0% found this document useful (0 votes)
53 views561 pages

Theoretic Methods in Data Science

The document discusses a comprehensive book on the intersection of information theory and data science, authored by leading experts in the field. It covers various topics including data acquisition, representation, analysis, and machine learning, providing a thorough survey of current research and methodologies. The book is intended for graduate students and researchers seeking to deepen their understanding of information-theoretic methods in data science.

Uploaded by

Jon Slow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views561 pages

Theoretic Methods in Data Science

The document discusses a comprehensive book on the intersection of information theory and data science, authored by leading experts in the field. It covers various topics including data acquisition, representation, analysis, and machine learning, providing a thorough survey of current research and methodologies. The book is intended for graduate students and researchers seeking to deepen their understanding of information-theoretic methods in data science.

Uploaded by

Jon Slow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 561

Information-Theoretic Methods in Data Science

Learn about the state-of-the-art at the interface between information theory and data
science with this first unified treatment of the subject. Written by leading experts
in a clear, tutorial style, and using consistent notation and definitions throughout, it
shows how information-theoretic methods are being used in data acquisition, data
representation, data analysis, and statistics and machine learning.
Coverage is broad, with chapters on signal data acquisition, data compression,
compressive sensing, data communication, representation learning, emerging topics in
statistics, and much more. Each chapter includes a topic overview, definition of the key
problems, emerging and open problems, and an extensive reference list, allowing readers
to develop in-depth knowledge and understanding.
Providing a thorough survey of the current research area and cutting-edge trends, this
is essential reading for graduate students and researchers working in information theory,
signal processing, machine learning, and statistics.

Miguel R. D. Rodrigues is Professor of Information Theory and Processing in the Depart-


ment of Electronic and Electrical Engineering, University College London, and a Turing
Fellow at the Alan Turing Institute, London.

Yonina C. Eldar is a Professor in the Faculty of Mathematics and Computer Science at


the Weizmann Institute of Science, a Fellow of the IEEE and Eurasip, and a member of
the Israel Academy of Sciences and Humanities. She is the author of Sampling Theory
(Cambridge, 2015), and co-editor of Compressed Sensing in Radar Signal Processing
(Cambridge, 2019), Compressed Sensing (Cambridge, 2012), and Convex Optimization
in Signal Processing and Communications (Cambridge, 2009).
“How should we model the opportunities created by access to data and to computing,
and how robust and resilient are these models? What information is needed for analysis,
how is it obtained and applied? If you are looking for answers to these questions, then
there is no better place to start than this book.”
Robert Calderbank, Duke University

“This pioneering book elucidates the emerging cross-disciplinary area of information-


theoretic data science. The book’s contributors are leading information theorists who
compellingly explain the close connections between the fields of information theory
and data science. Readers will learn about theoretical foundations for designing data
collection and analysis systems, from data acquisition to data compression, computation,
and learning.”
Al Hero, University of Michigan
Information-Theoretic Methods
in Data Science
Edited by
MIGUEL R. D. RODRIGUES
University College London

YONINA C. ELDAR
Weizmann Institute of Science
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06-04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of


education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781108427135
DOI: 10.1017/9781108616799

© Cambridge University Press 2021

This publication is in copyright. Subject to statutory exception


and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.

First published 2021

Printed in the United Kingdom by TJ International Ltd, Padstow Cornwall

A catalogue record for this publication is available from the British Library.

ISBN 978-1-108-42713-5 Hardback

Cambridge University Press has no responsibility for the persistence or accuracy


of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To my husband Shalomi and children Yonatan, Moriah, Tal, Noa, and Roei for their
boundless love and for filling my life with endless happiness
YE

To my wife Eduarda and children Isabel and Diana for their unconditional love,
encouragement, and support
MR
Contents

Preface page xiii


Notation xvii
List of Contributors xix

1 Introduction to Information Theory and Data Science 1


Miguel R. D. Rodrigues, Stark C. Draper, Waheed U. Bajwa, and Yonina C. Eldar

1.1 Classical Information Theory: A Primer 2


1.2 Source Coding: Near-Lossless Compression of Binary Sources 4
1.3 Channel Coding: Transmission over the Binary Symmetric Channel 8
1.4 Linear Channel Coding 12
1.5 Connecting Information Theory to Data Science 15
1.6 Information Theory and Data Acquisition 18
1.7 Information Theory and Data Representation 20
1.8 Information Theory and Data Analysis and Processing 26
1.9 Discussion and Conclusion 35
References 36

2 An Information-Theoretic Approach to Analog-to-Digital Compression 44


Alon Kipnis, Yonina C. Eldar, and Andrea J. Goldsmith

2.1 Introduction 44
2.2 Lossy Compression of Finite-Dimensional Signals 48
2.3 ADX for Continuous-Time Analog Signals 49
2.4 The Fundamental Distortion Limit 55
2.5 ADX under Uniform Sampling 63
2.6 Conclusion 69
References 70

3 Compressed Sensing via Compression Codes 72


Shirin Jalali and H. Vincent Poor

3.1 Compressed Sensing 72


3.2 Compression-Based Compressed Sensing 74
viii Contents

3.3 Definitions 76
3.4 Compressible Signal Pursuit 77
3.5 Compression-Based Gradient Descent (C-GD) 81
3.6 Stylized Applications 85
3.7 Extensions 92
3.8 Conclusions and Discussion 100
References 100

4 Information-Theoretic Bounds on Sketching 104


Mert Pilanci

4.1 Introduction 104


4.2 Types of Randomized Sketches 105
4.3 Background on Convex Analysis and Optimization 108
4.4 Sketching Upper Bounds for Regression Problems 110
4.5 Information-Theoretic Lower Bounds 114
4.6 Non-Parametric Problems 121
4.7 Extensions: Privacy and Communication Complexity 124
4.8 Numerical Experiments 126
4.9 Conclusion 127
A4.1 Proof of Theorem 4.3 128
A4.2 Proof of Lemma 4.3 129
References 130

5 Sample Complexity Bounds for Dictionary Learning from Vector-


and Tensor-Valued Data 134
Zahra Shakeri, Anand D. Sarwate, and Waheed U. Bajwa

5.1 Introduction 134


5.2 Dictionary Learning for Vector-Valued Data 137
5.3 Dictionary Learning for Tensors 149
5.4 Extensions and Open Problems 157
References 160

6 Uncertainty Relations and Sparse Signal Recovery 163


Erwin Riegler and Helmut Bölcskei

6.1 Introduction 163


6.2 Uncertainty Relations in (Cm ,  · 2 ) 165
6.3 Uncertainty Relations in (Cm ,  · 1 ) 173
6.4 Sparse Signal Separation 181
6.5 The Set-Theoretic Null-Space Property 184
6.6 A Large Sieve Inequality in (Cm ,  · 2 ) 187
6.7 Uncertainty Relations in L1 and L2 189
Contents ix

6.8 Proof of Theorem 6.4 189


6.9 Results for ||| · |||1 and ||| · |||2 192
References 193

7 Understanding Phase Transitions via Mutual Information and MMSE 197


Galen Reeves and Henry D. Pfister

7.1 Introduction 197


7.2 Problem Setup and Characterization 200
7.3 The Role of Mutual Information and MMSE 207
7.4 Proving the Replica-Symmetric Formula 215
7.5 Phase Transitions and Posterior Correlation 219
7.6 Conclusion 223
A7.1 Subset Response for the Standard Linear Model 223
References 225

8 Computing Choice: Learning Distributions over Permutations 229


Devavrat Shah

8.1 Background 229


8.2 Setup 232
8.3 Models 234
8.4 Sparse Model 236
8.5 Random Utility Model (RUM) 247
8.6 Discussion 255
References 258

9 Universal Clustering 263


Ravi Kiran Raman and Lav R. Varshney

9.1 Unsupervised Learning in Machine Learning 263


9.2 Universality in Information Theory 265
9.3 Clustering and Similarity 271
9.4 Distance-Based Clustering 276
9.5 Dependence-Based Clustering 280
9.6 Applications 291
9.7 Conclusion 292
References 293

10 Information-Theoretic Stability and Generalization 302


Maxim Raginsky, Alexander Rakhlin, and Aolin Xu

10.1 Introduction 302


10.2 Preliminaries 305
x Contents

10.3 Learning Algorithms and Stability 308


10.4 Information Stability and Generalization 311
10.5 Information-Theoretically Stable Learning Algorithms 317
References 327

11 Information Bottleneck and Representation Learning 330


Pablo Piantanida and Leonardo Rey Vega

11.1 Introduction and Overview 330


11.2 Representation and Statistical Learning 333
11.3 Information-Theoretic Principles and Information Bottleneck 339
11.4 The Interplay between Information and Generalization 347
11.5 Summary and Outlook 356
References 356

12 Fundamental Limits in Model Selection for Modern Data Analysis 359


Jie Ding, Yuhong Yang, and Vahid Tarokh

12.1 Introduction 359


12.2 Fundamental Limits in Model Selection 360
12.3 An Illustration on Fitting and the Optimal Model 363
12.4 The AIC, the BIC, and Other Related Criteria 365
12.5 The Bridge Criterion – Bridging the Conflicts between the AIC
and the BIC 369
12.6 Modeling-Procedure Selection 375
12.7 Conclusion 379
References 380

13 Statistical Problems with Planted Structures: Information-Theoretical


and Computational Limits 383
Yihong Wu and Jiaming Xu

13.1 Introduction 383


13.2 Basic Setup 384
13.3 Information-Theoretic Limits 386
13.4 Computational Limits 401
13.5 Discussion and Open Problems 412
A13.1 Mutual Information Characterization of Correlated Recovery 415
A13.2 Proof of (13.7) ⇒ (13.6) and Verification of (13.7)
in the Binary Symmetric SBM 418
References 419
Contents xi

14 Distributed Statistical Inference with Compressed Data 425


Wenwen Zhao and Lifeng Lai

14.1 Introduction 425


14.2 Basic Model 427
14.3 Interactive Communication 438
14.4 Identity Testing 445
14.5 Conclusion and Extensions 451
References 452

15 Network Functional Compression 455


Soheil Feizi and Muriel Médard

15.1 Introduction 455


15.2 Problem Setup and Prior Work 458
15.3 Network Functional Compression 463
15.4 Discussion and Future Work 473
15.5 Proofs 476
References 484

16 An Introductory Guide to Fano’s Inequality with Applications


in Statistical Estimation 487
Jonathan Scarlett and Volkan Cevher

16.1 Introduction 487


16.2 Fano’s Inequality and Its Variants 492
16.3 Mutual Information Bounds 495
16.4 Applications – Discrete Settings 499
16.5 From Discrete to Continuous 510
16.6 Applications – Continuous Settings 516
16.7 Discussion 523
References 525

Index 529
Preface

Since its introduction in 1948, the field of information theory has proved instrumen-
tal in the analysis of problems pertaining to compressing, storing, and transmitting
data. For example, information theory has allowed analysis of the fundamental limits
of data communication and compression, and has shed light on practical communica-
tion system design for decades. Recent years have witnessed a renaissance in the use
of information-theoretic methods to address problems beyond data compression, data
communications, and networking, such as compressive sensing, data acquisition, data
analysis, machine learning, graph mining, community detection, privacy, and fairness.
In this book, we explore a broad set of problems on the interface of signal processing,
machine learning, learning theory, and statistics where tools and methodologies orig-
inating from information theory can provide similar benefits. The role of information
theory at this interface has indeed been recognized for decades. A prominent example is
the use of information-theoretic quantities such as mutual information, metric entropy
and capacity in establishing minimax rates of estimation back in the 1980s. Here we
intend to explore modern applications at this interface that are shaping data science in
the twenty-first century.
There are of course some notable differences between standard information-theoretic
tools and signal-processing or data analysis methods. Globally speaking, information
theory tends to focus on asymptotic limits, using large blocklengths, and assumes the
data is represented by a finite number of bits and viewed through a noisy channel.
The standard results are not concerned with complexity but focus more on fundamen-
tal limits characterized via achievability and converse results. On the other hand, some
signal-processing techniques, such as sampling theory, are focused on discrete-time rep-
resentations but do not necessarily assume the data is quantized or that there is noise in
the system. Signal processing is often concerned with concrete methods that are opti-
mal, namely, achieve the developed limits, and have bounded complexity. It is natural
therefore to combine these tools to address a broader set of problems and analysis which
allows for quantization, noise, finite samples, and complexity analysis.
This book is aimed at providing a survey of recent applications of information-
theoretic methods to emerging data-science problems. The potential reader of this book
could be a researcher in the areas of information theory, signal processing, machine
learning, statistics, applied mathematics, computer science or a related research area, or
xiv Preface

a graduate student seeking to learn about information theory and data science and to
scope out open problems at this interface. The particular design of this volume ensures
that it can serve as both a state-of-the-art reference for researchers and a textbook for
students.
The book contains 16 diverse chapters written by recognized leading experts world-
wide, covering a large variety of topics that lie on the interface of signal processing, data
science, and information theory. The book begins with an introduction to information
theory which serves as a background for the remaining chapters, and also sets the nota-
tion to be used throughout the book. The following chapters are then organized into four
categories: data acquisition (Chapters 2–4), data representation and analysis (Chapters
5–9), information theory and machine learning (Chapters 10 and 11), and information
theory, statistics, and compression (Chapters 12–15). The last chapter, Chapter 16, con-
nects several of the book’s themes via a survey of Fano’s inequality in a diverse range
of data-science problems. The chapters are self-contained, covering the most recent
research results in the respective topics, and can all be treated independently of each
other. A brief summary of each chapter is given next.
Chapter 1 by Rodrigues, Draper, Bajwa, and Eldar provides an introduction to infor-
mation theory concepts and serves two purposes: It provides background on classical
information theory, and presents a taster of modern information theory applied to
emerging data-science problems.
Chapter 2 by Kipnis, Eldar, and Goldsmith extends the notion of rate-distortion the-
ory to continuous-time inputs deriving bounds that characterize the minimal distortion
that can be achieved in representing a continuous-time signal by a series of bits when
the sampler is constrained to a given sampling rate. For an arbitrary stochastic input and
given a total bitrate budget, the authors consider the lowest sampling rate required to
sample the signal such that reconstruction of the signal from a bit-constrained represen-
tation of its samples results in minimal distortion. It turns out that often the signal can
be sampled at sub-Nyquist rates without increasing the distortion.
Chapter 3 by Jalali and Poor discusses the interplay between compressed sensing and
compression codes. In particular, the authors consider the use of compression codes to
design compressed sensing recovery algorithms. This allows the expansion of the class
of structures used by compressed sensing algorithms to those used by data compression
codes, which is a much richer class of inputs and relies on decades of developments in
the field of compression.
Chapter 4 by Pilanci develops information-theoretical lower bounds on sketching for
solving large statistical estimation and optimization problems. The term sketching is
used for randomized methods that aim to reduce data dimensionality in computationally
intensive tasks for gains in space, time, and communication complexity. These bounds
allow one to obtain interesting trade-offs between computation and accuracy and shed
light on a variety of existing methods.
Chapter 5 by Shakeri, Sarwate, and Bajwa treats the problem of dictionary learning,
which is a powerful signal-processing approach for data-driven extraction of features
from data. The chapter summarizes theoretical aspects of dictionary learning for vector-
and tensor-valued data and explores lower and upper bounds on the sample complexity
Preface xv

of dictionary learning which are derived using information-theoretic tools. The depen-
dence of sample complexity on various parameters of the dictionary learning problem
is highlighted along with the potential advantages of taking the structure of tensor data
into consideration in representation learning.
Chapter 6 by Riegler and Bölcskei presents an overview of uncertainty relations for
sparse signal recovery starting from the work of Donoho and Stark. These relations are
then extended to richer data structures and bases, which leads to the recently discovered
set-theoretic uncertainty relations in terms of Minkowski dimension. The chapter also
explores the connection between uncertainty relations and the “large sieve,” a family
of inequalities developed in analytic number theory. It is finally shown how uncertainty
relations allow one to establish fundamental limits of practical signal recovery problems
such as inpainting, declipping, super-resolution, and denoising of signals.
Chapter 7 by Reeves and Pfister examines high-dimensional inference problems
through the lens of information theory. The chapter focuses on the standard linear model
for which the performance of optimal inference is studied using the replica method from
statistical physics. The chapter presents a tutorial of these techniques and presents a new
proof demonstrating their optimality in certain settings.
Chapter 8 by Shah discusses the question of learning distributions over permutations
of a given set of choices based on partial observations. This is central to capturing
choice in a variety of contexts such as understanding preferences of consumers over
a collection of products based on purchasing and browsing data in the setting of retail
and e-commerce. The chapter focuses on the learning task from marginal distributions
of two types, namely, first-order marginals and pair-wise comparisons, and provides a
comprehensive review of results in this area.
Chapter 9 by Raman and Varshney studies universal clustering, namely, clustering
without prior access to the statistical properties of the data. The chapter formalizes the
problem in information theory terms, focusing on two main subclasses of clustering that
are based on distance and dependence. A review of well-established clustering algo-
rithms, their statistical consistency, and their computational and sample complexities is
provided using fundamental information-theoretic principles.
Chapter 10 by Raginsky, Rakhlin, and Xu introduces information-theoretic measures
of algorithmic stability and uses them to upper-bound the generalization bias of learn-
ing algorithms. The notion of stability implies that its output does not depend too
much on any individual training example and therefore these results shed light on the
generalization ability of modern learning techniques.
Chapter 11 by Piantanida and Vega introduces the information bottleneck principle
and explores its use in representation learning, namely, in the development of com-
putational algorithms that learn the different explanatory factors of variation behind
high-dimensional data. Using these tools, the authors obtain an upper bound on the
generalization gap corresponding to the cross-entropy risk. This result provides an inter-
esting connection between mutual information and generalization, and helps to explain
why noise injection during training can improve the generalization ability of encoder
models.
xvi Preface

Chapter 12 by Ding, Yang, and Tarokh discusses fundamental limits of inference


and prediction based on model selection principles from modern data analysis. Using
information-theoretic tools the authors analyze several state-of-the-art model selec-
tion techniques and introduce two recent advances in model selection approaches, one
concerning a new information criterion and the other concerning modeling-procedure
selection.
Chapter 13 by Wu and Xu provides an exposition on some of the methods for deter-
mining the information-theoretical as well as computational limits for high-dimensional
statistical problems with a planted structure. Planted structures refer to a ground truth
structure (often of a combinatorial nature) which one is trying to discover in the presence
of random noise. In particular, the authors discuss first- and second-moment methods for
analyzing the maximum likelihood estimator, information-theoretic methods for proving
impossibility results using mutual information and rate-distortion theory, and techniques
originating from statistical physics. To investigate computational limits, they describe
randomized polynomial-time reduction schemes that approximately map planted-clique
problems to the problem of interest in total variation distance.
Chapter 14 by Zhao and Lai considers information-theoretic models for distributed
statistical inference problems with compressed data. The authors review several research
directions and challenges related to applying these models to various statistical learn-
ing problems. In these applications, data are distributed in multiple terminals, which
can communicate with each other via limited-capacity channels. Information-theoretic
tools are used to characterize the fundamental limits of the classical statistical inference
problems using compressed data directly.
Chapter 15 by Feizi and Médard treats different aspects of the network functional
compression problem. The goal is to compress a source of random variables for the
purpose of computing a deterministic function at the receiver where the sources and
receivers are nodes in a network. Traditional data compression schemes are special cases
of functional compression, in which the desired function is the identity function. It is
shown that for certain classes of functions considerable compression is possible in this
setting.
Chapter 16 by Scarlett and Cevher provides a survey of Fano’s inequality and its use
in various statistical estimation problems. In particular, the chapter overviews the use of
Fano’s inequality for establishing impossibility results, namely, conditions under which
a certain goal cannot be achieved by any estimation algorithm. The authors present
several general-purpose tools and analysis techniques, and provide representative exam-
ples covering group testing, graphical model selection, sparse linear regression, density
estimation, and convex optimization.
Within the chapters, the authors point to various open research directions at the
interface of information theory, data acquisition, data analysis, machine learning, and
statistics that will certainly see increasing attention in the years to come.
We would like to end by thanking all the authors for their contributions to this book
and for their hard work in presenting the material in a unified and accessible fashion.
Notation

z scalar (or value of random variable Z)


Z random variable
z vector (or value of random vector Z)
Z matrix (or random vector)
zi ith entry of vector z
Zi, j (i, j)th entry of matrix Z
Z n = (Z1 , . . . , Zn ) sequence of n random variables
zn = (z  1 , . . . , zn ) value of sequence of n random variables Z n
j
Zi = Zi , . . . , Z j sequence of j − i + 1 random variables
j
  j
zi = z i , . . . , z j value of sequence of j − i + 1 random variables Zi
 · p p-norm
(·)T transpose operator
(·)∗ conjugate Hermitian operator
(·)† pseudo-inverse of the matrix argument
tr(·) trace of the square matrix argument
det(·) determinant of the square matrix argument
rank(·) rank of the matrix argument
range(·) range span of the column vectors of the matrix argument
λmax (·) maximum eigenvalue of the square matrix argument
λmin (·) minimum eigenvalue of the square matrix argument
λi (·) ith largest eigenvalue of the square matrix argument
I identity matrix (its size is determined from the context)
0 matrix with zero entries (its size is determined from the context)
T standard notation for sets
|T | cardinality of set T
R set of real numbers
C set of complex numbers
Rn set of n-dimensional vectors of real numbers
Cn set of n-dimensional vectors of complex numbers
j imaginary unit
Re(x) real part of the complex number x
Im(x) imaginary part of the complex number x
|x| modulus of the complex number x
arg(x) argument of the complex number x
E[·] statistical expectation
P[·] probability measure
xviii Notation

H(·) entropy
H(·|·) conditional entropy
h(·) differential entropy
h(·|·) conditional differential entropy
D(··) relative entropy
I(·; ·) mutual information
I(·; ·|·)  conditional mutual information
N μ, σ2 scalar Gaussian distribution with mean μ and variance σ2
N(μ, Σ) multivariate Gaussian distribution with mean μ and covariance
matrix Σ
Contributors

Waheed U. Bajwa Andrea J. Goldsmith


Department of Electrical and Computer Department of Electrical Engineering
Engineering Stanford University
Rutgers University
Shirin Jalali
Helmut Bölcskei Nokia Bell Labs
Department of Information Technology
and Electrical Engineering Alon Kipnis
Department of Mathematics Department of Statistics
ETH Zürich Stanford University

Lifeng Lai
Volkan Cevher
Department of Electrical and Computer
Laboratory for Information and
Engineering
Inference Systems
University of California, Davis
Institute of Electrical Engineering
School of Engineering, EPFL Muriel Médard
Department of Electrical Engineering
Jie Ding
and Computer Science
School of Statistics
Massachussets Institute of
University of Minnesota
Technology
Stark C. Draper
Henry D. Pfister
Department of Electrical and Computer
Department of Electrical and Computer
Engineering
Engineering
University of Toronto
Duke University
Yonina C. Eldar
Pablo Piantanida
Faculty of Mathematics and Computer
Laboratoire des Signaux et Systèmes
Science
Université Paris Saclay
Weizmann Institute of Science
CNRS-CentraleSupélec
Soheil Feizi Montreal Institute for Learning
Department of Computer Science Algorithms (Mila), Université
University of Maryland, College Park de Montréal
xx List of Contributors

Mert Pilanci Miguel R. D. Rodrigues


Department of Electrical Engineering Department of Electronic and Electrical
Stanford University Engineering
University College London
H. Vincent Poor
Department of Electrical Engineering
Anand D. Sarwate
Princeton University
Department of Electrical and
Maxim Raginsky Computer Engineering
Department of Electrical and Computer Rutgers University
Engineering
University of Illinois at Jonathan Scarlett
Urbana-Champaign Department of Computer Science and
Alexander Rakhlin Department of Mathematics
Department of Brain & Cognitive National University of Singapore
Sciences
Massachusetts Institute of Technology Devavrat Shah
Department of Electrical Engineering and
Ravi Kiran Raman
Computer Science
Department of Electrical and Computer
Massachusetts Institute of Technology
Engineering
University of Illinois at
Urbana-Champaign Zahra Shakeri
Electronic Arts
Galen Reeves
Department of Electrical and
Computer Engineering Vahid Tarokh
Department of Statistical Science Department of Electrical and Computer
Duke University Engineering
Duke University
Leonardo Rey Vega
Department of Electrical Engineering
Lav R. Varshney
School of Engineering
Department of Electrical and Computer
Universidad de Buenos Aires
Engineering
CSC-CONICET
University of Illinois at Urbana-Champaign
Erwin Riegler
Department of Information Technology Yihong Wu
and Electrical Engineering Department of Statistics and Data Science
ETH Zürich Yale University
List of Contributors xxi

Aolin Xu Yuhong Yang


Department of Electrical and Computer School of Statistics
Engineering University of Minnesota
University of Illinois at
Urbana-Champaign Wenwen Zhao
Department of Electrical and
Jiaming Xu Computer Engineering
Fuqua School of Business University of California, Davis
Duke University
1 Introduction to Information Theory
and Data Science
Miguel R. D. Rodrigues, Stark C. Draper, Waheed U. Bajwa, and Yonina C. Eldar

Summary

The field of information theory – dating back to 1948 – is one of the landmark intel-
lectual achievements of the twentieth century. It provides the philosophical and math-
ematical underpinnings of the technologies that allow accurate representation, efficient
compression, and reliable communication of sources of data. A wide range of storage
and transmission infrastructure technologies, including optical and wireless commu-
nication networks, the internet, and audio and video compression, have been enabled
by principles illuminated by information theory. Technological breakthroughs based on
information-theoretic concepts have driven the “information revolution” characterized
by the anywhere and anytime availability of massive amounts of data and fueled by the
ubiquitous presence of devices that can capture, store, and communicate data.
The existence and accessibility of such massive amounts of data promise immense
opportunities, but also pose new challenges in terms of how to extract useful and action-
able knowledge from such data streams. Emerging data-science problems are different
from classical ones associated with the transmission or compression of information in
which the semantics of the data was unimportant. That said, we are starting to see
that information-theoretic methods and perspectives can, in a new guise, play impor-
tant roles in understanding emerging data-science problems. The goal of this book is
to explore such new roles for information theory and to understand better the modern
interaction of information theory with other data-oriented fields such as statistics and
machine learning.
The purpose of this chapter is to set the stage for the book and for the upcoming
chapters. We first overview classical information-theoretic problems and solutions. We
then discuss emerging applications of information-theoretic methods in various data-
science problems and, where applicable, refer the reader to related chapters in the book.
Throughout this chapter, we highlight the perspectives, tools, and methods that play
important roles in classic information-theoretic paradigms and in emerging areas of data
science. Table 1.1 provides a summary of the different topics covered in this chapter and
highlights the different chapters that can be read as a follow-up to these topics.

1
2 Miguel R. D. Rodrigues et al.

Table 1.1. Major topics covered in this chapter and their connections to other chapters

Section(s) Topic Related chapter(s)

1.1–1.4 An introduction to information theory 15


1.6 Information theory and data acquisition 2–4, 6, 16
1.7 Information theory and data representation 5, 11
1.8 Information theory and data analysis/processing 6–16

1.1 Classical Information Theory: A Primer

Claude Shannon’s 1948 paper “A mathematical theory of communications,” Bell Systems


Technical Journal, July/Oct. 1948, laid out a complete architecture for digital communi-
cation systems [1]. In addition, it articulated the philosophical decisions for the design
choices made. Information theory, as Shannon’s framework has come to be known, is
a beautiful and elegant example of engineering science. It is all the more impressive as
Shannon presented his framework decades before the first digital communication system
was implemented, and at a time when digital computers were in their infancy.
Figure 1.1 presents a general schematic of a digital communication system. This
figure is a reproduction of Shannon’s “Figure 1” from his seminal paper. Before 1948
no one had conceived of a communication system in this way. Today nearly all digital
communication systems obey this structure.
The flow of information through the system is as follows. An information source first
produces a random message that a transmitter wants to convey to a destination. The
message could be a word, a sentence, or a picture. In information theory, all information
sources are modeled as being sampled from a set of possibilities according to some
probability distribution. Modeling information sources as stochastic is a key aspect of
Shannon’s approach. It allowed him to quantify uncertainty as the lack of knowledge
and reduction in uncertainty as the gaining of knowledge or “information.”

INFORMATION
SOURCE TRANSMITTER RECEIVER DESTINATION
Encoder

Decoder
Encoder

Decoder
Channel

Channel
Source

Source

SIGNAL RECEIVED
SIGNAL
MESSAGE MESSAGE

NOISE
SOURCE
Figure 1.1 Reproduction of Shannon’s Figure 1 in [1] with the addition of the source and channel
encoding/decoding blocks. In Shannon’s words, this is a “schematic diagram of a general
communication system.”
Introduction to Information Theory and Data Science 3

The message is then fed into a transmission system. The transmitter itself has two
main sub-components: the source encoder and the channel encoder. The source encoder
converts the message into a sequence of 0s and 1s, i.e., a bit sequence. There are two
classes of source encoders. Lossless source coding removes predictable redundancy
that can later be recreated. In contrast, lossy source coding is an irreversible process
wherein some distortion is incurred in the compression process. Lossless source cod-
ing is often referred to as data compression while lossy coding is often referred to as
rate-distortion coding. Naturally, the higher the distortion the fewer the number of bits
required.
The bit sequence forms the data payload that is fed into a channel encoder. The out-
put of the channel encoder is a signal that is transmitted over a noisy communication
medium. The purpose of the channel code is to convert the bits into a set of possible
signals or codewords that can be reliably recovered from the noisy received signal.
The communication medium itself is referred to as the channel. The channel can
model the physical separation of the transmitter and receiver. It can also, as in data
storage, model separation in time.
The destination observes a signal that is the output of the communication channel.
Similar to the transmitter, the receiver has two main components: a channel decoder and
a source decoder. The former maps the received signal into a bit sequence that is, one
hopes, the same as the bit sequence produced by the transmitter. The latter then maps
the estimated bit sequence to an estimate of the original message.
If lossless compression is used, then an apt measure of performance is the probabil-
ity that the message estimate at the destination is not equal to the original message at
the transmitter. If lossy compression (rate distortion) is used, then other measures of
goodness, such as mean-squared error, are more appropriate.
Interesting questions addressed by information theory include the following.
1. Architectures
• What trade-offs in performance are incurred by the use of the architecture
detailed in Figure 1.1?
• When can this architecture be improved upon; when can it not?
2. Source coding: lossless data compression
• How should the information source be modeled; as stochastic, as arbitrary
but unknown, or in some other way?
• What is the shortest bit sequence into which a given information source can
be compressed?
• What assumptions does the compressor work under?
• What are basic compression techniques?
3. Source coding: rate-distortion theory
• How do you convert an analog source into a digital bitstream?
• How do you reconstruct/estimate the original source from the bitstream?
• What is the trade-off involved between the number of bits used to describe
a source and the distortion incurred in reconstruction of the source?
4 Miguel R. D. Rodrigues et al.

4. Channel coding
• How should communication channels be modeled?
• What throughput, measured in bits per second, at what reliability, measured
in terms of probability of error, can be achieved?
• Can we quantify fundamental limits on the realizable trade-offs between
throughput and reliability for a given channel model?
• How does one build computationally tractable channel coding systems that
“saturate” the fundamental limits?
5. Multi-user information theory
• How do we design systems that involve multiple transmitters and receivers?
• How do many (perhaps correlated) information sources and transmission
channels interact?
The decades since Shannon’s first paper have seen fundamental advances in each of
these areas. They have also witnessed information-theoretic perspectives and thinking
impacting a number of other fields including security, quantum computing and com-
munications, and cryptography. The basic theory and many of these developments are
documented in a body of excellent texts, including [2–9]. Some recent advances in net-
work information theory, which involves multiple sources and/or multiple destinations,
are also surveyed in Chapter 15. In the next three sections, we illustrate the basics of
information-theoretic thinking by focusing on simple (point-to-point) binary sources and
channels. In Section 1.2, we discuss the compression of binary sources. In Section 1.3,
we discuss channel coding over binary channels. Finally, in Section 1.4, we discuss
computational issues, focusing on linear codes.

1.2 Source Coding: Near-Lossless Compression of Binary Sources

To gain a feel for the tools and results of classical information theory consider the fol-
lowing lossless source coding problem. One observes a length-n string of random coin
flips, X1 , X2 , . . . , Xn , each Xi ∈ {heads, tails}. The flips are independent and identically
distributed with P(Xi = heads) = p, where 0 ≤ p ≤ 1 is a known parameter. Suppose we
want to map this string into a bit sequence to store on a computer for later retrieval.
Say we are going to assign a fixed amount of memory to store the sequence. How much
memory must we allocate?
Since there are 2n possible sequences, all of which could occur if p is not equal to 0
or 1, if we use n bits we can be 100% certain we could index any heads/tails sequence
that we might observe. However, certain sequences, while possible, are much less likely
than others. Information theory exploits such non-uniformity to develop systems that
can trade off between efficiency (the storage of fewer bits) and reliability (the greater
certainty that one will later be able to reconstruct the observed sequence). In the follow-
ing, we accept some (arbitrarily) small probability  > 0 of observing a sequence that we
Introduction to Information Theory and Data Science 5

choose not to be able to store a description of.1 One can think of  as the probability of
the system failing. Under this assumption we derive bounds on the number of bits that
need to be stored.

1.2.1 Achievability: An Upper Bound on the Rate Required for Reliable Data Storage
To figure out which sequences we may choose not to store, let us think about the statis-
tics.Inexpectation, we observe np heads. Of the 2n possible heads/tails sequences there
n
are np sequences with np heads. (For the moment we ignore non-integer effects and
deal with them later.) There will be some variability about this mean but, at a minimum,
we must be able to store all these
 expected realizations since these realizations all have
n
the same probability. While np is the cardinality of the set, we prefer to develop a good
approximation that is more amenable to manipulation. Further, rather than counting car-
dinality, we will count the log-cardinality. This is because given k bits we can index 2k
heads/tails source sequences. Hence, it is the exponent in which we are interested.
Using Stirling’s approximation to the factorial, log2 n! = n log2 n−(log2 e)n+O(log2 n),
and ignoring the order term, we have
 
n
log  nlog2 n − n(1 − p)log2 (n(1 − p)) − np log2 (np) (1.1)
np
   
1 1− p
= n log2 + np log2
1− p p
    
1 1
= n (1 − p)log2 + p log2 . (1.2)
1− p p
In (1.1), the (log2 e)n terms have canceled and the term in square brackets in (1.2) is
called the (binary) entropy, which we denote as HB (p), so

HB (p) = −plog2 p − (1 − p) log2 (1 − p), (1.3)

where 0 ≤ p ≤ 1 and 0 log 0 = 0. The binary entropy function is plotted in Fig. 1.2 within
Section 1.3. One can compute that when p = 0 or p = 1 then HB (0) = HB (1) = 0. The
interpretation is that, since there is only one all-tails and one all-heads sequence, and
we are quantifying log-cardinality, there is only one sequence to index in each case so
log2 (1) = 0. In these cases, we a priori know the outcome (respectively, all the heads
or all tails) and so do not need to store any bits to describe
n the realization. On the other
hand, if the coin is fair then p = 0.5, HB (0.5) = 1, n/2  2n , and we must use n bits of
storage. In other words, on an exponential scale almost all binary sequences are 50%
heads and 50% tails. As an intermediate value, if p = 0.11 then HB (0.11)  0.5.

1 In source coding, this is termed near-lossless source coding as the arbitrarily small  bounds the probability
of system failure and thus loss of the original data. In the variable-length source coding paradigm, one
stores a variable amount of bits per sequence, and minimizes the expected number of bits stored. We focus
on the near-lossless paradigm as the concepts involved more closely parallel those in channel coding.
6 Miguel R. D. Rodrigues et al.

The operational upshot of (1.2) is that if one allocates nHB (p) bits then basically all
expected sequences can be indexed. Of course, there are caveats. First, np need not be
integer. Second, there will be variability about the mean. To deal with both, we allocate
a few more bits, n(HB (p) + δ) in total. We use these bits not just to index the expected
sequences, but also the typical sequences, those sequences with empirical entropy close
to the entropy of the source.2 In the case of coin flips, if a particular sequence consists
of nH heads (and n − nH tails) then we say that the sequence is “typical” if
    
nH 1 n − nH 1
HB (p) − δ ≤ log2 + log2 ≤ HB (p) + δ. (1.4)
n p n 1− p

It can be shown that the cardinality of the set of sequences that satisfies condition (1.4) is
upper-bounded by 2n(HB (p)+δ) . Therefore if, for instance, one lists the typical sequences
lexicographically, then any typical sequence can be described using n(HB (p) + δ) bits.
One can also show that for any δ > 0 the probability of the source not producing a typical
sequence can be upper-bounded by any  > 0 as n grows large. This follows from the
law of large numbers. As n grows the distribution of the fraction of heads in the realized
source sequence concentrates about its expectation. Therefore, as long as n is sufficiently
large, and as long as δ > 0, any  > 0 will do. The quantity HB (p)+δ is termed the storage
“rate” R. For this example R = HB (p) + δ. The rate is the amount of memory that must
be made available per source symbol. In this case, there were n symbols (n coin tosses),
so one normalizes n(HB (p) + δ) by n to get the rate HB (p) + δ.
The above idea can immediately be extended to independent and identically dis-
tributed (i.i.d.) finite-alphabet (and more general) sources as well. The general definition
of the entropy of a finite-alphabet random variable X with probability mass function
(p.m.f.) pX is

H(X) = − pX (x)log2 pX (x), (1.5)
x∈X

where “finite-alphabet” means the sample space X is finite.


Regardless of the distribution (binary, non-binary, even non-i.i.d.), the simple coin-
flipping example illustrates one of the central tenets of information theory. That is, to
focus one’s design on what is likely to happen, i.e., the typical events, rather than on
worst-case events. The partition of events into typical and atypical is, in information
theory, known as the asymptotic equipartition property (AEP). In a nutshell, the simplest
form of the AEP says that for long i.i.d. sequences one can, up to some arbitrarily small
probability , partition all possible outcomes into two sets: the typical set and the atypical
set. The probability of observing an event in the typical set is at least 1 − . Furthermore,
on an exponential scale all typical sequences are of equal probability. Designing for
typical events is a hallmark of information theory.

2 In the literature, these are termed the “weakly” typical sequences. There are other definitions of typicality
that differ in terms of their mathematical use. The overarching concept is the same.
Introduction to Information Theory and Data Science 7

1.2.2 Converse: A Lower Bound on the Rate Required for Reliable Data Storage
A second hallmark of information theory is the emphasis on developing bounds. The
source coding scheme described above is known as an achievability result. Achievability
results involve describing an operational system that can, in principle, be realized in
practice. Such results provide (inner) bounds on what is possible. The performance of
the best system is at least this good. In the above example, we developed a source coding
technique that delivers high-reliability storage and requires a rate of H(X) + δ, where
both the error  and the slack δ can be arbitrarily small if n is sufficiently large.
An important coupled question is how much (or whether) we can reduce the rate
further, thereby improving the efficiency of the scheme. In information theory, outer
bounds on what is possible – e.g., showing that if the encoding rate is too small one
cannot guarantee a target level of reliability – are termed converse results.
One of the key lemmas used in converse results is Fano’s inequality [7], named for
Robert Fano. The statement of the inequality is as follows: For any pair of random
variables (U, V) ∈ U × V jointly distributed according to pU,V (·, ·) and for any estimator
G : U → V with probability of error Pe = Pr[G(U)  V],

H(V|U) ≤ HB (Pe ) + Pe log2 (|V| − 1). (1.6)

On the left-hand side of (1.6) we encounter the conditional entropy H(V|U) of the joint
p.m.f. pU,V (·, ·). We use the notation H(V|U = u) to denote the entropy in V when the
realization of the random variable U is set to U = u. Let us name this the “pointwise”
conditional entropy, the value of which can be computed by applying our formula for
entropy (1.5) to the p.m.f. pV|U (·|u). The conditional entropy is the expected pointwise
conditional entropy:
⎡  ⎤⎥
  ⎢⎢⎢  1 ⎥⎥⎥
H(V|U) = pU (u)H(V|U = u) = pU (u)⎢⎢⎣⎢ pV|U (v|u)log2 ⎥⎦⎥.
u∈U u∈U v∈V
pV|U (v|u)
(1.7)
Fano’s inequality (1.6) can be interpreted as a bound on the ability of any hypothesis
test function G to make a (single) correct guess of the realization of V on the basis of
its observation of U. As the desired error probability Pe → 0, both terms on the right-
hand side go to zero, implying that the conditional entropy must be small. Conversely, if
the left-hand side is not too small, that asserts a non-zero lower bound on Pe . A simple
explicit bound is achieved by upper-bounding HB (Pe ) as HB (Pe ) ≤ 1 and rearranging to
find that Pe ≥ (H(V|U) − 1)/log2 (|V| − 1).
The usefulness of Fano’s inequality stems, in part, from the weak assumptions it
makes. One can apply Fano’s inequality to any joint distribution. Often identification of
an applicable joint distribution is part of the creativity in the use of Fano’s inequality. For
instance in the source coding example above, one takes V to be the stored data sequence,
so |V| = 2n(HB (p)+δ) , and U to be the original source sequence, i.e., U = X n . While we
do not provide the derivation herein, the result is that to achieve an error probability of
at most Pe the storage rate R is lower-bounded by R ≥ H(X) − Pe log2 |X| − HB (Pe )/n,
8 Miguel R. D. Rodrigues et al.

where |X| is the source alphabet size; for the binary example |X| = 2. As we let Pe → 0
we see that the lower bound on the achievable rate is H(X) which, letting δ → 0, is also
our upper bound. Hence we have developed an operational approach to data compression
where the rate we achieve matches the converse bound.
We now discuss the interaction between achievability and converse results. As long as
the compression rate R > H(X) then, due to concentration in measure, in the achievability
case the failure probability  > 0 and rate slack δ > 0 can both be chosen to be arbitrarily
small. Concentration of measure occurs as the blocklength n becomes large. In parallel
with n getting large, the total number of bits stored nR also grows.
The entropy H(X) thus specifies a boundary between two regimes of operation. When
the rate R is larger than H(X), achievability results tell us that arbitrarily reliable storage
is possible. When R is smaller than H(X), converse results imply that reliable storage
is not possible. In particular, rearranging the converse expression and once again noting
that HB (Pe ) ≤ 1, the error probability can be lower-bounded as

H(X) − R − 1/n
Pe ≥ . (1.8)
log2 |X|

If R < H(X), then for n sufficiently large Pe is bounded away from zero.
The entropy H(X) thus characterizes a phase transition between one state, the pos-
sibility of reliable data storage, and another, the impossibility. Such sharp information-
theoretic phase transitions also characterize classical information-theoretic results on
data transmission which we discuss in the next section, and applications of information-
theoretic tools in the data sciences which we turn to later in the chapter.

1.3 Channel Coding: Transmission over the Binary Symmetric Channel

Shannon applied the same mix of ideas (typicality, entropy, conditional entropy) to solve
the, perhaps at first seemingly quite distinct, problem of reliable and efficient digital
communications. This is typically referred to as Shannon’s “channel coding” problem
in contrast to the “source coding” problem already discussed.
To gain a sense of the problem we return to the simple binary setting. Suppose our
source coding system has yielded a length-k string of “information bits.” For simplicity
we assume these bits are randomly distributed as before, i.i.d. along the sequence, but
are now fair; i.e., each is equally likely to be “0” or a “1.” The objective is to convey this
sequence over a communications channel to a friend. Importantly we note that, since
the bits are uniformly distributed, our result on source coding tells us that no further
compression is possible. Thus, uniformity of message bits is a worst-case assumption.
The channel we consider is the binary symmetric channel (BSC). We can transmit
binary symbols over a BSC. Each input symbol is conveyed to the destination, but not
entirely accurately. The binary symmetric channel “flips” each channel input symbol
(0 → 1 or 1 → 0) with probability p, 0 ≤ p ≤ 1. Flips occur independently. The challenge
is for the destination to deduce, one hopes with high accuracy, the k information bits
Introduction to Information Theory and Data Science 9

1
0.9
1– p
0 0 0.8

Binary entropy, HB(p)


0.7
p 0.6
0.5
0.4
p 0.3
0.2
1 1 0.1
1– p 0
0 0.2 0.4 0.6 0.8 1
Bernoulli distribution parameter, p

Figure 1.2 On the left we present a graphical description of the binary symmetric channel (BSC).
Each transmitted binary symbol is represented as a 0 or 1 input on the left. Each received binary
observation is represented by a 0 or 1 output on the right. The stochastic relationship between
inputs and outputs is represented by the connectivity of the graph where the probability of
transitioning each edge is represented by the edge label p or 1 − p. The channel is “symmetric”
due to the symmetries in these transition probabilities. On the right we plot the binary entropy
function HB (p) as a function of p, 0 ≤ p ≤ 1. The capacity of the BSC is CBSC = 1 − HB (p).

transmitted. Owing to the symbol flipping noise, we get some slack; we transmit n ≥ k
binary channel symbols. For efficiency’s sake, we want n to be as close to k as possible,
while meeting the requirement of high reliability. The ratio k/n is termed the “rate” of
communication. The length-n binary sequence transmitted is termed the “codeword.”
This “bit flipping” channel can be used, e.g., to model data storage errors in a computer
memory. A graphical representation of the BSC is depicted in Fig. 1.2.

1.3.1 Achievability: A Lower Bound on the Rate of Reliable Data Communication


The idea of channel coding is analogous to human-evolved language. The length-k string
of information bits is analogous to what we think, i.e., the concept we want to impart to
the destination. The length-n codeword string of binary channel symbols is analogous
to what we say (the sentence). There is redundancy in spoken language that makes it
possible for spoken language to be understood in noisy (albeit not too noisy) situations.
We analogously engineer redundancy into what a computer transmits in order to be able
to combat the expected (the typical!) noise events. For the BSC those would be the
expected bit-flip sequences.
 n We
 now consider the noise process. For any chosen length-n codeword there are about
np typical noise patterns which, using the same logic as in our discussion of source
compression, is a set of roughly 2nHB (p) patterns. If we call X n the codeword and E n
the noise sequence, then what the receiver measures is Y n = X n + E n . Here addition is
vector addition over F2 , i.e., coordinate-wise, where the addition of two binary symbols
is implemented using the XOR operator. The problem faced by the receiver is to identify
the transmitted codeword. One can imagine that if the possible codewords are far apart in
the sense that they differ in many entries (i.e., their Hamming distance is large) then the
10 Miguel R. D. Rodrigues et al.

receiver will be less likely to make an error when deciding on the transmitted codeword.
Once such a codeword estimate has been made it can then be mapped back to the length-
k information bit sequence. A natural decoding rule, in fact the maximum-likelihood rule,
is for the decoder to pick the codeword closest to Y n in terms of Hamming distance.
The design of the codebook (analogous to the choice of grammatically correct – and
thus allowable – sentences in a spoken language) is a type of probabilistic packing prob-
lem. The question is, how do we select the set of codewords so that the probability of a
decoding error is small? We can develop a simple upper bound on how large the set of
reliably decodable codewords can be. There are 2n possible binary output sequences. For
any codeword selected there are roughly 2nHB (p) typical output sequences, each associ-
ated with a typical noise sequence, that form a noise ball centered on the codeword. If we
were simply able to divide up the output space into disjoint sets of cardinality 2nHB (p) , we
would end up with 2n /2nHB (p) = 2n(1−HB (p)) distinct sets. This sphere-packing argument
tells us that the best we could hope to do would be to transmit this number of distinct
codewords reliably. Thus, the number of information bits k would equal n(1 − HB (p)).
Once we normalize by the number n of channels uses we get a transmission rate of
1 − HB (p).
Perhaps quite surprisingly, as n gets large, 1 − HB (p) is the supremum of achievable
rates at (arbitrarily) high reliability. This is the Shannon capacity CBSC = 1 − HB (p). The
result follows from the law of large numbers, which can be used to show that the typical
noise balls concentrate. Shannon’s proof that one can actually find a configuration of
codewords while keeping the probability of decoding error small was an early use of
the probabilistic method. For any rate R = CBSC − δ, where δ > 0 is arbitrarily small, a
randomized choice of the positioning of each codeword will with high probability, yield
a code with a small probability of decoding error. To see the plausibility of this statement
we revisit the sphere-packing argument. At rate R = CBSC − δ the 2nR codewords are each
associated with a typical noise ball of 2nHB (p) sequences. If the noise balls were all (in the
worst case) disjointed, this would be a total of 2nR 2nHB (p) = 2n(1−HB (p)−δ)+nHB (p) = 2n(1−δ)
sequences. As there are 2n binary sequences, the fraction of the output space taken up
by the union of typical noise spheres associated with the codewords is 2n(1−δ) /2n = 2−nδ .
So, for any δ > 0 fixed, as the blocklength n → ∞, only an exponentially disappearing
fraction of the output space is taken up by the noise balls. By choosing the codewords
independently at random, each uniformly chosen over all length-n binary sequences, one
can show that the expected (over the choice of codewords and channel noise realization)
average probability of error is small. Hence, at least one codebook exists that performs
at least as well as this expectation.
While Shannon showed the existence of such a code (actually a sequence of codes
as n → ∞), it took another half-century for researchers in error-correction coding to
find asymptotically optimal code designs and associated decoding algorithms that were
computationally tractable and therefore implementable in practice. We discuss this
computational problem and some of these recent code designs in Section 1.4.
While the above example is set in the context of a binary-input and binary-output
channel model, the result is a prototype of the result that holds for discrete memoryless
channels. A discrete memoryless channel is described by the conditional distribution
Introduction to Information Theory and Data Science 11

pY|X : X → Y. Memoryless means that output symbols are conditionally independent



given the input codeword, i.e., pY n |X n (yn |xn ) = ni=1 pY|X (yi |xi ). The supremum of
achievable rates is the Shannon capacity C, where
C = sup[H(Y) − H(Y|X)] = sup I(X; Y). (1.9)
pX pX

In (1.9), H(Y) is the entropy of the output space, induced by the choice of input dis-

tribution pX via pY (y) = x∈X pX (x)pY|X (y|x), and H(Y|X) is the conditional entropy of
pX (·)pY|X (·|·). For the BSC the optimal choice of pX (·) is uniform. We shortly develop an
operational intuition for this choice by connecting it to hypothesis testing. We note that
this choice induces the uniform distribution on Y. Since |Y| = 2, this means that H(Y) = 1.
Further, plugging the channel law of the BSC into (1.7) yields H(Y|X) = HB (p). Putting
the pieces together recovers the Shannon capacity result for the BSC, CBSC = 1 − HB (p).
In (1.9), we introduce the equality H(Y) − H(Y|X) = I(X; Y), where I(X; Y), denotes
the mutual information of the joint distribution pX (·)pY|X (·|·). The mutual information is
another name for the Kullback–Leibler (KL) divergence between the joint distribution
pX (·)pY|X (·|·) and the product of the joint distribution’s marginals, pX (·)pY (·). The gen-
eral formula for the KL divergence between a pair of distributions pU and pV defined
over a common alphabet A is
  
pU (a)
D(pU pV ) = pU (a)log2 . (1.10)
a∈A
pV (a)
In the definition of mutual information over A = X × Y, pX,Y (·, ·) plays the role of pU (·)
and pX (·)pY (·) plays the role of pV (·).
The KL divergence arises in hypothesis testing, where it is used to quantify the error
exponent of a binary hypothesis test. Conceiving of channel decoding as a hypothesis-
testing problem – which one of the codewords was transmitted? – helps us understand
why (1.9) is the formula for the Shannon capacity. One way the decoder can make its
decision regarding the identity of the true codeword is to test each codeword against
independence. In other words, does the empirical joint distribution of any particular
codeword X n and the received data sequence Y n look jointly distributed according to the
channel law or does it look independent? That is, does (X n , Y n ) look like it is distributed
i.i.d. according to pXY (·, ·) = pX (·)pY|X (·|·) or i.i.d. according to pX (·)pY (·)? The exponent
of the error in this test is −D(pXY pX pY ) = −I(X; Y). Picking the input distribution pX to
maximize (1.9) maximizes this exponent. Finally, via an application of the union bound
we can assert that, roughly, 2nI(X;Y) codewords can be allowed before more than one
codeword in the codebook appear to be jointly distributed with the observation vector
Y n according to pXY .

1.3.2 Converse: An Upper Bound on the Rate of Reliable Data Communication


An application of Fano’s inequality (1.6) shows that C is also an upper bound on the
achievable communication rate. This application of Fano’s inequality is similar to that
used in source coding. In this application of (1.6), we set V = X n and U = Y n . The greatest
additional subtlety is that we must leverage the memoryless property of the channel to
12 Miguel R. D. Rodrigues et al.

single-letterize the bound. To single-letterize means to express the final bound in terms
of only the pX (·)pY|X (·|·) distribution, rather than in terms of the joint distribution of
the length-n input and output sequences. This is an important step because n is allowed
to grow without bound. By single-letterizing we express the bound in terms of a fixed
distribution, thereby making the bound computable.
As at the end of the discussion of source coding, in channel coding we find a boundary
between two regimes of operation: the regime of efficient and reliable data transmission,
and the regime wherein such reliable transmission is impossible. In this instance, it is
the Shannon capacity C that demarks the phase-transition between these two regimes of
operation.

1.4 Linear Channel Coding

In the previous sections, we discussed the sharp phase transitions in both source and
channel coding discovered by Shannon. These phase transitions delineate fundamental
boundaries between what is possible and what is not. In practice, one desires schemes
that “saturate” these bounds. In the case of source coding, we can saturate the bound if
we can design source coding techniques with rates that can be made arbitrarily close to
H(X) (from above). For channel coding we desire coding methods with rates that can be
made arbitrarily close to C (from below). While Shannon discovered and quantified the
bounds, he did not specify realizable schemes that attained them.
Decades of effort have gone into developing methods of source and channel coding.
For lossless compression of memoryless sources, as in our motivating examples, good
approaches such as Huffman and arithmetic coding were found rather quickly. On the
other hand, finding computationally tractable and therefore implementable schemes of
error-correction coding that got close to capacity took much longer. For a long time it
was not even clear that computationally tractable techniques of error correction that sat-
urated Shannon’s bounds were even possible. For many years researchers thought that
there might be a second phase transition at the cutoff rate, only below which compu-
tationally tractable methods of reliable data transmission existed. (See [10] for a nice
discussion.) Indeed, only with the emergence of modern coding theory in the 1990s and
2000s that studies turbo, low-density parity-check (LDPC), spatially coupled LDPC, and
Polar codes has the research community, even for the BSC, developed computationally
tractable methods of error correction that closed the gap to Shannon’s bound.
In this section, we introduce the reader to linear codes. Almost all codes in use have
linear structure, structure that can be exploited to reduce the complexity of the decoding
process. As in the previous sections we only scratch the surface of the discipline of
error-correction coding. We point the reader to the many excellent texts on the subject,
e.g., [6, 11–15].

1.4.1 Linear Codes and Syndrome Decoding


Linear codes are defined over finite fields. As we have been discussing the BSC, the field
we will focus on is F2 . The set of codewords of a length-n binary linear code corresponds
Introduction to Information Theory and Data Science 13

to a subspace of the vector space Fn2 . To encode we use a matrix–vector multiplication


defined over F2 to map a length-k column vector b ∈ Fk2 of “information bits” into a
length-n column vector x ∈ Fn2 of binary “channel symbols” as

x = GT b, (1.11)

where G ∈ F2k×n is a k × n binary “generator” matrix and GT denotes the transpose of G.


Assuming that G is full rank, all 2k possible binary vectors b are mapped by G into 2k
distinct codewords x, so the set of possible codewords (the “codebook”) is the row-space
of G. We compute the rate of the code as R = k/n.
Per our earlier discussion, the channel adds the length-n noise sequence e to x,
yielding the channel output y = x + e. To decode, the receiver pre-multiplies y by the
parity-check matrix H ∈ Fm×n
2 to produce the length-m syndrome s as

s = Hy. (1.12)

We caution the reader not to confuse the parity-check matrix H with the entropy function
H(·). By design, the rows of H are all orthogonal to the rows of G and thus span the
null-space of G.3 When the columns of G are linearly independent, the dimension of the
null-space of G is n − k and the relation m = n − k holds.
Substituting in the definition for x into the expression for y and thence into (1.12), we
compute

s = H(GT b + e) = HGT b + He = He, (1.13)

where the last step follows because the rows of G and H are orthogonal by design so that
HGT = 0, the m × k all-zeros matrix. Inspecting (1.13), we observe that the computation
of the syndrome s yields m linear constraints on the noise vector e.
Since e is of length n and m = n − k, (1.13) specifies an under-determined set of lin-
ear equations in F2 . However, as already discussed, while e could be any vector, when
the blocklength n becomes large, concentration of measure comes into play. With high
probability the realization of e will concentrate around those sequences that contain only
np non-zero elements. We recall that p ∈ [0, 1] is the bit-flip probability and note that
in F2 any non-zero element must be a one. In coding theory, we are therefore faced
with the problem of solving an under-determined set of linear equations subject to a
sparsity constraint: There are only about np non-zero elements in the solution vector.
Consider p ≤ 0.5. Then, as error vectors e containing fewer bit flips are more likely,
the maximum-likelihood solution for the noise vector e is to find the maximally sparse
vector that satisfies the syndrome constraints, i.e.,


e = arg minn dH (e) such that s = He, (1.14)
e∈F2

where dH (·) is the Hamming weight (or distance from 0n ) of the argument. As mentioned
before, the Hamming weight is the number of non-zero entries of e. It plays a role

3 Note that in finite fields vectors can be self-orthogonal; e.g., in F2 any even-weight vector is orthogonal to
itself.
14 Miguel R. D. Rodrigues et al.

analogous to the cardinality function in Rn (sometimes denoted  · 0 ), which is often


used to enforce sparsity in the solution to optimization problems.
We observe that there are roughly 2nHB (p) typical binary bit-flip sequences, each with
roughly np non-zero elements. The syndrome s provides m linear constraints on the noise
sequence. Each constraint is binary so that, if all constraints are linearly independent,
each constraint reduces by 50% the set of possible noise sequences. Thus, if the number
m of constraints exceeds log2 (2nHB (p) ) = nHB (p) we should be able to decode correctly.4
Decoders can thus be thought of as solving a binary search problem where the mea-
surements/queries are fixed ahead of time, and the decoder uses the results of the queries,
often in an iterative fashion, to determine e. Once  e has been calculated, the codeword
estimate x = y + e = GT b + (e − e). If 
e = e, then the term in brackets cancels and b
can uniquely and correctly be recovered from GT b. This last point follows since the
codebook is the row-space of G and G is full rank.
Noting from the previous section that the capacity of the BSC channel is C = 1 −
HB (p) and the rate of the code is R = k/n, we would achieve capacity if 1 − HB (p) = k/n
or, equivalently, if the syndrome length m = n − k = n(1 − k/n) = nHB (p). This is the
objective of coding theory: to find “good” codes (specified by their generator matrix G
or, alternately, by their parity-check matrix H) and associated decoding algorithms (that
attempt to solve (1.14) in a computationally efficient manner) so as to be able to keep
R = k/n as close as possible to CBSC = 1 − HB (p).

1.4.2 From Linear to Computationally Tractable: Polar Codes


To understand the challenge of designing computationally tractable codes say that, in the
previous discussion, one picked G (or H) according to the Bernoulli-0.5 random i.i.d.
measure. Then for any fixed rate R if one sets the blocklength n to be sufficiently large
the generator (or parity-check) matrix produced will, with arbitrarily high probability,
specify a code that is capacity-achieving. Such a selection of G or H is respectively
referred to as the Elias or the Gallager ensemble.
However, attempting to use the above capacity-achieving scheme can be problematic
from a computational viewpoint. To see the issue consider blocklength n = 4000 and
rate R = 0.5, which are well within normal operating ranges for these parameters. For
these choices there are 2nR = 22000 codewords. Such an astronomical number of code-
words makes the brute-force solution of (1.14) impossible. Hence, the selection of G or
H according to the Elias or Gallager ensembles, while yielding linear structure, does not
by itself guarantee that a computationally efficient decoder will exist. Modern methods
of coding – LDPC codes, spatially coupled codes, and Polar codes – while being linear
codes also design additional structure into the code ensemble with the express intent of

4 We comment that this same syndrome decoding can also be used to provide a solution to the near-lossless
source coding problem of Section 1.2. One pre-multiplies the source sequence by the parity-check matrix
H, and stores the syndrome of the source sequence. For a biased binary source, one can solve (1.14) to
recover the source sequence with high probability. This approach does not feature prominently in source
coding, with the exception of distributed source coding, where it plays a prominent role. See [7, 9] for
further discussion.
Introduction to Information Theory and Data Science 15

making it compatible with computationally tractable decoding algorithms. To summa-


rize, in coding theory the design of a channel coding scheme involves the joint design of
the codebook and the decoding algorithm.
Regarding the phase transitions discovered by Shannon for source and channel cod-
ing, a very interesting code construction is Erdal Arikan’s Polar codes [16]. Another
tractable code construction that connects to phase transitions is the spatial-coupling con-
cept used in convolutionally structured LDPC codes [17–19]. In [16], Arikan considers
symmetric channels and introduces a symmetry-breaking transformation. This transfor-
mation is a type of pre-coding that combines pairs of symmetric channels to produce
a pair of virtual channels. One virtual channel is “less noisy” than the original chan-
nel and one is more noisy. Arikan then applies this transformation recursively. In the
limit, the virtual channels polarize. They either become noiseless and so have capac-
ity one, or become useless and have capacity zero. Arikan shows that, in the limit, the
fraction of virtual channels that become noiseless is equal to the capacity of the orig-
inal symmetric channel; e.g., 1 − HB (p), if the original channel were the BSC. One
transmits bits uncoded over the noiseless virtual channels, and does not use the use-
less channels. The recursive construction yields log-linear complexity in encoding and
decoding, O(n log n), making Polar codes computationally attractive. In many ways, the
construction is information-theoretic in nature, focusing on mutual information rather
than Hamming distance as the quantity of importance in the design of capacity-achieving
codes.
To conclude, we note that many important concepts, such as fundamental lim-
its, achievability results, converse results, and computational limitations, that arise in
classical information theory also arise in modern data-science problems. In classical
information theory, as we have seen, such notions have traditionally been consid-
ered in the context of data compression and transmission. In data science similar
notions are being studied in the realms of acquisition, data representation, analysis,
and processing. There are some instances where one can directly borrow classical
information-theoretic tools used to determine limits in, e.g., the channel coding prob-
lem to compute limits in data-science tasks. For example, in compressive sensing [20]
and group testing [21] achievability results have been derived using the probabilistic
method and converse results have been developed using Fano’s inequality [22]. How-
ever, there are various other data-science problems where information-theoretic methods
have not yet been directly applied. We elaborate further in the following sections how
information-theoretic ideas, tools, and methods are also gradually shaping data science.

1.5 Connecting Information Theory to Data Science

Data science – a loosely defined concept meant to bring together various problems stud-
ied in statistics, machine learning, signal processing, harmonic analysis, and computer
science under a unified umbrella – involves numerous other challenges that go beyond
the traditional source coding and channel coding problems arising in communication
or storage systems. These challenges are associated with the need to acquire, represent,
16 Miguel R. D. Rodrigues et al.

and analyze information buried in data in a reliable and computationally efficient manner
in the presence of a variety of constraints such as security, privacy, fairness, hardware
resources, power, noise, and many more.
Figure 1.3 presents a typical data-science pipeline, encompassing functions such as
data acquisition, data representation, and data analysis, whose overarching purpose is to
turn data captured from the physical world into insights for decision-making. It is also
common to consider various other functions within a data-science “system” such as data
preparation, data exploration, and more. We restrict ourselves to this simplified version
because it serves to illustrate how information theory is helping shape data science. The
goals of the different blocks of the data-science pipeline in Fig. 1.3 are as follows.
• The data-acquisition block is often concerned with the act of turning physical-
world continuous-time analog signals into discrete-time digital signals for further
digital processing.
• The data-representation block concentrates on the extraction of relevant attributes
from the acquired data for further analysis.
• The data-analysis block concentrates on the extraction of meaningful actionable
information from the data features for decision-making.
From the description of these goals, one might think that information theory – a field
that arose out of the need to study communication systems in a principled manner – has
little to offer to the principles of data acquisition, representation, analysis, or processing.
But it turns out that information theory has been advancing our understanding of data
science in three major ways.
• First, information theory has been leading to new system architectures for the
different elements of the data-science pipeline. Representative examples associ-
ated with new architectures for data acquisition are overviewed in Section 1.6.
• Second, information-theoretic methods can be used to unveil fundamental
operational limits in various data-science tasks, including in data acquisition,
representation, analysis, and processing. Examples are overviewed in Sections
1.6–1.8.
• Third, information-theoretic measures can be used as the basis for developing
algorithms for various data-science tasks. We allude to some examples in Sections
1.7 and 1.8.
In fact, the questions one can potentially ask about the data-science pipeline depicted
in Fig. 1.3 exhibit many parallels to the questions one asks about the communications

Figure 1.3 A simplified data-science pipeline encompassing functions such as data acquisition,
data representation, and data analysis and processing.
Introduction to Information Theory and Data Science 17

system architecture shown in Fig. 1.1. Specifically, what are the trade-offs in perfor-
mance incurred by adopting this data-science architecture? In particular, are there other
systems that do not involve the separation of the different data-science elements and
exhibit better performance? Are there fundamental limits on what is possible in data
acquisition, representation, analysis, and processing? Are there computationally feasi-
ble algorithms for data acquisition, representation, analysis, and processing that attain
such limits?
There has been progress in data science in all three of these directions. As a
concrete example that showcases many similarities between the data-compression
and – communication problems and data-science problems, information-theoretic meth-
ods have been providing insight into the various operational regimes associated with
the following different data-science tasks: (1) the regime where there is no algorithm –
regardless of its complexity – that can perform the desired task subject to some accu-
racy; this “regime of impossibility” in data science has the flavor of converse results in
source coding and channel coding in information theory; (2) the regime where there are
algorithms, potentially very complex and computationally infeasible, that can perform
the desired task subject to some accuracy; this “regime of possibility” is akin to the
initial discussion of linear codes and the Elias and Gallager ensembles in channel cod-
ing; and (3) the regime where there are computationally feasible algorithms to perform
the desired task subject to some accuracy; this “regime of computational feasibility” in
data science has many characteristics that parallel those in design of computationally
tractable source and channel coding schemes in information theory.
Interestingly, in the same way that the classical information-theoretic problems of
source coding and channel coding exhibit phase transitions, many data-science prob-
lems have also been shown to exhibit sharp phase transitions in the high-dimensional
setting, where the number of data samples and the dimension of the data approach
infinity. Such phase transitions are typically expressed as a function of various param-
eters associated with the data-science problem. The resulting information-theoretic
limit/threshold/barrier (a.k.a. statistical phase transition) partitions the problem param-
eter space into two regions [23–25]: one defining problem instances that are impossible
to solve and another defining problem instances that can be solved (perhaps only
with a brute-force algorithm). In turn, the computational limit/threshold/barrier (a.k.a.
computational phase transition) partitions the problem parameter space into a region
associated with problem instances that are easy to solve and another region associated
with instances that are hard to solve [26, 27].
There can, however, be differences in how one establishes converse and achievability
results – and therefore phase transitions – in classical information-theoretic problems
and data-science ones. Converse results in data science can often be established using
Fano’s inequality or variations on it (see also Chapter 16). In contrast, achievability
results often cannot rely on classical techniques, such as the probabilistic method,
necessitating instead the direct analysis of the algorithms. Chapter 13 elaborates on
some emerging tools that may be used to establish statistical and computational limits
in data-science problems.
18 Miguel R. D. Rodrigues et al.

In summary, numerous information-theoretic tools, methods, and quantities are


increasingly becoming essential to cast insight into data science. It is impossible to cap-
ture all the recent developments in a single chapter, but the following sections sample a
number of recent results under three broad themes: data acquisition, data representation,
and data analysis and processing.

1.6 Information Theory and Data Acquisition

Data acquisition is a critical element of the data-science architecture shown in Fig. 1.3.
It often involves the conversion of a continuous-time analog signal into a discrete-time
digital signal that can be further processed in digital signal-processing pipelines.5
Conversion of a continuous-time analog signal x(t) into a discrete-time digital rep-
resentation typically entails two operations. The first operation – known as sampling –
involves recording the values of the original signal x(t) at particular instants of time. The
simplest form of sampling is direct uniform sampling in which the signal is recorded at
uniform sampling times x(kT s ) = x(k/Fs ), where T s denotes the sampling period (in
seconds), Fs denotes the sampling frequency (in Hertz), and k is an integer. Another
popular form of sampling is generalized shift-invariant sampling in which x(t) is first
filtered by a linear time-invariant (LTI) filter, or a bank of LTI filters, and only then sam-
pled uniformly [28]. Other forms of generalized and non-uniform sampling have also
been studied. Surprisingly, under certain conditions, the sampling process can be shown
to be lossless: for example, the classical sampling theorem for bandlimited processes
asserts that it is possible to perfectly recover the original signal from its uniform sam-
ples provided that the sampling frequency Fs is at least twice the signal bandwidth B.
This minimal sampling frequency FNQ = 2B is referred to as the Nyquist rate [28].
The second operation – known as quantization – involves mapping the continuous-
valued signal samples onto discrete-valued ones. The levels are taken from a finite set
of levels that can be represented using a finite sequence of bits. In (optimal) vector
quantization approaches, a series of signal samples are converted simultaneously to a
bit sequence, whereas in (sub-optimal) scalar quantization, each individual sample is
mapped to bits. The quantization process is inherently lossy since it is impossible to
accurately represent real-valued samples using a finite set of bits. Rate-distortion the-
ory establishes a trade-off between the average number of bits used to encode each
signal sample – referred to as the rate – and the average distortion incurred in the
reconstruction of each signal sample – referred to simply as the distortion – via two func-
tions. The rate-distortion function R(D) specifies the smallest number of bits required
on average per sample when one wishes to represent each sample with average distor-
tion less than D, whereas the distortion-rate function D(R) specifies the lowest average
distortion achieved per sample when one wishes to represent on average each sample
with R bits [7]. A popular measure of distortion in the recovery of the original signal

5 Note that it is also possible that the data are already presented in an inherently digital format; Chapters 3,
4, and 6 deal with such scenarios.
Introduction to Information Theory and Data Science 19

samples from the quantized ones is the mean-squared error (MSE). Note that this class of
problems – known as lossy source coding – is the counterpart of the lossless source
coding problems discussed earlier.
The motivation for this widely used data-acquisition architecture, involving (1) a
sampling operation at or just above the Nyquist rate and (2) scalar or vector quanti-
zation operations, is its simplicity that leads to a practical implementation. However,
it is well known that the separation of the sampling and quantization operations is not
necessarily optimal. Indeed, the optimal strategy that attains Shannon’s distortion-rate
function associated with arbitrary continuous-time random signals with known statis-
tics involves a general mapping from continuous-time signal space to a sequence of
bits that does not consider any practical constraints in its implementation [1, 3, 29].
Therefore, recent years have witnessed various generalizations of this data-acquisition
paradigm informed by the principles of information theory, on the one hand, and guided
by practical implementations, on the other.
One recent extension considers a data-acquisition paradigm that illuminates the
dependence between these two operations [30–32]. In particular, given a total rate bud-
get, Kipnis et al. [30–32] draw on information-theoretic methods to study the lowest
sampling rate required to sample a signal such that the reconstruction of the signal
from the bit-constrained representation of its samples results in minimal distortion. The
sampling operation consists of an LTI filter, or bank of filters, followed by pointwise
sampling of the outputs of the filters. The authors also show that, without assuming
any particular structure on the input analog signal, this sampling rate is often below the
signal’s Nyquist rate. That is, due to the fact that there is loss encountered by the quan-
tization operation, there is no longer in general a requirement to sample the signal at the
Nyquist rate.
As an example, consider the case where x(t) is a stationary random process bandlim-
ited to B with a triangular power spectral density (PSD) given formally by

σ2x
S(f) = [1 − | f /B|]+ (1.15)
B
with [a]+ = max(a, 0). In this case, the Nyquist sampling rate is 2B. However, when
quantization is taken into account, the sampling rate can be lowered without introducing
further distortion. Specifically, assuming a bitrate leading to distortion D, the minimal
sampling rate is shown to be equal to [32]

fR = 2B 1 − D/σ2x . (1.16)
Thus, as the distortion grows, the minimal sampling rate is reduced. When we do not
allow any distortion, namely, no quantization takes place, D = 0 and fR = 2B so that
Nyquist rate sampling is required.
Such results show how information-theoretic methods are leading to new insights
about the interplay between sampling and quantization. In particular, these new results
can be seen as an extension of the classical sampling theorem applicable to bandlimited
random processes in the sense that they describe the minimal amount of excess distortion
in the reconstruction of a signal due to lossy compression of its samples, leading to
20 Miguel R. D. Rodrigues et al.

the minimal sampling frequency required to achieve this distortion.6 In general, this
sampling frequency is below the Nyquist rate. Chapter 2 surveys some of these recent
results in data acquisition.
Another generalization of the classical data-acquisition paradigm considers scenar-
ios where the end goal is not to reconstruct the original analog signal x(t) but rather to
perform some other operation on it [33]. For example, in the context of parameter esti-
mation, Rodrigues et al. [33] show that the number of bits per sample required to achieve
a certain distortion in such task-oriented data acquisition can be much lower than that
required for task-ignorant data acquisition. More recently, Shlezinger et al. [34, 35]
study task-oriented hardware-efficient data-acquisition systems, where optimal vector
quantizers are replaced by practical ones. Even though the optimal rate-distortion curve
cannot be achieved by replacing optimal vector quantizers by simple serial scalar ones,
it is shown in [34, 35] that one can get close to the minimal distortion in settings where
the information of interest is not the signal itself, but rather a low-dimensional param-
eter vector embedded in the signal. A practical application of this setting is in massive
multiple-input multiple-output (MIMO) systems where there is a strong need to utilize
simple low-resolution quantizers due to power and memory constraints. In this context, it
is possible to design a simple quantization system, consisting of scalar uniform quantiz-
ers and linear pre- and post-processing, leading to minimal channel estimation distortion.
These recent results also showcase how information-theoretic methods can provide
insight into the interplay between data acquisition, representation, and analysis, in
the sense that knowledge of the data-analysis goal can influence the data-acquisition
process. These results therefore also suggest new architectures for the conventional data-
science pipeline that do not involve a strict separation between the data-acquisition,
data-representation, and data-analysis and processing blocks.
Beyond this data-acquisition paradigm involving the conversion of continuous-time
signals to digital ones, recent years have also witnessed the emergence of various
other data-acquisition approaches. Chapters 3, 4, and 6 cover further data-acquisition
strategies that are also benefiting from information-theoretic methods.

1.7 Information Theory and Data Representation

The outputs of the data-acquisition block – often known as “raw” data – typically need
to be turned into “meaningful” representations – known as features – for further data
analysis. Note that the act of transforming raw data into features, where the number
of dimensions in the features is lower than that in the raw data, is also referred to as
dimensionality reduction.
Recent years have witnessed a shift from model-based data representations, rely-
ing on predetermined transforms – such as wavelets, curvelets, and shearlets – to
compute the features from raw data, to data-driven representations that leverage a

6 In fact, this theory can be used even when the input signal is not bandlimited.
Introduction to Information Theory and Data Science 21

number of (raw) data “examples”/“instances” to first learn a (linear or nonlinear)


data representation transform conforming to some postulated data generative model
[36–39]. Mathematically, given N (raw) data examples {yi ∈ Y}i=1
N (referred to as train-

ing samples), data-driven representation learning often assumes generative models of


the form
yi = F(xi ) + wi , i = 1, . . . , N, (1.17)
where xi ∈ X denote feature vectors that are assumed to be realizations of a random
vector X distributed as pX , wi denote acquisition noise and/or modeling errors that are
realizations of random noise W distributed as pW , and F : X → Y denotes the true
(linear or nonlinear) representation transform that belongs to some postulated class of
transforms F . The operational challenge in representation learning is to estimate F(·)
using the training samples, after which the features can be obtained using the inverse
images of data samples returned by
G = F −1 : Y → X.
def
(1.18)
Note that if F(·) is not a bijection then G(·) will not be the inverse operator.7
In the literature, F(·) and G(·) are sometimes also referred to as the synthesis operator
and the analysis operator, respectively. In addition, the representation-learning problem
as stated is referred to as unsupervised representation learning [37, 39]. Another cate-
gory of representation learning is supervised representation learning [40, 41], in which
training data correspond to tuples (yi , i ) with i termed the label associated with training
sample yi . Representation learning in this case involves obtaining an analysis/synthesis
operator that results in the best task-specific (e.g., classification and regression) per-
formance. Another major categorization of representation learning is in terms of the
linearity of G(·), with the resulting classes referred to as linear representation learning
and nonlinear representation learning, respectively.
The problem of learning (estimating) the true transformation from a given postulated
generative model poses various challenges. One relates to the design of appropriate
algorithms for estimating F(·) and computing inverse images G(·). Another challenge
involves understanding information-theoretic and computational limitations in repre-
sentation learning in order to identify regimes where existing algorithms are nearly
optimal, regimes where existing algorithms are clearly sub-optimal, and to guide the
development of new algorithms. These challenges are also being addressed using
information-theoretic tools. For example, researchers often map the representation learn-
ing problem onto a channel coding problem, where the transformation F(·) represents
the message that needs to be decoded at the output of a channel that maps F(·) onto
F(X) + W. This allows leveraging of information-theoretic tools such as Fano’s inequal-
ity for derivation of fundamental limits on the estimation error of F(·) as a function
of the number of training samples [25, 42–45]. We next provide a small sampling of
representation learning results that involve the use of information-theoretic tools and
methods.

7 In some representation learning problems, instead of using F(·) to obtain the inverse images of data
samples, G(·) is learned directly from training samples.
22 Miguel R. D. Rodrigues et al.

1.7.1 Linear Representation Learning


Linear representation learning constitutes one of the oldest and, to date, the most preva-
lent data-representation technique in data science. While there are several different
variants of linear representation learning both in unsupervised and in supervised set-
tings, all these variants are based on the assumption that the raw data samples lie near
a low-dimensional (affine) subspace. Representation learning in this case is therefore
equivalent to learning the subspace(s) underlying raw data. This will be a single subspace
in the unsupervised setting, as in principal component analysis (PCA) [46–48] and inde-
pendent component analysis (ICA) [49, 50], and multiple subspaces in the supervised
setting, as in linear discriminant analysis (LDA) [51–53] and quadratic discriminant
analysis (QDA) [53].
Mathematically, linear representation learning operates under the assumption of the
raw data space being Y = Rm , the feature space being X = Rk with k m, the raw data
samples being given by

Y = AX + W (1.19)

with A ∈ F ⊂ Rm×k , and the feature estimates being given by X = BY with B ∈ Rk×m . In
this setting, (F,G) = (A, B) and representation learning reduces to estimating the linear
operators A and/or B under various assumptions on F and the generative model.8 In the
case of PCA, for example, it is assumed that F is the Stiefel manifold in Rm and the
feature vector X is a random vector that has zero mean and uncorrelated entries. On
the other hand, ICA assumes X to have zero mean and independent entries. (The zero-
mean assumption in both PCA and ICA is for ease of analysis and can be easily removed
at the expense of extra notation.)
Information-theoretic frameworks have long been used to develop computational
approaches for estimating (A, B) in ICA and its variants; see, e.g., [49, 50, 54, 55].
Recent years have also seen the use of information-theoretic tools such as Fano’s
inequality to derive sharp bounds on the feasibility of linear representation learning.
One such result that pertains to PCA under the so-called spiked covariance model is
described next.
Suppose the training data samples are N independent realizations according to
(1.19), i.e.,

yi = Axi + wi , i = 1, . . . , N, (1.20)

where AT A = I and both xi and wi are independent realizations of X and W that have
zero mean and diagonal covariance matrices given by

E[XXT ] = diag(λ1 , . . . , λk ), λ1 ≥ λ2 ≥ · · · ≥ λk > 0, (1.21)

and E[WWT ] = σ2 I, respectively. Note that the ideal B in this PCA example is
given by B = AT . It is then shown in [43, Theorem 5] using various analytical

8 Supervised learning typically involves estimation of multiple As and/or Bs.


Introduction to Information Theory and Data Science 23

tools, which include Fano’s inequality, that A can be reliably estimated from N training
samples only if 9

Nλ2k
→ ∞. (1.22)
k(m − k)(1 + λk )
This is the “converse” for the spiked covariance estimation problem.
The “achievability” result for this problem is also provided in [43]. Specifically, when
the condition given in (1.22) is satisfied, a practical algorithm exists that allows reliable
estimation of A [43]. This algorithm involves taking A  to be the k eigenvectors corre-
 N i iT
sponding to the k largest eigenvalues of the sample covariance (1/N) i=1 y y of the
training data. We therefore have a sharp information-theoretic phase transition in this
problem, which is characterized by (1.22). Notice here, however, that while the converse
makes use of information-theoretic tools, the achievability result does not involve the
use of the probabilistic method; rather, it requires analysis of an explicit (deterministic)
algorithm.
The sharp transition highlighted by the aforementioned result can be interpreted in
various ways. One of the implications of this result is that it is impossible to reliably
estimate the PCA features when m > N and m, N → ∞. In such high-dimensional PCA
settings, it is now well understood that sparse PCA, in which the columns of A are
approximately “sparse,” is more appropriate for linear representation learning. We refer
the reader to works such as [43, 56, 57] that provide various information-theoretic limits
for the sparse PCA problem.
We conclude by noting that there has been some recent progress regarding bounds
on the computational feasibility of linear representation learning. For example, the
fact that there is a practical algorithm to learn a linear data representation in some
high-dimensional settings implies that computational barriers can almost coincide with
information-theoretic ones. It is important to emphasize, though, that recent work –
applicable to the detection of a subspace structure within a data matrix [25, 58–62] – has
revealed that classical computationally feasible algorithms such as PCA cannot always
approach the information-theoretic detection threshold [25, 61].

1.7.2 Nonlinear Representation Learning


While linear representation learning techniques tend to have low computational
complexity, they often fail to capture relevant information within complex physical
phenomena. This, coupled with a meteoric rise in computing power, has led to
widespread adoption of nonlinear representation learning in data science.
There is a very wide portfolio of nonlinear representation techniques, but one of the
most well-known classes, which has been the subject of much research during the last
two decades, postulates that (raw) data lie near a low-dimensional manifold embedded in
a higher-dimensional space. Representation learning techniques belonging to this class

9  and A converges to 0 with increasing number of


Reliable estimation here means that the error between A
samples.
24 Miguel R. D. Rodrigues et al.

include local linear embedding [63], Isomap [64], kernel entropy component analysis
(ECA) [65], and nonlinear generalizations of linear techniques using the kernel trick
(e.g., kernel PCA [66], kernel ICA [67], and kernel LDA [68]). The use of information-
theoretic machinery in these methods has mostly been limited to formulations of the
algorithmic problems, as in kernel ECA and kernel ICA. While there exist some results
that characterize the regime in which manifold learning is impossible, such results
leverage the probabilistic method rather than more fundamental information-theoretic
tools [69].
Recent years have seen the data-science community widely embrace another non-
linear representation learning approach that assumes data lie near a union of subspaces
(UoS). This approach tends to have several advantages over manifold learning because of
the linearity of individual components (subspaces) in the representation learning model.
While there exist methods that learn the subspaces explicitly, one of the most popu-
lar classes of representation learning under the UoS model in which the subspaces are
implicitly learned is referred to as dictionary learning [38]. Formally, dictionary learning
assumes the data space to be Y = Rm , the feature space to be

X = {x ∈ R p : x0 ≤ k} (1.23)

with k m ≤ p, and the random generative model to be

Y = DX + W (1.24)

with D ∈ F = {D ∈ Rm×p : Di 2 = 1, i = 1, . . . , p} representing a dictionary and W ∈ Rm


representing the random  noise vector. This corresponds to the random data vector Y
lying near a union of kp k-dimensional subspaces. Notice that, while F(·) = D is a linear
operator,10 its inverse image G(·) is highly nonlinear and typically computed as
 = G(Y) = arg min Y − DX2 ,
X (1.25)
2
X:X0 ≤k

with (1.25) referred to as sparse coding.


The last 15 years have seen the development of a number of algorithms that enable
learning of the dictionary D both in unsupervised and in supervised settings [41, 70–
72]. Sample complexity of these algorithms in terms of both infeasibility (converse) and
achievability has been a more recent effort [44, 45, 73–76]. In particular, it is established
in [44] using Fano’s inequality that the number of samples N, which are independent
realizations of the generative model (1.24), i.e.,

yi = Dxi + wi , i = 1, . . . , N, (1.26)

must scale at least as fast as N = O(mp2  −2 ) in order to ensure recovery of an estimate


 such that D
D  − DF ≤ . This lower bound on sample complexity, which is derived
in the minimax sense, is akin to the converse bounds in source and channel coding
in classical information theory. However, general tightness of this lower bound, which
requires analyzing explicit (deterministic) dictionary-learning algorithms and deriving

10 Strictly speaking, D restricted to X is also nonlinear.


Introduction to Information Theory and Data Science 25

matching achievability results, remains an open problem. Computational limits are also
in general open for dictionary learning.
Recent years have also seen extension of these results to the case of data that have a
multidimensional (tensor) structure [45]. We refer the reader to Chapter 5 in the book
for a more comprehensive review of dictionary-learning results pertaining to both vector
and tensor data.
Linear representation learning, manifold learning, and dictionary learning are all
based on a geometric viewpoint of data. It is also possible to view these representation-
learning techniques from a purely numerical linear algebra perspective. Data represen-
tations in this case are referred to as matrix factorization-based representations. The
matrix factorization perspective of representation learning allows one to expand the
classes of learning techniques by borrowing from the rich literature on linear algebra.
Non-negative matrix factorization [77], for instance, allows one to represent data that
are inherently non-negative in terms of non-negative features that can be assigned phys-
ical meanings. We refer the reader to [78] for a more comprehensive overview of matrix
factorizations in data science; [79] also provides a recent information-theoretic analysis
of non-negative matrix factorization.

1.7.3 Recent Trends in Representation Learning


Beyond the subspace and UoS models described above, another emerging approach to
learning data representations relates to the use of deep neural networks [80]. In particu-
lar, this involves designing a nonlinear transformation G : Y → X consisting of a series
of stages, with each stage encompassing a linear and a nonlinear operation, that can be
used to produce a data representation x ∈ X given a data instance y ∈ Y as follows:

x = G(y) = fL (WL · fL−1 (WL−1 · (· · · f1 (W1 y + b1 ) · · · ) + bL−1 ) + bL ), (1.27)

where Wi ∈ Rni ×ni−1 is a weight matrix, bi ∈ Rni is a bias vector, fi : Rni → Rni is a non-
linear operator such as a rectified linear unit (ReLU), and L corresponds to the number
of layers in the deep neural network. The challenge then relates to how to learn the set
of weight matrices and bias vectors associated with the deep neural network. For exam-
ple, in classification problems where each data instance x is associated with a discrete
label , one typically relies on a training set (yi , i ), i = 1, . . . , N, to define a loss function
that can be used to tune the various parameters of the network using algorithms such as
gradient descent or stochastic gradient descent [81].
This approach to data representation underlies some of the most spectacular advances
in areas such as computer vision, speech recognition, speech translation, natural lan-
guage processing, and many more, but this approach is also not fully understood.
However, information-theoretically oriented studies have also been recently conducted
to gain insight into the performance of deep neural networks by enabling the analysis
of the learning process or the design of new learning algorithms. For example, Tishby
et al. [82] propose an information-theoretic analysis of deep neural networks based on
the information bottleneck principle. They view the neural network learning process as
a trade-off between compression and prediction that leads up to the extraction of a set of
26 Miguel R. D. Rodrigues et al.

minimal sufficient statistics from the data in relation to the target task. Shwartz-Ziv and
Tishby [83] – building upon the work in [82] – also propose an information-bottleneck-
based analysis of deep neural networks. In particular, they study information paths in the
so-called information plane capturing the evolution of a pair of items of mutual informa-
tion over the network during the training process: one relates to the mutual information
between the ith layer output and the target data label, and the other corresponds to the
mutual information between the ith layer output and the data itself. They also demon-
strate empirically that the widely used stochastic gradient descent algorithm undergoes
a “fitting” phase – where the mutual information between the data representations and
the target data label increases – and a “compression” phase – where the mutual infor-
mation between the data representations and the data decreases. See also related works
investigating the flow of information in deep networks [84–87].
Achille and Soatto [88] also use an information-theoretic approach to understand
deep-neural-networks-based data representations. In particular, they show how deep
neural networks can lead to minimal sufficient representations with properties such as
invariance to nuisances, and provide bounds that connect the amount of information in
the weights and the amount of information in the activations to certain properties of
the activations such as invariance. They also show that a new information-bottleneck
Lagrangian involving the information between the weights of a network and the training
data can overcome various overfitting issues.
More recently, information-theoretic metrics have been used as a proxy to learn data
representations. In particular, Hjelm et al. [89] propose unsupervised learning of repre-
sentations by maximizing the mutual information between an input and the output of a
deep neural network.
In summary, this body of work suggests that information-theoretic quantities such as
mutual information can inform the analysis, design, and optimization of state-of-the-art
representation learning approaches. Chapter 11 covers some of these recent trends in
representation learning.

1.8 Information Theory and Data Analysis and Processing

The outputs of the data-representation block – the features – are often the basis for fur-
ther data analysis or processing, encompassing both statistical inference and statistical
learning tasks such as estimation, regression, classification, clustering, and many more.
Statistical inference forms the core of classical statistical signal processing and statis-
tics. Broadly speaking, it involves use of explicit stochastic data models to understand
various aspects of data samples (features). These models can be parametric, defined as
those characterized by a finite number of parameters, or non-parametric, in which the
number of parameters continuously increases with the number of data samples. There is
a large portfolio of statistical inference tasks, but we limit our discussion to the problems
of model selection, hypothesis testing, estimation, and regression.
Briefly, model selection involves the use of data features/samples to select a stochastic
data model from a set of candidate models. Hypothesis testing, on the other hand, is the
Introduction to Information Theory and Data Science 27

task of determining whether a certain postulated hypothesis (stochastic model) underly-


ing the data is true or false. This is referred to as binary hypothesis testing, as opposed
to multiple hypothesis testing in which the data are tested against several hypotheses.
Statistical estimation, often studied under the umbrella of inverse problems in many dis-
ciplines, is the task of inferring some parameters underlying the stochastic data model. In
contrast, regression involves estimating the relationships between different data features
that are divided into the categories of response variable(s) (also known as dependent
variables) and predictors (also known as independent variables).
Statistical learning, along with machine learning, primarily concentrates on
approaches to find structure in data. In particular, while the boundary between statistical
inference and statistical learning is not a hard one, statistical learning tends not to focus
on explicit stochastic models of data generation; rather, it often treats the data genera-
tion mechanism as a black box, and primarily concentrates on learning a “model” with
good prediction accuracy [90]. There are two major paradigms in statistical learning:
supervised learning and unsupervised learning.
In supervised learning, one wishes to determine predictive relationships between
and/or across data features. Representative supervised learning approaches include clas-
sification problems where the data features are mapped to a discrete set of values (a.k.a.
labels) and regression problems where the data features are mapped instead to a continu-
ous set of values. Supervised learning often involves two distinct phases of training and
testing. The training phase involves use of a dataset, referred to as training data, to learn
a model that finds the desired predictive relationship(s). These predictive relationships
are often implicitly known in the case of training data, and the goal is to leverage this
knowledge for learning a model during training that generalizes these relationships to
as-yet unseen data. Often, one also employs a validation dataset in concert with train-
ing data to tune possible hyperparameters associated with a statistical-learning model.
The testing phase involves use of another dataset with known characteristics, termed test
data, to estimate the learned model’s generalization capabilities. The error incurred by
the learned model on training and test data is referred to as training error and testing
error, respectively, while the error that the model will incur on future unseen data can be
captured by the so-called generalization error. One of the biggest challenges in super-
vised learning is understanding the generalization error of a statistical learning model as
a function of the number of data samples in training data.
In unsupervised learning, one wishes instead to determine the underlying structure
within the data. Representative unsupervised learning approaches include density esti-
mation, where the objective is to determine the underlying data distribution given a set
of data samples, and clustering, where the aim is to organize the data points onto differ-
ent groups so that points belonging to the same group exhibit some degree of similarity
and points belonging to different groups are distinct.
Challenges arising in statistical inference and learning also involve analyzing, design-
ing and optimizing inference and learning algorithms, and understanding statistical
and computational limits in inference and learning tasks. We next provide a small
sampling of statistical inference and statistical learning results that involve the use of
information-theoretic tools and methods.
28 Miguel R. D. Rodrigues et al.

1.8.1 Statistical Inference


We now survey some representative results arising in model selection, estimation,
regression, and hypothesis-testing problems that benefit from information-theoretic
methods. We also offer an example associated with community detection and recov-
ery on graphs where information-theoretic and related tools can be used to determine
statistical and computational limits.

Model Selection
On the algorithmic front, the problem of model selection has been largely impacted by
information-theoretic tools. Given a data set, which statistical model “best” describes
the data? A huge array of work, dating back to the 1970s, has tackled this question
using various information-theoretic principles. The Akaike information criterion (AIC)
for model selection [91], for instance, uses the KL divergence as the main tool for deriva-
tion of the final criterion. The minimum description length (MDL) principle for model
selection [92], on the other hand, makes a connection between source coding and model
selection and seeks a model that best compresses the data. The AIC and MDL prin-
ciples are just two of a number of information-theoretically inspired model-selection
approaches; we refer the interested reader to Chapter 12 for further discussion.

Estimation and Regression


Over the years, various information-theoretic tools have significantly advanced our
understanding of the interconnected problems of estimation and regression. In statistical
inference, a typical estimation/regression problem involving a scalar random variable
Y ∈ R takes the form
Y = f (X; β) + W, (1.28)
where the random vector X ∈ R p is referred to as a covariate in statistics and measure-
ment vector in signal processing, β ∈ R p denotes the unknown parameter vector, termed
regression parameters and signal in statistics and signal processing, respectively, and
W represents observation noise/modeling error.11 Both estimation and regression prob-
lems in statistical inference concern themselves with recovering β from N realizations
{(yi , xi )}i=1
N from the model (1.28) under an assumed f (·; ·). In estimation, one is inter-

ested in recovering a  β that is as close to the true β as possible; in regression, on the


other hand, one is concerned with prediction, i.e., how close f (X; 
β) is to f (X; β) for the
random vector X. Many modern setups in estimation/regression problems correspond to
the high-dimensional setting in which N p. Such setups often lead to seemingly ill-
posed mathematical problems, resulting in the following important question: How small
can the estimation and/or regression errors be as a function of N, p, and properties of
the covariates and parameters?
Information-theoretic methods have been used in a variety of ways to address this
question for a number of estimation/regression problems. The most well known of these

11 The assumption is that raw data have been transformed into its features, which correspond to X.
Introduction to Information Theory and Data Science 29

results are for the generalized linear model (GLM), where the realizations of (Y, X, W)
are given by
T  + w,
yi = xi β + wi =⇒ y = Xβ (1.29)

 ∈ RN×p , and w ∈ RN denoting concatenations of yi , xi T , and wi , respec-


with y ∈ RN , X
tively. Fano’s inequality has been used to derive lower bounds on the errors in GLMs
under various assumptions on the matrix X  and β [93–95]. Much of this work has been
limited to the case of sparse β, in which it is assumed that no more than a few (say,
s N) regression parameters are non-zero [93, 94]. The work by Raskutti et al. [95]
extends many of these results to β that is not strictly sparse. This work focuses on
approximately sparse regression parameters, defined as lying within an q ball, q ∈ [0, 1]
of radius Rq as follows:
⎧ ⎫

⎨  q
⎪ p ⎪


Bq (Rq ) = ⎪
⎪β : |β | ≤ R ⎪
⎪ , (1.30)
⎩ i q

i=1

and provides matching minimax lower and upper bounds (i.e., the optimal minimax rate)
both for the estimation error,  
β − β22 , and for the prediction error, (1/n)X( β − β)22 .
In particular, it is established that, under suitable assumptions on X,  it is possible to
 
achieve estimation and prediction errors in GLMs that scale as Rq log p/N 1−q/2 . The
corresponding result for exact sparsity can be derived by setting q = 0 and Rq = s. Fur-
ther, there exist no algorithms, regardless of their computational complexity, that can
achieve errors smaller than this rate for every β in an q ball. As one might expect,
Fano’s inequality is the central tool used by Raskutti et al. [95] to derive this lower
bound (the “converse”). The achievability result requires direct analysis of algorithms,
as opposed to use of the probabilistic method in classical information theory. Since both
the converse and the achievability bounds coincide in regression and estimation under
the GLM, we end up with a sharp statistical phase transition. Chapters 6, 7, 8, and
16 elaborate further on various other recovery and estimation problems arising in data
science, along with key tools that can be used to gain insight into such problems.
Additional information-theoretic results are known for the standard linear model –

where Y = sXβ + W, with Y ∈ Rn , X ∈ Rn×p , β ∈ R p , W ∈ Rn ∼ N(0, I), and s a scaling
factor representing a signal-to-noise ratio. In particular, subject to mild conditions on the
distribution of the parameter vector, it has been established that the mutual information
and the minimum mean-squared error obey the so-called I-MMSE relationship given
by [96]:
 √ 
dI β; sXβ + W 1  √ 
= · mmse Xβ| sXβ + W , (1.31)
ds 2
 √ 
where I β; sXβ + W corresponds to the mutual information between the standard
linear model input and output and
 √    √ 2 !
mmse Xβ| sXβ + W = E Xβ − E Xβ| sXβ + W  (1.32)
2
30 Miguel R. D. Rodrigues et al.


is the minimum mean-squared error associated with the estimation of Xβ given sXβ +
W. Other relations involving information-theoretic quantities, such as mutual infor-
mation, and estimation-theoretic ones have also been established in a wide variety of
settings in recent years, such as Poisson models [97]. These relations have been shown to
have important implications in classical information-theoretic problems – notably in the
analysis and design of communications systems (e.g., [98–101]) – and, more recently, in
data-science ones. In particular, Chapter 7 elaborates further on how the I-MMSE rela-
tionship can be used to gain insight into modern high-dimensional inference problems.

Hypothesis Testing
Information-theoretic tools have also been advancing our understanding of hypothesis-
testing problems (one of the most widely used statistical inference techniques). In
general, we can distinguish between binary hypothesis-testing problems, where the data
are tested against two hypotheses often known as the null and the alternate hypotheses,
and multiple-hypothesis-testing problems in which the data are tested against multiple
hypotheses. We can also distinguish between Bayesian approaches to hypothesis test-
ing, where one specifies a prior probability associated with each of the hypotheses, and
non-Bayesian ones, in which one does not specify a priori any prior probability.
Formally, a classical formulation of the binary hypothesis-testing problem involves
testing whether a number of i.i.d. data samples (features) x1 , x2 , . . . , xN of a random
variable X ∈ X ∼ pX conform to one or other of the hypotheses H0 : pX = p0 and
H1 : pX = p1 , where under the first hypothesis one postulates that the data are generated
i.i.d. according to model (distribution) p0 and under the second hypothesis one assumes
the data are generated i.i.d. according to model (distribution) p1 . A binary hypothesis
test T : X × · · · × X → {H0 , H1 } is a mapping that outputs an estimate of the hypothesis
given the data samples.
In non-Bayesian settings, the performance of such a binary hypothesis test can be
described by two error probabilities. The type-I error probability, which relates to the
rejection of a true null hypothesis, is given by
   
Pe|0 (T ) = P T X1 , X2 , . . . , XN = H1 |H0 (1.33)

and the type-II error probability, which relates to the failure to reject a false null
hypothesis, is given by
   
Pe|1 (T ) = P T X1 , X2 , . . . , XN = H0 |H1 . (1.34)

In this class of problems, one is typically interested in minimizing one of the error
probabilities subject to a constraint on the other error probability as follows:

Pe (α) = min Pe|1 (T ), (1.35)


T :Pe|0 (T )≤α

where the minimum can be achieved using the well-known Neymann–Pearson test [102].
Introduction to Information Theory and Data Science 31

Information-theoretic tools – such as typicality [7] – have long been used to analyze
the performance of this class of problems. For example, the classical Stein lemma asserts
that asymptotically with the number of data samples approaching infinity [7]

1
lim lim · log Pe (α) = −D(p0 ||p1 ), (1.36)
α→0 N→∞ N

where D(·||·) is the Kullback–Leibler distance between two different distributions.


In Bayesian settings, the performance of a hypothesis-testing problem can be
described by the average error probability given by
   
Pe (T ) = P(H0 ) · P T X1 , X2 , . . . , XN = H1 |H0
   
+ P(H1 ) · P T X1 , X2 , . . . , XN = H0 |H1 , (1.37)
where P(H0 ) and P(H1 ) relate to the prior probabilities ascribed to each of the hypoth-
esis. It is well known that the maximum a posteriori test (or maximum a posteriori
decision rule) minimizes this average error probability [102].
Information-theoretic tools have been used to analyze the performance of Bayesian
hypothesis-testing problems too. For example, consider a simple M-ary Bayesian
hypothesis-testing problem involving M possible hypotheses, which are modeled by a
random variable C drawn according to some prior distribution pC , and the data are mod-
eled by a random variable X drawn according to the distribution pX|C . In particular, since
it is often difficult to characterize in closed form the minimum average error probability
associated with the optimal maximum a posteriori test, information-theoretic measures
can be used to upper- or lower-bound this quantity. A lower bound on the minimum
average error probability – derived from Fano’s inequality – is given by

H(C|X)
Pe,min = min Pe (T ) ≥ 1 − . (1.38)
T log2 (M − 1)

An upper bound on the minimum average error probability is [103]

Pe,min = min Pe (T ) ≤ 1 − exp(−H(C|X)). (1.39)


T

A number of other bounds on the minimum average error probability involving Shannon
information measures, Rényi information measures, or other generalizations have also
been devised over the years [104–106] and have led to stronger converse results not only
in classical information-theory problems but also in data-science ones [107].

Example: Community Detection and Estimation on Graphs


We now briefly offer examples of hypothesis testing and estimation problems arising in
modern data analysis that exhibit sharp statistical and computational phase transitions
which can be revealed using emerging information-theoretic methods.
To add some context, in modern data analysis it is increasingly common for datasets to
consist of various items exhibiting complex relationships among them, such as pair-wise
32 Miguel R. D. Rodrigues et al.

or multi-way interactions between items. Such datasets can therefore be represented by


a graph or a network of interacting items where the network vertices denote different
items and the network edges denote pair-wise interactions between the items.12 Our
example – involving a concrete challenge arising in the analysis of such networks of
interacting items – relates to the detection and recovery of community structures within
the graph. A community consists of a subset of vertices within the graph that are densely
connected to one another but sparsely connected to other vertices within the graph [108].
Concretely, consider a simple instance of such problems where one wishes to dis-
cern whether the underlying graph is random or whether it contains a dense subgraph (a
community). Mathematically, we can proceed by considering two objects: (1) an Erdős–
Rényi random graph model G(N, q) consisting of N vertices where each pair of vertices
is connected independently with probability q; and (2) a planted dense subgraph model
G(N, K, p, q) with N vertices where each vertex is assigned to a random set S with prob-
ability K/N (K ≤ N) and each pair of vertices are connected with probability p if both
of them are in the set S and with probability q otherwise (p > q). We can then proceed
by constructing a hypothesis-testing problem where under one hypothesis one postu-
lates that the observed graph is drawn from G(N, q) and under the other hypothesis one
postulates instead that the observed graph is drawn from G(N, K, p,q). It can then be
 
established in the asymptotic regime p = cq = O N −α , K = O N −β , N → ∞, that (a)
one can detect the community with arbitrarily low error probability with simple linear-
time algorithms when β > 12 + α/4; (b) one can detect the community with arbitrarily low
error probability only with no-polynomial-time algorithms when α < β < 12 + α/4; and
(c) there is no test – irrespective of its complexity
 – that can
 detect the community with
arbitrarily low error probability when β < min α, 12 + α/4 [109]. It has also been estab-
lished that the recovery of the community exhibits identical statistical and computational
limits.
This problem in fact falls under a much wider problem class arising in modern data
analysis, involving the detection or recovery of structures planted in random objects such
as graphs, matrices, or tensors. The characterization of statistical limits in detection or
recovery of such structures can typically be done by leveraging various tools: (1) statis-
tical tools such as the first- and the second-moment methods; (2) information-theoretic
methods such as mutual information and rate distortion; and (3) statistical physics-based
tools such as the interpolation method. In contrast, the characterization of computational
limits associated with these statistical problems often involves finding an approximate
randomized polynomial-time reduction, mapping certain graph-theoretic problems such
as the planted-clique problem approximately to the statistical problem under considera-
tion, in order to show that the statistical problem is at least as hard as the planted-clique
problem. Chapter 13 provides a comprehensive overview of emerging methods –
including information-theoretic ones – used to establish both statistical and computa-
tional limits in modern data analysis.

12 Some datasets can also be represented by hyper-graphs of interacting items, where vertices denote the
different objects and hyper-edges denotes multi-way interactions between the different objects.
Introduction to Information Theory and Data Science 33

1.8.2 Statistical Learning


We now survey emerging results in statistical learning that are benefiting from
information-theoretic methods.

Supervised Learning
In the supervised learning setup, one desires to learn a hypothesis based on a set of data
examples that can be used to make predictions given new data [90]. In particular, in order
to formalize the problem, let X be the domain set, Y be the label set, Z = X × Y be the
examples domain, μ be a distribution on Z, and W a hypothesis class (i.e., W = {W} is
a set of hypotheses W : X → Y). Let also S = z1 , . . . , zN = (x1 , y1 ), . . . , (xN , yN ) ∈ ZN
be the training set – consisting of a number of data points and their associated labels –
drawn i.i.d. from Z according to μ. A learning algorithm is a Markov kernel that maps
the training set S to an element W of the hypothesis class W according to the probability
law pW|S .
A key challenge relates to understanding the generalization ability of the learning
algorithm, where the generalization error corresponds to the difference between the
expected (or true) error and the training (or empirical) error. In particular, by consider-
ing a non-negative loss function L : W × Z → R+ , one can define the expected error and
the training error associated with a hypothesis W as follows:

1
N
lossμ (W) = E{L(W, Z )} and lossS (W) = L(W, zi ),
N i=1

respectively. The generalization error is given by


gen(μ, W) = lossμ (W) − lossS (W)
and its expected value is given by

gen(μ, pW|S ) = E lossμ (W) − lossS (W) ,

where the expectation is with respect to the joint distribution of the algorithm input (the
training set) and the algorithm output (the hypothesis).
A number of approaches have been developed throughout the years to characterize
the generalization error of a learning algorithm, relying on either certain complexity
measures of the hypothesis space or certain properties of the learning algorithm. These
include VC-based bounds [110], algorithmic stability-based bounds [111], algorithmic
robustness-based bounds [112], PAC-Bayesian bounds [113], and many more. However,
many of these generalization error bounds cannot explain the generalization abilities of a
variety of machine-learning methods for various reasons: (1) some of the bounds depend
only on the hypothesis class and not on the learning algorithm, (2) existing bounds do
not easily exploit dependences between different hypotheses, and (3) existing bounds
also do not exploit dependences between the learning algorithm input and output.
More recently, approaches leveraging information-theoretic tools have been emerging
to characterize the generalization ability of various learning methods. Such approaches
34 Miguel R. D. Rodrigues et al.

often express the generalization error in terms of certain information measures between
the algorithm input (the training dataset) and the algorithm output (the hypothesis),
thereby incorporating the various ingredients associated with the learning problem,
including the dataset distribution, the hypothesis space, and the learning algorithm itself.
In particular, inspired by [114], Xu and Raginsky [115] derive an upper bound on the
generalization error, applicable to σ-sub-Gaussian loss functions, given by
#
"" " 2σ2
"gen(μ, pW|S )"" ≤ · I(S; W),
N
where I(S; W) corresponds to the mutual information between the input – the dataset –
and the output – the hypothesis – of the algorithm. This bound supports the intuition that
the less information the output of the algorithm contains about the input to the algorithm
the less it will overfit, providing a means to strike a balance between the ability to fit
data and the ability to generalize to new data by controlling the algorithm’s input–output
mutual information. Raginsky et al. [116] also propose similar upper bounds on the
generalization error based on several information-theoretic measures of algorithmic sta-
bility, capturing the idea that the output of a stable learning algorithm cannot depend “too
much” on any particular training example. Other generalization error bounds involving
information-theoretic quantities appear in [117, 118]. In particular, Asadi et al. [118]
combine chaining and mutual information methods to derive generalization error bounds
that significantly outperform existing ones.
Of particular relevance, these information-theoretically based generalization error
bounds have also been used to gain further insight into machine-learning models and
algorithms. For example, Pensia et al. [119] build upon the work by Xu and Ragin-
sky [115] to derive very general generalization error bounds for a broad class of iterative
algorithms that are characterized by bounded, noisy updates with Markovian structure,
including stochastic gradient Langevin dynamics (SGLD) and variants of the stochas-
tic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. This work demonstrates
that mutual information is a very effective tool for bounding the generalization error of a
large class of iterative empirical risk minimization (ERM) algorithms. Zhang et al. [120],
on the other hand, build upon the work by Xu and Raginsky [115] to study the expected
generalization error of deep neural networks, and offer a bound that shows that the error
decreases exponentially to zero with the increase of convolutional and pooling layers in
the network. Other works that study the generalization ability of deep networks based
on information-theoretic considerations and measures include [121, 122]. Chapters 10
and 11 scope these directions in supervised learning problems.

Unsupervised Learning
In unsupervised learning setups, one desires instead to understand the structure asso-
ciated with a set of data examples. In particular, multivariate information-theoretic
functionals such as partition information, minimum partition information, and multi-
information have been recently used in the formulation of unsupervised clustering
problems [123, 124]. Chapter 9 elaborates further on such approaches to unsupervised
learning problems.
Introduction to Information Theory and Data Science 35

1.8.3 Distributed Inference and Learning


Finally, we add that there has also been considerable interest in the generalization of the
classical statistical inference and learning problems overviewed here to the distributed
setting, where a statistician/learner has access only to data distributed across various
terminals via a series of limited-capacity channels. In particular, much progress has been
made in distributed-estimation [125–127], hypothesis-testing [128–138], learning [139,
140], and function-computation [141] problems in recent years. Chapters 14 and 15
elaborate further on how information theory is advancing the state-of-the-art for this
class of problems.

1.9 Discussion and Conclusion

This chapter overviewed the classical information-theoretic problems of data com-


pression and communication, questions arising within the context of these problems,
and classical information-theoretic tools used to illuminate fundamental architectures,
schemes, and limits for data compression and communication.
We then discussed how information-theoretic methods are currently advancing the
frontier of data science by unveiling new data-processing architectures, data-processing
limits, and algorithms. In particular, we scoped out how information theory is leading to
a new understanding of data-acquisition architectures, and provided an overview of how
information-theoretic methods have been uncovering limits and algorithms for linear and
nonlinear representation learning problems, including deep learning. Finally, we also
overviewed how information-theoretic tools have been contributing to our understanding
of limits and algorithms for statistical inference and learning problems.
Beyond the typical data-acquisition, data-representation, and data-analysis tasks cov-
ered throughout this introduction, there are also various other emerging challenges in
data science that are benefiting from information-theoretic techniques. For example, pri-
vacy is becoming a very relevant research area in data science in view of the fact that
data analysis can not only reveal useful insights but can also potentially disclose sensitive
information about individuals. Differential privacy is an inherently information-theoretic
framework that can be used as the basis for the development of data-release mechanisms
that control the amount of private information leaked to a data analyst while retain-
ing some degree of utility [142]. Other information-theoretic frameworks have also
been used to develop data pre-processing mechanisms that strike a balance between
the amount of useful information and private information leaked to a data analyst
(e.g., [143]).
Fairness is likewise becoming a very relevant area in data science because data anal-
ysis can also potentially exacerbate biases in decision-making, such as discriminatory
treatments of individuals according to their membership of a legally protected group
such as race or gender. Such biases may arise when protected variables (or correlated
ones) are used explicitly in the decision-making. Biases also arise when learning algo-
rithms that inherit biases present in training sets are then used in decision-making.
36 Miguel R. D. Rodrigues et al.

Recent works have concentrated on the development of information-theoretically based


data pre-processing schemes that aim simultaneously to control discrimination and
preserve utility (e.g., [144]).
Overall, we anticipate that information-theoretic methods will play an increasingly
important role in our understanding of data science in upcoming years, including in
shaping data-processing architectures, in revealing fundamental data-processing limits,
and in the analysis, design, and optimization of new data-processing algorithms.

Acknowledgments

The work of Miguel R. D. Rodrigues and Yonina C. Eldar was supported in part by the
Royal Society under award IE160348. The work of Stark C. Draper was supported in
part by a Discovery Research Grant from the Natural Sciences and Engineering Research
Council of Canada (NSERC). The work of Waheed U. Bajwa was supported in part by
the National Science Foundation under award CCF-1453073 and by the Army Research
Office under award W911NF-17-1-0546.

References

[1] C. E. Shannon, “A mathematical theory of communications,” Bell System Technical J.,


vol. 27, nos. 3–4, pp. 379–423, 623–656, 1948.
[2] R. G. Gallager, Information theory and reliable communications. Wiley, 1968.
[3] T. Berger, Rate distortion theory: A mathematical basis for data compression. Prentice-
Hall, 1971.
[4] I. Csiszár and J. Körner, Information theory: Coding theorems for discrete memoryless
systems. Cambridge University Press, 2011.
[5] A. Gersho and R. M. Gray, Vector quantization and signal compression. Kluwer
Academic Publishers, 1991.
[6] D. J. C. MacKay, Information theory, inference and learning algorithms. Cambridge
University Press, 2003.
[7] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2006.
[8] R. W. Yeung, Information theory and network coding. Springer, 2008.
[9] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge University Press,
2011.
[10] E. Arikan, “Some remarks on the nature of the cutoff rate,” in Proc. Workshop Information
Theory and Applications (ITA ’06), 2006.
[11] R. E. Blahut, Theory and practice of error control codes. Addison-Wesley Publishing
Company, 1983.
[12] S. Lin and D. J. Costello, Error control coding. Pearson, 2005.
[13] R. M. Roth, Introduction to coding theory. Cambridge University Press, 2006.
[14] T. Richardson and R. Urbanke, Modern coding theory. Cambridge University Press, 2008.
[15] W. E. Ryan and S. Lin, Channel codes: Classical and modern. Cambridge University
Press, 2009.
Introduction to Information Theory and Data Science 37

[16] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for
symmetric binary-input memoryless channels,” IEEE Trans. Information Theory, vol. 55,
no. 7, pp. 3051–3073, 2009.
[17] A. Jiménez-Feltström and K. S. Zigangirov, “Time-varying periodic convolutional codes
with low-density parity-check matrix,” IEEE Trans. Information Theory, vol. 45, no. 2,
pp. 2181–2191, 1999.
[18] M. Lentmaier, A. Sridharan, D. J. J. Costello, and K. S. Zigangirov, “Iterative decod-
ing threshold analysis for LDPC convolutional codes,” IEEE Trans. Information Theory,
vol. 56, no. 10, pp. 5274–5289, 2010.
[19] S. Kudekar, T. J. Richardson, and R. L. Urbanke, “Threshold saturation via spatial cou-
pling: Why convolutional LDPC ensembles perform so well over the BEC,” IEEE Trans.
Information Theory, vol. 57, no. 2, pp. 803–834, 2011.
[20] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal
Processing Mag., vol. 25, no. 2, pp. 21–30, 2008.
[21] H. Q. Ngo and D.-Z. Du, “A survey on combinatorial group testing algorithms with appli-
cations to DNA library screening,” Discrete Math. Problems with Medical Appl., vol. 55,
pp. 171–182, 2000.
[22] G. K. Atia and V. Saligrama, “Boolean compressed sensing and noisy group testing,”
IEEE Trans. Information Theory, vol. 58, no. 3, pp. 1880–1901, 2012.
[23] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional
geometry, with implications for modern data analysis and signal processing,” Phil. Trans.
Roy. Soc. A: Math., Phys. Engineering Sci., pp. 4273–4293, 2009.
[24] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp, “Living on the edge: Phase
transitions in convex programs with random data,” Information and Inference, vol. 3,
no. 3, pp. 224–294, 2014.
[25] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu, “Information-theoretic bounds
and phase transitions in clustering, sparse PCA, and submatrix localization,” IEEE Trans.
Information Theory, vol. 64, no. 7, pp. 4872–4894, 2018.
[26] R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, and L. Troyansky, “Determining
computational complexity from characteristic ‘phase transitions,”’ Nature, vol. 400, no.
6740, pp. 133–137, 1999.
[27] G. Zeng and Y. Lu, “Survey on computational complexity with phase transitions and
extremal optimization,” in Proc. 48th IEEE Conf. Decision and Control (CDC ’09), 2009,
pp. 4352–4359.
[28] Y. C. Eldar, Sampling theory: Beyond bandlimited systems. Cambridge University Press,
2014.
[29] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[30] A. Kipnis, A. J. Goldsmith, Y. C. Eldar, and T. Weissman, “Distortion-rate function of
sub-Nyquist sampled Gaussian sources,” IEEE Trans. Information Theory, vol. 62, no. 1,
pp. 401–429, 2016.
[31] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Analog-to-digital compression: A new
paradigm for converting signals to bits,” IEEE Signal Processing Mag., vol. 35, no. 3,
pp. 16–39, 2018.
[32] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Fundamental distortion limits of analog-
to-digital compression,” IEEE Trans. Information Theory, vol. 64, no. 9, pp. 6013–6033,
2018.
38 Miguel R. D. Rodrigues et al.

[33] M. R. D. Rodrigues, N. Deligiannis, L. Lai, and Y. C. Eldar, “Rate-distortion trade-offs in


acquisition of signal parameters,” in Proc. IEEE International Conference or Acoustics,
Speech, and Signal Processing (ICASSP ’17), 2017.
[34] N. Shlezinger, Y. C. Eldar, and M. R. D. Rodrigues, “Hardware-limited task-based
quantization,” submitted to IEEE Trans. Signal Processing, accepted 2019.
[35] N. Shlezinger, Y. C. Eldar, and M. R. D. Rodrigues, “Asymptotic task-based quantiza-
tion with application to massive MIMO,” submitted to IEEE Trans. Signal Processing,
accepted 2019.
[36] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Machine
Learning, vol. 73, no. 3, pp. 243–272, 2008.
[37] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised
feature learning,” in Proc. 14th International Conference on Artificial Intelligence and
Statistics (AISTATS ’11), 2011, pp. 215–223.
[38] I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Processing Mag., vol. 28,
no. 2, pp. 27–38, 2011.
[39] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 35, no. 8, pp.
1798–1828, 2013.
[40] S. Yu, K. Yu, V. Tresp, H.-P. Kriegel, and M. Wu, “Supervised probabilistic principal com-
ponent analysis,” in Proc. 12th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD ’06), 2006, pp. 464–473.
[41] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learn-
ing,” in Proc. Advances in Neural Information Processing Systems (NeurIPS ’09), 2009,
pp. 1033–1040.
[42] V. Vu and J. Lei, “Minimax rates of estimation for sparse PCA in high dimen-
sions,” in Proc. 15th International Conference on Artificial Intelligence and Statistics
(AISTATS ’12), 2012, pp. 1278–1286.
[43] T. T. Cai, Z. Ma, and Y. Wu, “Sparse PCA: Optimal rates and adaptive estimation,” Annals
Statist., vol. 41, no. 6, pp. 3074–3110, 2013.
[44] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE
Trans. Information Theory, vol. 62, no. 3, pp. 1501–1515, 2016.
[45] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary
learning for tensor data,” IEEE Trans. Information Theory, vol. 64, no. 4, 2018.
[46] H. Hotelling, “Analysis of a complex of statistical variables into principal components,”
J. Educ. Psychol., vol. 6, no. 24, pp. 417–441, 1933.
[47] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy.
Statist. Soc. Ser. B, vol. 61, no. 3, pp. 611–622, 1999.
[48] I. T. Jolliffe, Principal component analysis, 2nd edn. Springer-Verlag, 2002.
[49] P. Comon, “Independent component analysis: A new concept?” Signal Processing,
vol. 36, no. 3, pp. 287–314, 1994.
[50] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley &
Sons, 2004.
[51] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition
using class specific linear projection,” IEEE Trans. Pattern Analysis Machine Intelligence,
vol. 19, no. 7, pp. 711–720, 1997.
[52] J. Ye, R. Janardan, and Q. Li, “Two-dimensional linear discriminant analysis,” in
Proc. Advances in Neural Information Processing Systems (NeurIPS ’04), 2004, pp.
1569–1576.
Introduction to Information Theory and Data Science 39

[53] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data
mining, inference, and prediction, 2nd edn. Springer, 2016.
[54] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component
analysis,” IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
[55] D. Erdogmus, K. E. Hild, Y. N. Rao, and J. C. Príncipe, “Minimax mutual information
approach for independent component analysis,” Neural Comput., vol. 16, no. 6, pp. 1235–
1252, 2004.
[56] A. Birnbaum, I. M. Johnstone, B. Nadler, and D. Paul, “Minimax bounds for sparse PCA
with noisy high-dimensional data,” Annals Statist., vol. 41, no. 3, pp. 1055–1084, 2013.
[57] R. Krauthgamer, B. Nadler, and D. Vilenchik, “Do semidefinite relaxations solve sparse
PCA up to the information limit?” Annals Statist., vol. 43, no. 3, pp. 1300–1322, 2015.
[58] Q. Berthet and P. Rigollet, “Representation learning: A review and new perspectives,”
Annals Statist., vol. 41, no. 4, pp. 1780–1815, 2013.
[59] T. Cai, Z. Ma, and Y. Wu, “Optimal estimation and rank detection for sparse spiked
covariance matrices,” Probability Theory Related Fields, vol. 161, nos. 3–4, pp. 781–815,
2015.
[60] A. Onatski, M. Moreira, and M. Hallin, “Asymptotic power of sphericity tests for high-
dimensional data,” Annals Statist., vol. 41, no. 3, pp. 1204–1231, 2013.
[61] A. Perry, A. Wein, A. Bandeira, and A. Moitra, “Optimality and sub-optimality of PCA
for spiked random matrices and synchronization,” arXiv:1609.05573, 2016.
[62] Z. Ke, “Detecting rare and weak spikes in large covariance matrices,” arXiv:1609.00883,
2018.
[63] D. L. Donoho and C. Grimes, “Hessian eigenmaps: Locally linear embedding techniques
for high-dimensional data,” Proc. Natl. Acad. Sci. USA, vol. 100, no. 10, pp. 5591–5596,
2003.
[64] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework
for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323,
2000.
[65] R. Jenssen, “Kernel entropy component analysis,” IEEE Trans. Pattern Analysis Machine
Intelligence, vol. 32, no. 5, pp. 847–860, 2010.
[66] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in
Proc. Intl. Conf. Artificial Neural Networks (ICANN ’97), 1997, pp. 583–588.
[67] J. Yang, X. Gao, D. Zhang, and J.-Y. Yang, “Kernel ICA: An alternative formulation and
its application to face recognition,” Pattern Recognition, vol. 38, no. 10, pp. 1784–1787,
2005.
[68] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K. R. Mullers, “Fisher discriminant
analysis with kernels,” in Proc. IEEE Workshop Neural Networks for Signal Processing
IX, 1999, pp. 41–48.
[69] H. Narayanan and S. Mitter, “Sample complexity of testing the manifold hypothesis,”
in Proc. Advances in Neural Information Processing Systems (NeurIPS ’10), 2010, pp.
1786–1794.
[70] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski,
“Dictionary learning algorithms for sparse representation,” Neural Comput., vol. 15,
no. 2, pp. 349–396, 2003.
[71] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing over-
complete dictionaries for sparse representation,” IEEE Trans. Signal Processing, vol. 54,
no. 11, pp. 4311–4322, 2006.
40 Miguel R. D. Rodrigues et al.

[72] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in face recognition,”
in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’10),
2010, pp. 2691–2698.
[73] Q. Geng and J. Wright, “On the local correctness of 1 -minimization for dictionary learn-
ing,” in Proc. IEEE International Symposium on Information Theory (ISIT ’14), 2014, pp.
3180–3184.
[74] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning
sparsely used overcomplete dictionaries,” in Proc. 27th Conference on Learning Theory
(COLT ’14), 2014, pp. 123–137.
[75] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcom-
plete dictionaries,” in Proc. 27th Conference on Learning Theory (COLT ’14), 2014, pp.
779–806.
[76] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: Dictionary learning with
noise and outliers,” IEEE Trans. Information Theory, vol. 61, no. 11, pp. 6298–6319,
2015.
[77] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in
Proc. Advances in Neural Information Processing Systems 13 (NeurIPS ’01), 2001,
pp. 556–562.
[78] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative matrix and ten-
sor factorizations: Applications to exploratory multi-way data analysis and blind source
separation. John Wiley & Sons, 2009.
[79] M. Alsan, Z. Liu, and V. Y. F. Tan, “Minimax lower bounds for nonnegative matrix fac-
torization,” in Proc. IEEE Statistical Signal Processing Workshop (SSP ’18), 2018, pp.
363–367.
[80] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444,
2015.
[81] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016,
www.deeplearningbook.org.
[82] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in
Proc. IEEE Information Theory Workshop (ITW ’15), 2015.
[83] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via
information,” arXiv:1703.00810, 2017.
[84] C. W. Huang and S. S. Narayanan, “Flow of Rényi information in deep neural net-
works,” in Proc. IEEE International Workshop Machine Learning for Signal Processing
(MLSP ’16), 2016.
[85] P. Khadivi, R. Tandon, and N. Ramakrishnan, “Flow of information in feed-forward deep
neural networks,” arXiv:1603.06220, 2016.
[86] S. Yu, R. Jenssen, and J. Príncipe, “Understanding convolutional neural network training
with information theory,” arXiv:1804.09060, 2018.
[87] S. Yu and J. Príncipe, “Understanding autoencoders with information theoretic concepts,”
arXiv:1804.00057, 2018.
[88] A. Achille and S. Soatto, “Emergence of invariance and disentangling in deep represen-
tations,” arXiv:1706.01350, 2017.
[89] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler,
and Y. Bengio, “Learning deep representations by mutual information estimation and
maximization,” in International Conference on Learning Representations (ICLR ’19),
2019.
Introduction to Information Theory and Data Science 41

[90] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to


algorithms. Cambridge University Press, 2014.
[91] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automation
Control, vol. 19, no. 6, pp. 716–723, 1974.
[92] A. Barron, J. Rissanen, and B. Yu, “The minimum description length principle in coding
and modeling,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2743–2760, 1998.
[93] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-
dimensional and noisy setting,” IEEE Trans. Information Theory, vol. 55, no. 12, pp.
5728–5741, 2009.
[94] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery
using 1 -constrained quadratic programming (lasso),” IEEE Trans. Information Theory,
vol. 55, no. 5, pp. 2183–2202, 2009.
[95] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax rates of estimation for high-
dimensional linear regression over q -balls,” IEEE Trans. Information Theory, vol. 57,
no. 10, pp. 6976–6994, 2011.
[96] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error
in Gaussian channels,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1261–1282,
2005.
[97] D. Guo, S. Shamai, and S. Verdú, “Mutual information and conditional mean estimation
in Poisson channels,” IEEE Trans. Information Theory, vol. 54, no. 5, pp. 1837–1849,
2008.
[98] A. Lozano, A. M. Tulino, and S. Verdú, “Optimum power allocation for parallel Gaussian
channels with arbitrary input distributions,” IEEE Trans. Information Theory, vol. 52,
no. 7, pp. 3033–3051, 2006.
[99] F. Pérez-Cruz, M. R. D. Rodrigues, and S. Verdú, “Multiple-antenna fading channels with
arbitrary inputs: Characterization and optimization of the information rate,” IEEE Trans.
Information Theory, vol. 56, no. 3, pp. 1070–1084, 2010.
[100] M. R. D. Rodrigues, “Multiple-antenna fading channels with arbitrary inputs: Characteri-
zation and optimization of the information rate,” IEEE Trans. Information Theory, vol. 60,
no. 1, pp. 569–585, 2014.
[101] A. G. C. P. Ramos and M. R. D. Rodrigues, “Fading channels with arbitrary inputs:
Asymptotics of the constrained capacity and information and estimation measures,” IEEE
Trans. Information Theory, vol. 60, no. 9, pp. 5653–5672, 2014.
[102] S. M. Kay, Fundamentals of statistical signal processing: Detection theory. Prentice Hall,
1998.
[103] M. Feder and N. Merhav, “Relations between entropy and error probability,” IEEE Trans.
Information Theory, vol. 40, no. 1, pp. 259–266, 1994.
[104] I. Sason and S. Verdú, “Arimoto–Rényi conditional entropy and Bayesian M-ary hypoth-
esis testing,” IEEE Trans. Information Theory, vol. 64, no. 1, pp. 4–25, 2018.
[105] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength
regime,” IEEE Trans. Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
[106] G. Vazquez-Vilar, A. T. Campo, A. Guillén i Fàbregas, and A. Martinez, “Bayesian M-
ary hypothesis testing: The meta-converse and Verdú–Han bounds are tight,” IEEE Trans.
Information Theory, vol. 62, no. 5, pp. 2324–2333, 2016.
[107] R. Venkataramanan and O. Johnson, “A strong converse bound for multiple hypothesis
testing, with applications to high-dimensional estimation,” Electron. J. Statist, vol. 12,
no. 1, pp. 1126–1149, 2018.
42 Miguel R. D. Rodrigues et al.

[108] E. Abbe, “Community detection and stochastic block models: Recent developments,”
J. Machine Learning Res., vol. 18, pp. 1–86, 2018.
[109] B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection on
random graphs,” in Proc. 28th Conference on Learning Theory (COLT ’15), Paris, 2015,
pp. 1–30.
[110] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Networks,
vol. 10, no. 5, pp. 988–999, 1999.
[111] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Machine Learning Res.,
vol. 2, pp. 499–526, 2002.
[112] H. Xu and S. Mannor, “Robustness and generalization,” Machine Learning, vol. 86, no. 3,
pp. 391–423, 2012.
[113] D. A. McAllester, “PAC-Bayesian stochastic model selection,” Machine Learning,
vol. 51, pp. 5–21, 2003.
[114] D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via
information usage,” arXiv:1511.05219, 2016.
[115] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability
of learning algorithms,” in Proc. Advances in Neural Information Processing Systems
(NeurIPS ’17), 2017.
[116] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of
stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop
(ITW ’16), 2016.
[117] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayof, “Learners that use little
information,” arXiv:1710.05233, 2018.
[118] A. R. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening
generalization bounds,” arXiv:1806.03803, 2018.
[119] A. Pensia, V. Jog, and P. L. Loh, “Generalization error bounds for noisy, iterative
algorithms,” arXiv:1801.04295v1, 2018.
[120] J. Zhang, T. Liu, and D. Tao, “An information-theoretic view for deep learning,”
arXiv:1804.09060, 2018.
[121] M. Vera, P. Piantanida, and L. R. Vega, “The role of information complexity and
randomization in representation learning,” arXiv:1802.05355, 2018.
[122] M. Vera, L. R. Vega, and P. Piantanida, “Compression-based regularization with an
application to multi-task learning,” arXiv:1711.07099, 2018.
[123] C. Chan, A. Al-Bashadsheh, and Q. Zhou, “Info-clustering: A mathematical theory of
data clustering,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 64–91,
2016.
[124] R. K. Raman and L. R. Varshney, “Universal joint image clustering and registration using
multivariate information measures,” IEEE J. Selected Topics Signal Processing, vol. 12,
no. 5, pp. 928–943, 2018.
[125] Z. Zhang and T. Berger, “Estimation via compressed information,” IEEE Trans. Informa-
tion Theory, vol. 34, no. 2, pp. 198–211, 1988.
[126] T. S. Han and S. Amari, “Parameter estimation with multiterminal data compression,”
IEEE Trans. Information Theory, vol. 41, no. 6, pp. 1802–1833, 1995.
[127] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower
bounds for distributed statistical estimation with communication constraints,” in Proc.
Advances in Neural Information Processing Systems (NeurIPS ’13), 2013.
Introduction to Information Theory and Data Science 43

[128] R. Ahlswede and I. Csiszár, “Hypothesis testing with communication constraints,” IEEE
Trans. Information Theory, vol. 32, no. 4, pp. 533–542, 1986.
[129] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Trans.
Information Theory, vol. 33, no. 6, pp. 759–772, 1987.
[130] T. S. Han and K. Kobayashi, “Exponential-type error probabilities for multiterminal
hypothesis testing,” IEEE Trans. Information Theory, vol. 35, no. 1, pp. 2–14, 1989.
[131] T. S. Han and S. Amari, “Statistical inference under multiterminal data compression,”
IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2300–2324, 1998.
[132] H. M. H. Shalaby and A. Papamarcou, “Multiterminal detection with zero-rate data
compression,” IEEE Trans. Information Theory, vol. 38, no. 2, pp. 254–267, 1992.
[133] G. Katz, P. Piantanida, R. Couillet, and M. Debbah, “On the necessity of binning for
the distributed hypothesis testing problem,” in Proc. IEEE International Symposium on
Information Theory (ISIT ’15), 2015.
[134] Y. Xiang and Y. Kim, “Interactive hypothesis testing against independence,” in Proc.
IEEE International Symposium on Information Theory (ISIT ’13), 2013.
[135] W. Zhao and L. Lai, “Distributed testing against independence with conferencing
encoders,” in Proc. IEEE Information Theory Workshop (ITW ’15), 2015.
[136] W. Zhao and L. Lai, “Distributed testing with zero-rate compression,” in Proc. IEEE
International Symposium on Information Theory (ISIT ’15), 2015.
[137] W. Zhao and L. Lai, “Distributed detection with vector quantizer,” IEEE Trans. Signal
Information Processing Networks, vol. 2, no. 2, pp. 105–119, 2016.
[138] W. Zhao and L. Lai, “Distributed testing with cascaded encoders,” IEEE Trans. Informa-
tion Theory, vol. 64, no. 11, pp. 7339–7348, 2018.
[139] M. Raginsky, “Learning from compressed observations,” in Proc. IEEE Information
Theory Workshop (ITW ’07), 2007.
[140] M. Raginsky, “Achievability results for statistical learning under communication con-
straints,” in Proc. IEEE International Symposium on Information Theory (ISIT ’09),
2009.
[141] A. Xu and M. Raginsky, “Information-theoretic lower bounds for distributed function
computation,” IEEE Trans. Information Theory, vol. 63, no. 4, pp. 2314–2337, 2017.
[142] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations
and Trends Theoretical Computer Sci., vol. 9, no. 3–4, pp. 211–407, 2014.
[143] J. Liao, L. Sankar, V. Y. F. Tan, and F. P. Calmon, “Hypothesis testing under mutual
information privacy constraints in the high privacy regime,” IEEE Trans. Information
Forensics Security, vol. 13, no. 4, pp. 1058–1071, 2018.
[144] F. P. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney, “Data
pre-processing for discrimination prevention: Information-theoretic optimization and
analysis,” IEEE J. Selected Topics Signal Processing, vol. 12, no. 5, pp. 1106–1119, 2018.
2 An Information-Theoretic Approach
to Analog-to-Digital Compression
Alon Kipnis, Yonina C. Eldar, and Andrea J. Goldsmith

Summary

Processing, storing, and communicating information that originates as an analog phe-


nomenon involve conversion of the information to bits. This conversion can be described
by the combined effect of sampling and quantization, as illustrated in Fig. 2.1. The dig-
ital representation in this procedure is achieved by first sampling the analog signal so as
to represent it by a set of discrete-time samples and then quantizing these samples to a
finite number of bits. Traditionally, these two operations are considered separately. The
sampler is designed to minimize information loss due to sampling on the basis of prior
assumptions about the continuous-time input [1]. The quantizer is designed to represent
the samples as accurately as possible, subject to the constraint on the number of bits that
can be used in the representation [2]. The goal of this chapter is to revisit this paradigm
by considering the joint effect of these two operations and to illuminate the dependence
between them.

2.1 Introduction

Consider the minimal sampling rate that arises in classical sampling theory due to
Whittaker, Kotelnikov, Shannon, and Landau [1, 3, 4]. These works establish the Nyquist
rate or the spectral occupancy of the signal as the critical sampling rate above which
the signal can be perfectly reconstructed from its samples. This statement, however,
focuses only on the condition for perfectly reconstructing a bandlimited signal from
its infinite-precision samples; it does not incorporate the quantization precision of the
samples and does not apply to signals that are not bandlimited. In fact, as follows from
lossy source coding theory, it is impossible to obtain an exact representation of any
continuous-amplitude sequence of samples by a digital sequence of numbers due to
finite quantization precision, and therefore any digital representation of an analog sig-
nal is prone to error. That is, no continuous amplitude signal can be reconstructed from
its quantized samples with zero distortion regardless of the sampling rate, even when
the signal is bandlimited. This limitation raises the following question. In converting a
signal to bits via sampling and quantization at a given bit precision, can the signal be

44
An Information-Theoretic Approach to Analog-to-Digital Compression 45

sampling quantization

010010100110101011010
111010011010011101111
110010001011110010001

Figure 2.1 Analog-to-digital conversion is achieved by combining sampling and quantization.

reconstructed from these samples with minimal distortion based on sub-Nyquist sam-
pling? One of the goals of this chapter is to discuss this question by extending classical
sampling theory to account for quantization and for non-bandlimited inputs. Namely,
for an arbitrary stochastic input and given a total bitrate budget, we consider the lowest
sampling rate required to sample the signal such that reconstruction of the signal from
a bit-constrained representation of its samples results in minimal distortion. As we shall
see, without assuming any particular structure for the input analog signal, this sampling
rate is often below the signal’s Nyquist rate.
The minimal distortion achievable in recovering a signal from its representation by
a finite number of bits per unit time depends on the particular way the signal is quan-
tized or, more generally, encoded, into a sequence of bits. Since we are interested in the
fundamental distortion limit in recovering an analog signal from its digital representa-
tion, we consider all possible encoding and reconstruction (decoding) techniques. As an
example, in Fig. 2.1 the smartphone display may be viewed as a reconstruction of the
real-world painting The Starry Night from its digital representation. No matter how fine
the smartphone screen, this recovery is not perfect since the digital representation of the
analog image is not accurate. That is, loss of information occurs during the transforma-
tion from analog to digital. Our goal is to analyze this loss as a function of hardware
limitations on the sampling mechanism and the number of bits used in the encoding. It
is convenient to normalize this number of bits by the signal’s free dimensions, that is,
the dimensions along which new information is generated. For example, the free dimen-
sions of a visual signal are usually the horizontal and vertical axes of the frame, and the
free dimension of an audio wave is time. For simplicity, we consider analog signals with
a single free dimension, and we denote this dimension as time. Therefore, our restriction
on the digital representation is given in terms of its bitrate – the number of bits per unit
time.
For an arbitrary continuous-time random signal with a known distribution, the funda-
mental distortion limit due to the encoding of the signal using a limited bitrate is given
by Shannon’s distortion-rate function (DRF) [5–7]. This function provides the optimal
trade-off between the bitrate of the signal’s digital representation and the distortion in
recovering the original signal from this representation. Shannon’s DRF is described only
in terms of the distortion criterion, the probability distribution of the continuous-time
signal, and the maximal bitrate allowed in the digital representation. Consequently, the
46 Alon Kipnis et al.

fs (smp/s) R (bits/s)
X(t) sampler encoder

distortion

X̂(t) decoder

analog analog
digital
(continuous-time) (discrete-time)
Figure 2.2 Analog-to-digital compression (ADX) and reconstruction setting. Our goal is to derive
the minimal distortion between the signal and its reconstruction from any encoding at bitrate R of
the samples of the signal taken at sampling rate fs .

optimal encoding scheme that attains Shannon’s DRF is a general mapping from the
space of continuous-time signals to bits that does not consider any practical constraints
in implementing such a mapping. In practice, the minimal distortion in recovering analog
signals from their mapping to bits considers the digital encoding of the signal samples,
with a constraint on both the sampling rate and the bitrate of the system [8–10]. Here the
sampling rate fs is defined as the number of samples per unit time of the continuous-time
source signal; the bitrate R is the number of bits per unit time used in the representation
of these samples. The resulting system describing our problem is illustrated in Fig. 2.2,
and is denoted as the analog-to-digital compression (ADX) setting.
The digital representation in this setting is obtained by transforming a continuous-time
continuous-amplitude random source signal X(·) through the concatenated operation of
a sampler and an encoder, resulting in a bit sequence. The decoder estimates the original
analog signal from this bit sequence. The distortion is defined to be the mean-squared
error (MSE) between the input signal X(·) and its reconstruction X(·).  Since we are
interested in the fundamental distortion limit subject to a sampling constraint, we allow
optimization over the encoder and the decoder as the time interval over which X(·) is
sampled goes to infinity. When X(·) is bandlimited and the sampling rate fs exceeds its
Nyquist rate fNyq , the encoder can recover the signal using standard interpolation and
use the optimal source code at bitrate R to attain distortion equal to Shannon’s DRF
of the signal [11]. Therefore, for bandlimited signals, a non-trivial interplay between
the sampling rate and the bitrate arises only when fs is below a signal’s Nyquist rate.
In addition to the optimal encoder and decoder, we also explore the optimal sampling
mechanism, but limit ourselves to the class of linear and continuous deterministic sam-
plers. Namely, each sample is defined by a bounded linear functional over a class of
signals. Finally, in order to account for system imperfections or those due to external
interferences, we assume that the signal X(·) is corrupted by additive noise prior to sam-
pling. The noise-free version is obtained from our results by setting the intensity of this
noise to zero.
The minimal distortion in the ADX system of Fig. 2.2 is bounded from below by
two extreme cases of the sampling rate and the bitrate, as illustrated in Fig. 2.3: (1)
when the bitrate R is unlimited, the minimal ADX distortion reduces to the minimal
An Information-Theoretic Approach to Analog-to-Digital Compression 47

Distortion

distortion in ADX = ?
unlimited sampling rate
(Shannon’s DRF)
unlim
it
bitra ed
te
0 fs
? fNyq

Figure 2.3 The minimal sampling rate for attaining the minimal distortion achievable under a
bitrate-limited representation is usually below the Nyquist rate fNyq . In this figure, the noise is
assumed to be zero.

MSE (MMSE) in interpolating a signal from its noisy samples at rate fs [12, 13].
(2) When the sampling rate fs is unlimited or above the Nyquist rate of the signal and
when the noise is zero, the ADX distortion reduces to Shannon’s DRF of the signal.
Indeed, in this case, the optimal encoder can recover the original continuous-time source
without distortion and then encode this recovery in an optimal manner according to the
optimal lossy compression scheme attaining Shannon’s DRF. When fs is unlimited or
above the Nyquist rate and the noise is not zero, the minimal distortion is the indi-
rect (or remote) DRF of the signal given its noise-corrupted version, see Section 3.5 of
[7] and [14]. Our goal is therefore to characterize the MSE due to the joint effect of a
finite bitrate constraint and sampling at a sub-Nyquist sampling rate. In particular, we
are interested in the minimal sampling rate for which Shannon’s DRF, or the indirect
DRF, describing the minimal distortion subject to a bitrate constraint, is attained. As
illustrated in Fig. 2.3, and as will be explained in more detail below, this sampling rate
is usually below the Nyquist rate of X(·), or, more generally, the spectral occupancy
of X(·) when non-uniform or generalized sampling techniques are allowed. We denote
this minimal sampling rate as the critical sampling rate subject to a bitrate constraint,
since it describes the minimal sampling rate required to attain the optimal performance
in systems operating under quantization or bitrate restrictions. The critical sampling
rate extends the minimal-distortion sampling rate considered by Shannon, Nyquist, and
Landau. It is only as the bitrate goes to infinity that sampling at the Nyquist rate is
necessary to attain minimal (namely zero) distortion.
In order to gain intuition as to why the minimal distortion under a bitrate constraint
may be attained by sampling below the Nyquist rate, we first consider in Section 2.2 a
simpler version of the ADX setup involving the lossy compression of linear projections
of signals represented as finite-dimensional random real vectors. Next, in Section 2.3 we
formalize the combined sampling and source coding problem arising from Fig. 2.2 and
provide basic properties of the minimal distortion in this setting. In Section 2.4, we fully
characterize the minimal distortion in ADX as a function of the bitrate and sampling rate
and derive the critical sampling rate that leads to optimal performance. We conclude this
chapter in Section 2.5, where we consider uniform samplers, in particular single-branch
and more general multi-branch uniform samplers, and show that these samplers attain
the fundamental distortion limit.
48 Alon Kipnis et al.

2.2 Lossy Compression of Finite-Dimensional Signals

Let X n be an n-dimensional Gaussian random vector with covariance matrix ΣX n , and let
Y m be a projected version of X n defined by
Y m = HX n , (2.1)
where H ∈ Rm×n is a deterministic matrix and m < n. This projection of X n into a lower-
dimensional space is the counterpart of sampling the continuous-time analog signal
X(·) in the ADX setting. We consider the normalized MMSE estimate of X n from a
representation of Y m using a limited number of bits.
Without constraining the number of bits, the distortion in this estimation is given by
1  
mmse(X n |Y m )  tr ΣX n − ΣX n |Y m , (2.2)
n
where ΣX n |Y m is the conditional covariance matrix. However, when Y m is to be encoded
using a code of no more than nR bits, the minimal distortion cannot be smaller than
the indirect DRF of X n given Y m , denoted by DX n |Y m (R). This function is given by the
following parametric expression [14]:

m
   
D(Rθ ) = tr (ΣX n ) − λi ΣX n |Y m − θ + ,
i=1
(2.3)
1
m
 +  
Rθ = log λi ΣX n |Y m /θ ,
2 i=1
 
x+
where = max{x, 0} and λi ΣX n |Y m is the ith eigenvalue of ΣX n |Y m .
It follows from (2.2) that X n can be recovered from Y m with zero MMSE if and
only if
 
λi (ΣX n ) = λi ΣX n |Y m , (2.4)
for all i = 1, . . . , n. When this condition is satisfied, (2.3) takes the form

n
D(Rθ ) = min{λi (ΣX n ), θ},
i=1
(2.5)
1 +
n
Rθ = log [λi (ΣX n )/θ],
2 i=1

which is Kolmogorov’s reverse water-filling expression for the DRF of the vector Gaus-
sian source X n [15], i.e., the minimal distortion in encoding X n using codes of rate R bits
per source realization. The key insight is that the requirements for equality between (2.3)
and (2.5) are not as strict as (2.4): all that is needed is equality among those eigenvalues
that affect the value of (2.5). In particular, assume that for a point (R, D) on DX n (R),
only λn (ΣX n ), . . . , λn−m+1 (ΣX n ) are larger than θ, where the eigenvalues are organized in
ascending order. Then we can choose the rows of H to be the m left eigenvectors corre-
sponding to λn (ΣX n ), . . . , λn−m+1 (ΣX n ). With this choice of H, the m largest eigenvalues
of ΣX n |Y m are identical to the m largest eigenvalues of ΣX n , and (2.5) is equal to (2.3).
An Information-Theoretic Approach to Analog-to-Digital Compression 49

eigenvalues of ΣX n eigenvalues of ΣX n |Y m

n m

n−1 m−1

n−2 m−2

Figure 2.4 Optimal sampling occurs whenever DX n (R) = DX n |Y m (R). This condition is satisfied
even for m < n, as long as there is equality among the eigenvalues of ΣX n and ΣX n |Y m which are
larger than the water-level parameter θ.

Since the rank of the sampling matrix is now m < n, we effectively performed sam-
pling below the “Nyquist rate” of X n without degrading the performance dictated by its
DRF. One way to understand this phenomenon is as an alignment between the range
of the sampling matrix H and the subspace over which X n is represented, according
to Kolmogorov’s expression (2.5). That is, when Kolmogorov’s expression implies that
not all degrees of freedom are utilized by the optimal distortion-rate code, sub-sampling
does not incur further performance loss, provided that the sampling matrix is aligned
with the optimal code. This situation is illustrated in Fig. 2.4. Sampling with an H that
has fewer rows than the rank of ΣX 2 is the finite-dimensional analog of sub-Nyquist
sampling in the infinite-dimensional setting of continuous-time signals.
In the rest of this chapter, we explore the counterpart of the phenomena described
above in the richer setting of continuous-time stationary processes that may or may not
be bandlimited, and whose samples may be corrupted by additive noise.

2.3 ADX for Continuous-Time Analog Signals

We now explore the fundamental ADX distortion in the setting of continuous-time


stationary processes that may be corrupted by noise prior to sampling. We consider
the system of Fig. 2.5 in which X(·)  {X(t), t ∈ R} is a stationary process with power

(·)

 
X (·) N i ∈ 1, . . . , 2T R
Sampler YT ∈ R T
X(·) + Enc
S(KH , Λ)

distortion

X̂T (·) Dec

Figure 2.5 Combined sampling and source coding setting.


50 Alon Kipnis et al.

spectral density (PSD) S X ( f ). This PSD is assumed to be a real, symmetric, and absolute
integrable function that satisfies
 ∞
E[X(t)X(s)] = S X ( f )e2π j(t−s) f d f, t, s ∈ R. (2.6)
−∞

The noise process ε(·) is another stationary process independent of X(·) with PSD S ε ( f )
of similar properties, so that the input to the sampler is the stationary process Xε (·) 
X(·) + ε(·) with PSD S Xε ( f ) = S X ( f ) + S ε ( f ).
We note that, by construction, X(·) and ε(·) are regular processes in the sense that
their spectral measure has an absolutely continuous density with respect to the Lebesgue
measure. If in addition the support of S X ( f ), denoted by supp S X , is contained1 within
a bounded interval, then we say that X(·) is bandlimited and denote by fNyq its Nyquist
rate, defined as twice the maximal element in supp S X . The spectral occupancy of X(·)
is defined to be the Lebesgue measure of supp S X .
Although this is not necessary for all parts of our discussion, we assume that the
processes X(·) and ε(·) are Gaussian. This assumption leads to closed-form characteriza-
tions for many of the expressions we consider. In addition, it follows from [16, 17] that
a lossy compression policy that is optimal under a Gaussian distribution can be used to
encode non-Gaussian signals with matching second-order statistics, while attaining the
same distortion as if the signals were Gaussian. Hence, the optimal sampler and encod-
ing system we use to obtain the fundamental distortion limit for Gaussian signals attains
the same distortion limit for non-Gaussian signals as long as the second-order statistics
of the two signals are the same.

2.3.1 Bounded Linear Sampling


The sampler in Fig. 2.5 outputs a finite-dimensional vector of samples where, most gen-
erally, each sample is defined by a linear and bounded (hence, continuous) functional of
the process Xε (·). For this reason, we denote a sampler of this type as a bounded linear
sampler. In order to consider this sampler in applications, it is most convenient to define
it in terms of a bilinear kernel KH (t, s) on R × R and a discrete sampling set Λ ⊂ R,
as illustrated in Fig. 2.6. The kernel KH (t, s) defines a time-varying linear system on a
suitable class of signals [18], and hence each element tn ∈ Λ defines a linear bounded
functional K(tn , s) on this class by
 ∞
Yn  Xε (s)KH (tn , s)ds.
−∞

For a time horizon T , we denote by YT the finite-dimensional vector obtained by


sampling at times t1 , . . . , tn ∈ ΛT , where

ΛT  Λ ∩ [−T/2, T/2].

1 Since the PSD is associated with an absolutely continuous spectral measure, sets defined in term of the
PSD, e.g., supp S X , are understood to be unique up to symmetric difference of Lebesgue measure zero.
An Information-Theoretic Approach to Analog-to-Digital Compression 51

tn ∈ ΛT
X (t) KH (t, ) T ∈ RNT

Figure 2.6 Bounded linear sampler (NT  |ΛT |).

We assume in addition that Λ is uniformly discrete in the sense that there exists ε > 0
such that |t − s| > ε for every non-identical t, s ∈ Λ. The density of ΛT is defined as the
number of points in ΛT divided by T and denoted here by d(ΛT ). Whenever it exists, we
define the limit
|Λ ∩ [−T/2, T/2]|
d(Λ) = lim d(ΛT ) = lim
T →∞ T →∞ T
as the symmetric density of Λ, or simply it’s density.

Linear Time-Invariant Uniform Sampling


An important special case of the bounded linear sampler is that of a linear time-invariant
(LTI) uniform sampler [1], illustrated in Fig. 2.7. For this sampler, the sampling set
is a uniform grid ZT s = {nT s , n ∈ Z}, where T s = fs−1 > 0. The kernel is of the form
KH (t, s) = h(t − s) where h(t) is the impulse response of an LTI system with frequency
response H( f ). Therefore, the entries of YT corresponding to sampling at times nT s ∈ Λ
are given by
 ∞
Yn  h(nT s − s)Xε (s)ds.
−∞

It is easy to check that d(T s Z) = fs and hence, in this case, the density of the sampling
set has the usual interpretation of sampling rate.

Multi-Branch Linear Time-Invariant Uniform Sampling


A generalization of the uniform LTI sampler incorporates several of these samplers in
parallel, as illustrated in Fig. 2.8. Each of the L branches in Fig. 2.8 consists of a LTI
system with frequency response Hl ( f ) followed by a uniform pointwise sampler with
sampling rate fs /L, so that the overall sampling rate is fs . The vector YT consists of the
concatenation of the vectors Y1,T , . . . , YL,T obtained from each of the sampling branches.

2.3.2 Encoding and Reconstruction


For a time horizon T , the encoder in Fig. 2.5 can be any function of the form

f : RNT → 1, . . . , 2T R , (2.7)

fs
X (·) H( f ) T ∈ RT fs 

Figure 2.7 Uniform linear time-invariant sampler.


52 Alon Kipnis et al.

fs /L
H1 ( f )
fs /L
X(·) H2 ( f ) T ∈ RLT fs /L

fs /L
HL ( f )

Figure 2.8 Multi-branch linear time-invariant uniform sampler.

where NT = dim(YT ) = |ΛT |. That is, the encoder receives the vector of samples YT
and outputs an index out of 2T R possible indices. The decoder receives this index, and
 for the signal X(·) over the interval [−T/2, T/2]. Thus, it is a
produces an estimate X(·)
mapping
g : 1, . . . , 2T R → R[−T/2,T/2] . (2.8)
The goal of the joint operation of the encoder and the decoder is to minimize the
expected mean-squared error (MSE)

1 T/2  2 dt.
E X(t) − X(t)
T −T/2
In practice, an encoder may output a finite number of samples that are then interpo-
lated to the continuous-time estimate X(·). Since our goal is to understand the limits in
converting signals to bits, this separation between decoding and interpolation, as well
as the possible restrictions each of these steps encounters in practice, are not explored
within the context of ADX.
Given a particular bounded linear sampler S = (Λ, KH ) and a bitrate R, we are
interested in characterizing the function

1 T/2  2 dt,
DT (S , R)  inf E X(t) − X(t) (2.9)
f,g T −T/2

or its limit as T → ∞, where the infimum is over all encoders and decoders of the form
(2.7) and (2.8). The function DT (S , R) is defined only in terms of the sampler S and the
bitrate R, and in this sense measures the minimal distortion that can be attained using
the sampler S subject to a bitrate constraint R on the representation of the samples.

2.3.3 Optimal Distortion in ADX


From the definition of DT (S , R) and the ADX setting, it immediately follows that
DT (S , R) is non-increasing in R. Indeed, any encoding into a set of 2T R elements can
be obtained as a special case of encoding to a set of 2T (R+r) elements, for r > 0. In
addition, by using the trivial encoder g ≡ 0 we see that DT (S , R) is bounded from above
by the variance σ2X of X(·), which is given by
 ∞
σX 
2
S X ( f )d f.
−∞
In what follows, we explore additional important properties of DT (S , R).
An Information-Theoretic Approach to Analog-to-Digital Compression 53

Optimal Encoding
Denote by XT (·) the process that is obtained by estimating X(·) from the output of the
sampler according to an MSE criterion. That is
XT (t)  E[X(t)|YT ], t ∈ R. (2.10)
From properties of the conditional expectation and MSE, under any encoder f we may
write

1 T/2  2 dt = mmseT (S ) + mmse XT | f (YT ) ,
E X(t) − X(t) (2.11)
T −T/2
where
 T/2
1 2
mmseT (S )  E X(t) − XT (t) dt (2.12)
T −T/2

is the distortion due to sampling and


 T/2
1 2
mmse XT | f (YT )  E XT (t) − g( f (YT )) dt
T −T/2

is the distortion associated with the lossy compression procedure, and depends on the
sampler only through XT (·).
The decomposition (2.11) already provides important clues on an optimal encoder
and decoder pair that attains DT (S , R). Specifically, it follows from (2.11) that there
is no loss in performance if the encoder tries to describe the process XT (·) subject to
the bitrate constraint, rather than the process X(·). Consequently, the optimal decoder
outputs the conditional expectation of XT (·) given f (YT ). The decomposition (2.11) was
first used in [14] to derive the indirect DRF of a pair of stationary Gaussian processes,
and later in [19] to derive indirect DRF expressions in other settings. An extension of the
principle presented in this decomposition to arbitrary distortion measures is discussed
in [20].
The decomposition (2.11) also sheds light on the behavior of the optimal distortion
DT (S , R) under the two extreme cases of unlimited bitrate and unrestricted sampling
rate, each of which is illustrated in Fig. 2.3. We discuss these two cases next.

Unlimited Bitrate
If we remove the bitrate constraint in the ADX setting (formally, letting R → ∞), loss of
information is only due to noise and sampling. In this case, the second term in the RHS
of (2.11) disappears, and the distortion in ADX is given by mmseT (S ). Namely, we have
lim DT (S , R) = mmseT (S ).
R→∞

The unlimited bitrate setting reduces the ADX problem to a classical problem in sam-
pling theory: the MSE under a given sampling system. Of particular interest is the case
of optimal sampling, i.e., when this MSE vanishes as T → ∞. For example, by con-
sidering the noiseless case and assuming that KH (t, s) = δ(t − s) is the identity operator,
the sampler is defined solely in terms of Λ. The condition on mmseT (S ) to converge
54 Alon Kipnis et al.

to zero is related to the conditions for stable sampling in Paley–Wiener spaces studied
by Landau and Beurling [21, 22]. In order to see this connection more precisely, note
that (2.6) defines an isomorphism between the Hilbert spaces of finite-variance ran-
dom variables measurable with respect to the sigma algebra generated by X(·) and the
Hilbert space generated by the inverse Fourier transform of e2π jt f S X ( f ), t ∈ R [23].
Specifically, this isomorphism is obtained by extending the map

X(t) ←→ F −1 e2π jt f S X ( f ) (s)

to the two aforementioned spaces. It follows that sampling and reconstructing X(·) with
vanishing MSE is equivalent to the same operation in the Paley–Wiener space of analytic
functions whose Fourier transform vanishes outside supp S X . In particular, the condition
T →∞
mmseT (S ) −→ 0 holds whenever Λ is a set of stable sampling in this Paley–Wiener
space, i.e., there exists a universal constant A > 0 such that the L2 norm of each function
in this space is bounded by A times the energy of the samples of this function. Landau
[21] showed that a necessary condition for this property is that the number of points in
Λ that fall within the interval [−T/2, T/2] is at least the spectral occupancy of X(·) times
T , minus a constant that is logarithmic in T . For this reason, this spectral occupancy is
termed the Landau rate of X(·), and we denote it here by fLnd . In the special case where
supp S X is an interval (symmetric around the origin since X(·) is real), the Landau and
Nyquist rates coincide.

Optimal Sampling
The other special case of the ADX setting is obtained when there is no loss of infor-
mation due to sampling. For example, this is the case when mmseT (S ) goes to zero
under the conditions mentioned above of zero noise, identity kernel, and sampling den-
sity exceeding the spectral occupancy. More generally, this situation occurs whenever
XT (·) converges (in expected norm) to the MMSE estimator of X(·) from Xε (·). This
MMSE estimator is a stationary process obtained by non-causal Wiener filtering, and
its PSD is

S 2X ( f )
S X|Xε ( f )  , (2.13)
S X ( f ) + S ε( f )

where in (2.13) and in similar expressions henceforth we interpret the expression to be


zero whenever both the numerator and denominator are zero. The resulting MMSE is
given by
 ∞
 
mmse(X|Xε )  S X ( f ) − S X|Xε ( f ) d f. (2.14)
−∞

Since our setting does not limit the encoder from computing XT (·), the ADX problem
reduces in this case to the indirect source coding problem of recovering X(·) from a
bitrate R representation of its corrupted version Xε (·). This problem was considered and
An Information-Theoretic Approach to Analog-to-Digital Compression 55

mmse (X |X )
lossy compression distortion
preserved spectrum

(f)
SX
(f)
|X
SX
f
−3 −2 −1 0 1 2 3

Figure 2.9 Water-filling interpretation of (2.15). The distortion is the sum of mmse(X|Xε ) and the
lossy compression distortion.

solved by Dobrushin and Tsybakov in [14], where the following expression was given
for the optimal trade-off between bitrate and distortion:
 ∞
 
DX|Xε (Rθ )  mmse(X|Xε ) + min S X|Xε ( f ), θ d f, (2.15a)
−∞
 ∞
1  
Rθ = log+ S X|Xε ( f )/θ d f. (2.15b)
2 −∞
A graphical water-filling interpretation of (2.15) is given in Fig. 2.9. When the noise ε(·)
is zero, S X|Xε ( f ) = S X ( f ), and hence (2.15) reduces to
 ∞
DX (Rθ )  min{S X ( f ), θ}d f, (2.16a)
−∞

1 ∞ + 
Rθ = log S X ( f )/θ d f, (2.16b)
2 −∞
which is Pinsker’s expression [15] for the DRF of the process X(·), denoted here by
DX (R). Note that (2.16) is the continuous-time counterpart of (2.3).
From the discussion above, we conclude that
DT (S , R) ≥ DX|Xε (R) ≥ max{DX (R), mmse(X|Xε )}. (2.17)
Furthermore, when the estimator E[X(t)|Xε ] can be obtained from YT as T → ∞, we
T →∞
have that DT (S , R) −→ DX|Xε (R). In this situation, we say that the conditions for optimal
sampling are met, since the only distortion is due to the noise and the bitrate constraint.
The two lower bounds in Fig. 2.3 describe the behavior of DT (S , R) in the two special
cases of unrestricted bitrate and optimal sampling. Our goal in the next section is to
characterize the intermediate case of non-optimal sampling and a finite bitrate constraint.

2.4 The Fundamental Distortion Limit

Given a particular bounded linear sampler S = (Λ, KH ) and a bitrate R, we defined the
function DT (S , R) as the minimal distortion that can be attained in the combined sam-
pling and lossy compression setup of Fig. 2.5. Our goal in this section is to derive and
analyze a function D ( fs , R) that bounds from below DT (S , R) for any such bounded
56 Alon Kipnis et al.

linear sampler with symmetric density of Λ not exceeding fs . The achievability of this
lower bound is addressed in the next section.

2.4.1 Definition of D (fs,R)


In order to define D ( fs , R), we let F  ( fs ) ⊂ R be any set that maximizes
 
S 2X ( f )
S X|Xε ( f )d f = df (2.18)
F F S X ( f ) + S ε( f )

over all Lebesgue measurable sets F whose Lebesgue measure does not exceed fs . In
other words, F  ( fs ) consists of the fs spectral bands with the highest energy in the
spectrum of the process {E[X(t)|Xε (·)], t ∈ R}. Define

 
D ( fs , Rθ ) = mmse ( fs ) + min S X|Xε ( f ), θ d f, (2.19a)
F  ( fs )

1  
Rθ = log+ S X|Xε ( f )/θ d f, (2.19b)
2 F  ( fs )
where
  ∞ 
mmse ( fs )  σ2X − S X|Xε ( f )d f = S X ( f ) − S X|Xε ( f )1F  ( fs ) d f.
F  ( fs ) −∞

Graphical interpretations of D ( fs , R) and mmse ( fs ) are provided in Fig. 2.10.


The importance of the function D ( fs , R) can be deduced from the following two
theorems:
theorem 2.1 (converse) Let X(·) be a Gaussian stationary process corrupted by a
Gaussian stationary noise ε(·), and sampled using a bounded linear sampler S =
(KH , Λ).
(i) Assume that for any T > 0, d(ΛT ) ≤ fs . Then, for any bitrate R,

DT (S , R) ≥ D ( fs , R).

preserved spectrum
)

lossy compression distortion


SX ( f

mmse( f ) s
(f)
|X
SX

f
−3 −2 −1 0 1 2 3

Figure 2.10 Water-filling interpretation of D ( fs , Rθ ): the fundamental distortion limit under any
bounded linear sampling. This distortion is the sum of the fundamental estimation error
mmse ( f s ) and the lossy compression distortion.
An Information-Theoretic Approach to Analog-to-Digital Compression 57

(ii) Assume that the symmetric density of Λ exists and satisfies d(Λ) ≤ fs . Then, for
any bitrate R,
lim inf DT (S , R) ≥ D ( fs , R).
T →∞
In addition to the negative statement of Theorem 2.1, we show in the next section the
following positive coding result.
theorem 2.2 (achievability) Let X(·) be a Gaussian stationary process corrupted by a
Gaussian stationary noise ε(·). Then, for any fs and ε > 0, there exists a bounded linear
sampler S with a sampling set of symmetric density not exceeding fs such that, for any
R, the distortion in ADX attained by sampling Xε (·) using S over a large enough time
interval T , and encoding these samples using T R bits, does not exceed D ( fs , R) + ε.
A full proof of Theorem 2.1 can be found in [24]. Intuition for Theorem 2.1 may be
obtained by representing X(·) according to its Karhunen–Loève (KL) expansion over
[−T/2, T/2], and then using a sampling matrix that keeps only NT  T fs  of these
coefficients. The function D ( fs , R) arises as the limiting expression in the noisy version
of (2.5), when the sampling matrix is tuned to keep those KL coefficients corresponding
to the NT largest eigenvalues in the expansion.
In Section 2.5, we provide a constructive proof of Theorem 2.2 that also shows that

D ( fs , R) is attained using a multi-branch LTI uniform sampler with an appropriate
choice of pre-sampling filters. The rest of the current section is devoted to studying
properties of the minimal ADX distortion D ( fs , R).

2.4.2 Properties of D (fs,R)


In view of Theorems 2.1 and 2.2, the function D ( fs , R) trivially satisfies the proper-
ties mentioned in Section 2.3.3 for the optimal distortion in ADX. It is instructive to
observe how these properties can be deduced directly from the definition of D ( fs , R)
in (2.19).

Unlimited Bitrate
As R → ∞, the parameter θ goes to zero and (2.19a) reduces to mmse ( fs ). This function
describes the MMSE that can be attained by any bounded linear sampler with symmetric
density at most fs . In particular, in the non-noisy case, mmse ( fs ) = 0 if and only if fs
exceeds the Landau rate of X(·). Therefore, in view of the explanation in Section 2.3.3
and under unlimited bitrate, zero noise, and the identity pre-sampling operation,
Theorem 2.1 agrees with the necessary condition derived by Landau for stable sampling
in the Paley–Wiener space [21].

Optimal Sampling
The other extreme in the expression for D ( fs , R) is when fs is large enough that it does
not impose any constraint on sampling. In this case, we expect the ADX distortion to
coincide with the function DX|X (R) of (2.15), since the latter is the minimal distortion
only due to noise and lossy compression at bitrate R. From the definition of F  ( fs ), we
58 Alon Kipnis et al.

observe that F  ( fs ) = supp S X (almost everywhere) whenever fs is equal to or greater


than the Landau rate of X(·). By examining (2.19), we see that this equality implies that

D ( fs , R) = DX|Xε (R). (2.20)

In other words, the condition fs ≥ fLnd means that there is no loss due to sampling in
the ADX system. This property of the minimal distortion is not surprising. It merely
expresses the fact anticipated in Section 2.3.3 that, when (2.10) vanishes as T goes to
infinity, the estimator E[X(t)|Xε ] is obtained from the samples in this limit and thus the
only loss of information after sampling is due to the noise.
In the next section we will see that, under some conditions, the equality (2.20) is
extended to sampling rates smaller than the Landau rate of the signal.

2.4.3 Optimal Sampling Subject to a Bitrate Constraint


We now explore the minimal sampling rate fs required in order to attain equality in
(2.20), that is, the minimal sampling rate at which the minimal distortion in ADX equals
the indirect DRF of X(·) given Xε (·), describing the minimal distortion subject only to
a bitrate R constraint and additive noise. Intuition for this sampling rate is obtained by
exploring the behavior of D ( fs , R) as a function of fs for a specific PSD and a fixed
bitrate R. For simplicity, we explore this behavior under the assumption of zero noise
(ε ≡ 0) and signal X(·) with a unimodal PSD as in Fig. 2.11. Note that in this case we

preserved spectrum mmse( fs) lossy compression distance

SX ( f ) SX ( f )

fs fs
f f
– fR / 2 fR / 2 – fR / 2 fR / 2
(a) (b)
SX ( f ) SX ( f )

fs
f f
– fR / 2 fR / 2 – fR / 2 fR / 2
(c) (d)
Figure 2.11 Water-filling interpretation for the function D ( fs , R) under zero noise, a fixed bitrate
R, and three sampling rates: (a) fs < fR , (b) fs = fR , and (c) fs > fR . (d) corresponds to the DRF
of X(·) at bitrate R. This DRF is attained whenever fs ≥ fR , where fR is smaller than the
Nyquist rate.
An Information-Theoretic Approach to Analog-to-Digital Compression 59

2
X

D( fs,R = 1)
DX|X (R = 1)

Distortion
D( fs,R = 2)
DX|X (R = 2)

mmse( fs)
mmse(X|X ) fs
fR=1 fR=2 fNyq

Figure 2.12 The function D ( fs , R) for the PSD of Fig. 2.11 and two values of the bitrate R. Also
shown is the DRF of X(·) at these values that is attained at the sub-Nyquist sampling rates
marked by fR .

have S X|Xε ( f ) = S X ( f ) since the noise is zero, fLnd = fNyq since S X ( f ) has a connected
support, and F  ( fs ) is the interval of length fs centered around the origin since S X ( f )
is unimodal. In all cases in Fig. 2.11 the bitrate R is fixed and corresponds to the pre-
served part of the spectrum through (2.19b). The distortion D ( fs , R) changes with fs ,
and is given by the sum of two terms in (2.19a): mmse ( fs ) and the lossy compres-
sion distortion. For example, the increment in fs from (a) to (b) reduces mmse ( fs ) and
increases the lossy compression distortion, although the overall distortion decreases due
to this increment. However, the increase in fs leading from (b) to (c) is different: while
(c) shows an additional reduction in mmse ( fs ) compared with (b), the sum of the two
distortion terms is identical in both cases and, as illustrated in (d), equals the DRF of
X(·) from (2.16). It follows that, in the case of Fig. 2.11, the optimal ADX performance
is attained at some sampling rate fR that is smaller than the Nyquist rate, and depends
on the bitrate R through expression (2.16). The full behavior of D ( fs , R) as a function
of fs is illustrated in Fig. 2.12 for two values of R.
The phenomenon described above and in Figs. 2.11 and 2.12 can be generalized to any
Gaussian stationary process with arbitrary PSD and noise in the ADX setting, according
to the following theorem.
theorem 2.3 (optimal sampling rate [24]) Let X(·) be a Gaussian stationary process
with PSD S X ( f ) corrupted by a Gaussian noise ε(·). For each point (R, D) on the graph
of DX|Xε (R) associated with a water-level θ via (2.15), let fR be the Lebesgue measure of
the set
 
Fθ  f : S X|Xε ( f ) ≥ θ .

Then, for all fs ≥ fR ,

D ( fs , R) = DX|Xε (R).
60 Alon Kipnis et al.

The proof of Theorem 2.3 is relatively straightforward and follows from the definition
of Fθ and D ( fs , R).
We emphasize that the critical frequency fR depends only on the PSDs S X ( f )
and S ε ( f ), and on the operating point on the graph of D ( fs , R). This point may be
parametrized by D, R, or the water-level θ using (2.15). Furthermore, we can consider
a version of Theorem 2.3 in which the bitrate is a function of the distortion and the
sampling rate, by inverting D ( fs , R) with respect to R. This inverse function, R ( fs , D),
is the minimal number of bits per unit time one must provide on the samples of Xε (·),
obtained by any bounded linear sampler with sampling density not exceeding fs , in order
to attain distortion not exceeding D. The following representation of R ( fs , D) in terms
of fR is equivalent to Theorem 2.3.
theorem 2.4 (rate-distortion lower bound) Consider the samples of a Gaussian sta-
tionary process X(·) corrupted by a Gaussian noise ε(·) obtained by a bounded linear
sampler of maximal sampling density fs . The bitrate required to recover X(·) with MSE
at most D > mmse ( fs ) is at least
⎧   


⎪ 1 + f s S X|X ( f )
d f, f s < fR ,
⎨ 2 F  ( fs ) log D−mmse ( f s )
R ( f s , D) = ⎪
⎪ (2.21)

⎩RX|X (D), f s ≥ fR ,

where
 ∞
1  
RX|Xε (Dθ ) = log+ S X|Xε ( f )/θ d f
2 −∞
is the indirect rate-distortion function of X(·) given Xε (·), and θ is determined by
 ∞
 
Dθ = mmse ( fs ) + min S X|Xε ( f ), θ d f.
−∞

Theorems 2.3 and 2.4 imply that the equality in (2.20), which was previously shown to
hold for fs ≥ fLnd , is extended to all sampling rates above fR ≤ fLnd . As R goes to infinity,
D ( fs , R) converges to mmse ( fs ), the water-level θ goes to zero, the set Fθ coincides
with the support of S X ( f ), and fR converges to fLnd . Theorem 2.3 then implies that
mmse ( fs ) = 0 for all fs ≥ fLnd , a fact that agrees with Landau’s characterization of sets
of sampling for perfect recovery of signals in the Paley–Wiener space, as explained in
Section 2.3.3.
An intriguing way to explain the critical sampling rate subject to a bitrate constraint
arising from Theorem 2.3 follows by considering the degrees of freedom in the repre-
sentation of the analog signal pre- and post-sampling and with lossy compression of the
samples. For stationary Gaussian signals with zero sampling noise, the degrees of free-
dom in the signal representation are those spectral bands in which the PSD is non-zero.
When the signal energy is not uniformly distributed over these bands, the optimal lossy
compression scheme described by (2.16) calls for discarding those bands with the lowest
energy, i.e., the parts of the signal with the lowest uncertainty.
The degree to which the new critical rate fR is smaller than the Nyquist rate depends
on the energy distribution of X(·) across its spectral occupancy. The more uniform this
An Information-Theoretic Approach to Analog-to-Digital Compression 61

SΠ ( f ) S ( f ) S (f) SΩ ( f )

1.6

1.4

1.2

fR [smp/time]
1

0.8

0.6

0.4

0.2
0 0.5 1 1.5 2
R [bit/time]
Figure 2.13 The critical sampling rate fR as a function of the bitrate R for the PSDs given in the
small frames at the top of the figure. For the bandlimited PSDs S Π ( f ), S ( f ), and S ω ( f ), the
critical sampling rate is always at or below the Nyquist rate. The critical sampling rate is finite
for any R, even for the non-bandlimited PSD S Ω ( f ).

distribution, the more degrees of freedom are required to represent the lossy compressed
signal and therefore the closer fR is to the Nyquist rate. In the examples below we derive
the precise relation between fR and R for various PSDs. These relations are illustrated in
Fig. 2.13 for the signals SΠ ( f ), S ( f ), S ω ( f ), and S Ω ( f ) defined below.

2.4.4 Examples

Example 2.1 Consider the Gaussian stationary process X (·) with PSD
 +
2 1 − | f /W|
S (f)  σ ,
W
for some W > 0. Assuming that the noise is zero,
Fθ = [W(Wθ − 1), W(1 − Wθ)]
and thus fR = 2W(1 − Wθ). The exact relation between fR and R is obtained from (2.19b)
and found to be
    
1 fR /2 1 − | f /W| 2W fR
R= log d f = W log − .
2 − fR /2 1 − fR /2W 2W − fR 2 ln 2
In particular, note that R → ∞ leads to fR → fNyq = 2W, as anticipated.

Example 2.2 Let XΠ (·) be the process with PSD


1| f |<W ( f )
S Π( f ) = . (2.22)
2W
62 Alon Kipnis et al.

Assume that ε(·) is noise with a flat spectrum within the band (−W, W) such that
γ  S Π ( f )/S ε ( f ) is the SNR at the spectral component f . Under these conditions, the
water-level θ in (2.15) satisfies
γ −R/W
θ = σ2 2 ,
1+γ
and hence
DX|Xε R) 1 γ −R/W
= + 2 . (2.23)
σ2 1+γ 1+γ
In particular, Fθ = [−W, W], so that fR = 2W = fNyq for any bitrate R and D ( fs , R) =
DX|Xε (R) only for fs ≥ fNyq . That is, for the process XΠ (·), optimal sampling under a
bitrate constraint occurs only at or above its Nyquist rate.

Example 2.3 Consider the PSD


σ 2 / f0
S Ω( f )  ,
(π f / f0 )2 + 1
for some f0 > 0. The Gaussian stationary process XΩ (·) corresponding to this PSD is
a Markov process, and it is in fact the unique Gaussian stationary process that is also
Markovian (a.k.a. the Ornstein–Uhlenbeck process). Since the spectral occupancy of
XΩ (·) is the entire real line, its Nyquist and Landau rates are infinite and it is impossible
to recover it with zero MSE using any bounded linear sampler. Assuming ε(·) ≡ 0 and
noting that S Ω ( f ) is unimodal, F  ( fs ) = (− f2 /2, fs /2), and thus
 ∞   
 2 π fs
mmse ( fs ) = 2 S Ω ( f )d f = σ 1 − arctan
2
.
fs /2 π 2 f0
Consider now a point (R, D) on the DRF of XΩ (·) and its corresponding water-level θ
determined from (2.16). It follows that fR = (2 fs /π) 1/θ fs − 1, so that Theorem 2.3
implies that the distortion cannot be reduced below D by sampling above this rate. The
exact relation between R and fR is found to be
  
1 2 f0 π fR
R= fR − arctan . (2.24)
ln 2 π 2 f0

Note that, although the Nyquist rate of XΩ (·) is infinite, for any finite R there exists a
critical sampling frequency fR satisfying (2.24) such that DXΩ (R) is attained by sampling
at or above fR .
The asymptotic behavior of (2.24) as R goes to infinity is given by R ∼ fR /ln 2. Thus,
for R sufficiently large, the optimal sampling rate is linearly proportional to R. The ratio
R/ fs is the average number of bits per sample used in the resulting digital representa-
tion. It follows from (2.24) that, asymptotically, the “right” number of bits per sample
converges to 1/ ln 2 ≈ 1.45. If the number of bits per sample is below this value, then
An Information-Theoretic Approach to Analog-to-Digital Compression 63

the distortion in ADX is dominated by the DRF DXΩ (·), as there are not enough bits to
represent the information acquired by the sampler. If the number of bits per sample is
greater than this value, then the distortion in ADX is dominated by the sampling distor-
tion, as there are not enough samples for describing the signal up to a distortion equal to
its DRF.
As a numerical example, assume that we encode XΩ (t) using two bits per sample,
i.e., fs = 2R. As R → ∞, the ratio between the minimal distortion D ( fs , R) and DXΩ (R)
converges to approximately 1.08, whereas the ratio between D ( fs , R) and mmse ( fs )
converges to approximately 1.48. In other words, it is possible to attain the optimal
encoding performance within an approximately 8% gap by providing one sample per
each two bits per unit time used in this encoding. On the other hand, it is possible to
attain the optimal sampling performance within an approximately 48% gap by providing
two bits per each sample taken.

2.5 ADX under Uniform Sampling

We now analyze the distortion in the ADX setting of Fig. 2.5 under the important class of
single- and multi-branch LTI uniform samplers. Our goal in this section is to show that
for any source and noise PSDs, S X ( f ) and S ε ( f ), respectively, the function D ( fs , R)
describing the fundamental distortion limit in ADX is attainable using a multi-branch
LTI uniform sampler. By doing so, we also provide a proof of Theorem 2.2.
We begin by analyzing the ADX system of Fig. 2.5 under an LTI uniform sampler.
As we show, the asymptotic distortion in this case can be obtained in a closed form
that depends only on the signal and noise PSDs, the sampling rate, the bitrate, and the
pre-sampling filter H( f ). We then show that, by taking H( f ) to be a low-pass filter with
cutoff frequency fs /2, we can attain the fundamental distortion limit D ( fs , R) whenever
the function S X|Xε ( f ) of (2.13) attains its maximum at the origin. In the more general
case of an arbitrarily shaped S X|Xε ( f ), we use multi-branch sampling in order to achieve
D ( fs , R).

2.5.1 Single-Branch LTI Uniform Sampling


Assume that the sampler S is the LTI uniform sampler defined in Section 2.3.1 and illus-
trated in Fig. 2.7. This sampler is characterized by its sampling rate fs and the frequency
response H( f ) of the pre-sampling filter.
In Section 2.3.3, we saw that, for any bounded linear sampler, optimal encod-
ing in ADX is obtained by first forming the estimator XT (·) from YT , and then
encoding XT (·) in an optimal manner subject to the bitrate constraint. That is, the
encoder performs estimation under an MSE criterion followed by optimal source coding
for this estimate. Under the LTI uniform sampler, the process XT (·) has an asymp-
totic distribution described by the conditional expectation of X(·) given the sigma
64 Alon Kipnis et al.

algebra generated by {Xε (n/ fs ), n in Z}. Using standard linear estimation techniques,
this conditional expectation has a representation similar to that of a Wiener filter given
by [12]:
  
X(t)  E X(t)|{X(n/ fs ), n ∈ Z} = Xε (n/ fs )w(t − n/ fs ), t ∈ R, (2.25)
n∈Z

where the Fourier transform of w(t) is


S X ( f )|H( f )|2
W( f ) =  .
k∈Z S Xε ( f − k fs )|H( f − k fs )|2
Moreover, the resulting MMSE, which is the asymptotic value of mmseT (S ), is given by
 fs
2  
mmseH ( fs )  S X ( f − n fs ) − S X ( f ) d f, (2.26)
f
n∈Z − 2s

where
S 2X ( f − fs n)|H( f − fs n)|2
S X( f )   . (2.27)
n∈Z S Xε ( f − fs n)|H( f − fs n)|2
From the decomposition (2.11), it follows that, when S is an LTI uniform sampler, the
distortion can be expressed as

DH ( fs , R)  lim inf DT (S , R) = mmseH ( fs ) + DX (R),


T →∞

where DX (R) is the DRF of the Gaussian process X(·) defined by (2.25), satisfying the
law of the process XT (·) in the limit as T goes to infinity.
Note that, whenever fs ≥ fNyq and supp S X is included within the passband of H( f ),
we have that S X ( f ) = S X|Xε ( f ) and thus mmseH ( fs ) = mmse(X|Xε ), i.e., no distortion
due to sampling. Moreover, in this situation, X(t) = E[X(t)|Xε (·)] and

DH ( fs , R) = DX|Xε (R). (2.28)

The equality (2.28) is a special case of (2.20) for LTI uniform sampling, and says that
there is no loss due to sampling in ADX whenever the sampling rate exceeds the Nyquist
rate of X(·).
When the sampling rate is below fNyq , (2.25) implies that the estimator X(·) has the
form of a stationary process modulated by a deterministic pulse, and is therefore a block-
stationary process, also called a cyclostationary process [25]. The DRF for this class of
processes can be described by a generalization of the orthogonal transformation and rate
allocation that leads to the water-filling expression (2.16) [26]. Evaluating the result-
ing expression for the DRF of the cyclostationary process X(·) leads to a closed-form
expression for DH ( fs , R), which was initially derived in [27].
theorem 2.5 (achievability for LTI uniform sampling) Let X(·) be a Gaussian station-
ary process corrupted by a Gaussian stationary noise ε(·). The minimal distortion in
An Information-Theoretic Approach to Analog-to-Digital Compression 65

∑ SX ( f − fsk) mmseH ( fs)

lossy compression distance


preserved spectrum

)
Y (f
X|
S

fs )
SX
(f−

(f+
f)
SX (

f s)

SX
f
fs/2 fNyq/2

Figure 2.14 Water-filling interpretation of (2.29) with an all-pass filter H( f ). The function

k∈Z S X ( f − fs k) is the aliased PSD that represents the full energy of the original signal within
the discrete-time spectrum interval (− fs /2, fs /2). The part of the energy recovered by the X(·) is
S X ( f ). The distortion due to lossy compression is obtained by water-filling over the recovered
energy according to (2.29a). The overall distortion DH ( fs , R) is the sum of the sampling
distortion and the distortion due to lossy compression.

ADX at bitrate R with an LTI uniform sampler with sampling rate fs and pre-sampling
filter H( f ) is given by
 fs
2
DH ( fs , Rθ ) = mmseH ( fs ) + min S X ( f ), θ d f, (2.29a)
f
− 2s
 fs
 
1 2
Rθ = log+2 S X ( f )/θ d f, (2.29b)
2 −
fs
2

where mmseH ( fs ) and S X ( f ) are given by (2.26) and (2.27), respectively.


A graphical water-filling interpretation of (2.29) is provided in Fig. 2.14. This
expression combines the MMSE (2.26), which depends only on fs and H( f ), with the
expression for the indirect DRF of (2.15), which also depends on the bitrate R. The func-
tion S X ( f ) arises in the MMSE estimation of X(·) from its samples and can be interpreted
as an average over the PSD of polyphase components of the cyclostationary process
X(·) [26].
As is implicit in the analysis above, the coding scheme that attains DH ( fs , R) is
described by the decomposition of the non-causal MSE estimate of X(·) from its sam-
ples YT . This estimate is encoded using a codebook with 2T R elements that attains the
DRF of the Gaussian process X(·) at bitrate R, and the decoded codewords (which is a
waveform over [−T/2, T/2]) are used as the reconstruction of X(·). For any ρ > 0, the
MSE resulting from this process can be made smaller than DH ( fs , R − ρ) by taking T to
be sufficiently large.

Example 2.4 (continuation of Example 2.2) As a simple example for using formula
(2.29), consider the process XΠ (·) of Example 2.2. Assuming that the noise ε(·) ≡ 0
(equivalently, γ → ∞) and that H( f ) passes all frequencies f ∈ [−W, W], the relation
between the distortion in (2.29a) and the bitrate in (2.29b) is given by
66 Alon Kipnis et al.

distortion
DX (R = 1) DH ( fs, R = 1)
DX (R = 2) DH ( fs, R = 2)
fs
sampling rate
Figure 2.15 Distortion as a function of sampling rate for the source with PSD S Π ( f ) of (2.22),
zero noise, and source coding rates R = 1 and R = 2 bits per time unit.



⎪ f − 2R
⎨mmseH ( fs ) + σ2 2Ws 2 fs , fs < 2W,
DH ( fs , R) = ⎪
⎪ (2.30)
⎩σ2 2− WR , fs ≥ 2W,
 
where mmseH ( fs ) = σ2 1 − fs /2W + . Expression (2.30) is shown in Fig. 2.15 for two
fixed values of the bitrate R. It has a very intuitive structure: for frequencies below
fNyq = 2W, the distortion as a function of the bitrate increases by a constant factor due
to the error as a result of non-optimal sampling. This factor completely vanishes once
the sampling rate exceeds the Nyquist frequency, in which case DH ( fs , R) coincides with
the DRF of X(·).
In the noisy case when γ = S Π ( f )/S ε ( f ), we have mmseH ( fs ) = σ2 (1 −
fs /(2W(1 + γ))) and the distortion takes the form


⎪ −2R/ fs ,
 2 ⎨mmseH ( fs ) + ( fs /2W)(γ/(1 + γ))2 fs < 2W,
D ( fs , R) = σ ⎪⎪ (2.31)
⎩mmse(X|Xε ) + (γ/(1 + γ))2−R/W , fs ≥ 2W,

where mmse(X|Xε ) = 1/(1 + γ).

Next, we show that, when S X|Xε ( f ) is unimodal, an LTI uniform sampler can be used
to attain D ( fs , R).

2.5.2 Unimodal PSD and Low-Pass Filtering


Under the assumption that H( f ) is an ideal low-pass filter with cutoff frequency fs /2,
(2.29) becomes
 fs  fs
2   2  
DLPF ( fs , Rθ ) = S X ( f ) − S X|Xε ( f ) d f + min S X|Xε ( f ), θ d f, (2.32a)
fs fs
− 2 − 2
 fs
1 2  
Rθ = log+2 S X|Xε ( f )/θ d f. (2.32b)
2 −
fs
2

Comparing (2.32) with (2.19), we see that the two expressions coincide whenever the
interval [− fs /2, fs /2] minimizes (2.18). Therefore, we conclude that, when the function
S X|Xε ( f ) is unimodal in the sense that it attains its maximal value at the origin, the
An Information-Theoretic Approach to Analog-to-Digital Compression 67

fundamental distortion in ADX is attained using an LTI uniform sampler with a low-pass
filter of cutoff frequency fs /2 as its pre-sampling operation.
An example for a PSD for which (2.32) describes its fundamental distortion limit is
the one in Fig. 2.11. Note the LPF with cutoff frequency fs /2 in cases (a)–(c) there.
Another example of this scenario for a unimodal PSD is given in Example 2.4 above.

Example 2.5 (continuation of Examples 2.2 and 2.4) In the case of the process XΠ (·)
with a flat spectrum noise as in Examples 2.2 and 2.4, (2.32) leads to (2.31). It follows
that the fundamental distortion limit in ADX with respect to XΠ (·) and a flat spectrum
noise is given by (2.31), which was obtained from (2.29). Namely, the fundamental
distortion limit in this case is obtained using any pre-sampling filter whose passband
contains [−W, W], and using an LPF is unnecessary.
In particular, the distortion in (2.31) corresponding to fs ≥ fNyq equals the indirect
DRF of X(·) given Xε (·), which can be found directly from (2.15). Therefore, (2.31)
implies that optimal sampling for XΠ (·) under LTI uniform sampling occurs only at or
above its Nyquist rate. This conclusion is not surprising since, according to Example 2.2,
super-Nyquist sampling of XΠ (·) is necessary for (2.20) to hold under any bounded linear
sampler.

The analysis above implies in particular that a distortion of D ( fs , R) is achievable


using LTI uniform sampling for any signal X(·) with a unimodal PSD in the noiseless
case or unimodal S X|Xε ( f ) in the noisy case. Therefore, Theorem 2.5 implies Theo-
rem 2.2 in these special cases. In what follows, we use multi-branch sampling in order
to show that Theorem 2.2 holds for an arbitrary S X|Xε ( f ).

2.5.3 Multi-Branch LTI Uniform Sampling


We now consider the ADX system of Fig. 2.5 where the sampler is the multi-branch
sampler defined in Section 2.3.1 and illustrated in Fig. 2.8. This sampler is characterized
by L filters H1 ( f ), . . . , HL ( f ) and a sampling rate fs .
The generalization of Theorem 2.5 under this sampler is as follows [27].
theorem 2.6 Let X(·) be a Gaussian stationary process corrupted by a Gaussian sta-
tionary noise ε(·). The minimal distortion in ADX at bitrate R with a multi-branch LTI
uniform sampler is given by

L 
 fs
2
DH1 ,...,HL ( fs , R) = mmseH1 ,...,HL ( fs ) + min{λl ( f )}d f, (2.33a)
f
l=1 − 2s


1
L fs
2  
Rθ = log+2 λl ( f )/θ d f, (2.33b)
2 l=1 −
fs
2
68 Alon Kipnis et al.

where λ1 , . . . , λL are the eigenvalues of the matrix


 1 H
− −1
SX ( f ) = SY 2 ( f ) K( f )SY 2 ( f ),

with

(SY ( f ))i, j = S Xε ( f − fs n)Hi ( f − fs n)H ∗j ( f − fs n), i, j = 1, . . . , L,
n∈Z

(K( f ))i, j = S 2X ( f − fs n)Hi ( f − fs n)H ∗j ( f − fs n), i, j = 1, . . . , L.
n∈Z

In addition,
 fs
2
mmseH1 ,...,HL ( fs )  σ2X − tr SX ( f ) d f
f
− 2s

is the minimal MSE in estimating X(·) from the combined output of the L sampling
branches as T approaches infinity.
The most interesting feature in the extension of (2.29) provided by (2.33) is the depen-
dences between samples obtained over different branches, expressed in the definition of
the matrix SX ( f ). In particular, if fs ≥ fNyq , then we may choose the bandpasses of the L
filters to be a set of L disjoint intervals, each of length at most fs /L, such that the union
of their supports contains the support of this choice, the matrix SX ( f ) is diagonal and its
eigenvalues are

n∈Z S X ( f − fs n)
2
λl = S l ( f )   1suppHl ( f ).
n∈Z S X+ε ( f − fs n)

Since the union of the filters’ support contains the support of S X|Xε , we have

DH1 ,...,HL ( fs , R) = DX|Xε (R).

While it is not surprising that a multi-branch sampler attains the optimal sampling dis-
tortion when fs is above the Nyquist rate, we note that at each branch the sampling rate
can be as small as fNyq /L. This last remark suggests that a similar principle may be used
under sub-Nyquist sampling to sample those particular parts of the spectrum of maximal
energy whenever S X|Xε ( f ) is not unimodal.
Our goal now is to prove Theorem 2.2 by showing that, for any PSDs S X ( f ) and
S ε ( f ), the distortion in (2.33) can be made arbitrarily close to the fundamental distor-
tion limit D ( fs , R) with an appropriate choice of the number of sampling branches and
their filters. Using the intuition gained above, given a sampling rate fs we cover the set
of maximal energy F  ( fs ) of (2.18) using L disjoint intervals, such that the length of
each interval does not  exceed fs /L. For any ε > 0, it can be shown that there exists L
large enough such Δ S X|Xε ( f )d f < ε, where Δ is the part that is not covered by the L
intervals [28].
An Information-Theoretic Approach to Analog-to-Digital Compression 69

From this explanation, we conclude that, for any PSD S X|Xε ( f ), fs > 0, and ε > 0,
there exists an integer L and a set of L pre-sampling filters H1 ( f ), . . . , HL ( f ) such that,
for every bitrate R,

DH1 ,...,HL ( fs , R) ≤ D ( fs , R) + ε. (2.34)

Since DH1 ,...,HL ( fs , R) is obtained in the limit as T approaches infinity of the mini-
mal distortion in ADX under the aforementioned multi-branch uniform sampler, the
fundamental distortion limit in ADX is achieved up to an arbitrarily small constant.
The description starting from Theorem 2.6 and ending in (2.34) sketches the proof
of the achievability side of the fundamental ADX distortion (Theorem 2.2). Below we
summarize the main points in the procedure described in this section.
(i) Given a sampling rate fs , use a multi-branch LTI uniform sampler with a sufficient
number of sampling branches L that the effective passband of all branches is close
enough to F  , which is a set of Lebesgue measure fs that maximizes (2.18).
(ii) Estimate the signal X(·) under an MSE criterion, leading to XT (·) defined in (2.10).
As T → ∞ this process converges in L2 norm to X(·) defined in (2.25).
(iii) Given a bitrate constraint R, encode a realization of XT (·) in an optimal manner
subject to an MSE constraint as in standard source coding [7]. For example, for
ρ > 0 arbitrarily small, we may use a codebook consisting of 2T (R+ρ) waveforms
of duration T generated by independent draws from the distribution defined by
the preserved part of the spectrum in Fig. 2.10. We then use minimum distance
encoding with respect to this codebook.

2.6 Conclusion

The processing, communication, and digital storage of an analog signal requires first
representing it as a bit sequence. Hardware and modeling constraints in processing ana-
log information imply that the digital representation is obtained by first sampling the
analog waveform and then quantizing or encoding its samples. That is, the transforma-
tion from analog signals to bits involves the composition of sampling and quantization
or, more generally, lossy compression operations.
In this chapter we explored the minimal sampling rate required to attain the funda-
mental distortion limit in reconstructing a signal from its quantized samples subject to a
strict constraint on the bitrate of the system. We concluded that, when the energy of the
signal is not uniformly distributed over its spectral occupancy, the optimal signal repre-
sentation can be attained by sampling at some critical rate that is lower than the Nyquist
rate or, more generally, the Landau rate, in bounded linear sampling. This critical sam-
pling rate depends on the bitrate constraint, and converges to the Nyquist or Landau rates
in the limit of infinite bitrate. This reduction in the optimal sampling rate under finite bit
precision is made possible by designing the sampling mechanism to sample only those
parts of the signals that are not discarded due to optimal lossy compression.
70 Alon Knipis et al.

The information-theoretic approach to analog-to-digital compression explored in this


chapter can be extended in various directions. First, while we considered the minimal
sampling rate and resulting distortion under an ideal encoding of the samples, such
an encoding is rarely possible in practice. Indeed, in most cases, the encoding of the
samples is subject to additional constraints in addition to the bit resolution, such as
complexity, time delay, or limited information on the distribution of the signal and the
noise. It is therefore important to characterize the optimal sampling rate and resulting
distortion under these limitations. In addition, the reduction in the optimal sampling rate
under the bitrate constraint from the Nyquist rate to fR can be understood as the result
of a reduction in degrees of freedom in the compressed signal representation compared
with the original source. It is interesting to consider whether a similar principle holds
for non-stationary [29] or non-Gaussian [30, 31] signal models (e.g., sparse signals).

References

[1] Y. C. Eldar, Sampling theory: Beyond bandlimited systems. Cambridge University Press,
2015.
[2] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Information Theory, vol. 44,
no. 6, pp. 2325–2383, 1998.
[3] C. E. Shannon, “Communication in the presence of noise,” IRE Trans. Information Theory,
vol. 37, pp. 10–21, 1949.
[4] H. Landau, “Sampling, data transmission, and the Nyquist rate,” Proc. IEEE, vol. 55,
no. 10, pp. 1701–1706, 1967.
[5] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical J.,
vol. 27, pp. 379–423, 623–656, 1948.
[6] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[7] T. Berger, Rate-distortion theory: A mathematical basis for data compression. Prentice-
Hall, 1971.
[8] R. Walden, “Analog-to-digital converter survey and analysis,” IEEE J. Selected Areas in
Communications, vol. 17, no. 4, pp. 539–550, 1999.
[9] J. Candy, “A use of limit cycle oscillations to obtain robust analog-to-digital converters,”
IEEE Trans. Communications, vol. 22, no. 3, pp. 298–305, 1974.
[10] B. Oliver, J. Pierce, and C. Shannon, “The philosophy of PCM,” IRE Trans. Information
Theory, vol. 36, no. 11, pp. 1324–1331, 1948.
[11] D. L. Neuhoff and S. S. Pradhan, “Information rates of densely sampled data: Distributed
vector quantization and scalar quantization with transforms for Gaussian sources,” IEEE
Trans. Information Theory, vol. 59, no. 9, pp. 5641–5664, 2013.
[12] M. Matthews, “On the linear minimum-mean-squared-error estimation of an undersampled
wide-sense stationary random process,” IEEE Trans. Signal Processing, vol. 48, no. 1, pp.
272–275, 2000.
[13] D. Chan and R. Donaldson, “Optimum pre- and postfiltering of sampled signals with appli-
cation to pulse modulation and data compression systems,” IEEE Trans. Communication
Technol., vol. 19, no. 2, pp. 141–157, 1971.
An Information-Theoretic Approach to Analog-to-Digital Compression 71

[14] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE
Trans. Information Theory, vol. 8, no. 5, pp. 293–304, 1962.
[15] A. Kolmogorov, “On the Shannon theory of information transmission in the case of
continuous signals,” IRE Trans. Information Theory, vol. 2, no. 4, pp. 102–108, 1956.
[16] A. Lapidoth, “On the role of mismatch in rate distortion theory,” IEEE Trans. Information
Theory, vol. 43, no. 1, pp. 38–47, 1997.
[17] I. Kontoyiannis and R. Zamir, “Mismatched codebooks and the role of entropy coding in
lossy data compression,” IEEE Trans. Information Theory, vol. 52, no. 5, pp. 1922–1938,
2006.
[18] A. H. Zemanian, Distribution theory and transform analysis: An introduction to general-
ized functions, with applications. Courier Corporation, 1965.
[19] J. Wolf and J. Ziv, “Transmission of noisy information to a noisy receiver with minimum
distortion,” IEEE Trans. Information Theory, vol. 16, no. 4, pp. 406–411, 1970.
[20] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Trans. Information Theory,
vol. 26, no. 5, pp. 518–521, 1980.
[21] H. Landau, “Necessary density conditions for sampling and interpolation of certain entire
functions,” Acta Mathematica, vol. 117, no. 1, pp. 37–52, 1967.
[22] A. Beurling and L. Carleson, The collected works of Arne Beurling: Complex analysis.
Birkhäuser, 1989, vol. 1.
[23] F. J. Beutler, “Sampling theorems and bases in a Hilbert space,” Information and Control,
vol. 4, nos. 2–3, pp. 97–117, 1961.
[24] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Fundamental distortion limits of analog-
to-digital compression,” IEEE Trans. Information Theory, vol. 64, no. 9, pp. 6013–6033,
2018.
[25] W. Bennett, “Statistics of regenerative digital transmission,” Bell Labs Technical J., vol. 37,
no. 6, pp. 1501–1542, 1958.
[26] A. Kipnis, A. J. Goldsmith, and Y. C. Eldar, “The distortion rate function of cyclostationary
Gaussian processes,” IEEE Trans. Information Theory, vol. 64, no. 5, pp. 3810–3824, 2018.
[27] A. Kipnis, A. J. Goldsmith, Y. C. Eldar, and T. Weissman, “Distortion rate function of sub-
Nyquist sampled Gaussian sources,” IEEE Trans. Information Theory, vol. 62, no. 1, pp.
401–429, 2016.
[28] A. Kipnis, “Fundamental distortion limits of analog-to-digital compression,” Ph.D. disser-
tation, Stanford University, 2018.
[29] A. Kipnis, A. J. Goldsmith, and Y. C. Eldar, “The distortion-rate function of sampled
Wiener processes,” in IEEE Transactions on Information Theory, vol. 65, no. 1, pp. 482–
499, Jan. 2019. doi: 10.1109/TTT.2018.2878446
[30] A. Kipnis, G. Reeves, and Y. C. Eldar, “Single letter formulas for quantized compressed
sensing with Gaussian codebooks,” in 2018 IEEE International Symposium on Information
Theory (ISIT), 2018, pp. 71–75.
[31] A. Kipnis, G. Reeves, Y. C. Eldar, and A. J. Goldsmith, “Compressed sensing under opti-
mal quantization,” in 2017 IEEE International Symposium on Information Theory (ISIT),
2017, pp. 2148–2152.
3 Compressed Sensing via
Compression Codes
Shirin Jalali and H. Vincent Poor

Summary

Compressed sensing (CS) refers to the following fundamental data-acquisition problem.


A signal x ∈ Rn is measured as y = Ax+z, where A ∈ Rm×n (m < n) and z ∈ Rm denote the
sensing matrix and the measurement noise, respectively. The goal of a CS recovery algo-
rithm is to estimate x from under-determined measurements y. CS is possible because
the signals we are interested in capturing are typically highly structured. Recovery algo-
rithms take advantage of such structures to solve the described under-determined system
of linear equations. On the other hand, data compression, i.e., efficiently encoding of an
acquired signal, is one of the pillars of information theory. Like CS, data-compression
codes are designed to take advantage of a signal’s structure to encode it efficiently.
Studying modern image and video compression codes on one hand and CS recovery
algorithms on the other hand reveals that, to a great extent, structures used by com-
pression codes are much more elaborate than those used by CS algorithms. Using more
complex structures in CS, similar to those employed by data-compression codes, poten-
tially leads to more efficient recovery methods that require fewer linear measurements
or have a better reconstruction quality. In this chapter, we establish fundamental con-
nections between the two seemingly independent problems of data compression and CS.
This connection leads to CS recovery methods that are based on compression codes.
Such compression-based CS recovery methods, indirectly, take advantage of all struc-
tures used by compression codes. This process, with minimal effort, elevates the class of
structures used by CS algorithms to those used by compression codes and hence leads
to more efficient CS recovery methods.

3.1 Compressed Sensing

Data acquisition refers to capturing signals that lie in the physical world around us and
converting them to processable digital signals. It is a basic operation that takes place
before any other data processing can begin to happen. Audio recorders, cameras, X-ray
computed tomography (CT) scanners, and magnetic resonance imaging (MRI) machines
are some examples of data-acquisition devices that employ different mechanisms to
perform this crucial step.

72
Compressed Sensing via Compression Codes 73

Figure 3.1 Block diagram of a CS measurement system. Signal x is measured as y ∈ Rm and later
recovered as x.

For decades the Nyquist–Shannon sampling theorem served as the theoretical foun-
dation of most data-acquisition systems. The sampling theorem states that for a band-
limited signal with a maximum frequency fc , a sampling rate of 2 fc is enough for perfect
reconstruction. CS is arguably one of the most disruptive ideas conceived after this fun-
damental theorem [1, 2]. CS proves that, for sparse signals, the sampling rate can in
fact be substantially below the rate required by the Nyquist–Shannon sampling theorem
while perfect reconstruction still being possible.
CS is a special data-acquisition technique that is applicable to various measurement
devices that can be described as a noisy linear measurement system. Let x ∈ Rn denote
the desired signal that is to be measured. Then, the measured signal y ∈ Rm in such
systems can be written as

y = Ax + z, (3.1)

where A ∈ Rm×n and z ∈ Rm denote the sensing matrix and the measurement noise,
respectively. A reconstruction algorithm then recovers x from measurements y, while
having access to sensing matrix A. Fig. 3.1 shows a block diagram representation of the
described measurement system.
For x to be recoverable from y = Ax + z, for years, the conventional wisdom has been
that, since there are n unknown variables, the number of measurements m should be
equal to or larger than n. However, in recent years, researchers have observed that, since
most signals we are interested in acquiring do not behave like random noise, and are
typically “structured,” therefore, employing our knowledge about their structure might
enable us to recover them, even if the number of measurements (m) is smaller than the
ambient dimension of the signal (n). CS theory proves that this is in fact the case, at least
for signals that have sparse representations in some transform domain.
More formally, the problem of CS can be stated as follows. Consider a class of sig-
nals denoted by a set Q, which is a compact subset of Rn . A signal x ∈ Q is measured
by m noisy linear projections as y = Ax + z. A CS recovery (reconstruction) algorithm
estimates the signal x from measurements y, while having access to the sensing matrix
A and knowing the set Q.
While the original idea of CS was developed for sparse or approximately sparse
signals, the results were soon extended to other types of structure, such as block-
sparsity and low-rankness as well. (See [3–26] for some examples of this line of work.)
74 Shirin Jalali and H. Vincent Poor

Despite such extensions, for one-dimensional signals, and to a great extent for higher-
dimensional signals such as images and video files, the main focus has still remained on
sparsity and its extensions. For two-dimensional signals, in addition to sparse signals,
there has also been extensive study of low-rank matrices.
The main reasons for such a focus on sparsity have been two-fold. First, sparsity is
a relevant structure that shows up in many signals of interest. For instance, images are
known to have (approximately) sparse wavelet representations. In most applications of
CS in wireless communications, the desired coefficients that are to be estimated are
sparse. The second reason for focusing on sparsity has been theoretical results that show
that the 0 -“norm” can be replaced with the 1 -norm. The 0 -“norm” of a signal x ∈ Rn
counts the number of non-zero elements of x, and hence serves as a measure of its
sparsity. For solving y = Ax, when x is known to be sparse, a natural optimization is the
following:

min u0 s.t. Au = y, (3.2)

where u ∈ Rn . This is an NP-hard combinatorial optimization. However, replacing the


0 -“norm” with the 1 -norm leads to a tractable convex optimization problem.
Since the inception of the idea of CS there has been a large body of work on develop-
ing practical, efficient, and robust CS algorithms that are able to recover sparse signals
from their under-sampled measurements. Such algorithms have found compelling appli-
cations for instance in MRI machines and other computational imaging systems. Since
most focus in CS has been on sparsity and its extensions, naturally, most reconstruction
algorithms are also designed for recovering sparse signals from their noisy linear pro-
jections. For instance, widely used optimizations and algorithms such as basis pursuit
[27], the least absolute shrinkage and selection operator (LASSO) [28], fast iterative
shrinkage-thresholding algorithm (FISTA) [29], orthogonal matching pursuit (OMP)
[30], the iterative thresholding method of [31], least-angle regression (LARS) [32], iter-
ative hard thresholding (IHT) [33], CoSaMP [34], approximate message passing (AMP)
[35], and subspace pursuit [36] have been developed for recovering sparse signals.

3.2 Compression-Based Compressed Sensing

While sparsity is an important and attractive structure that is present in many signals of
interest, most such signals, including images and video files, in addition to exhibiting
sparsity, follow structures that are beyond sparsity. A CS recovery algorithm that takes
advantage of the full structure that is present in a signal, potentially, outperforms those
that merely rely on a simple structure. In other words, such an algorithm would poten-
tially require fewer measurements, or, for an equal number of measurements, present a
better reconstruction quality.
To have high-performance CS recovery methods, it is essential to design algorithms
that take advantage of the full structure that is present in a signal. To address this issue,
researchers have proposed different potential solutions. One line of work has been based
Compressed Sensing via Compression Codes 75

on using already-existing denoising algorithms to develop denoising-based CS recovery


methods. Denoising, especially image denoising, compared with CS is a very well-
established research area. In fact, denoising-based recovery methods [37] show very
good performance in practice, as they take advantage of more complex structures. Most
other directions, such as those based on universal CS [38–40], or those based on learning
general structure [41], while interesting from a theoretical viewpoint, have yet to yield
efficient practical algorithms.
In this chapter, we focus on another approach for addressing this issue, which we
refer to as compression-based CS. As we discuss later in this chapter, compression-
based CS leads to theoretically tractable efficient algorithms that achieve state-of-the-art
performance in some applications.
Data compression is a crucial data-processing task, which is important from the per-
spectives of both data communication and data storage. For instance, modern image and
video compression codes play a key role in current wired and wireless communication
systems, as they substantially reduce the size of such files. A data-compression code
provides an encoding and a decoding mechanism than enables storage of data using as
few bits as possible. For image, audio, and video files, compared with CS, data compres-
sion is a very well-established research area with, in some cases, more than 50 years of
active research devoted to developing efficient codes.
As with CS, the success of data-compression codes hinges upon their ability to take
advantage of structures that are present in signals. Comparing the set of structures used
by data-compression codes and those employed by CS algorithms reveals that, espe-
cially for images and video files, the class of structures used by compression codes is
much richer and much more complex. Given this discrepancy in the scope of structures
used in the two cases, and the shortcomings of existing recovery methods highlighted
earlier, the following question arises: is it possible to use existing compression codes as
part of a CS recovery algorithm to have “compression-based CS” recovery algorithms?
The motivation for such a method is to provide a shortcut to the usual path of design-
ing recovery methods, which consists of the following two rather challenging steps: (i)
(explicitly) specifying the structure that is present in a class of signals, and (ii) finding
an efficient and robust algorithm that, given y = Ax + z, finds  x that is consistent with
both the measurements and the discovered structure. Given this motivation, in the rest
of this chapter, we try to address the following questions.
question 1 Can compression codes be used as a tool to design CS recovery algorithms?
question 2 If compression-based CS is possible, how well do such recovery algorithms
perform, in terms of their required sampling rate, robustness to noise, and reconstruction
quality?
question 3 Are there efficient compression-based CS algorithms with low compu-
tational complexity that employ off-the-shelf compression codes, such as JPEP or
JPEG2000 for images and MPEG for videos?
Throughout this chapter, we mainly focus on deterministic signal models. However,
at the end, we describe how most of the results can be extended to stochastic processes
76 Shirin Jalali and H. Vincent Poor

as well. We also discuss the implications of such generalizations. Before proceeding


to the main argument, in the following section we review some basic definitions and
notation used throughout this chapter.

3.3 Definitions

3.3.1 Notation
For x ∈ R, δ x denotes the Dirac measure with an atom at x. Consider x ∈ R and b ∈ N+ .
Every real number can be written as x = x + xq , where x denotes the largest integer
smaller than or equal to x and xq = x − x. Since x − 1 < x ≤ x, xq ∈ [0, 1). Let 0.a1 a2 . . .
denote the binary expansion of xq . That is,


xq = ai 2−i .
i=1
Then, define the b-bit quantized version of x as

b
[x]b = x + ai 2−i .
i=1
Using this definition,
x − 2−b < [x]b ≤ x.
For a vector xn ∈ Rn , let [xn ]b = ([x1 ]b , . . . , [xn ]b ).

Given two vectors x and y both in Rn , x, y = ni=1 xi yi denotes their inner product.
Throughout this chapter, ln and log denote the natural logarithm and logarithm to base
2, respectively.
Sets are denoted by calligraphic letters and the size of set A is denoted by |A|. The
0 -“norm” of x ∈ Rn is defined as x0 = |{i : xi  0}|.

3.3.2 Compression
Data compression is about efficiently storing an acquired signal, and hence is a step that
happens after the data have been collected. The main goal of data-compression codes is
to take advantage of the structure of the data and represent it as efficiently as possible, by
minimizing the required number of bits. Data-compression algorithms are either lossy
or lossless. In this part, we briefly review the definition of a lossy compression scheme
for real-valued signals in Rn . (Note that lossless compression of real-valued signals is
not feasible.)
Consider Q, a compact subset of Rn . A fixed-rate compression code for set Q is
defined via encoding and decoding mappings (E, D), where
E : Rn → {1, . . . , 2r }
and
D : {1, . . . , 2r } → Rn .
Compressed Sensing via Compression Codes 77

Figure 3.2 Encoder maps signal x to r bits, E(x), and, later, the decoder maps the coded bits to 
x.

Here r denotes the rate of the code. The codebook of such code is defined as
C = {D(E(x)) : x ∈ Q}.
Note that C is always a finite set with at most 2r distinct members. Fig. 3.2 shows a
block diagram presentation of the described compression code. The compression code
defined by (E, D) encodes signal x ∈ Q into r bits as E(x), and decodes the encoded bits
to 
x = D(E(x)). The performance of this code is measured in terms of its rate r and its
induced distortion δ defined as
δ = sup x −
x2 .
x∈Q

Consider a family of compression codes {(Er , Dr ) : r > 0} for set Q indexed by their
rate r. The (deterministic) distortion-rate function of this family of compression codes
is defined as
δ(r) = sup x − Dr (Er (x))2 .
x∈Q

In other words, δ(r) denotes the distortion of the code operating at rate r. The
corresponding rate-distortion function of this family of compression codes is defined as
r(δ) = inf{r : δ(r) ≤ δ}.
Finally, the α-dimension of a family of compression codes {(Er , Dr ) : r > 0} is defined
as [42]
r(δ)
α = lim sup . (3.3)
δ→0 log (1/δ)
This dimension, as shown later, serves as a measure of structuredness and plays a key
role in understanding the performance of compression-based CS schemes.

3.4 Compressible Signal Pursuit

Consider the standard problem of CS, which is recovering a “structured” signal x ∈ Rn


from an under-sampled set of linear measurements y = Ax + z, where y ∈ Rm , and m
is (typically) much smaller than n. The CS theory shows that for various classes of
structured signals solving this problem is possible. For different types of structure such
as sparsity and low-rankness, it also provides various theoretically analyzable efficient
and robust recovery algorithms. However, there is a considerable gap between structures
covered by CS theory and those existing in actual signals of interest such as images and
78 Shirin Jalali and H. Vincent Poor

videos. Compression-based CS potentially provides a shortcut to generalizing the class


of structures employed in CS without much additional effort. In this section, we review
compressible signal pursuit (CSP), as the first attempt at designing compression-based
CS recovery schemes [42].
Consider a class of signals represented by a compact set Q ⊂ Rn . Further, assume that
there exists a family of compression algorithms for set Q, denoted by {(Er , Dr ) : r > 0}.
Let r(δ) denote the deterministic rate-distortion function of this family of codes, as
defined in Section 3.3.2. As an example, consider the class of natural images and the
JPEG compression code [43] used at different rates. In general, a compression code
might take advantage of signals’ sparsity in a certain transform domain such as the
wavelet domain, or a very different type of structure. Instead of being interested in
structures and mechanisms used by the compression code, we are interested in recov-
ering x ∈ Q from the under-sampled measurements y = Ax + z, using compression
algorithms {(Er , Dr ) : r > 0}.
CSP is a compression-based CS recovery approach that recovers x from y = Ax + z,
by having access to the sensing matrix A and a rate-r compression code characterized
by mappings (Er , Dr ) and a corresponding codebook Cr . CSP operates according to
Occam’s principle and outputs a compressible signal (a signal in the codebook of the
code) that best matches the measurements. More precisely, CSP estimates x by solving
the following optimization:


x = arg min y − Ac22 . (3.4)
c∈Cr

The following theorem considers the noise-free regime where z = 0, and connects
the number of measurements m and the reconstruction quality of CSP  x − x2 to the
properties of the compression code, i.e., its rate and its distortion.
theorem 3.1 Consider compact set Q ⊂ Rn with a rate-r compression code (Er , Dr )
operating at distortion δ. Let A ∈ Rm×n , where Ai, j are independently and identically
distributed (i.i.d.) as N(0, 1). For x ∈ Q, let 
x denote the reconstruction of x from y = Ax,
generated by the CSP optimization employing code (Er , Dr ). Then,

1 + τ1
x − x2 ≤ δ
 ,
1 − τ2
with probability at least
m m
1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) ,

where τ1 > 0 and τ2 ∈ (0, 1) are arbitrary.

Proof Let  x = Dr (Er (x)) and  x = arg minc∈Cr y − Ac22 . Since x is the minimizer of
y − Ac2 over all c ∈ Cr , and since 
2 x ∈ Cr , it follows that y − A
x2 ≤ y − A
x2 . That is,

x2 ≤ Ax − A
Ax − A x2 . (3.5)
Compressed Sensing via Compression Codes 79

Given x, define set U to denote the set of all possible normalized error vectors as
 
x−c
U= : c ∈ Cr .
x − c
Note that since |Cr | ≤ 2r and x is fixed, |U| ≤ 2r as well.
For τ1 > 0 and τ2 ∈ (0, 1), define events E1 and E2 as
E1  {A(x −
x)22 ≤ m(1 + τ1 )x −
x2 }
and
E2  {Au2 ≥ m(1 − τ2 ) : u ∈ U},
respectively.
For a fixed u ∈ U, since the entries of A are i.i.d. N(0, 1), Au is a vector of m
i.i.d. standard normal random variables. Hence, by Lemma 2 in [39],
m
P(Au22 ≥ m(1 + τ1 )) ≤ e− 2 (τ1 −log(1+τ1 )) (3.6)
and
m
P({Au22 ≤ m(1 − τ2 )}) ≤ e 2 (τ2 +log(1−τ2 )) . (3.7)
Since 
x ∈ Cr ,
x −
x
∈U
x −
x2
and, from (3.6),
m
P(Ec1 ) ≤ e− 2 (τ1 −log(1+τ1 )) .
Moreover, since |U| ≤ 2r , by the union bound, we have
m
P(Ec2 ) ≤ 2r e 2 (τ2 +log(1−τ2 )) . (3.8)
Hence, by the union bound,
P(E1 ∩ E2 ) = 1 − P(Ec1 ∪ Ec2 )
m m
≥ 1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) .
By definition, 
x is the reconstruction of x using code (Er , Dr ). Hence, x −
x2 ≤ δ,
and, conditioned on E1 ,
y − A
x2 ≤ δ m(1 + τ1 ). (3.9)
On the other hand, conditioned on E2 ,
y − A
x2 ≥ x −
x m(1 − τ2 ). (3.10)
Combining (3.9) and (3.10), conditioned on E1 ∩ E2 , we have

1 + τ1
x −x ≤ δ .
1 − τ2
80 Shirin Jalali and H. Vincent Poor

To understand the implications of Theorem 3.1, the following corollary considers a


specific choice of parameters τ1 and τ2 and relates the required number of measurements
by the CSP algorithm to the α-dimension of the compression code.
corollary 3.1 (Corollary 1 in [42]) Consider a family of compression codes (Er , Dr )
for set Q with corresponding codebook Cr and rate-distortion function r(δ). Let A ∈
Rm×n , where Ai, j are i.i.d. N(0, 1). For x ∈ Q and y = Ax, let 
x denote the solution of
(3.4). Given ν > 0 and η > 1, such that η/ log(1/eδ) < ν, let
ηr
m= .
log (1/eδ)
Then,
1+ν
x2 ≥ θδ1−
P x − η ≤ e−0.8m + e−0.3νr ,

where θ = 2e−(1+ν)/η .
Proof In Theorem 3.1, let τ1 = 3, τ2 = 1 − (eδ)2(1+)/η . For τ1 = 3, 0.5(τ1 −
log(1 + τ1 )) > 0.8. For τ2 = 1 − (eδ)2(1+)/η ,
m
r ln 2 + (τ2 + ln(1 − τ2 ))
2
m 2(1 + )
= r ln 2 + 1 − (eδ)2(1+)/η + ln(eδ)
2 η
η
≤ r ln 2 + − (1 + ) ln 2
2 log (1/eδ)
(a)   
≤ r ln 2 1 + − (1 + )
2
≤ −0.3r, (3.11)
where (a) is due to the fact that
η η ln 2 
= < .
log (1/eδ) ln (1/eδ) 2
Finally,
 
1 + τ1 4
δ =δ
1 − τ2 (eδ)2(1+)/η
= θδ1−(1+)/η , (3.12)
where θ = 2e−(1+)/η .
Corollary 3.1 implies that, using a family of compression codes {(Er , Dr ) : r > 0},
as r → ∞ and δ → 0, the achieved reconstruction error converges to zero (as
limδ→0 θδ1−(1+)/η = 0), while the number of measurements converges to ηα, where α
is the α dimension of the compression algorithms and η > 1 is a free parameter. In other
words, as long as the number of measurements m is larger than α, using an appropriate
compression code, CSP recovers x.
Compressed Sensing via Compression Codes 81

Theorem 3.1 and its corollary both ignore the effect of noise and consider noise-free
measurements. In practice, noise is always present. So it is important to understand the
effect of noise on the performance. For instance, the following theorem characterizes the
effect of a deterministic noise with a bounded 2 -norm on the performance of CSP.
theorem 3.2 (Theorem 2 in [42]) Consider compression code (E, D) operating at rate
r and distortion δ on set Q ⊂ Rn . For x ∈ Q, and y = Ax + z with z2 ≤ ζ, let 
x denote
the reconstruction of x from y offered by the CSP optimization employing code (E, D).
Then,

1 + τ1 2ζ
x − x2 ≤ δ
 + √ ,
1 − τ2 (1 − τ2 )d
with probability exceeding
d d
1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) ,
where τ1 > 0 and τ2 ∈ (0, 1) are free parameters.
CSP optimization is a discrete optimization that minimizes a convex cost function
over a discrete set of exponential size. Hence, solving CSP in its original form is com-
putationally prohibitive. In the next section, we study this critical issue and review an
efficient algorithm with low computational complexity that is designed to approximate
the solution of the CSP optimization.

3.5 Compression-Based Gradient Descent (C-GD)

As discussed in the previous section, CSP is based on an exhaustive search over expo-
nentially many codewords and as a result is computationally infeasible. Compression-
based gradient descent (C-GD) is a computationally efficient and theoretically
analyzable approach to approximating the solution of CSP. The C-GD algorithm
[44, 45], inspired by the projected gradient descent (PGD) algorithm [46], works as
follows. Start from some x0 ∈ Rn . For t = 1, 2, . . ., proceed as follows:
st+1 = xt + ηAT (y − Axt ), (3.13)
and
xt+1 = PCr (st+1 ), (3.14)
where PCr (·) denotes projection into the set of codewords. In other words, for x ∈ Rn ,
PCr (·) = arg min x − c2 . (3.15)
c∈Cr

Here index t denotes the iteration number and η ∈ R denotes the step size. Each iteration
of this algorithm involves performing two operations. The first step is moving in the
direction of the negative of the gradient of y − Ax22 with respect to x to find solutions
that are closer to the y = Ax hyperplane. The second step, i.e., the projection step, ensures
82 Shirin Jalali and H. Vincent Poor

that the estimate C-GD obtains belongs to the codebook and hence conforms with the
source structure. The following theorem characterizes the convergence performance of
the described C-GD algorithm.
theorem 3.3 (Theorem 2 in [45]) Consider x ∈ Rn . Let y = Ax + z and assume that
the entries of the sensing matrix A are i.i.d. N(0, 1) and that zi , i = 1, . . . , m, are
i.i.d. N(0, σ2z ). Let
1
η=
m
and define 
x = PCr (x), where PCr (·) is defined in (3.15). Given  > 0, for m ≥ 80r(1 + ),
m
with a probability larger than 1 − e− 2 − 2−40r − 2−2r+0.5 − e−0.15m , we have
 2 
n 32(1 + )r
x −
t+1
x2 ≤ 0.9x −
t
x2 + 2 2 + δ + σz , (3.16)
m m
for k = 0, 1, 2, . . ..
As will be clear in the proof, the choice of 0.9 in Theorem 3.3 is arbitrary, and the
result could be derived for any positive value strictly smaller than one. We present the
result for this choice as it clarifies the statement of the result and its proof.
As stated earlier, each iteration of the C-GD involves two steps: (i) moving in the
direction of the gradient and (ii) projection of the result onto the set of codewords of
the compression code. The first step is straightforward and requires two matrix–vector
multiplications. For the second step, optimally solving (3.15) might be challenging.
However, for any “good” compression code, it is reasonable to assume that employ-
ing the code’s encoder and decoder consecutively well approximates this operation.
In fact, it can be proved that, if PCr (x) − Dr (Er (x)) is smaller than  for all x, then
replacing PCr (·) with Dr (Er (·)) only results in an additive error of  in (3.16). (Refer
to Theorem 3 in [45].) Under this simplification, the algorithm’s two steps can be
summarized as

xt+1 = Dr (Er (xt + ηAT (y − Axt )).

Proof By definition, xt+1 = PCr (st+1 ) and  x = PCr (x). Hence, since xt+1 is the closest
vector in Cr to s and 
t+1 x is also in Cr , we have xt+1 − st+1 22 ≤ 
x − st+1 22 . Equivalently,
(x −
t+1 x) − (s −
t+1 x)2 ≤ 
2 x − s 2 , or
t+1 2

xt+1 −
x22 ≤ 2xt+1 −
x, st+1 −
x. (3.17)

For k = 0, 1, . . ., define the error vector and its normalized version as

θt  xt −
x

and
θt
θt  ,
θt 
respectively.
Compressed Sensing via Compression Codes 83

Also, given θt ∈ Rn , θt+1 ∈ Rn , η ∈ R+ , and A ∈ Rm×n , define the coefficient μ as


μ(θt+1 , θt , η)  θt+1 , θt − ηAθt+1 , Aθt .
Using these definitions and substituting for st+1 and y = Ax+z, it follows from (3.17) that
2
θt+1 2 ≤ 2xt+1 −
x, xt + η AT (Ax + z − Axt ) −
x
= 2xt+1 −
x, xt −
x + 2η xt+1 −
x, AT A(x − xt ) + 2η xt+1 −
x, AT z
= 2θt+1 , θt − 2ηAθt+1 , Aθt + 2η Aθt+1 , A(x −
x) + 2η θt+1 , AT z . (3.18)
Hence, on dividing both sides by (3.18), we have
θt+1 2 ≤ 2 θt+1 , θt − ηAθt+1 , Aθt θt 2 + 2η(σmax (A))2 x −
x2
+ 2ηθt+1 , AT z . (3.19)
In the following we bound the three main random terms on the right-hand side of (3.19),
namely, θt+1 , θt − ηAθt+1 , Aθt , (σmax (A))2 , and θt+1 , AT z .
(i) Bounding θt+1 , θt − ηAθt+1 , Aθt . We prove that with high probability this
term is smaller than 0.45. Let F define the set of all possible normalized error
vectors defined as
 
x1 −x2
F  : ∀
x1 , x1  
x2 ∈ Cr , x2 . (3.20)

x1 −x2 2
Now define event E1 as
 
E1  θt+1 , θt − ηAθt+1 , Aθt < 0.45 : ∀ u, v ∈ F . (3.21)
From Lemma 6 in [41], given u, v ∈ F , we have
m
P θt+1 , θt − ηAθt+1 , Aθt ≥ 0.45 ≤ 2− 20 . (3.22)
Hence, by the union bound,
m
P(Ec1 ) ≤ |F |2 2− 20 . (3.23)
Note that by construction |F | ≤ |Cr |2 ≤ 22r . Therefore,
P(E1 ) ≥ 1 − 2(4r−0.05m) .
By the theorem assumption, m ≥ 80r(1 + ), where  > 0. Therefore, with
probability at least 1 − 2−40r , event E1 happens.
(ii) Bounding (σmax (A))2 . Define event E2 as
√ √
E2  {σmax (A) ≤ 2 m + n}.
As proved in [47], we have
m
P(Ec2 ) ≤ e− 2 .
(iii) Bounding θt+1 , AT z . Note that
θt+1 , AT z = Aθt+1 , z .
84 Shirin Jalali and H. Vincent Poor

Let ATi denote the ith row of random matrix A. Then,


Aθt+1 = [ A1 , θt+1 , . . . ,  Am , θt+1 ]T .
 n
But, for fixed θt+1 ,  Ai , θt+1 are i.i.d. N(0, 1) random variables. Hence,
i=1
from Lemma 3 in [39], θt+1 , AT z has the same as z2 θt+1 , g , where g =
i.i.d.
[g1 , . . . , gn ]T is independent of z2 and gi ∼ N(0, 1). Given τ1 > 0 and τ2 > 0,
define events E3 and E4 as follows:
 
E3  z22 ≤ (1 + τ1 )σ2z σ2z
and
 2 
E4  θ, g  ≤ 1 + τ2 , ∀ θ ∈ F .
From Lemma 2 in [39], we have
m
P(Ec3 ) ≤ e− 2 (τ1 −ln(1+τ1 )) , (3.24)
and by the same lemma, for fixed θ , we have t+1

 2
P θt+1 , g  ≥ 1 + τ2 ≤ e− 2 (τ2 −ln(1+τ2 )) .
1
(3.25)
Hence, by the union bound,
τ2
P(Ec4 ) ≤ |F |e− 2
1
≤ 22r e− 2 (τ2 −ln(1+τ2 ))
τ2
≤ 22r− 2 , (3.26)
where the last inequality holds for all τ2 > 7. Setting τ2 = 4(1 + )r − 1, where
 > 0, ensures that P(Ec4 ) ≤ 2−2r+0.5 . Setting τ1 = 1,
P(Ec3 ) ≤ e−0.15m . (3.27)
x2 ≤ δ, conditioned on E1 ∩ E2 ∩
Now using the derived bounds and noting that x −
E3 ∩ E4 , we have
2 θt+1 , θt − ηAθt+1 , Aθt ≤ 0.9, (3.28)
 2
2 2 √ √ 2 n
(σmax (A))2 x −
x2 ≤ 2 m+ n δ = 2 2+ δ, (3.29)
m m m
and
2 t+1 T
2ηθt+1 , AT z = θ , A z
m

2
≤ σ2z (1 + τ1 )m(1 + τ2 )
m
2σz
= 8m(1 + )r
m

32(1 + )r
= σz . (3.30)
m
Compressed Sensing via Compression Codes 85

Hence, combining (3.28), (3.29), and (3.30) yields the desired error bound. Finally, note
that, by the union bound,


4
P(E1 ∩ E2 ∩ E3 ∩ E4 ) ≥ 1 − P(Ei )
i=1
− m2
≥ 1−e − 2−40r − 2−2r+0.5 − e−0.15m .

3.6 Stylized Applications

In this section, we consider three well-studied classes of signals, namely, (i) sparse sig-
nals, (ii) piecewise polynomials, and (iii) natural images. For each class of signal, we
explore the implications of the main results discussed so far for these specific classes of
signals. For the first two classes, we consider simple families of compression codes to
study the performance of the compression-based CS methods. For images, on the other
hand, we use standard compression codes such as JPEG and JPEG2000. These examples
enable us to shed light on different aspects of the CSP optimization and the C-GD algo-
rithm, such as (i) their required number of measurements, (ii) the reconstruction error
in a noiseless setting, (iii) the reconstruction error in the presence of noise, and (iv) the
convergence rate of the C-GD algorithm.

3.6.1 Sparse Signals and Connections to IHT


The focus of this chapter has been on moving beyond simple structures such as sparsity
in the context of CS. However, to better understand the performance of compression-
based CS algorithms and to compare them with other results in the literature, as our first
application, we consider the standard class of sparse signals and study the performance
of CSP and C-GD for this special class of signals.
Let Bn2 (ρ) and Γnk (ρ) denote the ball of radius ρ in Rn and the set of all k-sparse signals
in Bn2 (ρ), respectively. That is,

Bn2 (ρ) = {x ∈ Rn : x2 ≤ ρ}, (3.31)

and

Γnk (ρ) = {x ∈ Bn2 (ρ) : x0 ≤ k}. (3.32)

Consider the following compression code for signals in Γnk (ρ). Consider x ∈ Γnk (ρ).
By definition, x contains at most k non-zero entries. The encoder encodes x by first
describing the locations of the k non-zero entries and then the values of those non-zero
entries, each b-bit quantized. To encode x ∈ Γnk (ρ), the described code spends at most
r bits, where
86 Shirin Jalali and H. Vincent Poor

r ≤ klog n + k(b + log ρ)


≤ k(b + log ρ + log n + 2). (3.33)
The resulting distortion δ of this compression code can can be bounded as

δ = sup x −
x2
x∈Γnk (ρ)

= sup (xi −
xi )2
x∈Γnk (ρ) i:xi 0

= 2−2b
i:xi 0

= 2−b k. (3.34)
The α-dimension of this compression code can be bounded as
r(δ)
α = lim
δ→0 log (1/δ)
k(b + log ρ + log n + 2)
≤ lim = k. (3.35)
b→∞ b − log k
It can in fact be shown that the α-dimension is equal to k.
Consider using this specific compression algorithm in the C-GD framework. The
resulting algorithm is very similar to the well-known iterative hard thresholding (IHT)
algorithm [33]. The IHT algorithm, a CS recovery algorithm for sparse signals, is an
iterative algorithm. Its first step, as with C-GD, is moving in the opposite direction of
the gradient of the cost function. At the projection step, IHT keeps the k largest elements
and sets the rest to zero. The C-GD algorithm, on the other hand, after moving in the
opposite direction to the gradient, finds the codeword in the described code that is clos-
est to the result. For the special code described earlier, this is equivalent to first finding
the k largest entries and setting the rest to zero. Then, each remaining non-zero entry xi
is first clipped between [−1, 1] as
xi 1 xi ∈(−1,1) + sign(xi )1|xi |≥1 ,
where 1A is an indicator of event A, and then quantized to b + 1 bits.
i.i.d. i.i.d.
Consider x ∈ Γnk (1) and let y = Ax + z, where Ai, j ∼ N(0, 1/n) and zi ∼ N(0, σ2z ). Let

x denote the projection of x, as described earlier. The following corollary of Theorem 3.3
characterizes the convergence performance of C-GD applied to y when using this code.
corollary 3.2 (Corollary 2 of [45]) Given γ > 0, set b = γ log n + 12 log k bits. Also,
set η = n/m. Then, given  > 0, for m ≥ 80 r(1 + ), where r = (1 + γ)k log n + (k/2)
log k + 2k,
 2 
1 t+1 0.9 t n −1/2−γ 8(1 + )r
√ x − x2 ≤ √ x − x2 + 2 2 + n + σz , (3.36)
n n m m
for t = 0, 1, . . . , with probability larger than 1 − 2−2r .
Compressed Sensing via Compression Codes 87

Proof Consider u ∈ [−1, 1]. Quantizing u by a uniform quantizer that uses b + 1 bits
yields 
u, which satisfies |u − u| < 2−b . Therefore, using b + 1 bits to quantize
√ each non-
zero element of x ∈ Γnk yields a code which achieves distortion δ ≤ 2−b k. Hence, for
b + 1 = γ log n + 12 log k + 1,
δ ≤ n−γ .
On the other hand, the code rate r can be upper-bounded as
k
n
r≤ log + k(b + 1) ≤ log nt+1 + k(b + 1) = (k + 1) log n + k(b + 1),
i=0
i
where the last inequality holds for all n large enough. The rest of the proof follows
directly from inserting these numbers in to the statement of Theorem 3.3.
In the noiseless setting, according to this corollary, if the number of measurements m
satisfies m = Ω(k log n), using (3.36), the final reconstruction error can be bounded as
⎛  2 ⎞
1 t+1 ⎜⎜⎜⎜ n −1/2−γ ⎟⎟⎟⎟
lim √ x − x2 = O⎜⎜⎝ 2 + n ⎟⎟⎠,
t→∞ n m
or
⎛ 1 ⎞
1 t+1 ⎜⎜ n 2 −γ ⎟⎟⎟⎟
lim √ x − x2 = O⎜⎜⎜⎝ ⎟.
t→∞ n m ⎠
Hence, if γ > 0.5, the error vanishes as the ambient dimension of the signal space grows
to infinity. There are two key observations regarding the number of measurements used
by C-GD and CSP that one can make.
1. The α-dimension of the described compression code is equal to k. This implies
that, using slightly more than k measurements, CSP is able to almost accurately
recover the signal from its under-determined linear measurements. However, solv-
ing CSP involves an exhaustive search over all of the exponentially many code-
books. On the other hand, the computationally efficient C-GD method employs
k log n measurements, which is more than what is required by CSP by a factor
log n.
2. The results derived for C-GD are slightly weaker than those derived for IHT, as (i)
the C-GD algorithm’s reconstruction is not exact, even in the noiseless setting,
and (ii) C-GD requires O(k log n) measurements, compared with O(k log(n/k))
required by IHT.
In the case where the measurements are distorted by additive white Gaussian noise,
Corollary 3.2 states that there will be an additive
⎛  ⎞
⎜⎜⎜ k log n ⎟⎟⎟⎟

O⎜⎝σz ⎟
m ⎠
distortion term. While no similar result on the performance of the IHT algorithm in the
presence of stochastic noise exists, the noise sensitivity of C-GD is comparable to the
noise-sensitivity performance of other algorithms based on convex optimization, such as
LASSO and the Dantzig selector [48, 49].
88 Shirin Jalali and H. Vincent Poor

3.6.2 Piecewise-Polynomial Functions


Consider the class of piecewise-polynomial functions p : [0, 1] → [0, 1], with at most Q
singularities,1 and maximum degree of N. Let PolyQ N denote the defined class of signals.
Q
For p ∈ PolyN , let (x1 , x2 , . . . , xn ) be the samples of p at
1 n−1
0, , . . . , .
n n
Let N and {ai }i=0
N

,  = 1, . . . , Q, denote the degree and the set of coefficients of the th
polynomial in p. Furthermore, assume the following.
1. The coefficients of each polynomial belong to the interval [0, 1].
N 
2. For every , i=0 ai < 1.
Define P, the class of signals derived from sampling piecewise-polynomial functions,
as follows:
 i 
P  x ∈ Rn | xi = p , p ∈ PolyQ N . (3.37)
n
The described class of signals can be considered as a generalization of the class of
piecewise-constant functions, which is a popular signal model in imaging applications.
To apply C-GD to this class of signals, consider the following compression code for
signals in P. For x ∈ P, the encoder first describes the locations of the discontinuities
of its corresponding piecewise-polynomial function, and then, using a uniform quantizer
that spends b bits per coefficient, describes the quantized coefficients of the polynomials.
Using b bits per coefficient, the described code, in total, spends at most r bits, where
r ≤ (N + 1)(Q + 1)b + Q(log n + 1). (3.38)
Moreover, using b bits per coefficient, the distortion in approximating each point can be
bounded as
 N  N
  N
  
 ai t − [ai ]b t  ≤
 i  i
|ai − [ai ]b |
 i=0 i=0
 i=0
≤ (N + 1)2−b
≤ (N + 1)2−b . (3.39)
The C-GD algorithm combined with the described compression code provides a natu-
ral extension of IHT to piecewise-polynomial functions. At every iteration, the resulting
C-GD algorithm projects its current estimate of the signal to the space of quantized
piecewise-polynomial functions. As shown in Appendix B of [45], this projection step
can be done efficiently using dynamic programming.
i.i.d. i.i.d.
Consider x ∈ P and y = Ax + z, where Ai, j ∼ N(0, 1/n) and zi ∼ N(0, σ2z ). As with
Corollary 3.2, the following corollary characterizes the convergence performance of
C-GD combined with the described compression code.

1 A singularity is a point at which the function is not infinitely differentiable.


Compressed Sensing via Compression Codes 89

corollary 3.3 (Corollary 3 in [45]) Given γ > 0, let η = n/m and

b = (γ + 0.5) log n + log(N + 1).

Set


r = ((γ + 0.5)(N + 1)(Q + 1) + Q) log n + (N + 1)(Q + 1)(log(N + 1) + 1) + 1.

Then, given  > 0, for m ≥ 80


r(1 + ), for t = 0, 1, 2, . . ., we have
 2 
1 0.9 n 8(1 + )
r
√ xt+1 −
x2 ≤ √ xt −
x2 + 2 2 + n−0.5−γ + σz , (3.40)
n n m m

with a probability larger than 1 − 2−2r+1 .


Proof For b = (γ + 0.5) log n + log(N + 1), from (3.38),

r ≤ ((γ + 0.5)(N + 1)(Q + 1) + Q) log n + (N + 1)(Q + 1)(log(N + 1) + 1) + 1 = 


r.

Also, from (3.39), the overall error can be bounded as



δ≤ n(N + 1)2−b ,

and for b = (γ + 0.5) log n + log(N + 1),

δ ≤ n−γ .

Inserting these numbers in Theorem 3.3 yields the desired result.

Here are the takeaways from this corollary.


1. If we assume that n is much larger than N and Q, then the required number
of measurements is Ω((N + 1)(Q + 1) log n). Given that there are (N + 1)(Q + 1)
free parameters that describe x ∈ P, no algorithm is expected to require less than
(N + 1)(Q + 1) observations.
2. Using an argument similar to the one used in the previous example, in the absence
of measurement noise, the reconstruction error can be bounded as follows
⎛ 1 ⎞
1 t+1 ⎜⎜ n 2 −γ ⎟⎟⎟
lim √ x − x2 = O⎜⎜⎜⎝ ⎟⎟.
k→∞ n m ⎠

Therefore, for γ > 0.5, the reconstruction distortion converges to zero, as n grows to
infinity.
3. In the case of noisy measurements, the effect of the noise in the reconstruction error
behaves as
⎛  ⎞
⎜⎜⎜ log n ⎟⎟⎟⎟

O⎜⎝σz (Q + 1)(N + 1) ⎟.
m ⎠
90 Shirin Jalali and H. Vincent Poor

3.6.3 Practical Image Compressed Sensing


As the final application, we consider the case of natural images. In this case JPEG and
JPEG2000 are well-known efficient compression codes that exploit complex structures
that are present in images. Hence, C-GD combined by these compression codes yields a
recovery algorithm that also takes advantage of the same structures.
To actually run the C-GD algorithm, there are three parameters that need to be
specified:
1. step-size η,
2. compression rate r,
3. number of iterations.
Proper setting of the first two parameters is crucial to the success of the algorithm. Note
that the theoretical results of Section 3.5 were derived by fixing η as
1
η= .
m
However, choosing η in this way can lead to a very slow convergence rate of the algo-
rithm in practice. Hence, this parameter can also be considered as a free parameter that
needs to be specified. One approach to set the step size is to set it adaptively. Let ηt
denote the step size in iteration t. Then, ηt can be set such that after projection the result
yields the smallest measurement error. That is,
 2
ηt = arg min y − APCr (xt + η AT (y − Axt ))2 , (3.41)
η

where the projection operation PCr is defined in (3.15). The optimization described in
(3.41) is a scalar optimization problem. Derivative-free methods, such as the Nelder–
Mead method also known as the downhill simplex method [50] can be employed to
solve it.
Tuning the parameter r is a more challenging task, which can be described as a model-
selection problem. (See Chapter 7 of [51].) There are some standard techniques such as
multi-fold cross validation that address this issue. However, finding a method with low
computational complexity that works well for the C-GD algorithm is an open problem
that has yet to be addressed.
Finally, controlling the number of iterations in the C-GD method can be done using
standard stopping rules, such as limiting the maximum number of iterations, or lower-
bounding the reduction of the squared error per iteration, i.e., xt+1 − xt 2 .
Using these parameter-selection methods results in an algorithm described below as
Algorithm 3.1. Here, as usual, E and D denote the encoder and decoder of the com-
pression code, respectively. K1,max denotes the maximum number of iterations and T
denotes the lower bound in the reduction mentioned earlier. In all the simulation results
cited later, K1,max = 50 and T = 0.001.
Tables 3.1 and 3.2 show some simulation results reported in [45], for noiseless and
noisy measurements, respectively. In both cases the measurement matrix is a random
partial-Fourier matrix, generated by randomly choosing rows from a Fourier matrix.
Compressed Sensing via Compression Codes 91

Table 3.1. PSNR of 512 × 512 reconstructions with no noise − sampled


by a random partial-Fourier measurement matrix

Method m/n Boat House Barbara

10% 23.06 27.26 20.34


NLR-CS
30% 26.38 30.74 23.67
10% 18.38 24.11 16.36
JPEG-CG
30% 24.70 30.51 20.37
10% 20.75 26.30 18.64
JP2K-CG
30% 27.73 38.07 24.89

Note: The bold numbers highlight the best performance achieved


in each case.

Algorithm 3.1 C-GD: Compression-based (projected) gradient descent


1: Inputs: compression code (E, D), measurements y ∈ Rm , sensing matrix A ∈ Rm×n
2: Initialize: x0 , η0 , K1,max , K2,max , T
3: for 1 ≤ t ≤ K1,max do
4: ηt ← Apply Nelder–Mead method with K2,max iterations to solve (3.41).
5: xt+1 ← D E(xt + ηt AT (y − Axt ))

if (1/ n)xt+1 − xt 2 < T then return xt+1
end if
6: end for
7: Output: xt+1

The derived results are compared with a state-of-the-art recovery algorithm for partial-
Fourier matrices, non-local low-rank regularization (NLR-CS) [52]. In both tables, when
compression algorithm a is employed in the platform of C-GD, the resulting C-GD algo-
rithm is referred to as a-GD. For instance, C-GD that uses JPEG compression code is
referred to as JPEG-GD. The results are derived using the C-GD algorithm applied to
a number of standard test images. In these results, standard JPEG and JPEG2000 codes
available in the Matlab-R2016b image- and video-processing package are used as com-
pression codes for images. Boat, House, and Barbara test images are the standard images
available in the Matlab image toolbox.
The mean square error (MSE) between x ∈ Rn and  x ∈ Rn is measured as follows:
1
MSE = x − x22 . (3.42)
n
Then, the corresponding peak signal-to-noise (PSNR) is defined as
255
PSNR = 20 log √ . (3.43)
MSE
In the case of noisy measurements, y = Ax + z, the signal-to-noise ratio (SNR) is
defined as
Ax2
SNR = 20 log , (3.44)
z2
92 Shirin Jalali and H. Vincent Poor

Table 3.2. PSNR of 512 × 512 reconstructions with Gaussian measurement noise with various SNR
values − sampled by a random partial-Fourier measurement matrix

MRI Barbara Snake


Method m/n SNR=10 SNR=30 SNR=10 SNR=30 SNR=10 SNR=30

10% 11.66 24.14 12.10 19.83 10.50 18.75


NLR-CS
30% 12.60 26.84 13.32 24.05 11.98 24.82
10% 14.34 20.50 15.60 18.60 12.33 15.67
JPEG-GD
30% 19.20 24.70 18.17 22.89 14.40 22.37
10% 17.33 25.40 16.53 21.65 18.00 23.12
JP2K-GD
30% 21.56 35.38 21.82 28.19 21.06 29.30

Note: The bold numbers highlight the best performance achieved in each case.

In both tables, JP2K-CG consistently outperforms JPEG-CG. This confirms the


premise of this chapter that taking advantage of the full structure of a signal improves
the performance of a recovery algorithm. As mentioned earlier, JP2K-CG and JPEG-CG
employ JPEG2000 and JPEG codes, respectively. These tables suggest that the advan-
tage of JPEG2000 compression over JPEG compression, in terms of their compression
performance, directly translates to advantage of CS recovery algorithms that use them.
In the case of noiseless measurements, when the sampling rate is small, m/n = 0.1, as
reported in [45], for the majority of images (Boat, Barbara, House, and Panda), NLR-
CS outperforms both JPEG-CG and JP2K-CG. For some images, though (Dog, Snake),
JP2K presents the best result. As the sampling rate increases to m/n = 0.3, JP2K-CG
consistently outperforms other algorithms. Finally, it can be observed that, in the case
of noisy measurements too, JP2K-CG consistently outperforms NLR-CS and JPEG-CG
both for m/n = 0.1 and for m/n = 0.3.

3.7 Extensions

In this section, we explore a couple of other areas in which extensions of compression-


based compression sensing ideas can have meaningful impact.

3.7.1 Phase Retrieval


Compression algorithms take advantage of complex structures that are present in signals
such as images and videos. Therefore, compression-based CS algorithms automatically
take advantage of such structures without much extra effort. The same reasoning can be
employed in other areas as well. In this section we briefly review some recent results on
the application of compression codes to the problem of phase retrieval.
Phase retrieval refers to the problem of recovering a signal from its phaseless mea-
surements. It is a well-known fundamental problem with applications such as X-ray
crystallography [53], astronomical imaging [54], and X-ray tomography [55]. Briefly,
Compressed Sensing via Compression Codes 93

the problem can be described as follows. Consider an n-dimensional complex-valued


signal x ∈ Cn . The signal is captured by some phaseless measurements y ∈ Rn , where
y = |Ax|2 + z.
Here, A ∈ Cm×n and z ∈ Rm denote the sensing matrix and the measurement noise,
respectively. Also, | · | denotes the element-wise absolute value operation. That is, for
x = (x1 , . . . , xn ), |x| = (|x1 |, . . . , |xn |). The goal of a phase-retrieval recovery algorithm is to
recover x, up to a constant phase shift, from measurements y. This is arguably a more
complicated problem than the problem of CS as the measurement process itself is also
nonlinear. Hence, solving this problem in the case where x is an arbitrary signal in Cn
and m is larger than n is still challenging.
In recent years, there has been a fresh interest in the problem of phase retrieval and
developing more efficient recovery algorithms. In one of the influential results in this
area, Candès et al. propose lifting a phase-retrieval problem into a higher-dimensional
problem, in which, as with CS, the measurement constraints can be stated as linear
constraints [56, 57]. After this transform, ideas from convex optimization and matrix
completion can be employed to efficiently and robustly solve this problem. (Refer to [58,
59] for a few other examples of works in this area.) One common aspect of these works
is that, given the challenging structure of the problem, to a great extent, they all either
ignore the structure of the signal or only consider sparsity. In general, however, signals
follow complex patterns that if used efficiently can yield considerable improvement in
the performance of the systems in terms of the required number of measurements or
reconstruction quality.
Given our discussions so far on compression codes and compression-based com-
pressed sensing, it is not surprising that using compression codes for the problem of
phase retrieval provides a platform for addressing this problem. Consider, Q, a compact
subset of Cn . Assume that (Er , Dr ) denotes a rate-r compression code with maximum
reconstruction distortion δ for signals in Q. In other words, for all x ∈ Q,
x − Dr (Er (x))2 ≤ δ. (3.45)
Consider x ∈ Q and noise-free measurements y = |Ax|2 .
COmpressible PhasE Retrieval
(COPER), a compression-based phase-retrieval solution proposed in [60], recovers x, by
solving the following optimization:
 
x = arg miny − |Ac|2 ,
 (3.46)
c∈Cr

where as usual Cr denotes the codebook of the code defined by (Er , Dr ). In other words,
similarly to CSP, COPER also seeks the codeword that minimizes the measurement
error. The following theorem characterizes the performance of COPER and connects the
number of measurements m with the rate r, distortion δ, and reconstruction quality.
Before stating the theorem, note that, since the measurements are phaseless, for all
values of θ ∈ [0, 2π], e jθ x and x will generate the same measurements. That is, |Ax| =
|A(e jθ x)|. Hence, without any additional information about the phase of the signal, any
recovery algorithm is expected only to recover x, up to an unknown phase shift.
94 Shirin Jalali and H. Vincent Poor

theorem 3.4 (Theorem 1 in [60]) Consider x ∈ Q and measurements y = |Ax|. Let  x


denote the solution from the COPER optimization, described in (3.46). Assume that the
elements of matrix A ∈ Cm×n are i.i.d. N(0, 1) + iN(0, 1). Then,
√ 1 + τ2
inf eiθ x −
x22 ≤ 16 3 √ mδ, (3.47)
θ τ1
with a probability larger than
m
1 − 2r e 2 (K+ln τ1 −ln m) − e−2m(τ2 −ln(1+τ2 )) ,
where K = ln(2πe), and τ1 and τ2 are arbitrary positive real numbers.
To understand the implications of the above theorem, consider the following corollary.
corollary 3.4 (Corollary 1 in [60]) Consider the same setup as that in Theorem 3.4.
For η > 1, let
r
m=η .
log(1/δ)
Then, given  > 0, for large enough r, we have
 
x2 ≤ Cδ ≥ 1 − 2−cη r − e−0.6m ,
P inf eiθ x − (3.48)
θ

where C = 32 3. If η > 1/(1 − ), cη is a positive number greater than η(1 − ) − 1.
Given a family of compression codes {(Er , Dr ) : r > 0}, indexed by their rates, as δ
approaches zero,
r(δ)
log (1/δ)
approaches the α dimension of this family of codes defined in (3.3). Hence, Corollary 3.4
states that, as δ approaches zero, COPER generates, by employing high-rate codes, an
almost zero-distortion reconstruction of the input vector, using slightly more than α
measurements. For instance, for the class of complex-valued bounded k-sparse signals,
one could use the same strategy as used earlier to build a code for the class of real-valued
k-sparse signals to find a code of this class of signals. It is straightforward to confirm that
the α dimension of the resulting class of codes is equal to 2k. This shows that COPER
combined with an appropriate compression code is able to recover signals in this class
from slightly more than 2k measurements.

3.7.2 Stochastic Processes


In this chapter, we mainly focus on deterministic signal classes that are defined as com-
pact subsets of Rn . In this section, we shift our focus to structured stochastic processes
and show how most of the results derived so far carry over to such sources as well.
First, in the next section, we start by reviewing information-theoretic lossy compres-
sion codes defined for stationary processes. As highlighted below, these definitions are
slightly different than how we have defined compression codes earlier in this chapter.
We also explain why we have adopted these non-mainstream definitions. In the follow-
ing, we highlight the differences between the two sets of definitions, explain why we
Compressed Sensing via Compression Codes 95

have adopted the alternative definitions as the main setup in this chapter, and finally
specify how compression-based CS of stochastic processes raises new questions about
fundamental limits of noiseless CS.

Lossy Compression
Consider a stationary stochastic process X = {Xi }i∈Z , with alphabet X, where X ⊂ R. A
(fixed-length) lossy compression code for process X is defined by its blocklength n and
encoding and decoding mappings (E, D), where

E : Xn → {1, . . . , 2nR }

and
n .
D : {1, . . . , 2nR } → X

Here, X and X  denote the source and the reconstruction alphabet, respectively. Unlike
the earlier definitions, where the rate r represented the total number of bits used to
encode each source vector, here the rate R denotes the number of bits per source symbol.
The distortion performance of the described code is measured in terms of its induced
expected average distortion defined as

1 !
n
D= i ) ,
E d(Xi , X
n i=1

where X  → R+ denotes a per-letter distortion measure. Dis-


n = D(E(X n )) and d : X × X
tortion D is said to be achievable at rate R if, for any  > 0, there exists a family of
compression codes (n, En , Dn ) such that, for all large enough n,

1 !
n
i ) ≤ D + ,
E d(Xi , X
n i=1

where X n = Dn (En (X n )). The distortion-rate function D(X, R) of process X is defined as


the infimum of all achievable distortions at rate R. The rate-distortion function of process
X is defined as R(X, D) = inf{R : D(X, R) ≤ D}.
The distortion performance of such codes can alternatively be measured through their
excess distortion probability, which is the probability that the distortion between the
source output and its reconstruction offered by the code exceeds some fixed level, i.e.,
n ) > D). The distortion D is said to be achievable at rate R under vanish-
P(dn (X n , X
ing excess distortion probability if, for any  > 0, there exists a family of rate-R lossy
compression codes (n, En , Dn ) such that, for all large enough n,
n ) > D) ≤ ,
P(dn (X n , X

where Xn = Dn (En (X n )).


While in general these two definitions lead to two different distortion-rate functions,
for stationary ergodic processes the distortion-rate functions under these definitions
coincide and are identical [61–63].
96 Shirin Jalali and H. Vincent Poor

Structured Processes
Consider a real-valued stationary stochastic process X = {Xi }i∈Z . CS of process X, i.e.,
recovering X n from measurements Y m = AX n + Z m , where m < n, is possible only if X is
“structured.” This raises the following important questions.
question 4 What does it mean for a real-valued stationary stochastic process to be
structured?
question 5 How can we measure the level of structuredness of a process?
For stationary processes with a discrete alphabet, information theory provides us with
a powerful tool, namely, the entropy rate function, that measures the complexity or
compressibility of such sources [24, 64]. However, the entropy rate of all real-valued
processes is infinite and hence this measure by itself is not useful to measure the
complexity of real-valued sources.
In this section, we briefly address the above questions in the context of CS. Before
proceeding, to better understand what it means for a process to be structured, let us
review a classic example, which has been studied extensively in CS. Consider process X
which is i.i.d., and for which Xi , i ∈ N, is distributed as

(1 − p)δ0 + pνc . (3.49)

Here, νc denotes the probability density function (p.d.f.) of an absolutely continuous


distribution. For p < 1, the output vectors of this source are, with high probability, sparse,
and therefore compressed sensing of them is feasible. Therefore, for p < 1, process X
can be considered as a structured process. As the value of p increases, the percentage
of non-zero entries in X n increases as well. Hence, as the signal becomes less sparse, a
CS recovery algorithm would require more measurements. This implies that, at least for
such sparse processes, p can serve as a reasonable measure of structuredness. In general,
a plausible measure of structuredness is expected to conform with such an intuition. The
higher the level of structuredness, the larger the required sampling rate. In the following
example we derive one such measure that satisfies this property.

Example 3.1 Consider random variable X distributed such that with probability 1 − p
it is equal to 0, and with probability p it is uniformly distributed between 0 and 1. That
is,

X ∼ (1 − p)δ0 + pν,

where ν denotes the p.d.f. of Unif[0, 1]. Consider quantizing X by b bits. It is


straightforward to confirm that

P([X]b = 0) = 1 − p + p2−b

and

P([X]b = i) = p2−b ,
Compressed Sensing via Compression Codes 97

for i ∈ {2−b , 2−b+1 , . . . , 1 − 2−b }. Hence,

H([X]b ) = −(1 − p + p2−b ) log(1 − p + p2−b ) − (2b − 1)p2−b log(p2−b )


= pb − (1 − p + p2−b ) log(1 − p + p2−b ) − (1 − 2−b )p log(p) − 2−b p. (3.50)

As expected, H([X]b ) grows to infinity, as b grows without bound. Dividing both sides
by b, it follows that, for a fixed p,

H([X]b )
= p + δb , (3.51)
b
where δb = o(1). This suggests that H([X]b ) grows almost linearly with b and the
asymptotic slope of its growth is equal to

H([X]b )
lim = p.
b→∞ b

The result derived in Example 3.1 on the slope of the growth of the quantized entropy
function holds for any absolutely continuous distribution satisfying H(X) < ∞ [66]. In
fact this slope is a well-known quantity referred to as the Rényi information dimension
of X [66].
definition 3.1 (Rényi information dimension) Given random variable X, the upper
Rényi information dimension of X is defined as

H([X]b )
d̄(X) = lim sup . (3.52)
b→∞ b

The lower Rényi information dimension of X, d(X) is defined similarly by replacing


lim sup with lim inf. In the case where d̄(X) = d(X), the Rényi information dimension of
X is defined as d(X) = d̄(X) = d(X).
For a stationary memoryless process X = {Xi }i∈Z , under some regularity conditions on
the distribution of X, as n and m grow to infinity, while m/n stays constant, the Rényi
information dimension of X1 characterizes the minimum required sampling rate (m/n)
for almost lossless recovery of X n from measurements Y m = AX n [67]. In other words,
for such sources, d(X1 ) is the infimum of all sampling rates for which there exists a
proper decoding function with P(X n  X n ) ≤ , where  ∈ (0, 1). This result provides an
operational meaning for the Rényi information dimension and partially addresses the
questions we raised earlier about measuring the structuredness of processes, at least for
i.i.d. sources.
To measure the structuredness of general non-i.i.d. stationary processes, a general-
ized version of this definition is needed. Such a generalized definition would also take
structures that manifest themselves as specific correlations between output symbols into
account. In fact such a generalization has been proposed in [40], and more recently
in [68]. In the following we briefly review the information dimension of stationary
processes, as proposed in [40].
98 Shirin Jalali and H. Vincent Poor

definition 3.2 (kth-order information dimension) The upper kth-order information


dimension of stationary process X is defined as
H([Xk+1 ]b |[X k ]b )
d̄k (X) = lim sup . (3.53)
b→∞ b
The lower kth-order information dimension of X, dk (X) is defined similarly by replac-
ing lim sup with lim inf. In the case where d̄k (X) = dk (X), the kth-order information
dimension of X is defined as dk (X) = d̄(X) = d(X).
It can be proved that the kth-order information dimension is always a positive number
between zero and one. Also, both d̄k (X) and dk (X) are non-increasing functions of k.
definition 3.3 (Information dimension) The upper information dimension of stationary
process X is defined as

d̄o (X) = lim d̄k (X). (3.54)


k→∞

The lower information dimension of process X is defined as do (X) = limk→∞ dk (X). When
d̄o (X) = do (X), the information dimension of process X is defined as do (X) = do (X) =
d̄o (X).
Just like the Rényi information dimension, the information dimension of general sta-
tionary processes has an operational meaning in the context of universal CS of such
sources.2
Another possible measure of structuredness for stochastic processes, which is more
closely connected to compression-based CS, is the rate-distortion dimension defined as
follows.
definition 3.4 (Rate-distortion dimension) The upper rate-distortion dimension of
stationary process X is defined as
2R(X, D)
dimR (X) = lim sup . (3.55)
D→0 log (1/D)
The lower rate-distortion dimension of process X, dimR (X), is defined similarly by
replacing lim sup with lim inf. Finally, if

dimR (X) = dimR (X),

the rate-distortion dimension of process X is defined as dimR (X) = dimR (X) = dimR (X).
For memoryless stationary sources, the information dimension simplifies to the Rényi
information dimension of the marginal distribution, which is known to be equal to the
rate-distortion dimension [69]. For more general sources, under some technical condi-
tions, the rate-distortion dimension and the information dimension can be proved to be
equal [70].

2 In information theory, an algorithm is called universal if it does not need to know the source distribution
and yet, asymptotically, achieves the optimal performance.
Compressed Sensing via Compression Codes 99

CSP Applied to Stochastic Processes


In Section 3.4, we discussed the performance of the CSP optimization, a compression-
based CS scheme, in the deterministic setting, where signal x is from compact set Q.
The results can be extended to the case of stochastic processes as well. The following
theorem characterizes the performance of the CSP in the stochastic setting in terms of
the rate, distortion, and excess probability of the compression code.
theorem 3.5 (Theorem 4 in [70]) Consider process X and a rate-R blocklength-n lossy
compression code with distortion exceeding D, with a probability smaller than . Without
any loss of generality assume that the source is normalized such that D < 1, and let Cn
denote the codebook of this code. Let Y m = AX n , and assume that Ai, j are i.i.d. as
N(0, 1). For arbitrary α > 0 and η > 1, let δ = η/ ln (1/D) + α, and

2ηR
m= n. (3.56)
log (1/D)

Further, let X n =
n denote the solution of the CSP optimization. That is, X
arg minc∈Cn Ac − Y 2 . Then,
m 2

⎛  ⎞
⎜⎜ 1 ⎟⎟⎟
n 2 ≥ 2 + n D 2
1 1− 1+δ
P⎜⎜⎝ √  X n − X
1 m
η ⎟⎠ ≤  + 2− 2 nRα + e− 2 .
n m

corollary 3.5 (Corollary 3 in [70]) Consider a stationary process X with upper rate-
distortion dimension dimR (X). Let Y m = AX n , where Ai, j are i.i.d. N(0, 1). For any Δ > 0,
if the number of measurements m = mn satisfies
mn
lim inf > dimR (X),
n→∞ n
then there exists a family of compression codes which used by the CSP optimization
yields

1 n 2 ≥ Δ = 0,
lim P √  X n − X
n→∞ n

n denotes the solution of the CSP optimization.


where X
This corollary has the following important implication for the problem of Bayesian
CS in the absence of measurement noise. Asymptotically, zero-distortion recovery is
achievable, as long as the sampling rate m/n is larger than the upper rate-distortion
dimension of the source. This observation raises the following fundamental and open
questions about noiseless CS of stationary stochastic processes.
question 6 Does the rate-distortion dimension of a stationary process characterize the
fundamental limit of the required sampling rate for zero-distortion recovery?
question 7 Are the fundamental limits of the required rates for zero-distortion recovery
and for almost-lossless recovery the same?
100 Shirin Jalali and H. Vincent Poor

3.8 Conclusions and Discussion

Data-compression algorithms and CS algorithms are fundamentally different as they are


designed to address different problems. Data-compression codes are designed to repre-
sent (encode) signals by as few bits as possible. The goal of CS recovery algorithms,
on the other hand, is to recover a structured signal from its under-determined (typi-
cally linear) measurements. Both types of algorithm succeed by taking advantage of
signals, structures and patterns. While the field of CS is only about a decade old, there
has been more than 50 years of research on developing efficient image and video com-
pression algorithms. Therefore, structures and patterns that are used by state-of-the-art
compression codes are typically much more complex than those used by CS recovery
algorithms. In this chapter, we discussed how using compression codes to solve the CS
problem potentially enables us to expand the class of structures used by CS codes to
those already used by data-compression codes.
This approach of using compression codes to address a different problem, namely, CS,
can be used to address other emerging data-acquisition problems as well. In this chapter,
we briefly reviewed the problem of phase retrieval, and how, theoretically, compression-
based phase-retrieval recovery methods are able to recover a signal from its phase-less
measurements.

References

[1] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4,
pp. 1289–1306, 2006.
[2] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Universal
encoding strategies?” IEEE Trans. Information Theory, vol. 52, no. 12, pp. 5406–5425,
2006.
[3] S. Bakin, “Adaptive regression and model selection in data mining problems,” Ph.D.
Thesis, Australian National University, 1999.
[4] Y. C. Eldar and M. Mishali, “Robust recovery of signals from a structured union of
subspaces,” IEEE Trans. Information. Theory, vol. 55, no. 11, pp. 5302–5316, 2009.
[5] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty
relations and efficient recovery,” IEEE Trans. Signal Processing, vol. 58, no. 6,
pp. 3042–3054, 2010.
[6] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”
J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp. 49–67, 2006.
[7] S. Ji, D. Dunson, and L. Carin, “Multi-task compressive sensing,” IEEE Trans. Signal
Processing, vol. 57, no. 1, pp. 92–106, 2009.
[8] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk, “Asymptotic analysis of complex lasso
via complex approximate message passing (CAMP),” IEEE Trans. Information Theory,
vol. 59, no. 7, pp. 4290–4308, 2013.
[9] M. Stojnic, “Block-length dependent thresholds in block-sparse compressed sensing,”
arXiv:0907.3679, 2009.
Compressed Sensing via Compression Codes 101

[10] M. Stojnic, F. Parvaresh, and B. Hassibi, “On the reconstruction of block-sparse signals
with an optimal number of measurements,” IEEE Trans. Signal Processing, vol. 57, no. 8,
pp. 3075–3085, 2009.
[11] M. Stojnic, “2 /1 -optimization in block-sparse compressed sensing and its strong thresh-
olds,” IEEE J. Selected Topics Signal Processing, vol. 4, no. 2, pp. 350–357, 2010.
[12] L. Meier, S. Van De Geer, and P. Buhlmann, “The group Lasso for logistic regression,”
J. Roy. Statist. Soc. Ser. B, vol. 70, no. 1, pp. 53–71, 2008.
[13] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of
linear inverse problems,” Found. Comput. Math., vol. 12, no. 6, pp. 805–849, 2012.
[14] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Model-based compressive
sensing,” IEEE Trans. Information Theory, vol. 56, no. 4, pp. 1982–2001, 2010.
[15] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum rank solutions to linear matrix
equations via nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.
[16] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation,”
IEEE Trans. Signal Processing, vol. 50, no. 6, pp. 1417–1428, 2002.
[17] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a
Markov-tree prior,” IEEE Trans. Signal Processing, vol. 60, no. 7, pp. 3439–3448, 2012.
[18] D. Donoho and G. Kutyniok, “Microlocal analysis of the geometric separation problem,”
Comments Pure Appl. Math., vol. 66, no. 1, pp. 1–47, 2013.
[19] E. J. Candentss, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”
J. ACM, vol. 58, no. 3, pp. 1–37, 2011.
[20] A. E. Waters, A. C. Sankaranarayanan, and R. Baraniuk, “Sparcs: Recovering low-rank
and sparse matrices from compressive measurements,” in Proc. Advances in Neural
Information Processing Systems, 2011, pp. 1089–1097.
[21] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. Willsky, “Rank-sparsity incoherence
for matrix decomposition,” SIAM J. Optimization, vol. 21, no. 2, pp. 572–596, 2011.
[22] M. F. Duarte, W. U. Bajwa, and R. Calderbank, “The performance of group Lasso for
linear regression of grouped variables,” Technical Report TR-2010-10, Duke University,
Department of Computer Science, Durham, NC, 2011.
[23] T. Blumensath and M. E. Davies, “Sampling theorems for signals from the union of finite-
dimensional linear subspaces,” IEEE Trans. Information Theory, vol. 55, no. 4, pp. 1872–
1882, 2009.
[24] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex deconvolution, with
applications,” arXiv:1205.1580, 2012.
[25] C. Studer and R. G. Baraniuk, “Stable restoration and separation of approximately sparse
signals,” Appl. Comp. Harmonic Analysis (ACHA), vol. 37, no. 1, pp. 12–35, 2014.
[26] G. Peyré and J. Fadili, “Group sparsity with overlapping partition functions,” in Proc.
EUSIPCO Rev, 2011, pp. 303–307.
[27] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”
SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001.
[28] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser.
B vol. 58, no. 1, pp. 267–288, 1996.
[29] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear
inverse problems,” SIAM J. Imaging Sci., vol. 2, no. 1, pp. 183–202, 2009.
[30] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal
matching pursuit,” IEEE Trans. Information Theory, vol. 53, no. 12, pp. 4655–4666, 2007.
102 Shirin Jalali and H. Vincent Poor

[31] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear
inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11,
pp. 1413–1457, 2004.
[32] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals
Statist., vol. 32, no. 2, pp. 407–499, 2004.
[33] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,”
Appl. Comp. Harmonic Analysis (ACHA), vol. 27, no. 3, pp. 265–274, 2009.
[34] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inac-
curate samples,” Appl. Comp. Harmonic Analysis (ACHA), vol. 26, no. 3, pp. 301–321,
2009.
[35] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed
sensing,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[36] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruc-
tion,” IEEE Trans. Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009.
[37] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,”
IEEE Trans. Information Theory, vol. 62, no. 9, pp. 5117–5144, 2016.
[38] J. Zhu, D. Baron, and M. F. Duarte, “Recovery from linear measurements with complexity-
matching universal signal estimation,” IEEE Trans. Signal Processing, vol. 63, no. 6, pp.
1512–1527, 2015.
[39] S. Jalali, A. Maleki, and R. G. Baraniuk, “Minimum complexity pursuit for universal
compressed sensing,” IEEE Trans. Information Theory, vol. 60, no. 4, pp. 2253–2268,
2014.
[40] S. Jalali and H. V. Poor, “Universal compressed sensing for almost lossless recovery,” IEEE
Trans. Information Theory, vol. 63, no. 5, pp. 2933–2953, 2017.
[41] S. Jalali and A. Maleki, “New approach to Bayesian high-dimensional linear regression,”
Information and Inference, vol. 7, no. 4, pp. 605–655, 2018.
[42] S. Jalali and A. Maleki, “From compression to compressed sensing,” Appl. Comput
Harmonic Analysis, vol. 40, no. 2, pp. 352–385, 2016.
[43] D. S. Taubman and M. W. Marcellin, JPEG2000: Image compression fundamentals,
standards and practice. Kluwer Academic Publishers, 2002.
[44] S. Beygi, S. Jalali, A. Maleki, and U. Mitra, “Compressed sensing of compressible signals,”
in Proc. IEEE International Symposium on Information Theory, 2017, pp. 2158–2162.
[45] S. Beygi, S. Jalali, A. Maleki, and U. Mitra, “An efficient algorithm for compression-based
compressed sensing,” vol. 8, no. 2, pp. 343–375, June 2019.
[46] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM J. Cont.
Opt., vol. 14, no. 5, pp. 877–898, 1976.
[47] E. J. Candès, J. Romberg, and T. Tao, “Decoding by linear programming,” IEEE Trans.
Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[48] E. Candès and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger
than n,” Annals Statist, vol. 35, no. 6, pp. 2313–2351, 2007.
[49] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysis of Lasso and Dantzig
selector,” Annals Statist, vol. 37, no. 4, pp. 1705–1732, 2009.
[50] J. A. Nelder and R. Mead, “A simplex method for function minimization,” Comp. J, vol. 7,
no. 4, pp. 308–313, 1965.
[51] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning: Data mining,
inference, and prediction, 2nd edn. Springer, 2009.
Compressed Sensing via Compression Codes 103

[52] W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang, “Compressive sensing via nonlocal low-rank
regularization,” IEEE Trans. Image Processing, 2014.
[53] R. W. Harrison, “Phase problem in crystallography,” J. Opt. Soc. America A, vol. 10, no. 5,
pp. 1046–1055, 1993.
[54] C. Fienup and J. Dainty, “Phase retrieval and image reconstruction for astronomy,” Image
Rec.: Theory and Appl., pp. 231–275, 1987.
[55] F. Pfeiffer, T. Weitkamp, O. Bunk, and C. David, “Phase retrieval and differential phase-
contrast imaging with low-brilliance X-ray sources,” Nature Physics, vol. 2, no. 4,
pp. 258–261, 2006.
[56] E. J. Candès, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery
from magnitude measurements via convex programming,” Commun. Pure Appl. Math.,
vol. 66, no. 8, pp. 1241–1274, 2013.
[57] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval via matrix
completion,” SIAM Rev., vol. 57, no. 2, pp. 225–251, 2015.
[58] H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “CPRL – an extension of compressive sens-
ing to the phase retrieval problem,” in Proc. Advances in Neural Information Processing
Systems 25, 2012, pp. 1367–1375.
[59] P. Schniter and S. Rangan, “Compressive phase retrieval via generalized approximate
message passing,” IEEE Trans. Information Theory, vol. 63, no. 4, pp. 1043–1055, 2015.
[60] M. Bakhshizadeh, A. Maleki, and S. Jalali, “Compressive phase retrieval of struc-
tured signals,” in Proc. IEEE International Symposium on Information Theory, 2018,
pp. 2291–2295.
[61] Y. Steinberg and S. Verdu, “Simulation of random processes and rate-distortion theory,”
IEEE Trans. Information Theory, vol. 42, no. 1, pp. 63–86, 1996.
[62] S. Ihara and M. Kubo, “Error exponent of coding for stationary memoryless sources with
a fidelity criterion,” IEICE Trans. Fund. Elec., Comm. and Comp. Sciences, vol. 88, no. 5,
pp. 1339–1345, 2005.
[63] K. Iriyama, “Probability of error for the fixed-length lossy coding of general sources,”
IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1498–1507, 2005.
[64] C. E. Shannon, “A mathematical theory of communication: Parts I and II,” Bell Systems
Technical J., vol. 27, pp. 379–423 and 623–656, 1948.
[65] T. Cover and J. Thomas, Elements of information theory, 2nd edn. Wiley, 2006.
[66] A. Rényi, “On the dimension and entropy of probability distributions,” Acta Math. Acad.
Sci. Hungarica, vol. 10, no. 5, 1–2, pp. 193–215, 1959.
[67] Y. Wu and S. Verdú, “Rényi information dimension: Fundamental limits of almost lossless
analog compression,” IEEE Trans. Information Theory, vol. 56, no. 8, pp. 3721–3748,
2010.
[68] B. C. Geiger and T. Koch, “On the information dimension rate of stochastic processes,” in
Proc. IEEE International Symposium on Information Theory, 2017, pp. 888–892.
[69] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and measures,” IEEE
Trans. Information Theory, vol. 40, no. 5, pp. 1564–1572, 1994.
[70] F. E. Rezagah, S. Jalali, E. Erkip, and H. V. Poor, “Compression-based compressed
sensing,” IEEE Trans. Information Theory, vol. 63, no. 10, pp. 6735–6752, 2017.
4 Information-Theoretic Bounds
on Sketching
Mert Pilanci

Summary

Approximate computation methods with provable performance guarantees are becoming


important and relevant tools in practice. In this chapter, we focus on sketching methods
designed to reduce data dimensionality in computationally intensive tasks. Sketching can
often provide better space, time, and communication complexity trade-offs by sacrificing
minimal accuracy. This chapter discusses the role of information theory in sketching
methods for solving large-scale statistical estimation and optimization problems. We
investigate fundamental lower bounds on the performance of sketching. By exploring
these lower bounds, we obtain interesting trade-offs in computation and accuracy. We
employ Fano’s inequality and metric entropy to understand fundamental lower bounds
on the accuracy of sketching, which is parallel to the information-theoretic techniques
used in statistical minimax theory.

4.1 Introduction

In recent years we have witnessed an unprecedented increase in the amount of avail-


able data in a wide variety of fields. Approximate computation methods with provable
performance guarantees are becoming important and relevant tools in practice to attack
larger-scale problems. The term sketching is used for randomized algorithms designed
to reduce data dimensionality in computationally intensive tasks. In large-scale prob-
lems, sketching allows us to leverage limited computational resources such as memory,
time, and bandwidth, and also explore favorable trade-offs between accuracy and
computational complexity.
Random projections are widely used instances of sketching, and have attracted sub-
stantial attention in the literature, especially very recently in the machine-learning,
signal processing, and theoretical computer science communities [1–6]. Other popular
sketching techniques include leverage score sampling, graph sparsification, core sets,
and randomized matrix factorizations. In this chapter we overview sketching meth-
ods, develop lower bounds using information-theoretic techniques, and present upper
bounds on their performance. In the next section we begin by introducing commonly
used sketching methods.

104
Information-Theoretic Bounds on Sketching 105

This chapter focuses on the role of information theory in sketching methods for
solving large-scale statistical estimation and optimization problems, and investigates
fundamental lower bounds on their performance. By exploring these lower bounds, we
obtain interesting trade-offs in computation and accuracy. Moreover, we may hope to
obtain improved sketching constructions by understanding their information-theoretic
properties. The lower-bounding techniques employed here parallel the information-
theoretic techniques used in statistical minimax theory [7, 8]. We apply Fano’s inequality
and packing constructions to understand fundamental lower bounds on the accuracy of
sketching.
Randomness and sketching also have applications in privacy-preserving queries
[9, 10]. Privacy has become an important concern in the age of information where
breaches of sensitive data are frequent. We will illustrate that randomized sketch-
ing offers a computationally simple and effective mechanism to preserve privacy in
optimization and machine learning.
We start with an overview of different constructions of sketching matrices in Sec-
tion 4.2. In Section 4.3, we briefly review some background on convex analysis and
optimization. Then we present upper bounds on the performance of sketching from an
optimization viewpoint in Section 4.4. To be able to analyze upper bounds, we introduce
the notion of localized Gaussian complexity, which also plays an important role in the
characterization of minimax statistical bounds. In Section 4.5, we discuss information-
theoretic lower bounds on the statistical performance of sketching. In Section 4.6,
we turn to non-parametric problems and information-theoretic lower bounds. Finally,
in Section 4.7 we discuss privacy-preserving properties of sketching using a mutual
information characterization, and communication-complexity lower bounds.

4.2 Types of Randomized Sketches

In this section we describe popular constructions of sketching matrices. Given a sketch-


ing matrix S, we use {si }m
i=1 to denote the collection of its n-dimensional rows. Here
we consider sketches which are zero mean, and are normalized, i.e., they satisfy the
following two conditions:
(a) EST S = Id×d , (4.1)
(b) ES = 0n×d . (4.2)
The reasoning behind the above conditions will become clearer when they are applied
to sketching optimization problems involving data matrices.
A very typical use of sketching is to obtain compressed versions of a large data matrix
A. We obtain the matrix SA ∈ Rm×d using simple matrix multiplication. See Fig. 4.1 for
an illustration. As we will see in a variety of examples, random matrices preserve most
of the information in the matrix A.

4.2.1 Gaussian Sketches


The most classical sketch is based on a random matrix S ∈ Rm×n with i.i.d. standard
Gaussian entries. Suppose that we generate a random matrix S ∈ Rm×n with entries drawn
106 Mert Pilanci

d d
m S = m SA

n A

Figure 4.1 Sketching a tall matrix A. The smaller matrix SA ∈ Rm×d is a compressed version of
the original data A ∈ Rn×d .

from i.i.d. zero-mean Gaussian random variables with variance 1/m. Note that we have
 m
ES = 0m×d and also EST S = m i=1 Esi si = i=1 Id (1/m) = Id . Analyzing the Gaussian
T

sketches is considerably easier than analyzing sketches of other types, because of the
special properties of the Gaussian distribution such as rotation invariance. However,
Gaussian sketches may not be the most computationally efficient choice for many data
matrices, as we will discuss in the following sections.

4.2.2 Sub-Gaussian Sketches


A generalization of the previous construction is a random sketch with rows, drawn from
i.i.d. sub-Gaussian random variables. In particular, a zero-mean random vector s ∈ Rn is
1-sub-Gaussian if, for any u ∈ Rn , we have

P[s, u ≥ εu2 ≤ e−ε /2
2
for all ε ≥ 0. (4.3)
For instance, a vector with i.i.d. N(0, 1) entries is 1-sub-Gaussian, as is a vector with
i.i.d. Rademacher entries (uniformly distributed over {−1, +1}). In many models of
computation, multiplying numbers by random signs is simpler than multiplying by
Gaussian variables, and only costs an addition operation. Note that multiplying by
−1 only amounts to flipping the sign bit in the signed number representation of the
number in the binary system. In modern computers, the difference between addition
and multiplication is often not appreciable. However, the real disadvantage of sub-
Gaussian and Gaussian sketches is that they require matrix–vector multiplications with
unstructured and dense random matrices. In particular, given a data matrix A ∈ Rn×d ,
computing its sketched version SA requires O(mnd) basic operations using classical
matrix multiplication algorithms, in general.

4.2.3 Randomized Orthonormal Systems


The second type of randomized sketch we consider is the randomized orthonormal
system (ROS), for which matrix multiplication can be performed much more efficiently.
In order to define an ROS sketch, we first let H ∈ Rn×n be an orthonormal matrix with
√ √
entries Hi j ∈ [(−1/ n), (1/ n)]. Standard classes of such matrices are the Hadamard
Information-Theoretic Bounds on Sketching 107

or Fourier bases, for which matrix–vector multiplication can be performed in O(n log n)
time via the fast Hadamard or Fourier transforms, respectively. For example, an n × n
Hadamard matrix H = Hn can be recursively constructed as follows:
   
1 1 1 1 H2 H2
H2 = √ , H4 = √ , H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 .
2 1 −1 2 H2 −H2 
Kronecker product t times
From any such matrix, a sketching matrix S ∈ Rm×n from a ROS ensemble can be
obtained by sampling i.i.d. rows of the form

sT = neTj HD with probability 1/n for j = 1, . . . , n,

where the random vector e j ∈ Rn is chosen uniformly at random from the set of all
n canonical basis vectors, and D = diag(r) is a diagonal matrix of i.i.d. Rademacher
variables r ∈ {−1, +1}n , where P[ri = +1] = P[ri = −1] = 12 ∀i. Alternatively, the rows
of the ROS sketch can be sampled without replacement and one can obtain similar
guarantees to sampling with replacement. Given a fast routine for matrix–vector
multiplication, ROS sketch SA of the data A ∈ Rn×d can be formed in O(n d log m) time
(for instance, see [11]).

4.2.4 Sketches Based on Random Row Sampling


Given a probability distribution {p j }nj=1 over [n] = {1, . . . , n}, another choice of sketch
is to randomly sample the rows of the extended data matrix A a total of m times with
replacement from the given probability distribution. Thus, the rows of S are independent
and take on the values
eTj
s = √
T
with probability p j for j = 1, . . . , n,
pj
where e j ∈ Rn is the jth canonical basis vector. Different choices of the weights {p j }nj=1
are possible, including those based on the leverage scores of A. Leverage scores are
defined as
u j 2
p j : = n 2 2 ,
i=1 ui 2

where u1 , u2 , ..., un are the rows of U ∈ Rn × d , which is the matrix of left singular vectors
of A. Leverage scores can be obtained using a singular value decomposition A = UΣVT .
Moreover, there also exist faster randomized algorithms to approximate the leverage
scores (e.g., see [12]). In our analysis of lower bounds to follow, we assume that the
weights are α-balanced, meaning that
α
max p j ≤ (4.4)
j=1,...,n n
for some constant α that is independent of n.
108 Mert Pilanci

4.2.5 Graph Sparsification via Sub-Sampling


Let G = (V, E) be a weighted, undirected graph with d nodes and n edges, where V and
E are the set of nodes and the set of edges, respectively. Let A ∈ Rn×d be the node–edge
incidence matrix of the graph G. Suppose we randomly sample the edges in the graph a
total of m times with replacement from a given probability distribution over the edges.
The obtained graph is a weighted subgraph of the original, whose incidence matrix is
SA. Similarly to the row sampling sketch, the sketch can be written as
eTj
sT = √ with probability p j for j = 1, . . . , n .
pj
We note that row and graph sub-sampling sketches satisfy the condition (4.1). However,
they do not satisfy the condition (4.2). In many computational problems on graphs, spar-
sifying the graph has computational advantages. Notable examples of such problems
are solving Laplacian linear systems and graph partitioning, where sparsification can be
used. We refer the reader to Spielman and Srivastava [13] for details.

4.2.6 Sparse Sketches Based on Hashing


In many applications, the data matrices contain very few non-zero entries. For sparse
data matrices, special constructions of the sketching matrices yield greatly improved
performance. Here we describe the count-sketch construction from [14, 15]. Let
h : [n] → [m] be a hash functions from a pair-wise independent family.1 The entry Si j of
the sketch matrix is given by σ j if i = h( j), and otherwise it is zero, where σ ∈ {−1, +1}n
is a random vector containing 4-wise independent variables. Therefore, the jth column
of S is non-zero only in the row indexed by h( j). We refer the reader to [14, 15] for
the details. An example realization of the sparse sketch is given below, where each col-
umn contains a single non-zero entry which is uniformly random sign ±1 at a uniformly
random index:
⎡ ⎤
⎢⎢⎢⎢ 0 1 0 1 0 0 0 ⎥⎥⎥

⎢⎢⎢ −1 0 0 0 0 1 0 ⎥⎥⎥⎥
⎢⎢⎢ ⎥⎥ .
⎢⎢⎢ 0 0 0 0 −1 0 0 ⎥⎥⎥⎥
⎢⎣ ⎦
0 0 1 0 0 0 −1
Figure 4.2 shows examples of different randomized sketching matrices S ∈ Rm×n , where
m = 64, n = 1024, which are drawn randomly. We refer readers to Nelson and Nguyên
[16] for details on sparse sketches.

4.3 Background on Convex Analysis and Optimization

In this section, we first briefly review relevant concepts from convex analysis and
optimization. A set C ⊆ Rd is convex if, for any x, y ∈ C,
tx + (1 − t)y ∈ C for all t ∈ [0, 1] .
1 A hash function is from a pair-wise independent family if P[h( j) = i, h(k) = l] = 1/m 2 and P[h( j) = i]
= 1/m for all i, j, k, l.
Information-Theoretic Bounds on Sketching 109

(a) Gaussian sketch (b) ±1 random sign sketch

(c) ROS sketch (d) sparse sketch

Figure 4.2 Different types of sketching matrices: (a) Gaussian sketch, (b) ±1 random sign sketch,
(c) randomized orthogonal system sketch, and (d) sparse sketch.

Let X be a convex set. A function f : X → R is convex if, for any x, y ∈ X,

f (tx + (1 − t)y) ≤ t f (x) + (1 − t) f (y) for all t ∈ [0, 1] .

Given a matrix A ∈ Rn×d , we define the linear transform of the convex set C as AC =
{Ax | x ∈ C}. It can be shown that AC is convex if C is convex.
A convex optimization problem is a minimization problem of the form

min f (x), (4.5)


x∈C

where f (x) is a convex function and C is a convex set. In order to characterize optimality
of solutions, we will define the tangent cone of C at a fixed vector x∗ as follows:

 
TC (x∗ ) = t(x − x∗ ) | t ≥ 0 and x ∈ C . (4.6)

Figures 4.3 and 4.4 illustrate2 examples of tangent cones of a polyhedral convex set in
R2 . A first-order characterization of optimality in the convex optimization problem (4.5)
is given by the tangent cone. If a vector x∗ is optimal in (4.5), it holds that

zT ∇ f (x∗ ) ≥ 0, ∀z ∈ TC (x∗ ) . (4.7)

We refer the reader to Hiriart-Urruty and Lemaréchal [17] for details on convex analysis,
and Boyd and Vandenberghe [18] for an in-depth discussion of convex optimization
problems and applications.

2 Note that the tangent cones extend toward infinity in certain directions, whereas the shaded regions in
Figs. 4.3 and 4.4 are compact for illustration.
110 Mert Pilanci

K
C xLS

Figure 4.3 A narrow tangent cone where the Gaussian complexity is small.

C
K

xLS

Figure 4.4 A wide tangent cone where the Gaussian complexity is large.

4.4 Sketching Upper Bounds for Regression Problems

Now we consider an instance of a convex optimization problem. Consider the least-


squares optimization

x∗ = arg min Ax − b2 , (4.8)


x∈C 
f (x)

where A ∈ Rn×d and b ∈ Rn are the input data and C ⊆ Rd is a closed and convex constraint
set. In statistical and signal-processing applications, it is typical to use the constraint
set to impose structure on the obtained solution x. Important examples of the convex
constraint C include the non-negative orthant, 1 -ball for promoting sparsity, and the
∞ -ball as a relaxation to the combinatorial set {0, 1}d .
In the unconstrained case when C = Rd , a closed-form solution exists for the solution
of (4.8), which is given by x∗ = (AT A)−1 AT b. However, forming the Gram matrix AT A
and inverting using direct methods such as QR decomposition, or the singular value
decomposition, typically requires O(nd2 ) + O(nd min(n, d)) operations. Faster iterative
algorithms such as the conjugate gradient (CG) method can be used to obtain an approx-
imate solution in O(ndκ(A)) time, where κ(A) is the condition number of the data matrix
A. Using sketching methods, it is possible to obtain even faster approximate solutions,
as we will discuss in what follows.
In the constrained case, a variety of efficient iterative algorithms have been developed
in the last couple of decades to obtain the solution, such as proximal and projected
gradient methods, their accelerated variants, and barrier-based second-order methods.
Sketching can also be used to improve the run-time of these methods.
Information-Theoretic Bounds on Sketching 111

4.4.1 Over-Determined Case (n > d )


In many applications, the number of observations, n, exceeds the number of unknowns,
d, which gives rise to the tall n × d matrix A. In machine learning, it is very common
to encounter datasets where n is very large and d is of moderate size. Suppose that we
first compute the sketched data matrices SA and Sb from the original data A and b, then
consider the following approximation to the above optimization problem:


x = arg min SAx − Sb22 . (4.9)
x∈C

After applying the sketch to the data matrices, the sketched problem has dimensions
m × d, which is lower than the original dimensions when m < n. Note that the objective
in the above problem (4.9) can be seen as an unbiased approximation of the original
objective function (4.8), since it holds that

E  SAx − Sb 22 =  Ax − b 22

for any fixed choice of A, x, and b. This is a consequence of the condition (4.1), which
is satisfied by all of the sketching matrices considered in Section 4.2.

4.4.2 Gaussian Complexity


Gaussian complexity plays an important role in statistics, empirical process theory, com-
pressed sensing, and the theory of Banach spaces [19–21]. Here we consider a localized
version of the Gaussian complexity, which is defined as follows:
 
Wt (C) : = Eg sup |g, z| , (4.10)
z∈C
z2 ≤t

where g is a random vector with i.i.d. standard Gaussian entries, i.e., g ∼ N(0, In ). The
parameter t > 0 controls the radius at which the random deviations are localized. For a
finite value of t, the supremum in (4.10) is always achieved since the constraint set is
compact.
Analyzing the sketched optimization problem requires us to control the random devi-
ations constrained to the set of possible descent directions {x − x∗ | x ∈ C}. We now define
a transformed tangent cone at x∗ as follows:
 
K = tA(x − x∗ ) | t ≥ 0 and x ∈ C ,

which can be alternatively defined as ATC (x∗ ) using the definition given in (4.6). The
next theorem provides an upper bound on the performance of the sketching method for
constrained optimization based on localized Gaussian complexity.
theorem 4.1 Let S be a Gaussian sketch, and let  x be the solution of (4.9). Suppose
that m ≥ c0 W1 (K)2 / 2 , where c0 is a universal constant, then it holds that
x − x∗ )2
A(
≤,
f (x∗ )
112 Mert Pilanci

and consequently we have

f (x∗ ) ≤ f (
x) ≤ f (x∗ )(1 + ) . (4.11)

As predicted by the theorem, the approximation ratio improves as the sketch dimen-
sion m increases, and converges to one as m → ∞. However, we are often interested in the
rate of convergence of the approximation ratio. Theorem 4.1 characterizes this rate by
relating the geometry of the constraint set to the accuracy of the sketching method (4.9).
As an illustration, Figs. 4.3 and 4.4 show narrow and wide tangent cones in R2 , respec-
tively. The proof of Theorem 4.1 combines the convex optimality condition involving
the tangent cone in (4.7) with results on empirical processes, and can be found in Pilanci
and Wainwright [22]. An important feature of Theorem 4.1 is that the approximation
quality is relative to the optimal value f (x∗ ). This is advantageous when f (x∗ ) is small,
e.g., the optimal value can be zero in noiseless signal recovery problems. However, in
problems where the signal-to-noise ratio is low, f (x∗ ) can be large, and hence negatively
affects the approximation quality. We illustrate the implications of Theorem 4.1 on some
concrete examples in what follows.

Example 4.1 Unconstrained Least Squares For unconstrained problems, we have K =


range(A), i.e., the tangent cone is equal to the range space of the data matrix A. In order
to apply Theorem 4.1, we need the following lemma about the Gaussian complexity of
a subspace.

lemma 4.1 Let Q be a subspace of dimension q. The Gaussian complexity of Q satisfies



Wt (Q) ≤ t q .

Proof Let U be an orthonormal basis for the subspace Q. We have the following rep-
resentation: L = {Ux | x ∈ Rq }. Consequently the Gaussian complexity W1 (Q) can be
written as
    
Eg sup g, Ux = Eg sup UT g, x = t Eg UT g2 ≤ t E tr UUT ggT
x x
Ux2 ≤t x2 ≤t

= t tr UTU

= t q.

Where the inequality follows from Jensen’s inequality and concavity of the square root,
and first and fifth equality follow since UT U = Iq . Therefore, the Gaussian complexity
of the range of A for t = 1 satisfies

W1 (range(A)) ≤ rank(A) .

Setting the dimension of the sketch m ≥ c0 rank(A)/ 2 suffices to obtain an  approxi-


mate solution in the sense of (4.11). We note that rank(A) might not be known a priori,
but the upper bound rank(A) ≤ d may be useful when n  d.
Information-Theoretic Bounds on Sketching 113

Example 4.2 1 Constrained Least Squares For 1 -norm-constrained problems we have


C = {x | x1 ≤ r} for some radius parameter r. The tangent cone K at the optimal point
x∗ depends on the support3 of x∗ , and hence on its cardinality x∗ 0 . In [23], it is shown
that the localized Gaussian complexity satisfies
γk + (A)  ∗
W1 (K) ≤ c1 x 0 log d ,
βk − (A)
where c1 is a universal constant and γk are βk are 1 -restricted maximum and minimum
eigenvalues defined as follows:

γk := max Az22 and βk := min Az22 .


z=1
√ z=1

z1 ≤ k z1 ≤ k

As a result, we conclude that for 1 -constrained problems, the sketch dimension can be
substantially smaller when 1 -constrained eigenvalues are well behaved.

4.4.3 Under-Determined Case (n ≤ d )


In many applications the dimension of the data vectors may be larger than the sample
size. In these situations, it makes sense to reduce the dimensionality by applying the
sketch on the right, i.e., AST , and solve

arg minm (AST z − b)2 . (4.12)


z∈R
ST z∈C

Note that the vector z ∈ Rm is of smaller dimension than the original variable x ∈ Rd .
After solving the reduced-dimensional problem and obtaining its optimal solution z∗ , the
x = ST z∗ . We will investigate this
final estimate for the original variable x can be taken as 
approach in Section 4.5 in non-parametric statistical estimation problems and present
concrete theoretical guarantees.
It is instructive to note that, in the special case where we have 2 regularization and
C = Rd , we can easily transform the under-determined least-squares problem into an
over-determined one using convex duality, or the matrix-inversion lemma. We first write
the sketched problem (4.12) as the constrained convex program
1
min y − b22 + ρz22 ,
z∈Rm ,y∈Rn 2
y=AST z

and form the convex dual. It can be shown that strong duality holds, and consequently
primal and dual programs can be stated as follows:
1 1 1
min AST z − b22 + ρz22 = max − SAT x22 − x22 + xT b ,
z∈Rm 2 x∈Rd 4ρ 2

3 The term support refers to the set of indices where the solution has a non-zero value.
114 Mert Pilanci

where the primal and dual solutions satisfy z∗ = (1/2ρ)SAT x∗ at the optimum [18].
Therefore the sketching matrix applied from the right, AST , corresponds to a sketch
applied on the left, SAT , in the dual problem which parallels (4.9). This observation can
be used to derive approximation results on the dual program. We refer the reader to [22]
for an application in support vector machine classification where b = 0n .

4.5 Information-Theoretic Lower Bounds

4.5.1 Statistical Upper and Lower Bounds


In order to develop information-theoretic lower bounds, we consider a statistical
observation model for the constrained regression problem. Consider the following
model:
b = Ax† + w, where w ∼ N(0, σ2 In ) and x† ∈ C0 , (4.13)
where x† is the unknown vector to be estimated and w is an i.i.d. noise vector whose
entries are distributed as N(0, σ2 ). In this section we will focus on the observation model
(4.13) and present a lower bound on all estimators which use the sketched data (SA, Sb)
to form an estimate  x.
We assume that the unknown vector x† belongs to some set C0 ⊆ C that is star-shaped
around zero.4 In many cases of interest we have C = C0 , i.e., when the set C is convex
and simple to describe. In this case, the constrained least-squares estimate x∗ from equa-
tion (4.8) corresponds to the constrained maximum-likelihood estimator for estimating
the unknown regression vector x† under the Gaussian observation model (4.13). How-
ever C0 may not be computationally tractable as an optimization constraint set, such as
a non-convex set, and we can consider a set C which is a convex relaxation5 of this set,
such that C ⊂ C0 . An important example is the set of s sparse and bounded vectors given
by C0 = {x : x0 ≤ s, x∞ ≤ 1}, which has combinatorially many elements. The well-

known 1 relaxation given by C = {x : x1 ≤ s, x∞ ≤ 1} satisfies C ⊂ C0 , which
follows from the Cauchy–Schwartz inequality, and is widely used [24, 25] to find sparse
solutions.
We now present a theoretical result on the statistical performance of the original
constrained least-squares estimator in (4.8)
theorem 4.2 Let C be any set that contains the true parameter x† . Then the constrained
estimator x∗ in (4.8) under the observation model (4.13) has mean-squared error upper-
bounded as
  
1 σ2 
Ew A(x − x )2 ≤ c2 δ∗ (n)2 +
∗ † 2
,
n n

4 This assumption means that, for any x ∈ C0 and scalar t ∈ [0, 1], the point tx also belongs to C0 .
5 We may also consider an approximation of C0 which doesn’t necessarily satisfy C ⊂ C0 , for example, the
1 and 0 unit balls.
Information-Theoretic Bounds on Sketching 115

where δ∗ (n) is the critical radius, equal to the smallest positive solution δ > 0 to the
inequality
Wδ (C) δ
√ ≤ . (4.14)
δ n σ
We refer the reader to [20, 23] for a proof of this theorem. This result provides a
baseline against which to compare the statistical recovery performance of the random-
ized sketching method. In particular, an important goal is characterizing the minimal
projection dimension m that will enable us to find an estimate 
x with the error guarantee

x − x† )22 ≈ (1/n)A(x∗ − x† )22 ,


(1/n)A(

in a computationally simpler manner using the compressed data SA, Sb.


An application of Theorem 4.1 will yield that the sketched solution  x in (4.9), using
the choice of sketch dimension m = c0 W1 (K)2 / 2 , satisfies the bound

x − x∗ ) ≤  Ax∗ − b2 ,
A(

where Ax∗ − b2 = f (x∗ ) is the optimal value of the optimization problem (4.8).
However, under the model (4.13) we have

Ax∗ − b2 = A(x∗ − x† ) − w2 ≤ A(x∗ − x† )2 + w2 ,



which is at least O(σ n) because of the term w2 . This upper bound suggests that
 
(1/n)A( x − x† )22 is bounded by O( 2 σ2 ) = O σ2 W1 (K)2 /m . This can be consid-
ered as a negative result for the sketching method, since the error scales as O(1/m)
instead of O(1/n). We will show that this upper bound is tight, and the O(1/m) scaling is
unavoidable for all methods that sketch the data once. In contrast, as we will discuss in
Section 4.5.7, an iterative sketching method can achieve optimal prediction error using
sketches of comparable dimension.
We will in fact show that, unless m ≥ n, any method based on observing only the pair
(SA, Sb) necessarily has a substantially larger error than the least-squares estimate. In
particular, our result applies to an arbitrary measurable function (SA, Sb) → 
x, which we
refer to as an estimator.
More precisely, our lower bound applies to any random matrix S ∈ Rm×n for which
m
||E[ST (SST )−1 S]|| op ≤ η , (4.15)
n
where η is a constant that is independent of n and m, and ||A|| op denotes the 2 -operator
norm, which reduces to the maximum eigenvalue for a symmetric matrix. These con-
ditions hold for various standard choices of the sketching matrix, including most of
those discussed in the Section 4.2: the Gaussian sketch, the ROS sketch,6 the sparse
sketch, and the α-balanced leverage sampling sketch. The following lemma shows that
the condition (4.15) is satisfied for Gaussian sketches with equality and η = 1.

6 See [23] for a proof of this fact for Gaussian and ROS sketches. To be more precise, for ROS sketches, the
condition (4.15) holds when rows are sampled without replacement.
116 Mert Pilanci

lemma 4.2 Let S ∈ Rm×n be a random matrix with i.i.d. Gaussian entries. We have
  m
||E ST (SST )−1 S ||op = .
n

Proof Let S = UΣVT denote the singular value decomposition of the random matrix
S. Note that we have ST (SST )−1 S = VVT . By virtue of the rotation invariance of
the Gaussian distribution, columns of V denoted by {vi }m
i=1 are uniformly distributed
over the n-dimensional unit sphere, and it holds that Evi vTi = (1/n)In for i = 1, ..., m.
Consequently, we obtain

  m
m
E ST (SST )−1 S = E vi vTi = m Ev1 vT1 = In ,
i=1
n

and the bound on the operator norm follows.

4.5.2 Fano’s Inequality


Let X and Y represent two random variables with a joint probability distribution Px,y ,
 = g(Y) be the predicted
where X is discrete and takes values from a finite set X. Let X
value of X for some deterministic function g which also takes values in X. Then Fano’s
inequality states that
 
 ≥ H(X|Y) − 1 .
P XX
log2 (|X| − 1)

Fano’s inequality follows as a simple consequence of the chain rule for entropy. How-
ever, it is very powerful for deriving lower bounds on the error probabilities in coding
theory, statistics, and machine learning [7, 26–30].

4.5.3 Metric Entropy


For a given positive tolerance value δ > 0, we define the δ-packing number Mδ,· of a
set C ⊆ Rd with respect to a norm  ·  as the largest number of vectors {x j } M
j=1 ⊆ C which
are elements of C and satisfy

xk − xl  > δ ∀k  l .

We define the metric entropy of the set C with respect to a norm  ·  as the logarithm of
the corresponding packing number

Nδ,· (C) = log2 Mδ,· .

The concept of metric entropy provides a way to measure the complexity, or effec-
tive size, of a set with infinitely many elements and dates back to the seminal work
of Kolmogorov and Tikhomirov [31].
Information-Theoretic Bounds on Sketching 117

4.5.4 Minimax Risk


In this chapter, we will take a frequentist approach in modeling the unknown vector
x† we are trying to estimate from the data. In order to assess the quality of estimation,
we will consider a risk function associated with our estimation method. Note that, for
a fixed value of the unknown vector x† , there exist estimators which make no error for
that particular vector x† , such as the estimator which always returns x† regardless of
the observation. We will take the worst-case risk approach considered in the statistical
estimation literature, which focuses on the minimax risk. More precisely, we define the
minimax risk as follows:
 
1 † 2
M(Q) = inf sup E A( x − x )2 , (4.16)

x∈Q x† ∈X n
where the infimum ranges over all estimators that use the input data A and b to
estimate x† .

4.5.5 Reduction to Hypothesis Testing


In this section we present a reduction of the minimax estimation risk to hypothesis test-
ing. Suppose that we have a packing of the constraint set C given by the collection
z(1) , ..., z(M) with radius 2δ. More precisely, we have
A(z(i) − z( j) )2 ≥ 2δ ∀i  j,
where z(i) ∈ C for all i = 1, ..., M. Next, consider a set of probability distributions {Pz( j) } M
j=1
corresponding to the distribution of the observation when the unknown vector is x† = z j .
Suppose that we have an M-ary hypothesis-testing problem constructed as follows. Let
Jδ denote a random variable with uniform distribution over the index set {1, . . . , M} that
allows us to pick an element of the packing set at random. Note that M is a function of δ,
hence we keep the dependence of Jδ on δ explicit in our notation. Let us set the random
variable Z according to the probability distribution Pz( j) in the event that Jδ = j, i.e.,
Z ∼ Px( j) whenever Jδ = j .
Now we will consider the problem of detecting the index set given the value of Z.
The next lemma is a standard reduction in minimax theory, and relates the minimax
estimation risk to the M-ary hypothesis-testing error (see Birgé [30] and Yu [7]).
lemma 4.3 The minimax risk Q is lower-bounded by
M(Q) ≥ δ2 inf P[ψ(Z)  Jδ ] . (4.17)
ψ

A proof of this lemma can be found in Section A4.2. Lemma 4.3 allows us to apply
Fano’s method after transforming the estimation problem into a hypothesis-testing
problem based on sketched data. Let us recall the condition on sketching matrices stated
earlier,
m
||E[ST (SST )−1 S]|| op ≤ η , (4.18)
n
where η is a constant that is independent of n and m. Now we are ready to present the
lower bound on the statistical performance of sketching.
118 Mert Pilanci

theorem 4.3 For any random sketching matrix S ∈ Rm×n satisfying condition (4.18),
any estimator (SA, Sb) → x† has MSE lower-bounded as
  1
1 σ2 log2 ( 2 M1/2 )
sup ES,w A(x† − x∗ )22 ≥ , (4.19)
x† ∈C0 n 128 η min{m, n}

where M1/2 is the 1/2-packing number of C0 ∩ BA (1) in the semi-norm (1/ n)A( · )2 .
We defer the proof to Section 4.8, and investigate the implications of the lower bound in
the next section. It can be shown that Theorem 4.3 is tight, since Theorem 4.1 provides
a matching upper bound.

4.5.6 Implications of the Information-Theoretic Lower Bound


We now investigate some consequences of the lower bound given in Theorem 4.3.
We will focus on concrete examples of popular statistical estimation and optimization
problems to illustrate its applicability.

Example 4.3 Unconstrained Least Squares We first consider the simple unconstrained
case, where the constraint is the entire d-dimensional space, i.e., C = Rd . With this
choice, it is well known that, under the observation model (4.13), the least-squares
solution x∗ has prediction mean-squared error upper-bounded as follows:7
1  σ2 rank(A)
E A(x∗ − x† )22  (4.20a)
n n
σ2 d
≤ , (4.20b)
n
where the expectation is over the noise variable w in (4.13). On the other hand, with the
choice C0 = B2 (1), it is well known that we can construct a 1/2-packing with M = 2d
elements, so that Theorem 4.3 implies that any estimator x† based on (SA, Sb) has
prediction MSE lower-bounded as
1  σ2 d
ES,w A( x − x† )22  . (4.20c)
n min{m, n}
Consequently, the sketch dimension m must grow proportionally to n in order for the
sketched solution to have a mean-squared error comparable to the original least-squares
estimate. This may not be desirable for least-squares problems in which n  d, since
it should be possible to sketch down to a dimension proportional to rank(A) which is
always upper-bounded by d. Thus, Theorem 4.3 reveals a surprising gap between the
classical least-squares sketch (4.9) and the accuracy of the original least-squares esti-
mate. In the regime n  m, the prediction MSE of the sketched solution is O(σ2 (d/m))

7 In fact, a closed-form solution exists for the prediction error, which it is straightforward to obtain from
the closed-form solution of the least-squares estimator. However, this simple form is sufficient to illustrate
information-theoretic lower bounds.
Information-Theoretic Bounds on Sketching 119

which is a factor of n/m larger than the optimal prediction MSE in (4.20b). In Section
4.5.7, we will see that this gap can be removed by iterative sketching algorithms which
don’t obey the information-theoretic lower bound (4.20c).

Example 4.4 1 Constrained Least Squares We can consider other forms of con-
strained least-squares estimates as well, such as those involving an 1 -norm constraint
to encourage sparsity in the solution. We now consider the sparse variant of the linear
regression problem, which involves the 0 “ball”
 
d 
B0 (s) : = x ∈ Rd | I[x j  0] ≤ s ,
j=1

corresponding to the set of all vectors with at most s non-zero entries. Fixing some

radius R ≥ s, consider a vector x† ∈ C0 : = B0 (s) ∩ {x1 = R}, and suppose that we
have noisy observations of the form b = Ax† + w.
Given this setup, one way in which to estimate x† is by computing the least-squares
estimate x∗ constrained to the 1 -ball C = {x ∈ Rn | x1 ≤ R}.8 This estimator is a form
of the Lasso [2, 32] which has been studied extensively in the context of statistical
estimation and signal reconstruction.
On the other hand, the 1/2-packing number M of the set C0 can be lower-bounded
 
as log2 M  s log2 ed/s . We refer the reader to [33] for a proof. Consequently, in
application to this particular problem, Theorem 4.3 implies that any estimator  x based
on the pair (SA, Sb) has mean-squared error lower-bounded as
1  σ2 s log ed/s
Ew,S A( x − x† )22  2
. (4.21)
n min{m, n}
Again, we see that the projection dimension m must be of the order of n in order to
match the mean-squared error of the constrained least-squares estimate x∗ up to constant
factors.

Example 4.5 Low-Rank Matrix Estimation In the problem of multivariate regression,


the goal is to estimate a matrix X† ∈ Rd1 ×d2 model based on observations of the form
Y = AX† + W, (4.22)
where Y ∈ Rn×d2 is a matrix of observed responses, A ∈ Rn×d1 is a data matrix, and
W ∈ Rn×d2 is a matrix of noise variables. A typical interpretation of this model is
a collection of d2 regression problems, where each one involves a d1 -dimensional
regression vector, namely a particular column of the matrix X† . In many applications,
including reduced-rank regression, multi-task learning, and recommender systems

8 This setup is slightly unrealistic, since the estimator is assumed to know the radius R = x† 1 . In practice,
one solves the least-squares problem with a Lagrangian constraint, but the underlying arguments are
essentially the same.
120 Mert Pilanci

(e.g., [34–37]), it is reasonable to model the matrix X† as being a low-rank matrix. Note
that a rank constraint on the matrix X can be written as an 0 -“norm” sparsity constraint
on its singular values. In particular, we have

min{d 1 ,d2 }
rank(X) ≤ r if and only if I[γ j (X) > 0] ≤ r,
j=1

where γ j (X) denotes the jth singular value of X. This observation motivates a standard
 1 ,d2 }
relaxation of the rank constraint using the nuclear norm ||X|| nuc : = min{d
j=1 γ j (X).
Accordingly, let us consider the constrained least-squares problem
!
1
X∗ = arg min ||Y − AX|| 2fro such that ||X|| nuc ≤ R, (4.23)
X∈Rd1 ×d2 2

where || · ||fro denotes the Frobenius norm on matrices, or equivalently the Euclidean norm
on its vectorized version. Let C0 denote the set of matrices with rank r < 12 min{d1 , d2 },
and Frobenius norm at most one. In this case the constrained least-squares solution X∗
satisfies the bound
 
1 σ2 r (d1 + d2 )
E A(X∗ − X† )22  . (4.24a)
n n
On the other hand, the 1/2-packing number of the set C0 is lower-bounded as
 
log2 M  r d1 + d2 (see [36] for a proof), so that Theorem 4.3 implies that any
 based on the pair (SA, SY) has MSE lower-bounded as
estimator X
  2  
1
Ew,S A(X  − X† )2  σ r d1 + d2 . (4.24b)
2
n min{m, n}

As with the previous examples, we see the sub-optimality of the sketched approach
in the regime m < n.

4.5.7 Iterative Sketching


It is possible to improve the basic sketching estimator using adaptive measurements.
Consider the constrained least-squares problem in (4.8):
1
x∗ = arg min Ax − b22 (4.25)
x∈C 2
1 1
= arg min Ax22 − bT Ax + b22 . (4.26)
x∈C 2 2

f (x)

We may use an iterative method to obtain x∗ which uses the gradient ∇ f (x) = AT (Ax− b)
and Hessian ∇2 f (x) = AT A to minimize the second-order Taylor expansion of f (x) at a
current iterate xt using ∇ f (xt ) and ∇2 f (xt ) as follows:
" 1/2 "2
x = x + arg min "" ∇2 f (x) x"" + xT ∇ f (x )
t+1 t 2 t (4.27)
x∈C
= xt + arg min Ax22 − xT AT (b − Axt ) . (4.28)
x∈C
Information-Theoretic Bounds on Sketching 121

We apply a sketching matrix S to the data A on the formulation (4.28) and define this
procedure as an iterative sketch
xt+1 = xt + arg min SAx22 − 2xT AT (b − Axt ) . (4.29)
x∈C
Note that this procedure uses more information then the classical sketch (4.9), in
particular it calculates the left matrix–vector multiplications with the data A in the
following order:
sT1 A
sT2 A
..
.
sTm A
..
.
(b − Ax1 )T A
..
.
(b − Axt )T A ,
where sT1 , ..., sTm are the rows of the sketching matrix S. This can be considered as
an adaptive form of sketching where the residual directions (b − Axt ) are used after
the random directions s1 , ..., sm . As a consequence, the information-theoretic bounds
we considered in Section 4.4.6 do not apply to iterative sketching. In Pilanci and
Wainwright [23], it is shown that this algorithm achieves the minimax statistical risk
given in (4.16) using at most O(log2 n) iterations while obtaining equivalent speedups
from sketching. We also note that the iterative sketching method can also be applied
to more general convex optimization problems other than the least-squares objective.
We refer the reader to Pilanci and Wainwright [38] for the application of sketching in
solving general convex optimization problems.

4.6 Non-Parametric Problems

4.6.1 Non-Parametric Regression


In this section we discuss an extension of the sketching method to non-parametric regres-
sion problems over Hilbert spaces. The goal of non-parametric regression is making pre-
dictions of a continuous response after observing a covariate, where they are related via
yi = f ∗ (xi ) + vi , (4.30)
where v ∼ N(0, σ2 In ), and the function f ∗ (x) needs to be estimated from {xi , yi }ni=1 . We
will consider the well-studied case where the function f ∗ is assumed to belong to a
reproducing kernel Hilbert space (RKHS) H, and has a bounded Hilbert norm  f H
122 Mert Pilanci

[39, 40]. For these regression problems it is customary to consider the kernel ridge
regression (KRR) problem based on convex optimization
⎧ ⎫

⎨1 
⎪ n ⎪

 2 ⎬
f = arg min⎪ ⎪ (yi − f (xi )) + λ f H ⎪
2
⎪ . (4.31)
f ∈H ⎩ 2n ⎭
i=1

An RKHS is generated by a kernel function which is positive semidefinite (PSD). A


PSD kernel is a symmetric function K : X × X → R that satisfies

r
yi y j K(xi , x j ) ≥ 0
i, j=1

for all collections of points {x1 , ..., xn }, {y1 , ..., yn } and ∀r ∈ Z+ . The vector space of all
functions of the form

r
f (·) = yi K(·, xi )
i

generates an RKHS by taking closure of all such linear combinations. It can be shown
that this RKHS is uniquely associated with the kernel function K (see Aronszajn [41] for
details). Let us define a finite-dimensional kernel matrix K using n covariates as follows
1
Ki j = K(xi , x j ) ,
n
which is a positive semidefinite matrix. In the linear least-squares regression the
kernel matrix reduces to the Gram matrix given by K = AT A. It is also known that
the above infinite-dimensional program can be recast as a finite-dimensional quadratic
optimization problem involving the kernel matrix
1 √
 = arg minn Kw − (1/ n)y22 + λwT Kw
w (4.32)
w∈R 2
1 Ky
= arg minn wT K2 w − wT √ + λwT Kw , (4.33)
w∈R 2 n
and we can find the optimal solution to the infinite-dimensional problem (4.31) via the
following relation:9

1
n

f (·) = i K(·, xi ) .
w (4.34)
n i=1

We now define a kernel complexity measure that is based on the eigenvalues of the ker-
nel matrix K. Let λ1 ≥ λ2 ≥ · · · ≥ λn correspond to the real eigenvalues of the symmetric
positive-definite kernel matrix K. The kernel complexity is defined as follows.

9 Our definition of the kernel optimization problem slightly differs from the literature. The classical kernel
problem can be recovered by a variable change w  = K1/2 w, where K1/2 is the matrix square root. We
refer the reader to [40] for more details on kernel-based methods.
Information-Theoretic Bounds on Sketching 123

definition 4.1 (Kernel complexity)


*
+ n

R(δ) = min{δ2 , λi } ,
i=1

which is the sum of eigenvalues truncated at level δ. As in (4.14), we define a critical


radius δ∗ (n) as the smallest positive solution δ∗ (n) > 0 to the following inequality:
R(δ) δ
√ ≤ , (4.35)
δ n σ
where σ is the noise standard deviation in the statistical model (4.30). The existence of a
unique solution is guaranteed for all kernel classes (see Bartlett et al. [20]). The critical
radius plays an important role in the minimax risk through an information-theoretic
argument. The next theorem provides a lower bound on the statistical risk of any
estimator applied to the observation model (4.30).
theorem 4.4 Given n i.i.d. samples from the model (4.30), any estimator 
f has
prediction error lower-bounded as
1 
n
sup E ( f (xi ) − f ∗ (xi ))2 ≥ c0 δ∗ (n)2 , (4.36)
 f ∗ H≤1 n i=1

where c0 is a numerical constant and δ∗ (n) is the critical radius defined in (4.35).
The lower bound given by Theorem 4.4 can be shown to be tight, and is achieved
by the kernel-based optimization procedure (4.33) and (4.34) (see Bartlett et al. [20]).
The proof of Theorem 4.4 can be found in Yang et al. [42]. We may define the effective
dimension d∗ (n) of the kernel via the relation
d∗ (n) := nδ∗ (n)2 .
This definition allows us to interpret the convergence rate in (4.36) as d∗ (n)/n, which
resembles the classical parametric convergence rate where the number of variables is
d∗ (n).

4.6.2 Sketching Kernels


Solving the optimization problem (4.33) becomes a computational challenge when the
sample size n is large, since it involves linear algebraic operations on an n × n matrix K.
There is a large body of literature on approximating kernel matrices using randomized
methods [43–46]. Here we assume that the matrix K is available, and a sketching matrix
S ∈ Rm×n can be applied to form a randomized approximation of the kernel matrix.
We will present an extension of (4.9), which achieves optimal statistical accuracy.
Specifically, the sketching method we consider solves
1 SKy

v = arg minm vT (SK)(KST )v − vT √ + λvT SKST v , (4.37)
v∈R 2 n
which involves smaller-dimensional sketched kernel matrices SK, SKST and a lower-
dimensional decision variable v ∈ Rm . Then we can recover the original variable via
124 Mert Pilanci

w = ST v. The next theorem shows that the sketched kernel-based optimization method
achieves the optimal prediction error.
theorem 4.5 Let S ∈ Rm×n be a Gaussian sketching matrix where m ≥ c3 dn , and choose
λ = 3δ∗ (n). Given n i.i.d. samples from the model (4.30), the sketching procedure (4.42)
produces a regression estimate  f which satisfies the bound

1 
n
( f (xi ) − f ∗ (xi ))2 ≤ c2 δ∗ (n)2 ,
n i=1

where δ∗ (n) is the critical radius defined in (4.35).


A proof of this theorem can be found in Yang et al. [42]. We note that a similar result
holds for the ROS sketch matrices with extra logarithmic terms in the dimension of
the sketch, i.e., when m ≥ c4 dn log4 (n) holds. Notably, Theorem 4.5 guarantees that the
sketched estimator achieves the optimal error. This is in contrast to the lower-bound
case in Section 4.4.6, where the sketching method does not achieve a minimax optimal
error. This is due to the fact that the sketched problem in (4.37) is using the observation
SKy instead of Sy. Therefore, the lower bound in Section 4.4.6 does not apply for this
construction. It is worth noting that one can formulate the ordinary least-squares case as
a kernel regression problem with kernel K = AAT , and then apply the sketching method
(4.37), which is guaranteed to achieve the minimax optimal risk. However, computing
the kernel matrix AAT would cost O(nd2 ) operations, which is more than would be
required for solving the original least-squares problem.
We note that some kernel approximation methods avoid computing the kernel matrix
K and directly form low-rank approximations. We refer the reader to [43] for an
example, which also provides an error guarantee for the approximate kernel.

4.7 Extensions: Privacy and Communication Complexity

4.7.1 Privacy and Information-Theoretic Bounds


Another interesting property of randomized sketching is privacy preservation in the
context of optimization and learning. Privacy properties of random projections for
various statistical tasks have been studied in the recent literature [10, 11, 47]. It is
of great theoretical and practical interest to characterize fundamental privacy and
optimization trade-offs of randomized algorithms. We first show the relation between
sketching and a mutual information-based privacy measure.

4.7.2 Mutual Information Privacy


Suppose we model the data matrix A ∈ Rn×d as stochastic, where each entry is drawn
randomly. One way we can assess the information revealed to the server is considering
the mutual information per symbol, which is given by the formula
Information-Theoretic Bounds on Sketching 125

I(SA; A) 1
= {H(A) − H(A|SA)}
nd nd
1  
= D PSA,A ||PSA PA ,
nd
where we normalize by nd since the data matrix A has nd entries in total. The following
corollary is a direct application of Theorem 4.1.
corollary 4.1 Let the entries of the matrix A be i.i.d from an arbitrary distribution
with finite variance σ2 . Using sketched data, we can obtain an -approximate10 solution
to the optimization problem while ensuring that the revealed mutual information satisfies
I(SA; A) c0 W2 (AK)
≤ 2 log2 (2πeσ2 ) .
nd  n
Therefore, we can guarantee the mutual information privacy of the sketching-based
methods, whenever the term W(AK) is small.
An alternative and popular characterization of privacy is referred to as the differential
privacy (see Dwork et al. [9]), where other randomized methods, such as additive noise
for preserving privacy, were studied. It is also possible to directly analyze differential
privacy-preserving aspects of random projections as considered in Blocki et al. [10].

4.7.3 Optimization-Based Privacy Attacks


We briefly discuss a possible approach an adversary might take to circumvent the
privacy provided by sketching. If the data matrix is sparse, then one might consider
optimization-based recovery techniques borrowed from compressed sensing to recover
, = SA,
the data A given the sketched data A
min A1
A
,,
s.t. SA = A
 
where we have used the matrix 1 norm A1 := ni=1 dj=1 |Ai j |. The success of the
above optimization method will critically depend on the sparsity level of the original
data A. Most of the randomized sketching constructions shown in Section 4.2 can be
shown to be susceptible to data recovery via optimization (see Candès and Tao [25]
and Candès et al. [48]). However, this method assumes that the sketching matrix S is
available to the attacker. If S is not available to the adversary, then the above method
cannot be used and the recovery is not straightforward.

4.7.4 Communication Complexity-Space Lower Bounds


In this section we consider a streaming model of computation, where the algorithm
is allowed to make only one pass over the data. In this model, an algorithm receives
updates to the entries of the data matrix A in the form “add a to Ai j .” An entry can

10 Here -approximate solution refers to the approximation defined in Theorem 4.1, relative to the optimal
value.
126 Mert Pilanci

be updated more than once, and the value a is any arbitrary real number. The sketches
introduced in this chapter provide a valuable data structure when the matrix is very
large in size, and storing and updating the matrix directly can be impractical. Owing
to the linearity of sketches, we can update the sketch SA by adding a Sei eTj to SA, and
maintain an approximation with limited memory.
The following theorem due to Clarkson and Woodruff [49] provides a lower bound
of the space used by any algorithm for least-squares regression which performs a single
pass over the data.
theorem 4.6 Any randomized 1-pass algorithm which returns an -approximate
solution to the unconstrained least-squares problem with probability at least 7/9 needs
Ω(d2 (1/ + log(nd))) bits of space.
This theorem confirms that the space complexity of sketching for unconstrained
least-squares regression is near optimal. Because of the choice of the sketching
dimension m = O(d), the space used by the sketch SA is O(d2 ), which is optimal up to
constants according to the theorem.

4.8 Numerical Experiments

In this section, we illustrate the sketching method numerically and confirm the theoreti-
cal predictions of Theorems 4.1 and 4.3. We consider both the classical low-dimensional
statistical regime where n > d, and the 1 -constrained least-squares minimization known
as LASSO (see Tibshirani [51]):
x∗ = arg min Ax − b2 .
x s.t. x1 ≤λ

We generate a random i.i.d. data matrix A ∈ Rn×d , where n = 10 000 and d = 1000, and
set the observation vector b = Ax† + σw, where x† ∈ {−1, 0, 1}d is a random s-sparse
vector and w has i.i.d. N(0, 10−4 ) components. For the sketching matrix S ∈ Rm×n , we
consider the Gaussian and Rademacher (±1 i.i.d.-valued) random matrices, where m
ranges between 10 and 400. Consequently, we solve the sketched program

x = arg min SAx − Sb2 .
x s.t. x1 ≤λ

Figures 4.5 and 4.6 show the relative prediction mean-squared error given by the ratio
x − x† )22
(1/n)A(
,
(1/n)A(x∗ − x† )22
where it is averaged over 20 realizations of the sketching matrix, and  x and x∗ are
the sketched and the original solutions, respectively. As predicted by the upper and
lower bounds given in Theorems 4.1 and 4.3, the prediction mean-squared error of
the sketched estimator scales as O((s log d)/m), since the corresponding Gaussian
complexity W1 (K)2 is O(s log d). These plots reveal that the prediction mean-squared
error of the sketched estimators for both Gaussian and Rademacher sketches are in
agreement with the theory.
Information-Theoretic Bounds on Sketching 127

Gaussian sketch for LASSO


1.2
s=5
s = 10
1 s = 20

Relative Prediction Error


0.8

0.6

0.4

0.2

0
0 50 100 150 200 250 300 350 400
Sketch Dimension m
Figure 4.5 Sketching LASSO using Gaussian random projections.

Rademacher sketch for LASSO


1.2
s=5
s = 10
1 s = 20
Relative Prediction Error

0.8

0.6

0.4

0.2

0
0 50 100 150 200 250 300 350 400
Sketch Dimension m
Figure 4.6 Sketching LASSO using Rademacher random projections.

4.9 Conclusion

This chapter presented an overview of random projection-based methods for solv-


ing large-scale statistical estimation and constrained optimization problems. We
investigated fundamental lower bounds on the performance of sketching using
information-theoretic tools. Randomized sketching has interesting theoretical proper-
ties, and also has numerous practical advantages in machine-learning and optimization
128 Mert Pilanci

problems. Sketching yields faster algorithms with lower space complexity while
maintaining strong approximation guarantees.
For the upper bound on the approximation accuracy in Theorem 4.2, Gaussian
complexity plays an important role, and also provides a geometric characterization of
the dimension of the sketch. The lower bounds given in Theorem 4.3 are statistical in
nature, and involve packing numbers, and consequently metric entropy, which measures
the complexity of the sets. The upper bounds on the Gaussian sketch can be extended
to Rademacher sketches, sub-Gaussian sketches, and randomized orthogonal system
sketches (see Pilanci and Wainwright [22] and also Yun et al. [42] for the proofs). How-
ever, the results for non-Gaussian sketches often involve superfluous logarithmic factors
and large constants as artifacts of the analysis. As can be observed in Figs. 4.5 and 4.6,
the mean-squared error curves for Gaussian and Rademacher sketches are in agreement
with each other. It can be conjectured that the approximation ratio of sketching is uni-
versal for random matrices with entries sampled from well-behaved distributions. This
is an important theoretical question for future research. We refer the reader to the work
of Donoho and Tanner [51] for observations of the universality in compressed sensing.
Finally, a number of important limitations of the analysis techniques need to be
considered. The minimax criterion (4.16) is a worst-case criterion in nature by virtue
of its definition, and may not correctly reflect the average error of sketching when the
unknown vector x† is randomly distributed. Furthermore, in some applications, it might
be suitable to consider prior information on the unknown vector. As an interesting
direction of future research, it would be interesting to study lower bounds for sketching
in a Bayesian setting.

A4.1 Proof of Theorem 4.3



Let us define the shorthand notation  · A : = (1/ n)A(·)2 . Let {z j } M
j=1 be a 1/2-packing
of C0 ∩ BA (1) in the semi-norm defined by  · A , and, for a fixed δ ∈ (0, 1/4), define
x j = 4δz j . Since 4δ ∈ (0, 1), the star-shaped assumption guarantees that each x j belongs
to C0 . We thus obtain a collection of M vectors in C0 such that
2δ ≤ x j − xk A ≤ 8δ for all j  k.
Letting J be a random index uniformly distributed over {1, . . . , M}, suppose that,
conditionally on J = j, we observe the sketched observation vector Sb = SAx j + Sw, as
well as the sketched matrix SA. Conditioned on J = j, the random vector Sb follows
an N(SAx j , σ2 SST ) distribution, denoted by Px j . We let Y denote the resulting mixture

variable, with distribution (1/M) M j=1 Px j .
Consider the multi-way testing problem of determining the index J by observing Ȳ.
With this setup, we may apply Lemma 4.3 (see, e.g., [30, 46]), which implies that, for
any estimator x† , the worst-case mean-squared error is lower-bounded as
sup ES,w x† − x∗ 2A ≥ δ2 inf P[ψ(Ȳ)  J], (4.38)
x∗ ∈C ψ

where the infimum ranges over all testing functions ψ. Consequently, it suffices to show
that the testing error is lower-bounded by 1/2.
Information-Theoretic Bounds on Sketching 129

In order to do so, we first apply Fano’s inequality [27] conditionally on the sketching
matrix S and get
- . / 
ES IS (Ȳ; J) + 1
P[ψ(Ȳ)  J] = ES P[ψ(Ȳ)  J | S] ≥ 1 − , (4.39)
log2 M
where IS (Ȳ; J) denotes the mutual information between Ȳ and J with S fixed. Our next
step is to upper-bound the expectation ES [I(Ȳ; J)].
 
Letting D Px j  Pxk denote the Kullback–Leibler (KL) divergence between the
distributions Px j and Pxk , the convexity of KL divergence implies that
⎛ ⎞
1  ⎜⎜⎜⎜ 1  ⎟⎟⎟⎟
M M
IS (Ȳ; J) = D⎜⎜⎝Px j  Pxk ⎟⎟⎠
M j=1
M k=1

1  
M

≤ 2 D Px j  Pxk .
M j,k=1

Computing the KL divergence for Gaussian vectors yields


1  1  
M
IS (Ȳ; J) ≤ 2 2
(x j − xk )T AT ST (SST )−1 S A(x j − xk ).
M j,k=1 2σ

Thus, using condition (4.15), we have


1  mη
M
32 m η 2
ES [I(Ȳ; J)] ≤ A(x j − xk )22 ≤ δ ,
2
M j,k=1 2 nσ 2 σ2

where the final inequality uses the fact that x j − xk A = 1/ nA(x j − xk )2 ≤ 8δ for all
pairs.
Combined with our previous bounds (4.38) and (4.39), we find that
!
32(m η δ2 /σ2 ) + 1
sup Ex − x∗ 22 ≥ δ2 1 − .
x∗ ∈C log2 M
Setting δ = σ2 log2 (M/2)/64 η m yields the lower bound (4.19).

A4.2 Proof of Lemma 4.3

x − x† 2A we have
By Markov’s inequality applied on the random variable 
x − x† 2A ≥ δ2 P[
E  x − x† 2A ≥ δ2 ]. (4.40)
Now note that
x − x† A ≥ δ] ≥ max P[ 
sup P[ x − x( j) A ≥ δ | Jδ = j]
x∗ ∈C j∈{1,...,M}

1 
M
≥ P[
x − x( j) A ≥ δ | Jδ = j] , (4.41)
M j=1
130 Mert Pilanci

since every element of the packing set satisfies x( j) ∈ C and the discrete maximum
is upper-bounded by the average over {1, ..., M}. Since we have P[Jδ = j] = 1/M, we
equivalently have

1  
M M  6 
P[
x − x A ] ≥ δ | Jδ = j] =
( j)
P x − x( j) A ≥ δ 66 Jδ = j P[Jδ = j]
M j=1 j=1
 
= P x − x(Jδ ) A ≥ δ . (4.42)

Now we will argue that, whenever the true index is Jδ = j and if 


x − x( j) A < δ, then we
can form a hypothesis tester ψ(Z) identifying the true index j. Consider the test
ψ(Z) : = arg min x( j) −
xA .
j∈[M]

Now note that x j −


xA < δ ensures that
x(i) −
xA ≥ x(i) − x( j) A − x( j) −
xA ≥ 2δ − δ = δ ,
where the second inequality follows from the 2δ-packing construction of our collection
x(1) , ..., x(M) . Consequently x(i) − xA > δ for all i ∈ {1, ..., N} − { j}, and the test ψ(Z)
identifies the true index J = j. Therefore we obtain
- .
x( j) −
xA < δ ⇒ {φ(Z) = j} ,
and conclude that the complements of these events obey
  / 
P x( j) −
xA ≥ δ | Jδ = j ≥ P φ(Z)  j | Jδ = j .
Taking averages over the indices 1, ..., M, we obtain
  1  M   / 
P x(Jδ ) −
xA ≥ δ = P x( j) −
xA ≥ δ | Jδ = j ≥ P φ(Z)  Jδ .
M j=1

Combining the above with the earlier lower bound (4.41) and the identity (4.42), we
obtain
/  / 
x − x∗ A ≥ δ] ≥ P φ(Z)  Jδ ≥ inf P φ(Z)  Jδ ,
sup P[
x∗ ∈C φ

where the second inequality follows by taking the infimum over all tests, which can only
make the probability smaller. Plugging in the above lower bound in (4.40) completes
the proof of the lemma.

References

[1] S. Vempala, The random projection method. American Mathematical Society, 2004.
[2] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Univer-
sal encoding strategies?” IEEE Trans. Information Theory, vol. 52, no. 12, pp. 5406–5425,
2006.
Information-Theoretic Bounds on Sketching 131

[3] N. Halko, P. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilis-
tic algorithms for constructing approximate matrix decompositions,” SIAM Rev., vol. 53,
no. 2, pp. 217–288, 2011.
[4] M. W. Mahoney, Randomized algorithms for matrices and data. Now Publishers, 2011.
[5] D. P. Woodruff, “Sketching as a tool for numerical linear algebra,” Foundations and Trends
Theoretical Computer Sci., vol. 10, nos. 1–2, pp. 1–157, 2014.
[6] S. Muthukrishnan, “Data streams: Algorithms and applications,” Foundations and Trends
Theoretical Computer Sci., vol. 1, no. 2, pp. 117–236, 2005.
[7] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift in Honor of Lucien Le Cam. Springer,
1997, pp. 423–435.
[8] Y. Yang and A. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in
private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.
[10] J. Blocki, A. Blum, A. Datta, and O. Sheffet, “The Johnson–Lindenstrauss transform
itself preserves differential privacy,” in Proc. 2012 IEEE 53rd Annual Symposium on
Foundations of Computer Science, 2012, pp. 410–419.
[11] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast Johnson-
Lindenstrauss transform,” in Proc. 38th Annual ACM Symposium on Theory of Computing,
2006, pp. 557–563.
[12] P. Drineas and M. W. Mahoney, “Effective resistances, statistical leverage, and applications
to linear equation solving,” arXiv:1005.3097, 2010.
[13] D. A. Spielman and N. Srivastava, “Graph sparsification by effective resistances,” SIAM J.
Computing, vol. 40, no. 6, pp. 1913–1926, 2011.
[14] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in
International Colloquium on Automata, Languages, and Programming, 2002, pp. 693–703.
[15] D. M. Kane and J. Nelson, “Sparser Johnson–Lindenstrauss transforms,” J. ACM, vol. 61,
no. 1, article no. 4, 2014.
[16] J. Nelson and H. L. Nguyên, “Osnap: Faster numerical linear algebra algorithms via sparser
subspace embeddings,” in Proc. 2013 IEEE 54th Annual Symposium on Foundations of
Computer Science (FOCS), 2013, pp. 117–126.
[17] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms.
Springer, 1993, vol. 1.
[18] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[19] M. Ledoux and M. Talagrand, Probability in Banach spaces: Isoperimetry and processes.
Springer, 1991.
[20] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local Rademacher complexities,” Annals
Statist., vol. 33, no. 4, pp. 1497–1537, 2005.
[21] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of
linear inverse problems,” Foundations Computational Math., vol. 12, no. 6, pp. 805–849,
2012.
[22] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp
guarantees,” UC Berkeley, Technical Report, 2014, full-length version at arXiv:1404.7203;
Presented in part at ISIT 2014.
[23] M. Pilanci and M. J. Wainwright, “Iterative Hessian sketch: Fast and accurate solution
approximation for constrained least-squares,” J. Machine Learning Res., vol. 17, no. 1,
pp. 1842–1879, 2016.
132 Mert Pilanci

[24] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”


SIAM J. Sci. Computing, vol. 20, no. 1, pp. 33–61, 1998.
[25] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Information
Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[26] R. M. Fano and W. Wintringham, “Transmission of information,” Phys. Today, vol. 14,
p. 56, 1961.
[27] T. Cover and J. Thomas, Elements of information theory. John Wiley & Sons, 1991.
[28] P. Assouad, “Deux remarques sur l’estimation,” Comptes Rendus Acad. Sci. Paris, vol.
296, pp. 1021–1024, 1983.
[29] I. A. Ibragimov and R. Z. Has’minskii, Statistical estimation: Asymptotic theory. Springer,
1981.
[30] L. Birgé, “Estimating a density under order restrictions: Non-asymptotic minimax risk,”
Annals Statist., vol. 15, no. 3, pp. 995–1012, 1987.
[31] A. Kolmogorov and B. Tikhomirov, “-entropy and -capacity of sets in functional
spaces,” Uspekhi Mat. Nauk, vol. 86, pp. 3–86, 1959, English transl. Amer. Math. Soc.
Translations, vol. 17, pp. 277–364, 1961.
[32] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser.
B, vol. 58, no. 1, pp. 267–288, 1996.
[33] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax rates of estimation for high-
dimensional linear regression over q -balls,” IEEE Trans. Information Theory, vol. 57,
no. 10, pp. 6976–6994, 2011.
[34] N. Srebro, N. Alon, and T. S. Jaakkola, “Generalization error bounds for collaborative
prediction with low-rank matrices,” in Proc. Advances in Neural Information Processing
Systems, 2005, pp. 1321–1328.
[35] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped
variables,” J. Roy. Statist. Soc. B, vol. 1, no. 68, p. 49, 2006.
[36] S. Negahban and M. J. Wainwright, “Estimation of (near) low-rank matrices with noise
and high-dimensional scaling,” Annals Statist., vol. 39, no. 2, pp. 1069–1097, 2011.
[37] F. Bunea, Y. She, and M. Wegkamp, “Optimal selection of reduced rank estimators of
high-dimensional matrices,” Annals Statist., vol. 39, no. 2, pp. 1282–1309, 2011.
[38] M. Pilanci and M. J. Wainwright, “Newton sketch: A near linear-time optimization
algorithm with linear-quadratic convergence,” SIAM J. Optimization, vol. 27, no. 1,
pp. 205–245, 2017.
[39] H. L. Weinert, (ed.), Reproducing kernel hilbert spaces: Applications in statistical signal
processing. Hutchinson Ross Publishing Co., 1982.
[40] B. Schölkopf and A. Smola, Learning with kernels. MIT Press, 2002.
[41] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68,
pp. 337–404, 1950.
[42] Y. Yang, M. Pilanci, and M. J. Wainwright, “Randomized sketches for kernels: Fast and
optimal nonparametric regression,” Annals Statist., vol. 45, no. 3, pp. 991–1023, 2017.
[43] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Proc.
Advances in Neural Information Processing Systems, 2008, pp. 1177–1184.
[44] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimiza-
tion with randomization in learning,” in Proc. Advances in Neural Information Processing
Systems, 2009, pp. 1313–1320.
[45] P. Drineas and M. W. Mahoney, “On the Nyström method for approximating a Gram
matrix for improved kernel-based learning,” J. Machine Learning Res., vol. 6, no. 12,
pp. 2153–2175, 2005.
Information-Theoretic Bounds on Sketching 133

[46] Q. Le, T. Sarlós, and A. Smola, “Fastfood – approximating kernel expansions in loglinear
time,” in Proc. 30th International Conference on Machine Learning, 2013, 9 unnumbered
pages.
[47] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed regression,” IEEE Trans.
Information Theory, vol. 55, no. 2, pp. 846–866, 2009.
[48] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information,” IEEE Trans. Information
Theory, vol. 52, no. 2, pp. 489–509, 2004.
[49] K. L. Clarkson and D. P. Woodruff, “Numerical linear algebra in the streaming model,” in
Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, 2009,
pp. 205–214.
[50] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc.
Ser. B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[51] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional
geometry, with implications for modern data analysis and signal processing,” Phil. Trans.
Roy. Soc. London A: Math., Phys. Engineering Sci., vol. 367, no. 1906, pp. 4273–4293,
2009.
5 Sample Complexity Bounds for
Dictionary Learning from Vector- and
Tensor-Valued Data
Zahra Shakeri, Anand D. Sarwate, and Waheed U. Bajwa

Summary

During the last decade, dictionary learning has emerged as one of the most powerful
methods for data-driven extraction of features from data. While the initial focus on dic-
tionary learning had been from an algorithmic perspective, recent years have seen an
increasing interest in understanding the theoretical underpinnings of dictionary learning.
Many such results rely on the use of information-theoretic analytic tools and help us to
understand the fundamental limitations of different dictionary-learning algorithms. This
chapter focuses on the theoretical aspects of dictionary learning and summarizes existing
results that deal with dictionary learning both from vector-valued data and from tensor-
valued (i.e., multi-way) data, which are defined as data having multiple modes. These
results are primarily stated in terms of lower and upper bounds on the sample complexity
of dictionary learning, defined as the number of samples needed to identify or recon-
struct the true dictionary underlying data from noiseless or noisy samples, respectively.
Many of the analytic tools that help yield these results come from the information-theory
literature; these include restating the dictionary-learning problem as a channel coding
problem and connecting the analysis of minimax risk in statistical estimation Fano’s
inequality. In addition to highlighting the effects of different parameters on the sample
complexity of dictionary learning, this chapter also brings out the potential advantages
of dictionary learning from tensor data and concludes with a set of open problems that
remain unaddressed for dictionary learning.

5.1 Introduction

Modern machine learning and signal processing relies on finding meaningful and suc-
cinct representations of data. Roughly speaking, data representation entails transforming
“raw” data from its original domain to another domain in which it can be processed
more effectively and efficiently. In particular, the performance of any information-
processing algorithm is dependent on the representation it is built on [1]. There are
two major approaches to data representation. In model-based approaches, a predeter-
mined basis is used to transform data. Such a basis can be formed using predefined

134
Sample Complexity Bounds for Dictionary Learning 135

transforms such as the Fourier transform [2], wavelets [3], and curvelets [4]. The
data-driven approach infers transforms from the data to yield efficient representations.
Prior works on data representation show that data-driven techniques generally outper-
form model-based techniques as the learned transformations are tuned to the input
signals [5, 6].
Since contemporary data are often high-dimensional and high-volume, we need
efficient algorithms to manage them. In addition, rapid advances in sensing and data-
acquisition technologies in recent years have resulted in individual data samples or
signals with multimodal structures. For example, a single observation may contain
measurements from a two-dimensional array over time, leading to a data sample
with three modes. Such data are often termed tensors or multi-way arrays [7]. Spe-
cialized algorithms can take advantage of this tensor structure to handle multimodal
data more efficiently. These algorithms represent tensor data using fewer parame-
ters than vector-valued data-representation methods by means of tensor decomposi-
tion techniques [8–10], resulting in reduced computational complexity and storage
costs [11–15].
In this chapter, we focus on data-driven representations. As data-collection sys-
tems grow and proliferate, we will need efficient data representations for processing,
storage, and retrieval. Data-driven representations have successfully been used for sig-
nal processing and machine-learning tasks such as data compression, recognition, and
classification [5, 16, 17]. From a theoretical standpoint, there are several interest-
ing questions surrounding data-driven representations. Assuming there is an unknown
generative model forming a “true” representation of data, these questions include the
following. (1) What algorithms can be used to learn the representation effectively? (2)
How many data samples are needed in order for us to learn the representation? (3) What
are the fundamental limits on the number of data samples needed in order for us to
learn the representation? (4) How robust are the solutions addressing these questions to
parameters such as noise and outliers? In particular, state-of-the-art data-representation
algorithms have excellent empirical performance but their non-convex geometry makes
analyzing them challenging.
The goal of this chapter is to provide a brief overview of some of the aforementioned
questions for a class of data-driven-representation methods known as dictionary learning
(DL). Our focus here will be both on the vector-valued and on the tensor-valued (i.e.,
multidimensional/multimodal) data cases.

5.1.1 Dictionary Learning: A Data-Driven Approach to Sparse Representations


Data-driven methods have a long history in representation learning and can be divided
into two classes. The first class includes linear methods, which involve transform-
ing (typically vector-valued) data using linear functions to exploit the latent structure
in data [5, 18, 19]. From a geometric point of view, these methods effectively learn
a low-dimensional subspace and projection of data onto that subspace, given some
constraints. Examples of classical linear approaches for vector-valued data include
136 Zahra Shakeri et al.

principal component analysis (PCA) [5], linear discriminant analysis (LDA) [18], and
independent component analysis (ICA) [19].
The second class consists of nonlinear methods. Despite the fact that historically lin-
ear representations have been preferred over nonlinear methods because of their lesser
computational complexity, recent advances in available analytic tools and computational
power have resulted in an increased interest in nonlinear representation learning. These
techniques have enhanced performance and interpretability compared with linear tech-
niques. In nonlinear methods, data are transformed into a higher-dimensional space,
in which they lie on a low-dimensional manifold [6, 20–22]. In the world of nonlin-
ear transformations, nonlinearity can take different forms. In manifold-based methods
such as diffusion maps, data are projected onto a nonlinear manifold [20]. In kernel
(nonlinear) PCA, data are projected onto a subspace in a higher-dimensional space [21].
Auto-encoders encode data according to the desired task [22]. DL uses a union of sub-
spaces as the underlying geometric structure and projects input data onto one of the
learned subspaces in the union. This leads to sparse representations of the data, which
can be represented in the form of an overdetermined matrix multiplied by a sparse vec-
tor [6]. Although nonlinear representation methods result in non-convex formulations,
we can often take advantage of the problem structure to guarantee the existence of a
unique solution and hence an optimal representation.
Focusing specifically on DL, it is known to have slightly higher computational com-
plexity than linear methods, but it surpasses their performance in applications such as
image denoising and inpainting [6], audio processing [23], compressed sensing [24],
and data classification [17, 25]. More specifically, given input training signals y ∈ Rm ,
the goal in DL is to construct a basis such that y ≈ Dx. Here, D ∈ Rm×p is denoted as
the dictionary that has unit-norm columns and x ∈ R p is the dictionary coefficient vector
that has a few non-zero entries. While the initial focus in DL had been on algorith-
mic development for various problem setups, works in recent years have also provided
fundamental analytic results that help us understand the fundamental limits and perfor-
mance of DL algorithms for both vector-valued [26–33] and tensor-valued [12, 13, 15]
data.
There are two paradigms in the DL literature: the dictionary can be assumed to be a
complete or an overcomplete basis (effectively, a frame [34]). In both cases, columns of
the dictionary span the entire space [27]; in complete dictionaries, the dictionary matrix
is square (m = p), whereas in overcomplete dictionaries the matrix has more columns
than rows (m < p). In general, overcomplete representations result in more flexibility to
allow both sparse and accurate representations [6].

5.1.2 Chapter Outline


In this chapter, we are interested in summarizing key results in learning of overcomplete
dictionaries. We group works according to whether the data are vector-valued (one-
dimensional) or tensor-valued (multidimensional). For both these cases, we focus on
works that provide fundamental limits on the sample complexity for reliable dictionary
Sample Complexity Bounds for Dictionary Learning 137

Data Representation

Model-Based Data-Driven

Linear Nonlinear

Manifold-Based Kernel PCA Dictionary Learning Auto-encoders

Square Overcomplete

Vector-Valued Data
This Chapter:
Tensor Data

Figure 5.1 A graphical representation of the scope of this chapter in relation to the literature on
representation learning.

estimation, i.e., the number of observations that are necessary to recover the true
dictionary that generates the data up to some predefined error. The main information-
theoretic tools that are used to derive these results range from reformulating the
dictionary-learning problem as a channel coding problem to connecting the minimax
risk analysis to Fano’s inequality. We refer the reader to Fig. 5.1 for a graphical overview
of the relationship of this chapter to other themes in representation learning.
We address the DL problem for vector-valued data in Section 5.2, and that for ten-
sor data in Section 5.3. Finally, we talk about extensions of these works and some open
problems in DL in Section 5.4. We focus here only on the problems of identifiability and
fundamental limits; in particular, we do not survey DL algorithms in depth apart from
some brief discussion in Sections 5.2 and 5.3. The monograph of Okoudjou [35] dis-
cusses algorithms for vector-valued data. Algorithms for tensor-valued data are relatively
more recent and are described in our recent paper [13].

5.2 Dictionary Learning for Vector-Valued Data

We first address the problem of reliable estimation of dictionaries underlying data that
have a single mode, i.e., are vector-valued. In particular, we focus on the subject of the
sample complexity of the DL problem from two prospectives: (1) fundamental limits
on the sample complexity of DL using any DL algorithm, and (2) the numbers of sam-
ples that are needed for different DL algorithms to reliably estimate a true underlying
dictionary that generates the data.
138 Zahra Shakeri et al.

5.2.1 Mathematical Setup


In the conventional vector-valued dictionary learning setup, we are given a total number
N of vector-valued samples, {yn ∈ Rm }n=1
N
, that are assumed to be generated from a fixed
0
dictionary, D , according to the following model:
yn = D0 xn + wn , n = 1, . . . , N. (5.1)
Here, D0 ∈ Rm×p is a (deterministic) unit-norm frame (m < p) that belongs to the
following compact set:1
 
D0 ∈ D  D ∈ Rm×p , D j 2 = 1 ∀ j ∈ {1, . . . , p} , (5.2)
and is referred to as the generating, true, or underlying dictionary. The vector xn ∈ R p is
the coefficient vector that lies in some set X ⊆ R p , and wn ∈ Rm denotes the observation
noise. Concatenating the observations into a matrix Y ∈ Rm×N , their corresponding coef-
ficient vectors into X ∈ R p×N , and noise vectors into W ∈ Rm×N , we get the following
generative model:
Y = D0 X + W. (5.3)
Various works in the DL literature impose different conditions on the coefficient vectors
{xn } to define the set X. The most common assumption is that xn is sparse with one
of several probabilistic models for generating sparse xn . In contrast to exact sparsity,
some works consider approximate sparsity and assume that xn satisfies some decay pro-
file [38], while others assume group sparsity conditions for xn [39]. The latter condition
comes up implicitly in DL for tensor data, as we discuss in Section 5.3. Similarly, exist-
ing works consider a variety of noise models, the most common being Gaussian white
noise. Regardless of the assumptions on coefficient and noise vectors, all of these works
assume that the observations are independent for n = 1, 2, . . . , N.
We are interested here in characterizing when it is possible to recover the true dictio-
nary D0 from observations Y. There is an inherent ambiguity in dictionary recovery:
reordering the columns of D0 or multiplying any column by −1 yields a dictionary
which can generate the same Y (with appropriately modified X). Thus, each dictionary
is equivalent to 2 p p! other dictionaries. To measure the distance between dictionaries,
we can either define the distance between equivalence classes of dictionaries or consider
errors within a local neighborhood of a fixed D0 , where the ambiguity can potentially
disappear.
The specific criterion that we focus on is sample complexity, defined as the number
of observations necessary to recover the true dictionary up to some predefined error.
The measure of closeness of the recovered dictionary and the true dictionary can be
defined in several ways. One approach is to compare the representation error of these
dictionaries. Another measure is the mean-squared error (MSE) between the estimated
and generating dictionary, defined as

p
1 A frame F ∈ Rm×p , m ≤ p, is defined as a collection of vectors {Fi ∈ Rm }i=1 in some separable Hilbert
p
space H that satisfy c1 v22 ≤ i=1 |Fi , v |2 ≤ c2 v22 for all v ∈ H and for some constants c1 and c2 such
that 0 < c1 ≤ c2 < ∞. If c1 = c2 , then F is a tight frame [36, 37].
Sample Complexity Bounds for Dictionary Learning 139

  2

EY d D(Y), D0 , (5.4)


where d(·, ·) is some distance metric and D(Y) is the recovered dictionary according to
observations Y. For example, if we restrict the analysis to a local neighborhood of the
generating dictionary, then we can use the Frobenius norm as the distance metric.
We now discuss an optimization approach to solving the dictionary recovery problem.
Understanding the objective function within this approach is the key to understanding
the sample complexity of DL. Recall that solving the DL problem involves using the
 such that D
observations to estimate a dictionary D  is close to D0 . In the ideal case, the
objective function involves solving the statistical risk minimization problem as follows:

 ∈ arg min inf E 1 y − Dx2 + R(x)


D . (5.5)
2
D∈D x∈X 2

Here, R(·) is a regularization operator that enforces the pre-specified structure, such
as sparsity, on the coefficient vectors. Typical choices for this parameter include
functions of x0 or its convex relaxation, x1 .2 However, solving (5.5) requires knowl-
edge of exact distributions of the problem parameters as well as high computational
power. Hence, works in the literature resort to algorithms that solve the empirical risk
minimization (ERM) problem [40]:
⎧ N ⎫

⎪ 1  n 2 ⎪

 ∈ arg min⎨
D ⎪ inf y − Dxn 2 + R(xn )

⎪. (5.6)
⎪ ⎪
D∈D ⎩ n=1 x ∈X 2 ⎭
n

In particular, to provide analytic results, many estimators solve this problem in lieu
of (5.5) and then show that the solution of (5.6) is close to (5.5).
There are a number of computational algorithms that have been proposed to solve
(5.6) directly for various regularizers, or indirectly using heuristic approaches. One of
the most popular heuristic approaches is the K-SVD algorithm, which can be thought
of as solving (5.6) with 0 -norm regularization [6]. There are also other methods such
as the method of optimal directions (MOD) [41] and online DL [25] that solve (5.6)
with convex regularizers. While these algorithms have been known to perform well in
practice, attention has shifted in recent years to theoretical studies to (1) find the funda-
mental limits of solving the statistical risk minimization problem in (5.5), (2) determine
conditions on objective functions like (5.6) to ensure recovery of the true dictionary, and
(3) characterize the number of samples needed for recovery using either (5.5) or (5.6). In
this chapter, we are also interested in understanding the sample complexity for the DL
statistical risk minimization and ERM problems. We summarize such results in the exist-
ing literature for the statistical risk minimization of DL in Section 5.2.2 and for the ERM
problem in Section 5.2.3. Because the measure of closeness or error differs between
these theoretical results, the corresponding sample complexity bounds are different.

2 The so-called 0 -norm counts the number of non-zero entries of a vector; it is not a norm.
140 Zahra Shakeri et al.

remark 5.1 In this section, we assume that the data are available in a batch, central-
ized setting and the dictionary is deterministic. In the literature, DL algorithms have
been proposed for other settings such as streaming data, distributed data, and Bayesian
dictionaries [42–45]. Discussion of these scenarios is beyond the scope of this chapter.
In addition, some works have looked at ERM problems that are different from (5.6). We
briefly discuss these works in Section 5.4.

5.2.2 Minimax Lower Bounds on the Sample Complexity of DL


In this section, we study the fundamental limits on the accuracy of the dictionary recov-
ery problem that is achievable by any DL method in the minimax setting. Specifically,
we wish to understand the behavior of the best estimator that achieves the lowest worst-
case MSE among all possible estimators. We define the error of such an estimator as the
minimax risk, which is formally defined as
  2

ε∗ = inf sup EY d D(Y), D . (5.7)

D(Y) 
D∈D

Note that the minimax risk does not depend on any specific DL method and provides a
lower bound for the error achieved by any estimator.
The first result we present pertains to lower bounds on the minimax risk, i.e., minimax
lower bounds, for the DL problem using the Frobenius norm as the distance metric
between dictionaries. The result is based on the following assumption.
A1.1 (Local recovery) The true dictionary lies in the neighborhood of a fixed, known
 where
reference dictionary,3 D∗ ∈ D, i.e., D0 ∈ D,
   
 = D|D ∈ D, D − D∗  ≤ r .
D (5.8)
F

The range for the neighborhood radius r in (5.8) is (0, 2 p]. This conditioning comes

from the fact that, for any D, D ∈ D, D − D F ≤ DF + D F = 2 p. By restricting
dictionaries to this class, for small enough r, ambiguities that are a consequence of using
the Frobenius norm can be prevented. We also point out that any lower bound on ε∗ is
also a lower bound on the global DL problem.
theorem 5.1 (Minimax lower bounds [33]) Consider a DL problem for vector-valued
data with N i.i.d. observations and true dictionary D satisfying assumption A1.1 for

some r ∈ (0, 2 p]. Then, for any coefficient distribution with mean zero and covariance
matrix Σ x , and white Gaussian noise with mean zero and variance σ2 , the minimax risk
ε∗ is lower-bounded as
σ2
ε∗ ≥ c1 min r2 , (c2 p(m − 1) − 1) , (5.9)
NΣ x 2
for some positive constants c1 and c2 .

3 
The use of a reference dictionary is an artifact of the proof technique and, for sufficiently large r, D ≈ D.
Sample Complexity Bounds for Dictionary Learning 141

Theorem 5.1 holds both for square and for overcomplete dictionaries. To obtain
this lower bound on the minimax risk, a standard information-theoretic approach is
taken in [33] to reduce the dictionary-estimation problem to a multiple-hypothesis-

testing
 problem.  In this technique, given fixed D and r, and L ∈ N, a packing DL =
1 2 L  
D , D , . . . , D ⊆ D of D is constructed. The distance of the packing is chosen to
ensure a tight lower bound on the minimax risk. Given observations Y = Dl X+W, where
Dl ∈ DL and the index l is chosen uniformly at random from {1, . . . , L}, and any estima-
tion algorithm that recovers a dictionary D(Y), a minimum-distance detector can be used
to find the recovered dictionary index  l ∈ {1, . . . , L}. Then, Fano’s inequality can be used
to relate the probability of error, i.e., P( l(Y)  l), to the mutual information between
observations and the dictionary (equivalently, the dictionary index l), i.e., I(Y; l) [46].
Let us assume that r is sufficiently large that the minimizer of the left-hand side
of (5.9) is the second term. In this case, Theorem 5.1 states that, to achieve any error
ε ≥ ε∗ , we need the number of samples to be on the order of
 2 
σ mp 4
N=Ω .
Σ x 2 ε
Hence, the lower bound on the minimax risk of DL can be translated to a lower bound
on the number of necessary samples, as a function of the desired dictionary error. This
can further be interpreted as a lower bound on the sample complexity of the dictionary
recovery problem.
We can also specialize this result to sparse coefficient vectors. Assume xn has up to s
non-zero elements, and the random support of the non-zero elements of xn is assumed to
be uniformly distributed over the set {S ⊆ {1, . . . , p} : |S| = s}, for n = {1, . . . , N}. Assum-
ing that the non-zero entries of xn are i.i.d. with variance σ2x , we get Σ x = (s/p)σ2x I p .
Therefore, for sufficiently large r, the sample complexity scaling to achieve any error ε
becomes Ω((σ2 mp2 )/(σ2x sε)). In this special case, it can be seen that, in order to achieve
a fixed error ε, the sample complexity scales with the number of degrees of freedom of
the dictionary multiplied by the number of dictionary columns, i.e., N = Ω(mp2 ). There
is also an inverse dependence on the sparsity level s. Defining the signal-to-noise ratio
of the observations as SNR = (sσ2x )/(mσ2 ), this can be interpreted as an inverse relation-
ship with the SNR. Moreover, if all parameters except the data dimension, m, are fixed,
increasing m requires a linear increase in N. Evidently, this linear relation is limited by
the fact that m ≤ p has to hold in order to maintain completeness or overcompleteness of
the dictionary: increasing m by a large amount requires increasing p also.
While the tightness of this result remains an open problem, Jung et al. [33] have
shown that for a special class of square dictionaries that are perturbations of the identity
matrix, and for sparse coefficients following a specific distribution, this result is order-
wise tight. In other words, a square dictionary that is perturbed from the identity matrix
can be recovered from this sample size order. Although this result does not extend to
overcomplete dictionaries, it suggests that the lower bounds may be tight.

4 We use f (n) = Ω(g(n)) and f (n) = O(g(n)) if, for sufficiently large n ∈ N, f (n) > c1 g(n) and f (n) < c2 g(n),
respectively, for some positive constants c1 and c2 .
142 Zahra Shakeri et al.

Finally, while distance metrics that are invariant to dictionary ambiguities have
been used for achievable overcomplete dictionary recovery results [30, 31], obtaining
minimax lower bounds for DL using these distance metrics remains an open problem.
In this subsection, we discussed the number of necessary samples for reliable dictio-
nary recovery (the sample complexity lower bound). In the next subsection, we focus
on achievability results, i.e., the number of sufficient samples for reliable dictionary
recovery (the sample complexity upper bound).

5.2.3 Achievability Results


The preceding lower bounds on minimax risk hold for any estimator or computational
algorithm. However, the proofs do not provide an understanding of how to construct
effective estimators and provide little intuition about the potential performance of
practical estimation techniques. In this section, we direct our attention to explicit recon-
struction methods and their sample complexities that ensure reliable recovery of the
underlying dictionary. Since these achievability results are tied to specific algorithms
that are guaranteed to recover the true dictionary, the sample complexity bounds from
these results can also be used to derive upper bounds on the minimax risk. As we will
see later, there remains a gap between the lower bound and the upper bound on the mini-
max risk. Alternatively, one can interpret the sample complexity lower bound and upper
bound as the necessary number of samples and sufficient number of samples for reliable
dictionary recovery, respectively. In the following, we focus on identifiability results: the
estimation procedures are not required to be computationally efficient.
One of the first achievability results for DL was derived in [27, 28] for square matri-
ces. Since then, a number of works have been carried out for overcomplete DL involving
vector-valued data [26, 29–32, 38]. These works differ from each other in terms of their
assumptions on the true underlying dictionary, the dictionary coefficients, the presence
or absence of noise and outliers, the reconstruction objective function, the distance met-
ric used to measure the accuracy of the solution, and the local or global analysis of
the solution. In this section, we summarize a few of these results in terms of various
assumptions on the noise and outliers and provide a brief overview of the landscape of
these results in Table 5.1. We begin our discussion with achievability results for DL for
the case where Y is exactly given by Y = D0 X, i.e., the noiseless setting.
Before proceeding, we provide a definition and an assumption that will be used for
the rest of this section. We note that the constants that are used in the presented theorems
change from one result to another.

(Worst-case coherence)  For any dictionary D ∈ D, its worst-case coherence is defined


as μ(D) = maxi j Di , D j , where μ(D) ∈ (0, 1) [36].
(Random support of sparse coefficient vectors) For any xn that has up to s non-
zero elements, the support of the non-zero elements of xn is assumed to
be distributed uniformly at random over the set {S ⊆ {1, . . . , p} : |S| = s}, for
n = {1, . . . , N}.
Sample Complexity Bounds for Dictionary Learning 143

Noiseless Recovery
We begin by discussing the first work that proves local identifiability of the overcomplete
DL problem. The objective function that is considered in that work is
 
D
X,  = arg minX1 subject to Y = DX, (5.10)
D∈D,X

where X1  i, j |Xi, j | denotes the sum of absolute values of X.
This result is based on the following set of assumptions.
A2.1 (Gaussian random coefficients). The values of the non-zero entries of xn s are
independent Gaussian
 random variables with zero mean and common standard
deviation σ x = p/sN.  
A2.2 (Sparsity level). The sparsity level satisfies s ≤ min c1 /μ(D0 ), c2 p for some
constants c1 and c2 .
theorem 5.2 (Noiseless, local recovery [29]) There exist positive constants c1 , c2 such
that if assumptions A2.1 and A2.2 are satisfied for true (X, D0 ), then (X, D0 ) is a local
minimum of (5.10) with high probability.
The probability
  in this theorem depends on various problem parameters and implies
that N = Ω sp3 samples are sufficient for the desired solution, i.e., true dictionary and
coefficient matrix, to be locally recoverable. The proof of this theorem relies on studying
the local properties of (5.10) around its optimal solution and does not require defining a
distance metric.
We now present a result that is based on the use of a combinatorial algorithm, which
can provably and exactly recover the true dictionary. The proposed algorithm solves
the objective function (5.6) with R(x) = λx1 , where λ is the regularization parameter
and the distance metric that is used is the column-wise distance. Specifically, for two
dictionaries D1 and D2 , their column-wise distance is defined as
 
d(D1j , D2j ) = min D1j − lD2j  , j ∈ {1, . . . , p}, (5.11)
l∈{−1,1} 2

where D1j and D2j are the jth column of D1 and D2 , respectively. This distance metric
avoids the sign ambiguity among dictionaries belonging to the same equivalence class.
To solve (5.6), Agarwal et al. [30] provide a novel DL algorithm that consists of an
initial dictionary-estimation stage and an alternating minimization stage to update the
dictionary and coefficient vectors. The provided guarantees are based on using this algo-
rithm to update the dictionary and coefficients. The forthcoming result is based on the
following set of assumptions.
A3.1 (Bounded random coefficients) The non-zero entries of xn s are drawn from a
zero-mean unit-variance distribution and their magnitude satisfies xmin ≤ |xni | ≤
xmax .
144 Zahra Shakeri et al.

  
A3.2 (Sparsity level) The sparsity level satisfies s ≤ min c1 / μ(D0 ), c2 m1/9 , c3 p1/8
for some positive constants c1 , c2 , c3 that depend on xmin , xmax , and the spectral
norm of D0 .
A3.3 (Dictionary
  assumptions) The true dictionary has bounded spectral norm, i.e.,
D0 2 ≤ c4 p/m, for some positive c4 .

theorem 5.3 (Noiseless, exact recovery [30]) Consider a DL problem with N i.i.d.
observations and assume that assumptions A3.1–A3.3 are satisfied. Then, there exists a
universal constant c such that for given η > 0, if
 2  
xmax 2p
N≥c p2 log , (5.12)
xmin η
there exists a procedure consisting of an initial dictionary estimation stage and an alter-
nating minimization stage such that after T = O(log(1/ε)) iterations of the second stage,
 D0 ) ≤ ε, ∀ε > 0.
with probability at least 1 − 2η − 2ηN 2 , d(D,
This theorem guarantees that the true dictionary can be recovered to an arbitrary pre-
cision given N = Ω(p2 log p) samples. This result is based on two steps. The first step is
guaranteeing an error bound for the initial dictionary-estimation step. This step involves
using a clustering-style algorithm to approximate the dictionary columns. The second
step is proving a local convergence result for the alternating minimization stage. This
step involves improving estimates of the coefficient vectors and the dictionary through
Lasso [47] and least-square steps, respectively. More details for this work can be found
in [30].
While the works in [29, 30] study the sample complexity of the overcomplete DL
problem, they do not take noise into account. Next, we present works that obtain sample
complexity results for noisy reconstruction of dictionaries.

Noisy Reconstruction
The next result we discuss is based on the following objective function:
1  2
N
max maxPS (D)yn 2 , (5.13)
D∈D N n=1 |S|=s
 
where PS (D) denotes projection of D onto the span of DS = D j .5 Here, the distance
  j∈S
metric that is used is d(D1 , D2 ) = max j∈{1,...,p} D1j − D2j  . In addition, the results are
2
based on the following set of assumptions.
A4.1 (Unit-norm tight frame) The true dictionary is a unit-norm tight frame, i.e., for
 p  0 2
all v ∈ Rm we have D , v  = pv /m.
j=1 j
2
2

5 This objective function can be thought of as a manipulation of (5.6) with the 0 -norm regularizer for the
coefficient vectors. See equation 2 of [38] for more details.
Sample Complexity Bounds for Dictionary Learning 145

A4.2 (Lower isometry constant) The lower isometry constant of D0 , defined as


δ s (D0 )  max|S|≤s δS (D0 ) with 1 − δS (D0 ) denoting the minimal eigenvalue of

D0S D0S , satisfies δ s (D0 ) ≤ 1 − (s/m).
A4.3 (Decaying random coefficients) The coefficient vector xn is drawn from a
symmetric decaying probability distribution ν on the unit sphere S p−1 .6
A4.4 (Bounded random noise) The vector wn is a bounded random white-noise vector
 
satisfying wn 2 ≤ Mw almost surely, E{wn } = 0, and E wn wn∗ = ρ2 Im .
A4.5 (Maximal projection constraint) Define c(xn ) to be the non-increasing rearrange-
ment of the absolute values of xn . Given a sign sequence l ∈ {−1, 1} p and a
permutation operator π : {1, . . . , p} → {1, . . . , p}, define cπ,l (xn ) whose ith element
is equal to li c(xn )π(i) for i ∈ {1, . . . , p}. There exists κ > 0 such that, for c(xn ) and
Sπ  π−1 ({1, . . . , s}), we have
      
ν min PSπ (D0 )D0 cπ,l (xn )2 − max PS (D0 )D0 cπ,l (xn )2 ≥ 2κ + 2Mw = 1.
π,l |S|=s,SSπ
(5.15)

theorem 5.4 (Noisy, local recovery [38]) Consider a DL problem with N i.i.d. obser-
vations and assume that assumptions A4.1–A4.5 are satisfied. If, for some 0 < q < 1/4,
the number of samples satisfies

−q −2q c1 1 − δ s (D0 )
2N + N ≤ ,
√   
     (5.16)
s 1 + c2 log c3 p ps / c4 s 1 − δ s D0

then, with high probability, there is a local maximum of (5.13) within distance at most
2N −q of D0 .
The constants c1 , c2 , c3 , and c4 in Theorem 5.4 depend on the underlying dictio-
nary, coefficient vectors, and the underlying noise. The proof of this theorem relies on
the
 fact that, for the true dictionary and its perturbations, the maximal response, i.e.,
 0 xn  ,7 is attained for the set S = Sπ for most signals. A detailed explanation
PS (D)D 2
of the theorem and its proof can be found in the paper of Schnass [38].
 
In order to understand Theorem 5.4, let us set q ≈ 14 − (log p)/(log N) . We can then
understand this theorem as follows. Given N/ log N = Ω(mp3 ), except with probability
O(N −mp ), there is a local minimum of (5.13) within distance O(pN −1/4 ) of the true
dictionary. Moreover, since the objective function that is considered in this work is also
solved for the K-SVD algorithm, this result gives an understanding of the performance
of the K-SVD algorithm. Compared with results with R(x) being a function of the 1 -
norm [29, 30], this result requires the true dictionary to be a tight frame. On the flip side,

6 A probability measure ν on the unit sphere S p−1 is called symmetric if, for all measurable sets X ⊆ S p−1 ,
for all sign sequences l ∈ {−1, 1} p and all permutations π : {1, . . . , p} → {1, . . . , p}, we have
  
ν(lX) = ν(X), where lX = l1 x1 , . . . , l p x p : x ∈ X ,
  
ν(π(X)) = ν(X), where π(X) = xπ(1) , . . . , xπ(p) : x ∈ X . (5.14)
7  can be D0 itself or some perturbation of D0 .
D
146 Zahra Shakeri et al.

the coefficient vector in Theorem 5.4 is not necessarily sparse; instead, it only has to
satisfy a decaying condition.
Next, we present a result obtained by Arora et al. [31] that is similar to that of
Theorem 5.3 in the sense that it uses a combinatorial algorithm that can provably recover
the true dictionary given noiseless observations. It further obtains dictionary reconstruc-
tion results for the case of noisy observations. The objective function considered in this
work is similar to that of the K-SVD algorithm and can be thought of as (5.6) with
R(x) = λx0 , where λ is the regularization parameter.
Arora et al. [31] define two dictionaries D1 and D2 to be column-wise ε close if

there exists a permutation π and l ∈ {−1, 1} such that D1j − lD2π( j)  ≤ ε. This distance
2
metric captures the distance between equivalent classes of dictionaries and avoids the
sign-permutation ambiguity. They propose a DL algorithm that first uses combinatorial
techniques to recover the support of coefficient vectors, by clustering observations into
overlapping clusters that use the same dictionary columns. To find these large clusters,
a clustering algorithm is provided. Then, the dictionary is roughly estimated given the
clusters, and the solution is further refined. The provided guarantees are based on using
the proposed DL algorithm. In addition, the results are based on the following set of
assumptions.
A5.1 (Bounded coefficient distribution) Non-zero entries of xn are drawn from a
zero-mean distribution and lie in [−xmax , −1] ∪ [1, xmax ], where xmax = O(1).
Moreover, conditioned on any subset of coordinates in xn being non-zero, non-
zero values of xni are independent from each other. Finally, the distribution has
bounded 3-wise moments, i.e., the probability that n
!  x isnon-zero in any subset
S of three coordinates is at most c times i∈S P xi  0 , where c = O(1).8
3 n

A5.2 (Gaussian noise) The wn s are independent and follow a spherical Gaussian

distribution with standard deviation σ = o( m).
A5.3 (Dictionary coherence) The true dictionary is  μ-incoherent, that is, for all i  j,

D0i , D0j ≤ 
μ(D0 )/ m and 
μ(D0 ) = O(log(m)).
 √ 
A5.4 (Sparsity level) The sparsity level satisfies s ≤ c1 min p2/5 , m/ μ(D0 ) log m ,
for some positive constant c1 .
theorem 5.5 (Noisy, exact recovery [31]) Consider a DL problem with N i.i.d.
observations and assume that assumptions A5.1–A5.4 are satisfied. Provided that
   
p 1
N = Ω σ2 ε−2 p log p 2 + s2 + log , (5.17)
s ε
there is a universal constant c1 and a polynomial-time algorithm that learns the under-
 that is column-wise ε
lying dictionary. With high probability, this algorithm returns D
close to D0 .

8 This condition is trivially satisfied if the set of the locations of non-zero entries of xn is a random subset
of size s.
Sample Complexity Bounds for Dictionary Learning 147

For desired error ε, the run-time of the algorithm and the sample complexity depend
on log(1/ε). With the addition of noise, there is also a dependence on ε−2 for N, which
is inevitable for noisy reconstruction of  the true
 dictionary [31, 38]. In the noiseless
setting, this result translates into N = Ω p log p (p/s ) + s + log(1/ε) .
2 2

Noisy Reconstruction with Outliers


In some scenarios, in addition to observations Y drawn from D0 , we encounter obser-
vations Yout that are not generated according to D0 . We call such observations outliers
(as opposed to inliers). In this case, the observation matrix is Yobs = [Y, Yout ], where
Y is the inlier matrix and Yout is the outlier matrix. In this part, we study the robust-
ness of dictionary identification in the presence of noise and outliers. The following
result studies (5.6) with R(x) = λx1 , where λ is the regularization parameter. Here, the
Frobenius norm is considered as the distance metric. In addition, the result is based on
the following set of assumptions.
A6.1 (Cumulative coherence) The cumulative coherence of the true dictionary D0 ,
which is defined as
 
0
μ s (D0 )  sup supD0T 
S Dj , (5.18)
|S|≤s jS 1

satisfies μ s (D0 ) ≤ 1/4.


A6.2 (Bounded random coefficients) Assume non-zero entries of xn are drawn i.i.d.
from a distribution with absolute mean E{|x|} and variance E x2 . We denote
ln = sign(xn ).9 Dropping the index of xn and ln for simplicityof notation,  the fol-

lowing assumptions are satisfied for the coefficient vector: E xS xTS |S = E x2 I s ,
   
E xS lTS |S = E{|x|}I s , E lS lTS |S = I s , x2 ≤ M x , and mini∈S |xi | ≥ xmin . We define
  
κ x  E{|x|}/ E x2 as a measure of the flatness of x. Moreover, the following
inequality is satisfied:
 
E x2    
D0 2 + 1 D0T D0 − IF ,
cs
> (5.19)
M x E{|x|} (1 − 2μ s (D0 ))p
where c is a positive constant.
A6.3 (Regularization parameter) The regularization parameter satisfies λ ≤ xmin /4.
A6.4 (Bounded random noise) Assume non-zero n
 entries of w are drawn i.i.d. from a
distribution with mean 0 and variance E w2 . Dropping the index of vectors for
 
simplicity, w is a bounded random white-noise vector satisfying E wwT |S =
     
E w2 Im , E wxT |S = E wlT |S = 0, and w2 ≤ Mw . Furthermore, denoting λ̄ 
λ/E{|x|},
Mw 7
≤ (cmax − cmin )λ̄, (5.20)
Mx 2

9 The sign of the vector v is defined as l = sign(v), whose elements are li = vi /|vi | for vi  0 and li = 0 for
vi = 0, where i denotes any index of the elements of v.
148 Zahra Shakeri et al.

where cmin and cmax depend on problem parameters such as s, the coefficient
distribution, and D0 .
"   2 
A6.5 (Sparsity level) The sparsity level satisfies s ≤ p 16 D0 2 + 1 .
 
A6.6 (Radius range) The error radius ε > 0 satisfies ε ∈ λ̄cmin , λ̄cmax .
A6.7 (Outlier energy) Given inlier matrix Y = {yn }n=1 N and outlier matrix Yout =
n Nout
{y }n=1 , the energy of Yout satisfies
√  2   ⎡   & ⎤
Yout 1,2 c1 ε sE x2 A0 3/2 ⎢⎢⎢ 1 c λ̄ mp + η ⎥⎥⎥
≤ ⎢⎣ 1 − min − c2 ⎥⎦, (5.21)
N λ̄E{|x|} p p ε N
where Yout 1,2 denotes the sum of the 2 -norms of the columns of Yout , c1 and
c2 are positive constants, independent of parameters, and A0 is the lower frame
 2
bound of D0 , i.e., A0 v22 ≤ D0T v2 for any v ∈ Rm .
theorem 5.6 (Noisy with outliers, local recovery [32]) Consider a DL problem with N
i.i.d. observations and assume that assumptions A6.1–A6.6 are satisfied. Suppose
⎛ ⎞ ⎛    2 ⎞
⎜⎜ M 2 ⎟⎟⎟2 ⎜⎜⎜⎜ ε + (Mw /M x ) + λ̄ + (Mw /M x ) + λ̄ ⎟⎟⎟⎟
2⎜
N > c0 (mp + η)p ⎜⎜⎜⎝  x
 ⎟⎟⎟⎠ ⎜⎜⎜⎜ ⎟⎟⎟,
⎟⎠ (5.22)
E x22 ⎝ ε − cmin λ̄

then, with probability at least 1 − 2−η , (5.6) admits a local minimum within distance ε of
D0 . In addition, this result is robust to the addition of outlier matrix Yout , provided that
the assumption in A6.7 is satisfied.
The proof of this theorem relies on using the Lipschitz continuity property of the
objective function in (5.6) with respect to the dictionary and sample complexity analysis
using Rademacher averages and Slepian’s Lemma [48]. Theorem 5.6 implies that
⎛ ⎞
⎜⎜⎜ 3  Mw 2 ⎟⎟

N = Ω⎜⎝ mp + ηp 2 ⎟⎟⎟ (5.23)
Mx ε ⎠
samples are sufficient for the existence of a local minimum within distance ε of true
dictionary D 0
  , with high probability. In the noiseless setting, this result translates into
N = Ω mp , and sample complexity becomes independent of the radius ε. Furthermore,
3

this result applies to overcomplete dictionaries with dimensions p = O(m2 ).

5.2.4 Summary of Results


In this section, we have discussed DL minimax risk lower bounds [33] and achievability
results [29–32, 38]. These results differ in terms of the distance metric they use. An
interesting question arises here. Can these results be unified so that the bounds can be
directly compared with one another? Unfortunately, the answer to this question is not as
straightforward as it seems, and the inability to unify them is a limitation that we discuss
in Section 5.4. A summary of the general scaling of the discussed results for sample
complexity of (overcomplete) dictionary learning is provided in Table 5.1. We note that
these are general scalings that ignore other technicalities. Here, the provided sample
Sample Complexity Bounds for Dictionary Learning 149

complexity results depend on the presence or absence of noise and outliers. All the
presented results require that the underlying dictionary satisfies incoherence conditions
in some way. For a one-to-one comparison of these results, the bounds for the case of
absence of noise and outliers can be compared. A detailed comparison of the noiseless
recovery for square and overcomplete dictionaries can be found in Table I of [32].

5.3 Dictionary Learning for Tensors

Many of today’s data are collected using various sensors and tend to have a multidi-
mensional/tensor structure (Fig. 5.2). Examples of tensor data include (1) hyperspectral
images that have three modes, two spatial and one spectral; (2) colored videos that have
four modes, two spatial, one depth, and one temporal; and (3) dynamic magnetic res-
onance imaging in a clinical trial that has five modes, three spatial, one temporal, and
one subject. To find representations of tensor data using DL, one can follow two paths.
A naive approach is to vectorize tensor data and use traditional vectorized representa-
tion learning techniques. A better approach is to take advantage of the multidimensional
structure of data to learn representations that are specific to tensor data. While the main
focus of the literature on representation learning has been on the former approach, recent
works have shifted focus to the latter approaches [8–11]. These works use various tensor
decompositions to decompose tensor data into smaller components. The representation
learning problem can then be reduced to learning the components that represent different
modes of the tensor. This results in a reduction in the number of degrees of freedom in
the learning problem, due to the fact that the dimensions of the representations learned
for each mode are significantly smaller than the dimensions of the representation learned
for the vectorized tensor. Consequently, this approach gives rise to compact and efficient
representation of tensors.
To understand the fundamental limits of dictionary learning for tensor data, one can
use the sample complexity results in Section 5.2, which are a function of the underlying
dictionary dimensions. However, considering the reduced number of degrees of freedom
in the tensor DL problem compared with vectorized DL, this problem should be solv-
able with a smaller number of samples. In this section, we formalize this intuition and
address the problem of reliable estimation of dictionaries underlying tensor data. As in
the previous section, we will focus on the subject of sample complexity of the DL prob-
lem from two prospectives: (1) fundamental limits on the sample complexity of DL for
tensor data using any DL algorithm, and (2) the numbers of samples that are needed for
different DL algorithms in order to reliably estimate the true dictionary from which the
tensor data are generated.

5.3.1 Tensor Terminology


A tensor is defined as a multi-way array, and the tensor order is defined as the number
of components of the array. For instance, X ∈ R p1 ×···×pK is a Kth-order tensor. For K = 1
and K = 2, the tensor is effectively a vector and a matrix, respectively. In order to
Table 5.1. Summary of the sample complexity results of various works

Reference Jung et al. [33] Geng et al. [29] Agarwal et al. [30] Schnass et al. [38] Arora et al. [31] Gribonval et al. [32]
     
Distance metric D1 − D2 F – min max j D1j − D2j  min D1 − D2 F
 l∈{±1}  2  l∈{±1} 
D1 − lD2  D1 − lD2 
j π( j) 2 j j 2
Regularizer 0 1 1 1 0 1

Sparse Non-zero i.i.d. Non-zero i.i.d. Non-zero symmetric Non-zero Non-zero


coefficient zero-mean, ∼ N(0, σ x ) zero-mean decaying zero-mean |xi | > xmin ,
distribution variance σ2x unit-variance (non-sparse) xi ∈ ±[1, xmax ] x2 ≤ M x
xmin ≤ |xi | ≤
xmax

Sparsity level – O(min {1/μ, p}) O(min{1/ μ, O(1/μ) O(min{1/(μ log m), O(m)
1/9 1/8
m , p }) p2/5 })
 
Noise i.i.d.∼ N(0, σ) – – E{w} = 0, i.i.d.∼ N(0, σ) E wwT |S =
 
distribution E{ww∗ } = ρ2 Im , E w2 Im ,
w2 ≤ Mw  
E wxT |S = 0,
w2 ≤ Mw
Outlier – – – – – Robust

Local/global Local Local Global Local Global Local

mp2 p mp3
Sample sp3 p2 log p mp3 log p(p/s2 +
complexity ε2 ε2 ε2
s2 + log(1/ε))
Sample Complexity Bounds for Dictionary Learning 151

Temporal
Temporal
Spatial (horizontal) Spatial (horizontal)

Hyperspectral video Clinical study using dynamic MRI

Figure 5.2 Two of countless examples of tensor data in today’s sensor-rich world.

better understand the results reported in this section, we first need to define some tensor
notation that will be useful throughout this section.
Tensor Unfolding: Elements of tensors can be rearranged to form matrices. Given
a Kth-order !tensor, X ∈ R p1 ×···×pK , its mode-k unfolding is denoted as
X(k) ∈ R pk × ik pi . The columns of X(k) are formed by fixing all the indices,
except one in the kth mode.
Tensor Multiplication: The mode-k product between the Kth-order tensor, X ∈
R p1 ×···×pK , and a matrix, A ∈ Rmk ×pk , is defined as

pk
(X ×k A)i1 ,...,ik−1 , j,ik+1 ,...,iK = Xi1 ,...,ik−1 ,ik ,ik+1 ,...,iK A j,ik . (5.24)
ik =1

Tucker Decomposition [49]: Given a Kth-order tensor Y ∈ Rm1 ×···×mK satisfying


 
rank Y(k) ≤ pk , the Tucker decomposition decomposes Y into a core tensor
X ∈ R p1 ×···×pK multiplied by factor matrices Dk ∈ Rmk ×pk along each mode, i.e.,

Y = X ×1 D1 ×2 D2 ×3 · · · ×K DK . (5.25)

This can be restated as


   
vec Y(1) = (DK ⊗ DK−1 ⊗ · · · ⊗ D1 ) vec X(1) , (5.26)

where ⊗ denotes the matrix Kronecker product [50] and vec(·) denotes stacking
of the columns of a matrix
 into one
0 column. We will use the shorthand notation
vec(Y) to denote vec Y(1) and k Dk to denote D1 ⊗ · · · ⊗ DK .

5.3.2 Mathematical Setup


To exploit the structure of tensors in DL, we can model tensors using various
tensor decomposition techniques. These include Tucker decomposition, CANDE-
COMP/PARAFAC (CP) decomposition [51], and the t-product tensor factorization [52].
While the Tucker decomposition can be restated as the Kronecker product of matri-
ces multiplied by a vector, other decompositions result in different formulations. In this
chapter, we consider the Tucker decomposition for the following reasons: (1) it rep-
resents a sequence of independent transformations, i.e., factor matrices, for different
152 Zahra Shakeri et al.

data modes, and (2) Kronecker-structured matrices have successfully been used for
data representation in applications such as magnetic resonance imaging, hyperspectral
imaging, video acquisition, and distributed sensing [8, 9].

Kronecker-Structured Dictionary Learning (KS-DL)


In order to state the main results of this section, we begin with a generative model for
tensor data that is based on Tucker decomposition. Specifically, we assume we have
access to a total number of N tensor observations, Yn ∈ Rm1 ×···×mK , that are generated
according to the following model:10
 
vec(Yn ) = D01 ⊗ D02 ⊗ · · · ⊗ D0K vec(Xn ) + vec(Wn ), n = 1, . . . , N. (5.27)

Here, {D0k ∈ Rmk ×pk }k=1


K are the true fixed coordinate dictionaries, Xn ∈ R p1 ×···×pK is the

coefficient tensor, and Wn ∈ Rm1 ×···×mK is the underlying noise tensor. In this case, the
true dictionary D0 ∈ Rm×p is Kronecker-structured (KS) and has the form
1 2
K 2
K
D0 = D0k , m= mk , and p= pk ,
k k=1 k=1

where
   
D0k ∈ Dk = Dk ∈ Rmk ×pk , Dk, j 2 = 1 ∀ j ∈ {1, . . . , pk } . (5.28)

We define the set of KS dictionaries as


 1
DKS = D ∈ Rm×p : D = Dk , Dk ∈ Dk ∀k ∈ {1, . . . , K} . (5.29)
k

Comparing (5.27) with the traditional formulation in (5.1), it can be seen that KS-DL
also involves vectorizing the observation tensor, but it has the main difference that the
structure of the tensor is captured in the underlying KS dictionary. An illustration of this
for a second-order tensor is shown in Fig. 5.3. As with (5.3), we can stack the vectorized
observations, yn = vec(Yn ), vectorized coefficient tensors, xn = vec(Xn ), and vectorized
noise tensors, wn = vec(Wn ), in columns of Y, X, and W, respectively. We now discuss
the role of sparsity in coefficient tensors for dictionary learning. While in vectorized DL
it is usually assumed that the random support of non-zero entries of xn is uniformly dis-
tributed, there are two different definitions of the random support of Xn for tensor data.
(1) Random sparsity. The random support of xn is uniformly distributed over the set
{S ⊆ {1, . . . , p} : |S| = s}.
(2) Separable sparsity. The random support of xn is uniformly distributed over the
set S that is related to {S1 × . . . SK : Sk ⊆ {1, . . . , pk }, |Sk | = sk } via lexicographic
!
indexing. Here, s = k sk .
Separable sparsity requires non-zero entries of the coefficient tensor to be grouped in
blocks. This model also implies that the columns of Y(k) have sk -sparse representations
with respect to D0k [53].

10 We have reindexed the Dk s here for simplicity of notation.


Sample Complexity Bounds for Dictionary Learning 153

Vectorized DL

KS-DL ⊗

Figure 5.3 Illustration of the distinctions of KS-DL versus vectorized DL for a second-order
tensor: both vectorize the observation tensor, but the structure of the tensor is exploited in the KS
dictionary, leading to the learning of two coordinate dictionaries with fewer parameters than the
dictionary learned in vectorized DL.

The aim in KS-DL is to estimate coordinate dictionaries, D k s, such that they are close
0
to Dk s. In this scenario, the statistical risk minimization problem has the form
⎧ ⎧   ⎫⎫
  ⎪

⎨ ⎪
⎨ 1 
⎪ 1  2 ⎪
⎬⎪
⎪⎪

 
D1 , . . . , DK ∈ arg min ⎪ E  − 
 + R(x) ,


inf ⎪

⎩  y Dk x ⎪

⎭⎪


(5.30)
{Dk ∈Dk } K x∈X 2 k 2
k=1

and the ERM problem is formulated as


⎧ N ⎧  ⎫⎫
  ⎪

⎨ ⎪

⎨ 1  n  1  n 2 ⎪
⎬⎪
⎪⎪

D K ∈ arg min ⎪
1 , . . . , D inf ⎪ y − Dk x  + R(x )⎪
n
⎪ , (5.31)

⎩ x n ∈X⎪
⎩ 2 2

⎭⎪

{D ∈D }
k
K
k k=1 n=1 k

where R(·) is a regularization operator on the coefficient vectors. Various KS-DL algo-
rithms have been proposed that solve (5.31) heuristically by means of optimization tools
such as alternative minimization [9] and tensor rank minimization [54], and by taking
advantage of techniques in tensor algebra such as the higher-order SVD for tensors [55].
In particular, an algorithm is proposed in [11], which shows that the Kronecker prod-
uct of any number of matrices can be rearranged to form a rank-1 tensor. In order
to solve (5.31), therefore, in [11] a regularizer is added to the objective function that
enforces this low-rankness on the rearrangement tensor. The dictionary update stage of
this algorithm involves learning the rank-1 tensor and rearranging it to form the KS dic-
tionary. This is in contrast to learning the individual coordinate dictionaries by means of
alternating minimization [9].
In the case of theory for KS-DL, the notion of closeness can have two interpretations.
One
 is the distance
 between the true KS dictionary and the recovered KS dictionary, i.e.,

d D(Y), D0 . The other is the distance between each true coordinate dictionary and the
 
corresponding recovered coordinate dictionary, i.e., d D k (Y), D0 . While small recovery
k
errors for coordinate dictionaries imply a small recovery error for the KS dictionary,
154 Zahra Shakeri et al.

the other side of the statement does not necessarily hold. Hence, the latter notion is of
importance when we are interested in recovering the structure of the KS dictionary.
In this section, we focus on the sample complexity of the KS-DL problem. The ques-
tions that we address in this section are as follows. (1) What are the fundamental limits
of solving the statistical risk minimization problem in (5.30)? (2) Under what kind of
conditions do objective functions like (5.31) recover the true coordinate dictionaries and
how many samples do they need for this purpose? (3) How do these limits compare with
their vectorized DL counterparts? Addressing these questions will help in understanding
the benefits of KS-DL for tensor data.

5.3.3 Fundamental Limits on the Minimax Risk of KS-DL


Below, we present a result that obtains lower bounds on the minimax risk of the KS-DL
problem. This result can be considered as an extension of Theorem 5.1 for the KS-DL
problem for tensor data. Here, the Frobenius norm is considered as the distance metric
and the result is based on the following assumption.
A7.1 (Local recovery). The true KS dictionary lies in the neighborhood of some
 KS , where
reference dictionary, D∗ ∈ DKS , i.e., D0 ∈ D
   
 KS = D|D ∈ DKS , D − D∗  ≤ r .
D (5.32)
F

theorem 5.7 (KS-DL minimax lower bounds [13]) Consider a KS-DL problem with
N i.i.d. observations and true KS dictionary D0 satisfying assumption A7.1 for some

r ∈ (0, 2 p]. Then, for any coefficient distribution with mean zero and covariance matrix
Σ x , and white Gaussian noise with mean zero and variance σ2 , the minimax risk ε∗ is
lower-bounded as
⎧ ⎛ ⎛ K ⎞ ⎞⎫


⎨ r2 σ2 ⎜⎜⎜ ⎜⎜⎜ ⎟⎟⎟ K ⎟⎟⎪


⎜⎜⎝⎜c1 ⎜⎜⎝⎜ (mk − 1)pk ⎟⎟⎠⎟ − log2 (2K) − 2⎟⎟⎟⎠⎟⎪
∗ t
ε ≥ min⎪ ⎪ p, , ⎪ , (5.33)
4 ⎩ 2K 4NKΣ x 2 2 ⎭
k=1

for any 0 < t < 1 and any 0 < c1 < ((1 − t)/(8 log 2)).
Similarly to Theorem 5.1, the proof of this theorem relies on using the standard
procedure for lower-bounding the minimax risk by connecting it to the maximum prob-
ability of error of a multiple-hypothesis-testing problem. Here, since the constructed
hypothesis-testing class consists of KS dictionaries, the construction procedure and the
minimax risk analysis are different from that in [33].
To understand this theorem, let us assume that r and p are sufficiently large that the
minimizer of the left-hand side of (5.33) is the third term. In this case, Theorem 5.7
states that to achieve any error ε for the Kth-order tensor dictionary-recovery problem,
we need the number of samples to be on the order of
 2 
σ k mk pk
N=Ω .
KΣ x 2 ε

Comparing this scaling with the results for the unstructured dictionary-learning problem
provided in Theorem 5.1, the lower bound here is decreased from the scaling Ω(mp)
Sample Complexity Bounds for Dictionary Learning 155

 
to Ω k mk pk /K . This reduction can be attributed to the fact that the average number

of degrees of freedom in a KS-DL problem is k mk pk /K, compared with the num-
ber of degrees of the vectorized DL problem, which is mp. For the case of K = 2,
√ √
m1 = m2 = m, and p1 = p2 = p, the sample complexity lower bound scales with

Ω(mp) for vectorized DL and with Ω( mp) for KS-DL. On the other hand, when
m1 = αm, m2 = 1/α and p1 = αm1 , p2 = 1/α, where α < 1, 1/α ∈ N, the sample com-
plexity lower bound scales with Ω(mp) for KS-DL, which is similar to the scaling for
vectorized DL.
Specializing this result to random sparse coefficient vectors and assuming that the
non-zero entries of xn are i.i.d. with variance σ2x , we get Σ x = (s/p)σ2x I p . Therefore, for
sufficiently large r, the sample complexity scaling required in order to achieve any error
ε for strictly sparse representations becomes
 2  
σ p k mk pk
Ω .
σ2x sKε

A very simple KS-DL algorithm is also provided in [13], which can recover a square
KS dictionary that consists of the Kronecker product of two smaller dictionaries and
is a perturbation of the identity matrix. It is shown that, in this case, the lower bound
provided in (5.33) is order-wise achievable for the case of sparse coefficient vectors. This
suggests that the obtained sample complexity lower bounds for overcomplete KS-DL are
not too loose.
In the next section, we focus on achievability results for the KS dictionary recovery
problem, i.e., upper bounds on the sample complexity of KS-DL.

5.3.4 Achievability Results


While the results in the previous section provide us with a lower bound on the sample
complexity of the KS-DL problem, we are further interested in the sample complexity of
specific KS-DL algorithms that solve (5.31). Below, we present a KS-DL achievability
result that can be interpreted as an extension of Theorem 5.6 to the KS-DL problem.

Noisy Recovery
We present a result that states conditions that ensure reliable recovery of the coordinate
dictionaries from noisy observations using (5.31). Shakeri et al. [15] solve (5.31) with
R(x) = λx1 , where λ is a regularization parameter. Here, the coordinate dictionary error
is defined as
 
εk = D k − D0  , k ∈ {1, . . . , K}, (5.34)
k F

where Dk is the recovered coordinate dictionary. The result is based on the following set
of assumptions.
156 Zahra Shakeri et al.

A8.1 (Cumulative coherence) The cumulative coherences of the true coordinate


dictionaries satisfy μ sk (D0k ) ≤ 1/4, and the cumulative coherence of the true
dictionary satisfies μ s (D0 ) ≤ 1/2.11
A8.2 (Bounded random coefficients) The random support of xn is generated from the
separable sparsity model. Assume that non-zero entries of xn are drawn i.i.d.
from a distribution with absolute mean E{|x|} and variance E x2 . Denoting ln =
sign(xn ), and dropping the index of xn and ln for simplicity of notation,  the fol-

lowing assumptions are satisfied for the coefficient vector: E xS xS |S = E x2 I s ,
T
   
E xS lTS |S = E{|x|}I s , E lS lTS |S = I s , x2 ≤ M x , and mini∈S |xi | ≥ xmin . More-
  
over, defining κ x  E{|x|}/ E x2 as a measure of the flatness of x, the following
inequality is satisfied:
    
E x2 sk  0   
Dk 2 + 1 D0T
c1
> max D0
− I  , (5.35)
M x E{|x|} 1 − 2μ s (D0 ) k∈{1,...,K} pk k k F

where c1 is a positive constant that is an exponential function of K.


A8.3 (Regularization parameter) The regularization parameter satisfies λ ≤ xmin /c2 ,
where c2 is a positive constant that is an exponential function of K.
A8.4 (Bounded random noise) Assume that non-zero entries n
  of w are drawn i.i.d.
from a distribution with mean 0 and variance E w . Dropping the index of
2

vectors for simplicity


  of notation,
  w
 is a bounded
  random  white-noise vector
satisfying E wwT |S = E w2 Im , E wxT |S = E wlT |S = 0, and w2 ≤ Mw .
Furthermore, denoting λ̄  λ/E{|x|}, we have
⎛ ⎞
Mw ⎜⎜⎜  K ⎟⎟⎟
≤ c3 ⎜⎜⎝⎜λ̄Kcmax − εk ⎟⎟⎠⎟, (5.36)
Mx k=1
where c3 is a positive constant that is an exponential function of K and cmax
depends on the coefficient distribution, D0 , and K. "
A8.5 (Sparsity level) The sparsity levels for each mode satisfy sk ≤ pk
   2 
8 D0k 2 + 1 for k ∈ {1, . . . , K}.
 
A8.6 (Radii range) The error radii εk > 0 satisfy εk ∈ λ̄ck,min , λ̄cmax for k ∈ {1, . . . , K},
where ck,min depends on s, the coefficient distribution, D0 , and K.
theorem 5.8 (Noisy KS-DL, local recovery [15]) Consider a KS-DL problem with N
i.i.d. observations and suppose that assumptions A8.1–A8.6 are satisfied. Assume
 2   2 
pk (η + mk pk ) 2K (1 + λ̄2 )M x2 Mw
N ≥ max Ω + , (5.37)
k∈[K] (εk − εk,min (λ̄))2 s2 E{x2 }2 sE{x2 }
. Then, with probability at least 1 − e−η ,
where εk,min (λ̄) is a function of K, λ̄, and ck,min0
there exists a local minimum of (5.31), D = k , such that d(D
D k , D0 ) ≤ εk , for all
k
k ∈ {1, . . . , K}.

11 The cumulative coherence is defined in (5.18).


Sample Complexity Bounds for Dictionary Learning 157

Table 5.2. Comparison of the scaling of vectorized DL sample


complexity bounds with KS-DL, given fixed SNR

Vectorized DL KS-DL

mp2 p k mk pk
Minimax lower bound [33] [13]
ε2 Kε2
3
mp3 mk p
Achievability bound [32] max 2 k [15]
ε2 k εk

The proof of this theorem relies on coordinate-wise Lipschitz continuity of the objec-
tive function in (5.31) with respect to coordinate dictionaries and the use of similar
sample complexity analysis arguments  to those in [32]. Theorem 5.8 implies that, for
fixed K and SNR, N = maxk∈{1,...,K} Ω mk p3k ε−2k is sufficient for the existence of a local
minimum within distance εk of true coordinate dictionaries, with high probability. This
result holds for coefficients that are generated according to the separable sparsity model.
The case of coefficients generated according to the random sparsity model requires a
different analysis technique that is not explored in [15].
We compare this result with the scaling in the vectorized  DL problem in
−2 ! −2
Theorem 5.6, which stated that N = Ω mp ε 3 = Ω k mk pk ε3 is sufficient for the
existence ofD0 as a local
 minimum of (5.6) up to the predefined error ε. In contrast,
N = maxk Ω mk p3k ε−2k is sufficient in the case of tensor data for the existence of D0k s as
local minima of (5.31) up to predefined errors εk . This reduction in the scaling can be
attributed to the reduction in the number of degrees of freedom of the KS-DL problem.
We can also compare this result with the sample complexity lower-bound scaling
obtained in Theorem 5.7 for KS-DL, which stated that, given sufficiently large r and p,

N = Ω p k mk pk ε−2 /K is necessary in order to recover the true KS dictionary D0
√ 
up to error ε. We can relate ε to εk s using the relation ε ≤ p k εk [15]. Assum-

 εk s are equal to each other, this implies that ε ≤ pKεk , and we have
ing all of the
N = maxk Ω 2K K 2 p(mk p3k )ε−2 . It can be seen from Theorem 5.7 that the sample com-
plexity lower bound depends on the average dimension of coordinate dictionaries; in
contrast, the sample complexity upper bound reported in this section depends on the
maximum dimension of coordinate dictionaries. There is also a gap between the lower
bound and the upper bound of order maxk p2k . This suggests that the obtained bounds
may be loose.
The sample complexity scaling results in Theorems 5.1, 5.6, 5.7, and 5.8 are shown
in Table 5.2 for sparse coefficient vectors.

5.4 Extensions and Open Problems

In Sections 5.2 and 5.3, we summarized some of the key results of dictionary identifica-
tion for vectorized and tensor data. In this section, we look at extensions of these works
and discuss related open problems.
158 Zahra Shakeri et al.

5.4.1 DL for Vector-Valued Data


Extensions to alternative objective functions. The works discussed in Section 5.2 all
analyze variants of (5.5) and (5.6), which minimize the representation error of the
dictionary. However, there do exist other works that look for a dictionary that opti-
mizes different criteria. Schnass [56] proposed a new DL objective function called the
“response maximization criterion” that extends the K-means objective function to the
following:


N  
max maxD∗S yn 1 . (5.38)
D∈D |S|=s
n=1

 
Given distance metric d(D1 , D2 ) = max j D1j − D2j  , Schnass shows that the sample
2
complexity
  needed to recover a true generating dictionary up to precision ε scales as
O mp3 ε−2 using this objective. This sample complexity is achieved by a novel DL
algorithm, called Iterative Thresholding and K-Means (ITKM), that solves (5.38) under
certain conditions on the coefficient distribution, noise, and the underlying dictionary.
Efficient representations can help improve the complexity and performance of
machine-learning tasks such as prediction. This means that a DL algorithm could explic-
itly tune the representation to optimize prediction performance. For example, some
works learn dictionaries to improve classification performance [17, 25]. These works
add terms to the objective function that measure the prediction performance and min-
imize this loss. While these DL algorithms can yield improved performance for their
desired prediction task, proving sample complexity bounds for these algorithms remains
an open problem.
Tightness guarantees. While dictionary identifiability has been well studied for
vector-valued data, there remains a gap between the upper and lower bounds on the sam-
ple complexity. The lower bound presented in Theorem 5.1 is for the case of a particular
distance metric, i.e., the Frobenius norm, whereas the presented achievability results in
Theorems 5.2–5.6 are based on a variety of distance metrics. Restricting the distance
metric to the Frobenius norm, we still observe a gap of order p between the sample
complexity lower bound in Theorem 5.1 and the upper bound in Theorem 5.6. The par-
tial converse result for square dictionaries that is provided in [33] shows that the lower
bound is achievable for square dictionaries close to the identity matrix. For more gen-
eral square matrices, however, the gap may be significant: either improved algorithms
can achieve the lower bounds or the lower bounds may be further tightened. For over-
complete dictionaries the question of whether the upper bound or lower bound is tight
remains open. For metrics other than the Frobenius norm, the bounds are incomparable,
making it challenging to assess the tightness of many achievability results.
Finally, the works reported in Table 5.1 differ significantly in terms of the mathemat-
ical tools they use. Each approach yields a different insight into the structure of the DL
problem. However, there is no unified analytic framework encompassing all of these
perspectives. This gives rise to the following question. Is there a unified mathematical
tool that can be used to generalize existing results on DL?
Sample Complexity Bounds for Dictionary Learning 159

5.4.2 DL for Tensor Data


Extensions of sample complexity bounds for KS-DL. In terms of theoretical results, there
are many aspects of KS-DL that have not been addressed in the literature so far. The
results that are obtained in Theorems 5.7 and 5.8 are based on the Frobenius norm
distance metric and provide only local recovery guarantees. Open questions include
corresponding bounds for other distance metrics and global recovery guarantees. In
particular, getting global recovery guarantees requires using a distance metric that can
handle the inherent permutation and sign ambiguities in the dictionary. Moreover, the
results of Theorem 5.8 are based on the fact that the coefficient tensors are gener-
ated according to the separable sparsity model. Extensions to coefficient tensors with
arbitrary sparsity patterns, i.e., the random sparsity model, have not been explored.
Algorithmic open problems. Unlike vectorized DL problems whose sample complex-
ity is explicitly tied to the actual algorithmic objective functions, the results in [13, 15]
are not tied to an explicit KS-DL algorithm. While there exist KS-DL algorithms in
the literature, none of them explicitly solve the problem in these papers. Empirically,
KS-DL algorithms can outperform vectorized DL algorithms for a variety of real-world
datasets [10, 11, 57–59]. However, these algorithms lack theoretical analysis in terms of
sample complexity, leaving open the question of how many samples are needed in order
to learn a KS dictionary using practical algorithms.
Parameter selection in KS-DL. In some cases we may not know a priori the param-
eters for which a KS dictionary yields a good model for the data. In particular, given
dimension p, the problem of selecting the pk s for coordinate dictionaries such that
!
p = k pk has not been studied. For instance, in the case of RGB images, the selec-
tion of pk s for the spatial modes is somewhat intuitive, as each column in the separable
transform represents a pattern in each mode. However, selecting the number of columns
for the depth mode, which has three dimensions (red, green, and blue), is less obvious.
Given a fixed number of overall columns for the KS dictionary, how should we divide it
between the number of columns for each coordinate dictionary?
Alternative structures on dictionary. In terms of DL for tensor data, the issue of exten-
sions of identifiability results to structures other than the Kronecker product is an open
problem. The main assumption in KS-DL is that the transforms for different modes of
the tensor are separable from one another, which can be a limiting assumption for real-
world datasets. Other structures can be enforced on the underlying dictionary to reduce
sample complexity while conferring applicability to a wider range of datasets. Examples
include DL using the CP decomposition [96] and the tensor t-product [61]. Character-
izing the DL problem and understanding the practical benefits of these models remain
interesting questions for future work.

Acknowledgments

This work is supported in part by the National Science Foundation under awards CCF-
1525276 and CCF-1453073, and by the Army Research Office under award W911NF-
17-1-0546.
160 Zahra Shakeri et al.

References

[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 8,
pp. 1798–1828, 2013.
[2] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications.
McGraw-Hill, 1986.
[3] I. Daubechies, Ten lectures on wavelets. SIAM, 1992.
[4] E. J. Candès and D. L. Donoho, “Curvelets: A surprisingly effective nonadaptive repre-
sentation for objects with edges,” in Proc. 4th International Conference on Curves and
Surfaces, 1999, vol. 2, pp. 105–120.
[5] I. T. Jolliffe, “Principal component analysis and factor analysis,” in Principal component
analysis. Springer, 1986, pp. 115–128.
[6] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcom-
plete dictionaries for sparse representation,” IEEE Trans. Signal Processing, vol. 54, no. 11,
pp. 4311–4322, 2006.
[7] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Rev.,
vol. 51, no. 3, pp. 455–500, 2009.
[8] M. F. Duarte and R. G. Baraniuk, “Kronecker compressive sensing,” IEEE Trans. Image
Processing, vol. 21, no. 2, pp. 494–504, 2012.
[9] C. F. Caiafa and A. Cichocki, “Multidimensional compressed sensing and their applica-
tions,” Wiley Interdisciplinary Rev.: Data Mining and Knowledge Discovery, vol. 3, no. 6,
pp. 355–380, 2013.
[10] S. Hawe, M. Seibert, and M. Kleinsteuber, “Separable dictionary learning,” in Proc. IEEE
Conference Computer Vision and Pattern Recognition, 2013, pp. 438–445.
[11] M. Ghassemi, Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “STARK: Structured dictionary
learning through rank-one tensor recovery,” in Proc. IEEE 7th International Workshop
Computational Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
[12] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds for Kronecker-
structured dictionary learning,” in Proc. IEEE International Symposium on Information
Theory, 2016, pp. 1148–1152.
[13] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learn-
ing for tensor data,” IEEE Trans. Information Theory, vol. 64, no. 4, pp. 2706–2726, 2018.
[14] Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “Identification of Kronecker-structured dictio-
naries: An asymptotic analysis,” in Proc. IEEE 7th International Workshop Computational
Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
[15] Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “Identifiability of Kronecker-structured
dictionaries for tensor data,” IEEE J. Selected Topics Signal Processing, vol. 12, no. 5,
pp. 1047–1062, 2018.
[16] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (GPCA),” IEEE
Trans. Pattern Analysis Machine Intelligence, vol. 27, no. 12, pp. 1945–1959, 2005.
[17] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learn-
ing from unlabeled data,” in Proc. 24th International Conference on Machine Learning,
2007, pp. 759–766.
[18] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals Human
Genetics, vol. 7, no. 2, pp. 179–188, 1936.
[19] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley &
Sons, 2004.
Sample Complexity Bounds for Dictionary Learning 161

[20] R. R. Coifman and S. Lafon, “Diffusion maps,” Appl. Comput. Harmonic Analysis, vol. 21,
no. 1, pp. 5–30, 2006.
[21] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in Proc.
International Conference on Artificial Neural Networks, 1997, pp. 583–588.
[22] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[23] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-invariance sparse coding for audio
classification,” in Proc. 23rd Conference on Uncertainty in Artificial Intelligence, 2007,
pp. 149–158.
[24] J. M. Duarte-Carvajalino and G. Sapiro, “Learning to sense sparse signals: Simultaneous
sensing matrix and sparsifying dictionary optimization,” IEEE Trans. Image Processing,
vol. 18, no. 7, pp. 1395–1408, 2009.
[25] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern
Analysis Machine Intelligence, vol. 34, no. 4, pp. 791–804, 2012.
[26] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictio-
naries, and a practical way to retrieve them,” Linear Algebra Applications, vol. 416, no. 1,
pp. 48–67, 2006.
[27] R. Remi and K. Schnass, “Dictionary identification – sparse matrix-factorization via 1 -
minimization,” IEEE Trans. Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.
[28] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,”
in Proc. Conference on Learning Theory, 2012, pp. 37.11–37.18.
[29] Q. Geng and J. Wright, “On the local correctness of 1 -minimization for dictionary
learning,” in Proc. IEEE International Symposium on Information Theory, 2014, pp. 3180–
3184.
[30] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning sparsely used
overcomplete dictionaries,” in Proc. 27th Annual Conference on Learning Theory, 2014,
pp. 1–15.
[31] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcomplete
dictionaries,” in Proc. 25th Annual Conference Learning Theory, 2014, pp. 1–28.
[32] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: Dictionary learning with
noise and outliers,” IEEE Trans. Information Theory, vol. 61, no. 11, pp. 6298–6319, 2015.
[33] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE
Trans. Information Theory, vol. 62, no. 3, pp. 1501–1515, 2015.
[34] O. Christensen, An introduction to frames and Riesz bases. Springer, 2016.
[35] K. A. Okoudjou, Finite frame theory: A complete introduction to overcompleteness.
American Mathematical Society, 2016.
[36] W. U. Bajwa, R. Calderbank, and D. G. Mixon, “Two are better than one: Fundamen-
tal parameters of frame coherence,” Appl. Comput. Harmonic Analysis, vol. 33, no. 1,
pp. 58–78, 2012.
[37] W. U. Bajwa and A. Pezeshki, “Finite frames for sparse signal processing,” in Finite
frames, P. Casazza and G. Kutyniok, eds. Birkhäuser, 2012, ch. 10, pp. 303–335.
[38] K. Schnass, “On the identifiability of overcomplete dictionaries via the minimisation prin-
ciple underlying K-SVD,” Appl. Comput. Harmonic Analysis, vol. 37, no. 3, pp. 464–491,
2014.
[39] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”
J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp. 49–67, 2006.
[40] V. Vapnik, “Principles of risk minimization for learning theory,” in Proc. Advances in
Neural Information Processing Systems, 1992, pp. 831–838.
162 Zahra Shakeri et al.

[41] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5,
1999, pp. 2443–2446.
[42] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and
sparse coding,” J. Machine Learning Res., vol. 11, no. 1, pp. 19–60, 2010.
[43] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionary learning algorithm
for big, distributed data,” IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173–188, 2016.
[44] Z. Shakeri, H. Raja, and W. U. Bajwa, “Dictionary learning based nonlinear classifier train-
ing from distributed data,” in Proc. 2nd IEEE Global Conference Signal and Information
Processing, 2014, pp. 759–763.
[45] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, G. Sapiro, and
L. Carin, “Nonparametric Bayesian dictionary learning for analysis of noisy and incom-
plete images,” IEEE Trans. Image Processing, vol. 21, no. 1, pp. 130–144, 2012.
[46] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997,
pp. 423–435.
[47] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery
using 1 -constrained quadratic programming (lasso),” IEEE Trans. Information Theory,
vol. 55, no. 5, pp. 2183–2202, 2009.
[48] P. Massart, Concentration inequalities and model selection. Springer, 2007.
[49] L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement
of change,” in Problems Measuring Change. University of Wisconsin Press, 1963,
pp. 122–137.
[50] C. F. Van Loan, “The ubiquitous Kronecker product,” J. Comput. Appl. Math., vol. 123,
no. 1, pp. 85–100, 2000.
[51] R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an
explanatory multi-modal factor analysis,” in UCLA Working Papers in Phonetics, vol. 16,
pp. 1–84, 1970.
[52] M. E. Kilmer, C. D. Martin, and L. Perrone, “A third-order generalization of the matrix
SVD as a product of third-order tensors,” Technical Report, 2008.
[53] C. F. Caiafa and A. Cichocki, “Computing sparse representations of multidimensional
signals using Kronecker bases,” Neural Computation, vol. 25, no. 1, pp. 186–220, 2013.
[54] S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery
via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011.
[55] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decompo-
sition,” SIAM J. Matrix Analysis Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
[56] K. Schnass, “Local identification of overcomplete dictionaries,” J. Machine Learning Res.,
vol. 16, pp. 1211–1242, 2015.
[57] S. Zubair and W. Wang, “Tensor dictionary learning with sparse Tucker decomposition,”
in Proc. IEEE 18th International Conference on Digital Signal Processing, 2013, pp. 1–6.
[58] F. Roemer, G. Del Galdo, and M. Haardt, “Tensor-based algorithms for learning multidi-
mensional separable dictionaries,” in Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, 2014, pp. 3963–3967.
[59] C. F. Dantas, M. N. da Costa, and R. da Rocha Lopes, “Learning dictionaries as a sum of
Kronecker products,” IEEE Signal Processing Lett., vol. 24, no. 5, pp. 559–563, 2017.
[60] Y. Zhang, X. Mou, G. Wang, and H. Yu, “Tensor-based dictionary learning for spectral CT
reconstruction,” IEEE Trans. Medical Imaging, vol. 36, no. 1, pp. 142–154, 2017.
[61] S. Soltani, M. E. Kilmer, and P. C. Hansen, “A tensor-based dictionary learning approach to
tomographic image reconstruction,” BIT Numerical Mathe., vol. 56, no. 4, pp. 1–30, 2015.
6 Uncertainty Relations and Sparse
Signal Recovery
Erwin Riegler and Helmut Bölcskei

Summary

This chapter provides a principled introduction to uncertainty relations underlying sparse


signal recovery. We start with the seminal work by Donoho and Stark, 1989, which
defines uncertainty relations as upper bounds on the operator norm of the band-limitation
operator followed by the time-limitation operator, generalize this theory to arbitrary
pairs of operators, and then develop out of this generalization the coherence-based
uncertainty relations due to Elad and Bruckstein, 2002, as well as uncertainty rela-
tions in terms of concentration of the 1-norm or 2-norm. The theory is completed with
the recently discovered set-theoretic uncertainty relations which lead to best possible
recovery thresholds in terms of a general measure of parsimony, namely Minkowski
dimension. We also elaborate on the remarkable connection between uncertainty rela-
tions and the “large sieve,” a family of inequalities developed in analytic number theory.
It is finally shown how uncertainty relations allow to establish fundamental limits of
practical signal recovery problems such as inpainting, declipping, super-resolution, and
denoising of signals corrupted by impulse noise or narrowband interference. Detailed
proofs are provided throughout the chapter.

6.1 Introduction

The uncertainty principle in quantum mechanics says that certain pairs of physical prop-
erties of a particle, such as position and momentum, can be known to within a limited
precision only [1]. Uncertainty relations in signal analysis [2–5] state that a signal
and its Fourier transform cannot both be arbitrarily well concentrated; corresponding
mathematical formulations exist for square-integrable or integrable functions [6, 7], for
vectors in (Cm ,  · 2 ) or (Cm ,  · 1 ) [6–10], and for finite abelian groups [11, 12]. These
results feature prominently in many areas of the mathematical data sciences. Specifi-
cally, in compressed sensing [6–9, 13, 14] uncertainty relations lead to sparse signal
recovery thresholds, in Gabor and Wilson frame theory [15] they characterize limits on
the time–frequency localization of frame elements, in communications [16] they play a
fundamental role in the design of pulse shapes for orthogonal frequency division multi-
plexing (OFDM) systems [17], in the theory of partial differential equations they serve to

163
164 Erwin Riegler and Helmut Bölcskei

characterize existence and smoothness properties of solutions [18], and in coding theory
they help to understand questions around the existence of good cyclic codes [19].
This chapter provides a principled introduction to uncertainty relations underlying
sparse signal recovery, starting with the seminal work by Donoho and Stark [6], rang-
ing over the Elad–Bruckstein coherence-based uncertainty relation for general pairs
of orthonormal bases [8], later extended to general pairs of dictionaries [10], to the
recently discovered set-theoretic uncertainty relation [13] which leads to information-
theoretic recovery thresholds for general notions of parsimony. We also elaborate on the
remarkable connection [7] between uncertainty relations for signals and their Fourier
transforms–with concentration measured in terms of support–and the “large sieve,” a
family of inequalities involving trigonometric polynomials, originally developed in the
field of analytic number theory [20, 21].
Uncertainty relations play an important role in data science beyond sparse sig-
nal recovery, specifically in the sparse signal separation problem, which comprises
numerous practically relevant applications such as (image or audio signal) inpainting,
declipping, super-resolution, and the recovery of signals corrupted by impulse noise or
by narrowband interference. We provide a systematic treatment of the sparse signal sep-
aration problem and develop its limits out of uncertainty relations for general pairs of
dictionaries as introduced in [10]. While the flavor of these results is that beyond certain
thresholds something is not possible, for example a non-zero vector cannot be concen-
trated with respect to two different orthonormal bases beyond a certain limit, uncertainty
relations can also reveal that something unexpected is possible. Specifically, we demon-
strate that signals that are sparse in certain bases can be recovered in a stable fashion
from partial and noisy observations.
In practice one often encounters more general concepts of parsimony, such as, e.g.,
manifold structures and fractal sets. Manifolds are prevalent in the data sciences, e.g.,
in compressed sensing [22–27], machine learning [28], image processing [29, 30],
and handwritten-digit recognition [31]. Fractal sets find application in image com-
pression and in modeling of Ethernet traffic [32]. In the last part of this chapter, we
develop an information-theoretic framework for sparse signal separation and recovery,
which applies to arbitrary signals of “low description complexity.” The complexity mea-
sure our results are formulated in, namely Minkowski dimension, is agnostic to signal
structure and goes beyond the notion of sparsity in terms of the number of non-zero
entries or concentration in 1-norm or 2-norm. The corresponding recovery thresholds
are information-theoretic in the sense of applying to arbitrary signal structures and pro-
viding results of best possible nature that are, however, not constructive in terms of
recovery algorithms.
To keep the exposition simple and to elucidate the main conceptual aspects, we restrict
ourselves to the finite-dimensional cases (Cm ,  · 2 ) and (Cm ,  · 1 ) throughout. Refer-
ences to uncertainty relations for the infinite-dimensional case will be given wherever
possible and appropriate. Some of the results in this chapter have not been reported
before in the literature. Detailed proofs will be provided for most of the statements, with
the goal of allowing the reader to acquire a technical working knowledge that can serve
as a basis for own further research.
Uncertainty Relations and Sparse Signal Recovery 165

The chapter is organized as follows. In Sections 6.2 and 6.3, we derive uncertainty
relations for vectors in (Cm ,  · 2 ) and (Cm ,  · 1 ), respectively, discuss the connec-
tion to the large sieve, present applications to noisy signal recovery problems, and
establish a fundamental relation between uncertainty relations for sparse vectors and
null-space properties of the accompanying dictionary matrices. Section 6.4 is devoted
to understanding the role of uncertainty relations in sparse signal separation problems.
In Section 6.5, we generalize the classical sparsity notion as used in compressed
sensing to a more comprehensive concept of description complexity, namely, lower
modified Minkowski dimension, which in turn leads to a set-theoretic null-space
property and corresponding recovery thresholds. Section 6.6 presents a large sieve
inequality in (Cm ,  · 2 ) that one of our results in Section 6.2 is based on. Section 6.7
lists infinite-dimensional counterparts–available in the literature–to some of the results
in this chapter. In Section 6.8, we provide a proof of the set-theoretic null-space
property stated in Section 6.5. Finally, Section 6.9 contains results on operator norms
used frequently in this chapter.
Notation. For A ⊆ {1, . . . , m}, DA denotes the m × m diagonal matrix with diago-
nal entries (DA )i,i = 1 for i ∈ A, and (DA )i,i = 0 else. With U ∈ Cm×m unitary and
A ⊆ {1, . . . , m}, we define the orthogonal projection ΠA (U) = UDA U∗ and set WU,A =
range (ΠA (U)). For x ∈ Cm and A ⊆ {1, . . . , m}, we let xA = DA x. With A ∈ Cm×m , |||A|||1 =
maxx: x1 =1 Ax1 refers to the operator 1-norm, |||A|||2 = maxx: x2 =1 Ax2 designates
√ 
the operator 2-norm, A2 = tr(AA∗ ) is the Frobenius norm, and A1 = m i, j=1 |Ai, j |.

The m × m discrete Fourier transform (DFT) matrix F has entry (1/ m)e−2π jkl/m in its
kth row and lth column for k, l ∈ {1, . . . , m}. For x ∈ R, we set [x]+ = max{x, 0}. The vec-
tor x ∈ Cm is said to be s-sparse if it has at most s non-zero entries. The open ball in
(Cm ,  · 2 ) of radius ρ centered at u ∈ Cm is denoted by Bm (u, ρ), and Vm (ρ) refers to its
volume. The indicator function on the set A is χA . We use the convention 0 · ∞ = 0.

6.2 Uncertainty Relations in (Cm ,  · 2 )

Donoho and Stark [6] define uncertainty relations as upper bounds on the operator norm
of the band-limitation operator followed by the time-limitation operator. We adopt this
elegant concept and extend it to refer to an upper bound on the operator norm of a general
orthogonal projection operator (replacing the band-limitation operator) followed by the
“time-limitation operator” DP as an uncertainty relation. More specifically, let U ∈ Cm×m
be a unitary matrix, P, Q ⊆ {1, . . . , m}, and consider the orthogonal projection ΠQ (U) onto
the subspace WU,Q which is spanned by {ui : i ∈ Q}. Let1 ΔP,Q (U) = |||DP ΠQ (U)|||2 . In
the setting of [6] U would correspond to the DFT matrix F and ΔP,Q (F) is the operator
2-norm of the band-limitation operator followed by the time-limitation operator, both in
finite dimensions. By Lemma 6.12 we have

1 We note that, for general unitary A, B ∈ Cm×m , unitary invariance of  · 2 yields |||ΠP (A)ΠQ (B)|||2 =
|||DP ΠQ (U)|||2 with U = A∗ B. The situation where both the band-limitation operator and the time-limitation
operator are replaced by general orthogonal projection operators can hence be reduced to the case
considered here.
166 Erwin Riegler and Helmut Bölcskei

xP 2
ΔP,Q (U) = max . (6.1)
x∈WU,Q \{0} x2

An uncertainty relation in (Cm ,  · 2 ) is an upper bound of the form ΔP,Q (U) ≤ c with
c ≥ 0, and states that xP 2 ≤ cx2 for all x ∈ WU,Q . ΔP,Q (U) hence quantifies how well
a vector supported on Q in the basis U can be concentrated on P. Note that an uncertainty
relation in (Cm ,  · 2 ) is non-trivial only if c < 1. Application of Lemma 6.13 now yields
DP ΠQ (U)2
 ≤ ΔP,Q (U) ≤ DP ΠQ (U)2 , (6.2)
rank(DP ΠQ (U))
where the upper bound constitutes an uncertainty relation and the lower bound will allow
us to assess its tightness. Next, note that

DP ΠQ (U)2 = tr(DP ΠQ (U)) (6.3)

and

rank(DP ΠQ (U)) = rank(DP UDQ U∗ ) (6.4)

≤ min{|P|, |Q|}, (6.5)

where (6.5) follows from rank(DP UDQ ) ≤ min{|P|, |Q|} and Property (c) in Section 0.4.5
of [33]. When used in (6.2) this implies

tr(DP ΠQ (U)) 
≤ ΔP,Q (U) ≤ tr(DP ΠQ (U)). (6.6)
min{|P|, |Q|}
Particularizing to U = F, we obtain
 
tr(DP ΠQ (F)) = tr(DP FDQ F∗ ) (6.7)
 
= |Fi, j |2 (6.8)
i∈P j∈Q


|P||Q|
= , (6.9)
m
so that (6.6) reduces to
 
max{|P|, |Q|} |P||Q|
≤ ΔP,Q (F) ≤ . (6.10)
m m
There exist sets P, Q ⊆ {1, . . . , m} that saturate both bounds in (6.10), e.g., P = {1} and Q =
√ √
{1, . . . , m}, which yields max{|P|, |Q|}/m = |P||Q|/m = 1 and therefore ΔP,Q (F) = 1.
An example of sets P, Q ⊆ {1, . . . , m} saturating only the lower bound in (6.10) is as
follows. Take n to divide m and set
 
m 2m (n − 1)m
P= , ,..., ,m (6.11)
n n n
Uncertainty Relations and Sparse Signal Recovery 167

and
Q = {l + 1, . . . , l + n}, (6.12)
with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. Then, the upper bound in
(6.10) is

|P||Q| n
= √ , (6.13)
m m
whereas the lower bound becomes
 
max{|P|, |Q|} n
= . (6.14)
m m
Thus, for m → ∞ with fixed ratio m/n, the upper bound in (6.10) tends to infinity whereas
the corresponding lower bound remains constant. The following result states that the
lower bound in (6.10) is tight for P and Q as in (6.11) and (6.12), respectively. This

implies a lack of tightness of the uncertainty relation ΔP,Q (F) ≤ |P||Q|/m by a factor

of n. The large sieve-based uncertainty relation developed in the next section will be
seen to remedy this problem.
lemma 6.1 (Theorem 11 of [6]) Let n divide m and consider
 
m 2m (n − 1)m
P= , ,..., ,m (6.15)
n n n
and
Q = {l + 1, . . . , l + n}, (6.16)

with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. Then, ΔP,Q (F) = n/m.
Proof We have
ΔP,Q (F) = |||ΠQ (F)DP |||2 (6.17)

= |||DQ F∗ DP |||2 (6.18)


= max DQ F∗ DP x2 (6.19)
x: x2 =1

DQ F∗ xP 2
= max (6.20)
x: x0 x2
DQ F∗ x2
= max , (6.21)
x: x=xP x2
x0
where in (6.17) we applied Lemma 6.12 and in (6.18) we used unitary invariance of  · 2 .
Next, consider an arbitrary but fixed x ∈ Cm with x = xP and define y ∈ Cn according to
y s = xms/n for s = 1, . . . , n. It follows that
2
∗ 1  2π jpq
DQ F x22 = xp e m (6.22)
m p∈P
q∈Q
168 Erwin Riegler and Helmut Bölcskei

1 
n 2
2π jsq
= xms/n e n (6.23)
m s=1
q∈Q

1 
n 2
2π jsq
= ys e n (6.24)
m s=1q∈Q

n ∗ 2
F y2
= (6.25)
m
n
= y22 , (6.26)
m
where F in (6.25) is the n × n DFT matrix and in (6.26) we used unitary invariance of

 · 2 . With (6.22)–(6.26) and x2 = y2 in (6.21), we get ΔP,Q (F) = n/m.

6.2.1 Uncertainty Relations Based on the Large Sieve


The uncertainty relation in (6.6) is very crude as it simply upper-bounds the operator
2-norm by the Frobenius norm. For U = F a more sophisticated upper bound on ΔP,Q (F)
was reported in Theorem 12 of [7]. The proof of this result establishes a remarkable
connection to the so-called “large sieve,” a family of inequalities involving trigonometric
polynomials originally developed in the field of analytic number theory [20, 21]. We next
present a slightly improved and generalized version of Theorem 12 of [7].
theorem 6.1 Let P ⊆ {1, . . . , m}, l, n ∈ {1, . . . , m}, and
Q = {l + 1, . . . , l + n}, (6.27)
with Q interpreted circularly in {1, . . . , m}. For λ ∈ (0, m], we define the circular Nyquist
density ρ(P, λ) according to
1
ρ(P, λ) = max |P ∩ (r, r + λ)|, (6.28)
λ r∈[0,m)
where P = P ∪ {m + p : p ∈ P}. Then,

λ(n − 1)
ΔP,Q (F) ≤ + 1 ρ(P, λ) (6.29)
m
for all λ ∈ (0, m].
Proof If P = ∅, then ΔP,Q (F) = 0 as a consequence of Π∅ (F) = 0 and (6.29) holds
trivially. Suppose now that P  ∅, consider an arbitrary but fixed x ∈ WF,Q with x2 = 1,
and set a = F∗ x. Then, a = aQ and, by unitarity of F, a2 = 1. We have

|x p |2 = |(Fa) p |2 (6.30)
2
1  − 2π jpq
= aq e m (6.31)
m
q∈Q
Uncertainty Relations and Sparse Signal Recovery 169

1  − 2π jpk
n 2
= ak e m (6.32)
m k=1
1 p2
= ψ for p ∈ {1, . . . , m}, (6.33)
m m
where we defined the 1-periodic trigonometric polynomial ψ(s) according to

n
ψ(s) = ak e−2π jks . (6.34)
k=1

Next, let νt denote the unit Dirac measure centered at t ∈ R and set μ = p∈P ν p/m with
1-periodic extension outside [0, 1). Then,
1  p2
xP 22 = ψ (6.35)
m p∈P m

1
= |ψ(s)|2 dμ(s) (6.36)
m [0,1)
n−1 1 λ 
≤ + sup μ r, r + (6.37)
m λ r∈[0,1) m
for all λ ∈ (0, m], where (6.35) is by (6.30)–(6.33) and in (6.37) we applied the large
sieve inequality Lemma 6.10 with δ = λ/m and a2 = 1. Now,
λ 
sup μ r, r + (6.38)
r∈[0,1) m

= sup (ν p ((r, r + λ)) + νm+p ((r, r + λ))) (6.39)
r∈[0,m) p∈P

= max |P ∩ (r, r + λ)| (6.40)


r∈[0,m)

= λ ρ(P, λ) for all λ ∈ (0, m], (6.41)

where in (6.39) we used the 1-periodicity of μ. Using (6.38)–(6.41) in (6.37) yields


λ(n − 1)
xP 22 ≤ + 1 ρ(P, λ) for all λ ∈ (0, m]. (6.42)
m

As x ∈ WF,Q with x2 = 1 was arbitrary, we conclude that


xP 22
Δ2P,Q (F) = max (6.43)
x∈WF,Q \{0} x22
λ(n − 1)
≤ + 1 ρ(P, λ) for all λ ∈ (0, m], (6.44)
m
thereby finishing the proof.

Theorem 6.1 slightly improves upon Theorem 12 of [7] by virtue of applying to


more general sets Q and defining the circular Nyquist density in (6.28) in terms of open
intervals (r, r + λ).
170 Erwin Riegler and Helmut Bölcskei

We next apply Theorem 6.1 to specific choices of P and Q. First, consider P = {1} and
Q = {1, . . . , m}, which were shown to saturate the upper and the lower bound in (6.10),
leading to ΔP,Q (F) = 1. Since P consists of a single point, ρ(P, λ) = 1/λ for all λ ∈ (0, m].
Thus, Theorem 6.1 with n = m yields

m−1 1
ΔP,Q (F) ≤ + for all λ ∈ (0, m]. (6.45)
m λ
Setting λ = m in (6.45) yields ΔP,Q (F) ≤ 1.
Next, consider P and Q as in (6.11) and (6.12), respectively, which, as already men-

tioned, have the uncertainty relation in (6.10) lacking tightness by a factor of n. Since
P consists of points spaced m/n apart, we get ρ(P, λ) = 1/λ for all λ ∈ (0, m/n]. The
upper bound (6.29) now becomes

n−1 1 m
ΔP,Q (F) ≤ + for all λ ∈ 0, . (6.46)
m λ n
Setting λ = m/n in (6.46) yields
 √ 
ΔP,Q (F) ≤ (2n − 1)/m ≤ 2 n/m, (6.47)

which is tight up to a factor of 2 (cf. Lemma 6.1). We hasten to add, however, that the
large sieve technique applies to U = F only.

6.2.2 Coherence-Based Uncertainty Relation


We next present an uncertainty relation that is of simple form and applies to general
unitary U. To this end, we first introduce the concept of coherence of a matrix.
definition 6.1 For A = (a1 . . . an ) ∈ Cm×n with columns  · 2 -normalized to 1, the
coherence is defined as μ(A) = maxi j |a∗i a j |.
We have the following coherence-based uncertainty relation valid for general
unitary U.
lemma 6.2 Let U ∈ Cm×m be unitary and P, Q ⊆ {1, . . . , m}. Then,

ΔP,Q (U) ≤ |P||Q| μ([I U]). (6.48)
Proof The claim follows from
Δ2P,Q (U) ≤ tr(DP UDQ U∗ ) (6.49)

= |Uk,l |2 (6.50)
k∈P l∈Q
≤ |P||Q| max |Uk,l |2 (6.51)
k,l

= |P||Q| μ ([I U]),


2
(6.52)
where (6.49) is by (6.6) and in (6.52) we used the definition of coherence.

Since μ([I F]) = 1/ m, Lemma 6.2 particularized to U = F recovers the upper bound
in (6.10).
Uncertainty Relations and Sparse Signal Recovery 171

6.2.3 Concentration Inequalities


As mentioned at the beginning of this chapter, the classical uncertainty relation in signal
analysis quantifies how well concentrated a signal can be in time and frequency. In the
finite-dimensional setting considered here this amounts to characterizing the concentra-
tion of p and q in p = Fq. We will actually study the more general case obtained by
replacing I and F by unitary A ∈ Cm×m and B ∈ Cm×m , respectively, and will ask our-
selves how well concentrated p and q in Ap = Bq can be. Rewriting Ap = Bq according
to p = Uq with U = A∗ B, we now show how the uncertainty relation in Lemma 6.2 can
be used to answer this question. Let us start by introducing a measure for concentration
in (Cm ,  · 2 ).
definition 6.2 Let P ⊆ {1, . . . , m} and εP ∈ [0, 1]. The vector x ∈ Cm is said to be εP -
concentrated if x − xP 2 ≤ εP x2 .
The fraction of 2-norm an εP -concentrated vector exhibits outside P is therefore no
more than εP . In particular, if x is εP -concentrated with εP = 0, then x = xP and x is |P|-
sparse. The zero vector is trivially εP -concentrated for all P ⊆ {1, . . . , m} and εP ∈ [0, 1].
We next derive a lower bound on ΔP,Q (U) for unitary matrices U that relate εP -
concentrated vectors p to εQ -concentrated vectors q through p = Uq. The formal
statement is as follows.
lemma 6.3 Let U ∈ Cm×m be unitary and P, Q ⊆ {1, . . . , m}. Suppose that there exist
a non-zero εP -concentrated p ∈ Cm and a non-zero εQ -concentrated q ∈ Cm such that
p = Uq. Then,
ΔP,Q (U) ≥ [1 − εP − εQ ]+ . (6.53)
Proof We have
p − ΠQ (U)pP 2 ≤ p − ΠQ (U)p2 + ΠQ (U)pP − ΠQ (U)p2 (6.54)
≤ p − ΠQ (U)p2 + |||ΠQ (U)|||2 pP − p2 (6.55)

≤ p − UDQ U p2 + pP − p2 (6.56)
= q − qQ 2 + pP − p2 , (6.57)
≤ εQ q2 + εP p2 (6.58)
= (εP + εQ )p2 , (6.59)
where in (6.57) we made use of the unitary invariance of  · 2 . It follows that
ΠQ (U)pP 2 ≥ [p2 − p − ΠQ (U)pP 2 ]+ (6.60)
≥ p2 [1 − εP − εQ ]+ , (6.61)
where (6.60) is by the reverse triangle inequality and in (6.61) we used (6.54)–(6.59).
Since p  0 by assumption, (6.60)–(6.61) implies
 
Π (U)D p  ≥ [1 − ε − ε ] , (6.62)
 Q P
p2 
P Q +
2
which in turn yields |||ΠQ (U)DP |||2 ≥ [1 − εP − εQ ]+ . This concludes the proof as
ΔP,Q (U) = |||ΠQ (U)DP |||2 by Lemma 6.12.
172 Erwin Riegler and Helmut Bölcskei

Combining Lemma 6.3 with the uncertainty relation Lemma 6.2 yields the announced
result stating that a non-zero vector cannot be arbitrarily well concentrated with respect
to two different orthonormal bases.
corollary 6.1 Let A, B ∈ Cm×m be unitary and P, Q ⊆ {1, . . . , m}. Suppose that there
exist a non-zero εP -concentrated p ∈ Cm and a non-zero εQ -concentrated q ∈ Cm such
that Ap = Bq. Then,
[1 − εP − εQ ]2+
|P||Q| ≥ . (6.63)
μ2 ([A B])
Proof Let U = A∗ B. Then, by Lemmata 6.2 and 6.3, we have

[1 − εP − εQ ]+ ≤ ΔP,Q (U) ≤ |P||Q| μ([I U]). (6.64)
The claim now follows by noting that μ([I U]) = μ([A B]).
For εP = εQ = 0, we recover the well-known Elad–Bruckstein result.
corollary 6.2 (Theorem 1 of [8]) Let A, B ∈ Cm×m be unitary. If Ap = Bq for non-zero
p, q ∈ Cm , then p0 q0 ≥ 1/μ2 ([A B]).

6.2.4 Noisy Recovery in (Cm ,  · 2 )


Uncertainty relations are typically employed to prove that something is not possible.
For example, by Corollary 6.1 there is a limit on how well a non-zero vector can
be concentrated with respect to two different orthonormal bases. Donoho and Stark
[6] noticed that uncertainty relations can also be used to show that something unex-
pected is possible. Specifically, Section 4 of [6] considers a noisy signal recovery
problem, which we now translate to the finite-dimensional setting. Let p, n ∈ Cm and
P ⊆ {1, . . . , m}, set Pc = {1, . . . , m}\P, and suppose that we observe y = pPc + n. Note
that the information contained in pP is completely lost in the observation. Without
structural assumptions on p, it is therefore not possible to recover information on pP
from y. However, if p is sufficiently sparse with respect to an orthonormal basis and
|P| is sufficiently small, it turns out that all entries of p can be recovered in a linear
fashion to within a precision determined by the noise level. This is often referred
to in the literature as stable recovery [6]. The corresponding formal statement is as
follows.
lemma 6.4 Let U ∈ Cm×m be unitary, Q ⊆ {1, . . . , m}, p ∈ WU,Q , and consider
y = pPc + n, (6.65)
where n ∈ Cm and Pc = {1, . . . , m}\P with P ⊆ {1, . . . , m}. If ΔP,Q (U) < 1, then there exists
a matrix L ∈ Cm×m such that
Ly − p2 ≤ CnPc 2 (6.66)
with C = 1/(1 − ΔP,Q (U)). In particular,
Uncertainty Relations and Sparse Signal Recovery 173

1
|P||Q| < (6.67)
μ2 ([I U])
is sufficient for ΔP,Q (U) < 1.
Proof For ΔP,Q (U) < 1, it follows that (see p. 301 of [33]) (I − DP ΠQ (U)) is invertible
with
1
|||(I − DP ΠQ (U))−1 |||2 ≤ (6.68)
1 − |||DP ΠQ (U)|||2
1
= . (6.69)
1 − ΔP,Q (U)
We now set L = (I − DP ΠQ (U))−1 DPc and note that
LpPc = (I − DP ΠQ (U))−1 pPc (6.70)
−1
= (I − DP ΠQ (U)) (I − DP )p (6.71)
−1
= (I − DP ΠQ (U)) (I − DP ΠQ (U))p (6.72)
= p, (6.73)
where in (6.72) we used ΠQ (U)p = p, which is by assumption. Next, we upper-bound
Ly − p2 according to
Ly − p2 = LpPc + Ln − p2 (6.74)
= Ln2 (6.75)
−1
≤ |||(I − DP ΠQ (U)) |||2 nPc 2 (6.76)
1
≤ nPc 2 , (6.77)
1 − ΔP,Q (U)
where in (6.75) we used (6.70)–(6.73). Finally, Lemma 6.2 implies that (6.67) is
sufficient for ΔP,Q (U) < 1.
We next particularize Lemma 6.4 for U = F,
 
m 2m (n − 1)m
P= , ,..., ,m (6.78)
n n n
and
Q = {l + 1, . . . , l + n}, (6.79)
with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. This means that p is n-sparse
in F and we are missing n entries in the noisy observation y. From Lemma 6.1, we know

that ΔP,Q (F) = n/m. Since n divides m by assumption, stable recovery of p is possible
for n ≤ m/2. In contrast, the coherence-based uncertainty relation in Lemma 6.2 yields

ΔP,Q (F) ≤ n/ m, and would hence suggest that n2 < m is needed for stable recovery.

6.3 Uncertainty Relations in (Cm ,  · 1 )

We introduce uncertainty relations in (Cm ,  · 1 ) following the same story line as in


Section 6.2. Specifically, let U = (u1 . . . um ) ∈ Cm×m be a unitary matrix, P, Q ⊆ {1, . . . , m},
174 Erwin Riegler and Helmut Bölcskei

and consider the orthogonal projection ΠQ (U) onto the subspace WU,Q , which is
spanned by {ui : i ∈ Q}. Let2 ΣP,Q (U) = |||DP ΠQ (U)|||1 . By Lemma 6.12 we have

xP 1
ΣP,Q (U) = max . (6.80)
x∈WU,Q \{0} x1

An uncertainty relation in (Cm ,  · 1 ) is an upper bound of the form ΣP,Q (U) ≤ c


with c ≥ 0 and states that xP 1 ≤ cx1 for all x ∈ WU,Q . ΣP,Q (U) hence quantifies
how well a vector supported on Q in the basis U can be concentrated on P, where
now concentration is measured in terms of 1-norm. Again, an uncertainty relation in
(Cm ,  · 1 ) is non-trivial only if c < 1. Application of Lemma 6.14 yields

1
DP ΠQ (U)1 ≤ ΣP,Q (U) ≤ DP ΠQ (U)1 , (6.81)
m
which constitutes the 1-norm equivalent of (6.2).

6.3.1 Coherence-Based Uncertainty Relation


We next derive a coherence-based uncertainty relation for (Cm ,  · 1 ), which comes with
the same advantages and disadvantages as its 2-norm counterpart.
lemma 6.5 Let U ∈ Cm×m be a unitary matrix and P, Q ⊆ {1, . . . , m}. Then,

ΣP,Q (U) ≤ |P||Q|μ2 ([I U]). (6.82)

Proof Let ui denote the column vectors of U∗ . It follows from Lemma 6.14 that

ΣP,Q (U) = max DP UDQ u j 1 . (6.83)


j∈{1,...,m}

With

max DP UDQ u j 1 ≤ |P| max |u∗i DQ u j | (6.84)


j∈{1,...,m} i, j∈{1,...,m}

≤ |P||Q| max |Ui,k ||U j,k | (6.85)


i, j,k∈{1,...,m}

≤ |P||Q|μ2 ([I U]), (6.86)

this establishes the proof.

For P = {1}, Q = {1, . . . , m}, and U = F, the upper bounds on ΣP,Q (F) in (6.81) and
(6.82) coincide and equal 1. We next present an example where (6.82) is sharper

2 In contrast to the operator 2-norm, the operator 1-norm is not invariant under unitary transformations, so
that we do not have |||ΠP (A)ΠQ (B)|||1 =|||DP ΠQ (A∗ B)|||1 for general unitary A, B. This, however, does not
constitute a problem as, whenever we apply uncertainty relations in (Cm ,  · 1 ), the case of general unitary
A, B can always be reduced directly to ΠP (I) = DP and ΠQ (A∗ B), simply by rewriting Ap = Bq according
to p = A∗ Bq.
Uncertainty Relations and Sparse Signal Recovery 175

than (6.81). Let m be even, P = {m}, Q = {1, . . . , m/2}, and U = F. Then, (6.82) becomes
ΣP,Q (F) ≤ 1/2, whereas

1   2π jlk
m m/2
DP ΠQ (F)1 = e m (6.87)
m l=1 k=1

1 1  1 − eπ jl
m−1
= + (6.88)
2 m l=1 1 − e 2πmjl

1 2
m/2
1
= + (6.89)
2 m l=1 1 − e 2π j(2l−1)
m

1 1
m/2
1
= + . (6.90)
2 m l=1 sin(π(2l − 1)/m)

Applying Jensen’s inequality, Theorem 2.6.2 of [34], to (6.90) and using


m/2
l=1
(2l − 1) = (m/2)2 then yields DP ΠQ (F)1 ≥ 1, which shows that (6.81) is trivial.
For P and Q as in (6.11) and (6.12), respectively, (6.82) becomes ΣP,Q (F) ≤ n2 /m,
which for fixed ratio n/m increases linearly in m and becomes trivial for m ≥ (m/n)2 .
A more sophisticated uncertainty relation based on a large sieve inequality exists for
strictly bandlimited (infinite) 1 -sequences, see Theorem 14 of [7]; a corresponding
finite-dimensional result does not seem to be available.

6.3.2 Concentration Inequalities


Analogously to Section 6.2.3, we next ask how well concentrated a given signal vector
can be in two different orthonormal bases. Here, however, we consider a different
measure of concentration accounting for the fact that we are dealing with the 1-norm.
definition 6.3 Let P ⊆ {1, . . . , m} and εP ∈ [0, 1]. The vector x ∈ Cm is said to be
εP -concentrated if x − xP 1 ≤ εP x1 .
The fraction of 1-norm an εP -concentrated vector exhibits outside P is therefore
no more than εP . In particular, if x is εP -concentrated for εP = 0, then x = xP and
x is |P|-sparse. The zero vector is trivially εP -concentrated for all P ⊆ {1, . . . , m} and
εP ∈ [0, 1]. In the remainder of Section 6.3, concentration is with respect to the 1-norm
according to Definition 6.3.
We are now ready to state the announced result on the concentration of a vector in
two different orthonormal bases.
lemma 6.6 Let A, B ∈ Cm×m be unitary and P, Q ⊆ {1, . . . , m}. Suppose that there exist
a non-zero εP -concentrated p ∈ Cm and a non-zero q ∈ Cm with q = qQ such that
Ap = Bq. Then,

1 − εP
|P||Q| ≥ . (6.91)
μ2 ([A B])
176 Erwin Riegler and Helmut Bölcskei

Proof Rewriting Ap = Bq according to p = A∗ Bq, it follows that p ∈ WU,Q with


U = A∗ B. We have
pP 1
1 − εP ≤ (6.92)
p1
≤ ΣP,Q (U) (6.93)

≤ |P||Q|μ2 ([I U]), (6.94)

where (6.92) is by εP -concentration of p, (6.93) follows from (6.80) and p ∈ WU,Q ,


and in (6.94) we applied Lemma 6.5. The proof is concluded by noting that
μ([I U]) = μ([A B]).

For εP = 0, Lemma 6.6 recovers Corollary 6.2.

6.3.3 Noisy Recovery in (Cm ,  · 1 )


We next consider a noisy signal recovery problem akin to that in Section 6.2.4.
Specifically, we investigate recovery–through 1-norm minimization–of a sparse signal
corrupted by εP -concentrated noise.
lemma 6.7 Let

y = p + n, (6.95)

where n ∈ Cm is εP -concentrated to P ⊆ {1, . . . , m} and p ∈ WU,Q for U ∈ Cm×m unitary


and Q ⊆ {1, . . . , m}. Denote

z = argmin (y − w1 ). (6.96)


w∈WU,Q

If ΣP,Q (U) < 1/2, then z − p1 ≤ CεP n1 with C = 2/(1 − 2ΣP,Q (U)). In particular,
1
|P||Q| < (6.97)
2μ2 ([I U])
is sufficient for ΣP,Q (U) < 1/2.
Proof Set Pc = {1, . . . , m}\P and let q = U∗ p. Note that qQ = q as a consequence of
p ∈ WU,Q , which is by assumption. We have

n1 = y − p1 (6.98)

≥ y − z1 (6.99)

= n − z1 (6.100)

= (n − z)P 1 + (n − z)Pc 1 (6.101)

≥ nP 1 − nPc 1 + zPc 1 − zP 1 (6.102)


Uncertainty Relations and Sparse Signal Recovery 177

= n1 − 2nPc 1 + z1 − 2zP 1 (6.103)


 
≥ n1 (1 − 2εP ) + z1 1 − 2ΣP,Q (U) , (6.104)

where in (6.100) we set z = z − p, in (6.102) we applied the reverse triangle inequality,


and in (6.104) we used that n is εP -concentrated and z ∈ WU,Q , owing to z ∈ WU,Q and
p ∈ WU,Q , together with (6.80). This yields

z − p1 = z1 (6.105)


2εP
≤ n1 . (6.106)
1 − 2ΣP,Q (U)
Finally, (6.97) implies ΣP,Q (U) < 1/2 thanks to (6.82).

Note that, for εP = 0, i.e., the noise vector is supported on P, we can recover p from
y = p + n perfectly, provided that ΣP,Q (U) < 1/2. For the special case U = F, this is
guaranteed by
m
|P||Q| < , (6.107)
2
and perfect recovery of p from y = p + n amounts to the finite-dimensional version of
what is known as Logan’s phenomenon (see Section 6.2 of [6]).

6.3.4 Coherence-Based Uncertainty Relation for Pairs of General Matrices


In practice, one is often interested in sparse signal representations with respect to
general (i.e., possibly redundant or incomplete) dictionaries. The purpose of this section
is to provide a corresponding general uncertainty relation. Specifically, we consider
representations of a given signal vector s according to s = Ap = Bq, where A ∈ Cm×p
and B ∈ Cm×q are general matrices, p ∈ C p , and q ∈ Cq . We start by introducing the
notion of mutual coherence for pairs of matrices.
definition 6.4 For A = (a1 . . . a p ) ∈ Cm×p and B = (b1 . . . bq ) ∈ Cm×q , both with columns
 · 2 -normalized to 1, the mutual coherence μ̄(A, B) is defined as μ̄(A, B) = maxi, j |a∗i b j |.
The general uncertainty relation we are now ready to state is in terms of a pair of
upper bounds on pP 1 and qQ 1 for P ⊆ {1, . . . , p} and Q ⊆ {1, . . . , q}.
theorem 6.2 Let A ∈ Cm×p and B ∈ Cm×q , both with column vectors  · 2 -normalized
to 1, and consider p ∈ C p and q ∈ Cq . Suppose that Ap = Bq. Then, we have
μ(A)p1 + μ̄(A, B)q1
pP 1 ≤ |P| (6.108)
1 + μ(A)
for all P ⊆ {1, . . . , p} and, by symmetry,
μ(B)q1 + μ̄(A, B)p1
qQ 1 ≤ |Q| (6.109)
1 + μ(B)
for all Q ⊆ {1, . . . , q}.
178 Erwin Riegler and Helmut Bölcskei

Proof Since (6.109) follows from (6.108) simply by replacing A by B, p by q, and P


by Q, and noting that μ̄(A, B) = μ̄(B, A), it suffices to prove (6.108). Let P ⊆ {1, . . . , p}
and consider an arbitrary but fixed i ∈ {1, . . . , p}. Multiplying Ap = Bq from the left by
a∗i and taking absolute values results in

|a∗i Ap| = |a∗i Bq|. (6.110)

The left-hand side of (6.110) can be lower-bounded according to


p
|a∗i Ap| = pi + a∗i ak pk (6.111)
k=1
ki


p
≥ |pi | − a∗i ak pk (6.112)
k=1
ki
p
≥ |pi | − |a∗i ak ||pk | (6.113)
k=1
ki

p
≥ |pi | − μ(A) |pk | (6.114)
k=1
ki
= (1 + μ(A))|pi | − μ(A)p1 , (6.115)

where (6.112) is by the reverse triangle inequality and in (6.114) we used Definition
6.1. Next, we upper-bound the right-hand side of (6.110) according to

q
|a∗i Bq| = a∗i bk qk (6.116)
k=1
q
≤ |a∗i bk ||qk | (6.117)
k=1
≤ μ̄(A, B)q1 , (6.118)

where the last step is by Definition 6.4. Combining the lower bound (6.111)–(6.115)
and the upper bound (6.116)–(6.118) yields

(1 + μ(A))|pi | − μ(A)p1 ≤ μ̄(A, B)q1 . (6.119)

Since (6.119) holds for arbitrary i ∈ {1, . . . , p}, we can sum over all i ∈ P and get
μ(A)p1 + μ̄(A, B)q1
pP 1 ≤ |P| . (6.120)
1 + μ(A)

For the special case A = I ∈ Cm×m and B ∈ Cm×m with B unitary, we have
μ(A) = μ(B) = 0 and μ̄(I, B) = μ([I B]), so that (6.108) and (6.109) simplify to

pP 1 ≤ |P| μ([I B]) q1 (6.121)


Uncertainty Relations and Sparse Signal Recovery 179

and
qQ 1 ≤ |Q| μ([I B]) p1 , (6.122)
respectively. Thus, for arbitrary but fixed p ∈ WB,Q and q = B∗ p, we have qQ = q so
that (6.121) and (6.122) taken together yield
pP 1 ≤ | P || Q | μ2 ([I B])  p 1 . (6.123)
As p was assumed to be arbitrary, by (6.80) this recovers the uncertainty relation
ΣP,Q (B) ≤ |P||Q|μ2 ([I B]) (6.124)
in Lemma 6.5.

6.3.5 Concentration Inequalities for Pairs of General Matrices


We next refine the result in Theorem 6.2 to vectors that are concentrated in 1-norm
according to Definition 6.3. The formal statement is as follows.
corollary 6.3 Let A ∈ Cm×p and B ∈ Cm×q , both with column vectors  · 2 -normalized
to 1, P ⊆ {1, . . . , p}, Q ⊆ {1, . . . , q}, p ∈ C p , and q ∈ Cq . Suppose that Ap = Bq. Then, the
following statements hold.
1. If q is εQ -concentrated, then
|P| μ̄ 2 (A, B)|Q|
pP 1 ≤ μ(A) + p1 . (6.125)
1 + μ(A) [(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+
2. If p is εP -concentrated, then
|Q| μ̄ 2 (A, B)|P|
qQ 1 ≤ μ(B) + q1 . (6.126)
1 + μ(B) [(1 + μ(A))(1 − εP ) − μ(A)|P|]+
3. If p is εP -concentrated, q is εQ -concentrated, μ̄(A, B) > 0, and (pT qT )T  0, then
[(1 + μ(A))(1 − εP ) − μ(A)|P|]+ [(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+
|P||Q| ≥ . (6.127)
μ̄ 2 (A, B)

Proof By Theorem 6.2, we have


μ(A)p1 + μ̄(A, B)q1
pP 1 ≤ |P| (6.128)
1 + μ(A)
and
μ(B)q1 + μ̄(A, B)p1
qQ 1 ≤ |Q| . (6.129)
1 + μ(B)
Suppose now that q is εQ -concentrated, i.e., qQ 1 ≥ (1 − εQ )q1 . Then (6.129) implies
that
|Q|μ̄(A, B)
q1 ≤ p1 . (6.130)
[(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+
180 Erwin Riegler and Helmut Bölcskei

Using (6.130) in (6.128) yields (6.125). The relation (6.126) follows from (6.125)
by swapping the roles of A and B, p and q, and P and Q, and upon noting that
μ̄(A, B) = μ̄(B, A). It remains to establish (6.127). Using pP 1 ≥ (1 − εP )p1 in (6.128)
and qQ 1 ≥ (1 − εQ )q1 in (6.129) yields
p1 [(1 + μ(A))(1 − εP ) − μ(A)|P|]+ ≤ μ̄(A, B)q1 |P| (6.131)
and
q1 [(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+ ≤ μ̄(A, B)p1 |Q|, (6.132)
respectively. Suppose first that p = 0. Then q  0 by assumption, and (6.132) becomes
[(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+ = 0. (6.133)
In this case (6.127) holds trivially. Similarly, if q = 0, then p  0 again by assumption,
and (6.131) becomes
[(1 + μ(A))(1 − εP ) − μ(A)|P|]+ = 0. (6.134)
As before, (6.127) holds trivially. Finally, if p  0 and q  0, then we multiply (6.131)
by (6.132) and divide the result by μ̄ 2 (A, B)p1 q1 , which yields (6.127).
Corollary 6.3 will be used in Section 6.4 to derive recovery thresholds for sparse
signal separation. The lower bound on |P||Q| in (6.127) is Theorem 1 of [9] and states
that a non-zero vector cannot be arbitrarily well concentrated with respect to two
different general matrices A and B. For the special case εQ = 0 and A and B unitary, and
hence μ(A) = μ(B) = 0 and μ̄(A, B) = μ([A B]), (6.127) recovers Lemma 6.6.
Particularizing (6.127) to εP = εQ = 0 yields the following result.
corollary 6.4 (Lemma 33 of [10]) Let A ∈ Cm×p and B ∈ Cm×q , both with column
vectors  · 2 -normalized to 1, and consider p ∈ C p and q ∈ Cq with (pT qT )T  0.
Suppose that Ap = Bq. Then, p0 q0 ≥ fA,B (p0 , q0 ), where
[1 + μ(A)(1 − u)]+ [1 + μ(B)(1 − v)]+
fA,B (u, v) = . (6.135)
μ̄ 2 (A, B)
Proof Let P = {i ∈ {1, . . . , p} : pi  0} and Q = {i ∈ {1, . . . , q} : qi  0}, so that pP = p,
qQ = q, |P| = p0 , and |Q| = q0 . The claim now follows directly from (6.127) with
εP = εQ = 0.
If A and B are both unitary, then μ(A) = μ(B) = 0 and μ̄(A, B) = μ([A B]), and
Corollary 6.4 recovers the Elad–Bruckstein result in Corollary 6.2.
Corollary 6.4 admits the following appealing geometric interpretation in terms of a
null-space property, which will be seen in Section 6.5 to pave the way to an extension
of the classical notion of sparsity to a more general concept of parsimony.
lemma 6.8 Let A ∈ Cm×p and B ∈ Cm×q , both with column vectors  · 2 -normalized to 1.
Then, the set (which actually is a finite union of subspaces)
 
p
S= : p ∈ C p , q ∈ Cq , p0 q0 < fA,B (p0 , q0 ) (6.136)
q
Uncertainty Relations and Sparse Signal Recovery 181

with fA,B defined in (6.135) intersects the kernel of [A B] trivially, i.e.,

ker([A B]) ∩ S = {0}. (6.137)

Proof The statement of this lemma is equivalent to the statement of Corollary 6.4
through a chain of equivalences between the following statements:
(1) ker([A B]) ∩ S = {0},
(2) if (pT − qT )T ∈ ker([A B])\{0}, then p0 q0 ≥ fA,B (p0 , q0 ),
(3) if Ap = Bq with (pT qT )T  0, then p0 q0 ≥ fA,B (p0 , q0 ),
where (1) ⇔ (2) is by the definition of S, (2) ⇔ (3) follows from the fact that Ap = Bq
with (pT qT )T  0 is equivalent to (pT − qT )T ∈ ker([A B])\{0}, and (3) is the statement
in Corollary 6.4.

6.4 Sparse Signal Separation

Numerous practical signal recovery tasks can be cast as sparse signal separation
problems of the following form. We want to recover y ∈ C p with y0 ≤ s and/or z ∈ Cq
with z0 ≤ t from the noiseless observation

w = Ay + Bz, (6.138)

where A ∈ Cm×p and B ∈ Cm×q . Here, s and t are the sparsity levels of y and z with cor-
responding ambient dimensions p and q, respectively. Prominent applications include
(image) inpainting, declipping, super-resolution, the recovery of signals corrupted by
impulse noise, and the separation of (e.g., audio or video) signals into two distinct
components (see Section I of [9]). We next briefly describe some of these problems.
1. Clipping. Non-linearities in power-amplifiers or in analog-to-digital converters
often cause signal clipping or saturation [35]. This effect can be cast into the signal
model (6.138) by setting B = I, identifying s = Ay with the signal to be clipped, and
setting z = (ga (s) − s) with ga (·) realizing entry-wise clipping of the amplitude to the
interval [0, a]. If the clipping level a is not too small, then z will be sparse, i.e., t  q.
2. Missing entries. Our framework also encompasses super-resolution [36, 37] and
inpainting [38] of, e.g., images, audio, and video signals. In both these applications
only a subset of the entries of the (full-resolution) signal vector s = Ay is available
and the task is to fill in the missing entries, which are accounted for by writing
w = s + z with zi = −si if the ith entry of s is missing and zi = 0 else. If the number
of entries missing is not too large, then z is sparse, i.e., t  q.
3. Signal separation. Separation of (audio, image, or video) signals into two struc-
turally distinct components also fits into the framework described above. A promi-
nent example is the separation of texture from cartoon parts in images (see [39, 40]
and references therein). The matrices A and B are chosen to allow sparse representa-
tions of the two distinct features. Note that here Bz no longer plays the role of unde-
sired noise, and the goal is to recover both y and z from the observation w = Ay+Bz.
182 Erwin Riegler and Helmut Bölcskei

The first two examples above demonstrate that in many practically relevant applica-
tions the locations of the possibly non-zero entries of one of the sparse vectors, say z,
may be known. This can be accounted for by removing the columns of B corresponding
to the other entries, which results in t = q, i.e., the sparsity level of z equals the ambient
dimension. We next show how Corollary 6.3 can be used to state a sufficient condition
for recovery of y from w = Ay + Bz when t = q. For recovery guarantees in the case
where the sparsity levels of both y and z are strictly smaller than their corresponding
ambient dimensions, we refer to Theorem 8 of [9].
theorem 6.3 (Theorems 4 and 7 of [9]) Let y ∈ C p with y0 ≤ s, z ∈ Cq , A ∈ Cm×p , and
B ∈ Cm×q , both with column vectors  · 2 -normalized to 1 and μ̄(A, B) > 0. Suppose that

2sq < fA,B (2s, q) (6.139)

with
[1 + μ(A)(1 − u)]+ [1 + μ(B)(1 − v)]+
fA,B (u, v) = . (6.140)
μ̄2 (A, B)
Then, y can be recovered from w = Ay + Bz by either of the following algorithms:



⎨minimize y0
(P0) ⎪ ⎪ (6.141)
⎩subject to Ay ∈ {w + Bz : z ∈ Cq },



⎨minimize y1
(P1) ⎪
⎪ (6.142)
⎩subject to Ay ∈ {w + Bz : z ∈ Cq }.

Proof We provide the proof for (P1) only. The proof for recovery through (P0) is very
similar and can be found in Appendix B of [9].
Let w = Ay + Bz and suppose that (P1) delivers y ∈ C p . This implies y1 ≤ y1 and
the existence of a z ∈ Cq such that

Ay = w + Bz. (6.143)

On the other hand, we also have

Ay = w − Bz. (6.144)

Subtracting (6.144) from (6.143) yields

A( y − y ) = B(
z + z ). (6.145)

=p =q

We now set

U = {i ∈ {1, . . . , p} : yi  0} (6.146)

and

U c = {1, . . . , p}\U, (6.147)


Uncertainty Relations and Sparse Signal Recovery 183

and show that p is εU -concentrated (with respect to the 1-norm) for εU = 1/2, i.e.,
1
pU c 1 ≤ p1 . (6.148)
2
We have

y1 ≥ y1 (6.149)


= y + p1 (6.150)
= yU + pU 1 + pU c 1 (6.151)
≥ yU 1 − pU 1 + pU c 1 (6.152)
= y1 − pU 1 + pU c 1 , (6.153)

where (6.151) follows from the definition of U in (6.146), and in (6.152) we applied
the reverse triangle inequality. Now, (6.149)–(6.153) implies pU 1 ≥ pU c 1 . Thus,
2pU c 1 ≤ pU 1 + pU c 1 = p1 , which establishes (6.148). Next, set V = {1, . . . , q}
and note that q is trivially εV -concentrated (with respect to 1-norm) for εV = 0.
Suppose, toward a contradiction, that p  0. Then, we have

2sq ≥ 2|U||V| (6.154)


[(1 + μ(A)) − 2μ(A)|U|]+ [1 + μ(B)(1 − |V|)]+
≥ (6.155)
μ̄2 (A, B)
[(1 + μ(A)) − 2sμ(A)]+ [1 + μ(B)(1 − q)]+
≥ , (6.156)
μ̄2 (A, B)
where (6.155) is obtained by applying Part 3 of Corollary 6.3 with p εU -concentrated
for εU = 1/2 and q εV -concentrated for εV = 0. But (6.154)–(6.156) contradicts
(6.139). Hence, we must have p = 0, which yields y = y.

We next provide an example showing that, as soon as (6.139) is saturated, recovery



through (P0) or (P1) can fail. Take m = n2 with n even, A = F ∈ Cm×m , and B ∈ Cm× m

containing every mth column of the m × m identity matrix, i.e.,
⎧ √


⎨1 if k = m l,
Bk,l = ⎪
⎪ (6.157)
⎩0 else,

for all k ∈ {1, . . . , m} and l ∈ {1, . . . , m}. For every a ∈ N dividing m, we define the vector
d(a) ∈ Cm with components



⎨1 if l ∈ {a, 2a, . . . , (m/a − 1)a, m},
(a)
dl = ⎪ ⎪ (6.158)
⎩0 else.

Straightforward calculations now yield



m (m/a)
Fd(a) = d (6.159)
a
184 Erwin Riegler and Helmut Bölcskei

for all a ∈ N dividing m. Suppose that w = Fy + Bz with


√ √
y = d(2 m)
− d( m)
∈ Cm , (6.160)

z = (1 . . . 1) ∈ C
T m
. (6.161)
√ √
Evaluating (6.139) for A = F, B as defined in (6.157), and q = m results in s < m/2.
√ √
Now, y in (6.160) has y0 = m/2 and thus just violates the threshold s < m/2. We
next show that√ this slender violation is enough for the existence of an alternative pair
y ∈ Cm , z ∈ C m satisfying w = Fy + Bz with y0 = y0 and y1 = y1 . Thus, neither
(P0) nor (P1) can distinguish between y and y. Specifically, we set

y = d(2 m)
∈ Cm , (6.162)
z=0∈C m
(6.163)

and note that y0 = y0 = y1 = y1 = m/2. It remains to establish that w = Fy + Bz.
To this end, first note that
w = Fy + Bz (6.164)
1 (√m/2) √
= d − d( m) + Bz (6.165)
2
1 √
= d( m/2) , (6.166)
2
where (6.165) follows from (6.159) and (6.166) is by (6.157). Finally, again using
(6.159), we find that
1 √
Fy + Bz = d( m/2) , (6.167)
2
which completes the argument.

The threshold s < m/2 constitutes a special instance of the so-called “square-root
bottleneck” [41] all coherence-based deterministic recovery thresholds suffer from.
The square-root bottleneck says that the number of measurements, m, has to scale
at least quadratically in the sparsity level s. It can be circumvented by considering
random models for either the signals or the measurement matrices [42–45], leading to
thresholds of the form m ∝ s log p and applying with high probability. Deterministic
linear recovery thresholds, i.e., m ∝ s, were, to the best of our knowledge, first reported
in [46] for the DFT measurement matrix under positivity constraints on the vector to be
recovered. Further instances of deterministic linear recovery thresholds were discovered
in the context of spectrum-blind sampling [47, 48] and system identification [49].

6.5 The Set-Theoretic Null-Space Property

The notion of sparsity underlying the theory developed so far is that of either the number
of non-zero entries or of concentration in terms of 1-norm or 2-norm. In practice, one
often encounters more general concepts of parsimony, such as manifold or fractal
set structures. Manifolds are prevalent in data science, e.g., in compressed sensing
[22–27], machine learning [28], image processing [29, 30], and handwritten-digit
recognition [31]. Fractal sets find application in image compression and in modeling
Uncertainty Relations and Sparse Signal Recovery 185

of Ethernet traffic [32]. Based on the null-space property established in Lemma 6.8,
we now extend the theory to account for more general notions of parsimony. To this
end, we first need a suitable measure of “description complexity” that goes beyond
the concepts of sparsity and concentration. Formalizing this idea requires an adequate
dimension measure, which, as it turns out, is lower modified Minkowski dimension. We
start by defining Minkowski dimension and modified Minkowski dimension.
definition 6.5 (from Section 3.1 of [50])3 For U ⊆ Cm non-empty, the lower and upper
Minkowski dimensions of U are defined as
log NU (ρ)
dimB (U) = lim inf (6.168)
ρ→0 log(1/ρ)
and
log NU (ρ)
dimB (U) = lim sup , (6.169)
ρ→0 log(1/ρ)
respectively, where
  
NU (ρ) = min k ∈ N : U ⊆ Bm (ui , ρ), ui ∈ U (6.170)
i∈{1,...,k}

is the covering number of U for radius ρ > 0. If dimB (U) = dimB (U), this common
value, denoted by dimB (U), is the Minkowski dimension of U.
definition 6.6 (from Section 3.3 of [50]) For U ⊆ Cm non-empty, the lower and upper
modified Minkowski dimensions of U are defined as
⎧ ⎫


⎨  ⎪ ⎪

dimMB (U) = inf ⎪⎪ sup dimB (Ui ) : U ⊆ Ui ⎪
⎪ (6.171)
⎩ i∈N ⎭
i∈N
and
⎧ ⎫


⎨  ⎪ ⎪

dimMB (U) = inf ⎪
⎪sup dimB (Ui ) : U ⊆ Ui ⎪
⎪ , (6.172)
⎩ i∈N ⎭
i∈N
respectively, where in both cases the infimum is over all possible coverings {Ui }i∈N
of U by non-empty compact sets Ui . If dimMB (U) = dimMB (U), this common value,
denoted by dimMB (U), is the modified Minkowski dimension of U.
For further details on (modified) Minkowski dimension, we refer the interested reader
to Section 3 of [50].
We are now ready to extend the null-space property in Lemma 6.8 to the following
set-theoretic null-space property.
theorem 6.4 Let U ⊆ C p+q be non-empty with dimMB (U) < 2m, and let B ∈ Cm×q with
m ≥ q be a full-rank matrix. Then, ker[A B] ∩ (U\{0}) = ∅ for Lebesgue-a.a. A ∈ Cm×p .
Proof See Section 6.8.

3 Minkowski dimension is sometimes also referred to as box-counting dimension, which is the origin of the
subscript B in the notation dimB (·) used henceforth.
186 Erwin Riegler and Helmut Bölcskei

The set U in this set-theoretic null-space property generalizes the finite union of linear
subspaces S in Lemma 6.8. For U ⊆ R p+q , the equivalent of Theorem 6.4 was reported
previously in Proposition 1 of [13]. The set-theoretic null-space property can be inter-
preted in geometric terms as follows. If p + q ≤ m, then [A B] is a tall matrix so that the
kernel of [A B] is {0} for Lebesgue-a.a. matrices A. The statement of the theorem holds
trivially in this case. If p + q > m, then the kernel of [A B] is a (p + q − m)-dimensional
subspace of the ambient space C p+q for Lebesgue-a.a. matrices A. The theorem
therefore says that, for Lebesgue-a.a. A, the set U intersects the subspace ker([A B])
at most trivially if the sum of dim ker([A B]) and4 dimMB (U)/2 is strictly smaller
than the dimension of the ambient space. What is remarkable here is that the notions
of Euclidean dimension (for the kernel of [A B]) and of lower modified Minkowski
dimension (for the set U) are compatible. We finally note that, by virtue of the chain
of equivalences in the proof of Lemma 6.8, the set-theoretic null-space property in
Theorem 6.4 leads to a set-theoretic uncertainty relation, albeit not in the form of an
upper bound on an operator norm; for a detailed discussion of this equivalence the
interested reader is referred to [13].
We next put the set-theoretic null-space property in Theorem 6.4 in perspective with
the null-space property in Lemma 6.8. Fix the sparsity levels s and t, consider the set
 
p
S s,t = : p ∈ C , q ∈ C , p0 ≤ s, q0 ≤ t ,
p q
(6.173)
q
which is a finite union of (s + t)-dimensional linear subspaces, and, for the sake of
concreteness, let A = I and B = F of size q × q. Lemma 6.8 then states that the kernel of
[I F] intersects S s,t trivially, provided that

m > st, (6.174)

which leads to a recovery threshold in the signal separation problem that is quadratic in
the sparsity levels s and t (see Theorem 8 of [9]). To see what the set-theoretic null-space
property gives, we start by noting that, by Example II.2 of [26], dimMB (S s,t ) = 2(s + t).
Theorem 6.4 hence states that, for Lebesgue-a.a. matrices A ∈ Cm×p , the kernel of
[A B] intersects S s,t trivially, provided that

m > s + t. (6.175)

This is striking as it says that, while the threshold in (6.174) is quadratic in the sparsity
levels s and t and, therefore, suffers from the square-root bottleneck, the threshold in
(6.175) is linear in s and t.
To understand the operational implications of the observation just made, we demon-
strate how the set-theoretic null-space property in Theorem 6.4 leads to a sufficient con-
dition for the recovery of vectors in sets of small lower modified Minkowski dimension.

4 The factor 1/2 stems from the fact that (modified) Minkowski dimension “counts real dimensions.”
For example, the modified Minkowski dimension of an n-dimensional linear subspace of Cm is 2n (see
Example II.2 of [26]).
Uncertainty Relations and Sparse Signal Recovery 187

lemma 6.9 Let S ⊆ C p+q be non-empty with dimMB (S  S) < 2m, where
S  S = {u − v : u, v ∈ S}, and let B ∈ Cm×q , with m ≥ q, be a full-rank matrix.
Then, [A B] is one-to-one on S for Lebesgue-a.a. A ∈ Cm×p .
Proof The proof follows immediately from the set-theoretic null-space property in
Theorem 6.4 and the linearity of matrix-vector multiplication.
To elucidate the implications of Lemma 6.9, consider S s,t defined in (6.173). Since
S s,t  S s,t is again a finite union of linear subspaces of dimensions no larger than
min{p, 2s} + min{q, 2t}, where the min{·, ·} operation accounts for the fact that the
dimension of a linear subspace cannot exceed the dimension of its ambient space, we
have (see Example II.2 of [26])
dimMB (S s,t  S s,t ) = 2(min{p, 2s} + min{q, 2t}). (6.176)
Application of Lemma 6.9 now yields that, for Lebesgue-a.a. matrices A ∈ Cm×p , we
can recover y ∈ C p with y0 ≤ s and z ∈ Cq with z0 ≤ t from w = Ay + Bz, provided
that m > min{p, 2s} + min{q, 2t}. This qualitative behavior (namely, linear in s and t) is
best possible as it cannot be improved even if the support sets of y and z were known
prior to recovery. We emphasize, however, that the statement in Lemma 6.9 guarantees
injectivity of [A B] only absent computational considerations for recovery.

6.6 A Large Sieve Inequality in (Cm ,  · 2 )

We present a slightly improved and generalized version of the large sieve inequality
stated in Equation (32) of [7].
lemma 6.10 Let μ be a 1-periodic, σ-finite measure on R, n ∈ N, ϕ ∈ [0, 1), a ∈ Cn , and
consider the 1-periodic trigonometric polynomial

n
ψ(s) = e2π jϕ ak e−2π jks . (6.177)
k=1

Then,

1
|ψ(s)|2 dμ(s) ≤ n − 1 + sup μ((r, r + δ))a22 (6.178)
[0,1) δ r∈[0,1)
for all δ ∈ (0, 1].
Proof Since

n
|ψ(s)| = ak e−2π jks , (6.179)
k=1

we can assume, without loss of generality, that ϕ = 0. The proof now follows closely
the line of argumentation on pp. 185–186 of [51] and in the proof of Lemma 5 of [7].
Specifically, we make use of the result on p. 185 of [51] saying that, for every δ > 0,
there exists a function g ∈ L2 (R) with Fourier transform
188 Erwin Riegler and Helmut Bölcskei

 ∞
G(s) = g(t)e−2π jst dt (6.180)
−∞

such that G22 = n−1+1/δ, |g(t)|2 ≥ 1 for all t ∈ [1, n], and G(s) = 0 for all s  [−δ/2, δ/2].
With this g, consider the 1-periodic trigonometric polynomial
n
ak −2π jks
θ(s) = e (6.181)
k=1
g(k)

and note that


 δ/2  
ak −2π jks ∞
n
G(r)θ(s − r) dr = e G(r)e2π jkr dr (6.182)
−δ/2 k=1
g(k) −∞

n
= ak e−2π jks (6.183)
k=1

= ψ(s) for all s ∈ R. (6.184)


We now have
   δ/2 2
|ψ(s)| dμ(s) =
2
G(r)θ(s − r)dr dμ(s) (6.185)
[0,1) [0,1) −δ/2
  δ/2
≤ G22 |θ(s − r)|2 dr dμ(s) (6.186)
[0,1) −δ/2
  s+δ/2
= G22 |θ(r)|2 dr dμ(s) (6.187)
[0,1) s−δ/2
 2
 
= G22 μ (r − δ/2, r + δ/2) ∩ [0, 1) |θ(r)|2 dr (6.188)
−1
1 
 1+i
 
= G22 μ (r − δ/2, r + δ/2) ∩ [0, 1) |θ(r)|2 dr (6.189)
i=−1 0+i

1 
 1
 
= G22 μ (r − δ/2, r + δ/2) ∩ [i, 1 + i) |θ(r)|2 dr (6.190)
i=−1 0
 1
 
= G22 μ (r − δ/2, r + δ/2) ∩ [−1, 2) |θ(r)|2 dr (6.191)
0
 1
 
= G22 μ (r − δ/2, r + δ/2) |θ(r)|2 dr (6.192)
0
for all δ ∈ (0, 1], where (6.185) follows from (6.182)–(6.184), in (6.186) we applied the
Cauchy–Schwartz inequality (Theorem 1.37 of [52]), (6.188) is by Fubini’s theorem
(Theorem 1.14 of [53]) (recall that μ is σ-finite by assumption) upon noting that
{(r, s) : s ∈ [0, 1), r ∈ (s − δ/2, s + δ/2)} = {(r, s) : r ∈ [−1, 2), s ∈ (r − δ/2, r + δ/2) ∩ [0, 1)}

(6.193)
Uncertainty Relations and Sparse Signal Recovery 189

for all δ ∈ (0, 1], in (6.190) we used the 1-periodicity of μ and θ, and (6.191) is by
σ-additivity of μ. Now,
 1  1
 
μ (r − δ/2, r + δ/2) |θ(r)|2 dr ≤ sup μ((r, r + δ)) |θ(r)|2 dr (6.194)
0 r∈[0,1) 0

n
|ak |2
= sup μ((r, r + δ)) (6.195)
r∈[0,1) k=1
|g(k)|2

≤ sup μ((r, r + δ))a22 (6.196)


r∈[0,1)

for all δ > 0, where (6.196) follows from |g(t)|2 ≥ 1 for all t ∈ [1, n]. Using (6.194)–(6.196)
and G22 = n − 1 + 1/δ in (6.192) establishes (6.178).
Lemma 6.10 is a slightly strengthened version of the large sieve inequality (Equation
(32) of [7]). Specifically, in (6.178) it is sufficient to consider open intervals (r, r + δ),
whereas Equation (32) of [7] requires closed intervals [r, r + δ]. Thus, the upper bound
in Equation (32) of [7] can be strictly larger than that in (6.178) whenever μ has mass
points.

6.7 Uncertainty Relations in L1 and L2

Table 6.1 contains a list of infinite-dimensional counterparts–available in the literature–


to results in this chapter. Specifically, these results apply to bandlimited L1 - and
L2 -functions and hence correspond to A = I and B = F in our setting.
Table 6.1. Infinite-dimensional counterparts to results in this chapter

L2 analog L1 analog

Upper bound in (6.10) Lemma 2 of [6]


Corollary 6.1 Theorem 2 of [6]
Lemma 6.4 Theorem 4 of [6]
Lemma 6.5 Lemma 3 of [6]
Lemma 6.7 Lemma 2 of [7]
Lemma 6.10 Theorem 4 of [7]

6.8 Proof of Theorem 6.4

By definition of lower modified Minkowski dimension, there exists a covering {Ui }i∈N
of U by non-empty compact sets Ui satisfying dimB (Ui ) < 2m for all i ∈ N. The
countable sub-additivity of Lebesgue measure λ now implies


λ({A ∈ Cm×p : ker[A B] ∩ (U\{0})  ∅}) ≤ λ({A ∈ Cm×p : ker[A B] ∩ (Ui \{0})  ∅}).
i=1
(6.197)
190 Erwin Riegler and Helmut Bölcskei

We next establish that every term in the sum on the right-hand side of (6.197) equals
zero. Take an arbitrary but fixed i ∈ N. Repeating the steps in Equations (10)–(14) of
[13] shows that it suffices to prove that

P[ker([A B]) ∩ V  ∅] = 0, (6.198)

with
 
u
V= : u ∈ C , v ∈ C , u2 > 0 ∩ Ui
p q
(6.199)
v

and A = (A1 . . . Am )∗ , where the random vectors Ai are independent and uniformly
distributed on B p (0, r) for arbitrary but fixed r > 0. Suppose, toward a contradiction,
that (6.198) is false. This implies
log P[ker([A B]) ∩ V  ∅]
0 = lim inf (6.200)
ρ→0 log(1/ρ)
NV (ρ)
log i=1 P[ker([A B]) ∩ B p+q (ci , ρ)  ∅]
≤ lim inf , (6.201)
ρ→0 log(1/ρ)
where we have chosen {ci : i = 1, . . . , NV (ρ)} ⊆ V such that
N
V (ρ)
V⊆ B p+q (ci , ρ), (6.202)
i=1

with NV (ρ) denoting the covering number of V for radius ρ > 0 (cf. (6.170)). Now let
i ∈ {1, . . . , NV (ρ)} be arbitrary but fixed and write ci = (uTi vTi )T . It follows that

Aui + Bvi 2 = [A B]ci 2 (6.203)

≤ [A B](x − ci )2 + [A B]x2 (6.204)

≤ [A B]2 (x − ci )2 + [A B]x2 (6.205)

≤ ([A2 + B2 )ρ + [A B]x2 (6.206)



≤ (r m + B2 )ρ + [A B]x2 for all x ∈ B p+q (ci , ρ), (6.207)

where in the last step we made use of Ai 2 ≤ r for i = 1, . . . , m. We now have

P[ker([A B]) ∩ B p+q (ci , ρ)  ∅] (6.208)

≤ P[∃ x ∈ B p+q (ci , ρ) : [A B]x2 < ρ] (6.209)



≤ P[Aui + Bvi 2 < ρ(1 + r m + B2 )] (6.210)

C(p, m, r) √
≤ ρ2m (1 + r m + B2 )2m , (6.211)
ui 2m
2
Uncertainty Relations and Sparse Signal Recovery 191

where (6.210) is by (6.203)–(6.207), and in (6.211) we applied the concentration of


measure result Lemma 6.11 below (recall that ci = (uTi vTi )T ∈ V implies ui  0) with
C(p, m, r) as in (6.216) below. Inserting (6.208)–(6.211) into (6.201) yields
log(NV (ρ)ρ2m )
0 ≤ lim inf (6.212)
ρ→∞ log(1/ρ)
= dimB (V) − 2m (6.213)
< 0, (6.214)

where (6.214) follows from dimB (V) ≤ dimB (Ui ) < 2m, which constitutes a
contradiction. Therefore, (6.198) must hold.
lemma 6.11 Let A = (A1 . . . Am )∗ with independent random vectors Ai uniformly
distributed on B p (0, r) for r > 0. Then
C(p, m, r) 2m
P[Au + v2 < δ] ≤ δ , (6.215)
u2m
2

with
πV p−1 (r) m
C(p, m, r) = (6.216)
V p (r)
for all u ∈ C p \{0}, v ∈ Cm , and δ > 0.
Proof Since
m
P[Au + v2 < δ] ≤ P[|A∗i u + vi | < δ] (6.217)
i=1

owing to the independence of the Ai and as Au + v2 < δ implies |A∗i u + vi | < δ for
i = 1, . . . , m, it is sufficient to show that
D(p, r) 2
P[|B∗ u + v| < δ] ≤ δ (6.218)
u22
for all u ∈ C p \{0}, v ∈ C, and δ > 0, where the random vector B is uniformly distributed
on B p (0, r) and
πV p−1 (r)
D(p, r) = . (6.219)
V p (r)
We have
! ∗ "
∗ |B u + v| δ
P[|B u + v| < δ] = P < (6.220)
u2 u2
= P[|B∗ U∗ e1 + v| < δ] (6.221)
= P[|B∗ e1 + v| < δ] (6.222)

1
= χ (b1 ) db (6.223)
V p (r) B p (0,r) {b1 :|b1 +v| < δ}
192 Erwin Riegler and Helmut Bölcskei

 
1
≤ db1 d(b2 . . . b p )T (6.224)
V p (r) |b1 +v|≤δ B p−1 (0,r)

V p−1 (r)
= db1 (6.225)
V p (r) |b1 +v|<δ

V p−1 (r) 2
= πδ (6.226)
V p (r)
πV p−1 (r) 2
= δ , (6.227)
V p (r)u2
where the unitary matrix U in (6.221) has been chosen such that U(u/u2 ) = e1 =
(1 0 . . . 0)T ∈ C p and we set δ := δ/u2 and v := v/u2 . Further, (6.222) follows
from the unitary invariance of the uniform distribution on B p (0, r), and in (6.223) the
factor 1/V p (r) is owing to the assumption of a uniform probability density function on
B p (0, r).

6.9 Results for ||| · |||1 and ||| · |||2

lemma 6.12 Let U ∈ Cm×m be unitary, P, Q ⊆ {1, . . . , m}, and consider the orthogonal
projection ΠQ (U) = UDQ U∗ onto the subspace WU,Q . Then

|||ΠQ (U)DP |||2 = |||DP ΠQ (U)|||2 . (6.228)

Moreover, we have
xP 2
|||DP ΠQ (U)|||2 = max (6.229)
x∈WU,Q \{0} x2
and
xP 1
|||DP ΠQ (U)|||1 = max . (6.230)
x∈WU,Q \{0} x1
Proof The identity (6.228) follows from

|||DP ΠQ (U)|||2 = |||(DP ΠQ (U))∗ |||2 (6.231)


= |||Π∗Q (U)D∗P |||2 (6.232)
= |||ΠQ (U)DP |||2 , (6.233)

where in (6.231) we used that ||| · |||2 is self-adjoint (see p. 309 of [33]), ΠQ (U)∗ = ΠQ (U),
and D∗P = DP . To establish (6.229), we note that

|||DP ΠQ (U)|||2 = max DP ΠQ (U)x2 (6.234)


x: x2 =1

= max DP ΠQ (U)x2 (6.235)


x: ΠQ (U)x0
x2 =1
 
≤ max D ΠQ (U)x  (6.236)
x: ΠQ (U)x0 ΠQ (U)x2 2
P
x2 =1
Uncertainty Relations and Sparse Signal Recovery 193

 
≤ max D ΠQ (U)x  (6.237)
x: ΠQ (U)x0 ΠQ (U)x2 2
P

xP 2
= max (6.238)
x∈W \{0} x2
U,Q
 
 ΠQ (U)x 
= max DP ΠQ (U)   (6.239)
x: ΠQ (U)x0 ΠQ (U)x2 2
≤ max DP ΠQ (U)x2 (6.240)
x: x2 =1

= |||DP ΠQ (U)|||2 , (6.241)

where in (6.236) we used ΠQ (U)x2 ≤ x2 , which implies ΠQ (U)x2 ≤ 1 for all x
with x2 = 1. Finally, (6.230) follows by repeating the steps in (6.234)–(6.241) with
 · 2 replaced by  · 1 at all occurrences.
lemma 6.13 Let A ∈ Cm×n . Then
A2
√ ≤ |||A|||2 ≤ A2 . (6.242)
rank(A)
Proof The proof is trivial for A = 0. If A  0, set r = rank(A) and let σ1 , . . . , σr denote
the non-zero singular values of A organized in decreasing order. Unitary invariance # of
||| · |||2 and ·2 (see Problem 5, on p. 311 of [33]) yields |||A|||2 = σ1 and A2 = i=1 σi .
r 2

The claim now follows from


$
% r
 √
σ1 ≤ σ2i ≤ rσ1 . (6.243)
i=1

lemma 6.14 For A = (a1 . . . an ) ∈ Cm×n , we have

|||A|||1 = max a j 1 (6.244)


j∈{1,...,n}

and
1
A1 ≤ |||A|||1 ≤ A1 . (6.245)
n
Proof The identity (6.244) is established on p. 294 of [33], and (6.245) follows
directly from (6.244).

References

[1] W. Heisenberg, The physical principles of the quantum theory. University of Chicago
Press, 1930.
[2] W. G. Faris, “Inequalities and uncertainty principles,” J. Math. Phys., vol. 19, no. 2, pp.
461–466, 1978.
[3] M. G. Cowling and J. F. Price, “Bandwidth versus time concentration: The Heisenberg–
Pauli–Weyl inequality,” SIAM J. Math. Anal., vol. 15, no. 1, pp. 151–165, 1984.
194 Erwin Riegler and Helmut Bölcskei

[4] J. J. Benedetto, Wavelets: Mathematics and applications. CRC Press, 1994, ch. Frame
decompositions, sampling, and uncertainty principle inequalities.
[5] G. B. Folland and A. Sitaram, “The uncertainty principle: A mathematical survey,” J.
Fourier Analysis and Applications, vol. 3, no. 3, pp. 207–238, 1997.
[6] D. L. Donoho and P. B. Stark, “Uncertainty principles and signal recovery,” SIAM J. Appl.
Math., vol. 49, no. 3, pp. 906–931, 1989.
[7] D. L. Donoho and B. F. Logan, “Signal recovery and the large sieve,” SIAM J. Appl. Math.,
vol. 52, no. 2, pp. 577–591, 1992.
[8] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and sparse represen-
tation in pairs of bases,” IEEE Trans. Information Theory, vol. 48, no. 9, pp. 2558–2567,
2002.
[9] C. Studer, P. Kuppinger, G. Pope, and H. Bölcskei, “Recovery of sparsely corrupted
signals,” IEEE Trans. Information Theory, vol. 58, no. 5, pp. 3115–3130, 2012.
[10] P. Kuppinger, G. Durisi, and H. Bölcskei, “Uncertainty relations and sparse signal recovery
for pairs of general signal sets,” IEEE Trans. Information Theory, vol. 58, no. 1, pp.
263–277, 2012.
[11] A. Terras, Fourier analysis on finite groups and applications. Cambridge University Press,
1999.
[12] T. Tao, “An uncertainty principle for cyclic groups of prime order,” Math. Res. Lett.,
vol. 12, no. 1, pp. 121–127, 2005.
[13] D. Stotz, E. Riegler, E. Agustsson, and H. Bölcskei, “Almost lossless analog signal
separation and probabilistic uncertainty relations,” IEEE Trans. Information Theory,
vol. 63, no. 9, pp. 5445–5460, 2017.
[14] S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing.
Birkhäuser, 2013.
[15] K. Gröchenig, Foundations of time–frequency analysis. Birkhäuser, 2001.
[16] D. Gabor, “Theory of communications,” J. Inst. Elec. Eng., vol. 96, pp. 429–457, 1946.
[17] H. Bölcskei, Advances in Gabor analysis. Birkhäuser, 2003, ch. Orthogonal frequency
division multiplexing based on offset QAM, pp. 321–352.
[18] C. Fefferman, “The uncertainty principle,” Bull. Amer. Math. Soc., vol. 9, no. 2, pp.
129–206, 1983.
[19] S. Evra, E. Kowalski, and A. Lubotzky, “Good cyclic codes and the uncertainty principle,”
L’Enseignement Mathématique, vol. 63, no. 2, pp. 305–332, 2017.
[20] E. Bombieri, Le grand crible dans la théorie analytique des nombres. Société
Mathématique de France, 1974.
[21] H. L. Montgomery, Twentieth century harmonic analysis – A celebration. Springer, 2001,
ch. Harmonic analysis as found in analytic number theory.
[22] R. G. Baraniuk and M. B. Wakin, “Random projections of smooth manifolds,” Found.
Comput. Math., vol. 9, no. 1, pp. 51–77, 2009.
[23] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found.
Comput. Math., vol. 9, no. 6, pp. 717–772, 2009.
[24] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty relations
and efficient recovery,” IEEE Trans. Signal Processing, vol. 58, no. 6, pp. 3042–3054,
2010.
[25] E. J. Candès and Y. Plan, “Tight oracle inequalities for low-rank matrix recovery from
a minimal number of noisy random measurements,” IEEE Trans. Information Theory,
vol. 4, no. 57, pp. 2342–2359, 2011.
Uncertainty Relations and Sparse Signal Recovery 195

[26] G. Alberti, H. Bölcskei, C. De Lellis, G. Koliander, and E. Riegler, “Lossless analog


compression,” IEEE Trans. Information Theory, vol. 65, no. 11, pp. 7480–7513, 2019.
[27] E. Riegler, D. Stotz, and H. Bölcskei, “Information-theoretic limits of matrix completion,”
in Proc. IEEE International Symposium on Information Theory, 2015, pp. 1836–1840.
[28] A. J. Izenman, “Introduction to manifold learning,” WIREs Comput. Statist., vol. 4, pp.
439–446, 2012.
[29] H. Lu, Y. Fainman, and R. Hecht-Nielsen, “Image manifolds,” in Proc. SPIE, 1998, vol.
3307, pp. 52–63.
[30] N. Sochen and Y. Y. Zeevi, “Representation of colored images by manifolds embedded
in higher dimensional non-Euclidean space,” in Proc. IEEE International Conference on
Image Processing, 1998, pp. 166–170.
[31] G. E. Hinton, P. Dayan, and M. Revow, “Modeling the manifolds of images of handwritten
digits,” IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 65–74, 1997.
[32] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, “On the self-similar nature of
Ethernet traffic (extended version),” IEEE/ACM Trans. Networks, vol. 2, no. 1, pp. 1–15,
1994.
[33] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge University Press, 1990.
[34] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd edn. Wiley, 2006.
[35] J. S. Abel and J. O. Smith III, “Restoring a clipped signal,” in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing, 1991, pp. 1745–1748.
[36] S. G. Mallat and G. Yu, “Super-resolution with sparse mixing estimators,” IEEE Trans.
Image Processing, vol. 19, no. 11, pp. 2889–2900, 2010.
[37] M. Elad and Y. Hel-Or, “Fast super-resolution reconstruction algorithm for pure transla-
tional motion and common space-invariant blur,” IEEE Trans. Image Processing, vol. 10,
no. 8, pp. 1187–1193, 2001.
[38] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proc. 27th
Annual Conference on Computer Graphics and Interactive Techniques, 2000, pp. 417–424.
[39] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho, “Simultaneous cartoon and texture
image inpainting using morphological component analysis (MCA),” Appl. Comput.
Harmonic Analysis, vol. 19, pp. 340–358, 2005.
[40] D. L. Donoho and G. Kutyniok, “Microlocal analysis of the geometric separation
problem,” Commun. Pure Appl. Math., vol. 66, no. 1, pp. 1–47, 2013.
[41] J. A. Tropp, “On the conditioning of random subdictionaries,” Appl. Comput. Harmonic
Analysis, vol. 25, pp. 1–24, 2008.
[42] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information,” IEEE Trans. Information
Theory, vol. 52, no. 2, pp. 489–509, 2006.
[43] G. Pope, A. Bracher, and C. Studer, “Probabilistic recovery guarantees for sparsely
corrupted signals,” IEEE Trans. Information Theory, vol. 59, no. 5, pp. 3104–3116, 2013.
[44] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp.
1289–1306, 2006.
[45] J. A. Tropp, “On the linear independence of spikes and sines,” J. Fourier Analysis and
Applications, vol. 14, no. 5, pp. 838–858, 2008.
[46] D. L. Donoho, I. M. Johnstone, J. C. Hoch, and A. S. Stern, “Maximum entropy and the
nearly black object,” J. Roy. Statist. Soc. Ser. B., vol. 54, no. 1, pp. 41–81, 1992.
196 Erwin Riegler and Helmut Bölcskei

[47] P. Feng and Y. Bresler, “Spectrum-blind minimum-rate sampling and reconstruction of


multiband signals,” in Proc. IEEE International Conference on Acoustics, Speech and
Signal Processing, 1996, pp. 1688–1691.
[48] M. Mishali and Y. C. Eldar, “From theory to practice: Sub-Nyquist sampling of sparse
wideband analog signals,” IEEE J. Selected Areas Commun., vol. 4, no. 2, pp. 375–391,
2010.
[49] R. Heckel and H. Bölcskei, “Identification of sparse linear operators,” IEEE Trans.
Information Theory, vol. 59, no. 12, pp. 7985–8000, 2013.
[50] K. Falconer, Fractal geometry, 1st edn. Wiley, 1990.
[51] J. D. Vaaler, “Some extremal functions in Fourier analysis,” Bull. Amer. Math. Soc., vol. 2,
no. 12, pp. 183–216, 1985.
[52] C. Heil, A basis theory primer. Springer, 2011.
[53] P. Mattila, Geometry of sets and measures in Euclidean space: Fractals and rectifiability.
Cambridge University Press, 1999.
7 Understanding Phase Transitions
via Mutual Information and MMSE
Galen Reeves and Henry D. Pfister

Summary

The ability to understand and solve high-dimensional inference problems is essential


for modern data science. This chapter examines high-dimensional inference problems
through the lens of information theory and focuses on the standard linear model as a
canonical example that is both rich enough to be practically useful and simple enough
to be studied rigorously. In particular, this model can exhibit phase transitions where
an arbitrarily small change in the model parameters can induce large changes in the
quality of estimates. For this model, the performance of optimal inference can be studied
using the replica method from statistical physics but, until recently, it was not known
whether the resulting formulas were actually correct. In this chapter, we present a tutorial
description of the standard linear model and its connection to information theory. We
also describe the replica prediction for this model and outline the authors’ recent proof
that it is exact.

7.1 Introduction

7.1.1 What Can We Learn from Data?


Given a probabilistic model, this question can be answered succinctly in terms of the dif-
ference between the prior distribution (what we know before looking at the data) and the
posterior distribution (what we know after looking at the data). Throughout this chapter
we will focus on high-dimensional inference problems where the posterior distribution
may be complicated and difficult to work with directly. We will show how techniques
rooted in information theory and statistical physics can provide explicit characterizations
of the statistical relationship between the data and the unknown quantities of interest.
In his seminal paper, Shannon showed that mutual information provides an impor-
tant measure of the difference between the prior and posterior distributions [1]. Since
its introduction, mutual information has played a central role for applications in engi-
neering, such as communication and data compression, by describing the fundamental
constraints imposed solely by the statistical properties of the problems.

197
198 Galen Reeves and Henry D. Pfister

Traditional problems in information theory assume that all statistics are known and
that certain system parameters can be chosen to optimize performance. In contrast, data-
science problems typically assume that the important distributions are either given by
the problem or must be estimated from the data. When the distributions are unknown,
the implied inference problems are more challenging and their analysis can become
intractable. Nevertheless, similar behavior has also been observed in more general high-
dimensional inference problems such as Gaussian mixture clustering [2].
In this chapter, we use the standard linear model as a simple example to illustrate
phase transitions in high-dimensional inference. In Section 7.2, basic properties of the
standard linear model are described and examples are given to describe its behav-
ior. In Section 7.3, a number of connections to information theory are introduced. In
Section 7.4, we present an overview of the authors’ proof that the replica formula for
mutual information is exact. In Section 7.5, connections between posterior correla-
tion and phase transitions are discussed. Finally, in Section 7.6, we offer concluding
remarks.
Notation The probability P[Y = y|X = x] is denoted succinctly by pY|X (y|x) and short-
ened to p(y|x) when the meaning is clear. Similarly, the distinction between discrete and
continuous distributions is neglected when it is inconsequential.

7.1.2 High-Dimensional Inference


Suppose that the relationship between the unobserved random vector X = (X1 , . . . , XN )
and the observed random vector Y = (Y1 , . . . , Y M ) is modeled by the joint probability
distribution p(x, y). The central problem of Bayesian inference is to answer questions
about the unobserved variables in terms of the posterior distribution:

p(y | x)p(x)
p(x | y) = . (7.1)
p(y)

In the high-dimensional setting where both M and N are large, direct evaluation of the
posterior distribution can become intractable, and one often resorts to summary statistics
such as the posterior mean/covariance or the marginal posterior distribution of a small
subset of the variables.
The analysis of high-dimensional inference problems focuses on two questions.
• What is the fundamental limit of inference without computational constraints?
• What can be inferred from data using computationally efficient methods?
It is becoming increasingly common for the answers to these questions to be framed
in terms of phase diagrams, which provide important information about fundamental
trade-offs involving the amount and quality of data. For example, the phase diagram
in Fig. 7.1 shows that increasing the amount of data not only provides more informa-
tion, but also moves the problem into a regime where efficient methods are optimal. By
contrast, increasing the SNR may lead to improvements that can be attained only with
significant computational complexity.
Understanding Phase Transitions via Mutual Information and MMSE 199

Easy – Problems can be solved using


Easy computationally efficient methods.

Amount
Hard – All known efficient methods fail
of data
but brute-force methods can still succeed.
Hard
Impossible
Impossible – All methods fail regardless
of computational complexity.

Quality of data (SNR)


Figure 7.1 Example phase diagram for a high-dimensional inference problem such as signal
estimation for the standard linear model. The parameter regions indicate the difficulty of
inference of some fixed quality.

7.1.3 Three Approaches to Analysis


For the standard linear model, the qualitative behavior shown in Fig. 7.1 is correct and
can be made quantitatively precise in the large-system limit. In general, there are many
different approaches that can be used to analyze high-dimensional inference problems.
Three popular approaches are described below.

Information-Theoretic Analysis
The standard approach taken in information theory is to first obtain precise characteri-
zations of the fundamental limits, without any computational constraints, and then use
these limits to inform the design and analysis of practical methods. In many cases, the
fundamental limits can be understood by studying macroscopic system properties in the
large-system limit, such as the mutual information and the minimum mean-squared error
(MMSE). There are a wide variety of mathematical tools to analyze these quantities in
the context of compression and communication [3, 4]. Unfortunately, these tools alone
are often unable to provide simple descriptions for the behavior of high-dimensional
statistical inference problems.

The Replica Method from Statistical Physics


An alternative approach for analyzing the fundamental limits is provided by the pow-
erful but heuristic replica method from statistical physics [5, 6]. This method, which
was developed originally in the context of disordered magnetic materials known as spin
glasses, has been applied successfully to a wide variety of problems in science and engi-
neering. At a high level, the replica method consists of a sequence of derivations that
provide explicit formulas for the mutual information in the large-system limit. The main
limitation, however, is that the validity of these formulas relies on certain assumptions
that are unproven in general. A common progression in the statistical physics litera-
ture is that results are first conjectured using the replica method and then proven using
very different techniques. For example, formulas for the Sherrington–Kirkpatrick model
were conjectured using the replica method in 1980 by Parisi [7], but were not rigorously
proven until 2006 by Talagrand [8].
200 Galen Reeves and Henry D. Pfister

Analysis of Approximate Inference


Significant work has focused on tractable methods for computing summary statistics of
the posterior distribution. Variational inference [9–11] refers to a large class of meth-
ods where the inference problem is recast as an optimization problem. The well-known
mean-field variational approach typically refers to minimizing the Kullback–Leibler
divergence with respect to product distributions. More generally, however, the vari-
ational formulation also encompasses the Bethe and Kikuchi methods for sparsely
connected or highly decomposable models as well as the expectation-consistent (EC)
approximate inference framework of Opper and Winther [12].
A variety of methods can be used to solve or approximately solve the variational
optimization problem, including message-passing algorithms such as belief propaga-
tion [13], expectation propagation [14], and approximate message passing [15]. In some
cases, the behavior of these algorithms can be characterized precisely via density evolu-
tion (coding theory) or state evolution (compressed sensing), which leads to single-letter
characterizations of the behavior in the large-system limit.

7.2 Problem Setup and Characterization

7.2.1 Standard Linear Model


Consider the inference problem implied by an unobserved random vector X ∈ RN and
an observed random vector Y ∈ R M . An important special case is the Gaussian channel,
where M = N and the unknown variables are related to the observations via

Yn = sXn + Wn , n = 1, . . . , N, (7.2)

where {Wn } is i.i.d. standard Gaussian noise and s ∈ [0, ∞) parameterizes the signal-to-
noise ratio. The Gaussian channel, which is also known as the Gaussian sequence model
in the statistics literature [16], provides a useful first-order approximation for a wide
variety of applications in science and engineering.
The standard linear model is an important generalization of the Gaussian channel in
which the observations consist of noisy linear measurements:

Ym = Am , X + Wm , m = 1, . . . , M, (7.3)

where ·, · denotes the standard Euclidean inner product, {Am } is a known sequence
of N-length measurement vectors, and {Wm } is i.i.d. Gaussian noise. Unlike for the
Gaussian channel, the number of observations M may be different from the number of
unknown variables N. For this reason the measurement indices are denoted by m instead
of n. In matrix form, the standard linear model can be expressed as

Y = AX + W, (7.4)

where A ∈ R M×N is a known matrix and W ∼ N(0, I M ). Inference problems involving


the standard linear model include linear regression in statistics, both channel and sym-
bol estimation in wireless communications, and sparse signal recovery in compressed
Understanding Phase Transitions via Mutual Information and MMSE 201

sensing [17, 18]. In the standard linear model, the matrix A induces dependences
between the unknown variables which make the inference problem significantly more
difficult.
Typical inference questions for the standard linear model include the following.
•  is
Estimation of unknown variables. The performance of an estimator Y → X
often measured using its mean-squared error (MSE),
 
E X−X  2.

The optimal MSE, computed by minimizing over all possible estimators, is called
the minimum mean-squared error (MMSE),
 
E X − E[X | Y] 2 .
The MMSE is also equivalent to the Bayes risk under squared-error loss.
• Prediction of a new observation Ynew = Anew , X + Wnew . The performance of an
estimator (Y, Anew ) → 
Ynew is often measured using the prediction mean-squared
error,
 
E (Ynew − 
Ynew )2 .
• Detection of whether the ith entry belongs to a subset K of the real line. For
example, K = R \ {0} tests whether entries are non-zero. In practice, one typically
defines a test statistic
 
p(Y = y|Xi ∈ K)
T (y) = ln
p(Y = y|Xi  K)
and then uses a threshold rule that chooses Xi ∈ K if T (y) ≥ λ and Xi  K otherwise.
The performance of this detection rule can be measured using the true-positive
rate (TPR) and the false-positive rate (FPR) given by
TPR = p(T (Y) ≥ λ|Xi ∈ K), FPR = p(T (Y) ≥ λ|Xi  K).
The receiver operating characteristic (ROC) curve for this binary decision prob-
lem is obtained by plotting the TPR versus the FPR as a parametric function of
the threshold λ. An example is given in Fig. 7.3 below.
• Posterior marginal approximation of a subset S of unknown variables. The goal
is to compute an approximation  p(xS | Y) of the marginal distribution of entries in
S, which can be used to provide summary statistics and measures of uncertainty. In
some cases, accurate approximation of the posterior for small subsets is possible
even though the full posterior distribution is intractable.

Analysis of Fundamental Limits


To understand the fundamental and practical limits of inference with the standard linear
model, a great deal of work has focused on the setting where (1) the entries of X are
drawn i.i.d. from a known prior distribution, and (2) the matrix A is an M × N random
matrix whose entries Ai j are drawn i.i.d. from N(0, 1/N). Sparsity in X can be modeled
by using a spike–slab signal distribution (i.e., a mixture of a very narrow distribution
202 Galen Reeves and Henry D. Pfister

with a very wide distribution). Consider a sequence of problems where the number of
measurements per signal dimension converges to δ. In this case, the normalized mutual
information and MMSE corresponding to the large-system limit1 are given by
1 1
I(δ)  lim I(X; Y | A), M(δ)  lim mmse(X | Y, A).
M,N→∞ N M,N→∞ N
M/N→δ M/N→δ

Part of what makes this problem interesting is that the MMSE can have discon-
tinuities, which are referred to as phase transitions. The values of δ at which these
discontinuities occur are of significant interest because they correspond to problem set-
tings in which a small change in the number of measurements makes a large difference
in the ability to estimate the unknown variables. In the above limit, the value of M(δ) is
undefined at these points.

Replica-Symmetric Formula
Using the heuristic replica method from statistical physics, Guo and Verdú [19] derived
single-letter formulas for the mutual information and MMSE in the standard linear
model with i.i.d. variables and an i.i.d. Gaussian matrix:
  √ δ δ s
I(δ) = min I X; sX + W + log + −1 (7.5)
s≥0 2 s δ

F (s)
  
and M(δ) = mmse X | s∗ (δ) X + W . (7.6)
In these expressions, X is a univariate random variable drawn according to the prior pX ,
W ∼ N(0, 1) is independent Gaussian noise, and s∗ (δ) is a minimizer of the objective
function F (s). Precise definitions of the mutual information and the MMSE are pro-
vided in Section 7.3 below. By construction, the replica mutual information (7.5) is a
continuous function of the measurement rate δ. However, the replica MMSE predic-
tion (7.6) may have discontinuities when the global minimizer s∗ (δ) jumps from one
minimum to another, and M(δ) is well defined only if s∗ (δ) is the unique minimizer.
In [20–22], the authors prove these expressions are exact for the standard linear model
with an i.i.d. Gaussian measurement matrix. An overview of this proof is presented in
Section 7.4.

Approximate Message Passing


An algorithmic breakthrough for the standard linear model with i.i.d. Gaussian matri-
ces was provided by the approximate message-passing (AMP) algorithm [15, 23] and
its generalizations [24–29]. For CDMA waveforms, the same idea was applied earlier
in [30].
An important property of this class of algorithms is that the performance for large i.i.d.
Gaussian matrices is characterized precisely via a state-evolution formalism [23, 31].
Remarkably, the fixed points of the state evolution correspond to the stationary points

1 Under reasonable conditions, one can show that these limits are well defined for almost all non-negative
values of δ and that I(δ) is continuous.
Understanding Phase Transitions via Mutual Information and MMSE 203

of the objective function F (s) in (7.5). For cases where the replica formulas are exact,
this means that AMP-type algorithms can be optimal with respect to marginal inference
problems whenever the largest local minimizer of F (s) is also the global minimizer [32].

The Generalized Linear Model


In the generalized linear model, the observations Y ∈ R M are related to the unknown
variables X ∈ RN by way of the M × N matrix A, the random variable Z = AX, and the
conditional distribution
pY|X (y | x)  pY|Z (y | AX), (7.7)
where pY|Z (y | z) defines a memoryless (i.e., separable) channel. The generalized linear
model is fundamental to generalized linear regression in statistics. It is also used to
model different sensing architectures (e.g., Poisson channels, phase retrieval) and the
effects of scalar quantization. The AMP algorithm was introduced by Donoho, Maleki,
and Montanari in [15] and extended to the GLM by Rangan in [24]. More recent
work has focused on AMP-style algorithms for rotationally invariant random matrices
[33–38].

7.2.2 Illustrative Examples


We now consider some examples that illustrate the similarities and differences between
the Gaussian channel and the standard linear model. In these examples, the unknown
variables are drawn i.i.d. from the Bernoulli-Gaussian distribution, which corresponds
to the product of independent Bernoulli and Gaussian random variables and is given by
BG(x | μ, σ2 , γ)  (1 − γ)δ0 (x) + γ N(x | μ, σ2 ). (7.8)
Here, δ0 denotes a Dirac distribution with all probability mass at zero and N(x|μ, σ2 )
denotes the Gaussian p.d.f. (2πσ2 )−1/2 e−(x−μ) /(2σ ) with mean μ and variance σ2 . The
2 2

parameter γ ∈ (0, 1) determines the expected fraction of non-zero entries. The mean and
variance of a random variable X ∼ BG(x | μ, σ2 , γ) are given by
E[X] = γμ, Var(X) = γ(1 − γ)μ2 + γσ2 . (7.9)

Gaussian Channel with Bernoulli–Gaussian Prior


If the unknown variables are i.i.d. BG(x | μ, σ2 , γ) and the observations are generated
according to the Gaussian channel (7.2), then the posterior distribution decouples into
the product of its marginals:

N
p(x | y) = p(xn | yn ). (7.10)
n=1
Furthermore, the posterior marginal p(xn | yn ) is also a Bernoulli–Gaussian distribution
but with new parameters (μn , σ2n , γn ) that depend on yn :
√ 2
sσ √
μn = μ + (yn − sμ), (7.11)
1 + sσ2
σ2
σ2n = , (7.12)
1 + sσ2
204 Galen Reeves and Henry D. Pfister

   √ −1
sμ2 − 2 sμyn − sσ2 y2n
γn = 1 + (1/γ − 1) 1 + sσ2 exp . (7.13)
2(1 + sσ2 )
Given these parameters, the posterior mean E[Xn | Yn ] and the posterior variance
Var(Xn | Yn ) can be computed using (7.9). The parameter γn is the conditional probability
that Xn is non-zero given Yn . This parameter is often called the posterior inclusion
probability in the statistics literature.
The decoupling of the posterior distribution makes it easy to characterize the funda-
mental limits of performance measures. For example the MMSE is the expectation of the
posterior variance Var(Xn | Yn ) and the optimal trade-off between the true-positive rate
and false-positive rate for detecting the event {Xn  0} is characterized by the distribution
of γn .
To investigate the statistical properties of the posterior distribution we perform a
numerical experiment. First, we draw N = 10 000 variables according to the Bernoulli–
Gaussian variables with μ = 0, σ2 = 106 , and prior inclusion probability γ = 0.2. Then,
for various values of the signal-to-noise-ratio parameter s, we evaluate the posterior
distribution corresponding to the output of the Gaussian channel.
In Fig. 7.2 (left panel), we plot three quantities associated with the estimation error:

1
N
average squared error: (Xn − E[X | Yn ])2 ,
N n=1

1
N
average posterior variance: Var(Xn | Yn ),
N n=1

1
N
average MMSE: E[Var(Xn | Yn )].
N n=1
Note that the squared error and posterior variance are both random quantities because
they are functions of the data. This means that the corresponding plots would look

Gaussian Channel Standard Linear Model


6 6
10 10
squared error AMP squared error
5
posterior variance AMP posterior variance
10 A MMSE 10 5 A fixed-point curve
asymptotic MMSE
4 4
10 10

3
10 10 3

10 2 10 2

1
10 10 1
B B
0
10 10 0

–1
10 10
–1
–8 –6 –4 –2 0
10 10 10 10 10 0 1000 2000 3000 4000 5000 6000 7000 8000
Signal-to-noise ratio (dB) Number of observations

Figure 7.2 Comparison of average squared error for the Gaussian channel as a function of the
signal-to-noise ratio (left panel) and the standard linear model as a function of the number of
observations (right panel). In both cases, the unknown variables are i.i.d. Bernoulli–Gaussian
with zero mean and a fraction γ = 0.20 of non-zero entries.
Understanding Phase Transitions via Mutual Information and MMSE 205

1
Evaluated at points
0.9 B in Figure 7.2
0.8
Evaluated at points
0.7
A in Figure 7.2
True-positive rate

0.6

0.5

0.4

0.3

0.2

0.1 Gaussian channel


standard linear model
0
0 0.2 0.4 0.6 0.8 1
False-positive rate
Figure 7.3 ROC curves for detecting the non-zero variables in the parameter regimes labeled A
and B in Fig. 7.2. The curves for the Gaussian channel are obtained by thresholding the true
posterior inclusion probabilities {γn }. The curves for the standard linear model are obtain by
thresholding the AMP approximations of the posterior inclusion probabilities.

slightly different if the experiment were repeated multiple times. The MMSE, however,
is a function of the joint distribution of (X, Y) and is thus non-random. In this setting, the
fact that there is little difference between the averages of these quantities can be seen
as a consequence of the decoupling of the posterior distribution and the law of large
numbers.
In Fig. 7.3, we plot the ROC curve for the problem of detecting the non-zero vari-
ables. The curves are obtained by thresholding the posterior inclusion probabilities {γn }
associated with the values of the signal-to-noise ratio at the points A and B in Fig. 7.2.

Standard Linear Model with Bernoulli–Gaussian Prior


Next, we consider the setting where the observations are generated by the standard linear
model (7.4). In this case, the measurement matrix introduces dependence in the posterior
distribution, and the decoupling seen in (7.10) does not hold in general.
To characterize the posterior distribution, one can use the fact that X is conditionally
Gaussian given the support vector U ∈ {0, 1}N , where Xn = 0 if Un = 0 and Xn  0 with
probability one if Un = 1. Consequently, the posterior distribution can be expressed as a
Gaussian mixture model of the form

    
p(x | y, A) = p(u | y, A) N x | E X | y, A, u , Cov(X | y, A, u) ,
u∈{0,1}N
206 Galen Reeves and Henry D. Pfister

where the summation is over all possible support sets. The posterior probability of the
support set is given by

p(u | y, A) ∝ p(u)N y | μAu 1, I + σ2 ATu Au ,

where Au is the submatrix of A formed by removing columns where un = 0 and 1 denotes


a vector of ones. The posterior marginal is obtained by integrating out the other variables
to get

p(xn | y, A) = p(x | y, A) dx∼n ,

where x∼n denotes all the entries except for xn .


Here, the challenge is that the number of terms in the summation over u grows
exponentially with the signal dimension N. Since it is difficult to compute the poste-
rior distribution in general, we use AMP to compute approximations to the posterior
marginal distributions. The marginals of the approximation, which belong to the
Bernoulli–Gaussian family of distributions, are given by


p(xn | y, A) = BG(xn | μn , σ2n , γn ), (7.14)

where the parameters (μn , σ2n , γn ) are the outputs of the AMP algorithm.
Similarly to the previous example, we perform a numerical experiment to investi-
gate the statistical properties of the marginal approximations. First, we draw N = 10 000
variables according to the Bernoulli–Gaussian variables with μ = 0, σ2 = 106 , and
prior inclusion probability γ = 0.2. Then, for various values of M, we obtain mea-
surements from the standard linear model with i.i.d. Gaussian measurement vectors
Am ∼ N(0, N −1 I) and use AMP to compute the parameters (μn , σ2n , γn ) used in the
marginal posterior approximations.
In Fig. 7.2 (right panel), we plot the squared error and the approximation of the
posterior variance associated with the AMP marginal approximations:

1 
N
2
average AMP squared error: Xn − Ep [Xn | Y, A] ,
N n=1

1
N
average AMP posterior variance: Varp (Xn | Y, A).
N n=1

In these expressions, the expectation and the variance are computed with respect to the
marginal approximation in (7.14). Because these quantities are functions of the random
data, one expects that they would look slightly different if the experiment were repeated
multiple times.
At this point, there are already some interesting observations that can be made. First,
we note that the AMP approximation of the mean can be viewed as a point-estimate
of the unknown variables. Similarly, the AMP approximation of the posterior variance
(which depends on the observations but not the ground truth) can be viewed as a point-
estimate of the squared error. From this perspective, the close correspondence between
Understanding Phase Transitions via Mutual Information and MMSE 207

the squared error and the AMP approximation of the variance seen in Fig. 7.2 suggests
that AMP is self-consistent in the sense that it provides an accurate estimate of its
square error.
Another observation is that the squared error undergoes an abrupt change at around
3500 observations, between the points labeled A and B. Before this, the squared error
is within an order of magnitude of the prior variance. After this, the squared error drops
discontinuously. This illustrates that the estimator provided by AMP is quite accurate in
this setting.
However, there are still some important questions that remain. For example, how
accurate are the AMP posterior marginal approximations? Is it possible that a different
algorithm (e.g., one that computes the true posterior marginals) would lead to estimates
with significantly smaller squared error? Further questions concern how much informa-
tion is lost in focusing only on the marginals of the posterior distribution as opposed to
the full posterior distribution.
Unlike the Gaussian channel, it is not possible to evaluate the MMSE directly because
the summation over all 210000 support vectors is prohibitively large. For comparison, we
plot the large-system MMSE predicted by (7.6), which corresponds to the large-N limit
where the fraction of observations is parameterized by δ = M/N.
The behavior of the large-system MMSE is qualitatively similar to the AMP squared
error because it has a single jump discontinuity (or phase transition). However, the jump
occurs after only 2850 observations for the MMSE as opposed to after 3600 observations
for AMP. By comparing the AMP squared error with the asymptotic MMSE, we see that
the AMP marginal approximations are accurate in some cases (e.g., when the number
of observations is fewer than 2840 or greater than 3600) but highly inaccurate in others
(e.g., when the number of observations is between 2840 and 3600).
In Fig. 7.3, we plot the ROC curve for the problem of detecting the non-zero variables
in the standard linear model. In this case, the curves are obtained by thresholding AMP
approximations of the posterior inclusion probabilities {γn }. It is interesting to note that
the ROC curves corresponding to the two different observation models (the Gaussian
channel and the standard linear model) have similar shapes when they are evaluated in
problem settings with matched squared error.

7.3 The Role of Mutual Information and MMSE

The amount one learns about an unknown vector X from an observation Y can be quan-
tified in terms of the difference between the prior and posterior distributions. For a
particular realization of the observations y, a fundamental measure of this difference
is provided by the relative entropy
 
D pX|Y (· | y) pX (·) .
This quantity is non-negative and equal to zero if and only if the posterior is the same as
the prior almost everywhere.
For real vectors, another way to assess the difference between the prior and posterior
distributions is to compare the first and second moments of their distributions. These
208 Galen Reeves and Henry D. Pfister

 
moments are summarized by the mean E[X], the conditional mean E X | Y = y , the
covariance matrix
 
Cov(X)  E (X − E[X])(X − E[X])T ,

and the conditional covariance matrix


 
Cov(X | Y = y)  E (X − E[X | Y])(X − E[X | Y])T | Y = y .

Together, these provide some measure of how much “information” there is in the data.
One of the difficulties of working with the posterior distribution directly is that it can
depend non-trivially on the particular realization of the data. It can be much easier to
focus on the behavior for typical realizations of the data by studying the distribution of
the relative entropy when Y is drawn according to the marginal distribution p(y). For
example, the expectation of the relative entropy is the mutual information
  
I(X; Y)  E D pX|Y (· | Y) pX (·) .

Similarly, the expected value of the conditional covariance matrix2 is


 
E[Cov(X | Y)] = E (X − E[X | Y])(X − E[X | Y])T .

The trace of this matrix equals the Bayes risk for squared-error loss, which is more
commonly called the MMSE and defined by
 
mmse(X | Y)  tr(E[Cov(X | Y)]) = E X − E[X | Y] 22 ,

where · 2 denotes the Euclidean norm. Part of the appeal of working with the mutual
information and the MMSE is that they satisfy a number of useful functional properties,
including chain rules and data-processing inequalities.
The prudence of focusing on the expectation with respect to the data depends on the
extent to which the random quantities of interest deviate from their expectations. In the
statistical physics literature, the concentration of the relative entropy and squared error
around their expectations is called the self-averaging property and is often assumed for
large systems [5].

7.3.1 I-MMSE Relationships for the Gaussian Channel


Given an N-dimensional random vector X = (X1 , . . . , XN ), the output of the Gaussian
channel with signal-to-noise-ratio parameter s ∈ [0, ∞) is denoted by

Y(s) = s X + W,

where W ∼ N(0, IN ) is independent Gaussian noise. Two important functionals of the


joint distribution of (X, Y s ) are the mutual information function
1
IX (s) = I(X; Y(s))
N

2 Observe that Cov(X | Y) is a random variable in the same sense as E[X | Y].
Understanding Phase Transitions via Mutual Information and MMSE 209

and the MMSE function


1  
MX (s) = E X − E[X | Y(s)] 22 .
N
In some cases, these functions can be computed efficiently using numerical integration
or Monte Carlo approximation. For example, if the entries of X are i.i.d. copies of a
scalar random variable X then the mutual information and MMSE depend only on the
marginal distribution:

IX (s) = I(X; sX + W),
  √  2
MX (s) = E X − E X | sX + W .

Another example is if X is drawn according to a Gaussian mixture model with a


small number of mixture components. For general high-dimensional distributions, how-
ever, direct computation of these functions can be intractable due to the curse of
dimensionality. Instead, one often resorts to asymptotic approximations.
The I-MMSE relationship [39] asserts that the derivative of the mutual information is
one-half of the MMSE,
d 1
IX (s) = MX (s). (7.15)
ds 2
This result is equivalent to the classical De Bruijn identity [40], which relates the deriva-
tive of differential entropy to the Fisher information. Part of the significance of the
I-MMSE relationship is that it provides a link between an information-theoretic quantity
and an estimation-theoretic quantity.
Another important property of the MMSE function is that its derivative is
1  √ 
MX (s) = − E Cov(X | s X + W) 2F , (7.16)
N
where · F denotes the Frobenious norm. Since the derivative is non-positive, it fol-
lows that the MMSE function is non-increasing and the mutual information function is
concave.
The relationship between the MMSE function and its derivative imposes some useful
constraints on the MMSE. One example is the so-called single-crossing property [41],
which asserts that, for any random vector X and isotropic Gaussian random vector Z,
the MMSE functions MX (s) and MZ (s) cross at most once.
The following result states a monotonicity property concerning a transformation of
the MMSE function. A matrix generalization of this result is given in [42].
theorem 7.1 (Monotonicity of MMSE) For any random vector X that is not almost-
surely constant, the function
1
kX (s)  −s (7.17)
MX (s)
is well defined and non-decreasing on (0, ∞).
210 Galen Reeves and Henry D. Pfister

Proof The MMSE function is real analytic, and hence infinitely differentiable, on
(0, ∞) [41]. By differentiation, one finds that
d
kX (s) = −MX (s)/MX2 (s) − 1. (7.18)
ds
Let λ1 , . . . , λN ∈ [0, ∞) be the eigenvalues of the N × N matrix E[Cov(X | Y))], where

Y = s X + W. Starting with (7.16), we find that
1   1
−MX (s) = E Cov(X | Y) 2F ≥ E[Cov(X | Y)] 2F
N N
⎛ ⎞2
1  2 ⎜⎜⎜⎜ 1  ⎟⎟⎟⎟
N N
= λ ≥ ⎜⎜ λn ⎟⎟ = MX2 (s),
N n=1 n ⎝ N n=1 ⎠
where both inequalities are due to Jensen’s inequality. Combining this inequality with
(7.18) establishes that the derivative of kX (s) is non-negative and hence kX (s) is non-
decreasing. 
We remark that Theorem 7.1 implies the single-crossing property. To see this, note
that if Z ∼ N(0, σ2 I) then kZ (s) = σ−2 is a constant and thus kX (s) and kZ (s) cross
at most once. Furthermore, Theorem 7.1 shows that, for many problems, the Gaussian
distribution plays an extremal role for distributions with finite second moments. For
example, if we let Z be a Gaussian random vector with the same mean and covariance
as X, then we have
IX (s) ≤ IZ (s),
MX (s) ≤ MZ (s),
MX ≤ MZ (s),
where equality holds if and only if X is Gaussian. The importance of these inequalities
follows from the fact that the Gaussian distribution is easy to analyze and often well
behaved.

7.3.2 Analysis of Good Codes for the Gaussian Channel


This section provides an example of how the properties described in Section 7.3.1 can
be applied in the context of a high-dimensional inference problem. The focus is on the
channel coding problem for the Gaussian channel. A code for the Gaussian channel is
a collection X = {x(1), . . . , x(L)} of L codewords in RN such that X = x(J), where J is
drawn uniformly from {1, 2, . . . , L}. The output of the channel is given by

Y = snr X + W, (7.19)
where W ∼ N(0, I) is independent Gaussian noise.
The code is called η-good if it satisfies three conditions.
• Power Constraint. The codewords satisfy the average power constraint
1  
E X 22 ≤ 1. (7.20)
N
Understanding Phase Transitions via Mutual Information and MMSE 211

•  we have
Low Error Probability. For the MAP decoding decision X,
  
Pr X = X ≥ P pX|Y (x(J)|Y) > max pX|Y (x()|Y) ≥ 1 − η.
J

• Sufficient Rate. The number of codewords satisfies L ≥ (1 + snr)(1−η)N/2 .


lemma 7.1 (Corollary of the channel coding theorem) For every snr > 0 and > 0 there
exist an integer N and random vector X = (X1 , . . . , XN ) satisfying the average power
constraint (7.20) as well as the following inequalities:
1
IX (snr) ≥ log(1 + snr) − , (7.21)
2
MX (snr) ≤ . (7.22)
The distribution on X induced by a good code is fundamentally different from the i.i.d.
Gaussian distribution that maximizes the mutual information. For example, a good code
defines a discrete distribution that has finite entropy, whereas the Gaussian distribution
is continuous and hence has infinite entropy. Furthermore, while the MMSE of a good
code can be made arbitrarily small (in the large-N limit), the MMSE of the Gaussian
channel is lower-bounded by 1/(1 + snr) for all N.
Nevertheless, the distribution induced by a good code and the Gaussian distribution
are similar in the sense that their mutual information functions must become arbitrarily
close for large N. It is natural to ask whether this closeness implies other similari-
ties between the good code and the Gaussian distribution. The next result shows that
closeness in mutual information also implies closeness in MMSE.
theorem 7.2 For any N-dimensional random vector X satisfying the average power
constraint (7.20) and the mutual information lower bound (7.21) the MMSE function
satisfies
e−2 1 − e−2 1
− ≤ MX (s) ≤ (7.23)
1 + s snr − s 1+ s
for all 0 ≤ s < snr.
Proof For the upper bound, we have
1 1 1
MX (s) = ≤ ≤ , (7.24)
kX (s) + s kX (0) + s 1 + s
where the first inequality follows  from  Theorem 7.1 and the second inequality holds
because the assumption (1/N)E X 22 ≤ 1 implies that MX (0) ≤ 1, and hence kX (0) ≥ 1.
For the lower bound, we use the following chain of inequalities:
   snr  
(1 + snr)(kX (s) + s) 1 1
log = − dt (7.25)
(1 + s)(kX (s) + snr) s 1 + t kX (s) + t
 snr  
1 1
≤ − dt (7.26)
s 1 + t kX (t) + t
 snr  
1
= − MX (t) dt (7.27)
s 1+t
212 Galen Reeves and Henry D. Pfister

 snr  
1
≤ − MX (t) dt (7.28)
0 1+t
= log(1 + snr) − 2IX (snr) (7.29)
≤2 , (7.30)
where (7.26) follows from Theorem 7.1, (7.28) holds because the upper bound in (7.23)
ensures that the integrand is non-negative, and (7.30) follows from the assumed lower
bound on the mutual information. Exponentiating both sides, rearranging terms, and
recalling the definition of kX (s) leads to the stated lower bound.
An immediate consequence of Theorem 7.2 is that the MMSE function associated
with a sequence of good codes undergoes a phase transition in the large-N limit. In
particular,



⎪ 1

⎨ 1 + s , 0 ≤ s < snr,
lim MX (s) = ⎪
⎪ (7.31)
N→∞ ⎪

⎩0, snr < s.
The case s ∈ [snr, ∞) follows from the definition of a good code and the monotonicity of
the MMSE function. The case s ∈ [0, snr) follows from Theorem 7.2 and the fact that
can be arbitrarily small.
An analogous result for binary linear codes on the Gaussian channel can be
found in [43]. The characterization of the asymptotic MMSE for good Gaussian
codes, described by (7.31), was also obtained previously using ideas from statisti-
cal physics [44]. The derivation presented in this chapter, which relies only on the
monotonicity of kX (s), bypasses some technical difficulties encountered in the previous
approach.

7.3.3 Incremental-Information Sequence


We now consider a different approach for decomposing the mutual information between
random vectors X = (X1 , . . . , XN ) and Y = (Y1 , . . . , Y M ). The main idea is to study the
increase in information associated with new observations. In order to make general state-
ments, we average over all possible presentation orders for the elements of Y. To this
end, we define the information sequence {Im } according to
1 
Im  I(X; Yπ(1) , . . . Yπ(m) ), m = 1, . . . , M,
M! π
where the sum is over all permutations π : [M] → [M]. Note that each summand on the
right-hand side is the mutual information between X and an m-tuple of the observa-
tions. The average over all possible permutations can be viewed as the expected mutual
information when the order of observations is chosen uniformly at random.
Owing to the random ordering, we will find that Im is an increasing sequence with 0 =
I0 ≤ I1 ≤ · · · ≤ I M−1 ≤ I M = I(X; Y). To study the increase in information with additional
observations, we focus on the first- and second-order difference sequences, which are
defined as follows:
Understanding Phase Transitions via Mutual Information and MMSE 213

Im  Im+1 − Im ,
Im  Im+1

− Im .

Using the chain rule for mutual information, it is straightforward to show that the first-
and second-order difference sequences can also be expressed as
1 
Im = I(X; Yπ(m+1) | Yπ(1) , . . . Yπ(m) ),
M! π
1 
Im = I(Yπ(m+2) ; Yπ(m+1) | X, Yπ(1) , . . . Yπ(m) ) (7.32)
M! π
1 
− I(Yπ(m+2) ; Yπ(m+1) | Yπ(1) , . . . Yπ(m) ).
M! π

The incremental-information approach is well suited to observation models in which


the entries of Y are conditionally independent given X, that is

M
pY|X (y | x) = pYk |X (yk | x). (7.33)
m=1

The class of models satisfying this condition is quite broad and includes memory-
less channels and generalized linear models as special cases. The significance of the
conditional independence assumption is summarized in the following result.
theorem 7.3 (Monotonicity of incremental information) The first-order difference
sequence {Im } is monotonically decreasing for any observation model satisfying the
conditional independence condition in (7.33).
Proof Under assumption (7.33), two new observations Yπ(m+1) and Yπ(m+2) are condi-
tionally independent given X, and thus the first term on the right-hand side of (7.33)
is zero. This means that the second-order difference sequence is non-positive, which
implies monotonicity of the first-order difference.
The monotonicity in Theorem 7.3 can also be seen as a consequence of the subset
inequalities studied by Han; see Chapter 17 of [3]. Our focus on the incremental infor-
mation is also related to prior work in coding theory that uses an integral–derivative
relationship for the mutual information called the area theorem [45].
Similarly to the monotonicity properties studied in Section 7.3.1, the monotonicity
of the first-order difference imposes a number of constraints on the mutual informa-
tion sequence. Some examples illustrating the usefulness of these constraints will be
provided in the following sections.

7.3.4 Standard Linear Model with i.i.d. Measurement Vectors


We now provide an example of how the incremental-information sequences can be used
in the context of the standard linear model (7.3). We focus on the setting where the
measurement vectors {Am } are drawn i.i.d. from a distribution on RN . In this setting,
the entire observation consists of the pair (Y, A) and the mutual information sequences
214 Galen Reeves and Henry D. Pfister

defined in Section 7.3.3 can be expressed compactly as


Im = I(X; Y m | Am ),

Im = I(X; Ym+1 | Y m , Am+1 ),

Im = −I(Ym+1 ; Ym+2 | Y m , Am+2 ),


where Y m = (Y1 , . . . , Ym ) and Am = (A1 , . . . , Am ). In these expressions, we do not average
over permutations of measurement indices because the distribution of the observations is
permutation-invariant. Furthermore, the measurement vectors appear only as conditional
variables in the mutual information because they are independent of all other random
variables.
The sequence perspective can also be applied to other quantities of interest. For
example, the MMSE sequence {Mm } is defined by
1 '  '2 
Mm  E ''X − E X | Y m , Am ''2 , (7.34)
N
where M0 = (1/N) tr(Cov(X)) and M M = (1/N) mmse(X | Y, A). By the data-processing
inequality for MMSE, it follows that Mm is a decreasing sequence.
Motivated by the I-MMSE relations in Section 7.3.1, one might wonder whether there
also exists a relationship between the mutual information and MMSE sequences. For
simplicity, consider the setting where the measurement vectors are i.i.d. with mean zero
and covariance proportional to the identity matrix:
 
E[Am ] = 0, E Am ATm = N −1 IN . (7.35)
One example of a distribution satisfying these constraints is when the entries of Am are
i.i.d. with mean zero and variance 1/N. Another example is when Am is drawn uniformly
from a collection of N mutually orthogonal unit vectors.
theorem 7.4 Consider the standard linear model (7.3) with i.i.d. measurement vectors
{Am } satisfying (7.35). If X has finite covariance, then the sequences {Im } and {Mm }
satisfy
1
Im ≤ log(1 + Mm ) (7.36)
2
for all integers m.
Proof Conditioned on the observations (Y m , Am+1 ), the variance of a new measurement
can be expressed as

Var(Am+1 , X | Y m , Am+1 ) = ATm+1 Cov(X | Y m , Am )Am+1 .


Taking the expectation of both sides and leveraging the assumptions in (7.35), we see
that
 
E Var(Am+1 , X | Y m , Am+1 ) = Mm . (7.37)
Next, starting with the fact that the mutual information in a Gaussian channel is
maximized when the input (i.e., Am+1 , X) is Gaussian, we have
Understanding Phase Transitions via Mutual Information and MMSE 215

Im = I(X; Ym+1 | Y m , Am+1 )


  
1
≤ E log 1 + Var(Am+1 , X | Y , A )
m m+1
2
   
1
≤ E log 1 + E Var(Am+1 , X | Y n , Am+1 ) , (7.38)
2
where the second step follows from Jensen’s inequality and the concavity of the
logarithm. Combining (7.37) and (7.38) gives the stated inequality.
Theorem 7.4 is reminiscent of the I-MMSE relation for Gaussian channels in the sense
that it relates a change in mutual information to an MMSE estimate. One key difference,
however, is that (7.36) is an inequality instead of an equality. The difference between the
right- and left-hand sides of (7.36) can be viewed as a measure of the difference between
the posterior distribution of a new observation Ym+1 given observations (Y m , Am+1 ) and
the Gaussian distribution with matched first and second moments [20–22, 46, 47].
Combining Theorem 7.4 with the monotonicity of the first-order difference sequence
(Theorem 7.3) leads to a lower bound on the MMSE in terms of the total mutual
information.
theorem 7.5 Under the assumptions of Theorem 7.4, we have
   
2Im − k log(1 + M0 )
Mk ≥ exp −1 (7.39)
m−k
for all integers 0 ≤ k < m.
Proof For any 0 ≤ k < m, the monotonicity of Im (Theorem 7.3) allows us to write
(  (k−1  (m−1   
Im = m−1
=0 I = =0 Ik + =k I ≤ kI0 + (m − k)Ik . Using Theorem 7.4 to upper-bound
the terms I0 and Ik and then rearranging terms leads to the stated result.
Theorem 7.5 is particularly meaningful when the mutual information is large. For
example, if the mutual information satisfies the lower bound
n
Im ≥ (1 − ) log(1 + M0 )
2
for some ∈ [0, 1), then Theorem 7.5 implies that

Mk ≥ (1 + M0 )1− 1−k/m − 1
for all integers 0 ≤ k < m. As converges to zero, the right-hand side of this inequality
increases to M0 . In other words, a large value of Im after m observations implies that the
MMSE sequence is nearly constant for all k that are sufficiently small relative to m.

7.4 Proving the Replica-Symmetric Formula

The authors’ prior work [20–22] provided the first rigorous proof of the replica formulas
(7.5) and (7.6) for the standard linear model with an i.i.d. signal and a Gaussian sensing
matrix.
216 Galen Reeves and Henry D. Pfister

In this section, we give an overview of the proof. It begins by focusing on the


increase in mutual information associated with adding a new observation as described
in Sections 7.3.3 and 7.3.4. Although this approach is developed formally using
finite-length sequences, we describe the large-system limit first for simplicity.

7.4.1 Large-System Limit and Replica Formulas


For the large-system limit, the increase in mutual information I(δ) with additional mea-
surement is characterized by its derivative I (δ). The main technical challenge is to
establish the following relationships:
 
δ
fixed-point formula M(δ) = MX , (7.40)
1 + M(δ)
1  
I-MMSE formula I (δ) = log 1 + M(δ) , (7.41)
2
where these equalities hold almost everywhere but not at phase transitions.
The next step is to use these two relationships to prove that the replica formulas, (7.5)
and (7.6), are exact. First, by solving the minimization over s in (7.5), one finds that any
local minimizer must satisfy the fixed-point formula (7.40). In addition, by differenti-
ating I(δ) in (7.5), one can show that I (δ) must satisfy the I-MMSE formula (7.41).
Thus, if the fixed-point formula (7.40) defines M(δ) uniquely, then the mutual informa-
tion I(δ) can be computed by integrating (7.41) and the proof is complete. However,
this happens only if there are no phase transitions. Later, we will discuss how to handle
the case of multiple solutions and phase transitions.

7.4.2 Information and MMSE Sequences


To establish the large-limit formulas, (7.40) and (7.41), the authors’ proof focuses
on functional properties of the incremental mutual information as well as the MMSE
sequence {Mm } defined by (7.34). In particular, the results in [20, 21] first establish
the following approximate relationships between the mutual information and MMSE
sequences:
 
m/N
fixed-point formula Mm ≈ MX , (7.42)
1 + Mm
1
I-MMSE formula Im ≈ log(1 + Mm ). (7.43)
2
The fixed-point formula (7.42) shows that the MMSE Mm corresponds to a scalar estima-
tion problem whose signal-to-noise ratio is a function of the number of observations m
as well as Mm . In Section A7.1, it is shown how the standard linear model can be related
to a scalar estimation problem of one signal entry. Finally, the I-MMSE formula (7.43)
implies that the increase in information with a new measurement corresponds to a sin-
gle use of a Gaussian channel with a Gaussian input whose variance is matched to the
Understanding Phase Transitions via Mutual Information and MMSE 217

MMSE. The following theorem, from [20, 21], quantifies the precise sense in which
these approximations hold.
theorem 7.6 Consider the standard linear model (7.3) with i.i.d. Gaussian measure-
−1
 vectors Am ∼ N(0, N IN ). If the entries of X are i.i.d. with bounded fourth moment
ment
E Xn4 ≤ B, then the sequences {Im } and {Mm } satisfy

 ))
δN )
))I  − 1 log(1 + M )))) ≤ C N α , (7.44)
)m 2 m ) B,δ
m=1
)  )
)
δN
)) m/N ))) α
)) Mm − MX )) ≤ C B,δ N (7.45)
m=1
1 + Mm

for every integer N and δ > 0, where α ∈ (0, 1) is a universal constant and C B,δ is a
constant that depends only on the pair (B, δ).
Theorem 7.6 shows that the normalized sum of the cumulative absolute error in
approximations (7.42) and (7.43) grows sub-linearly with the vector length N. Thus,
if one normalizes these sums by M = δN, then the resulting expressions converge to
zero as N → ∞. This is sufficient to establish (7.40) and (7.41).

7.4.3 Multiple Fixed-Point Solutions


At this point, the remaining difficulty is that the MMSE fixed-point formula (7.40) can
have multiple solutions, as is illustrated by the information fixed-point curve in Fig. 7.4.
In this case, the formulas (7.40) and (7.41) alone are not sufficient to uniquely define the
actual mutual information I(δ).
For many signal distributions with a single phase transition, the curve (see Fig. 7.4)
defined by the fixed-point formula (7.40) has the following property. For each δ, there
are at most two solutions where the slope of the curve is non-increasing. Since M(δ) is

4.5
The information fixed-point curve is the graph of
4
all possible solutions to MMSE fixed-point equation
3.5 (7.40) and I-MMSE formula (7.41).
3

2.5

2
The correct is the non-increasing subset of
1.5 the information fixed-point curve that matches the
1 boundary conditions for the mutual information.
0.5

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Measurement ratio
Figure 7.4 The derivative of the mutual information I (δ) as a function of the measurement rate δ
for linear estimation with i.i.d. Gaussian matrices. A phase transition occurs when the derivative
jumps from one branch of the information fixed-point curve to another.
218 Galen Reeves and Henry D. Pfister

non-increasing, in this case, it must jump from the upper solution branch to the lower
solution branch (see Fig. 7.4) at the phase transition.
The final step in the authors’ proof technique is to resolve the location of the phase
transition using boundary conditions on the mutual information for δ = 0 and δ → ∞,
which can be obtained directly using different arguments. Under the signal property
stated below in Definition 7.1, it is shown that the only solution consistent with the
boundary conditions is the one predicted by the replica method. A graphical illustration
of this argument is provided in Fig. 7.4.

7.4.4 Formal Statement


In this section, we formally state the main theorem in [21]. To do this, we need the
following definition.
definition 7.1 A signal distribution pX has the single-crossing property3 if its replica
MMSE (7.6) crosses its fixed-point curve (7.40) at most once.
For any δ > 0, consider a sequence of standard linear models indexed by N where
the number of measurements is M = δN, the signal X ∈ RN is an i.i.d. vector with
entries drawn from pX , the M × N measurement matrix A has i.i.d. entries drawn from
N(0, 1/N), and the observed vector is Y = AX + W, where W ∈ R M is a standard Gaus-
sian vector. For this sequence, we can also define a sequence of mutual information and
MMSE functions:
IN (δ)  = I(X N ; Y δM ), (7.46)
1
MN (δ)  = mmse(X N | Y δM ). (7.47)
N
theorem 7.7 Consider the sequence of problems
 defined above and assume that pX
has a bounded fourth moment (i.e., E X ≤ B < ∞) and satisfies the single-crossing
4

property. Then the following two statements hold.


1. The sequence of mutual information functions IN (δ) converges to the replica
prediction (7.5). In other words, for all δ > 0,
lim IN (δ) = I(δ).
N→∞

2. The sequence of MMSE functions MN (δ) converges almost everywhere to the


replica prediction (7.6). In other words, at all continuity points of M(δ),
lim MN (δ) = M(δ).
N→∞

Relationship with Other Methods


The use of an integral relationship defining the mutual information is reminiscent of the
generalized area theorems introduced by Méasson et al. in coding theory [45]. However,

3 Regrettably, this is unrelated to the “single-crossing property” described earlier that says the MMSE
function MX (s) may cross the matched Gaussian MMSE MZ (s) at most once.
Understanding Phase Transitions via Mutual Information and MMSE 219

one of the key differences in the compressed sensing problem is that the conditional
entropy of the signal does not drop to zero after the phase transition.
The authors’ proof technique also differs from previous approaches that use system-
wide interpolation methods to obtain one-sided bounds [5, 48] or that focus on special
cases, such as sparse matrices [49], Gaussian mixture models [50], or the detection
problem of support recovery [51, 52]. After [20, 22], Barbier et al. obtained similar
results using a substantially different method [53, 54]. More recent work has provided
rigorous results for the generalized linear model [55].

7.5 Phase Transitions and Posterior Correlation

A phase transition refers to an abrupt change in the macroscopic properties of a sys-


tem. In the context of thermodynamic systems, a phase transition may correspond to the
transition from one state of matter to another (e.g., from solid to liquid or from liquid
to gas). In the context of inference problems, a phase transition can be used to describe
a sharp change in the quality of inference. For example, the channel coding problem
undergoes a phase transition as the signal-to-noise ratio crosses a threshold because the
decoder error probability transitions from ≈ 1 to ≈ 0 over a very small range of signal-
to-noise ratios. In the standard linear model, the asymptotic MMSE may also contain a
jump discontinuity with respect to the fraction of observations.
In many cases, the existence of phase transitions in inference problems can be related
to the emergence of significant correlation in the posterior distribution [5]. In these cases,
a small change in the uncertainty for one variable (e.g., reduction in the posterior vari-
ance with a new observation) corresponds to a change in the uncertainty for a large
number of other variables as well. The net effect is a large change in the overall system
properties, such as the MMSE.
In this section, we show how the tools introduced in Section 7.3 can be used to provide
a link between a measure of the average correlation in the posterior distribution and
second-order differences in the mutual information for both the Gaussian channel and
the standard linear model.

7.5.1 Mean-Squared Covariance


Let us return to the general inference problem of estimating a random vector X =
(X1 , . . . , XN ) from observations Y = (Y1 , . . . , Y M ). As discussed in Section 7.3, the pos-
terior covariance matrix Cov(X | Y) provides a geometric measure of the amount of
uncertainty in the posterior distribution. One important function of this matrix is the
MMSE, which corresponds to the expected posterior variance:

N
mmse(X | Y) = E[Var(Xn | Y)]. (7.48)
n=1

Going beyond the MMSE, there is also important information contained in the off-
diagonal entries, which describe the pair-wise correlations. A useful measure of this
correlation is provided by the mean-squared covariance:
220 Galen Reeves and Henry D. Pfister

  N 
N  
E Cov(X | Y) 2F = E Cov2 (Xk , Xn | Y) . (7.49)
k=1 n=1

Note that, while the MMSE corresponds to N terms, the mean-squared covariance
corresponds
  to N 2 terms. If the entries in X have bounded fourth moments (i.e.,
E Xi ≤ B), then it follows from the Cauchy–Schwarz inequality that each summand
4

on the right-hand side of (7.49) is upper-bounded by B, and it can be verified that


1  
(mmse(X | Y))2 ≤ E Cov(X | Y) 2F ≤ N 2 B. (7.50)
N
The left inequality is tight when the posterior distribution is uncorrelated and hence the
off-diagonal terms of the conditional covariance are zero. The right inequality is tight
when the off-diagonal terms are of the same order as the variance.
Another way to view the relationship between the MMSE and mean-squared covari-
ance its to consider the spectral decomposition of the covariance matrix. Let Λ1 ≥ Λ2 ≥
· · · ≥ ΛN denote the (random) eigenvalues of Cov(X | Y). Then we can write

N 
N
tr(Cov(X | Y)) = Λn , Cov(X | Y) 2
F = Λ2n .
n=1 n=1

Taking the expectations of these random quantaties and rearranging terms, one finds that
the mean-squared covariance can be decomposed into three non-negative terms:
    2 
N  2

E Cov(X | Y) 2F = N E Λ̄ + NVar(Λ̄) + E Λn − Λ̄ , (7.51)
n=1
(N
where Λ̄ = (1/N) n=1 Λn denotes the arithmetic mean of the eigenvalues. The first term
on the right-hand side corresponds to the square of the MMSE and is equal to the lower
bound in (7.50). The second term on the right-hand side corresponds to the variance of
(1/N) tr(Cov(X | Y)) with respect to the randomness in Y. This term is equal to zero if
X and Y are jointly Gaussian. The last term corresponds to the expected variation in
the eigenvalues. If a small number of eigenvalues are significantly larger than the others
then it is possible for this term to be N times larger than the first term. When this occurs,
most of the uncertainty in the posterior distribution is concentrated on a low-dimensional
subspace.

7.5.2 Conditional MMSE Function and Its Derivative


The relationship between phase transitions and correlation in the posterior distribu-
tion can be made precise using the properties of the Gaussian channel discussed in
Section 7.3.1. Given a random vector X ∈ RN and a random observation Y ∈ R M , the
conditional MMSE function is defined by
1  
MX|Y (s)  E X − E[X | Y, Z(s)] 2 , (7.52)
N

where Z(s) = s X + W is a new observation of X from an independent Gaussian noise
channel [32]. From the expression for the derivative of the MMSE in (7.16), it can be
Understanding Phase Transitions via Mutual Information and MMSE 221

verified that the derivative of the conditional MMSE function is


 1  
MX|Y (s) = − E Cov(X | Y, Z(s)) 2F . (7.53)
N
Here, we recognize that the right-hand side is proportional to the mean-squared covari-
ance associated with the pair of observations (Y, Z(s)). Meanwhile, the left-hand side
describes the change in the MMSE associated with a small increase in the signal-to-noise
ratio of the Gaussian channel.
In Section 7.3.2, we saw that a phase transition in the channel coding problem for
the Gaussian channel corresponds to a jump discontinuity in the MMSE. More gener-
ally, one can say that the inference problem defined by the pair (Y, Z(s)) undergoes a
phase transition whenever MX|Y (s) has a jump discontinuity in the large-N limit. If such
a phase transition occurs, then it implies that the magnitude of MX|Y (s) is increasing

without bound. From (7.53), we see that this also implies significant correlation in the
posterior distribution.
Evaluating the conditional MMSE function and its derivative at s = 0 provides
expressions for the MMSE and mean-squared covariance associated with the orignal
observation model:
1
MX|Y (0) = mmse(X | Y), (7.54)
N
 1  
MX|Y (0) = − E Cov(X | Y) 2F . (7.55)
N
In light of the discussion above, the mean-squared covariance can be interpreted as the
rate of MMSE change with s that occurs when one is presented with an independent
observation of X from a Gaussian channel with infinitesimally small signal-to-noise
ratio. Furthermore, we see that significant correlation in the posterior distribution
corresponds to a jump discontinuity in the large-N limit of MX|Y (s) at the point s = 0.

7.5.3 Second-Order Differences of the Information Sequence


Next, we consider some further properties of the incremental-mutual information
sequence introduced in Section 7.3.3. For any observation model satisfying the con-
ditional independence condition in (7.33) the second-order difference sequence given in
(7.33) can be expressed as
1 
Im = − I(Yπ(m+2) ; Yπ(m+1) | Yπ(1) , . . . Yπ(m) ). (7.56)
M! π
Note that each summand is a measure of the pair-wise dependence in the posterior dis-
tribution of new measurements. If the pair-wise dependence is large (on average), then
this means that there is a significant decrease in the first-order difference sequence Im .
The monotonicity and non-negativity of the first-order difference sequence imposes
some important constraints on the second-order difference sequence. For example, the
number of terms for which |Im | is “large” can be upper-bounded in terms of the infor-
mation provided by a single observation. The following result provides a quantitative
description of this constraint.
222 Galen Reeves and Henry D. Pfister

theorem 7.8 For any observation model satisfying the conditional independence con-
dition in (7.33) and positive number T , the second-order difference sequence {Im }
satisfies
* +
| m : |Im | ≥ T | ≤ I1 /T, (7.57)
(M
where I1 = (1/M) m=1 I(X; Ym ) is the first term in the information sequence.
Proof The monotonicity of the first-order difference (Theorem 7.3) means that Im is
non-positive, and hence the indicator function of the event {|Im | ≥ T } is upper-bounded
by −Im /T . Summing this inequality over m, we obtain

+   
M−2 M−2
*
| m : |Im | ≥ T | = 1[T,∞) (|Im |) ≤ −Im /T = I0 − I M−1

/T.
m=1 m=1

Noting that I0 = I1 and 


I M−1 ≥ 0 completes the proof.
An important property of Theorem 7.8 is that, for many problems of interest, the
term I1 does not depend on the total number of observations M. For example, in the
standard linear model with i.i.d. measurement vectors, the upper bound in Theorem 7.4
gives I1 ≤ 12 log(1 + MX (0)). Some implications of these results are discussed in the next
section.

Implications for the Standard Linear Model


One of the key steps in the authors’ prior work on the standard linear model [20–22]
is the following inequality, which relates the second-order difference sequence and the
mean-squared covariance.
theorem 7.9 Consider the standard linear model (7.3) with i.i.d. Gaussian measure-
ment vectors Am ∼ N(0, N −1 IN ). If the entries of X are independent with bounded
fourth-moment E Xn4 ≤ B, then the mean-squared covariance satisfies
1 '' '2  ) )1/4
E ' Cov (X | Y m , Am )''F ≤ C B ))Im )) , (7.58)
N 2

for all integers N and m = 1, . . . , M, where C B is a constant that depends only on the
fourth-moment upper bound B.
Theorem 7.9 shows that significant correlation in the posterior distribution implies
pair-wise dependence in the joint distribution of new measurements and, hence, a
significant decrease in the first-order difference sequence Im . In particular, if the mean-
squared covariance is order N 2 (corresponding to the upper bound in (7.50)), then |Im |
is lower-bounded by a constant. If we consider the large-N limit in which the number
of observations is parameterized by the fraction δ = m/N, then an order-one difference
in Im corresponds to a jump discontinuity with respect to δ. In other words, signif-
icant pair-wise correlation implies a phase transition with respect to the fraction of
observations.
Viewed in the other direction, Theorem 7.9 also shows that small changes in the first-
order difference sequence imply that the average pair-wise correlation is small. From
Understanding Phase Transitions via Mutual Information and MMSE 223

Theorem 7.8, we see that this is, in fact, the typical situation. Under the assumptions of
Theorem 7.9, it can be verified that
))   )
)) m : E '''Cov(X | Y m , Am )'''2 ≥ N 2− /4 ))) ≤ C
,B N (7.59)
) F )

for all 0 ≤ ≤ 1, where C ,B is a constant that depends only on the fourth-moment bound
B. In other words, the number of m-values for which the mean-squared covariance has
the same order as the upper bound in (7.50) must be sub-linear in N. This fact plays a
key role in the proof of Theorem 7.6; see [20, 21] for details.

7.6 Conclusion

This chapter provides a tutorial introduction to high-dimensional inference and its con-
nection to information theory. The standard linear model is analyzed in detail and used
as a running example. The primary goal is to present intuitive links between phase tran-
sitions, mutual information, and estimation error. To that end, we show how general
functional properties (e.g., the chain rule, data-processing inequality, and I-MMSE rela-
tionship) of mutual information and MMSE can imply meaningful constraints on the
solutions of challenging problems. In particular, the replica prediction of the mutual
information and MMSE is described and an outline is given for the authors’ proof that it
is exact in some cases. We hope that the approach described here will make this material
accessible to a wider audience.

7.6.1 Further Directions


Beyond the standard linear model, there are other interesting high-dimensional inference
problems that can be addressed using the ideas in this chapter. For example, one recent
line of work has focused on multilayer networks, which consist of multiple stages of a
linear transform followed by a nonlinear (possibly random) function [32, 56, 57]. There
has also been significant work on bilinear estimation problems, matrix factorization, and
community detection [58–61]. Finally, there has been some initial progress on optimal
quantization for the standard linear model [62, 63].

A7.1 Subset Response for the Standard Linear Model

This appendix describes the mapping between the standard linear model and a signal-
plus-noise response model for a subset of the observations. Recall the problem
formulation
Y = AX + W (7.60)
where A is an M × N matrix. Suppose that we are interested in the posterior distribution
of a subset S ⊂ {1, . . . , N} of the signal entries where the size of the subset K = |S | is
small relative to the signal length N and the number of measurements M. Letting S c =
{1, . . . , N}\S denote the complement of S , the measurements can be decomposed as
224 Galen Reeves and Henry D. Pfister

Y = AS XS + AS c XS c + W, (7.61)
where AS is an M × K matrix corresponding to the columns of A indexed by S and AS c
is an M × (N − K) matrix corresponding to the columns indexed by the complement of S .
This decomposition suggests an alternative interpretation of the linear model in which
XS is a low-dimensional signal of interest and XS c is a high-dimensional interference
term. Note that AS is a tall skinny matrix, and thus the noiseless measurements of XS
lie in a K-dimensional subspace of the M-dimensional measurement space.
Next, we introduce a linear transformation of the problem that attempts to sepa-
rate the signal of interest from the interference term. The idea is to consider the QR
decomposition of the tall skinny matrix AS of the form
 R
AS = Q1 , Q2 ,
 0
Q

where Q is an M × M orthogonal matrix (QQT = I), Q1 is M × K, Q2 is M × (M − K),


and R is an K × K upper triangular matrix whose diagonal entries are non-negative. If
AS has full column rank, then the pair (Q1 , R) is uniquely defined. The matrix Q2 can be
chosen arbitrarily subject to the constraint QT2 Q2 = I − QT1 Q1 . To facilitate the analysis,
we will assume that Q2 is chosen uniformly at random over the set of matrices satisfying
this constraint.
Multiplication by QT is a one-to-one linear transformation. The transformed problem
parameters are defined as
,  QT Y,
Y B  QT AS c , -
W  QT W.
At this point, it is important to note that the isotropic Gaussian distribution is invariant
to orthogonal transformations. Consequently, the transformed noise - W has the same dis-
tribution as W and is independent of everything else. Using the transformed parameters,
the linear model can be expressed as
     ⎡ ⎤
,1
Y R B1 XS ⎢⎢-
W ⎥⎥
, = + ⎢⎣⎢-1 ⎥⎦⎥, (7.62)
Y2 0 B2 XS c W2
where Y,1 corresponds to the first K measurements and Y ,2 corresponds to the remaining
(M − K) measurements.
A useful property of the transformed model is that Y ,2 is independent of the signal
of interest XS . This decomposition motivates a two-stage approach in which one first
estimates XS c from the data (Y,2 , B2 ) and then uses this estimate to “subtract out” the
,
interference term in Y1 . To be more precise, we define
 
ZY ,1 − B1 E X ,S c | Y
,2 , B2

to be the measurements Y,1 after subtracting the conditional expectation of the interfer-
ence term. Rearranging terms, one finds that the relationship between Z and XS can be
expressed succinctly as
Z = RXS + V, ,2 , B),
V ∼ p(v | Y (7.63)
Understanding Phase Transitions via Mutual Information and MMSE 225

where
  
,2 , B2 + -
V  B1 XS c − E XS c | Y W1 (7.64)

is the error due to both the interference and the measurement noise.
Thus far, this decomposition is quite general in the sense that it can be applied for any
matrix A and subset S of size less than M. The key question at this point is whether the
error term V is approximately Gaussian.

References

[1] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical J.,


vol. 27, pp. 379–423, 623–656, 1948.
[2] T. Lesieur, C. De Bacco, J. Banks, F. Krzakala, C. Moore, and L. Zdeborová, “Phase tran-
sitions and optimal algorithms in high-dimensional Gaussian mixture clustering,” in Proc.
Allerton Conference on Communication, Control, and Computing, 2016.
[3] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd edn. Wiley-
Interscience, 2006.
[4] T. S. Han, Information-spectrum methods in information theory. Springer, 2004.
[5] M. Mézard and A. Montanari, Information, physics, and computation. Oxford University
Press, 2009.
[6] L. Zdeborová and F. Krzakala, “Statistical physics of inference: Thresholds and algo-
rithms,” Adv. Phys., vol. 65, no. 5, pp. 453–552, 2016.
[7] G. Parisi, “A sequence of approximated solutions to the S–K model for spin glasses,” J.
Phys. A: Math. and General, vol. 13, no. 4, pp. L115–L121, 1980.
[8] M. Talagrand, “The Parisi formula,” Annals Math., vol. 163, no. 1, pp. 221–263, 2006.
[9] D. J. MacKay, Information theory, inference, and learning algorithms. Cambridge Univer-
sity Press, 2003.
[10] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and varia-
tional inference. Now Publisher Inc., 2008.
[11] M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. O. Hero, and
S. McLaughlin, “A survey of stochastic simulation and optimization methods in signal
processing,” IEEE J. Selected Topics Signal Processing, vol. 10, no. 2, pp. 224–241, 2016.
[12] M. Opper and O. Winther, “Expectation consistent approximate inference,” J. Machine
Learning Res., vol. 6, pp. 2177–2204, 2005.
[13] J. Pearl, Probabilistic reasoning in intelligent systems: Networks of plausible inference.
Morgan Kaufmann Publishers Inc., 1998.
[14] T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Proc. 17th
Conference in Uncertainty in Artificial Intelligence, 2001, pp. 362–369.
[15] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed
sensing,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[16] I. M. Johnstone, “Gaussian estimation: Sequence and wavelet models,” 2015, http://
statweb.stanford.edu/~imj/.
[17] S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing. Birkhäuser,
2013.
226 Galen Reeves and Henry D. Pfister

[18] Y. C. Eldar and G. Kutyniok, Compressed sensing theory and applications. Cambridge
University Press, 2012.
[19] D. Guo and S. Verdú, “Randomly spread CDMA: Asymptotics via statistical physics,”
IEEE Trans. Information Theory, vol. 51, no. 6, pp. 1983–2010, 2005.
[20] G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing
with Gaussian matrices is exact,” in Proc. IEEE International Symposium on Information
Theory (ISIT), 2016, pp. 665–669.
[21] G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing
with Gaussian matrices is exact,” 2016, https://arxiv.org/abs/1607.02524.
[22] G. Reeves, “Understanding the MMSE of compressed sensing one measurement at a time,”
presented at the Institut Henri Poincaré Spring 2016 Thematic Program on the Nexus of
Information and Computation Theories, Paris, 2016, https://youtu.be/vmd8-CMv04I.
[23] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with
applications to compressed sensing,” IEEE Trans. Information Theory, vol. 57, no. 2, pp.
764–785, 2011.
[24] S. Rangan, “Generalized approximate message passing for estimation with random linear
mixing,” in Proc. IEEE International Symposium on Information Theory (ISIT), 2011, pp.
2174–2178.
[25] J. P. Vila and P. Schniter, “Expectation-maximization Gaussian-mixture approximate
message passing,” IEEE Trans. Signal Processing, vol. 61, no. 19, pp. 4658–4672, 2013.
[26] Y. Ma, J. Zhu, and D. Baron, “Compressed sensing via universal denoising and approxi-
mate message passing,” IEEE Trans. Signal Processing, vol. 64, no. 21, pp. 5611–5622,
2016.
[27] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,”
IEEE Trans. Information Theory, vol. 62, no. 9, pp. 5117–5144, 2016.
[28] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for the
generalized linear model,” in Asilomar Conference on Signals, Systems and Computers,
2016.
[29] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” in Proc.
IEEE International Symposium on Information Theory (ISIT), 2017, pp. 1588–1592.
[30] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propaga-
tion,” J. Phys. A: Math. General, vol. 36, no. 43, pp. 11 111–11 121, 2003.
[31] M. Bayati, M. Lelarge, and A. Montanari, “Universality in polytope phase transitions and
iterative algorithms,” in IEEE International Symposium on Information Theory, 2012.
[32] G. Reeves, “Additivity of information in multilayer networks via additive Gaussian noise
transforms,” in Proc. Allerton Conference on Communication, Control, and Computing,
2017, https://arxiv.org/abs/1710.04580.
[33] B. Çakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate message passing for
general matrix ensembles,” 2014, http://arxiv.org/abs/1405.2767.
[34] A. Fletcher, M. Sahree-Ardakan, S. Rangan, and P. Schniter, “Expectation consistent
approximate inference: Generalizations and convergence,” in Proc. IEEE International
Symposium on Information Theory (ISIT), 2016.
[35] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” 2016,
https://arxiv.org/abs/1610.03082.
[36] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for the
generalized linear model,” 2016, https://arxiv.org/abs/1612.01186.
Understanding Phase Transitions via Mutual Information and MMSE 227

[37] B. Çakmak, M. Opper, O. Winther, and B. H. Fleury, “Dynamical functional theory for
compressed sensing,” 2017, https://arxiv.org/abs/1705.04284.
[38] H. He, C.-K. Wen, and S. Jin, “Generalized expectation consistent signal recovery for
nonlinear measurements,” in Proc. IEEE International Symposium on Information Theory
(ISIT), 2017.
[39] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in
Gaussian channels,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1261–1282, 2005.
[40] A. J. Stam, “Some inequalities satisfied by the quantities of information of Fisher and
Shannon,” Information and Control, vol. 2, no. 2, pp. 101–112, 1959.
[41] D. Guo, Y. Wu, S. Shamai, and S. Verdú, “Estimation in Gaussian noise: Properties of
the minimum mean-square error,” IEEE Trans. Information Theory, vol. 57, no. 4, pp.
2371–2385, 2011.
[42] G. Reeves, H. D. Pfister, and A. Dytso, “Mutual information as a function of matrix SNR
for linear Gaussian channels,” in Proc. IEEE International Symposium on Information
Theory (ISIT), 2018.
[43] K. Bhattad and K. R. Narayanan, “An MSE-based transfer chart for analyzing iterative
decoding schemes using a Gaussian approximation,” IEEE Trans. Information Theory,
vol. 58, no. 1, pp. 22–38, 2007.
[44] N. Merhav, D. Guo, and S. Shamai, “Statistical physics of signal estimation in Gaussian
noise: Theory and examples of phase transitions,” IEEE Trans. Information Theory, vol. 56,
no. 3, pp. 1400–1416, 2010.
[45] C. Méasson, A. Montanari, T. J. Richardson, and R. Urbanke, “The generalized area theo-
rem and some of its consequences,” IEEE Trans. Information Theory, vol. 55, no. 11, pp.
4793–4821, 2009.
[46] G. Reeves, “Conditional central limit theorems for Gaussian projections,” in Proc. IEEE
International Symposium on Information Theory (ISIT), 2017, pp. 3055–3059.
[47] G. Reeves, “Two-moment inequailties for Rényi entropy and mutual information,” in Proc.
IEEE International Symposium on Information Theory (ISIT), 2017, pp. 664–668.
[48] S. B. Korada and N. Macris, “Tight bounds on the capacity of binary input random CDMA
systems,” IEEE Trans. Information Theory, vol. 56, no. 11, pp. 5590–5613, 2010.
[49] A. Montanari and D. Tse, “Analysis of belief propagation for non-linear problems: The
example of CDMA (or: How to prove Tanaka’s formula),” in Proc. IEEE Information
Theory Workshop (ITW), 2006, pp. 160–164.
[50] W. Huleihel and N. Merhav, “Asymptotic MMSE analysis under sparse representation
modeling,” Signal Processing, vol. 131, pp. 320–332, 2017.
[51] G. Reeves and M. Gastpar, “The sampling rate–distortion trade-off for sparsity pattern
recovery in compressed sensing,” IEEE Trans. Information Theory, vol. 58, no. 5, pp.
3065–3092, 2012.
[52] G. Reeves and M. Gastpar, “Approximate sparsity pattern recovery: Information-theoretic
lower bounds,” IEEE Trans. Information Theory, vol. 59, no. 6, pp. 3451–3465, 2013.
[53] J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linear
estimation,” in Proc. Allerton Conference on Communication, Control, and Computing,
2016.
[54] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová, “Phase transitions,
optimal errors and optimality of message-passing in generalized linear models,” 2017,
https://arxiv.org/abs/1708.03395.
228 Galen Reeves and Henry D. Pfister

[55] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová, “Optimal errors


and phase transitions in high-dimensional generalized linear models,” in Conference on
Learning Theory, 2018, pp. 728–731.
[56] A. Manoel, F. Krzakala, M. Mézard, and L. Zdeborová, “Multi-layer generalized linear
estimation,” in Proc. IEEE International Symposium on Information Theory (ISIT), 2017,
pp. 2098–2102.
[57] A. K. Fletcher, S. Rangan, and P. Schniter, “Inference in deep networks in high dimen-
sions,” in Proc. IEEE International Symposium on Information Theory (ISIT), 2018.
[58] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborová, “Mutual infor-
mation for symmetric rank-one matrix estimation: A proof of the replica formula,” in
Advances in Neural Information Processing Systems (NIPS), 2016, pp. 424–432.
[59] M. Lelarge and L. Miolane, “Fundamental limits of symmetric low-rank matrix estima-
tion,” 2016, https://arxiv.org/abs/1611.03888.
[60] T. Lesieur, F. Krzakala, and L. Zdeborová, “Constrained low-rank matrix estimation: Phase
transitions, approximate message passing and applications,” 2017, https://arxiv.org/abs/
1701.00858.
[61] E. Abbe, “Community detection and stochastic block models: Recent developments,”
2017, https://arxiv.org/abs/1703.10146.
[62] A. Kipnis, G. Reeves, Y. C. Eldar, and A. Goldsmith, “Compressed sensing under optimal
quantization,” in Proc. IEEE International Symposium on Information Theory (ISIT), 2017,
pp. 2153–2157.
[63] A. Kipnis, G. Reeves, and Y. C. Eldar, “Single letter formulas for quantized compressed
sensing with Gaussian codebooks,” in Proc. IEEE International Symposium on Information
Theory (ISIT), 2018.
8 Computing Choice: Learning
Distributions over Permutations
Devavrat Shah

Summary

We discuss the question of learning distributions over permutations of a given set of


choices, options, or items on the basis of partial observations. This is central to cap-
turing the so-called “choice” in a variety of contexts: understanding preferences of
consumers over a collection of products from purchasing and browsing data in the
setting of retail and e-commerce, learning public opinion amongst a collection of socio-
economic issues from sparse polling data, deciding a ranking of teams or players from
outcomes of games, electing leaders according to votes, and more generally collabo-
rative decision-making employing collective judgment such as accepting papers in a
competitive academic conference. The question of learning distributions over permu-
tations arises beyond capturing “choice” as well, for example, tracking a collection of
objects using noisy cameras, or aggregating ranking of web-pages using outcomes of
multiple search engines. It is only natural that this topic has been studied extensively in
economics, political science, and psychology for more than a century, and more recently
in computer science, electrical engineering, statistics, and operations research.
Here we shall focus on the task of learning distributions over permutations from
marginal distributions of two types: first-order marginals and pair-wise comparisons.
A lot of progress has been made on this topic in the last decade. The ideal goal is to
provide a comprehensive overview of the state-of-the-art on this topic. We shall pro-
vide a detailed overview of selective aspects, biased by the author’s perspective of the
topic, and provide sufficient pointers to aspects not covered here. We shall emphasize the
ability to identify the entire distribution over permutations as well as the “best ranking.”

8.1 Background

8.1.1 Learning from Comparisons


Consider a grocery store around the corner from your home. The owner of the store
would like to have the ability to identify exactly what every customer would purchase (or
not) given the options available in the store. If such an ability exists, then, for example,
optimal stocking decisions can be made by the store operator or the net worth of the store
can be evaluated. This ability is what one would call the “choice model” of consumers of

229
230 Devavrat Shah

the store. More precisely, such a “choice model” can be viewed as a black-box that spits
out the probability of purchase of a particular option when the customer is presented
with a collection of options.
A canonical fine-grained representation for such a “choice model” is the distribution
over permutations of all the possible options (including the no-purchase option). Then,
probability of purchasing a particular option when presented with a collection of options
is simply the probability that this particular option has the highest (relative) order or rank
amongst all the presented options (including the no-purchase option).
Therefore, one way to operationalize such a “choice model” is to learn the distribution
over permutations of all options that a store owner can stock in the store. Clearly, such a
distribution needs to be learned from the observations or data. The data available to the
store owner are the historical transactions as well as what was stocked in the store when
each transaction happened. Such data effectively provide a bag of pair-wise comparisons
between options: consumer exercises or purchases option A over option B corresponds
to a pair-wise comparison A > B or “A is preferred to B.”
In summary, to model consumer choice, we wish to learn the distribution over per-
mutations of all possible options using observations in terms of a collection of pair-wise
comparisons that are consistent with the learned distribution.
In the context of sports, we wish to go a step further, to obtain a ranking of sports
teams or players that is based on outcomes of games, which are simply pair-wise
comparisons (between teams or players). Similarly, for the purpose of data-driven
policy-making, we wish to aggregate people’s opinions about socio-economic issues
such as modes of transportation according to survey data; for designing online recom-
mendation systems that are based on historical online activity of individuals, we wish
to recommend the top few options; or to sort objects on the basis of noisy outcomes of
pair-wise comparisons.

8.1.2 Learning from First-Order Marginals


The task of learning distributions over permutations of options, using different types
of partial information comes up in other scenarios. To that end, now suppose that the
store owner wants to track each consumer’s journey within the store with the help of
cameras. The consumers constantly move within the store as they search through the
aisles. Naturally, when multiple consumers are in the store, their paths are likely to
cross. When the paths of two (or more) consumers cross, and they subsequently follow
different trajectories, confusion can arise regarding which of the multiple trajectories
maps to which of the consumers. That is, at each instant of time we need to continue
mapping physical locations of consumers observed by cameras with the trajectories of
consumers who are being tracked. Equivalently, it’s about keeping track of a “matching”
between locations and individuals in a bipartite graph or keeping track of permutations!
The authors of [1] proposed distribution over permutations as the canonical model
where a permutation corresponds to “matching” of consumers or trajectories to loca-
tions. In such a scenario, due to various constraints and tractability reasons, the
information that is available is the likelihoods of each consumer or trajectory being in
a specific location. In the context of distribution over permutations, this corresponds to
Computing Choice 231

knowing the “first-order” marginal distribution information that states the probability of
a given option being in a certain position in the permutation. Therefore, to track con-
sumers in the store, we wish to learn the distribution over consumer trajectories that is
consistent with this first-order marginal information over time.
In summary, the model to track trajectories of individuals boils down to continually
learning the distribution over permutations that is consistent with the first-order marginal
information and subsequently finding the most likely ranking or permutation from the
learned distribution. It is the very same question that arises in the context of aggregating
web-page rankings obtained through results of search from multiple search engines in a
computationally efficient manner.

8.1.3 Historical Remarks


This fine-grained representation for choice, distribution over permutations, is ancient.
Here, we provide a brief historical overview of the use of distribution over permu-
tations as a model for choice and other applications. We also refer to Chapter 9 of
the monograph by Diaconis [2] for a nice historical overview from a statistician’s
perspective.
One of the earliest references regarding how to model and learn choice using (poten-
tially inconsistent) comparisons is the seminal work by Thurstone [3]. It presents “a law
of comparative judgement” or more precisely a simple parametric model to capture the
outcomes of a collection of pair-wise comparisons between given options (or stimuli in
the language of [3]). This model can be rephrased as an instance of the random util-
ity model (RUM) as follows (also see [4, 5]): given N options, let each option, say i,
have inherent utility ui associated with it; when two options i and j are compared, ran-
dom variables Yi , Y j are sampled and i is preferred over j iff Yi > Y j ; here Yi = ui + εi ,
Y j = u j + ε j , with εi , ε j independent random variables with identical mean.
A specialization of the above model when the εi s are assumed to be Gaussian with
mean 0 and variance 1 for all i is known as the Thurstone–Mosteller model. It is also
known as the probit model. Another specialization of the Thurstone model is realized
when the εi s are assumed to have the Gumbel distribution (one of the extreme value
distributions). This model has been credited differently across communities. Holman
and Marley established that this model is equivalent (see [6] for details) to a generative
model described in detail in Section 8.3.2. It is known as the Luce model [7] and the
Plackett model [8]. In the context when the partial observations are choice observations
(i.e., the observation that an item is chosen from an offered subset of items), this model
is called the multinomial logit model (MNL) after McFadden [9] called it conditional
logit; also see [10]. It is worth remarking that, when restricted to pair-wise comparisons
only, this model matches the Bradley–Terry [11] model, but the Bradley–Terry model
did not consider the requirement that the pair-wise comparison marginals need to be
consistent with an underlying distribution over permutations.
The MNL model is of central importance for various reasons. It was introduced by
Luce to be consistent with the axiom of independence from irrelevant alternatives (IIA).
The model was shown to be consistent with the induced preferences assuming a form of
random utility maximization framework whose inquiry was started by [4, 5]. Very early
232 Devavrat Shah

on, simple statistical tests as well as simple estimation procedures were developed to fit
such a model to observed data [9]. Now the IIA property possessed by the MNL model
is not necessarily desirable as evidenced in many empirical scenarios. Despite such
structural limitations, the MNL model has been widely utilized across application areas
primarily due to the ability to learn the model parameters easily from observed data. For
example, see [12–14] for applications in transportation and [15, 16] for applications in
operations management and marketing.
With a view to addressing the structural limitations of the MNL model, a number of
generalizations to this model have been proposed over the years. Notable among these
are the so-called “nested” MNL model, as well as mixtures of MNL models (or MMNL
models). These generalizations avoid the IIA property and continue to be consistent with
the random utility maximization framework at the expense of increased model complex-
ity; see [13, 17–20] for example. The interested reader is also referred to an overview
article on this line of research [14]. While generalized models of this sort are in prin-
ciple attractive, their complexity makes them difficult to learn while avoiding the risk
of overfitting. More generally, specifying an appropriate parametric model is a difficult
task, and the risks associated with mis-specification are costly in practice. For an applied
view of these issues, see [10, 21, 22].
As an alternative to the MNL model (and its extensions), one might also consider
the parametric family of choice models induced by the exponential family of distri-
butions over permutations. These may be viewed as the models that have maximum
entropy among those models that satisfy the constraints imposed by the observed data.
The number of parameters in such a model is equal to the number of constraints in the
maximum-entropy optimization formulation, or equivalently the effective dimension of
the underlying data, see the Koopman–Pitman–Darmois theorem [23]. This scaling of
the number of parameters with the effective data dimension makes the exponential fam-
ily obtained via the maximum-entropy principle very attractive. Philosophically, this
approach imposes on the model only those constraints implied by the observed data.
On the flip side, learning the parameters of an exponential family model is a computa-
tionally challenging task (see [24–26]) as it requires computing a “partition function,”
possibly over a complex state space.
Very recently, Jagabathula and Shah [27, 28] introduced a non-parametric sparse
model. Here the distribution over permutations is assumed to have sparse (or small)
support. While this may not be exactly true, it can be an excellent approximation to the
reality and can provide computationally efficient ways to both infer the model [27, 28]
in a manner consistent with observations and utilize it for effective decision-making
[29, 30].

8.2 Setup

Given N objects or items denoted as [N] = {1, . . . , N}, we are interested in the distribution
over permutations of these N items. A permutation σ : [N] → [N] is one-to-one and onto
mapping, with σ(i) denoting the position or ordering of element i ∈ [N].
Computing Choice 233

Let S N denote the space of N! permutations of these N items. The set of distributions

over S N is denoted as M(S N ) = {ν : S N → [0, 1] : σ∈S N ν(σ) = 1}. Given ν ∈ M(S N ), the
first-order marginal information, M(ν) = [Mi j (ν)], is an N × N doubly stochastic matrix
with non-negative entries defined as

Mi j (ν) = ν(σ)1{σ(i)= j} , (8.1)
σ∈S N

where, for σ ∈ S N , σ(i) denotes the rank of item i under permutation σ, and 1{x} is the
standard indicator with 1{true} = 1 and 1{false} = 0. The comparison marginal information,
C(ν) = [Ci j (ν)], is an N × N matrix with non-negative entries defined as

Ci j (ν) = ν(σ)1{σ(i)>σ( j)} . (8.2)
σ∈S N

By definition, the diagonal entries of C(ν) are all 0s, and Ci j (ν) + C ji (ν) = 1 for all 1 ≤
i  j ≤ N. We shall abuse the notation by using M(σ) and C(σ) to denote the matrices
obtained by applying them to the distribution where the support of the distribution is
simply {σ}.
Throughout, we assume that there is a ground-truth model ν. We observe marginal
information M(ν) or C(ν), or their noisy versions.

8.2.1 Questions of Interest


We are primarily interested in two questions: recovering the distribution and producing
the ranking, on the basis of the distribution.
question 1 (Recover the distribution.) The primary goal is to recover ν from obser-
vations. Precisely, we observe D = P(ν) + η, where P(ν) ∈ {M(ν), C(ν)} and there is
a potentially noisy perturbation η. The precise form of noisy perturbation that allows
recovery of the model is explained later in detail. Intuitively, the noisy perturbation may
represent the finite sample error introduced due to forming an empirical estimation of
P(ν) or an inability to observe data associated with certain components.
A generic ν has N! − 1 unknowns, while the dimension of D is at most N 2 . Learning ν
from D boils down to finding the solution to a set of linear equations where there are at
most N 2 linear equations involving N! − 1 unknowns. This is a highly under-determined
system of equations and hence, without imposing structural conditions on ν, it is unlikely
that we will be able to recover ν faithfully. Therefore, the basic “information” question
would be to ask under what structural assumption on ν it is feasible to recover it from
P(ν) (i.e., when η = 0). The next question concerns the “robustness” of such a recov-
ery condition when we have non-trivial noise, η. And finally, we would like to answer
the “computational” question associated with it, which asks whether such recovery is
possible using computation that scales polynomially in N.
question 2 (Produce the ranking.) An important associated decision question is that of
finding the “ranking” or most “relevant” permutation for the underlying ν. To begin
with, what is the most “relevant” permutation assuming we know the ν perfectly? This,
234 Devavrat Shah

in a sense, is ill-defined due to the impossibility result of Arrow [31]: there is no ranking
algorithm that works for all ν and satisfies certain basic hypotheses expected from any
ranking algorithm even when N = 3.
For this reason, like in the context of recovering ν, we will have to impose structure
on ν. In particular, the structure that we shall impose (e.g., the sparse model or the
multinomial logit model) seems to suggest a natural answer for ranking or the most
“relevant” permutation: find the σ that has maximal probability, i.e., find σ∗ (ν), where

σ∗ (ν) ∈ arg max ν(σ). (8.3)


σ∈S N

Again, the goals would include the ability to recover σ∗ (ν) (exactly or approximately)
using observations (a) when η = 0 and (b) when there is a non-trivial η, and (c) the ability
to do this in a computationally efficient manner.

8.3 Models

We shall consider two types of model here: the non-parametric sparse model and the
parametric random utility model. As mentioned earlier, a large number of models have
been studied in the literature and are not discussed in detail here.

8.3.1 Sparse Model


The support of distribution ν, denoted as supp(ν) is defined as


supp(ν) = {σ ∈ S N : ν(σ) > 0}. (8.4)

The 0 -norm of ν, denoted as ν0 , is defined as

  
ν0 = supp(ν). (8.5)

We say that ν has sparsity K if K = ν0 . Naturally, by varying K, all possible ν ∈


M(S N ) can be captured. In that sense, this is a non-parametric model. This model was
introduced in [27, 28].
The goal would be to learn the sparsest possible ν that is consistent with observations.
Formally, this corresponds to solving

minimize μ0 over μ ∈ M(S N )


such that P(μ) ≈ D, (8.6)

where P(μ) ∈ {M(μ), C(μ)} depending upon the type of information considered.
Computing Choice 235

8.3.2 Random Utility Model (RUM)


We consider the random utility model (RUM) that in effect was considered in the “law
of comparative judgement” by Thurstone [3]. Formally, each option i ∈ [N] has a deter-
ministic utility ui associated with it. The random utility Yi associated with option i ∈ [N]
obeys the form

Yi = ui + εi , (8.7)

where εi are independent random variables across all i ∈ [N] – they represent “random
perturbation” of the “inherent utility” ui . We assume that all of the εi have identical mean
across all i ∈ [N], but can have varying distribution. The specific form of the distribution
gives rise to different types of models. We shall describe a few popular examples of this
in what follows. Before we do that, we explain how this setup gives rise to the distribu-
tion over permutations by describing the generative form of the distribution. Specifically,
to generate a random permutation over the N options, we first sample random variable
Yi , i ∈ [N], independently. Then we sort Y1 , . . . , YN in decreasing order,1 and this sorted
order of indices [N] provides the permutation. Now we describe two popular examples
of this model.
Probit model. Let εi have a Gaussian distribution with mean 0 and variance σ2i for i ∈
[N]. Then, the resulting model is known as the Probit model. In the homogeneous setting,
we shall assume that σ2i = σ2 for all i ∈ [N].
Multinomial logit (MNL) model. Let εi have a Gumbel distribution with mode μi and
scaling parameter βi > 0, i.e., the PDF of εi is given by
1 x − μi
f (x) = exp(−(z + exp(−z))), where z = , for x ∈ R. (8.8)
βi βi
In the homogeneous setting, μi = μ and βi = β for all i ∈ [N]. In this scenario, the resulting
distribution over permutations turns out to be equivalent to the following generative
model.
Let wi > 0 be a parameter associated with i ∈ [N]. Then the probability of permutation
σ ∈ S N is given by (for example, see [32])

N wσ−1 ( j)
P(σ) = . (8.9)
j=1
wσ−1 ( j) + wσ−1 ( j+1) + · · · + wσ−1 (N)

Above, σ−1 ( j) = i iff σ(i) = j. Specifically, for i  j ∈ [N],


wi
P(σ(i) > σ( j)) = . (8.10)
wi + w j
We provide a simple explanation of the above, seemingly mysterious, relationship
between two very different descriptions of the MNL model.

1 We shall assume that the distribution of εi , i ∈ [N], has a density and hence ties never happen between
Yi , Y j for any i  j ∈ [N].
236 Devavrat Shah

lemma 8.1 Let εi , ε j be independent random variables with Gumbel distributions with
mode μi , μ j , respectively, with scaling parameters βi = β j = β > 0. Then, Δi j = εi − ε j has
a logistic distribution with parameters μi − μ j (location) and β (scale).
The proof of Lemma 8.1 follows by, for example, using the characteristic function
associated with the Gumbel distribution along with the property of the gamma function
(Γ(1 + z)Γ(1 − z) = zπ/ sin(πz)) and then identifying the characteristic function of the
logistic distribution.
Returning to our model, when we compare the random utilities associated with
options i and j, Yi and Y j , respectively, we assume the corresponding random pertur-
bation to be homogeneous, i.e., μi = μ j = μ and βi = β j = β > 0. Therefore, Lemma 8.1
suggests that
   
P Yi > Y j = P εi − ε j > u j − ui
 
= P Logistic(0, β) > u j − ui
 
= 1 − P Logistic(0, β) < u j − ui
1
= 1−  
1 + exp − u j − ui /β
 
exp ui /β
=    
exp ui /β + exp u j /β
wi
= , (8.11)
wi + w j
   
where wi = exp ui /β , w j = exp u j /β . It is worth remarking that the property of
Lemma 8.9 relates the MNL model to the Bradley–Terry model.
Learning the model and ranking. For the random utility model, the question of learn-
ing the model from data effectively boils down to learning the model parameters from
observations. In the context of a homogeneous model, i.e., εi in (8.7), for which we have
an identical distribution across all i ∈ [N], the primary interest is in learning the inherent
utility parameters ui , for i ∈ [N]. The question of recovering the ranking, on the other
hand, is about recovering σ ∈ S N , which is the sorted (decreasing) order of the inherent
utilities ui , i ∈ [N]: for example, if u1 ≥ u2 ≥ · · · ≥ uN , then the ranking is the identity
permutation.

8.4 Sparse Model

In this section, we describe the conditions under which we can learn the underlying
sparse distribution using first-order marginal and comparison marginal information. We
divide the presentation into two parts: first, we consider access to exact or noise-less
marginals for exact recovery; and, then, we discuss its robustness.
We can recover a ranking in terms of the most likely permutation once we have
recovered the sparse model by simply sorting the likelihoods of the permutations in the
support of the distribution, which requires time O(K log K), where K is the sparsity of
Computing Choice 237

the model. Therefore, the key question in the context of the sparse model is the recovery
of the distribution, on which we shall focus in the remainder of this section.

8.4.1 Exact Marginals: Infinite Samples


We are interested in understanding when it is feasible to recover the underlying distribu-
tion ν given access to its marginal information M(ν) or C(ν). As mentioned earlier, one
way to recover siuch a distribution using exactly marginal information is to solve (8.6)
with the equality constraint of P(ν) = D, where P(ν) ∈ {M(ν), C(ν)} depending upon the
type of marginal information.
We can view the unknown ν as a high-dimensional vector in RN! which is sparse. That
is, ν0 N!. The observations are marginals of ν, either first-order marginals M(ν) or
comparison marginals C(ν). They can be viewed as linear projections of the ν vector
of dimension N 2 or N(N − 1). Therefore, recovering the sparse model from marginal
information boils down to recovering a sparse vector in high-dimensional space (here
N! dimensional) from a small number of linear measurements of the sparse vector. That
is, we wish to recover x ∈ Rn from observation y = Ax, where y ∈ Rm , A ∈ Rm×n with
m n. In the best case, one can hope to recover x uniquely as long as m ∼ x0 .
This question has been well studied in the context of sparse model learning from
linear measurements of the signal in signal processing and it has been popularized under
the umbrella term of compressed sensing, see for example [33–37]. It has been argued
that such recovery is possible as long as A satisfies certain conditions, for example the
restricted isoperimetric property (RIP) (see [34, 38]) with m ∼ K log n/K, in which case
the 0 optimization problem,
min z0 over y = Az,
z
recovers the true signal x as long as the sparsity of x, x0 , is at most K. The remarkable
fact about an RIP-like condition is that it not only recovers the sparse signal using the 0
optimization, but also can be done using a computationally efficient procedure, a linear
program.
Impossibility of recovering the distribution even for N = 4. Returning to our setting,
n = N!, m = N 2 and we wish to understand up to what level of sparsity of ν we can
recover it. The key difference here is in the fact that A is not designed but given.
Therefore, the question is whether A has a nice property such as RIP that can allow
sparse recovery. To that end, consider the following simple counterexample that shows
that it is impossible to recover a sparse model uniquely even with support size 3 using
the 0 optimization [27, 28].

Example 8.1 (Impossibility) For N = 4, consider the four permutations σ1 = [1 → 2, 2 →


1, 3 → 3, 4 → 4], σ2 = [1 → 1, 2 → 2, 3 → 4, 4 → 3], σ3 = [1 → 2, 2 → 1, 3 → 4, 4 → 3],
and σ4 = id = [1 → 1, 2 → 2, 3 → 3, 4 → 4], i.e., the identity permutation. It is easy to
check that
M(σ1 ) + M(σ2 ) = M(σ3 ) + M(σ4 ).
238 Devavrat Shah

Now suppose that ν(σi ) = pi , where pi ∈ [0, 1] for 1 ≤ i ≤ 3, and ν(σ) = 0 for all other
σ ∈ S N . Without loss of generality, let p1 ≤ p2 . Then
p1 M(σ1 ) + p2 M(σ2 ) + p3 M(σ3 ) = (p2 − p1 )M(σ2 ) + (p3 + p1 )M(σ3 ) + p1 M(σ4 ).
Here, note that {M(σ1 ), M(σ2 ), M(σ3 )} are linearly independent, yet the sparsest
solution is not unique. Therefore, it is not feasible to recover the sparse model uniquely.

Note that the above example can be extended for any N ≥ 4 by simply having identity
permutation for all elements larger than 4 in the above example. Therefore, for any N
with support size 3, we cannot always recover them uniquely.
Signature condition for recovery. Example 8.1 suggests that it is not feasible to expect an
RIP-like condition for the “projection matrix” corresponding to the first-order marginals
or comparison marginals so that any sparse probability distribution can be recovered.
The next best thing we can hope for is the ability to recover almost all of the sparse
probability distributions. This leads us to the signature condition of the matrix for a
given sparse vector, which, as we shall see, allows recovery of the particular sparse
vector [27, 28].
condition 8.1 (Signature conditions) A given matrix A ∈ Rm×n is said to satisfy a sig-
nature condition with respect to index set S ⊂ {1, . . . , n} if, for each i ∈ S , there exists
j(i) ∈ [m] such that A j(i)i  0 and A j(i)i = 0 for all i  i, i ∈ S .
The signature condition allows recovery of sparse vector using a simple “peeling”
algorithm. We summarize the recovery result followed by an algorithm that will imply
the result.
theorem 8.1 Let A ∈ {0, 1}m×n with all of its columns being distinct. Let x ∈ Rn≥0 be such
that A satisfies a signature condition with respect to the set supp(x). Let the non-zero
components of x, i.e., {xi : i ∈ supp(x)}, be such that, for any two distinct S 1 , S 2 ⊂ supp(x),
 
i∈S 1 xi  i ∈S 2 xi . Then x can be recovered from y, where y = Ax.

Proof To establish Theorem 8.1, we shall describe the algorithm that recovers x under
the conditions of the theorem and simultaneously argue for its correctness. To that end,
the algorithm starts by sorting components of y. Since A ∈ {0, 1}m×n , for each j ∈ [m],

y j = i∈S j xi , with S j ⊆ supp(x). Owing to the signature condition, for each i ∈ supp(x),
there exists j(i) ∈ [m] such that y j(i) = xi . If we can identify j(i) for each i, we recover the
values xi , but not necessarily the position i. To identify the position, we identify the ith
column of matrix A, and, since the columns of matrix A are all distinct, this will help us
identify the position. This will require use of the property that, for any S 1  S 2 ⊂ supp(x),
 
i∈S 1 xi  i ∈S 2 xi . This implies, to begin with, that all of the non-zero elements of x
are distinct. Without loss of generality, let the non-zero elements of x be x1 , . . . , xK , with
K = |supp(x)| such that 0 < x1 < · · · < xK .
Now consider the smallest non-zero element of y. Let it be y j1 . From the property of
x, it follows that y j1 must be the smallest non-zero element of x, x1 . The second smallest
component of y that is distinct from x1 , let it be y j2 , must be x2 . The third distinct
smallest component, y j3 , however, could be x1 + x2 or x3 . Since we know x1 , x2 , and
Computing Choice 239

the property of x, we can identify whether y j3 is x3 or not. Iteratively, we consider the


kth distinct smallest value of y, say y jk . Then, it equals either the sum of the subset
of already identified components of x or the next smallest unidentified component of x
due to the signature property and non-negativity of x. In summary, by the time we have
gone through all of the non-zero components of y in the increasing order as described
above, we will recover all the non-zero elements of x in increasing order as well as the
corresponding columns of A. This is because iteratively we identify for each y j the set

S j ⊂ supp(x) such that y j = i∈S j xi . That is, A ji = 1 for all i ∈ S j and 0 otherwise. This
completes the proof.

Now we remark on the computation complexity of the “peeling” algorithm described


above. It runs for at most m iterations. In each iteration, it tries to effectively solve a
subset sum problem whose computation cost is at most O(2K ), where K = x0 is the
sparsity of x. Then the additional step for sorting the components of y costs O(m log m).
In summary, the computation cost is O(2K + m log m). Notice that, somewhat surpris-
ingly, this does not depend on n at all. In contrast, for the linear programming-based
approach used for sparse signal recovery in the context of compressed sensing litera-
ture, the computational complexity scales at least linearly in n, the ambient dimension
of the signal. For example, if this were applicable to our setting, it would scale as N!,
which is simply prohibitive.
Recovering the distribution using first-order marginals via the signature condition. We
shall utilize the signature condition, Condition 8.1, in the context of recovering the
distribution over permutations from its first-order marginals. Again, given the counterex-
ample, Example 8.1, it is not feasible to recover all sparse models even with sparsity 3
from first-order marginals uniquely. However, with the aid of the signature condition, we
will argue that it is feasible to recover most sparse models with reasonably large sparsity.
To that end, let Af ∈ {0, 1}N ×N! denote the first-order marginal matrix that maps the
2

N!-dimensional vector corresponding to the distribution over permutations to an N 2 -


dimensional vector corresponding to the first-order marginals of the distribution. We
state the signature property of Af next.
lemma 8.2 Let S be a randomly chosen subset of {1, . . . , N!} of size K. Then the
first-order marginal matrix Af satisfies the signature condition with respect to S with
probability 1 − o(1) as long as K ≤ (1 − )N log N for any  > 0.
The proof of the above lemma can be found in [27, 28]. Lemma 8.2 and Theorem 8.1
immediately imply the following result.
theorem 8.2 Let S ⊂ S N be a randomly chosen subset of S N of size K, denoted as S =
{σ1 , . . . , σK }. Let p1 , . . . , pK be chosen from a joint distribution with a continuous density
over subspace of [0, 1]K corresponding to p1 + · · · + pK = 1. Let ν be the distribution over
S N such that



⎨ pk if σ = σk , k ∈ [K],
ν(σ) = ⎪
⎪ (8.12)
⎩0 otherwise.
240 Devavrat Shah

Then ν can be recovered from its first-order marginal distribution with probability
1 − o(1) as long as K ≤ (1 − )N log N for a fixed  > 0.
The proof of Theorem 8.2 can be found in [27, 28]. In a nutshell, it states that the most
sparse distribution over permutations with sparsity up to N log N can be recovered from
its first-order marginals. This is in sharp contrast with the counterexample, Example 8.1,
which states that for any N, a distribution with sparsity 3 cannot be recovered
uniquely.
Recovering the distribution using comparison marginals via the signature condition.
Next, we utilize the signature condition, Condition 8.1, in the context of recovering the
distribution over permutations from its comparison marginals. Let Ac ∈ {0, 1}N(N−1)×N!
denote the comparison marginal matrix that maps the N!-dimensional vector corre-
sponding to the distribution over permutations to an N 2 -dimensional vector correspond-
ing to the comparison marginals of the distribution. We state the signature property of
Ac next.
lemma 8.3 Let S be a randomly chosen subset of {1, . . . , N!} of size K. Then the com-
parison marginal matrix Ac satisfies the signature condition with respect to S with
probability 1 − o(1) as long as K = o(log N).
The proof of the above lemma can be found in [29, 30]. Lemma 8.3 and Theorem 8.1
immediately imply the following result.
theorem 8.3 Let S ⊂ S N be a randomly chosen subset of S N of size K, denoted as S =
{σ1 , . . . , σK }. Let p1 , . . . , pK be chosen from a joint distribution with a continuous density
over the subspace of [0, 1]K corresponding to p1 + · · · + pK = 1. Let ν be a distribution
over S N such that



⎨ pk if σ = σk , k ∈ [K],
ν(σ) = ⎪
⎪ (8.13)
⎩0 otherwise.
Then ν can be recovered from its comparison marginal distribution with probability
1 − o(1) as long as K = o(log N).
The proof of Theorem 8.3 can be found in [29, 30]. This proof suggests that it is feasi-
ble to recover a sparse model with growing support size with N as long as it is o(log N).
However, it is exponentially smaller than the recoverable support size compared with
the first-order marginal. This seems to be related to the fact that the first-order marginal
is relatively information-rich compared with the comparison marginal.

8.4.2 Noisy Marginals: Finite Samples


Thus far, we have considered a setup where we had access to exact marginal distribution
information. Instead, suppose we have access to marginal distributions formed on the
basis of an empirical distribution of finite samples from the underlying distribution.
This can be viewed as access to a “noisy” marginal distribution. Specifically, given a
distribution ν, we observe D = P(ν) + η, where P(ν) ∈ {M(ν), C(ν)} depending upon the
Computing Choice 241

type of marginal information, with η being noise such that some norm of η, e.g., η2
or η∞ , is bounded above by δ, with δ > 0 being small if we have access to enough
samples. Here δ represents the error observed due to access to finitely many samples
and is assumed to be known.
For example, if we have access to n independent samples for each marginal entry (e.g.,
i ranked in position j for a first-order marginal or i found to be better than j for a com-
parison marginal) according to ν, and we create an empirical estimation of each entry
in M(ν) or C(ν), then, using the Chernoff bound for the binomial distribution and the
union bound over a collection of events, it can be argued that η∞ ≤ δ with probability
1 − δ as long as n ∼ (1/δ2 ) log(4N/δ) for each entry. Using more sophisticated methods
from the matrix-estimation literature, it is feasible to obtain better estimations of M(ν)
or C(ν) from fewer samples of entries and even when some of the entries are entirely
unobserved as long as M(ν) or C(ν) has structure. This is beyond the scope of this expo-
sition; however, we refer the interested reader to [39–42] as well as the discussion in
Section 8.6.4.
Given this, the goal is to recover a sparse distribution whose marginals are close to the
observations. More precisely, we wish to find a distribution ν such that P(ν)−D2 ≤ f (δ)
and ν0 is small. Here, ideally we would like f (δ) = δ but we may settle for any f such
that f (δ) → 0 as δ → 0.
Following the line of reasoning in Section 8.4.1, we shall assume that there is a sparse
model νs with respect to which the marginal matrix satisfies the signature condition,
Condition 8.1, and P(νs ) − D2 ≤ δ. The goal would be to produce an estimate ν so that
ν0 = ν0 and P(ν) − D2 ≤ f (δ).
This is the exact analog of the robust recovery of a sparse signal in the context of
compressed sensing where the RIP-like condition allowed recovery of a sparse approx-
imation to the original signal from linear projections through linear optimization. The
computational complexity of such an algorithm scales at least linearly in n, the ambient
dimension of the signal. As discussed earlier, in our context this would lead to the com-
putation cost scaling as N!, which is prohibitive. The exact recovery algorithm discussed
in Section 8.4.1 has computation cost O(2K + N 2 log N) in the context of recovering a
sparse model satisfying the signature condition. The brute-force search for the sparse
model will lead to cost at least N!K ≈ (N!) K or exp(Θ(KN log N)) for K N!. The ques-
tion is whether it is possible to get rid of the dependence on N!, and ideally to achieve a
scaling of O(2K + N 2 log N), as in the case of exact model recovery.
In what follows, we describe the conditions on noise under which the algorithm
described in Section 8.4.1 is robust. This requires an assumption that the underlying
ground-truth distribution is sparse and satisfies the signature condition. This recovery
result requires noise to be small. How to achieve such a recovery in a higher-noise regime
remains broadly unknown; initial progress toward it has been made in [43].
Robust recovery under the signature condition: low-noise regime. Recall that the “peel-
ing” algorithm recovers the sparse model when the signature condition is satisfied using
exact marginals. Here, we discuss the robustness of the “peeling” algorithm under noise.
Specifically, we argue that the “peeling” algorithm as described is robust as long as the
noise is “low.” We formalize this in the statement below.
242 Devavrat Shah

theorem 8.4 Let A ∈ {0, 1}m×n , with all of its columns being distinct. Let x ∈ Rn≥0 be such
that A satisfies the signature condition with respect to the set supp(x). Let the non-zero
components of x, i.e., {xi : i ∈ supp(x)}, be such that, for any S 1  S 2 ⊂ supp(x),
 
   
 xi − xi  > 2δK, (8.14)
 
i∈S 1 i ∈S 2

for some δ > 0. Then, given y = Ax + η with η∞ < δ, it is feasible to find x so that
x − x∞ ≤ δ.
Proof To establish Theorem 8.4, we shall utilize effectively the same algorithm as that
utilized for establishing Theorem 8.1 in the proof of Theorem 8.1. However, we will
have to deal with the “error” in measurement y delicately.
To begin with, according to arguments in the proof of Theorem 8.1, it follows that all
non-zero elements of x are distinct. Without loss of generality, let the non-zero elements
of x be x1 , . . . , xK , with K = |supp(x)| ≥ 2 such that 0 < x1 < · · · < xK ; xi = 0 for K + 1 ≤
i ≤ n. From (8.14), it follows that xi+1 ≥ xi + 4δ for 1 ≤ i < K and x1 ≥ 2δ. Therefore,

xk ≥ (k − 1)4δ + 2δ, (8.15)


for 1 ≤ k ≤ K. Next, we shall argue that, inductively, it is feasible to find xi , 1 ≤ i ≤ n, so
that |xi − xi | ≤ δ for 1 ≤ i ≤ K and xi = 0 for K + 1 ≤ i ≤ n.

Now, since A ∈ {0, 1}m×n , for each j ∈ [m], y j = i∈S j xi + η j , with S j ⊆ supp(x) and
|η j | ≤ δ. From (8.14), it follows for x1 > 2δ. Therefore, if S j  ∅ then y j > δ. That is, we
will start by restricting the discussion to indices J 1 = { j ∈ [m] : y j > δ}.
Let j1 be an index in J such that y j1 ∈ arg min j∈J 1 y j . We set x1 = y j1 and A j1 1 = 1.
To justify this, we next argue that y j1 = x1 + η j1 and hence |x1 − x1 | < δ. By virtue of the
signature condition, for each i ∈ supp(x), there exists j(i) ∈ [m] such that |y j(i) − xi | ≤ δ
and hence j(i) ∈ J 1 , since y j(i) ≥ xi − δ > δ. Let J(1) = { j ∈ J 1 : y j = x1 + η j }. Clearly,
J(1)  ∅. Effectively, we want to argue that j1 ∈ J(1). To that end, suppose that this does

not hold. Then there exists S ⊂ supp(x), such that S  ∅, S  {1}, and y j1 = i∈S xi + η j1.
Then

y j1 > xi − δ, since |η j1 | < δ,
i∈S

> x1 + 2Kδ − δ, by (8.14),


≥ x1 + δ, since K ≥ 1, (8.16)

≥ y j, (8.17)
for any j ∈ J(1) ⊂ J 1 . But this is a contradiction, since y j1 ≤ y j for all j ∈ J. That is,
S = {1} or y j1 = x1 + η j1. Thus, we have found x1 = y j1 such that |x1 − x1 | < δ.

Now, for any j ∈ J 1 with y j = i∈S xi + η j with S ∩ {2, . . . , K}  ∅, we have, with the

notation x(S ) = i∈S xi ,
|x1 − y j | = |x1 − y j + x1 − x1 |

= |x1 − x(S ) − η j + x1 − x1 |
Computing Choice 243

≥ |x1 − x(S )| − |η j | − |x1 − x1 |


> 2Kδ − δ − δ
≥ 2δ. (8.18)
Furthermore, if S = {1}, then |x1 − y j | < 2δ. Therefore, we set
A j1 = 1 if |x1 − y j | < 2δ, for j ∈ J 1 ,
and
J 2 ← J 1 \{ j ∈ J 1 : |y j − x1 | < 2δ}.
Clearly,
j ∈ J 2 ⇔ j ∈ [m], y j = x(S ) + η j , such that S ∩ {2, . . . , K}  ∅.
Now suppose, inductively, that we have found x1 , . . . , xk , 1 ≤ k < K, so that |xi − xi | < δ
for 1 ≤ i ≤ k and
j ∈ J k+1 ⇔ j ∈ [m], y j = x(S ) + η j , such that S ∩ {k + 1, . . . , K}  ∅.
To establish the inductive step, we suggest that one should set jk+1 ∈ arg min j∈J k+1 {y j },
xk+1 = y jk+1 and
J k+2 ← J k+1 \{ j ∈ J k+1 : |y j − x1 | < (k + 1)δ}.
We shall argue that |xk+1 − xk+1 | < δ by showing that y jk+1 = xk+1 + η jk+1 and establishing

j ∈ J k+2 ⇔ j ∈ [m], y j = x(S ) + η j , such that S ∩ {k + 2, . . . , K}  ∅.


To that end, let y jk+1 = x(S ) + η jk+1 . By the inductive hypothesis, S ⊂ supp(x) and S ∩
{k + 1, . . . , K}  ∅. Suppose S  {k + 1}. Then
y jk+1 > x(S ) − δ
= (x(S ) − xk+1 ) + xk+1 − δ
≥ 2δ + xk+1 − δ
= δ + xk+1 ,
> y j,
for any j ∈ J(k + 1) ≡ { j ∈ J k+1 : y j = xk+1 + η j }. In the above, we have used the fact that,
since S ∩ {k + 1, . . . , K}  ∅ and S  {k + 1}, it must hold that x(S ) ≥ min{x1 + xk+1 , xk+2 }.
In either case, using (8.15), it follows that x(S ) − xk+1 ≥ 2δ. We note that, due to the
signature condition and the inductive hypothesis about J k+1 , it follows that J(k + 1)  ∅.
But y jk+1 is the minimal value of y j for j ∈ J k+1 . This is a contradiction. Therefore,
S = {k + 1}. That is, xk+1 = y jk+1 satisfies |xk+1 − xk+1 | < δ.

Now, consider any set S ⊂ {1, . . . , k + 1} and any j ∈ J k+2 such that y j = i∈S xi + η j

with S ∩ {k + 2, . . . , K}  ∅. Using the notation x(S ) = i∈S xi , we have
|x(S ) − y j | = |x(S ) − y j + x(S ) − x(S )|
= |x(S ) − x(S ) − η j + x(S ) − x(S )|
244 Devavrat Shah

≥ |x(S ) − x(S )| − |η j | − |x(S ) − x(S )|


> 2Kδ − (1 + |S |)δ
≥ (|S | + 1)δ, (8.19)

where we used the fact that |S | + 1 ≤ K. Therefore, if we set

J k+2 ← J k+1 \{ j ∈ J k+1 : |vy j − x(S )| ≤ (|S | + 1)δ, for some S ⊂ {1, . . . , k + 1}},

it follows that

j ∈ J k+2 ⇔ j ∈ [m], y j = x(S ) + η j , such that S ∩ {k + 2, . . . , K}  ∅.

This completes the induction step. It also establishes the desired result that we can
recover x such that x − x∞ ≤ δ.
Naturally, as before, Theorem 8.4 implies robust versions of Theorems 8.2 and 8.3.
In particular, if we are forming an empirical estimation of M(ν) or C(ν) that is based on
independently drawn samples, then simple application of the Chernoff bound along with
a union bound will imply that it may be sufficient to have samples that scale as δ−2 log N
in order to have M(ν) or C(ν) so that M(ν)−M(ν)∞ < δ or C(ν)−C(ν)∞ < δ with high
probability (i.e., 1 − oN (1)). Then, as long as ν satisfies condition (8.14) in addition to the
signature condition, Theorem 8.4 guarantees approximate recovery as discussed above.
Robust recovery under the signature condition: high-noise regime. Theorem 8.4
provides conditions under which the “peeling” algorithm manages to recover the
distribution as long as the elements in the support are far enough apart. Putting it
another way, for a given x, the error tolerance needs to be small enough compared with
the gap that is implicitly defined by (8.14) for recovery to be feasible.
Here, we make an attempt to go beyond such restrictions. In particular, supposing
we can view observations as a noisy version of a model that satisfies the signature
condition, then we will be content if we can recover any model that satisfies the
signature condition and is consistent with observations (up to noise). For this, we shall
assume knowledge of the sparsity K.
Now, we need to learn supp(x), i.e., positions of x that are non-zero and the non-zero
values at those positions. The determination of supp(x) corresponds to selecting the
columns of A. Now, if A satisfies the signature condition with respect to supp(x), then
we can simply choose the entries in the positions of y corresponding to the signature
component. If the choice of supp(x) is correct, then this will provide an estimate x
so that x − x2 ≤ δ. In general, if we assume that there exists x such that A satisfies
the signature condition with respect to supp(x) with K = x0 and y − Ax2 ≤ δ, then
a suitable approach would be to find x such that x0 = K, A satisfies the signature
 
condition with respect to supp x , and it minimizes y − Ax2 .
In summary, we are solving a combinatorial optimization problem over the space
of columns of A that collectively satisfy the signature condition. Formally, the space
of subsets of columns of A of size K can be encoded through a binary-valued matrix
Z ∈ {0, 1}m×m as follows: all but K columns of Z are zero, and the non-zero columns
of Z are distinct columns of A collectively satisfying the signature condition. More
Computing Choice 245

precisely, for any 1 ≤ i1 < i2 < · · · < iK ≤ m, representing the signature columns, the
variable Z should satisfy
Zi j i j = 1, for 1 ≤ j ≤ K, (8.20)
Zi j ik = 0, for 1 ≤ j  k ≤ K, (8.21)
[Zai j ]a∈[m] ∈ col(A), for 1 ≤ j ≤ K, (8.22)
Zab = 0, for a ∈ [m], b  {i1 , . . . , iK }. (8.23)
In the above, col(A) = {[Ai j ]i∈[m] : 1 ≤ j ≤ n} represents the set of columns of matrix A.
Then the optimization problem of interest is
minimize y − Zy2 over Z ∈ {0, 1}m×m
such that Z satisfies constraints (8.20) − (8.23). (8.24)
m

The constraint set (8.20)–(8.23) can be viewed as a disjoint union of K sets, each one
corresponding to a choice of 1 ≤ i1 < · · · < iK ≤ m. For each such choice, we can solve
the optimization (8.24) and choose the best solution across all of them. That is, the
computation cost is O(mK ) times the cost of solving the optimization problem (8.24).
The complexity of solving (8.24) fundamentally depends on the constraint (8.22) – it
captures the structural complexity of describing the column set of matrix A.
A natural convex relaxation of the optimization problem (8.24) involves replacing
(8.22) and Z ∈ {0, 1}m×m by
[Zai j ]a∈[m] ∈ convex-hull(col(A)), for 1 ≤ j ≤ K; Z ∈ [0, 1]m×m . (8.25)
In the above, for any set S ,
 Q  
convex-hull(S ) ≡ a x : a ≥ 0, x ∈ S for  ∈ [Q], a = 1, for Q ≥ 2 .
=1 

In the best case, it may be feasible to solve the optimization with the convex relaxation
efficiently. However, the relaxation may not yield a solution that is achieved at the
extreme points of convex-hull(col(A)), which is what we desire. This is due to the fact
that the objective 2 -norm of the error we are considering is strictly convex. To overcome
this challenge, we can replace 2 by ∞ . The constraints of interest are, for a given ε > 0,
Zi j i j = 1, for 1 ≤ j ≤ K, (8.26)
Zi j ik = 0, for 1 ≤ j  k ≤ K, (8.27)
yi − (Zy)i ≤ ε, for 1 ≤ i ≤ m, (8.28)
yi − (Zy)i ≥ −ε, for 1 ≤ i ≤ m, (8.29)
[Zai j ]a∈[m] ∈ convex-hull(col(A)), for 1 ≤ j ≤ K, (8.30)
Zab = 0, for a ∈ [m], b  {i1 , . . . , iK }. (8.31)
This results in the linear program

m
minimize ζ i j Zi j over Z ∈ [0, 1]m×m
i, j=1
246 Devavrat Shah

such that Z satisfies constraints (8.26) − (8.31). (8.32)


In the above, ζ = [ζ i j ] ∈ [0, 1]m×m is a random vector with each of its components chosen
by drawing a number from [0, 1] uniformly at random. The purpose of choosing ζ is to
obtain a unique solution, if it is feasible. Note that, when this is feasible, the solution
is achieved at the extreme point, which happens to be the valid solution of interest. We
can solve (8.32) iteratively for choices of ε = 2−q for q ≥ 0 until we fail to find a feasible
solution. The value of ε before that will be the smallest (within a factor of 2) error
tolerance that is feasible within the signature family. Therefore, the cost of finding such
a solution is within O(log 1/ε) times the cost of solving the linear program (8.32). The
cost of the linear program (8.32) depends on the complexity of the convex relaxation of
the set col(A). If it is indeed simple enough, then we can solve (8.32) efficiently.
As it turns out, for the case of first-order marginals, the convex hull of col(A) is
succinct due to the classical result by Birkhoff and von Neumann which characterizes
the convex relaxation of permutation matrices through a number of equalities that is
linear in the size of the permutation, here N. Each of the elements in col(A) corre-
sponds to a (flattened) permutation matrix. Therefore, its convex hull is simply that
of (flattened) permutation matrices, leading to a succinct description. This results in a
polynomial-time algorithm for solving (8.32). In summary, we conclude the following
(see [43] for details).
theorem 8.5 For a given observation vector y ∈ [0, 1]m , if there exists a distribution μ
in a signature familty of support size K such that the corresponding projection is within
ε ∈ (0, 1] of y in terms of the ∞ -norm, then it can be found through an algorithm with
computation cost O(N Θ(K) log(1/ε)).

Open question. It is not known how to perform an efficient computation equivalent to


Theorem 8.5 (or whether one can do so) for finding the approximate distribution in the
signature family for the pair-wise comparison marginals.
On the universality of the signature family. Thus far, we have focused on developing
algorithms for learning a sparse model with a signature condition. The sparse model
is a natural approximation for a generic distribution over permutations. In Theorems
8.2 and 8.3, we effectively argued that a model with randomly chosen sparse support
satisfies the signature condition as long as the sparsity is not too large. However, it is
not clear whether a sparse model with a signature condition is a good approximation
beyond such a setting. For example, is there a sparse model with a signature condition
that approximates the marginal information of a simple parametric model such as the
multinomial logit (MNL) model well?
To that end, recently Farias et al. [43] have established the following representation
result which we state without proof here.
theorem 8.6 Let ν be an MNL model with parameters w1 , . . . , wN (and, without loss of
generality, let 0 < w1 < · · · < wN ) such that

wN log N
N−L ≤ , (8.33)
k=1 wk
N
Computing Choice 247

 
for some L = N δ for some δ ∈ (0, 1). Then there exists ν such that |supp ν | = O(N/ε2 ),
ν satisfies the signature condition with respect to the first-order marginals, and
M(ν) − M(ν)2 ≤ ε.

8.5 Random Utility Model (RUM)

We discuss recovery of an exact model for the MNL model and recovery of the ranking
for a generic random utility model with homogeneous random perturbation.

8.5.1 Exact Marginals: Infinite Samples


Given the exact marginal information M(ν) or C(ν) for ν, we wish to recover the
parameters of the model when ν is MNL, and we wish to recover the ranking when ν
is a generic random utility model. We first discuss recovery of the MNL model for both
types of marginal information and then discuss recovery of ranking for a generic model.

Recovering MNL: first-order marginals. Without loss of generality, let us assume that

the parameters w1 , . . . , wN are normalized so that i wi = 1. Then, under the MNL model
according to (8.9),
 
P σ(i) = 1 = wi . (8.34)

That is, the first column of the first-order marginal matrix M(ν) = [Mi j (ν)] precisely
provides the parameters of the MNL model!
Recovering the MNL model: comparison marginals. Under the MNL model, according
to (8.11), for any i  j ∈ [N],
  wi
P σ(i) > σ( j) = . (8.35)
wi + w j
 
The comparison marginals, C(ν) provide access to P σ(i) > σ( j) for all i  j ∈ [N].
Using these, we wish to recover the parameters w1 , . . . , wN .
Next, we describe a reversible Markov chain over N states whose stationary distribu-
tion is precisely the parameters of our interest, and its transition kernel utilizes the C(ν).
This alternative representation provides an intuitive algorithm for recovering the MNL
parameters, and more generally what is known as the rank centrality [44, 45].
To that end, the Markov chain of interest has N states. The transition kernel or
transition probability matrix Q = [Qi j ] ∈ [0, 1]N×N of the Markov chain is defined using
comparison marginals C = C(ν) as follows:



⎨C ji /2N, if i  j,
Qi j = ⎪
⎩1 −  ji C ji /2N, if i = j.
⎪ (8.36)

The Markov chain has a unique stationary distribution because (a) Q is irreducible,
since Ci j , C ji > 0 for all i  j as long as wi > 0 for all i ∈ [N], and (b) Qii > 0 by definition
248 Devavrat Shah

for all i ∈ [N] and hence it is aperiodic. Further, w = [wi ]i∈[N] ∈ [0, 1]N is a stationary
distribution since it satisfies the detailed-balance condition, i.e., for any i  j ∈ [N],
C ji wj
wi Qi j = wi = wi
2N 2N(wi + w j )
wi Ci j
= wj = wj
2N(wi + w j ) 2N
= w j Q ji . (8.37)
Thus, by finding the stationary distribution of a Markov chain as defined above, we
can find the parameters of the MNL model. This boils down to finding the largest
eigenvector of Q, which can be done using various efficient algorithms, including the
standard power-iteration method.
We note that the algorithm employed to find the parameters of the MNL model does
not need to have access to all entries of C. Suppose E ⊂ {(i, j) : i  j ∈ [N]} to be a
subset of all possible N2 pairs for which we have access to C. Let us define a Markov
chain with Q such that, for i  j ∈ [N], Qi j is defined according to (8.36) if (i, j) ∈ E (we
assume (i, j) ∈ E and then ( j, i) ∈ E because C ji = 1 − Ci j by definition), else Qi j = 0; and

Qii = 1 − ji Qi j . The resulting Markov chain is aperiodic, since by definition Qii > 0.
Therefore, as long as the resulting Markov chain is irreducible, it has a unique stationary
distribution. Now the Markov chain is irreducible if effectively all N states are reachable
from each other via transitions {(i, j), ( j, i) : (i, j) ∈ E}. That is, there are data that compare
any two i  j ∈ [N] through, potentially, chains of comparisons, which, in a sense, is a
minimal requirement in order to have consistent ranking across all i ∈ [N]. Once we have
this, again it follows that the stationary distribution is given by w = [wi ]i∈[N] ∈ [0, 1]N ,
since the detailed-balance equation (8.37) holds for all i  j ∈ [N] with (i, j) ∈ E.
Recovering ranking for a homogeneous RUM. As mentioned in Section 8.3.2, we wish
to recover the ranking or ordering of inherent utilities for a homogeneous RUM. That
is, if u1 ≥ · · · ≥ uN , then the ranking of interest is identity, i.e., σ ∈ S N such that σ(i) = i
for all i ∈ [N]. Recall that the homogeneous RUM random perturbations εi in (8.7)
have an identical distribution for all i ∈ [N]. We shall assume that the distribution of the
random perturbation is absolutely continuous with respect to the Lebesgue measure on
R. Operationally, for any t1 < t2 ∈ R,
 
P ε1 ∈ (t1 , t2 ) > 0. (8.38)
The following is the key characterization of a homogeneous RUM with (8.38) that will
enable recovery of ranking from marginal data (both comparison and first-order); also
see [46, 47].
lemma 8.4 Consider a homogeneous RUM with property (8.38). Then, for i  j ∈ [N],
  1
ui > u j ⇔ P Yi > Y j > . (8.39)
2
Further, for any k  i, j ∈ [N],
   
ui > u j ⇔ P Yi > Yk > P Y j > Yk . (8.40)
Computing Choice 249

Proof By definition,
   
P Yi > Y j = P εi − ε j > u j − ui . (8.41)
Since εi , ε j are independent and identically distributed (i.i.d.) with propert (8.38), their
difference random variable εi − ε j has 0 mean, symmetric and with property (8.38).
That is, 0 is its unique median as well, and, for any t > 0,
  1
P εi − ε j > t) = P εi − ε j < −t) < . (8.42)
2
This leads to the conclusion that
  1
ui > u j ⇔ P Yi > Y j > .
2
Similarly,
   
P Yi > Yk = P εi − εk > uk − ui , (8.43)
   
P Y j > Yk = P ε j − εk > uk − u j . (8.44)
Now εi − εk and ε j − εk are identically distributed with property (8.38). That is, one has
a strictly monotonically increasing cumulative distribution function (CDF). Therefore,
(8.40) follows immediately.
Recovering the ranking: comparison marginals. From (8.39) of Lemma 8.4, using
comparison marginals C(ν), we can recover a ranking of [N] that corresponds to the
ranking of their inherent utility for a generic homogeneous RUM as follows. For each
i ∈ [N], assign rank as
 1 
rank(i) = N −  j ∈ [N] : j  i, Ci j > . (8.45)
2
From Lemma 8.4, it immediately follows that the rank provides the ranking of [N] as
desired.
We also note that (8.40) of Lemma 8.4 suggests an alternative way (which will turn
out to be robust and more useful) to find the same ranking. To that end, for each i ∈ [N],
define the score as
1 
score(i) = Cik . (8.46)
N − 1 ki
From (8.40) of Lemma 8.4, it follows that, for any i  j ∈ [N],
score(i) > score( j) ⇔ ui > u j . (8.47)
That is, by ordering [N] in decreasing order of score values, we obtain the desired
ranking.
Recovering the ranking: first-order marginals. We are given a first-order marginal data
matrix, M = M(ν) ∈ [0, 1]N×N , where the Mi j represent P(σ(i) = j) under ν for i, j ∈ [N].
To recover the ranking under a generic homogeneous RUM using M, we shall introduce
the notion of the Borda count, see [48]. More precisely, for any i ∈ [N],
 
borda(i) = E[σ(i)] = P(σ(i) = j) j = jMi j . (8.48)
j∈[N] j∈[N]
250 Devavrat Shah

That is, borda(i) can be computed using M for any i ∈ [N]. Recall that we argued earlier
that the score(·) (in decreasing order) provides the desired ordering or ranking of [N].
However, computing score required access to comparison marginals C. And it’s not
feasible to recover C from M.
On the other hand, intuitively it seems that borda (in increasing order) provides an
ordering over [N] that might be what we want. Next, we state a simple invariant that ties
score(i) and borda(i), which will lead to the conclusion that we can recover the desired
ranking by sorting [N] in increasing order of borda count [46, 47].
lemma 8.5 For any i ∈ [N] and any distribution over permutations,

borda(i) + (N − 1)score(i) = N. (8.49)

Proof Consider any permutation σ ∈ S N . For any i ∈ [N], σ(i) denotes the position in
[N] that i is ranked to. That is, N − σ(i) is precisely the number of elements in [N] (and
not equal to i) that are ranked below i. Formally,

N − σ(i) = 1(σ(i) > σ( j)). (8.50)
ji

Taking the expectation on both sides with respect to the underlying distribution over
permutations and rearranging terms, we obtain

N = E[σ(i)] + P(σ(i) > σ( j)). (8.51)
ji

Using definitions from (8.46) and (8.48), we have

N = borda(i) + (N − 1)score(i). (8.52)

This completes the proof.

8.5.2 Noisy Marginals: Finite Samples


Now we consider a setup where we have access to marginal distributions formed on
the basis of an empirical distribution of finite samples from the underlying distribution.
This can be viewed as access to a “noisy” marginal distribution. Specifically, given a
distribution ν, we observe D = P(ν) + η, where P(ν) ∈ {M(ν), C(ν)} depending upon the
type of marginal information, with η being noise such that η2 ≤ δ, with δ > 0 being
small if we have access to enough samples. Here δ represents the error observed due to
access to finitely many samples and is assumed to be known.
As before, we wish to recover the parameters of the model when ν is MNL, and we
wish to recover the ranking when ν is a generic RUM. We first discuss recovery of the
MNL model for both types of marginal information and then discuss recovery of the
ranking for the generic model.
Recovering the MNL model: first-order marginals. As discussed in Section 8.5.1, we
can recover the parameters of the MNL model, w = [wi ]i∈[N] , using the first row of
the first-order marginal matrix, M = M(ν), by simply setting wi = Mi1 . Since we have
access to Mi1 + ηi1 , a simple estimator is to set wi = Mi1 + ηi1 . Then it follows that
Computing Choice 251

w − w2 = η·1 2 ≤ η2 ≤ δ. (8.53)

That is, using the same algorithm for estimating a parameter as in the case of access
to the exact marginals, we obtain an estimator which seems to have reasonably good
properties.
Recovering the MNL model: comparison marginals. We shall utilize the noisy compari-
son data to create a Markov chain as in Section 8.5.1. The stationary distribution of this
noisy or perturbed Markov chain will be a good approximation of the original Markov
chain, i.e., the true MNL parameters. This will lead to a good estimator of the MNL
model using noisy comparison data.
To that end, we have access to noisy comparison marginals C = C + η. To keep things
generic, we shall assume that we have access to comparisons for a subset of pairs. Let
E ⊂ {(i, j) : i  j ∈ [N]} denote the subset of all possible N2 pairs for which we have
access to noisy comparison marginals, and we shall assume that (i, j) ∈ E iff ( j, i) ∈ E.
Define di = { j ∈ [N] : j  i, (i, j) ∈ E} and dmax = maxi di . Define a noisy Markov chain
with transition matrix Q as



⎪ C ji /2dmax if i  j, (i, j) ∈ E,


⎨ 
Qi j = ⎪
⎪ 1 − (1/2dmax ) j:(i, j)∈E C ji if i = j, (8.54)



⎩0, if (i, j)  E.

We shall assume that E is such that the resulting Markov chain with transition matrix Q
is irreducible; it is aperiodic since Qii > 0 for all i ∈ [N] by definition (8.54). As before,
it can be verified that this noisy Markov chain is reversible and has a unique stationary
distribution that satisfies the detailed-balance condition. Let π denote this stationary
distribution. The corresponding ideal Markov chain has transition matrix Q defined as



⎪ C ji /2dmax if i  j, (i, j) ∈ E,


⎨ 
Qi j = ⎪
⎪ 1 − (1/2dmax ) j:(i, j)∈E C ji if i = j, (8.55)



⎩0, if (i, j)  E.

It is reversible and has a unique stationary distribution π = w. We want to bound  π − π.


By definition of π being the stationary distribution of Q, we have that

πT Q = πT . (8.56)

We can find π using a power-iteration algorithm. More precisely, let ν0 ∈ [0, 1]N be a
probability distribution as our initial guess. Iteratively, for iteration t ≥ 0,

νt+1
T
= νtT Q. (8.57)

We make the following claim, see [44, 45].


lemma 8.6 For any t ≥ 1,
 
νt − π ν0 − π 1 πmax
≤ ρt + Δ2 . (8.58)
π π 1−ρ πmin
252 Devavrat Shah

Here Δ = Q − Q, πmax = maxi πi , πmin = mini πi ; let λmax (Q) be the second largest (in

norm) eigenvalue of Q; ρ = λmax (Q) + Δ2 πmax /πmin ; and it is assumed that ρ < 1.
Before we provide the proof of this lemma, let us consider its implications. It is
quantifying the robustness of our approach for identifying the parameters of the MNL
model using comparison data. Specifically, since limt→∞ νt → π, (8.58) implies

π − π 1 πmax
≤ Δ2 . (8.59)
π 1−ρ πmin

Thus, the operator or spectral norm of the perturbation matrix Δ = Q − Q determines


the error in our ability to learn parameters using the above-mentioned rank centrality
algorithm. By definition
⎧ 


⎪ Ci j − C ji /2dmax if i  j, (i, j) ∈ E,


⎨   
|Qi j − Qi j | ≤ ⎪
⎪  
(1/2dmax ) j:(i, j)∈E (Ci j − C ji ) if i = j, (8.60)



⎩0 if (i, j)  E.

Therefore, it follows that


 
Δ2F = Δ2i j = (Qi j − Qi j )2
i, j i, j
  2
1 1    
= (Ci j − C ji )2 +  ( C ij − C ji 
)
2
4dmax 4d 2
i j max i j:(i, j)∈E
1  1  
≤ 2 (Ci j − C ji )2 + 2 dmax (Ci j − C ji )2
4dmax i, j 4dmax i j:(i, j)∈E
dmax + 1 
≤ 2
(Ci j − C ji )2
4dmax i, j
dmax + 1 2 1
= 2
ηF ≤ η2F , (8.61)
4dmax 2dmax

where η is the error in comparison marginals, i.e., η = C − C. Thus, if η2F ≤ δ2 , then,


since Δ2 ≤ ΔF , we have that

π − π 1 δ πmax
≤ √ , (8.62)
π 1 − ρ 2dmax πmin
with
δ 
|ρ − λmax (Q)| ≤ √ πmax /πmin . (8.63)
2dmax
Therefore, if Q has a good spectral gap, i.e., 1 − λmax (Q) is large enough, and δ is small
enough, then the estimate π is a good proxy of π, the true parameters. The precise
role of the “graph structure” induced by the observed entries, E, comes into play in
determining λmax (Q). This, along with implications regarding the sample complexity
for the random sampling model, is discussed in detail in [44, 45].
Computing Choice 253

Proof of Lemma 8.6. Define the inner-product space induced by π. For any u, v ∈ RN ,
define the inner product ·, ·π as

u, vπ = ui vi πi . (8.64)
i

This defines norm uπ = u, uπ for all u ∈ R. Let L2 (π) denote the space of all vectors
with finite  · π norm endowed with inner product ·, ·π . Then, for any u, v ∈ L2 (π),
⎛ ⎞
 ⎜⎜⎜ ⎟⎟⎟
u, Qvπ = ui ⎜⎜⎜⎝ Qi j v j ⎟⎟⎟⎟⎠πi

i j
 
= ui v j πi Qi j = ui v j π j Q ji
i, j i, j
⎛ ⎞
 ⎜⎜⎜ ⎟⎟
= π j v j ⎜⎝ Q ji ui ⎟⎟⎟⎠ = Qu, vπ .
⎜ (8.65)
j i

That is, Q is self-adjoint over L2 (π). For a self-adjoint matrix Q over L2 (π), define the
norm
Quπ
Q2,π = max . (8.66)
u uπ
It can be verified that, for any u ∈ RN and Q,
√ √
πmin u2 ≤ uπ ≤ πmax u2 , (8.67)
 
πmin πmax
Q2 ≤ Q2,π ≤ Q2 . (8.68)
πmax πmin
1 1 1
Consider a symmetrized version of Q as S = Π 2 QΠ− 2 , where Π± 2 is the N × N
± 12
diagonal matrix with the ith entry on the diagonal being πi . The symmetry of S
follows due to the detailed-balance property of Q, i.e., πi Qi j = π j Q ji for all i, j. Since
Q is a probability transition matrix, by the Perron–Frobenius theorem we have that
its eigenvalues are in [−1, 1], with the top eigenvalue being 1 and unique. Let the
eigenvalues be 1 = λ1 > λ2 ≥ · · · ≥ λN > −1. Let the corresponding (left) eigenvectors of
1
Q be v1 , . . . , vN . By definition v1 = π. Therefore, ui = Π− 2 vi are (left) eigenvectors of S
with eigenvalue λi for 1 ≤ i ≤ N, since
1 1 1 1
uTi S = (Π− 2 vi )T Π 2 QΠ− 2 = vTi QΠ− 2
1 1
= λi vTi Π− 2 = λi (Π− 2 vi )T = λi uTi . (8.69)
1 1
That is, u1 = π 2 or Π− 2 u1 = 1. By singular value decomposition, we can write
N
S = S1 + S\1 , where S1 = λ1 u1 uT1 and S\1 = i=2 λi ui uTi . That is,
1 1 1 1 1 1 1 1
Q = Π− 2 SΠ 2 = Π− 2 S1 Π 2 + Π− 2 S\1 Π 2 = 1πT + Π− 2 S\1 Π 2 . (8.70)

Recalling the notation Δ = Q − Q, we can write (8.57) as


νt+1
T
− πT = νtT Q − πT Q = (νt − π)T (Q + Δ) + πT Δ
254 Devavrat Shah

1 1
= (νt − π)T (1πT + Π− 2 S\1 Π 2 ) + (νt − π)T Δ + πT Δ
1 1
= (νt − π)T Π− 2 S\1 Π 2 + (νt − π)T Δ + πT Δ, (8.71)

where we used the fact that (νt − π)T 1 = 0 since both νt and π are probability vectors.
1 1
Now, for any matrix W, Π− 2 WΠ 2 2,π = W2 . Therefore,

νt+1
T
− πT π ≤ νt − ππ S\1 2 + Δπ,2 + πT Δπ . (8.72)
  
By definition S\1 2 = max λ2 , |λN | = λmax (Q). Let γ = λmax (Q) + Δπ,2 . Then


t−1 
νtT − πT π ≤ γt ν0 − ππ + γ s πT Δπ . (8.73)
s=0

Using the bounds in (8.67) and (8.68), we have



πmax
γ ≤ λmax (Q) + Δ2 ≡ ρ, (8.74)
πmin
1
νtT − πT 2 ≤ √ νT − πT π,2 , (8.75)
πmin t

ν0 − ππ ≤ πmax ν0 − π2 , (8.76)

π Δπ ≤ π2 Δ2 πmax .
T
(8.77)

Therefore, we conclude that


 
t−1  
νtT − πT  t ν0 − π πmax
≤ ρ + ρ Δ2
s
. (8.78)
π π s=0
πmin
t−1
s=0 ρ = (1 − ρt )/(1 − ρ) ≤ 1/(1 − ρ) for ρ < 1.
This completes the proof by bounding s

Recovering the ranking: comparison marginals. We will consider recovering the


ranking from noisy comparison marginals C = C + η, using the scores as in (8.46)
defined using noisy marginals. That is, for i ∈ [N] define
1 
 =
score(i) Cik . (8.79)
N − 1 ki

Then the error in the score for i is


 
1   
 − score(i)| =
error(i) = |score(i)  Cik − Cik 
N − 1 ki 
1  1
≤ |Cik − Cik | ≤ η 1 , (8.80)
N − 1 ki N − 1 i·

where ηi· = [ηik ]k∈[N] . Therefore, the relative order of any pair of i, j ∈ [N] is preserved
under a noisy score as long as

error(i) + error( j) < |score(i) − score( j)|. (8.81)


Computing Choice 255

That is,

ηi· 1 + η j· 1 < (N − 1)|score(i) − score( j)|. (8.82)

In summary, (8.82) provides the robustness property of a ranking algorithm based on


noisy comparison being able to recover the true relative order for each pair of i, j; and
subsequently for the entire ranking.
Recovering the ranking: first-order marginals. We consider using a Borda count for
finding the ranking using noisy first-order marginals. More precisely, given noisy
first-order marginals M = M + η, we define the noisy Borda count for i ∈ [N] as

 =
borda(i) kMik . (8.83)
k∈[N]

Then, the error in the Borda count for i is



 − borda(i)| ≤
error(i) = |borda(i) k|Mik − Mik |
k∈[N]
 +
= k|ηik | = bordaη (i). (8.84)
k∈[N]

That is, error(i) is like computing the Borda count for i using η+ ≡ [|ηik |], which we
+
define as bordaη (i). Then the relative order of any pair i, j ∈ [N] per noisy Borda count
is preserved if
+ +
bordaη (i) + bordaη ( j) < |borda(i) − borda( j)|. (8.85)

In summary, (8.85) provides the robustness property of a ranking algorithm based on


noisy first-order marginal being able to recover the relative order for a given pair i, j;
and subsequently for the entire ranking.

8.6 Discussion

We discussed learning the distribution over permutations from marginal information. In


particular, we focused on marginal information of two types: first-order marginals and
pair-wise marginals. We discussed the conditions for recovering the distribution as well
as the ranking associated with the distribution under two model classes: sparse models
and random utility models (RUM). For all of these settings, we discussed settings
where we had access to exact marginal information as well as noisy marginals. A lot of
progress has been made, especially in the past decade, on this topic in various directions.
Here, we point out some of the prominent directions and provide associated references.

8.6.1 Beyond First-Order and Pair-Wise Marginals


To start with, learning the distribution for different types of marginal information has
been discussed in [28], where Jagabathula and Shah discuss the relationship between
the level of sparsity and the type of marginal information available. Specifically, through
256 Devavrat Shah

connecting marginal information with the spectral representation of the permutation


group, Jagabathula and Shah find that, as the higher-order marginal information is made
available, the distribution with larger support size can be recovered with a tantalizingly
similar relationship between the dimensionality of the higher-order information and the
recoverability of the support size just like in the first-order marginal information. The
reader is referred to [28] for more details and some of the open questions. We note that
this collection of results utilizes the signature condition discussed in Section 8.4.1.

8.6.2 Learning MNL beyond Rank Centrality


The work in [44, 45] on recovering the parameters of the MNL model from noisy
observations has led to exciting subsequent work in recent years. Maximum-likelihood
estimation (MLE) for the MNL model is particularly nice – it boils down to solving
a simple, convex minimization problem. For this reason, historically it has been well
utilized in practice. However, only recently has its sample complexity been well studied.
In particular, in [49] Hajek et al. argue that the sample complexity requirement of MLE
is similar to that of rank centrality, discussed in Section 8.5.2.
There has been work to find refined estimation of parameters, for example, restricting
the analysis to the top few parameters, see [50–52]. We also note an interesting
algorithmic generalization of rank centrality that has been discussed in [53] through
a connection to a continuous-time representation of the reversible Markov chain
considered in rank centrality.

8.6.3 Mixture of MNL Models


The RUM discussed in detail here has the weakness that all options are parameterized
by one parameter. This does not allow for heterogeneity in options in terms of multiple
“modes” of preferences or rankings. Putting it another way, the RUM captures a sliver
of the space of all possible distributions over permutations. A natural way to generalize
such a model is to consider a mixture of RUM models. A specific instance is a mixture
of MNL models, which is known as the mixture MNL or mixed MNL model. It
can be argued that such a mixed MNL model can approximate any distribution over
permutations with enough mixture components. This is because we can approximate a
distribution with unit mass on a permutation by an MNL model by choosing parameters
appropriately. Therefore, it makes sense to understand when we can learn the mixed
MNL model from pair-wise rankings or more generally partial rankings. In [54],
Ammar et al. considered this question and effectively identified an impossibility result
that suggests that pair-wise information is not sufficient to learn mixtures in general.
They provided lower bounds that related the number of mixture components, the
number of choices (here N), and the length of partial rankings observed. For a separable
mixture MNL model, they provide a natural clustering-based solution for recovery.
Such a recovery approach has been further refined in the context of collaborative
ranking [55, 56] through use of convex optimization-based methods and imposing a
Computing Choice 257

“low-rank” structure on the model parameter matrix to enable recovery. In another line
of work, using higher-moment information for a separable mixture model, in [57] Oh
and Shah provided a tensor-decomposition-based approach for recovering the mixture
MNL model.

8.6.4 Matrix Estimation for Denoising Noisy Marginals


In a very different view, the first-order marginal and pair-wise marginal information
considered here can be viewed as a matrix of observations with an underlying structure.
Or, more generally, we have access to a noisy observation of an underlying ground
matrix with structure. The structure is implied by the underlying distribution over
permutations generating it. Therefore, recovering exact marginal information from
noisy marginal information can be viewed as recovering a matrix, with structure,
employing a noisy and potentially partial view of it. In [39], this view was considered
for denoising pair-wise comparison marginal data. In [39], it was argued that, when
the ground-truth comparison marginal matrix satisfies a certain stochastic transitivity
condition, for example that implied by the MNL model, the true pair-wise marginal
matrix can be recovered from noisy, partial observations. This approach has been further
studied in a sequence of recent works including [42, 58].

8.6.5 Active Learning and Noisy Sorting


Active learning with a view to ranking using pair-wise comparisons with noisy observa-
tions has been well studied for a long time. For example, in [59] Adler et al. considered
the design of an adaptive “tournament,” i.e., the draw of games is not decided upfront,
but is decided as more and more games are played. The eventual goal is to find the
“true” winner with a high probability. It is assumed that there is an underlying ranking
which corresponds to the ordering implied by the MNL or Bradley–Terry–Luce model.
The work of Adler et al. provides an adaptive algorithm that leads to finding the true
winner with high probability efficiently.
When there is “geometry” imposed on the space of preferences, very efficient
adaptive algorithms can be designed for searching using comparisons, for example
[60, 61]. Another associated line of works is that of noisy sorting. For example, see [62].
It is worth noting that the variation of online learning in the context of a “bandit”
setting is known as “dueling bandits” wherein comparisons between pairs of arms are
provided and the goal is to find the top arm. This, again can be viewed as finding the
top element from pair-wise comparisons in an online setting, with the goal being to
minimize “regret.” For example, see [63–65].
In our view, the dueling-bandits model using distributions over permutations, i.e.,
the outcome of pair-wise comparisons of arms are consistent with the underlying
distribution over permutations, provides an exciting direction for making progress
toward online matrix estimation.
258 Devavrat Shah

8.6.6 And It Continues ...


There is a lot more that is tangentially related to this topic. For example, the question
of ranking or selecting the winner in an election is fundamental to so many disciplines
and each (sub)discipline brings to the question a different perspective that makes this
topic rich and exciting. The statistical challenges related to learning the distribution
over permutations are very recent, as is clear from the exposition provided here. The
scale and complexity of the distribution over permutations (N! for N options) makes
it challenging from a computational point of view and thus provides a fertile ground
for emerging interactions between statistics and computation. The rich group structure
embedded within the permutation group makes it an exciting arena for the development
of algebraic statistics, see [66].
The statistical philosophy of max-entropy model learning leads to learning a para-
metric distribution from an exponential family. This brings in rich connections to the
recent advances in learning and inference on graphical models. For example, fitting
such a model using first-order marginals boils down to computing a partition function
over the space of matchings or permutations; which can be computationally efficiently
solved due to fairly recent progress in computing the permanent [67]. In contrast,
learning such a model efficiently in the context of pair-wise marginals is not easy due to
a connection to the feedback arc set problem.
We note an interesting connection: a mode computation heuristic based on maximum
weight matching in a bipartite graph using a first-order marginal turns out to be a
“first-order” approximation of the mode of such a distribution, see [47]. On the other
hand, using pair-wise comparison marginals, there is a large number of heuristics
to compute ranking, including the rank centrality algorithm discussed in detail here,
or more generally a variety of spectral methods considered for ranking, including in
[68, 69], and more recently [70, 71].
This exciting list of work continues to grow even as the author completes these final
keystrokes and as you get inspired to immerse yourself in this fascinating topic of
computing choice.

References

[1] J. Huang, C. Guestrin, and L. Guibas, “Fourier theoretic probabilistic inference over
permutations,” J. Machine Learning Res., vol. 10, no. 5, pp. 997–1070, 2009.
[2] P. Diaconis, Group representations in probability and statistics. Institute of Mathematical
Statistics, 1988.
[3] L. Thurstone, “A law of comparative judgement,” Psychol. Rev., vol. 34, pp. 237–286,
1927.
[4] J. Marschak, “Binary choice constraints on random utility indicators,” Cowles Foundation
Discussion Paper, 1959.
[5] J. Marschak and R. Radner, Economic theory of teams. Yale University Press, 1972.
Computing Choice 259

[6] J. I. Yellott, “The relationship between Luce’s choice axiom, Thurstone’s theory of
comparative judgment, and the double exponential distribution,” J. Math. Psychol.,
vol. 15, no. 2, pp. 109–144, 1977.
[7] R. Luce, Individual choice behavior: A theoretical analysis. Wiley, 1959.
[8] R. Plackett, “The analysis of permutations,” Appl. Statist., vol. 24, no. 2, pp. 193–202,
1975.
[9] D. McFadden, “Conditional logit analysis of qualitative choice behavior,” in Frontiers in
Econometrics, P. Zarembka, ed. Academic Press, 1973, pp. 105–142.
[10] G. Debreu, “Review of R. D. Luce, ‘individual choice behavior: A theoretical analysis,’”
Amer. Economic Rev., vol. 50, pp. 186–188, 1960.
[11] R. A. Bradley, “Some statistical methods in taste testing and quality evaluation,”
Biometrics, vol. 9, pp. 22–38, 1953.
[12] D. McFadden, “Econometric models of probabilistic choice,” in Structural analysis of
discrete data with econometric applications, C. F. Manski and D. McFadden, eds. MIT
Press, 1981.
[13] M. E. Ben-Akiva and S. R. Lerman, Discrete choice analysis: Theory and application to
travel demand. CMIT Press, 1985.
[14] D. McFadden, “Disaggregate behavioral travel demand’s RUM side: A 30-year
retrospective,” Travel Behaviour Res., pp. 17–63, 2001.
[15] P. M. Guadagni and J. D. C. Little, “A logit model of brand choice calibrated on scanner
data,” Marketing Sci., vol. 2, no. 3, pp. 203–238, 1983.
[16] S. Mahajan and G. J. van Ryzin, “On the relationship between inventory costs and variety
benefits in retail assortments,” Management Sci., vol. 45, no. 11, pp. 1496–1509, 1999.
[17] M. E. Ben-Akiva, “Structure of passenger travel demand models,” Ph.D. dissertation,
Department of Civil Engineering, MIT, 1973.
[18] J. H. Boyd and R. E. Mellman, “The effect of fuel economy standards on the U.S.
automotive market: An hedonic demand analysis,” Transportation Res. Part A: General,
vol. 14, nos. 5–6, pp. 367–378, 1980.
[19] N. S. Cardell and F. C. Dunbar, “Measuring the societal impacts of automobile
downsizing,” Transportation Res. Part A: General, vol. 14, nos. 5–6, pp. 423–434, 1980.
[20] D. McFadden and K. Train, “Mixed MNL models for discrete response,” J. Appl.
Econometrics, vol. 15, no. 5, pp. 447–470, 2000.
[21] K. Bartels, Y. Boztug, and M. M. Muller, “Testing the multinomial logit model,” 1999,
unpublished working paper.
[22] J. L. Horowitz, “Semiparametric estimation of a work-trip mode choice model,” J.
Econometrics, vol. 58, pp. 49–70, 1993.
[23] B. Koopman, “On distributions admitting a sufficient statistic,” Trans. Amer. Math. Soc.,
vol. 39, no. 3, pp. 399–409, 1936.
[24] B. Crain, “Exponential models, maximum likelihood estimation, and the Haar condition,”
J. Amer. Statist. Assoc., vol. 71, pp. 737–745, 1976.
[25] R. Beran, “Exponential models for directional data,” Annals Statist., vol. 7, no. 6, pp.
1162–1178, 1979.
[26] M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational
inference,” Foundations and Trends Machine Learning, vol. 1, nos. 1–2, pp. 1–305, 2008.
[27] S. Jagabathula and D. Shah, “Inferring rankings under constrained sensing,” in Advances
in Neural Information Processing Systems, 2009, pp. 753–760.
260 Devavrat Shah

[28] S. Jagabathula and D. Shah, “Inferring rankings under constrained sensing,” IEEE Trans.
Information Theory, vol. 57, no. 11, pp. 7288–7306, 2011.
[29] V. Farias, S. Jagabathula, and D. Shah, “A data-driven approach to modeling choice,” in
Advances in Neural Information Processing Systems, 2009.
[30] V. Farias, S. Jagabathula, and D. Shah, “A nonparametric approach to modeling choice
with limited data,” Management Sci., vol. 59, no. 2, pp. 305–322, 2013.
[31] K. J. Arrow, “A difficulty in the concept of social welfare,” J. Political Economy, vol. 58,
no. 4, pp. 328–346, 1950.
[32] J. Marden, Analyzing and modeling rank data. Chapman & Hall/CRC, 1995.
[33] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Information
Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[34] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete
and inaccurate measurements,” Communications Pure Appl. Math., vol. 59, no. 8, pp.
1207–1223, 2006.
[35] E. J. Candès and J. Romberg, “Quantitative robust uncertainty principles and optimally
sparse decompositions,” Foundations Computational Math., vol. 6, no. 2, pp. 227–254,
2006.
[36] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information,” IEEE Trans. Information
Theory, vol. 52, no. 2, pp. 489–509, 2006.
[37] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4,
pp. 1289–1306, 2006.
[38] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, “Combining geometry
and combinatorics: A unified approach to sparse signal recovery,” in Proc. 46th Annual
Allerton Conference on Communications, Control, and Computation, 2008, pp. 798–805.
[39] S. Chatterjee, “Matrix estimation by universal singular value thresholding,” Annals Statist.,
vol. 43, no. 1, pp. 177–214, 2015.
[40] D. Song, C. E. Lee, Y. Li, and D. Shah, “Blind regression: Nonparametric regression
for latent variable models via collaborative filtering,” in Advances in Neural Information
Processing Systems, 2016, pp. 2155–2163.
[41] C. Borgs, J. Chayes, C. E. Lee, and D. Shah, “Thy friend is my friend: Iterative collabora-
tive filtering for sparse matrix estimation,” in Advances in Neural Information Processing
Systems, 2017, pp. 4715–4726.
[42] N. Shah, S. Balakrishnan, A. Guntuboyina, and M. Wainwright, “Stochastically transitive
models for pairwise comparisons: Statistical and computational issues,” in International
Conference on Machine Learning, 2016, pp. 11–20.
[43] V. F. Farias, S. Jagabathula, and D. Shah, “Sparse choice models,” in 46th Annual
Conference on Information Sciences and Systems (CISS), 2012, pp. 1–28.
[44] S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pair-wise comparisons,” in
Advances in Neural Information Processing Systems, 2012, pp. 2474–2482.
[45] S. Negahban, S. Oh, and D. Shah, “Rank centrality: Ranking from pairwise comparisons,”
Operations Res., vol. 65, no. 1, pp. 266–287, 2016.
[46] A. Ammar and D. Shah, “Ranking: Compare, don’t score,” in Proc. 49th Annual Allerton
Conference on Communication, Control, and Computing, 2011, pp. 776–783.
[47] A. Ammar and D. Shah, “Efficient rank aggregation using partial data,” in ACM
SIGMETRICS Performance Evaluation Rev., vol. 40, no. 1, 2012, pp. 355–366.
Computing Choice 261

[48] P. Emerson, “The original Borda count and partial voting,” Social Choice and Welfare,
vol. 40, no. 2, pp. 353–358, 2013.
[49] B. Hajek, S. Oh, and J. Xu, “Minimax-optimal inference from partial rankings,” in
Advances in Neural Information Processing Systems, 2014, pp. 1475–1483.
[50] Y. Chen and C. Suh, “Spectral mle: Top-k rank aggregation from pairwise comparisons,”
in International Conference on Machine Learning, 2015, pp. 371–380.
[51] Y. Chen, J. Fan, C. Ma, and K. Wang, “Spectral method and regularized mle are both
optimal for top-k ranking,” arXiv:1707.09971, 2017.
[52] M. Jang, S. Kim, C. Suh, and S. Oh, “Top-k ranking from pairwise comparisons: When
spectral ranking is optimal,” arXiv:1603.04153, 2016.
[53] L. Maystre and M. Grossglauser, “Fast and accurate inference of Plackett–Luce models,”
in Advances in Neural Information Processing Systems, 2015, pp. 172–180.
[54] A. Ammar, S. Oh, D. Shah, and L. F. Voloch, “What’s your choice?: Learning the mixed
multi-nomial,” in ACM SIGMETRICS Performance Evaluation Rev., vol. 42, no. 1, 2014,
pp. 565–566.
[55] S. Oh, K. K. Thekumparampil, and J. Xu, “Collaboratively learning preferences from ordi-
nal data,” in Advances in Neural Information Processing Systems, 2015, pp. 1909–1917.
[56] Y. Lu and S. N. Negahban, “Individualized rank aggregation using nuclear norm regu-
larization,” in Proc. 53rd Annual Allerton Conference on Communication, Control, and
Computing, 2015, pp. 1473–1479.
[57] S. Oh and D. Shah, “Learning mixed multinomial logit model from ordinal data,” in
Advances in Neural Information Processing Systems, 2014, pp. 595–603.
[58] N. B. Shah, S. Balakrishnan, and M. J. Wainwright, “Feeling the bern: Adaptive estimators
for Bernoulli probabilities of pairwise comparisons,” in IEEE International Symposium on
Information Theory (ISIT), 2016, pp. 1153–1157.
[59] M. Adler, P. Gemmell, M. Harchol-Balter, R. M. Karp, and C. Kenyon, “Selection in the
presence of noise: The design of playoff systems,” in SODA, 1994, pp. 564–572.
[60] K. G. Jamieson and R. Nowak, “Active ranking using pairwise comparisons,” in Advances
in Neural Information Processing Systems, 2011, pp. 2240–2248.
[61] A. Karbasi, S. Ioannidis, and L. Massoulié, “From small-world networks to comparison-
based search,” IEEE Trans. Information Theory, vol. 61, no. 6, pp. 3056–3074, 2015.
[62] M. Braverman and E. Mossel, “Noisy sorting without resampling,” in Proc. 19th Annual
ACM–SIAM Symposium on Discrete Algorithms, 2008, pp. 268–276.
[63] Y. Yue and T. Joachims, “Interactively optimizing information retrieval systems as a
dueling bandits problem,” in Proc. 26th Annual International Conference on Machine
Learning, 2009, pp. 1201–1208.
[64] K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak, “Sparse dueling bandits,”
in AISTATS, 2015.
[65] M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi, “Contextual dueling
bandits,” in Proc. 28th Conference on Learning Theory, 2015, pp. 563–587.
[66] R. Kondor, A. Howard, and T. Jebara, “Multi-object tracking with representations of the
symmetric group,” in Artificial Intelligence and Statistics, 2007, pp. 211–218.
[67] M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time approximation algorithm for
the permanent of a matrix with nonnegative entries,” J. ACM, vol. 51, no. 4, pp. 671–697,
2004.
[68] T. L. Saaty and G. Hu, “Ranking by eigenvector versus other methods in the analytic
hierarchy process,” Appl. Math. Lett., vol. 11, no. 4, pp. 121–125, 1998.
262 Devavrat Shah

[69] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the
web,” in Proc. 10th International Conference on World Wide Web, 2001, pp. 613–622.
[70] A. Rajkumar and S. Agarwal, “A statistical convergence perspective of algorithms for rank
aggregation from pairwise data,” in International Conference on Machine Learning, 2014,
pp. 118–126.
[71] H. Azari, D. Parks, and L. Xia, “Random utility theory for social choice,” in Advances in
Neural Information Processing Systems, 2012, pp. 126–134.
9 Universal Clustering
Ravi Kiran Raman and Lav R. Varshney

Summary

Clustering is a general term for the set of techniques that, given a set of objects, aim to
select those that are closer to one another than to the rest of the objects, according to
a chosen notion of closeness. It is an unsupervised learning problem since objects are
not externally labeled by category. Much research effort has been expended on finding
natural mathematical definitions of closeness and then developing/evaluating algorithms
in these terms [1]. Many have argued that there is no domain-independent mathematical
notion of similarity, but that it is context-dependent [2]; categories are perhaps natural in
that people can evaluate them when they see them [3, 4]. Some have dismissed the prob-
lem of unsupervised learning in favor of supervised learning, saying it is not a powerful
natural phenomenon (see p. 159 of [5]).
Yet, most of the learning carried out by people and animals is unsupervised. We
largely learn how to think through categories by observing the world in its unlabeled
state. This is a central problem in data science. Whether grouping behavioral traces into
categories to understand their neural correlates [6], grouping astrophysical data to under-
stand galactic potentials [7], or grouping light spectra into color names [8], clustering is
crucial to data-driven science.
Drawing on insights from universal information theory, in this chapter we ask whether
there are universal approaches to unsupervised clustering. In particular, we consider
instances wherein the ground-truth clusters are defined by the unknown statistics gov-
erning the data to be clustered. By universality, we mean that the system does not have
prior access to such statistical properties of the data to be clustered (as is standard in
machine learning), nor does it have a strong sense of the appropriate notion of similarity
to measure which objects are close to one another.

9.1 Unsupervised Learning in Machine Learning

In an age with explosive data-acquisition rates, we have seen a drastic increase in the
amount of raw, unlabeled data in the form of pictures, videos, text, and voice. By 2023,
it is projected that the per capita amount of data stored in the world will exceed the entire
Library of Congress (1014 bits) at the time Shannon first estimated it in 1949 [9, 10].

263
264 Ravi Kiran Raman and Lav R. Varshney

Making sense of such large volumes of data, e.g., for statistical inference, requires
extensive, efficient pre-processing to transform the data into meaningful forms. In addi-
tion to the increasing volumes of known data forms, we are also collecting new varieties
of data with which we have little experience [10].
In astronomy, GalaxyZoo uses the general intelligence of people in crowdsourcing
systems to classify celestial imagery (and especially point out unknown and anoma-
lous objects in the universe); supervised learning is not effective since there is a limited
amount of training data available for this task, especially given the level of noise in the
images [11]. Analyses of such new data forms often need to be performed with minimal
or no contextual information.
Crowdsourcing is also popular for collecting (noisy) labeled training data for
machine-learning algorithms [12, 13]. As an example, impressive classification perfor-
mance of images has been attained using deep convolutional networks on the ImageNet
database [14], but this is after training over a set of 1000 classes of images with 1000
labeled samples for each class, i.e., one million labeled samples, which were obtained
from costly crowdsourcing [15]. Similarly, Google Translate has achieved near-human-
level translation using deep recurrent neural networks (RNNs) [16] at the cost of large
labeled training corpora on the order of millions of examples [17]. For instance, the
training dataset for English to French consisted of 36 million pairs – far more than what
the average human uses. On the other hand, reinforcement learning methods in their
current form rely on exploring state–action pairs to learn the underlying cost function
by Q-learning. While this might be feasible in learning games such as chess and Go, in
practice such exploration through reinforcement might be expensive. For instance, the
recently successful Alpha Go algorithm uses a combination of supervised and reinforce-
ment learning of the optimal policy, by learning from a dataset consisting of 29.4 million
positions learned from 160 000 games [18]. Beside this dataset, the algorithm learned
from the experience of playing against other Go programs and players. Although col-
lecting such a database and the games to learn from might be feasible for a game of Go,
in general inferring from experience is far more expensive.
Although deep learning has given promising solutions for supervised learning prob-
lems, results for unsupervised learning pale in comparison [19]. Is it possible to make
sense of (training) data without people? Indeed, one of the main goals of artificial
general intelligence has always been to develop solutions that work with minimal
supervision [20]. To move beyond data-intensive supervised learning methods, there
is growing interest in unsupervised learning problems such as density/support estima-
tion, clustering, and independent-component analysis. Careful choice and application
of unsupervised methods often reveal structure in the data that is otherwise not evi-
dent. Good clustering solutions help sort unlabeled data into simpler sub-structures,
not only allowing for subsequent inferential learning, but also adding insights into
the characteristics of the data being studied. This chapter studies information-theoretic
formulations of unsupervised clustering, emphasizing their strengths under limited
contextual information.
Universal Clustering 265

9.2 Universality in Information Theory

Claude Shannon identified the fundamental mathematical principles governing data


compression and transmission [21], introducing the field of information theory. Much
work in understanding the fundamental limits of and in designing compression and
communication systems followed, typically under the assumption that statistical knowl-
edge of the source and channel is given, respectively. Information theory has also
subsequently found fundamental limits and optimal algorithms for statistical inference.
However, such contextual knowledge of the statistics governing the system is not
always available. Thus, it is important to be able to design solutions that operate reliably
under an appropriate performance criterion, for any model in a large family. This branch
of exploration, more commonly known as universal information theory, has also helped
understand worst-case fundamental performance limits of systems.
Here we give a very brief overview of prior work in the field, starting with com-
pression and communication, and then highlighting results from universal statistical
inference. As part of our development, we use block-diagrammatic reasoning [22] to
provide graphical summaries.

9.2.1 Universality in Compression and Communication


Compression and communication are the two most extensively studied problems in
information theory. However, besides being of historical significance the two topics also
shed light on problems in statistical inference, such as the clustering problem that we
are interested in.
In his Shannon lecture, Robert Gray described the inherent connection between data
compression and simulation [23]. The problem of simulation aims to design a coder that
takes a random bitstream as input and generates a sequence that has the statistical prop-
erties of a desired distribution [24, 25]. In particular, it is notable that a data-compression
decoder that operates close to the Shannon limit also serves as the nearly optimal coder
for the corresponding simulation problem, highlighting the underlying relation between
lossy data compression and this statistical inference problem. Similarly, the connec-
tion between data compression and the problem of gambling is well studied, in that
a sequence of outcomes over which a gambler can make significant profits is also a
sequence that can be compressed well [26].
Contrarily in his Shannon lecture, Jorma Rissanen discussed the philosophical
undertones of similarity between data compression and estimation theory with a
disclaimer [27]:

in data compression the shortest code length cannot be achieved without taking advantage of the
regular features in data, while in estimation it is these regular features, the underlying mecha-
nism, that we want to learn . . . like a jigsaw puzzle where the pieces almost fit but not quite, and,
moreover, vital pieces were missing.
266 Ravi Kiran Raman and Lav R. Varshney

This observation is significant not only in reaffirming the connection between the two
fields, but also in highlighting the fact that the unification/translation of ideas often
requires closer inspection and some additional effort.
Compression, by definition, is closely related to the task of data clustering. Consider
the k-means clustering algorithm where each data point is represented by one of k pos-
sible centroids, chosen according to Euclidean proximity (and thereby similar to lossy
data compression under the squared loss function where the compressed representation
can be viewed as a cluster center). Explicit clustering formulations for communication
under the universal setting, wherein the transmitted messages represent cluster labels,
have also been studied [28]; this work and its connection to clustering are elaborated in
Section 9.5.2.
It is evident from history and philosophy that universal compression and commu-
nication may yield much insight into both the design and the analysis of inference
algorithms, especially of clustering. Thus, we give some brief insight into prominent
prior work in these areas.

Compression
Compression, as originally introduced by Shannon [21], considers the task of efficiently
representing discrete data samples, X1 , . . . , Xn , generated by a source independently and
identically according to a distribution PX (·), as shown in Fig. 9.1. Shannon showed that
the best rate of lossless compression for such a case is given by the entropy, H(PX ), of
the source.
A variety of subsequently developed methods, such as arithmetic coding [29], Huff-
man coding [30], and Shannon–Fano–Elias coding [26], approach this limit to perform
optimal compression of memoryless sources. However, they often require knowledge of
the source distribution for encoding and decoding.
On the other hand, Lempel and Ziv devised a universal encoding scheme for com-
pressing unknown sources [31], which has proven to be asymptotically optimal for
strings generated by stationary, ergodic sources [32, 33]. This result is impressive as
it highlights the feasibility of optimal compression (in the asymptotic sense), without
contextual knowledge, by using a simple encoding scheme. Unsurprisingly, this work
also inspired the creation of practical compression algorithms such as gzip. The method
has subsequently been generalized to random fields (such as images) with lesser stor-
age requirements as well [34]. Another approach to universal compression is to first
approximate the source distribution from the symbol stream, and then perform optimal
compression for this estimated distribution [35–37]. Davisson identifies the necessary
and sufficient condition for noiseless universal coding of ergodic sources as being that
the per-letter average mutual information between the parameter that defines the source
distribution and the message is zero [38].

Figure 9.1 Model of data compression. The minimum rate of compression for a memoryless
source is given by the entropy of the source.
Universal Clustering 267

Beyond the single-terminal case, the problem of separately compressing a correlated


pair of sources was addressed in the seminal work of Slepian and Wolf [39] and later
generalized to jointly stationary ergodic sources [40]. Universal analogs for this problem
have also been derived by Csiszár [41].
Through rate-distortion theory, Shannon also established theoretical foundations of
lossy compression, where the aim is to compress a source subject to a constraint on
the distortion experienced by the decoded bitstream, instead of exact recovery [42, 43].
Universal analogs for lossy compression, albeit limited in comparison with the lossless
setting, have also been explored [44–46]. These are particularly important from the uni-
versal clustering standpoint due to the underlying similarity between the two problems
noted earlier. A comprehensive survey of universal compression can be found in [47]
and references therein.

Communication
In the context of communicating messages across a noisy, discrete memoryless chan-
nel (DMC), W, Shannon established that the rate of communication is limited by the
capacity of the channel [21, 48], which is given by

C(W) = max I(X; Y). (9.1)


PX

Here, the maximum is over all information sources generating the codeword symbol X
that is transmitted through the channel W to receive the noisy symbol Y, as shown in
Fig. 9.2.
Shannon showed that using a random codebook for encoding the messages and joint
typicality decoding achieves the channel capacity. Thus, the decoder requires knowledge
of the channel. Goppa later defined the maximum mutual information (MMI) decoder
to perform universal decoding [49]. The decoder estimates the message as the one with
the codeword that maximizes the empirical mutual information with the received vec-
tor. This decoder is universally optimal in the error exponent [50] and has also been
generalized to account for erasures using constant-composition random codes [51].
Universal communication over channels subject to conditions on the decoding error
probability has also been studied [52]. Beside DMCs, the problem of communica-
tion over channels with memory has also been extensively studied both with channel
knowledge [53–55] and in the universal sense [56, 57].
Note that the problem of universal communication translates to one of classification
of the transmitted codewords, on the basis of the message, using noisy versions without
knowledge of the channel. In addition, if the decoder is unaware of the transmission
codebook, then the best one could hope for is to cluster the messages. This problem
has been studied in [28, 58], which demonstrate that the MMI decoder is sub-optimal

Figure 9.2 Model of data communication. The maximum rate of communication for discrete
memoryless channels is given by the capacity of the channel.
268 Ravi Kiran Raman and Lav R. Varshney

for this context, and therefore introduces minimum partition information (MPI) decod-
ing to cluster the received noisy codewords. More recently, universal communication in
the presence of noisy codebooks has been considered [59]. A comprehensive survey of
universal communication under a variety of channel models can be found in [60].

9.2.2 Universality in Statistical Inference


Beside compression and communication, a variety of statistical inference problems
have also been studied from the perspective of universality. Let us review a few such
problems.
The likelihood ratio test is well known to be Bayes-optimal for hypothesis test-
ing when the conditional distributions defining the hypotheses are known. Composite
hypothesis testing, shown in Fig. 9.3, considers null and alternate hypotheses defined
by classes of distributions parameterized by families Θ0 and Θ1 , respectively. In such
contexts, the generalized likelihood ratio, defined by

supθ∈Θ1 pθ (y)
LG (y) = ,
supθ∈Θ0 pθ (y)

is used in place of the likelihood ratio. The generalized likelihood ratio test (GLRT) is
asymptotically optimal under specific hypothesis classes [61].
A universal version of composite hypothesis testing is given by the competitive
minimax formulation [62]:

Pe (δ|θ0 , θ1 )
min max .
δ (θ0 ,θ1 )∈Θ0 ×Θ1 inf Pe (
δ|θ0 , θ1 )

δ

That is, the formulation compares the worst-case error probabilities with the correspond-
ing Bayes error probability of the binary hypothesis test. The framework highlights the
loss from the composite nature of the hypotheses and from the lack of knowledge of
these families, proving to be a benchmark. This problem has also been studied in the
Neyman–Pearson formulation [63].
Universal information theory has also been particularly interested in the problem of
prediction. In particular, given the sequence X1 , . . . , Xn , the aim is to predict Xn+1 such

Figure 9.3 Composite hypothesis testing: infer a hypothesis without knowledge of the conditional
distribution of the observations. In this and all other subsequent figures in this chapter we adopt
the convention that the decoders and algorithms are aware of the system components enclosed in
solid lines and unaware of those enclosed by dashed lines.

Figure 9.4 Prediction: estimate the next symbol given all past symbols without knowledge of the
stationary, ergodic source distribution.
Universal Clustering 269

Figure 9.5 Denoising: estimate a transmitted sequence of symbols without knowledge of one or
both of the channel and source distributions.

Figure 9.6 Multireference alignment: estimate the transmitted sequence up to a cyclic rotation,
without knowledge of the channel distribution and permutation.

that a distortion function is minimized, as shown in Fig. 9.4. A universal prediction


algorithm is designed using the parsing idea of Lempel–Ziv compression in [64] and
has been proven to achieve finite-state predictability in the asymptotic sense. A similar
mechanism is also used for determining wagers in a sequential gambling framework
using finite-state machines [65].
Another problem of interest in the universal information theory framework is that
of denoising. This considers the context of estimating a sequence of symbols X1 , . . . , Xn
transmitted through a channel from the noisy received sequence Y1 , . . . , Yn , without chan-
nel coding to add redundancy (as against communication). This is shown in Fig. 9.5.
Denoising in the absence of knowledge of the source distribution, for a known, discrete
memoryless channel, has been studied in [66]. It has been shown that universal denoising
algorithms asymptotically perform as well as the optimal algorithm that has knowledge
of the source distribution.
For a memoryless source, estimating the transmitted sequence translates to a symbol-
wise m-ary hypothesis test for each symbol without knowledge of a prior distribution. By
restricting their consideration to a DMC with a full-rank transition matrix in this work,
the authors use the statistics of the noisy symbols to estimate the source distribution.
Once the source distribution has been estimated, the sequence is denoised.
The twice universal version of the problem wherein both source and channel distribu-
tions are unknown has also been explored [67]. Given a parameter k, the loss between
the universal denoiser and the optimal kth-order sliding window denoiser is shown to
vanish asymptotically. These problems try to exploit the inherent redundancy (if any) in
the source to predict the input sequences.
A problem closely related to the denoising problem is that of multireference align-
ment [68]. Here an input sequence is corrupted by a channel and permuted from a
given class of feasible permutations as shown in Fig. 9.6. The aim here is to recover the
input sequence, up to a cyclic shift, given the noisy signal. It finds application in signal
alignment and image registration. The sample complexity of the model-aware, Bayes-
optimal maximum a posteriori (MAP) decoder for the Boolean version of the problem
was recently identified in [69]. Multireference alignment with a linear observation
model and sub-Gaussian additive noise was considered in [70]. Here, denoising meth-
ods agnostic to the noise model parameters were defined with order-optimal minimax
performance rates.
The problem of image registration involves identifying the transformation that aligns
a noisy copy of an image to its reference, as shown in Fig. 9.7. Note that the aim here is
270 Ravi Kiran Raman and Lav R. Varshney

Figure 9.7 Image registration: align a corrupted copy Y to reference X without knowledge of the
channel distribution.

Figure 9.8 Delay estimation: estimate the finite cyclic delay from a noisy, cyclically rotated
sequence without channel knowledge.

Figure 9.9 Clustering: determine the correct partition without knowledge of the conditional
distributions of observations.

only to align the copy to the reference and not necessarily to denoise the signal. The max
mutual information method has been considered for universal image registration [71].
In the universal version of the registration problem with unknown discrete memory-
less channels, the decoder has been shown to be universally asymptotically optimal in
the error exponent for registering two images (but not for more than two images) [72].
The registration approach is derived from that used for universal delay estimation for
discrete channels and memoryless sources, given cyclically shifted, noisy versions of
the transmitted signal, in [73]. The model for the delay estimation problem is shown in
Fig. 9.8.
Classification and clustering of data sources have also been considered from the uni-
versal setting and can broadly be summarized by Fig. 9.9. As described earlier, even
the communication problem can be formulated as one of universal clustering of mes-
sages, encoded according to an unknown random codebook, and transmitted across an
unknown discrete memoryless channel, by using the MPI decoder [58].
Note that the task of clustering such objects is strictly stronger than the universal
binary hypothesis test of identifying whether two objects are drawn from the same source
or not. Naturally, several ideas from the hypothesis literature translate to the design of
clustering algorithms. Binary hypothesis testing, formulated as a classification problem
using empirically observed statistics, and the relation of the universal discriminant func-
tions to universal data compression are elaborated in [74]. An asymptotically optimal
decision rule for the problem under hidden Markov models with unknown statistics in
the Neyman–Pearson formulation is designed and compared with the GLRT in [75].
Unsupervised clustering of discrete objects under universal crowdsourcing was stud-
ied in [76]. In the absence of knowledge of the crowd channels, we define budget-
optimal universal clustering algorithms that use distributional identicality and response
Universal Clustering 271

Figure 9.10 Outlier hypothesis testing: determine outlier without knowledge of typical and outlier
hypothesis distributions.

dependence across similar objects. Exponentially consistent universal image clustering


algorithms using multivariate information functionals were also defined in [72].
A problem that is similar to clustering data sources is that of outlier hypothesis testing,
where the aim is to identify data sources that are drawn according to an outlier distribu-
tion. The system model for the outlier hypothesis testing problem is shown in Fig. 9.10.
Note that this forms a strict subset of the clustering problems. The GLRT is studied for
outlier testing in the universal setting where the distributions of the outlier distribution
and typical distribution are unknown [77]. The error exponents of the test and converse
results on the error exponent are also computed.
Note the strong similarities between the various problems of compression, communi-
cation, and statistical inference and the intuition they add to the problem of universal
clustering. We next define information-theoretic formulations of clustering for both
distance- and dependence-based methods, and present some results therein. Before pro-
ceeding, however, we formally define the problem of clustering and discuss various
notions of similarity that are central to the definition of any clustering algorithm.

9.3 Clustering and Similarity

We now formally define the problem of clustering. Consider a set of n objects


{X1 , . . . , Xn }. The term object here is defined broadly to encompass a variety of data
types. For instance, the objects could be random variables drawn according to an under-
lying joint distribution, or a set of hypotheses drawn from a family of distributions, or
even points in a vector space. The definition of objects is application-specific and is
often evident from usage. In general, they may be viewed as random variables, taking
values dictated by the space they belong to. For instance, the objects could be images of
dogs and cats, with each image being a collection of pixels as determined by the kind of
animal in it.
In the following discussion, let P be the set of all partitions of the index set [n] =
{1, . . . , n}. Given a set of constraints to be followed by the clusters, as defined by the
problem, let Pc ⊆ P be the subset of viable partitions satisfying the system constraints.
definition 9.1 (Clustering) A clustering of a set of n objects is a partition P ∈ Pc of
the set [n] that satisfies the set of all constraints. A cluster is a set in the partition and
objects Xi and X j are in the same cluster when i, j ∈ C for some cluster C ∈ P.
272 Ravi Kiran Raman and Lav R. Varshney

In this chapter we restrict the discussion to hard, non-overlapping clustering models


and methods. Some applications consider soft versions of clustering wherein each object
is assigned a probability of belonging to a set rather than a deterministic assignment.
Other versions also consider clusters such that an object can belong to multiple over-
lapping clusters owing to varying similarity criteria. In our discussion, each object is
assigned to a unique cluster.
Data sources have an underlying label associated with them depending on either the
class they are sampled from or the statistics governing their generation. However, in
the absence of training data to identify the optimal label assignment, clustering the data
according to the labels is desirable.
definition 9.2 (Correct clustering) A clustering, P ∈ Pc , is said to be correct if all
objects with the same label are clustered together, and no two objects with differing
labels are clustered together.
The space of partitions includes a natural ordering to compare the partitions, and is
defined as follows.
definition 9.3 (Partition ordering) For P, P ∈ P, P is finer than P , if the following
ordering holds:

P  P ⇔ for all C ∈ P, there exists C  ∈ P : C ⊆ C  .

Similarly, P is denser than P , P  P ⇔ P  P. If P  P and P  P , then the partitions


are comparable, P ∼ P .
As is evident from Definition 9.2, learning algorithms aim to identify the finest
clustering maximizing the similarity criterion. In the context of designing hierarchical
clustering structures (also known as taxonomies), algorithms identify sequentially finer
partitions, defined by the similarity criterion, to construct the phylogenetic tree.
The performance of a clustering algorithm is characterized by distance metrics such
as the symbol or block-wise error probabilities. Depending on context, a unique correct
clustering may not exist and, in such cases, we might be interested in a hierarchical
clustering of the objects using relative similarity structures.

9.3.1 Universal Clustering Statistical Model


In this chapter we focus on universal clustering of objects drawn according to a
statistical model defined according to the correct clustering of the data sources. Let us
first formalize the data model under consideration. Let the maximum number of labels
be , and the set of labels, without loss of generality, be L = []. Let the true label
of object i be Li ∈ L. Each label has an associated source distribution defined by the
label, which for object i is QLi ∈ {Q1 , . . . , Q }. According to this source distribution, an
m-dimensional latent vector representation, Xi = (Xi,1 , . . . , Xi,m ) ∈ Xm , of the source is
drawn. Depending on the source distribution, X could be a finite or infinite alphabet.
Latent vector representations define the class that the source belongs to. However,
we often get to observe only noisy versions of these data sources. In this respect,
Universal Clustering 273

Figure 9.11 Data-generation model for universal clustering. Here we cluster n data sources with
labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). True labels define unknown source distributions. Latent
vectors are drawn from the source distributions and corrupted by the unknown noise channel,
resulting in m-dimensional observation vectors for each source. Universal clustering algorithms
cluster the sources using the observation vectors generated according to this model.

let W(Y1 , . . . , Yn |X1 , . . . , Xn ) be the channel that corrupts the latent vectors to generate
observation vectors Y1 , . . . , Yn ∈ Ym . The specific source or channel models that define
the data samples are application-specific.
A clustering algorithm infers the correct clustering of the data sources using the obser-
vation vectors, such that any two objects are in the same cluster if and only if they share
the same label. The statistical model is depicted in Fig. 9.11.
definition 9.4 (Universal clustering) A clustering algorithm Φ is universal if it per-
forms clustering in the absence of knowledge of the source and/or channel distributions,
{P1 , . . . , P }, W, and other parameters defining the data samples.

Example 9.1 Let us now consider a simple example to illustrate the statistical model
under consideration. Let μ1 , . . . , μ ∈ Rn be the latent vector distributions corresponding
to the labels, i.e., the source distribution is given by Q j (μ j ) = 1. That is, if source i has
label j, i.e., Li = j, then Xi = μ j with probability one. Let W be a memoryless additive
white Gaussian noise (AWGN) channel, i.e., Yi = Xi + Zi , where Zi ∼ N(0, I).
Since the latent vector representation is dictated by the mean of the observation vec-
tors corresponding to the label, and since the distribution of each observation vector is
Gaussian, the maximum-likelihood estimation of the correct cluster translates to the par-
tition that minimizes the intra-cluster distance of points from the centroids of the cluster.
That is, it is the same problem as that which the k-means algorithm attempts to solve.

9.3.2 Similarity Scores and Comparing Clusters


As is evident from the example, the optimal clustering algorithm is dependent on the sta-
tistical model (or family) defining the data, the similarity criterion, and the performance
metric. The three defining characteristics often depend on each other. For instance, in the
example, the Gaussian statistical model in conjunction with the Hamming loss (proba-
bility of clustering error) will result in the Euclidean distance between the vectors being
274 Ravi Kiran Raman and Lav R. Varshney

the distance criterion, and the ML estimate as obtained from the k-means algorithm (with
appropriate initializations) being the optimal clustering algorithm. One can extrapolate
the example to see that a Laplacian model for data generation with Hamming loss would
translate to an L1 -distance criterion among the data vectors.
In the absence of knowledge of the statistics governing data generation, we require
appropriate notions of similarity to define insightful clustering algorithms. Let us define
the similarity score of the set of objects {Y1 , . . . , Yn } clustered according to a partition
P as S (Y1 , . . . , Yn ; P), where S : (Ym )n × P → R. Clustering can equivalently be defined
by a notion of distance between objects, where the similarity can be treated as just the
negative/inverse of the distance.
Identifying the best clustering for a task is thus equivalent to maximizing the intra-
cluster (or minimizing the inter-cluster) similarity of objects over the space of all viable
partitions, i.e.,

P∗ ∈ arg max S (Y1 , . . . , Yn ; P). (9.2)


P∈Pc

Such optimization may not be computationally efficient except for certain similarity
functions, as the partition space is discrete and exponentially large.
In the absence of a well-defined similarity criterion, S , defining the “true” clusters that
satisfy (9.2), the task may be more art than science [2]. Developing contextually appro-
priate similarity functions is thus of standalone interest, and we outline some popular
similarity functions used in the literature.
Efforts have long been made at identifying universal discriminant functions that define
the notion of similarity in classification [74]. Empirical approximation of the Kullback–
Leibler (KL) divergence has been considered as a viable candidate for a universal
discriminant function to perform binary hypothesis testing in the absence of knowledge
of the alternate hypothesis, such that it incurs an error exponent arbitrarily close to the
optimal exponent.
The Euclidean distance is the most common measure of pair-wise distance between
observations used to cluster objects, as evidenced in the k-means clustering algorithm.
A variety of other distances such as the Hamming distance, edit distance, Lempel–Ziv
distance, and more complicated formulations such as those in [78, 79] have also been
considered to quantify similarity in a variety of applications. The more general class of
Bregman divergences have been considered as inter-object distance functions for clus-
tering [80]. On the other hand, a fairly broad class of f -divergence functionals were used
as the notion of distance between two objects in [76]. Both of these studies estimate the
distributions from the data samples Y1 , . . . , Yn and use a notion of pair-wise distance
between the sources on the probability simplex to cluster them.
Clusters can also be viewed as alternate efficient representations of the data, com-
prising sufficient information about the objects. The Kolmogorov complexity is widely
accepted as a measure of the information content in an object [81]. The Kolomogorov
complexity, K(x), of x is the length of the shortest binary program to compute x using a
universal Turing machine. Similarly, K(x|y), the Kolmogorov complexity of x given y, is
the length of the shortest binary program to compute x, given y, using a universal Turing
Universal Clustering 275

machine. Starting from the Kolmogorov complexity, in [82] Bennett et al. introduce the
notion of the information distance between two objects X, Y as

E1 (X, Y) = max{K(X|Y), K(Y|X)},

and argue that this is a universal cognitive similarity distance. The smaller the informa-
tion distance between two sources, the easier it is to represent one by the other, and hence
the more similar they are. A corresponding normalized version of this distance, called the
normalized information distance, is obtained by normalizing the information distance
by the maximum Kolmogorov complexity of the individual objects [83] and is a useful
metric in clustering sources according to the pair-wise normalized cognitive similarity.
Salient properties of this notion of universal similarity have been studied [84], and
heuristically implemented using word-similarity scores, computed using Google page
counts, as empirical quantifications of the score [85]. While such definitions are the-
oretically universal and generalize a large class of similarity measures, it is typically
not feasible to compute the Kolmogorov complexity. They do, however, inspire other
practical notions of similarity. For instance, the normalized mutual information between
objects has been used as the notion of similarity to maximize the inter-cluster correlation
[86]. The normalized mutual information has also been used for feature selection [87],
a problem that translates to clustering in the feature space.
Whereas a similarity score guides us toward the design of a clustering algorithm, it
is also important to be able to quantitatively compare and evaluate the results of such
methods. One standard notion of the quality of a clustering scheme is comparing it
against the benchmark set by random clustering ensembles. A variety of metrics based
on this notion have been defined.
Arguably the most popular such index was introduced by Rand [1]. Let P1 , P2 ∈ Pc
be two viable partitions of a set of objects {X1 , . . . , Xn }. Let a be the number of pairs
of objects that are in the same cluster in both partitions and b be the number of pairs
of objects that are in different clusters in both partitions. Then, the Rand index for
deterministic clustering algorithms is given by
a+b
R = n ,
2

quantifying the similarity of the two partitions as observed through the fraction of pair-
wise comparisons they agree on. For randomized algorithms, this index is adjusted using
expectations of these quantities.
A quantitative comparison of clustering methods can also be made through the nor-
malized mutual information [88]. Specifically, for two partitions P1 , P2 , if p(C1 ,C2 ) is
the fraction of objects that are in cluster C1 in P1 and in cluster C2 in P2 , and if 
p1 , 
p2
are the corresponding marginal distributions, then the normalized mutual information is
defined as
2I(p)
NMI(P1 , P2 ) = .
p1 ) + H(
H( p2 )
Naturally the NMI value is higher for similar clustering algorithms.
276 Ravi Kiran Raman and Lav R. Varshney

Whereas comparing against the baselines established by random clustering ensembles


helps identify structure in the result, the permutation model is often used to gener-
ate these ensembles. Drawbacks of using the same method across applications and the
variation in the evaluations across different models have been identified [89].
Apart from intuitive quantifications such as these, it has long been argued that the task
of clustering is performed only as well as it is evaluated to be done by the humans who
evaluate it [2]. The inherently subjective nature of the task has evoked calls for human-
based evaluation of clustering methods [90]; some of the standard clustering quality
measures are at times in sharp contrast with human evaluations.
Clustering is thus best understood when the appropriate notions of similarity, or
equivalently the statistical class generating the data, are available, and when the meth-
ods can be benchmarked and evaluated using rigorous mathematical notions or human
test subjects. We now describe clustering algorithms defined under distance- and
dependence-based similarity notions.

9.4 Distance-Based Clustering

The most prominent criterion for clustering objects is based on a notion of distance
between the samples. Conventional clustering tools in machine learning such as k-
means, support vector machines, linear discriminant analysis, and k-nearest-neighbor
classifiers [91–93] have used Euclidean distances as the notion of distance between
objects represented as points in a feature space. As in Example 9.1, these for instance
translate to optimal universal clustering methods under Gaussian observation models
and Hamming loss. Other similar distances in the feature-space representation of the
observations can be used to perform clustering universally and optimally under varying
statistical and loss models.
In this section, however, we restrict the focus to universal clustering of data sources
in terms of the notions of distance in the conditional distributions generating them. That
is, according to the statistical model of Fig. 9.11, we cluster according to the distance
between the estimated conditional distributions of p(Yi |Li ). We describe clustering under
sample independence in detail first, and then highlight some distance-based methods
used for clustering sources with memory.

9.4.1 Clustering under Independence and Identicality


Let us first consider universal clustering under the following restrictions of independence
in data sources. Let us assume that the latent vectors corresponding to the same label are
generated as independent and identical samples of a source distribution and that the noise
channel is memoryless across sources and samples of the vector. That is, each sample of
an observation vector is independent and identically distributed (i.i.d.), and observations
of the same label are drawn according to identical distributions. The effective model is
depicted in Fig. 9.12.
The iterative clustering mechanism of the k-means algorithm is, for instance, used by
the Linde–Buzo–Gray algorithm [94] under the Itakura–Saito distance. Note that this
Universal Clustering 277

Figure 9.12 Data-generation model for universal distance-based clustering. Here we cluster n
sources with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). True labels define the source distributions and,
for each source, m i.i.d. samples are drawn to generate the observation vectors. Source
identicality is used to cluster the sources.
is equivalent to data clustering using pair-wise distances between the spectral densities
of the sources. A similar algorithm using KL divergences between probability distri-
butions was considered in [95]. More generally, the k-means clustering algorithm can
be extended directly to solve the clustering problem under a variety of distortion func-
tions between the distributions corresponding to the sources, in particular, the class of
Bregman divergences [96].
definition 9.5 (Bregman divergence) Let f : R+ → R be a strictly convex function,
and let p0 , p1 be two probability mass functions (p.m.f.s), represented by vectors. The
Bregman divergence is defined as

B f (p0 p1 ) = f (p1 ) − f (p2 ) − ∇ f (p2 ), p1 − p2 . (9.3)

The class of Bregman divergences includes a variety of distortion functions such as


the squared loss, KL divergence, Mahalanobis distance, I-divergence, and Itakuro–Saito
distance. Clustering according to various such distortion functions translates to varying
notions of similarity between the sources. Each of these similarity notions proves to be
an optimal universal clustering method for corresponding families of statistical models
and loss functions.
Hard and soft clustering algorithms based on minimizing an intra-cluster Bregman
divergence functional were defined in [80]. In particular, the clustering task is posed as a
quantization problem aimed at minimizing the loss in Bregman information, which is a
generalized notion of mutual information defined with respect to Bregman divergences.
Iterative solutions are developed for the clustering problem, as in the k-means method.
Co-clustering of rows and columns of a doubly stochastic matrix, with each row and
column treated as a p.m.f., has also been considered under the same criterion [97].
In [76], we consider the task of clustering n discrete objects according to their labels
using the responses of temporary crowd workers. Here, for each object, we collect a
set of m independent responses from unknown independent workers drawn at random
from a worker pool. Thus, the m responses to a given object are i.i.d., conditioned on a
label. Clustering is performed using the empirical response distribution for each object.
Consistent clustering involves grouping objects that have the same conditional response
distributions.
278 Ravi Kiran Raman and Lav R. Varshney

In particular, we cluster objects that have empirical distributions that are close to each
other in the sense of an f -divergence functional.
definition 9.6 ( f -divergence) Let p, q be discrete probability distributions defined on
a space of m alphabets. Given a convex function f : [0, ∞) → R, the f -divergence is
defined as
m  
pi
D f (p q) = qi f . (9.4)
i=1
qi

The function f is said to be normalized if f (1) = 0.


We consider clustering according to a large family of pair-wise distance functionals
defined by f -divergence functionals with convex functions f such that
(1) f is twice differentiable on [r, R], and
(2) there exist real constants c,C < ∞ such that

c ≤ x f  (x) ≤ C, for all x ∈ (r, R).

Using empirical estimates of conditional distributions of observations, we define


a weighted graph G such that the weights are defined by w(i, j) = di j 1 di j > γm ,
 f (Yi Y j ) is the plug-in estimate of the f -divergence estimates. Here, a
where di j = D
sufficiently small threshold is chosen according to the number of samples m per object,
decaying polynomially with m. Then the clustering is performed by identifying the
minimum-weight partition such that the clusters are maximal cliques in the graph, as
shown in Fig. 9.13.
This universal clustering algorithm is not only statistically consistent, but also incurs
order-optimal sample complexity [98]. In particular, the number of responses per object,
m, for reliable clustering scales as m = O(log n) with the number of objects n to clus-
ter, which is order-optimal. That is, in the absence of the statistics defining worker
responses to objects and their reliabilities, the universal clustering algorithm performs
budget-optimal reliable clustering of the discrete set of objects.
Outlier hypothesis testing is a problem closely related to clustering and has been
considered extensively in the universal context. In this problem, the set of labels is
L = {0} ∪ S, where S is a set of parameters. We have a collection of n sources of i.i.d.
data, of which a majority follow a typical distribution π (label 0), and the rest are outliers

Figure 9.13 Distance-based clustering of nine objects of three types according to [76]. The graph
is obtained by thresholding the f -divergences of empirical response distributions. The clustering
is then done by identifying the maximal clusters in the thresholded graph.
Universal Clustering 279

whose data are generated according to a distribution p s for some s ∈ S. In general the
set of parameters can be uncountably infinite, and in this sense the set of labels need
not necessarily be discrete. However, this does not affect our study of clustering of a
finite collection of sources. We are interested in separating the outliers from the typical
sources, and thus we wish to cluster the sources into two clusters. In the presence of
knowledge of the typical and outlier distributions, the optimal detection rule is charac-
terized by the generalized likelihood ratio test (GLRT) wherein each sequence can be
tested for being an outlier as

max s∈S j∈[m] p s (y j ) 1


δ(ym ) : ≷ η,
j∈[m] π(y j ) 0

where the threshold η is chosen depending on the priors of typical and outlier
distributions, and decision 1 implies that the sequence is drawn from an outlier.
In universal outlier hypothesis testing [77], we are unaware of the typical and out-
lier hypothesis distributions. The aim is then to design universal tests such that we
can detect the outlier sequences universally. We know that, for any distribution p and
a sequence of i.i.d. observations ym drawn according to this distribution, if  p is the
empirical distribution of the observed sequence, then

p) + D(p 
p(ym ) = exp −m(H( p)) . (9.5)

Then any likelihood ratio for two distributions essentially depends on the difference
in KL divergences of the corresponding distributions from the empirical estimate.
Hence, the outlier testing problem is formulated as one of clustering typical and out-
lier sequences according to the KL divergences of the empirical distributions from
the cluster centroids by using (9.5) in a variety of settings of interest [99–101].
Efficient clustering-based outlier detection methods, with a linear computational com-
plexity in the number of sources, with universal exponential consistency have also
been devised [102]. Here the problem is addressed by translating it into one of clus-
tering according to the empirical distributions, with the KL divergence as the similarity
measure.
A sequential testing variation of the problem has also been studied [103], where the
sequential probability ratio test (SPRT) is adapted to incorporate (9.5) in the likelihood
ratios. In the presence of a unique outlier distribution μ, the universal test is exponen-
tially consistent with the optimal error exponent given by 2B(π, μ), where B(p, q) is the
Bhattacharya distance between the distributions p and q. Further, this approach is also
exponentially consistent when there exists at least one outlier, and it is consistent in the
case of the null hypothesis (no outlier distributions) [77].

9.4.2 Clustering Sources with Memory


Let us now consider universal clustering models using sources with memory. That is,
for each label, the latent vector is generated by a source with memory and thus the
280 Ravi Kiran Raman and Lav R. Varshney

components of the vector are dependent. Let us assume that the channel W is the
same – memoryless. For such sources, we now describe some novel universal clustering
methods.
As noted earlier, compressed streams represent cluster centers either as themselves
or as clusters in the space of the codewords. In this sense, compression techniques
have been used in defining clustering algorithms for images [104]. In particular, a vari-
ety of image segmentation algorithms have been designed starting from compression
techniques that fundamentally employ distance-based clustering mechanisms over the
sources generating the image features [105–107].
Compression also inspires more distance-based clustering algorithms, using the com-
pression length of each sequence when compressed by a selected compressor C. The
normalized compression distance between sequences X, Y is defined as

C(X, Y) − min{C(X),C(Y)}
NCD(X Y) = , (9.6)
max{C(X),C(Y)}

where C(X, Y) is the length of compressing the pair. Thus, the normalized compression
distance represents the closeness of the compressed representations of the sequence,
therein translating to closeness of the sources as dictated through the compressor C.
Thus, the choice of compression scheme here characterizes the notion of similarity
and the corresponding universal clustering model it applies to. Clustering schemes
based on the NCD minimizing (maximizing) the intra-cluster (inter-cluster) distance
are considered in [108].
Clustering of stationary, ergodic random processes according to the source distribu-
tion has also been studied, wherein two objects are in the same cluster if and only if
they are drawn from the same distribution [109]. The algorithm uses empirical esti-
mates of a weighted L1 distance between the source distributions as obtained from the
data streams corresponding to each source. Statistical consistency of the algorithm is
established for generic ergodic sources. This method has also been used for clustering
time-series data [110].
The works we have summarized here represent a small fraction of the distance-based
clustering methods in the literature. It is important to notice the underlying thread of
universality in each of these methods as they do not assume explicit knowledge of the
statistics that defines the objects to be clustered. Thus we observe that, in the universal
framework, one is able to perform clustering reliably, and often with strong guaran-
tees such as order-optimality and exponential consistency using information-theoretic
methods.

9.5 Dependence-Based Clustering

Another large class of clustering algorithms consider the dependence among random
variables to cluster them. Several applications such as epidemiology and meteorology,
for instance, generate temporally or spatially correlated sources of information with an
element of dependence across sources belonging to the same label. We study the task
Universal Clustering 281

of clustering such sources in this section, highlighting some of these formulations and
universal solutions under two main classes – independence clustering using graphical
models and clustering using multivariate information functionals.

9.5.1 Independence Clustering by Learning Graphical Models


In this section, we consider the problem of independence clustering of a set of random
variables, defined according to an underlying graphical model, in a universal sense.
definition 9.7 (Independence clustering) Consider a collection of random variables
(X1 , . . . , Xn ) ∼ QG , where QG is the joint distribution defined according to a graphical
model defined on the graph G = ([n], E). Independence clustering of this set of random
variables identifies the finest partition P∗ ∈ P such that the clusters of random variables
are mutually independent [111].
Before we study the problem in detail, it is important to note that this problem is fun-
damentally different from independent component analysis (ICA) [112], where the idea
is to learn a linear transformation such that the components are mutually independent.
Several studies have also focused on learning the dependence structures [113] or their
optimal tree-structured approximations [114]. In its most general form, it is known that
recovering the Bayesian network structure, from data even for just Ising models, for a
collection of random variables is NP-hard [115, 116]. Independence clustering essen-
tially requires us to identify the connected components of the Bayesian network and
hence, though simpler, cannot always be efficiently learned from data. Additionally, as
we are interested in mutual independence among clusters, pair-wise measurements are
insufficient, translating to prohibitive sample complexities.
It has been observed that under constraints like bounded in-degrees, or under restricted
statistical families, the recovery of a graphical model can in fact be performed efficiently
[117–119].

Example 9.2 Consider a collection of n jointly Gaussian random variables


(X1 , . . . , Xn ) ∼ N(0, Σ), represented by the Gaussian graphical model defined on the graph
G = ([n], E). For the Gaussian graphical model, the adjacency matrix of the graph is
equivalent to the support of the precision matrix Σ−1 . Hence, learning the graphical
model structure is equivalent to identifying the support of the precision matrix.
Since independence clustering seeks only to identify mutual independence, for the
Gaussian graphical model, we are interested in identifying the block diagonal decompo-
sition of the precision matrix, i.e., elements of each block diagonal form the clusters.
Identifying the non-zero blocks of the precision matrix (independence clusters) is
exactly equivalent to identifying the non-zero blocks of the covariance matrix, therein
making the task of independence   clustering for Gaussian graphical
 models trivial in
k+1
terms of both computational O nk − 2 for k clusters and sample complexity
(depends on the exact loss function, but the worst-case complexity is O(n)).
282 Ravi Kiran Raman and Lav R. Varshney

In this section we study the problem of independence clustering under similar con-
straints on the statistical models to obtain insightful results and algorithms. A class of
graphical models that has been well understood is that of the Ising model [120–122].
Most generally, a fundamental threshold for conditional mutual information has been
used in devising an iterative algorithm for structure recovery in Ising models [122]. For
the independence clustering problem, we can use the same threshold to design a simple
extension of this algorithm to define an iterative independence clustering algorithm for
Ising models.
Beside such restricted distribution classes, a variety of other techniques involving
greedy algorithms [123], maximum-likelihood estimation [124], variational methods
[125], locally tree-like grouping strategies [126], and temporal sampling [127] have been
studied for dependence structure recovery using samples, for all possible distributions
under specific families of graphs and alphabets. The universality of these methods in
recovering the graphical model helps define direct extensions of these algorithms to
perform universal independence clustering.
Clustering using Bayesian network recovery has been studied in a universal crowd-
sourcing context [76], where the conditional dependence structure is used to identify
object similarity. We consider object clustering using responses of a crowdsourcing sys-
tem with long-term employees. Here, the system employs m workers such that each
worker labels each of the n objects. The responses of a given worker are assumed to
be dependent in a Markov fashion on the responses to the previous object and the most
recent object of the same class as shown in Fig. 9.14.
More generally, relating to the universal clustering model of Fig. 9.11, this version
of independence clustering reduces to Fig. 9.15. That is, the crowd workers observe the
object according to their true label, and provide a label appropriately, depending on the
observational noise introduced by the channel W. Thus, the latent vector representation
is a simple repetition code of the true label. The responses of the crowd workers are
defined by the Markov channel W, such that the response to an object is dependent on
that offered for the most recent object of the same label, and the most recent object.
Here, we define a clustering algorithm that identifies the clusters by recovering the
Bayesian network from the responses of workers. In particular, the algorithm com-
putes the maximum-likelihood (ML) estimates of the mutual information between the
responses to pairs of objects and, using the data-processing inequality, reconstructs the
Bayesian network defining the worker responses. As elaborated in [98] the number of
responses per object required for reliable clustering of n objects is m = O(log n), which is
order-optimal. The method and algorithm directly extend to graphical models with any
finite-order Markov dependences.

Figure 9.14 Bayesian network model of responses to a set of seven objects chosen from a set of
three types in [76]. The most recent response and the response to the most recent object of the
same type influence a response.
Universal Clustering 283

Figure 9.15 Block diagram of independence clustering under a temporal dependence structure.
Here we cluster n objects with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). The latent vectors are
m-dimensional repetitions of the true label. The observations are drawn according to the Markov
channel that introduces dependences in observations across objects depending on the true labels.
Clustering is performed according to the inferred conditional dependences in the observations.

A hierarchical approach toward independence clustering is adopted by Ryabko in


[111]. Here Ryabko first designs an algorithm to identify the clusters when the joint
distribution is given, by recursively dividing each cluster into two, as long as it is feasi-
ble. Then the idea is extended to clustering from data samples, wherein the criterion for
splitting into clusters can be obtained through a statistical test for conditional indepen-
dence [128–131]. The universal method is proven to be consistent for stationary, ergodic,
unknown sources when the number of clusters is given, as the conditional independence
test is reliable, given a sufficient number of samples.
The relationship between compression and clustering within the independence clus-
tering framework emerges through the information-bottleneck principle [132, 133]. This
problem considers quantization of an observation in the presence of latent variables, such
that the random variable is maximally compressed up to a requirement of a minimum
amount of information on the latent variable. Specifically, consider the Markov chain

Y → X → X.
Let Y be the latent variable representing the object label, and let X be the observation.
 represents the cluster it is associated with. If the
The quantized representation of X, X,
(randomized) clustering is performed according to the conditional distribution p( x|x),
then the information-bottleneck principle aims to minimize the Lagrangian
L(p(  − βI(X;
x|x)) = I(X; X)  Y), (9.7)
where β ≥ 0 is the Lagrange multiplier, as against the unconstrained maximization of
 Y). It has been observed that the optimal assignment minimizing (9.7) is given by
I(X;
the exponential family
p(x)
x|x) =
p( exp(−βD(p(Y|x) p(Y|
x))), (9.8)
Z(x; β)
and can be computed iteratively using a method similar to the Blahut–Arimoto algorithm
[134]. This framework extends naturally to the clustering of multiple random variables
according to the latent variable, and corresponding universal clustering methods can be
defined according to the information-bottleneck constraints.
284 Ravi Kiran Raman and Lav R. Varshney

The rate of information bottleneck (hypercontractivity coefficient) for random vari-


ables X, Y is defined as
I(U; Y)
s(X; Y) = sup ,
U−X−Y I(U; X)
where U−X−Y is a Markov chain and U is viewed as a summary of X. This func-
tional has been used as a measure of correlation between random variables and
has been used in performing correlation-based clustering [135]. Other variants of
the information-bottleneck method including multivariate [136], Gaussian [137], and
agglomerative formulations [138, 139] have also been studied extensively, forming
information-constrained universal clustering methods.
Thus we saw in this section that universal clustering algorithms can be defined for ran-
dom variables drawn according to a graphical model, using reliable, and often efficient,
structure recovery methods.

9.5.2 Multivariate Information-Based Clustering


Whereas independence clustering aims to recover mutually independent components,
such components rarely exist in practice. By analogy to the distance clustering frame-
work, we can consider a relaxed, and more practical notion of clustering that clusters
components such that the intra-cluster information is maximized.
Mutual information has been considered as a valid measure of pair-wise similarity
to perform clustering of random variables according to independence [140]. In par-
ticular, it has been prominent in the area of genomic clustering and gene-expression
evaluation [141, 142]. Hierarchical clustering algorithms using mutual information as
the pair-wise similarity functional have also been defined [143, 144]. Mutual informa-
tion is also a prominent tool used in clustering features to perform feature extraction in
big data analysis [87].
However, mutual information quantifies only pair-wise dependence and can miss non-
linear dependences in the data. For instance, consider objects X1 , X2 ∼ Bern(1/2) drawn
independently of each other, and let X3 = X1 ⊕ X2 . While any two of the three are pair-
wise independent, collectively the three objects in the set are dependent on each other.
Hence a meaningful cluster might require all three to be grouped in the same set, which
would not be the case if mutual information were used as the similarity score.
In applications that require one to account for such dependences, multivariate infor-
mation functionals prove to be an effective alternative. One such information functional
that is popular for clustering is the minimum partition information [145].
definition 9.8 (Partition information) Let X1 , . . . , Xn be a set of random variables, and
let P ∈ P be a partition of [n] with |P| > 1. Then, the partition information is defined as
⎛ ⎞
1 ⎜⎜⎜⎜ ⎟⎟
IP (X1 ; . . . ; Xn ) = ⎜⎝ H(XC ) − H(X1 , . . . , Xn )⎟⎟⎟⎠, (9.9)
|P| − 1 C∈P

where H(XC ) is the joint entropy of all random variables in the cluster C.
Universal Clustering 285

The partition information quantifies the inter-cluster information, normalized by the


size of the partition.
lemma 9.1 For any partition P ∈ P, IP (X1 ; . . . ; Xn ) ≥ 0, with equality if and only if the
clusters of P are mutually independent.
Proof Let (X1 , . . . , Xn ) ∼ Q. Then, the results follow from the observation that
⎛ ⎞
1 ⎜⎜⎜  ⎟⎟⎟
IP (X1 ; . . . ; Xn ) = D⎜Q⎜ QC ⎟⎟ ≥ 0,
|P| − 1 ⎝ C∈P ⎠

where QC is the marginal distribution of XC .

Example 9.3 Consider a set of jointly Gaussian random variables (X1 , . . . , Xn ) ∼


N(0, Σ). Then, for any partition P ∈ P and any cluster C ∈ C,
1
H(XC ) = |C|log(2πe) + log(|ΣC |) ,
2
where ΣC is the covariance submatrix of Σ corresponding to the variables in the cluster,
and |ΣC | is the determinant of the matrix. Here of course we use the differential entropy.
Thus, the partition information is given by
⎛ ⎞
1 ⎜⎜⎜⎜ ⎟⎟
IP (X1 ; . . . ; Xn ) = ⎜⎝ log(|ΣC |) − log(|Σ|)⎟⎟⎟⎠ (9.10)
|P| − 1 C∈P
 
1 |ΣP |
= log , (9.11)
|P| − 1 |Σ|
where [ΣP ]i, j = [Σ]i, j 1{i, j ∈ C for some C ∈ P}, for all i, j ∈ [n]2 . This is essentially the
block diagonal matrix obtained according to the partition P. The clusters of P are
mutually independent if and only if ΣP = Σ.

From the definition of the partition information, it is evident that the average
inter-cluster (intra-cluster) dependence is minimized (maximized) by a partition that
minimizes the partition information.
definition 9.9 (Minimum partition information) The minimum partition information
(MPI) of a set of random variables X1 , . . . , Xn is defined as
I(X1 ; . . . ; Xn ) = min IP (X1 ; . . . ; Xn ). (9.12)
P∈P,|P|>1

The finest partition minimizing the partition information is referred to as the fundamen-
tal partition, i.e., if P∗ = arg minP∈P,|P|>1 IP (X1 ; . . . ; Xn ), then P∗ ∈ P∗ such that P∗  P,
for any P ∈ P∗ .
The MPI finds operational significance as the capacity of the multiterminal secret-key
agreement problem [146]. More recently, the change in MPI with the addition or removal
of sources of common randomness was identified [147], giving us a better understanding
of the multivariate information functional.
286 Ravi Kiran Raman and Lav R. Varshney

The MPI finds functional use in universal clustering under communication, in the
absence of knowledge of the codebook (also thought of as communicating with aliens)
[28, 58]. It is of course infeasible to recover the messages when the codebook is not
available, and so the focus here is on clustering similar messages according to the trans-
mitted code stream. In particular, the empirical MPI was used as the universal decoder
for clustering messages transmitted through an unknown discrete memoryless channel,
when the decoder is unaware of the codebook used for encoding the messages. By
identifying the received codewords that are most dependent on each other, the decoder
optimally clusters the messages up to the capacity of the channel.
Clustering random variables under the Chow–Liu tree approximation by minimizing
the partition information is studied in [148]. A more general version of the clustering
problem that is obtained by minimizing the partition information is considered in [149],
where the authors describe the clustering algorithm, given the joint distribution.
Identifying the optimal partition is often computationally expensive as the number
of partitions is exponential in the number of objects. An efficient method to cluster the
random variables is identified in [149]. We know that entropy is a submodular function
as for any set of indices A, B,
H(XA∪B ) + H(XA∩B ) ≤ H(XA ) + H(XB ),
as conditioning reduces entropy. The equality holds if and only if XA ⊥ XB . For any
A ⊆ [n], let

hγ (A) = H(XA ) − γ, 
hγ (A) = min hγ (C), (9.13)
P∈P(A)
C∈P
where P(A) is the set of all partitions of the index set A. Then, we have

hγ (A1 ∪ A2 ) = min hγ (C)
P∈P(A1 ∪A2 )
C∈P
 
≤ min hγ (C) + min hγ (C)
P∈P(A1 ) P∈P(A2 )
C∈P C∈P

=
hγ (A1 ) +
hγ (A2 ).

Thus, hγ (·) is intersection submodular.


Further, the MPI is given by the value of γ satisfying (9.13) for A = [n]. Thus, iden-
tifying the partition that minimizes the intersection submodular function  hγ yields the
clustering solution (fundamental partition), and the value of γ minimizing the function
corresponds to the MPI.
Since submodular function minimization can be performed efficiently, the clustering
can be performed in strongly polynomial time by using the principal sequence of parti-
tions (PSP) based on the Dilworth truncation lattice formed by  hγ [145, 149]. From the
partition information and the PSP, a hierarchical clustering solution can be obtained
that creates the phylogenetic tree with increasing partition information [150]. How-
ever, the submodular minimization requires knowledge of the joint distribution of the
random variables. Defining universal equivalents of these clustering algorithms would
require not only efficient methods for estimating the partition information, but also a
minimization algorithm that is robust to the errors in the estimate.
Universal Clustering 287

Example 9.4 Continuing with the example of the multivariate Gaussian random vari-
able in Example 9.3, the minimum partition function is determined by submodular
minimization over the functions f : 2[n] → R, defined by f (C) = log(|ΣC |), since the
optimal partition is obtained as
1 
P∗ = arg min log(|ΣC |).
P∈P |P| − 1
C∈P

Alternatively, the minimization problem can also be viewed as


⎛ ⎞
1 ⎜⎜⎜   ⎟⎟⎟
P∗ = arg min ⎜⎜⎜⎜ log λC
− log λ ⎟⎟⎟,
⎟⎠
P∈P |P| − 1 ⎝ i i
C∈P i∈[|C|] i∈[n]

where λCi are the eigenvalues of the covariance submatrix ΣC and λi are the eigenvalues
of Σ. In this sense, clustering according to the multivariate information translates to a
clustering similar to spectral clustering.
For another view on the optimal clustering solution to this problem, relating to the
independence clustering equivalent, let us compare this case with Example 9.2. The
independence clustering solution for Gaussian graphical models recovered the block-
diagonal decomposition of the covariance matrix. Here, we perform a relaxed version of
the same retrieval wherein we return the block-diagonal decomposition with minimum
normalized entropy difference from the given distribution.

Thus, the partition information functional proves to be a useful multivariate informa-


tion functional to cluster random variables according to the dependence structure. This
approach is particularly conducive for the design of efficient optimization frameworks
to perform the clustering, given a sufficiently large number of samples of the sources.
Multiinformation is another multivariate information functional that has been used
extensively to study multivariate dependence [151].
definition 9.10 (Multiinformation) Given a set of random variables X1 , . . . , Xn , the
multiinformation is defined as

n
I M (X1 ; . . . ; Xn ) = H(Xi ) − H(X1 , . . . , Xn ). (9.14)
i=1

The multiinformation has, for instance, been used in identifying function transforma-
tions that maximize the multivariate correlation within the set of random variables, using
a greedy search algorithm [86]. It has also been used in devising unsupervised methods
for identifying abstract structure in data from areas like genomics and natural language
[152]. In both of these studies the multiinformation proves to be a more robust measure
of correlation between data sources and hence a stronger notion to characterize the rela-
tionship between them. The same principle is exploited in devising meaningful universal
clustering algorithms.
288 Ravi Kiran Raman and Lav R. Varshney

Figure 9.16 Source model for multi-image clustering according to mutual independence of
images. Here we cluster n images with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). The labels define the
base scene being imaged (latent vectors). The scenes are subject to an image-capture noise which
is independent across scenes and pixels. Clustering is performed according to these captured
images universally.

The role of multiinformation in independence-based clustering is best elucidated


through the example of joint clustering and registration of images under rigid-body
transformations [153]. The problem of clustering of images is defined by the source
model depicted in Fig. 9.16. Consider a collection of labels drawn i.i.d. and assume an
n-dimensional latent vector for each label. Let the noise model be a discrete memoryless
channel resulting in corrupted versions of the latent vectors. In the case of multi-image
clustering, the latent vectors represent the true scene being captured and the observations
are the resulting images to be clustered. The channel represents the image-capture noise
from the imaging process.
Under this model, observations corresponding to the same label (scene) are dependent
on each other, whereas those of different labels (scenes) are mutually independent. Thus
the sources can be clustered using a multiinformation-based clustering algorithm [72].
In particular, the algorithm considers the intra-cluster multiinformation, called cluster
information, defined as

IC(P) (X1 ; . . . ; Xn ) = I M (XC ). (9.15)
C∈P

Here I M (XC ) is the multiinformation of the random variables in cluster C.


lemma 9.2 ([72]) If I(X1 ; . . . ; Xn ) = 0, then the optimal clustering that minimizes
the partition information is given by the finest partition that maximizes the cluster
information.
Proof For any partition P,
1  
IP (X1 ; . . . ; Xn ) = I M (X1 ; . . . ; Xn ) − IC(P) (X1 ; . . . ; Xn ) .
|P| − 1
Thus,
max IC(P) (X1 ; . . . ; Xn ) = I M (X1 ; . . . ; Xn ) − min(|P| − 1)IP (X1 ; . . . ; Xn )
P∈P P∈P
= I M (X1 ; . . . ; Xn ).
Universal Clustering 289

From the non-negativity of partition information, and since I(X1 ; . . . ; Xn ) = 0,


arg max IC(P) (X1 ; . . . ; Xn ) = arg min IP (X1 ; . . . ; Xn ).
P∈P P∈P

This attribute characterizes the above-defined optimal universal clustering algorithm


for the model in Fig. 9.16. This algorithm is exponentially consistent given the number
of clusters [153]. We can also use the cluster information for hierarchical clustering,
constructing finer partitions of increasing inter-cluster dependence.
The cluster information is also intersection supermodular, and thus its maximization
to identify the optimal partition can also be done efficiently given the joint distribution,
much like for the partition information. Robust estimates of the multiinformation and
cluster information [154] can thus be used to perform efficient universal clustering.

Example 9.5 For a jointly Gaussian random vector (X1 , . . . , Xn ) ∼ N(0, Σ), the multi-
information is given by
⎛ ⎞
1 ⎜⎜⎜ i∈[n] σ2i ⎟⎟⎟
I M (X1 ; . . . ; Xn ) = log⎜⎝ ⎜ ⎟⎟, (9.16)
2 |Σ| ⎠
where σ2i is the variance of Xi .
Then, the cluster information is given by
⎛ ⎞
⎜⎜ i∈[n] σi ⎟
2
⎟⎟
= log⎜⎜⎜⎝
1 ⎟⎟⎠,
IC(P) (X1 ; . . . ; Xn ) (9.17)
2 |ΣP |

where again [ΣP ]i, j = [Σ]i, j 1{i, j ∈ C for some C ∈ P}, for all i, j ∈ [n]2 . If P =
{{1}, . . . , {n}}, i.e., the partition of singletons, then the cluster information is essentially
given by
 
(P) 1 |ΣP|
IC (X1 ; . . . ; Xn ) = log .
2 |ΣP |
Thus the cluster information represents the information in the clustered form as
compared against the singletons.

Another useful multivariate information functional, which is inspired by the multi-


information, is the illum information [155]. This is the Csiszár conjugate of the multi-
information and the multivariate extension of the lautum information [156]:
⎛ ⎞
⎜⎜⎜  ⎟⎟⎟
L(X1 ; . . . ; Xn ) = D⎜⎜⎝ ⎜ Qi Q⎟⎟⎟⎠, (9.18)
i∈[n]

where (X1 , . . . , Xn ) ∼ Q, and Qi is the marginal distribution of Xi for all i ∈ [n]. The cluster
version of the illum information, correspondingly defined as
⎛ ⎞
⎜⎜⎜ ⎟⎟
(P)
LC = D⎜⎝ ⎜ QC Q⎟⎟⎟⎠,
C∈P
290 Ravi Kiran Raman and Lav R. Varshney

is minimized by the partition that corresponds to the independence clustering solu-


tion, yielding another similarity function to perform clustering. Whereas the multi-
information is upper-bounded by the sum of marginal entropies, the illum information
has no such upper bound. This proves particularly useful at recovering dependence
structures and in clustering low-entropy data sources.

Example 9.6 For the jointly Gaussian random vector (X1 , . . . , Xn ) ∼ N(0, Σ), let Σ =
n
i=1 λi ui ui be the orthonormal eigendecomposition. Then, |Σ| = i=1 λi , and the illum
T n

information is given by
n ⎡ ⎛ 2⎞ ⎤
1 ⎢⎢⎢⎢ uTi 
Σui ⎜⎜ σ ⎟⎟ ⎥⎥
L(X1 ; . . . ; Xn ) = ⎢⎣ − log⎜⎜⎜⎝ i ⎟⎟⎟⎠ − 1⎥⎥⎥⎦, (9.19)
2 i=1 λi λi

where Σ is the diagonal matrix of variance values. In comparison, the multiinformation


is given by
⎛ ⎞
1  ⎜⎜⎜⎜ σ2i ⎟⎟⎟⎟
n
I M (X1 ; . . . ; Xn ) = log⎜⎝ ⎟⎠.
2 i=1 λi

Example 9.7 For a pair-wise Markov random field (MRF) defined on a graph G =
(V, E) as
⎛ ⎞
⎜⎜⎜  ⎟⎟⎟

QG (X1 ; . . . ; Xn ) = exp⎜⎜⎜⎝ ψi (Xi ) + ψi j (Xi , X j ) − A(ψ)⎟⎟⎟⎟⎠, (9.20)
i∈V (i, j)∈E

where A(ψ) is the log partition function, the sum information [155], which is defined as

S (X1 ; . . . ; Xn ) = I(X1 ; . . . ; Xn ) + L(X1 ; . . . ; Xn ), (9.21)

reduces to
    
S (X1 ; . . . ; Xn ) = EQG ψi j (Xi , X j ) − EQi Q j ψi j (Xi , X j ) .
(i, j)∈E

That is, the sum information is dependent only on the expectation of the pair-wise poten-
tial functions defining the MRF. In particular, it is independent of the partition function
and thus statistically and computationally easier to estimate when the potential functions
are known. This also gives us an alternative handle on information-based dependence
structure recovery.

From Example 9.7, we see the advantages of the illum information and other corre-
spondingly defined multivariate information functionals in universal clustering of data
sources. Thus, depending on the statistical families defining the data sources, we see that
Universal Clustering 291

a variety of multivariate information functionals, serving as natural notions of similar-


ity, can be used to recover clusters. The bottlenecks for most of these methods are the
sample and/or the computational complexity (moving away from universality, costs may
be reduced for restricted families).

9.6 Applications

Data often include all the information we require for processing them, and the difficulty
in extracting useful information typically lies in understanding their fundamental struc-
ture and in designing effective methods to process such data. Universal methods aim to
learn this information and perform unsupervised learning tasks such as clustering and
provide a lot of insight into problems we have little prior understanding on. In this sec-
tion we give a very brief overview of some of the applications of information-theoretic
clustering algorithms.
For instance, in [28] Misra hypothesizes the ability to communicate with aliens by
clustering messages under a universal communication setting without knowledge of the
codebook. That is, if one presumes that aliens communicate using a random codebook,
then the codewords can be clustered reliably, albeit without the ability to recover the
mapping between message and codeword.
The information-bottleneck method has been used for document clustering [157] and
for image clustering [158] applications. Similarly, clustering of pieces of music through
universal compression techniques has been studied [159, 160].
Clustering using the minimum partition information has been considered in biology
[149]. The information content in gene-expression patterns is exploited through info-
clustering for genomic studies of humans. Similarly, firing patterns of neurons over time
are used as data to better understand neural stimulation of the human brain by identifying
modular structures through info-clustering.
Clustering of images in the absence of knowledge of image and noise models along
with joint registration has been considered [153]. It is worth noting that statistically
consistent clustering and registration of images can be performed with essentially no
information about the source and channel models.
Multivariate information functionals in the form of multiinformation have also been
utilized in clustering data sources according to their latent factors in a hierarchical fash-
ion [152, 161]. This method, which is related to ICA in the search for latent factors
using information, has been found to be efficient in clustering statements according
to personality types, in topical characterization of news, and in learning hierarchical
clustering from DNA strands. The method has in fact been leveraged in social-media
content categorization with the motivation of identifying useful content for disaster
response [162].
Beside direct clustering applications, the study of universal clustering is significant
for understanding fundamental performance limits of systems. As noted earlier, under-
standing the task of clustering using crowdsourced responses, without knowledge of the
crowd channels, helps establish fundamental benchmarks for practical systems [76].
292 Ravi Kiran Raman and Lav R. Varshney

Establishing tight bounds on the cost of clustering or the performance of universal


algorithms helps in choosing appropriate parameters for practical adaptations of the
algorithms. Moreover, universal methods highlight the fundamental structural properties
inherent in the data, with minimal contextual knowledge.

9.7 Conclusion

Clustering is ubiquitous in data pre-processing and structure learning in unsupervised


settings. Hence it is imperative that we have a stronger understanding of the prob-
lem. Although it is often context-sensitive, the task of clustering can also be viewed
in the broader sense of universality. Here we have reviewed the strength of these
context-independent techniques for clustering.
In our review, we have seen universal information-theoretic methods yielding novel
techniques that not only translate to practice, but also establish fundamental limits in the
space. Practical heuristics are often built either using these techniques or by using the
fundamental limits as performance benchmarks.
We also noticed the integral role of classical information theory in inference problems
in unsupervised clustering. Design principles observed for the problems of commu-
nication and compression also prove to be useful guidelines for the data clustering
framework. Further, information theory also helps develop rigorous theoretical insight
into the problem and the methods therein, helping us devise systematic development
pipelines for tackling the challenge.

9.7.1 Future Directions


While many tools and techniques have been explored, devised, and analyzed, much is
still left to be understood on universal clustering in particular, and unsupervised learning
in general. A careful exploration of the information-theory literature could very well be
a strong guiding principle for solving learning problems.
Efficient, robust, universal learning could very well reduce the sample complexities
of supervised learning methods. It has already been observed that efficient unsupervised
learning of abstract data representations significantly reduces the training complex-
ity in visual object recognition [163]. Universal clustering in crowdsourcing has been
proposed to reduce the labeling complexity and the domain expert requirement [76].
Similarly, statistically grounded, information-theoretic methods addressing universal
clustering could very well generate efficient representations of unlabeled information,
thereby facilitating better learning algorithms.
The focus of most information-theoretic methods addressing universal clustering has
largely been on statistical consistency, in the process often neglecting the associated
computational complexity. An important avenue of study for the future would be to
design computationally efficient clustering algorithms. More generally, we require a con-
crete understanding of the fundamental trade-offs between computational and statistical
optimalities [164]. This would also help in translating such methods to practical datasets
and in developing new technologies and applications.
Universal Clustering 293

Most analyses performed within this framework have considered first-order optimal-
ities in the performance of the algorithms such as error exponents. These results are
useful when we have sufficiently large (asymptotic) amounts of data. However, when
we have finite (moderately large) data, stronger theoretical results focusing on second-
and third-order characteristics of the performance through CLT-based finite-blocklength
analyses are necessary [165].
Information functionals, which are often used in universal clustering algorithms,
have notoriously high computational and/or sample complexities of estimation and thus
an important future direction of research concerns the design of robust and efficient
estimators [166].
These are just some of the open problems in the rich and burgeoning area of universal
clustering.

References

[1] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” J. Am. Statist.
Assoc., vol. 66, no. 336, pp. 846–850, 1971.
[2] U. von Luxburg, R. C. Williamson, and I. Guyon, “Clustering: Science or art?” in Proc.
29th International Conference on Machine Learning (ICML 2012), 2012, pp. 65–79.
[3] G. C. Bowker and S. L. Star, Sorting things out: Classification and its consequences. MIT
Press, 1999.
[4] D. Niu, J. G. Dy, and M. I. Jordan, “Iterative discovery of multiple alternative clustering
views,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 36, no. 7, pp. 1340–1353,
2014.
[5] L. Valiant, Probably approximately correct: Nature’s algorithms for learning and pros-
pering in a complex world. Basic Books, 2013.
[6] J. T. Vogelstein, Y. Park, T. Ohyama, R. A. Kerr, J. W. Truman, C. E. Priebe, and M. Zlatic,
“Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure
learning,” Science, vol. 344, no. 6182, pp. 386–392, 2014.
[7] R. E. Sanderson, A. Helmi, and D. W. Hogg, “Action-space clustering of tidal streams to
infer the Galactic potential,” Astrophys. J., vol. 801, no. 2, 18 pages, 2015.
[8] E. Gibson, R. Futrell, J. Jara-Ettinger, K. Mahowald, L. Bergen, S. Ratnasingam, M. Gib-
son, S. T. Piantadosi, and B. R. Conway, “Color naming across languages reflects color
use,” Proc. Natl. Acad. Sci. USA, vol. 114, no. 40, pp. 10 785–10 790, 2017.
[9] C. E. Shannon, “Bits storage capacity,” Manuscript Division, Library of Congress,
handwritten note, 1949.
[10] M. Weldon, The Future X Network: A Bell Labs perspective. CRC Press, 2015.
[11] C. Lintott, K. Schawinski, S. Bamford, A. Slosar, K. Land, D. Thomas, E. Edmondson,
K. Masters, R. C. Nichol, M. J. Raddick, A. Szalay, D. Andreescu, P. Murray, and J. Van-
denberg, “Galaxy Zoo 1: Data release of morphological classifications for nearly 900 000
galaxies,” Monthly Notices Roy. Astron. Soc., vol. 410, no. 1, pp. 166–178, 2010.
[12] A. Kittur, E. H. Chi, and B. Suh, “Crowdsourcing user studies with Mechanical Turk,”
in Proc. SIGCHI Conference on Human Factors in Computational Systems (CHI 2008),
2008, pp. 453–456.
294 Ravi Kiran Raman and Lav R. Varshney

[13] P. G. Ipeirotis, F. Provost, and J. Wang, “Quality management on Amazon Mechanical


Turk,” in Proc. ACM SIGKDD Workshop Human Computation (HCOMP ’10), 2010, pp.
64–67.
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convo-
lutional neural networks,” in Proc. Advances in Neural Information Processing Systems
25, 2012, pp. 1097–1105.
[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual
recognition challenge,” arXiv:1409.0575 [cs.CV], 2014.
[16] T. Simonite, “Google’s new service translates languages almost as well as humans can,”
MIT Technol. Rev., Sep. 27, 2016.
[17] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,
Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws,
Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean,
“Google’s neural machine translation system: Bridging the gap between human and
machine translation,” arXiv:1609.08144 [cs.CL], 2017.
[18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,
N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and
D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,”
Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015.
[20] T. Simonite, “The missing link of artificial intelligence,” MIT Technol. Rev., Feb. 18, 2016.
[21] C. E. Shannon, “A mathematical theory of communication,” Bell Systems Technical J.,
vol. 27, nos. 3–4, pp. 379–423, 623–656, 1948.
[22] L. R. Varshney, “Block diagrams in information theory: Drawing things closed,” in SHOT
Special Interest Group on Computers, Information, and Society Workshop 2014, 2014.
[23] R. M. Gray, “Source coding and simulation,” IEEE Information Theory Soc. Newsletter,
vol. 58, no. 4, pp. 1/5–11, 2008 (2008 Shannon Lecture).
[24] R. M. Gray, “Time-invariant trellis encoding of ergodic discrete-time sources with a
fidelity criterion,” IEEE Trans. Information Theory, vol. 23, no. 1, pp. 71–83, 1977.
[25] Y. Steinberg and S. Verdú, “Simulation of random processes and rate-distortion theory,”
IEEE Trans. Information Theory, vol. 42, no. 1, pp. 63–86, 1996.
[26] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 1991.
[27] J. Rissanen, “Optimal estimation,” IEEE Information Theory Soc. Newsletter, vol. 59,
no. 3, pp. 1/6–7, 2009 (2009 Shannon Lecture).
[28] V. Misra, “Universal communication and clustering,” Ph.D. dissertation, Stanford Univer-
sity, 2014.
[29] J. J. Rissanen, “Generalized Kraft inequality and arithmetic coding,” IBM J. Res.
Development, vol. 20, no. 3, pp. 198–203, 1976.
[30] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc.
IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
[31] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,”
IEEE Trans. Information Theory, vol. 24, no. 5, pp. 530–536, 1978.
Universal Clustering 295

[32] J. Ziv, “Coding theorems for individual sequences,” IEEE Trans. Information Theory, vol.
24, no. 4, pp. 405–412, 1978.
[33] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side infor-
mation at the decoder,” IEEE Trans. Information Theory, vol. 22, no. 1, pp. 1–10,
1976.
[34] J. J. Rissanen, “A universal data compression system,” IEEE Trans. Information Theory,
vol. 29, no. 5, pp. 656–664, 1983.
[35] R. Gallager, “Variations on a theme by Huffman,” IEEE Trans. Information Theory, vol.
24, no. 6, pp. 668–674, 1978.
[36] J. C. Lawrence, “A new universal coding scheme for the binary memoryless source,”
IEEE Trans. Information Theory, vol. 23, no. 4, pp. 466–472, 1977.
[37] J. Ziv, “Coding of sources with unknown statistics – Part I: Probability of encoding error,”
IEEE Trans. Information Theory, vol. 18, no. 3, pp. 384–389, 1972.
[38] L. D. Davisson, “Universal noiseless coding,” IEEE Trans. Information Theory, vol. 19,
no. 6, pp. 783–795, 1973.
[39] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE
Trans. Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[40] T. M. Cover, “A proof of the data compression theorem of Slepian and Wolf for ergodic
sources,” IEEE Trans. Information Theory, vol. 21, no. 2, pp. 226–228, 1975.
[41] I. Csiszár, “Linear codes for sources and source networks: Error exponents, universal
coding,” IEEE Trans. Information Theory, vol. 28, no. 4, pp. 585–592, 1982.
[42] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[43] T. Berger, Rate distortion theory: A mathematical basis for data compression. Prentice-
Hall, 1971.
[44] J. Ziv, “Coding of sources with unknown statistics – Part II: Distortion relative to a fidelity
criterion,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 389–394, May 1972.
[45] J. Ziv, “On universal quantization,” IEEE Trans. Information Theory, vol. 31, no. 3, pp.
344–347, 1985.
[46] E. Hui Yang and J. C. Kieffer, “Simple universal lossy data compression schemes derived
from the Lempel–Ziv algorithm,” IEEE Trans. Information Theory, vol. 42, no. 1, pp.
239–245, 1996.
[47] A. D. Wyner, J. Ziv, and A. J. Wyner, “On the role of pattern matching in information
theory,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2045–2056, 1998.
[48] C. E. Shannon, “Communication in the presence of noise,” Proc. IRE, vol. 37, no. 1, pp.
10–21, 1949.
[49] V. D. Goppa, “Nonprobabilistic mutual information without memory,” Problems Control
Information Theory, vol. 4, no. 2, pp. 97–102, 1975.
[50] I. Csiszár, “The method of types,” IEEE Trans. Information Theory, vol. 44, no. 6, pp.
2505–2523, 1998.
[51] P. Moulin, “A Neyman–Pearson approach to universal erasure and list decoding,” IEEE
Trans. Information Theory, vol. 55, no. 10, pp. 4462–4478, 2009.
[52] N. Merhav, “Universal decoding for arbitrary channels relative to a given class of
decoding metrics,” IEEE Trans. Information Theory, vol. 59, no. 9, pp. 5566–5576, 2013.
[53] C. E. Shannon, “Certain results in coding theory for noisy channels,” Information Control,
vol. 1, no. 1, pp. 6–25, 1957.
296 Ravi Kiran Raman and Lav R. Varshney

[54] A. Feinstein, “On the coding theorem and its converse for finite-memory channels,”
Information Control, vol. 2, no. 1, pp. 25–44, 1959.
[55] I. Csiszár and P. Narayan, “Capacity of the Gaussian arbitrarily varying channel,” IEEE
Trans. Information Theory, vol. 37, no. 1, pp. 18–26, 1991.
[56] J. Ziv, “Universal decoding for finite-state channels,” IEEE Trans. Information Theory,
vol. 31, no. 4, pp. 453–460, 1985.
[57] M. Feder and A. Lapidoth, “Universal decoding for channels with memory,” IEEE Trans.
Information Theory, vol. 44, no. 5, pp. 1726–1745, 1998.
[58] V. Misra and T. Weissman, “Unsupervised learning and universal communication,” in
Proc. 2013 IEEE International Symposium on Information Theory, 2013, pp. 261–265.
[59] N. Merhav, “Universal decoding using a noisy codebook,” arXiv:1609:00549 [cs.IT], in
IEEE Trans. Information Theory , vol. 64. no. 4, pp. 2231–2239, 2018.
[60] A. Lapidoth and P. Narayan, “Reliable communication under channel uncertainty” IEEE
Trans. Information Theory, vol. 44, no. 6, pp. 2148–2177, 1998.
[61] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test
optimal?” IEEE Trans. Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
[62] M. Feder and N. Merhav, “Universal composite hypothesis testing: A competitive
minimax approach,” IEEE Trans. Information Theory, vol. 48, no. 6, pp. 1504–1517,
2002.
[63] E. Levitan and N. Merhav, “A competitive Neyman–Pearson approach to universal
hypothesis testing with applications,” IEEE Trans. Information Theory, vol. 48, no. 8,
pp. 2215–2229, 2002.
[64] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of individual sequences,”
IEEE Trans. Information Theory, vol. 38, no. 4, pp. 1258–1270, 1992.
[65] M. Feder, “Gambling using a finite state machine,” IEEE Trans. Information Theory,
vol. 37, no. 5, pp. 1459–1465, 1991.
[66] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, and M. J. Weinberger, “Universal
discrete denoising: Known channel,” IEEE Trans. Information Theory, vol. 51, no. 1,
pp. 5–28, 2005.
[67] E. Ordentlich, K. Viswanathan, and M. J. Weinberger, “Twice-universal denoising,” IEEE
Trans. Information Theory, vol. 59, no. 1, pp. 526–545, 2013.
[68] T. Bendory, N. Boumal, C. Ma, Z. Zhao, and A. Singer, “Bispectrum inversion with
application to multireference alignment,” vol. 66, no. 4, pp. 1037–1050, 2018.
[69] E. Abbe, J. M. Pereira, and A. Singer, “Sample complexity of the Boolean multirefer-
ence alignment problem,” in Proc. 2017 IEEE International Symposium on Information
Theory, 2017, pp. 1316–1320.
[70] A. Pananjady, M. J. Wainwright, and T. A. Courtade, “Denoising linear models with per-
muted data,” in Proc. 2017 IEEE International Symposium on Information Theory, 2017,
pp. 446–450.
[71] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” Int. J.
Computer Vision, vol. 24, no. 2, pp. 137–154, 1997.
[72] R. K. Raman and L. R. Varshney, “Universal joint image clustering and registration using
partition information,” in Proc. 2017 IEEE International Symposium on Information
Theory, 2017, pp. 2168–2172.
[73] J. Stein, J. Ziv, and N. Merhav, “Universal delay estimation for discrete channels,” IEEE
Trans. Information Theory, vol. 42, no. 6, pp. 2085–2093, 1996.
Universal Clustering 297

[74] J. Ziv, “On classification with empirically observed statistics and universal data compres-
sion,” IEEE Trans. Information Theory, vol. 34, no. 2, pp. 278–286, 1988.
[75] N. Merhav, “Universal classification for hidden Markov models,” IEEE Trans. Informa-
tion Theory, vol. 37, no. 6, pp. 1586–1594, Nov. 1991.
[76] R. K. Raman and L. R. Varshney, “Budget-optimal clustering via crowdsourcing,” in Proc.
2017 IEEE International Symposium on Information Theory, 2017, pp. 2163–2167.
[77] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” IEEE
Trans. Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.
[78] G. Cormode, M. Paterson, S. C. Sahinalp, and U. Vishkin, “Communication complex-
ity of document exchange,” in Proc. 11th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA ’00), 2000, pp. 197–206.
[79] S. Muthukrishnan and S. C. Sahinalp, “Approximate nearest neighbors and sequence
comparison with block operations,” in Proc. 32nd Annual ACM Symposium on Theory
Computation (STOC ’00), 2000, pp. 416–424.
[80] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman
divergences,” J. Machine Learning Res., vol. 6, pp. 1705–1749, 2005.
[81] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications,
3rd edn. Springer, 2008.
[82] C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. H. Zurek, “Information distance,”
IEEE Trans. Information Theory, vol. 44, no. 4, pp. 1407–1423, 1998.
[83] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity metric,” IEEE Trans.
Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
[84] P. Vitanyi, “Universal similarity,” in Proc. IEEE Information Theory Workshop (ITW ’05),
2005, pp. 238–243.
[85] R. L. Cilibrasi and P. M. B. Vitányi, “The Google similarity distance,” IEEE Trans.
Knowledge Data Engineering, vol. 19, no. 3, pp. 370–383, 2007.
[86] H. V. Nguyen, E. Müller, J. Vreeken, P. Efros, and K. Böhm, “Multivariate maximal
correlation analysis,” in Proc. 31st Internatinal Conference on Machine Learning (ICML
2014), 2014, pp. 775–783.
[87] P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized mutual infor-
mation feature selection,” IEEE Trans. Neural Networks, vol. 20, no. 2, pp. 189–201,
2009.
[88] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure
identification,” J. Statist. Mech., vol. 2005, p. P09008, 2005.
[89] A. J. Gates and Y.-Y. Ahn, “The impact of random models on clustering similarity,” J.
Machine Learning Res., vol. 18, no. 87, pp. 1–28, 2017.
[90] J. Lewis, M. Ackerman, and V. de Sa, “Human cluster evaluation and formal quality
measures: A comparative study,” in Proc. 34th Annual Conference on Cognitive Science
in Society, 2012.
[91] J. MacQueen, “Some methods for classification and analysis of multivariate observa-
tions,” in Proc. 5th Berkeley Symposium on Mathematics Statistics and Probability, 1967,
pp. 281–297.
[92] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd edn. Wiley, 2001.
[93] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
[94] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE
Trans. Communication, vol. 28, no. 1, pp. 84–95, 1980.
298 Ravi Kiran Raman and Lav R. Varshney

[95] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information-theoretic feature cluster-


ing algorithm for text classification,” J. Machine Learning Res., vol. 3, pp. 1265–1287,
2003.
[96] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditional expectation as
a Bregman predictor,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2664–2669,
2005.
[97] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoretic co-clustering,” in Proc.
9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD ’03), 2003, pp. 89–98.
[98] R. K. Raman and L. R. Varshney, “Universal clustering via crowdsourcing,”
arXiv:1610.02276 [cs.IT], 2016.
[99] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier detection,” in Proc. 2013
Information Theory Applications Workshop, 2013.
[100] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” in Proc.
2014 IEEE Internatinal Symposium on Information Theory, 2014, pp. 4066–4082.
[101] Y. Li, S. Nitinawarat, Y. Su, and V. V. Veeravalli, “Universal outlier hypothesis testing:
Application to anomaly detection,” in Proc. IEEE International Conference on Acoustics,
Speech, Signal Process. (ICASSP 2015), 2015, pp. 5595–5599.
[102] Y. Bu, S. Zou, and V. V. Veeravalli, “Linear-complexity exponentially-consistent tests for
universal outlying sequence detection,” in Proc. 2017 IEEE International Symposium on
Information Theory, 2017, pp. 988–992.
[103] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal sequential outlier hypothesis
testing,” Sequence Analysis, vol. 36, no. 3, pp. 309–344, 2017.
[104] J. Wright, Y. Tao, Z. Lin, Y. Ma, and H.-Y. Shum, “Classification via minimum incremen-
tal coding length (MICL),” in Proc. Advances in Neural Information Processing Systems
20. MIT Press, 2008, pp. 1633–1640.
[105] Y. Ma, H. Derksen, W. Hong, and J. Wright, “Segmentation of multivariate mixed data via
lossy data coding and compression,” IEEE Trans. Pattern Analysis Machine Intelligence,
vol. 29, no. 9, pp. 1546–1562, 2007.
[106] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Unsupervised segmentation of natu-
ral images via lossy data compression,” Comput. Vision Image Understanding, vol. 110,
no. 2, pp. 212–225, 2008.
[107] S. R. Rao, H. Mobahi, A. Y. Yang, S. S. Sastry, and Y. Ma, “Natural image segmenta-
tion with adaptive texture and boundary encoding,” in Computer Vision – ACCV 2009.
Springer.
[108] R. Cilibrasi and P. M. B. Vitányi, “Clustering by compression,” IEEE Trans. Information
Theory, vol. 51, no. 4, pp. 1523–1545, 2005.
[109] D. Ryabko, “Clustering processes,” in 27th International Conference on Machine Learn-
ing, 2010, pp. 919–926.
[110] A. Khaleghi, D. Ryabko, J. Mary, and P. Preux, “Consistent algorithms for clustering time
series,” J. Machine Learning Res., vol. 17, no. 3, pp. 1–32, 2016.
[111] D. Ryabko, “Independence clustering (without a matrix),” in Proc. Advances in Neural
Information Processing Systems 30, 2017, pp. 4016–4026.
[112] A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind sep-
aration and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159,
1995.
Universal Clustering 299

[113] F. R. Bach and M. I. Jordan, “Beyond independent components: Trees and clusters,” J.
Machine Learning Res., vol. 4, no. 12, pp. 1205–1233, 2003.
[114] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with
dependence trees,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462–467, 1968.
[115] D. M. Chickering, “Learning Bayesian networks is NP-complete,” in Learning from data,
D. Fisher and H.-J. Lenz, eds. Springer, 1996, pp. 121–130.
[116] A. Montanari and J. A. Pereira, “Which graphical models are difficult to learn?” in Proc.
Advances in Neural Information Processing Systems 22, 2009, pp. 1303–1311.
[117] P. Abbeel, D. Koller, and A. Y. Ng, “Learning factor graphs in polynomial time and
sample complexity,” J. Machine Learning Res., vol. 7, pp. 1743–1788, 2006.
[118] Z. Ren, T. Sun, C.-H. Zhang, and H. H. Zhou, “Asymptotic normality and optimali-
ties in estimation of large Gaussian graphical models,” Annals Statist., vol. 43, no. 3,
pp. 991–1026, 2015.
[119] P.-L. Loh and M. J. Wainwright, “Structure estimation for discrete graphical models: Gen-
eralized covariance matrices and their inverses,” in Proc. Advances in Neural Information
Processing Systems 25, 2012, pp. 2087–2095.
[120] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary
graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7,
pp. 4117–4134, 2012.
[121] L. Bachschmid-Romano and M. Opper, “Inferring hidden states in a random kinetic Ising
model: Replica analysis,” J. Statist. Mech., vol. 2014, no. 6, p. P06013, 2014.
[122] G. Bresler, “Efficiently learning Ising models on arbitrary graphs,” in Proc. 47th Annual
ACM Symposium Theory of Computation (STOC ’15), 2015, pp. 771–782.
[123] P. Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai, “Greedy learning of Markov
network structure,” in Proc. 48th Annual Allerton Conference on Communication Control
Computation, 2010, pp. 1295–1302.
[124] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A large-deviation analysis of
the maximum-likelihood learning of Markov tree structures,” IEEE Trans. Information
Theory, vol. 57, no. 3, pp. 1714–1735, 2011.
[125] M. J. Beal and Z. Ghahramani, “Variational Bayesian learning of directed graphical
models with hidden variables,” Bayesian Analysis, vol. 1, no. 4, pp. 793–831, 2006.
[126] A. Anandkumar and R. Valluvan, “Learning loopy graphical models with latent variables:
Efficient methods and guarantees,” Annals Statist., vol. 41, no. 2, pp. 401–435, 2013.
[127] G. Bresler, D. Gamarnik, and D. Shah, “Learning graphical models from the Glauber
dynamics,” arXiv:1410.7659 [cs.LG], 2014, to be published in IEEE Trans. Information
Theory.
[128] A. P. Dawid, “Conditional independence in statistical theory,” J. Roy. Statist. Soc. Ser. B.
Methodol., vol. 41, no. 1, pp. 1–31, 1979.
[129] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, “Testing ran-
dom variables for independence and identity,” in Proc. 42nd Annual Symposium on the
Foundations Computer Science, 2001, pp. 442–451.
[130] A. Gretton and L. Györfi, “Consistent non-parametric tests of independence,” J. Machine
Learning Res., vol. 11, no. 4, pp. 1391–1423, 2010.
[131] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai, “Model-powered
conditional independence test,” in Proc. Advances in Neural Information Processing
Systems 30, 2017, pp. 2955–2965.
300 Ravi Kiran Raman and Lav R. Varshney

[132] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in


Proc. 37th Annual Allerton Conference on Communation Control Computication, 1999,
pp. 368–377.
[133] R. Gilad-Bachrach, A. Navot, and N. Tishby, “An information theoretic trade-off between
complexity and accuracy,” in Learning Theory and Kernel Machines, B. Schölkopf and
M. K. Warmuth, eds. Springer, 2003, pp. 595–609.
[134] N. Slonim, “The information bottleneck: Theory and applications,” Ph.D. dissertation,
The Hebrew University of Jerusalem, 2002.
[135] H. Kim, W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Discovering potential correla-
tions via hypercontractivity,” in Proc. 30th Annual Conference on Neural Information
Processing Systems (NIPS), 2017, pp. 4577–4587.
[136] N. Slonim, N. Friedman, and N. Tishby, “Multivariate information bottleneck,” Neural
Comput., vol. 18, no. 8, pp. 1739–1789, 2006.
[137] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information bottleneck for Gaussian
variables,” J. Machine Learning Res., vol. 6, no. 1, pp. 165–188, 2005.
[138] N. Slonim and N. Tishby, “Agglomerative information bottleneck,” in Proc. Advances in
Neural Information Processing Systems 12, 1999, pp. 617–625.
[139] N. Slonim, N. Friedman, and N. Tishby, “Agglomerative multivariate information
bottleneck,” in Proc. Advances in Neural Information Processing Systems 14, 2002,
pp. 929–936.
[140] J. S. Bridle, A. J. R. Heading, and D. J. C. MacKay, “Unsupervised classifiers, mutual
information and ‘phantom targets,”’ in Proc. Advances in Neural Information Processing
Systems 4, 1992, pp. 1096–1101.
[141] A. J. Butte and I. S. Kohane, “Mutual information relevance networks: Functional
genomic clustering using pairwise entropy measurements,” in Biocomputing 2000, 2000,
pp. 418–429.
[142] I. Priness, O. Maimon, and I. Ben-Gal, “Evaluation of gene-expression clustering via
mutual information distance measure,” BMC Bioinformatics, vol. 8, no. 1, p. 111, 2007.
[143] A. Kraskov, H. Stögbauer, R. G. Andrzejak, and P. Grassberger, “Hierarchical clustering
using mutual information,” Europhys. Lett., vol. 70, no. 2, p. 278, 2005.
[144] M. Aghagolzadeh, H. Soltanian-Zadeh, B. Araabi, and A. Aghagolzadeh, “A hierarchi-
cal clustering based on mutual information maximization,” in Proc. IEEE International
Conference on Image Processing (ICIP 2007), vol. 1, 2007, pp. I-277–I-280.
[145] C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T. Kaced, and T. Liu, “Multivariate mutual
information inspired by secret-key agreement,” Proc. IEEE, vol. 103, no. 10, pp. 1883–
1913, 2015.
[146] I. Csiszár and P. Narayan, “Secrecy capacities for multiple terminals,” IEEE Trans.
Information Theory, vol. 50, no. 12, pp. 3047–3061, 2004.
[147] C. Chan, A. Al-Bashabsheh, and Q. Zhou, “Change of multivariate mutual information:
From local to global,” IEEE Trans. Information Theory, vol. 64, no. 1, pp. 57–76, 2018.
[148] C. Chan and T. Liu, “Clustering by multivariate mutual information under Chow–Liu tree
approximation,” in Proc. 53rd Annual Allerton Conference on Communication Control
Computation, 2015, pp. 993–999.
[149] C. Chan, A. Al-Bashabsheh, Q. Zhou, T. Kaced, and T. Liu, “Info-clustering: A mathe-
matical theory for data clustering,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2,
no. 1, pp. 64–91, 2016.
Universal Clustering 301

[150] C. Chan, A. Al-Bashabsheh, and Q. Zhou, “Agglomerative info-clustering,”


arXiv:1701.04926 [cs.IT], 2017.
[151] M. Studený and J. Vejnarová, “The multiinformation function as a tool for measuring
stochastic dependence,” in Learning in Graphical Models, M. I. Jordan, ed. Kluwer
Academic Publishers, 1998, pp. 261–297.
[152] G. V. Steeg and A. Galstyan, “Discovering structure in high-dimensional data through cor-
relation explanation,” in Proc. 28th Annual Conference on Neural Information Processing
Systems (NIPS), 2014, pp. 577–585.
[153] R. K. Raman and L. R. Varshney, “Universal joint image clustering and registration using
multivariate information measures,” IEEE J. Selected Topics Signal Processing, vol. 12,
no. 5, pp. 928–943, 2018.
[154] M. Studený, “Asymptotic behaviour of empirical multiinformation,” Kybernetika, vol. 23,
no. 2, pp. 124–135, 1987.
[155] R. K. Raman, H. Yu, and L. R. Varshney, “Illum information,” in Proc. 2017 Information
Theory Applications Workshop, 2017.
[156] D. P. Palomar and S. Verdú, “Lautum information,” IEEE Trans. Information Theory,
vol. 54, no. 3, pp. 964–975, 2008.
[157] N. Slonim and N. Tishby, “Document clustering using word clusters via the informa-
tion bottleneck method,” in Proc. 23rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR ’00), 2000, pp. 208–215.
[158] J. Goldberger, H. Greenspan, and S. Gordon, “Unsupervised image clustering using the
information bottleneck method,” in Pattern Recognition, L. Van Gool, ed. Springer, 2002,
pp. 158–165.
[159] S. Dubnov, G. Assayag, O. Lartillot, and G. Bejerano, “Using machine-learning methods
for musical style modeling,” IEEE Computer, vol. 36, no. 10, pp. 73–80, 2003.
[160] R. Cilibrasi, P. Vitányi, and R. de Wolf, “Algorithmic clustering of music based on string
compression,” Czech. Math. J., vol. 28, no. 4, pp. 49–67, 2004.
[161] G. V. Steeg and A. Galstyan, “The information sieve,” in Proc. 33rd International
Conference on Machine Learning (ICML 2016), 2016, pp. 164–172.
[162] N. O. Hodas, G. V. Steeg, J. Harrison, S. Chikkagoudar, E. Bell, and C. D. Corley,
“Disentangling the lexicons of disaster response in twitter,” in Proc. 24th International
Conference on the World Wide Web (WWW ’15), 2015, pp. 1201–1204.
[163] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio, “Unsupervised
learning of invariant representations,” Theoretical Computer Sci., vol. 633, pp. 112–121,
2016.
[164] M. I. Jordan, “On statistics, computation and scalability,” Bernoulli, vol. 19, no. 4, pp.
1378–1390, 2013.
[165] V. Y. Tan, “Asymptotic estimates in information theory with non-vanishing error prob-
abilities,” Foundations Trends Communication Information Theory, vol. 11, nos. 1–2,
pp. 1–184, 2014.
[166] W. Gao, S. Oh, and P. Viswanath, “Demystifying fixed k-nearest neighbor informa-
tion estimators,” arXiv:1604.03006 [cs.LG], to be published in IEEE Trans. Information
Theory.
10 Information-Theoretic Stability
and Generalization
Maxim Raginsky, Alexander Rakhlin, and Aolin Xu

Summary

Machine-learning algorithms can be viewed as stochastic transformations that map train-


ing data to hypotheses. Following Bousquet and Elisseeff, we say that such an algorithm
is stable if its output does not depend too much on any individual training example. Since
stability is closely connected to the generalization capabilities of learning algorithms, it
is of theoretical and practical interest to obtain sharp quantitative estimates on the gener-
alization bias of machine-learning algorithms in terms of their stability properties. This
chapter describes several information-theoretic measures of algorithmic stability and
illustrates their use for upper-bounding the generalization bias of learning algorithms.
Specifically, we relate the expected generalization error of a learning algorithm (i.e.,
the expected difference between the empirical risk and the population risk of the hypoth-
esis generated by the algorithm) to several information-theoretic quantities that capture
the amount of statistical dependence between the training data and the hypothesis. These
include the mutual information and the erasure mutual information, as well as their coun-
terparts induced by the total variation distance. We illustrate the general theory through
a number of examples, including the Gibbs algorithm and differentially private algo-
rithms, and discuss various strategies for controlling the generalization error, such as
pre- and post-processing, as well as adaptive composition.

10.1 Introduction

In the standard framework of statistical learning theory (see, e.g., [1]), we are faced with
the stochastic optimization problem

minimize Lμ (w) := (w, z)μ(dz),
Z
where w takes values in some hypothesis space W, μ is an unknown probability law on
an instance space Z, and  : W × Z → R+ is a given loss function. The quantity Lμ (w)
defined above is referred to as the expected (or population) risk of a hypothesis w ∈ W.

302
Information-Theoretic Stability and Generalization 303

We are given a training sample of size n, i.e., an n-tuple Z = (Z1 , . . . , Zn ) of i.i.d. random
elements of Z drawn from μ. A (possibly randomized) learning algorithm1 is a Markov
kernel PW|Z that maps the training data Z to a random element W of W, and the objective
is to generate W with a suitably small population risk

Lμ (W) = (W, z)μ(dz).
Z

Since Lμ (W) is a random variable, we require it to be small either in expectation or with


high enough probability.
This framework is sufficiently rich to cover a wide variety of learning problems (see
[2] for additional examples).
• Binary classification – let Z = X × {0, 1}, where X is an arbitrary space, let W be
a class of functions w : X → {0, 1}, and let (w, z) = (w, (x, y)) = 1{yw(x)} . The
elements of W are typically referred to as (binary) classifiers.
• Regression with quadratic loss – let Z = X × Y, where X is arbitrary and Y ⊆ R,
let W be a class of functions w : X → R, and let (w, z) = (w, (x, y)) = (y − w(x))2 .
The elements of W are typically referred to as predictors.
• Clustering – let Z be a Hilbert space, let W be the class of all k-element subsets
of Z, and let (w, z) = mina∈w a − z2Z , where  · Z denotes the norm induced by
the inner product on Z.
• Density estimation – let Z be a Borel set in Rd , let W be a class of bounded prob-
ability densities on Z, and let (w, z) = −log w(z), the negative log-likelihood of z
according to the hypothesis density w. It is typically assumed that the densities
w ∈ W are all lower-bounded by some constant c > 0.
The first two examples describe supervised learning problems, where each instance z ∈ Z
splits naturally into a “feature” x and a “label” y, and the objective is to learn to predict
the label corresponding to a given feature. The last two examples fall into the class of
unsupervised learning problems.
Since the data-generating distribution μ is unknown, we do not have enough informa-
tion to compute Lμ (W). However, the empirical loss

1
n
LZ (W) := (W, Zi )
n i=1

is a natural proxy that can be computed from the data Z and from the output of the
algorithm W. The generalization error of PW|Z is the difference Lμ (W) − LZ (W), and we
are interested in its expected value:
 
gen(μ, PW|Z ) := E Lμ (W) − LZ (W) ,

where the expectation is w.r.t. the joint probability law P := μ⊗n ⊗ PW|Z of the training
data Z and the algorithm’s output W.

1 We are using the term “algorithm” as a synonym for “rule” or “procedure,” without necessarily assuming
computational efficiency.
304 Maxim Raginsky et al.

One motivation to study this quantity is as follows. Let us assume, for simplicity, that
the infimum inf w∈W Lμ (w) exists, and is achieved by some w◦ ∈ W. We can decompose
the expected value of the excess risk Lμ (W) − Lμ (w◦ ) as
 
E[Lμ (W) − Lμ (w◦ )] = E Lμ (W) − LZ (W) + E[LZ (W)] − Lμ (w◦ )
 
= E Lμ (W) − LZ (W) + E[LZ (W) − LZ (w◦ )]
= gen(μ, PW|Z ) + E[LZ (W) − LZ (w◦ )], (10.1)

where in the second line we have used the fact that, for any fixed w ∈ W, the empiri-
cal risk LZ (w) is an unbiased estimate of the population risk Lμ (w): ELZ (w) = Lμ (w).
This decomposition shows that the expected excess risk of a learning algorithm will
be small if its expected generalization error is small and if the expected difference
of the empirical risks of W and w◦ is bounded from above by a small non-negative
quantity.
For example, we can consider the empirical risk minimization (ERM) algorithm [1]
that returns any minimizer of the empirical risk:

W ∈ arg min LZ (w).


w∈W

Evidently, the second term in (10.1) is non-positive, so the expected excess risk of ERM
is upper-bounded by its expected generalization error. A crude upper bound on the latter
is given by
 
gen(μ, PW|Z ) ≤ E sup |Lμ (w) − LZ (w)| , (10.2)
w∈W

and it can be shown that, under some restrictions on the complexity of W, the expected
supremum on the right-hand side decays to zero as the sample n → ∞, i.e., empiri-
cal risks converge to population risks uniformly over the hypothesis class. However, in
many cases it is possible to attain asymptotically vanishing excess risk without uniform
convergence (see the article by Shalev-Shwartz et al. [2] for several practically relevant
examples); moreover, the bound in (10.2) is oblivious to the interaction between the data
Z and the algorithm’s output W, and may be rather loose in some settings (for example,
if the algorithm has a fixed computational budget and therefore cannot be expected to
explore the entire hypothesis space W).

10.1.1 Algorithmic Stability


As discussed above, machine-learning algorithms are stochastic transformations (or
channels, in information-theoretic terms) that map training data to hypotheses. To gauge
the quality of the resulting hypothesis, one should ideally evaluate its performance on
a “fresh” independent dataset. In practice, this additional verification step is not always
performed. In the era of “small” datasets, such a procedure would have been deemed
wasteful, while these days it is routine to test many hypotheses and tune many parame-
ters on the same dataset [3]. One may ask whether the quality of the hypothesis can be
estimated simply by the resubstitution estimate LZ (W) or by some other statistic, such
Information-Theoretic Stability and Generalization 305

as the leave-one-out estimate. Algorithmic stability, which quantifies the sensitivity of


learning algorithms to local modifications of the training dataset, was introduced in the
1970s by Devroye and Wagner [4] and Rogers and Wagner [5] as a tool for control-
ling the error of these estimates. More recently, it was studied by Ron and Kearns [6],
Bousquet and Elisseeff [7], Poggio et al. [8], and Shalev-Shwartz et al. [2]. Kutin and
Niyogi [9] provided a taxonomy of at least twelve different notions of stability, and
Rakhlin et al. [10] studied another handful.
Recently, the interest in algorithmic stability was renewed through the work on differ-
ential privacy [11], which quantifies the sensitivity of the distribution of the algorithm’s
output to local modifications in the data, and can therefore be viewed as a form of algo-
rithmic stability. Once the connection to generalization bounds had been established,
it was used to study adaptive methods that choose hypotheses by interacting with the
data over multiple rounds [12]. Quantifying stability via differential privacy is an inher-
ently information-theoretic idea, as it imposes limits on the amount of information the
algorithm can glean from the observed data. Moreover, differential privacy behaves
nicely under composition of algorithms [13, 14], which makes it particularly amenable
to information-theoretic analysis.
In this chapter, which is based in part on preliminary results from [15, 16], we focus
on information-theoretic definitions of stability, which capture the idea that the output
of a stable learning algorithm cannot depend “too much” on any particular example
from the training dataset. These notions of stability are weaker (i.e., less restrictive) than
differential privacy.

10.2 Preliminaries

10.2.1 Information-Theoretic Background


The relative entropy between two probability measures μ and ν on a common measurable
space (Ω, F ) is given by
⎧  


⎪ dμ
⎪Eμ log
⎨ , if μ ν,
D(μν) := ⎪
⎪ dν


⎩+∞, otherwise,

where denotes absolute continuity of measures, and dμ/dν is the Radon–Nikodym


derivative of μ with respect to ν. Here and in what follows, we work with natural
logarithms. The total variation distance between μ and ν is defined as

dTV (μ, ν) := sup |μ(E) − ν(E)|,


E∈F

where the supremum is over all measurable subsets of Ω. An equivalent representation is


  
dTV (μ, ν) = sup  f dμ − f dν. (10.3)
f :Ω→[0,1] Ω Ω
306 Maxim Raginsky et al.

Moreover, if μ ν, then
  
dν − 1.
1 dμ
dTV (μ, ν) = (10.4)
2 Ω dν
For these and related results, see Section 1.2 of [17].
If (U, V) is a random couple with joint probability law PUV , the mutual information
between U and V is defined as I(U; V) := D(PUV PU ⊗ PV ). The conditional mutual
information between U and V given a third random element Y jointly distributed with
(U, V) is defined as

I(U; V|Y) := PY (dy)D(PUV|Y=y PU|Y=y ⊗ PV|Y=y ).

If we use the total variation distance instead of the relative entropy, we obtain the
T -information T (U; V) := dTV (PUV , PU ⊗ PV ) (see, e.g., [18]) and the conditional
T -information

T (U; V|Y) := PY (dy)dTV (PUV|Y=y , PU|Y=y ⊗ PV|Y=y ).

The erasure mutual information [19] between jointly distributed random objects U and
V = (V1 , . . . , Vm ) is

m
I − (U; V) = I − (U; V1 , . . . , Vm ) := I(U; Vk |V −k ),
k=1
−k
where V := (V1 , . . . , Vk−1 , Vk+1 , . . . , Vm ). By analogy, we define the erasure T -
information as

m
T − (U; V) = T − (U; V1 , . . . , Vm ) := T (U; Vk |V −k ).
k=1

The erasure mutual information I − (U; V) is related to the usual mutual information
I(U; V) = I(U; V1 , . . . , Vm ) via the identity

m
I − (U; V) = mI(U; V) − I(U; V −k ) (10.5)
k=1

(Theorem 7 of [19]). Moreover, I − (U; V) may be larger or smaller than I(U; V), as can
be seen from the following examples [19]:
• if U takes at most countably many values and V1 = . . . = Vm = U, then I(U; V) =
H(U), the Shannon entropy of U, while I − (U; V) = 0;
i.i.d.
• if V1 , . . . , Vm ∼ Bern( 12 ) and U = V1 ⊕ V2 ⊕ · · · ⊕ Vm , then I(U; V) = log 2, while
I − (U; V) = n log 2;
• if U ∼ N(0, σ2 ) and Vm = U + Nm , where N1 , . . . , Nm are i.i.d. N(0, 1) independent
of U, then
1
I(U; V) = log(1 + nσ2 ),
2
Information-Theoretic Stability and Generalization 307

n σ2
I − (U; V) = log 1 + .
2 1 + (n − 1)σ2
We also have the following.
proposition 10.1 If V1 , . . . , Vm are independent, then, for an arbitrary U jointly
distributed with V,

I(U; V) ≤ I − (U; V). (10.6)

Proof For any k ∈ [m], we can write

I(U; Vk |V1 , . . . , Vk−1 ) = I(U, V1 , . . . , Vk−1 ; Vk )


≤ I(U, V −k ; Vk )
= I(U; Vk |V −k ),

where the first and third steps use the chain rule and the independence of V1 , . . . , Vm ,
while the second step is provided by the data-processing inequality. Summing over all
k, we get (10.6).

10.2.2 Sub-Gaussian Random Variables and a Decoupling Estimate


A real-valued random variable U with EU < ∞ is σ2 -sub-Gaussian if, for every λ ∈ R,

E[eλ(U−EU) ] ≤ eλ
2 σ2 /2
. (10.7)

A classic result due to Hoeffding states that any almost surely bounded random variable
is sub-Gaussian.
lemma 10.1 (Hoeffding [20]) If a ≤ U ≤ b almost surely, for some −∞ < a ≤ b < ∞, then

E[eλ(U−EU) ] ≤ eλ
2 (b−a)2 /8
, (10.8)
 
i.e., U is (b − a)2 /4 -sub-Gaussian.
Consider a pair of random elements U and V of some spaces U and V, respectively,
with joint distribution PUV . Let Ū and V̄ be independent copies of U and V, such
that PŪ V̄ = PU ⊗ PV . For an arbitrary real-valued function f : U × V → R, we have the
following upper bound on the absolute difference between E[ f (U, V)] and E[ f (Ū, V̄)].
lemma 10.2 If f (u, V) is σ2 -sub-Gaussian under PV for every u, then
  
E[ f (U, V)] − E[ f (Ū, V̄)] ≤ 2σ2 I(U; V). (10.9)

Proof We exploit the Donsker–Varadhan variational representation of the relative


entropy, Corollary 4.15 of [21]: for any two probability measures π, ν on a common
measurable space (Ω, F ),
  
D(πν) = sup F dπ − log e dν ,
F
(10.10)
F Ω Ω
308 Maxim Raginsky et al.


where the supremum is over all measurable functions F : Ω → R, such that eF dν < ∞.
From (10.10), we know that, for any λ ∈ R,
 
D(PV|U=u PV ) ≥ E[λ f (u, V)|U = u] − log E eλ f (u,V)
  λ2 σ 2
≥ λ E[ f (u, V)|U = u] − E[ f (u, V)] − , (10.11)
2
where the second step follows from the sub-Gaussian assumption on f (u, V):
  λ2 σ 2
log E eλ( f (u,V)−E[ f (u,V)]) ≤ ∀λ ∈ R.
2
By maximizing the right-hand side of (10.11) over all λ ∈ R and rearranging we obtain
  
E[ f (u, V)|U = u] − E[ f (u, V)] ≤ 2σ2 D(PV|U=u PV ).
Then, using the law of iterated expectation and Jensen’s inequality,
   
E[ f (U, V)] − E[ f (Ū, V̄)] = E[E[ f (U, V) − f (U, V̄)|U]]

≤ PU (du)|E[ f (u, V)|U = u] − E[ f (u, V)]|
U 
≤ PU (du) 2σ2 D(PV|U=u PV )
U

≤ 2σ2 D(PUV PU ⊗ PV ).
The result follows by noting that I(U; V) = D(PUV PU ⊗ PV ).

10.3 Learning Algorithms and Stability

The magnitude of gen(μ, PW|Z ) is determined by the stability properties of PW|Z , i.e.,
by the sensitivity of PW|Z to local modifications of the training data Z. We wish to
quantify this variability in information-theoretic terms. Let PW|Z=z denote the prob-
ability distribution of the output of the algorithm in response to a deterministic z =
(z1 , . . . , zn ) ∈ Zn . This coincides with the conditional distribution of W given Z = z, i.e.,
PW|Z=z (·) = PW|Z=z (·). Recalling the notation z−i := (z1 , . . . , zi−1 , zi+1 , . . . , zn ), we can write
the conditional distribution of W given Z−i = z−i in the following form:

PW|Z−i =z−i (·) = μ(dzi )PW|Z=(z1 ,...,zi ,...,zn ) (·); (10.12)
Z
unlike PW|Z=z , this conditional distribution is determined by both μ and PW|Z . We put
forward the following definition.
definition 10.1 Given the data-generating distribution μ, we say that a learning
algorithm PW|Z is (ε, μ)-stable in erasure T -information if
T − (W; Z) = T − (W; Z1 , . . . , Zn ) ≤ nε, (10.13)
and (ε, μ)-stable in erasure mutual information if
I − (W; Z) = I − (W; Z1 , . . . , Zn ) ≤ nε, (10.14)
Information-Theoretic Stability and Generalization 309

where all expectations are w.r.t. P. We say that PW|Z is ε-stable (in erasure T -information
or in erasure mutual information) if it is (ε, μ)-stable in the appropriate sense for every μ.
These two notions of stability are related.
lemma 10.3 Consider a learning algorithm PW|Z and a data-generating distribution μ.

1. If PW|Z is (ε, μ)-stable in erasure mutual information, then it is ( ε/2, μ)-stable in
erasure T -information.
2. If PW|Z is (ε, μ)-stable in erasure T -information with ε ≤ 1/4 and the hypothesis
class W is finite, i.e., |W| < ∞, then PW|Z is (ε log(|W|/ε), μ)-stable in erasure mutual
information.
Proof For Part 1, using Pinsker’s inequality and the concavity of the square root,
we have
1
n
1 −
T (W; Z) = T (W; Zi |Z−i )
n n i=1

1 1
n
≤ I(W; Zi |Z−i )
n i=1 2


1 
n
≤ I(W; Zi |Z−i )
2n i=1

ε
≤ .
2
For Part 2, since the output W is finite-valued, we can express the conditional mutual
information I(W; Zi |Z−i ) as the difference of two conditional entropies:

I(W; Zi |Z−i ) = H(W|Z−i ) − H(W|Zi , Z−i )


= H(W|Z−i ) − H(W|Z)
  
= μ⊗n (dz) H(PW|Z−i =z−i ) − H(PW|Z=z ) . (10.15)

We now recall the following inequality (Theorem 17.3.3 of [22]): for any two probability
distributions μ and ν on a common finite set Ω with dTV (μ, ν) ≤ 1/4,

|Ω|
|H(μ) − H(ν)| ≤ dTV (μ, ν)log . (10.16)
dTV (μ, ν)
Applying (10.16) to (10.15), we get

I(W; Zi |Z−i ) ≤ μ⊗n (dz)dTV (PW|Z−i =z−i , PW|Z=z )

|W|
· log . (10.17)
dTV (PW|Z−i =z−i , PW|Z=z )
310 Maxim Raginsky et al.

Since the function u → u log(1/u) is concave, an application of Jensen’s inequality to


(10.17) gives
|W|
I(W; Zi |Z−i ) ≤ T (W; Zi |Z−i )log .
T (W; Zi |Z−i )
Summing from i = 1 to i = n, using the fact that the function u → u log(1/u) is concave
and monotone increasing on [0, 1/4], and the stability assumption on PW|Z , we obtain
the claimed stability result.
The next lemma gives a sufficient condition for stability by comparing the distributions
PW|Z=z and PW|Z=z for any two training sets that differ in one instance.
lemma 10.4 A learning algorithm PW|Z is ε-stable in erasure mutual information if
D(PW|Z=z PW|Z=z ) ≤ ε and in erasure T -information if dTV (PW|Z=z , PW|Z=z ) ≤ ε for all
z, z ∈ Z n with

n
dH (z, z ) := 1{zi  zi } ≤ 1,
i=1

where dH (·, ·) is the Hamming distance on Z . n

remark 10.1 Note that the sufficient conditions of Lemma 10.4 are distribution-free;
that is, they do not involve μ. These notions of stability were introduced recently by
Bassily et al. [23] under the names of KL- and TV-stability.
Proof We give the proof for the erasure mutual information; the case of the erasure
T -information is analogous. Fix μ ∈ P(Z), s ∈ Zn , and i ∈ [n]. For z ∈ Z, let zi,z denote the
n-tuple obtained by replacing zi in z with z. Then

D(PW|Z=z PW|Z−i =z−i ) ≤ μ(dz )D(PW|Z=z PW|Z=zi,z ) ≤ ε,
Z
where the first inequality follows from the convexity of the relative entropy. The claimed
result follows by substituting this estimate into the expression

I(W; Zi |Z−i ) = μ⊗n (dz)D(PW|Z=z PW|Z−i =z−i )
Zn
for the conditional mutual information.
Another notion of stability results if we consider plain T -information and mutual
information.
definition 10.2 Given a pair (μ, PW|Z ), we say that PW|Z is (ε, μ)-stable in
T -information if
T (W; Z) = T (W; Z1 , . . . , Zn ) ≤ nε, (10.18)
and (ε, μ)-stable in mutual information if
I(W; Z) = I(W; Z1 , . . . , Zn ) ≤ nε, (10.19)
where all expectations are w.r.t. P. We say that PW|Z is ε-stable (in T-information or in
mutual information) if it is (ε, μ)-stable in the appropriate sense for every μ.
Information-Theoretic Stability and Generalization 311

lemma 10.5 Consider a learning algorithm PW|Z and a data-generating distribution μ.



1. If PW|Z is (ε, μ)-stable in mutual information, then it is ( ε/2, μ)-stable in
T -information.
2. If PW|Z is (ε, μ)-stable in T -information with ε ≤ 1/4 and the hypothesis class W is
finite, i.e., |W| < ∞, then PW|Z is (ε log(|W|/ε), μ)-stable in mutual information.
3. If PW|Z is (ε, μ)-stable in erasure mutual information, then it is (ε, μ)-stable in
mutual information.
4. If PW|Z is (ε, μ)-stable in mutual information, then it is (nε, μ)-stable in erasure
mutual information.
Proof Parts 1 and 2 are proved essentially in the same way as in Lemma 10.3. For
Part 3, since the Zi s are independent, I(W; Z) ≤ I − (W; Z) by Proposition 10.1. Hence,
I − (W; Z) ≤ nε implies I(W; Z) ≤ nε. For Part 4, I − (W; Z) ≤ nI(W; Z) by (10.5), so
I(W; Z) ≤ nε implies I − (W; Z) ≤ n2 ε.

10.4 Information Stability and Generalization

In this section, we will relate the generalization error of an arbitrary learning algorithm
to its information-theoretic stability properties. We start with the following simple, but
important, result.
theorem 10.1 If the loss function  takes values in [0, 1], then, for any pair (μ, PW|Z ),
1
| gen(μ, PW|Z )| ≤ T − (W; Z). (10.20)
n
In particular, if PW|Z is (ε, μ)-stable in erasure T -information, then | gen(μ, PW|Z )| ≤ 2ε.
Proof The proof technique is standard in the literature on algorithmic stability (see,
e.g., the proof of Lemma 7 in [7], the discussion at the beginning of Section 3.1 in [10],
or the proof of Lemma 11 in [2]); note, however, that we do not require PW|Z to be
symmetric. Introduce an auxiliary sample Z = (Z1 , . . . , Zn ) ∼ μ⊗n that is independent of
(Z, W) ∼ P. Since E[Lμ (W)] = E[(W, Zi )] for each i ∈ [n], we write
1
n
gen(μ, PW|Z ) = E[(W, Zi ) − (W, Zi )].
n i=1

Now, for each i ∈ [n] let us denote by W (i) the output of the algorithm when the input is

equal to Zi,Zi . Then, since the joint probability law of (W, Z, Zi ) evidently coincides with

the joint probability law of (W (i) , Zi,Zi , Zi ), we have
E[(W, Zi ) − (W, Zi )] = E[(W (i) , Zi ) − (W (i) , Zi )]. (10.21)
Moreover,

E(W (i) , Zi ) = μ⊗(n−1) (dz−i )μ(dzi )μ(dzi )P i,z  (dw)(w, zi )
W|Z=z i

= μ⊗n (dz)PW|Z−i =z−i (dw)(w, zi ), (10.22)
312 Maxim Raginsky et al.

where in the second line we have used (10.12), and



E(W , Zi ) = μ⊗(n−1) (dz−i )μ(dzi )μ(dzi )P
(i) 
i,z (dw)(w, zi )
W|Z=z i

= μ⊗(n−1) (dz−i )μ(dzi )μ(dzi )PW|Z=z (dw)(w, zi )

= μ⊗n (dz)PW|Z=z (w, zi ), (10.23)

where in the second line we used the fact that Z1 , . . . , Zn and Zi are i.i.d. draws from μ.
Using (10.22) and (10.23), we obtain
 
E(W (i) , Z ) − E(W (i) , Z  )
i i
   
≤ μ⊗n (dz) PW|Z−i =z−i (dw)(w, zi ) − PW|Z=z (dw)(w, zi )
Zn W W

≤ μ⊗n (dz)dTV (PW|Z=z , PW|Z−i =z−i )
Zn

= T (W; Zi |Z−i ),
where we have used (10.3) and the fact that T (U; V|Y) = EdTV (PU|VY , PU|Y ) [18].
Summing over i ∈ [n] and using the definition of T − (W; Z), we get the claimed
bound.
The following theorem replaces the assumption of bounded loss with a less restrictive
sub-Gaussianity condition.
theorem 10.2 Consider a pair (μ, PW|Z ), where (w, Z) is σ2 -sub-Gaussian under μ for
every w ∈ W. Then

2σ2
|gen(μ, PW|Z )| ≤ I(W; Z). (10.24)
n

In particular, if PW|Z is (ε, μ)-stable in mutual information, then | gen(μ, PW|Z )| ≤ 2σ2 ε.
remark 10.2 Upper bounds on the expected generalization error in terms of the mutual
information I(Z; W) go back to earlier results of McAllester on PAC-Bayes methods [24]
(see also the tutorial paper [25]).
remark 10.3 For a bounded loss function (·, ·) ∈ [a, b], (w, Z) is (b − a)2 /4-sub-
Gaussian for all μ and all w ∈ W, by Hoeffding’s lemma.
remark 10.4 Since Z1 , . . . , Zn are i.i.d., I(W; Z) ≤ I − (W; Z), by Proposition 10.1. Thus,
if PW|Z is ε-stable in erasure mutual information, then it is automatically ε-stable √ in
mutual information, and therefore the right-hand side of (10.24) is bounded by 2σ2 ε.
On the other hand, proving stability in erasure mutual information is often easier than
proving stability in mutual infornation.

Proof For each w, LZ (w) = (1/n) ni=1 (w, Zi ) is (σ2 /n)-sub-Gaussian. Thus, we can
apply Lemma 10.2 to U = W, V = Z, f (U, V) = LZ (W).
Information-Theoretic Stability and Generalization 313

Theorem 10.2 allows us to control the generalization error of a learning algorithm by


the erasure mutual information or by the plain mutual information between the input
and the output of the algorithm. Russo and Zou [26] considered the same setup, but
with the restriction to finite hypothesis spaces, and showed that |gen(μ, PW|Z )| can be
upper-bounded in terms of I(ΛW (Z); W), where
 
ΛW (Z) := LZ (w) w∈W (10.25)

is the collection of empirical risks of the hypotheses in W. Using Lemma 10.2 by set-
ting U = ΛW (Z), V = W, and f (ΛW (z), w) = Lz (w), we immediately recover the result
obtained by Russo and Zou even when W is uncountably infinite.
theorem 10.3 (Russo and Zou [26]) Suppose (w, Z) is σ2 -sub-Gaussian under μ for
all w ∈ W, then

  2σ2
 gen(μ, PW|Z ) ≤ I(W; ΛW (Z)). (10.26)
n
It should be noted that Theorem 10.2 can also be obtained as a consequence of
Theorem 10.3 because

I(W; ΛW (Z)) ≤ I(W; Z), (10.27)

by the data-processing inequality for the Markov chain ΛW (Z) − Z − W. The latter holds
since, for each w ∈ W, LZ (w) is a function of Z. However, if the output W depends on Z
only through the empirical risks ΛW (Z) (i.e., the Markov chain Z − ΛW (Z) − W holds),
then Theorems 10.2 and 10.3 are equivalent. The advantage of Theorem 10.2 is that
I(W; Z) is often much easier to evaluate than I(W; ΛW (Z)). We will elaborate on this
when we discuss the Gibbs algorithm and adaptive composition of learning algorithms.
Recent work by Jiao et al. [27] extends the results of Russo and Zou by introducing
a generalization of mutual information that can handle the cases when (w, Z) is not
sub-Gaussian.

10.4.1 A Concentration Inequality for |Lμ (W) − LZ (W)|


Theorems 10.2 and 10.3 merely give upper bounds on the expected generalization error.
We are often interested in analyzing the expected value or the tail behavior of the
absolute generalization error |Lμ (W) − LZ (W)|. This is the subject of the present section.
For any fixed w ∈ W, if (w, Z) is σ2 -sub-Gaussian, the Chernoff–Hoeffding bound
gives P[|Lμ (w) − LZ (w)| > α] ≤ 2e−α n/2σ . Thus, if Z and W are independent, a sample
2 2

size of
2σ2 2
n≥ log (10.28)
α2 β
suffices to guarantee

P[|Lμ (W) − LZ (W)| > α] ≤ β. (10.29)


314 Maxim Raginsky et al.

The following results pertain to the case when W and Z are dependent, but the mutual
information I(W; Z) is sufficiently small. The tail probability now is taken with respect
to the joint distribution P = μ⊗n ⊗ PW|Z .
theorem 10.4 Suppose (w, Z) is σ2 -sub-Gaussian under μ for all w ∈ W. If a learning
algorithm satisfies I(W; ΛW (Z)) ≤ C, then for any α > 0 and 0 < β ≤ 1, (10.29) can be
guaranteed by a sample complexity of

8σ2 C 2
n≥ + log . (10.30)
α2 β β

In view of (10.27), any learning algorithm that is (ε, μ)-stable in mutual information
satisfies the condition I(W; ΛW (Z)) ≤ nε. We also have the following corollary.
corollary 10.1 Under the conditions in Theorem 10.4, if C ≤ (g(n) − 1)β log(2/β)
for some function g(n) ≥ 1, then a sample complexity that satisfies n/g(n) ≥
(8σ2 /α2 ) log(2/β) guarantees (10.29).
For example, taking g(n) = 2, Corollary 10.1 implies that, if C ≤ β log(2/β), then
(10.29) can be guaranteed by a sample complexity of n = (16σ2 /α2 ) log(2/β), which
is on the same order as the sample complexity when Z and W are independent as in

(10.28). As another example, taking g(n) = n, Corollary 10.1 implies that, if C ≤
√  
( n − 1)β log(2/β), then a sample complexity of n = (64σ4 /α4 ) log(2/β) 2 guarantees
(10.29).
Recent papers of Dwork et al. [28, 29] give “high-probability” bounds on the abso-
lute generalization error of differentially private algorithms with bounded loss functions,
i.e., the tail bound P[|Lμ (W) − LZ (W)| > α] ≤ β is guaranteed to hold whenever n 
(1/α2 ) log(1/β). By contrast, Theorem 10.4 does not require differential privacy and
assumes that (w, Z) is sub-Gaussian. Bassily et al. [30] obtain a concentration inequal-
ity on the absolute generalization error on the same order as the bound of Theorem 10.4
and show that this bound is sharp – they give an example of a learning problem (μ, W, )
and a learning algorithm PW|Z that satisfies I(W; Z) ≤ O(1) and

1 1
P |LZ (W) − Lμ (W)| ≥ ≥ .
2 n

They also give an example where a sufficient amount of mutual information is necessary
in order for the ERM algorithm to generalize.
The proof of Theorem 10.4 is based on Lemma 10.2 and an adaptation of the “monitor
technique” of Bassily et al. [23]. We first need the following two lemmas. The first
lemma is a simple consequence of the tensorization property of mutual information.
lemma 10.6 Consider the parallel execution of m independent copies of PW|Z on
independent datasets Z1 , . . . , Zm : for t = 1, . . . , m, an independent copy of PW|Z takes
Zt ∼ μ⊗n as input and outputs Wt . Let Zm := (Z1 , . . . , Zm ) be the overall dataset. If
under μ, PW|Z satisfies I(W; ΛW (Z)) ≤ C, then the overall algorithm PW m |Zm satisfies
I(W m ; ΛW (Z1 ), . . . , ΛW (Zm )) ≤ mC.
Information-Theoretic Stability and Generalization 315

Proof The proof is by the independence among (Zt , Wt ), t = 1, . . . , m, and the chain rule
of mutual information.
The next lemma is the key piece. It will be used to construct a procedure that executes
m copies of a learning algorithm in parallel and then selects the one with the largest
absolute generalization error.
lemma 10.7 Let Zm = (Z1 , . . . , Zm ), where each Zt ∼ μ⊗n is independent of
all of the others. If an algorithm PW,T,R|Zm : Z m×n → W × [m] × {±1} satisfies
I(W, T, R; (ΛW (Z1 ), . . . , ΛW (Zm )) ≤ C, and if (w, Z) is σ2 -sub-Gaussian for all
w ∈ W, then

  2σ2C
E R(LZT (W) − Lμ (W)) ≤ .
n
Proof The proof is based on Lemma 10.2. Let U = (ΛW (Z1 ), . . . , ΛW (Zm )), V =
(W, T, R), and
 
f (ΛW (z1 ), . . . , ΛW (zm )), (w, t, r) = rLzt (w).

If (w, Z) is σ2 -sub-Gaussian under Z ∼ μ for all w ∈ W, then (r/n) ni=1 (w, Zt,i ) is
(σ2 /n)-sub-Gaussian for all w ∈ W, t ∈ [m] and r ∈ {±1}, and hence f (u, V) is (σ2 /n)-
sub-Gaussian for every u. Lemma 10.2 implies that

2σ2 I(W, T, R; ΛW (Z1 ), . . . , ΛW (Zm ))
E[RLZT (W)] − E[RLμ (W)] ≤ ,
n
which proves the claim.
Note that the upper bound in Lemma 10.7 does not depend on m. With these lemmas,
we can prove Theorem 10.4.
Proof of Theorem 10.4 First, let PW m |Zm be the parallel execution of m independent
copies of PW|Z , as in Lemma 10.6. Given Zm and W m , let the output of the “monitor” be
a sample (W ∗ , T ∗ , R∗ ) drawn from W × [m] × {±1} according to
 
(T ∗ , R∗ ) = arg max r Lμ (Wt ) − LZt (Wt ) and W ∗ = WT ∗ . (10.31)
t∈[m], r∈{±1}

This gives
   
R∗ Lμ (W ∗ ) − LZT ∗ (W ∗ ) = max Lμ (Wt ) − LZt (Wt ).
t∈[m]

Taking the expectation of both sides, we have


     
E R∗ Lμ (W ∗ ) − LZT ∗ (W ∗ ) = E max Lμ (Wt ) − LZt (Wt ) . (10.32)
t∈[m]

Note that, conditionally on W m , the tuple (W ∗ , T ∗ , R∗ ) can take only 2m values, which
means that
I(W ∗ , T ∗ , R∗ ; ΛW (Z1 ), . . . , ΛW (Zm )|W m ) ≤ log(2m). (10.33)
In addition, since PW|Z is assumed to satisfy I(W; ΛW (Z)) ≤ C, Lemma 10.6 implies that
I(W m ; ΛW (Z1 ), . . . , ΛW (Zm )) ≤ mC.
316 Maxim Raginsky et al.

Therefore, by the chain rule of mutual information and the data-processing inequality,
we have

I(W ∗ , T ∗ , R∗ ; ΛW (Z1 ), . . . , ΛW (Zm ))


≤ I(W m , W ∗ , T ∗ , R∗ ; ΛW (Z1 ), . . . , ΛW (Zm ))
≤ mC + log(2m).

By Lemma 10.7 and the assumption that (w, Z) is σ2 -sub-Gaussian,



 ∗ ∗ ∗  2σ2  
E R LZT ∗ (W ) − Lμ (W ) ≤ mC + log(2m) . (10.34)
n
Combining (10.34) and (10.32) gives

   2σ2  
E max LZt (Wt ) − Lμ (Wt ) ≤ mC + log(2m) . (10.35)
t∈[m] n
The rest of the proof is by contradiction. Choose m = 1/β. Suppose the algorithm
PW|Z does not satisfy the claimed generalization property, namely
  
P LZ (W) − Lμ (W) > α > β. (10.36)

Then, by the independence among the pairs (Zt , Wt ), t = 1, . . . , m,


   
P max LZt (Wt ) − Lμ (Wt ) > α > 1 − (1 − β)1/β > .
1
t∈[m] 2
Thus
   α
E max LZt (Wt ) − Lμ (Wt ) > . (10.37)
t∈[m] 2
Combining (10.35) and (10.37) gives

α 2σ2 C 2
< + log . (10.38)
2 n β β

The above inequality implies that


8σ2 C 2
n< + log , (10.39)
α2 β β
which contradicts the condition in (10.30). Therefore, under the condition in (10.30), the
assumption in (10.36) cannot hold. This completes the proof.

A byproduct of the proof of Theorem 10.4 (setting m = 1 in (10.35)) is an upper bound


on the expected absolute generalization error.
theorem 10.5 Suppose (w, Z) is σ2 -sub-Gaussian under μ for all w ∈ W. If a learning
algorithm satisfies I(W; ΛW (Z)) ≤ C, then

  2σ2
ELμ (W) − LZ (W) ≤ (C + log 2). (10.40)
n
Information-Theoretic Stability and Generalization 317

 
This result improves on Proposition 3.2 of [26], which states that ELZ (W) − Lμ (W) ≤
√ 
σ/ n + 36 2σ2C/n. Theorem 10.5  together
 with Markov’s inequality implies that

(10.29) can be guaranteed by n ≥ 2σ2 /α2 β2 C + log 2 , but it has a worse dependence
on β than does the sample complexity given by Theorem 10.4.

10.5 Information-Theoretically Stable Learning Algorithms

In this section, we illustrate the framework of information-theoretic stability in the con-


text of several learning problems and algorithms. We first consider two cases where
upper bounds on the input–output mutual information of a learning algorithm can be
obtained in terms of the geometric or combinatorial properties of the hypothesis space.
Then we show that one can obtain learning algorithms with controlled input–output
mutual information by regularizing the ERM algorithm. We also discuss alternative
methods to induce input–output mutual information stability, as well as stability of
learning algorithms built by adaptive composition.

10.5.1 Countable Hypothesis Spaces


When the hypothesis space is countable, the input–output mutual information can be
upper-bounded by H(W), the entropy of W. In that case, if (w, Z) is σ2 -sub-Gaussian
for all w ∈ W, any learning algorithm PW|Z satisfies

  2σ2 H(W)
gen(μ, PW|Z ) ≤ (10.41)
n
by Theorem 10.2. If |W| = k < ∞, we have H(W) ≤ log k, and the bound (10.41) can be
weakened to

  2σ2 log k
gen(μ, PW|Z ) ≤ .
n
However, a simple argument based on the union bound also shows that, when |S | = k < ∞,
we have
 
2σ2 log k
E max |LZ (w) − Lμ (w)| ≤ .
w∈W n
Thus, the information-theoretic bound (10.41) is useful only when the learning algorithm
PW|Z is such that H(W) log k.
An uncountable hypothesis space can often be replaced with a countable one by quan-
tizing (or discretizing) the output hypothesis. For example, suppose W is a bounded
subset of Rd equipped with some norm  · , i.e., W ⊆ {w ∈ Rd : w ≤ B} for some
B < ∞. Let PW|Z be an arbitrary learning algorithm. Let N(r,  · , W) denote the cov-
ering number of W, defined as the cardinality of the smallest set W ⊂ Rm , such that for
every w ∈ W there exists some w ∈ W with w − w  ≤ r. Consider the Markov chain
Z−W−W  , where W  = arg minv∈W v − W, with ties broken arbitrarily. If r < B, then,
318 Maxim Raginsky et al.

using standard estimates for covering numbers in finite-dimensional Banach spaces [31],
we can write
3B
I(Z; W  ) ≤ log N(W, · · , r) ≤ d log ,
r
and therefore, under the sub-Gaussian assumption on , the composite learning algoritm
PW  |Z satisfies

  2σ2 d 3B
 gen(μ, PW  |Z ) ≤ log . (10.42)
n r

Forexample, if we set r = 3B/ n, the above bound on the generalization error will scale
 2 
as σ d log n /n. If (·, z) is Lipschitz, i.e., |(w, z) − (w , z)| ≤ w − w 2 , then we can
use (10.42) to obtain the following generalization bound for the original algorithm PW|Z :
⎛  ⎞
  ⎜⎜⎜ 2d ⎟⎟⎟

 gen(μ, PW|Z ) ≤ inf ⎜⎜⎜2r +

log
3B ⎟⎟⎟.
r≥0⎝ n r ⎟⎠

Again, taking r = 3B/ n, we get

  2σ2 d log n 6B
 gen(μ, PW|Z ) ≤ + √ .
n n

10.5.2 Binary Classification


Recall the problem of binary classification, briefly described in Section 16.1: Z = X × Y,
where Y = {0, 1}, W is a collection of classifiers w : X → Y (which could be uncountably
infinite), and (w, z) = 1{w(x)  y}. Before proceeding, we need some basic facts from
Vapnik–Chervonenkis theory [1, 32]. We say that W shatters a set S ⊂ X if for every
S ⊆ S there exists some w ∈ W such that, for any x ∈ S,
w(x) = 1{x∈S } .
The Vapnik–Chervonenkis dimension (or VC dimension) of W is defined as
' (
vc(W) := sup |S| : S is shattered by W .
If vc(W) is finite, we say that W is a VC class. For example, if X ⊆ Rd and W consists
of indicators of all halfspaces of Rd , then vc(W) = d + 1. A fundamental combinatorial
estimate, known as the Sauer–Shelah lemma, states that, for any finite set S ⊆ X and any
VC class W with vc(W) = V,
V
e|S|
|{(w(x) : x ∈ S) : w ∈ W}| ≤ ≤ (|S| + 1)V .
V
With these preliminaries out of the way, we can use Theorem 10.2 to perform a sim-
ple analysis of the following two-stage algorithm [32, 33] that can achieve the same
performance as ERM. Given the dataset Z of size n, split it into Z1 and Z2 with sizes n1
and n2 . First, pick a subset of hypotheses W1 ⊂ W based on Z1 , such that the binary
Information-Theoretic Stability and Generalization 319

strings (w(X1 ), . . . , w(Xn1 )) for w ∈ W1 are all distinct and {(w(X1 ), . . . , w(Xn1 )) : w ∈
W1 } = {(w(X1 ), . . . , w(Xn1 ) : w ∈ W}. In other words, W1 forms an empirical cover of
W with respect to Z1 . Then pick a hypothesis from W1 with the minimal empirical risk
on Z2 , i.e.,
W = arg min LZ2 (w). (10.43)
w∈W1

Let V denote the VC dimension of W. We can upper-bound the expected generalization


error of W with respect to Z2 as

  V log(n1 + 1)
E[Lμ (W)] − E[LZ2 (W)] = E E[Lμ (W) − LZ2 (W)|Z1 ] ≤ , (10.44)
2n2
where we have used Theorem 10.2 and the fact that I(W; Z2 |Z1 = z1 ) ≤ H(W|Z1 = z1 ) ≤
V log(n1 + 1), by the Sauer–Shelah lemma. It can also be shown that [32, 33]

  V
E[LZ2 (W)] ≤ E inf Lμ (w) ≤ inf Lμ (w) + c , (10.45)
w∈W1 w∈W n1
where the second expectation is taken with respect to W1 , which depends on Z1 , and c
is an absolute constant. Combining (10.44) and (10.45) and setting n1 = n2 = n/2, we
have, for some constant c,

V log n
E[Lμ (W)] ≤ inf Lμ (w) + c . (10.46)
w∈W n
From an information-theoretic point of view, the above two-stage algorithm effectively
controls the conditional mutual information I(W; Z2 |Z1 ) by extracting an empirical cover
of W using Z1 , while maintaining a small empirical risk using Z2 .

10.5.3 Gibbs Algorithm


Theorem 10.2 upper-bounds the generalization error of a learning algorithm by the
mutual information I(W; Z). Hence, it is natural to consider an idealized algorithm that
minimizes the empirical risk regularized by I(W; Z):
1
P◦W|Z = arg min E[LZ (W)] + I(Z; W) , (10.47)
PW|Z β
where β > 0 is a parameter that balances data fit and generalization. Since μ is unknown
to the learning algorithm, we relax the above optimization problem by appealing to the
so-called golden formula for the mutual information:
I(U; V) = D(PUV PU ⊗ QV ) − D(PV QV ), (10.48)
which is valid for any QV satisfying D(PV QV ) < ∞. In particular, applying (10.48) with
U = Z and V = W, we see that we can choose an arbitrary probability distribution Q on
W and upper-bound I(W; Z) as follows:
I(W; Z) ≤ D(PWZ PZ ⊗ Q)

= μ⊗n (dz)D(PW|Z=z Q).
Zn
320 Maxim Raginsky et al.

Using this, we relax the minimization problem (10.47) to


1
P∗W|Z = arg min E[LZ (W)] + D(PWZ Q ⊗ PZ ) (10.49a)
PW|Z β
  
⊗n 1
= arg min μ (dz) PW|Z=z (dw)Lz (w) + D(PW|Z=z Q) . (10.49b)
PW|Z Zn W β
It is evident that the solution of the relaxed optimization problem does not depend on μ,
and it is straightforward to obtain the following.
theorem 10.6 The solution to the optimization problem (10.49) is given by the Gibbs
algorithm
e−βLz (w) Q(dw)
P∗W|Z=z (dw) = for each z ∈ Zn . (10.50)
EQ [e−βLz (W) ]
remark 10.5 We would not have been able to arrive at the Gibbs algorithm had we
used I(W; ΛW (Z)) as the regularization term instead of I(W; Z) in (10.47), even if we
had chosen to upper-bound I(W; ΛW (Z)) by D(PW|ΛW (Z) Q|PΛW (Z) ).
Using Theorems 10.1 and 10.2, we can obtain the following distribution-free bounds
on the generalization error of the Gibbs algorithm.
theorem 10.7 If the function  takes values in [0, 1], then, for any data-generating
distribution μ, the generalization error of the Gibbs algorithm defined in (10.50) is
bounded as

∗ 1 −2β/n
 β β
| gen(μ, PW|Z )| ≤ 1 − e ∧ ∧ . (10.51)
2 2n n
Proof Fix any two z, z ∈ Zn with dH (z, z ) = 1 and let i ∈ [n] be the coordinate where
zi  zi . Then, from the identity
1 
Lz (w) = (w, zi ) − (w, zi ) + Lz (w),
n
it follows that
    

EQ [e−βLz (W) ] EQ exp −(β/n) (W, zi ) − (W, zi ) − βLz (W)
=
EQ [e−βLz (W) ] EQ [e−βLz (W) ]
≤ eβ/n .
Consequently,
dP∗W|Z=z )β * EQ [e−βLz (W) ]
(w) = exp (w, zi ) − (w, zi ) ≤ e2β/n ,
dP∗W|Z=z n EQ [e−βLz (W) ]
   
and, interchanging the roles of z and z , we see that e−2β/n ≤ dP∗W|Z=z / dP∗W|Z=z ≤
e2β/n . It follows that
 ⎛ ∗ ⎞
⎜⎜⎜ dPW|Z=z ⎟⎟⎟ 2β
∗ ∗
D(PW|Z=z PW|Z=z ) = dPW|Z=z log⎜⎜⎝ ∗
∗ ⎟⎟⎠ ≤
W dP n
W|Z=z
Information-Theoretic Stability and Generalization 321

and
  dP∗ 
 1 
dP∗W|Z=z  ∗ − 1 ≤ 1 − e−2β/n ,
1
dTV (P∗W|Z=z , P∗W|Z=z )
W|Z=z
=
2 W  dPW|Z=z  2
where we have used (10.4). Another bound on the relative entropy between P∗W|Z=z and
P∗W|Z=z can be obtained as follows. We start with

β   EQ [e−βLz (W) ]
D(P∗W|Z=z P∗W|Z=z ) = P∗W|Z=z (dw) (w, zi )−(w, zi ) +log . (10.52)
n W EQ [e−βLz (W) ]
Then

EQ [e−βLz (W) ] β 
log −βL
= log exp (w, zi ) − (w, zi ) P∗W|Z=z (dw)
EQ [e z (W) ] W n

β   β2
≤ P∗W|Z=z (dw) (w, zi ) − (w, zi ) + 2 , (10.53)
n W 2n
where the inequality follows by applying Hoeffding’s lemma to the random vari-
able (W, zi ) − (W, zi ) with W ∼ P∗W|Z=z , which takes values in [−1, 1] almost surely.
Combining (10.52) and (10.53), we get the bound
β2
D(P∗W|Z=z P∗W|Z=z ) ≤
2n2
for any z, z ∈ Z with dH (z, z ) = 1. Invoking Lemma 10.4, we see that P∗W|Z is (1 −
n

e−2β/n )-stable in erasure T -information and ((2β/n) ∧ (β2 /2n2 ))-stable in erasure mutual
information.
Therefore, Theorem 10.1 gives | gen(μ, P∗W|Z )| ≤ 1 − e−2β/n . Moreover, since  takes
values in [0, 1], the sub-Gaussian assumption of Theorem
 10.2 is satisfied with σ2 = 1/4

by Hoeffding’s lemma, and thus | gen(μ, PW|Z )| ≤ β/n ∧ (β/2n). Taking the minimum
of the two bounds, we obtain (10.51).
With the above guarantees on the generalization error, we can analyze the population
risk of the Gibbs algorithm. We first present a result for countable hypothesis spaces.
corollary 10.2 Suppose W is countable. Let W denote the output of the Gibbs
algorithm applied to Z, and let wo denote the hypothesis that achieves the minimum
population risk among W. If  takes values in [0, 1], the population risk of W satisfies
1 1 β
E[Lμ (W)] ≤ inf Lμ (w) + log + . (10.54)
w∈W β Q(wo ) 2n
Proof We can bound the expected empirical risk of the Gibbs algorithm P∗W|Z as
1
E[LZ (W)] ≤ E[LZ (W)] + D(P∗W|Z Q|PZ ) (10.55)
β
1
≤ E[LZ (w)] + D(δw Q) for all w ∈ W, (10.56)
β
where δw is the point mass at w. The second inequality is obtained via the optimality of
the Gibbs P∗W|Z in (10.49), since one can view δw as a sub-optimal learning algorithm that
322 Maxim Raginsky et al.

simply ignores the dataset and always outputs w. Taking w = wo , noting that E[LZ (wo )] =
Lμ (wo ), and combining this with the upper bound of Theorem 10.7 we obtain
1 β
E[Lμ (W)] ≤ inf Lμ (w) + D(δwo Q) + . (10.57)
w∈W β 2n
This leads to (10.54), as D(δwo Q) = − log Q(wo ) when W is countable.

The auxiliary distribution Q in the Gibbs algorithm can encode any prior knowledge
of the population risks of the hypotheses in W. For example, we can order the hypotheses
according to our (possibly imperfect) prior knowledge of their population risks, and set

Q(wm ) = 6/(π2 m2 ) for the mth hypothesis in the order.2 Then, setting β = n, (10.54)
becomes
2 log mo + 1
E[Lμ (W)] ≤ inf Lμ (w) + √ , (10.58)
w∈W n
where mo is the index of wo in the order. Thus, better prior knowledge of the population
risks leads to a smaller sample complexity to achieve a certain expected excess risk. As
another example, if |W| = k < ∞ and we do not have any a priori preferences among
the hypotheses,
 then we can take Q to be the uniform distribution on W. Upon setting
β = 2n log k, (10.54) becomes

2 log k
E[Lμ (W)] ≤ inf Lμ (w) + .
w∈W n
For uncountable hypothesis spaces, we can proceed in an analogous fashion to analyze
the population risk under the assumption that the loss function is Lipschitz.
corollary 10.3 Suppose W = Rd with the Euclidean norm  · 2 . Let wo be the hypoth-
esis that achieves the minimum population risk among W. Suppose that  takes values
in [0, 1], and (·, z) is ρ-Lipschitz for all z ∈ Z. Let W denote the output of the Gibbs
algorithm applied to Z. The population risk of W satisfies
β √ 1  
E[Lμ (W)] ≤ inf Lμ (w) + + inf aρ d + D N(wo , a2 Id )Q , (10.59)
w∈W 2n a>0 β
where N(v, Σ) denotes the d-dimensional normal distribution with mean v and covari-
ance matrix Σ.
Proof Just as we did in the proof of Corollary 10.2, we first bound the expected
empirical risk of the Gibbs algorithm P∗W|Z . For any a > 0, we can view N(wo , a2 Id )
as a learning algorithm that ignores the dataset and always draws a hypothesis from
this Gaussian distribution. Denote by γd the standard normal density on Rd . The
non-negativity of relative entropy and (10.49) imply that
1
E[LZ (W)] ≤ E[LZ (W)] + D(P∗W|Z Q|PZ ) (10.60)
β

∞
m=1 (1/m ) = (π /6) < 2.
2 Recall that 2 2
Information-Theoretic Stability and Generalization 323

 )w−w *
0 1  
≤ E[LZ (w)]γd dw + D N(wo , a2 Id )Q (10.61)
a β
W )w−w *
0 1  
= Lμ (w)γd dw + D N(wo , a2 Id )Q . (10.62)
W a β
Combining this with the upper bound on the expected generalization error of Theo-
rem 10.7, we obtain
 )w−w *
o 1   β
E[Lμ (W)] ≤ inf Lμ (w)γd dw + D N(wo , a2 Id )Q + . (10.63)
a>0 W a β 2n
Since (·, z) is ρ-Lipschitz for all z ∈ Z, we have that, for any w ∈ W,
|Lμ (w) − Lμ (wo )| ≤ E[|(w, Z) − (wo , Z)|] ≤ ρw − wo 2 . (10.64)
Then
 )w−w * 
0   ) w − w0 *
Lμ (w)γd dw ≤ Lμ (wo ) + ρw − wo 2 γd dw (10.65)
W a W a

≤ Lμ (wo ) + ρa d. (10.66)
Substituting this into (10.63), we obtain (10.59).
Again, we can use the distribution Q to express our preference of the hypotheses in
W. For example, we can choose Q = N(wQ , b2 Id ) with b = n−1/4 d−1/4 ρ−1/2 and choose
β = n3/4 d1/4 ρ1/2 . Then, setting a = b in (10.59), we have
d1/4 ρ1/2  
E[Lμ (W)] ≤ inf Lμ (w) + 1/4
wQ − wo 22 + 3 . (10.67)
w∈W 2n
This result places essentially no restrictions on W, which could be unbounded, and
only requires the Lipschitz condition on (·, z), which could be non-convex. The sample
complexity decreases with better prior knowledge of the optimal hypothesis.

10.5.4 Differentially Private Learning Algorithms


As mentioned in Section 10.1.1, differential privacy can be interpreted as a stability
property. In this section, we investigate this in detail. A learning algorithm PW|Z is (ε, δ)-
differentially private [13] if, for any measurable set F ⊆ W,
dH (z, z ) ≤ 1 =⇒ PW|Z=z (F) ≤ eε PW|Z=z (F) + δ. (10.68)
theorem 10.8 Suppose that the loss function  takes values in [0, 1]. If PW|Z is (ε, δ)-
differentially private for some ε ≥ 0 and δ ∈ [0, 1], then
| gen(μ, PW|Z )| ≤ 1 − e−ε (1 − δ) (10.69)
for any data-generating distribution μ. If  is arbitrary, but the hypothesis space
W is finite, the pair (μ, PW|Z ) satisfies the conditions of Theorem 10.2, and eε ≤
4(1 − δ)/3, then

|W|
| gen(μ, PW|Z )| ≤ 2σ2 (1 − e−ε (1 − δ))log −ε
. (10.70)
1 − e (1 − δ)
324 Maxim Raginsky et al.

Proof We will prove that any (ε, δ)-differentially private learning algorithm is (1 −
e−ε (1 − δ))-stable in total variation. To that end, let us rewrite the differential privacy
condition (10.68) as follows. For any z, z with dH (z, z ) ≤ 1,
Eeε (PW|Z=z PW|Z=z ) ≤ δ, (10.71)
where the Eγ -divergence (with γ ≥ 1) between two probability measures μ and ν on a
measurable space (Ω, F ) is defined as
Eγ (μν) := max(μ(E) − γν(E))
E∈F

(see, e.g., [34] or p. 28 of [35]). It satisfies the following inequality (see Section VII.A
of [36]):
Eγ (μν) ≥ 1 − γ(1 − dTV (μ, ν)). (10.72)
Using (10.72) in (10.71) with γ = eε , we get
dH (z, z ) ≤ 1 =⇒ dTV (PW|Z=z , PW|Z=z ) ≤ 1 − e−ε (1 − δ),
and therefore A is (1 − e−ε (1 − δ))-stable in erasure T -information by Lemma 10.4.
Using Theorem 10.1, we obtain Eq. (10.69); the inequality (10.70) follows from (10.72),
Lemma 10.3, and Theorem 10.2.
For example, the Gibbs algorithm is (2β/n, 0)-differentially private [37], so
Theorem 10.7 is a special case of Theorem 10.8.

10.5.5 Noisy Empirical Risk Minimization


Another algorithm that provides control on the input–output mutual information is noisy
ERM, where we add independent scalar noise Nw to the empirical risk of each hypothesis
w ∈ W and then output any hypothesis that minimizes the noise-perturbed empirical risk:
 
W = arg min LZ (w) + Nw . (10.73)
w∈W

Just as in the case of the Gibbs algorithm, we can encode our preferences for (or prior
knowledge about) various hypotheses by controlling the amount of noise added to each
hypothesis. The following result formalizes this idea.
corollary 10.4 Suppose W = {wm }∞ m=1 is countably infinite, and the ordering is such
that a hypothesis with a lower index is preferred over one with a higher index. Also sup-
pose  ∈ [0, 1]. For the noisy ERM algorithm in (10.73), choosing Nm to be an exponential
random variable with mean bm , we have

 ⎛∞ ⎞−1
1  Lμ (wm ) ⎜⎜⎜⎜ 1 ⎟⎟⎟⎟

E[Lμ (W)] ≤ min Lμ (wm ) + bmo + − ⎜⎝⎜ ⎟⎟ , (10.74)
m 2n m=1 bm b ⎠
m=1 m

where mo = arg minm Lμ (wm ). In particular, choosing bm = m1.1 /n1/3 , we have

o +3
m1.1
E[Lμ (W)] ≤ min Lμ (wm ) + . (10.75)
m n1/3
Information-Theoretic Stability and Generalization 325

Without adding noise, the ERM algorithm applied to theabove case when card (W) =
k < ∞ can achieve E[Lμ (WERM )] ≤ minm∈[k] Lμ (wm ) + (1/2n)log k. Compared with
(10.75), we see that performing noisy ERM may be beneficial when we have high-
quality prior knowledge of wo and when k is large.
Proof We prove the result assuming card (W) = k < ∞. When W is countably infinite,
the proof carries over by taking the limit as k → ∞.
First, we upper-bound the expected generalization error via I(W; Z). We have the
following chain of inequalities:
 
I(W; Z) ≤ I (LZ (wm ) + Nm )m∈[k] ; (LZ (wm ))m∈[k]

k
≤ I(LZ (wm ); LZ (wm ) + Nm )
m=1
k
E[LZ (wm )]
≤ log 1 +
m=1
bm
k
Lμ (wm )
= log 1 + , (10.76)
m=1
bm
where we have used the data-processing inequality for mutual information; the fact that,
for product channels, the overall input–output mutual information is upper-bounded by
the sum of the input–output mutual information of individual channels [38]; the formula
for the capacity of the additive exponential noise channel under an input mean constraint
[39]; and the fact that E[LZ (wm )] = Lμ (wm ). The assumption that  takes values in [0, 1]
implies that (w, Z) is 1/4-sub-Gaussian for all w ∈ W, and, as a consequence of (10.76),

+

1 
k
Lμ (wm )
gen(μ, PW|Z ) ≤ log 1 + . (10.77)
2n m=1 bm
We now upper-bound the expected empirical risk. By the construction of the algorithm,
almost surely
LZ (W) = LZ (W) + NW − NW
≤ LZ (wmo ) + Nmo − NW
≤ LZ (wmo ) + Nmo − min{Nm , m ∈ [k]}.
Taking the expectation on both sides, we get
⎛ k ⎞−1
⎜⎜⎜ 1 ⎟⎟⎟
E[LZ (W)] ≤ Lμ (wmo ) + bmo − ⎜⎜⎜⎝ ⎟⎟⎟ . (10.78)
bm ⎠
m=1
Combining (10.77) and (10.78), we have

+
 ⎛ k ⎞−1
1  ⎜⎜⎜ 1 ⎟⎟⎟
k
Lμ (wm ) ⎜ ⎟⎟⎟ ,
E[Lμ (W)] ≤ min Lμ (wm ) + log 1 + + bmo − ⎜⎜⎝ ⎠
m∈[k] 2n m=1 bm m=1
bm

which leads to (10.74) with the fact that log(1 + x) ≤ x.


326 Maxim Raginsky et al.

When bm = m1.1 /n1/3 , using the fact that

k
1
1.1
≤ 11 − 10k−1/10
m=1
m

and upper-bounding Lμ (wm )s by 1, we get


⎛ ⎞
1 ⎜⎜⎜⎜ 1  −1/10
 1 ⎟⎟⎟
⎟⎟
E[Lμ (W)] ≤ min Lμ (wm ) + 1/3 ⎜⎝ 11 − 10k + mo −
1.1
m∈[k] n 2 11 − 10k−1/10 ⎠
3 + m1.1
o
≤ min Lμ (wm ) + ,
m∈[k] n1/3

which proves (10.75).

10.5.6 Other Methods to Induce Input–Output Mutual Information Stability


There is a wide variety of methods for controlling the input–output mutual informa-
tion of a learning algorithm. One method is to pre-process the dataset Z and then run
a learning algorithm on the transformed dataset , Z. The pre-processing step can con-
sist of adding noise to the data or removing some of the instances from the dataset.
For any choice of the pre-processing mechanism and the learning algorithm, we have
' (
the Markov chain Z−, Z−W, which implies I(W; Z) ≤ min I(Z; , Z), I(W; ,
Z) . Another
method is to postprocess the output of a learning algorithm. For instance, the entries
of the weight matrix W , generated by a neural-network training algorithm can be
quantized or perturbed by noise. Any choice of the learning algorithm and the post-
processing mechanism induces the Markov chain Z−W−W, , which implies I(W; Z) ≤
' , , (
min I(W; W), I(W; Z) . One can use strong data-processing inequalities [40] to sharpen
these upper bounds on I(W; Z). Pre-processing of the dataset and post-processing of the
output hypothesis are among the numerous regularization methods used in deep learning
(see Chapter 7.5 of [41]). Other regularization methods may also be interpreted as strate-
gies to induce the input–output mutual information stability of a learning algorithm; this
would be an interesting direction of future research.

10.5.7 Adaptive Composition of Learning Algorithms


Apart from controlling the generalization error of individual learning algorithms, the
input–output mutual information can also be used for analyzing the generalization
capability of complex learning algorithms obtained by adaptively composing simple
constituent algorithms. Under a k-fold adaptive composition, a common Z is shared
by k learning algorithms that are executed sequentially. Specifically, for each j ∈ [k], the
output W j of the jth algorithm is drawn from a hypothesis space W j that is based on
the data Z and on the outputs W j−1 of the previously executed algorithms, according
to PW j |Z,W j−1 . An example when k = 2 is the simplest case of selective inference [3]:
data-driven model selection followed by a learning algorithm executed the same dataset.
Information-Theoretic Stability and Generalization 327

Various boosting techniques in machine learning can also be viewed as instances of


adaptive composition. From the data-processing inequality and the chain rule of mutual
information,

k
I(Wk ; Z) ≤ I(W k ; Z) = I(W j ; Z|W j−1 ). (10.79)
j=1

If the Markov chain Z−ΛW j (Z)−W j holds conditional on W j−1 for j = 1, . . . , k, then the

upper bound in (10.79) can be sharpened to kj=1 I(W j ; ΛW j (Z)|W j−1 ). Thus, we can
control the generalization error of the final output by controlling the conditional mutual
information at each step of the composition. This also gives us a way to analyze the gen-
eralization error of the composite learning algorithm using local information-theoretic
properties of the constituent algorithms. Recent work by Feldman and Steinke [42] pro-
poses a stronger notion of stability in erasure mutual information (uniformly with respect
to all, not necessarily product, distributions of Z) and applies it to adaptive data analysis.
Armed with this notion, they design and analyze a noise-adding algorithm that calibrates
the noise variance to the empirical variance of the data. A related information-theoretic
notion of stability was also proposed by Cuff and Yu [43] in the context of differential
privacy.

Acknowledgments

The authors would like to thank the referees and Nir Weinberger for reading the
manuscript and for suggesting a number of corrections, and Vitaly Feldman for valu-
able general comments and for pointing out the connection to the work of McAllester
on PAC-Bayes bounds. The work of A. Rakhlin is supported in part by the NSF under
grant no. CDS&E-MSS 1521529 and by the DARPA Lagrange program. The work of
M. Raginsky and A. Xu is supported in part by NSF CAREER award CCF–1254041 and
in part by the Center for Science of Information (CSoI), an NSF Science and Technology
Center, under grant agreement CCF-0939370.

References

[1] V. N. Vapnik, Statistical learning theory. Wiley, 1998.


[2] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability, and
uniform convergence,” J. Machine Learning Res., vol. 11, pp. 2635–2670, 2010.
[3] J. Taylor and R. J. Tibshirani, “Statistical learning and selective inference,” Proc. Natl.
Acad. Sci. USA, vol. 112, no. 25, pp. 7629–7634, 2015.
[4] L. Devroye and T. Wagner, “Distribution-free performance bounds for potential function
rules,” IEEE Trans. Information Theory, vol. 25, no. 5, pp. 601–604, 1979.
[5] W. H. Rogers and T. J. Wagner, “A finite sample distribution-free performance bound for
local discrimination rules,” Annals Statist., vol. 6, no. 3, pp. 506–514, 1978.
[6] D. Ron and M. Kearns, “Algorithmic stability and sanity-check bounds for leave-one-out
crossvalidation,” Neural Computation, vol. 11, no. 6, pp. 1427–1453, 1999.
328 Maxim Raginsky et al.

[7] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Machine Learning Res.,
vol. 2, pp. 499–526, 2002.
[8] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in
learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004.
[9] S. Kutin and P. Niyogi, “Almost-everywhere algorithmic stability and generalization error,”
in Proc. 18th Conference on Uncertainty in Artificial Intelligence (UAI 2002), 2002, pp.
275–282.
[10] A. Rakhlin, S. Mukherjee, and T. Poggio, “Stability results in learning theory,” Analysis
Applications, vol. 3, no. 4, pp. 397–417, 2005.
[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in
private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.
[12] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generalization in
adaptive data analysis and holdout reuse,” arXiv:1506.02629, 2015.
[13] C. Dwork and A. Roth, “The algorithmic foundatons of differential privacy,” Foundations
and Trends in Theoretical Computer Sci., vol. 9, nos. 3–4, pp. 211–407, 2014.
[14] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” in
Proc. 32nd International Conference on Machine Learning (ICML), 2015, pp. 1376–1385.
[15] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of
stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop,
2016.
[16] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of
learning algorithms,” in Proc. Conference on Neural Information Processing Systems,
2017.
[17] H. Strasser, Mathematical theory of statistics: Statistical experiments and asymptotic
decision Theory. Walter de Gruyter, 1985.
[18] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”
IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.
[19] S. Verdú and T. Weissman, “The information lost in erasures,” IEEE Trans. Information
Theory, vol. 54, no. 11, pp. 5030–5058, 2008.
[20] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” J. Amer.
Statist. Soc., vol. 58, no. 301, pp. 13–30, 1963.
[21] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic
theory of independence. Oxford University Press, 2013.
[22] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd edn. Wiley, 2006.
[23] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, “Algorithmic sta-
bility for adaptive data analysis,” in Proc. 48th ACM Symposium on Theory of Computing
(STOC), 2016, pp. 1046–1059.
[24] D. McAllester, “PAC-Bayesian model averaging,” in Proc. 1999 Conference on Learning
Theory, 1999.
[25] D. McAllester, “A PAC-Bayesian tutorial with a dropout bound,” arXiv:1307.2118, 2013,
http://arxiv.org/abs/1307.2118.
[26] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,”
in Proc. 19th International Conference on Artificial Intelligence and Statistics (AISTATS),
2016, pp. 1232–1240.
[27] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for
general measurements,” in Proc. IEEE International Symposium on Information Theory,
2017, pp. 1475–1479.
Information-Theoretic Stability and Generalization 329

[28] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Preserving sta-
tistical validity in adaptive data analysis,” in Proc. 47th ACM Symposium on Theory of
Computing (STOC), 2015.
[29] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generaliza-
tion in adaptive data analysis and holdout reuse,” in 28th Annual Conference on Neural
Information Processing Systems (NIPS), 2015.
[30] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learners that use little
information,” in Proc. Conference on Algorithmic Learning Theory (ALT), 2018.
[31] R. Vershynin, High-dimensional probability: An introduction with applications in data
science. Cambridge University Press, 2018.
[32] L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition.
Springer, 1996.
[33] K. L. Buescher and P. R. Kumar, “Learning by canonical smooth estimation. I. Simultane-
ous estimation,” IEEE Tran. Automatic Control, vol. 41, no. 4, pp. 545–556, 1996.
[34] J. E. Cohen, J. H. B. Kemperman, and G. Zbǎganu, Comparisons of stochastic matrices,
with applications in information theory, statistics, economics, and population sciences.
Birkhäuser, 1998.
[35] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation,
Princeton University, 2010.
[36] I. Sason and S. Verdú, “ f -divergence inequalities,” IEEE Trans. Information Theory,
vol. 62, no. 11, pp. 5973–6006, 2016.
[37] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in Proc. 48th
Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.
[38] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture Notes for
ECE563 (UIUC) and 6.441 (MIT), 2012–2016, http://people.lids.mit.edu/yp/homepage/
data/itlectures_v4.pdf.
[39] S. Verdú, “The exponential distribution in information theory,” Problems of Information
Transmission, vol. 32, no. 1, pp. 86–95, 1996.
[40] M. Raginsky, “Strong data processing inequalities and Φ-Sobolev inequalities for discrete
channels,” IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
[41] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
[42] V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” in
Proc. 2018 Conference on Learning Theory, 2018.
[43] P. Cuff and L. Yu, “Differential privacy as a mutual information constraint,” in Proc.
2016 ACM SIGSAC Conference on Computer and Communication Security (CCS), 2016,
pp. 43–54.
11 Information Bottleneck and
Representation Learning
Pablo Piantanida and Leonardo Rey Vega

A grand challenge in representation learning is the development of computational algo-


rithms that learn the different explanatory factors of variation behind high-dimensional
data. Representation models (usually referred to as encoders) are often determined
for optimizing performance on training data when the real objective is to generalize
well to other (unseen) data. The first part of this chapter is devoted to providing an
overview of and introduction to fundamental concepts in statistical learning theory and
the information-bottleneck principle. It serves as a mathematical basis for the technical
results given in the second part, in which an upper bound to the generalization gap cor-
responding to the cross-entropy risk is given. When this penalty term times a suitable
multiplier and the cross-entropy empirical risk are minimized jointly, the problem is
equivalent to optimizing the information-bottleneck objective with respect to the empir-
ical data distribution. This result provides an interesting connection between mutual
information and generalization, and helps to explain why noise injection during the
training phase can improve the generalization ability of encoder models and enforce
invariances in the resulting representations.

11.1 Introduction and Overview

Information theory aims to characterize the fundamental limits for data compression,
communication, and storage. Although the coding techniques used to prove these funda-
mental limits are impractical, they provide valuable insight, highlighting key properties
of good codes and leading to designs approaching the theoretical optimum (e.g., turbo
codes, ZIP and JPEG compression algorithms). On the other hand, statistical models and
machine learning are used to acquire knowledge from data. Models identify relationships
between variables that allow one to make predictions and assess their accuracy. A good
choice of data representation is paramount for performing large-scale data processing
in a computationally efficient and statistically meaningful manner [1], allowing one to
decrease the need for storage, or to reduce inter-node communication if the data are
distributed.
Shannon’s abstraction of information merits careful study [2]. While a layman might
think that the problem of communication is to convey meaning, Shannon clarified
that “the fundamental problem of communication is that of reproducing at one point

330
Information Bottleneck and Representation Learning 331

a message selected at another point.” Shannon further argued that the meaning of a
message is subjective, i.e., dependent on the observer, and irrelevant to the engineering
problem of communication. However, what does matter for the theory of communica-
tion is finding suitable representations for given data. In source coding, for example, one
generally aims at distilling the relevant information from the data by removing unneces-
sary redundancies. This can be cast in information-theoretic terms, as higher redundancy
makes data more predictable and lowers their information content.
In the context of learning [3, 4], we propose to distinguish between these two rather
different aspects of data: information and knowledge. Information contained in data is
unpredictable and random, while additional structure and redundancy in the data stream
constitute knowledge about the data-generation process, which a learner must acquire.
Indeed, according to connectionist models [5], the redundancy contained within mes-
sages enables the brain to build up its cognitive maps and the statistical regularities in
these messages are being used for this purpose. Hence, this knowledge, provided by
redundancy [6, 7] in the data, must be what drives unsupervised learning. While infor-
mation theory is a unique success story, from its birth, it discarded knowledge as being
irrelevant to the engineering problem of communication. However, knowledge is recog-
nized as being a critical – almost central – component of representation learning. The
present text provides an information-theoretic treatment of this problem.
Knowledge representation. The data deluge of recent decades has led to new expec-
tations for scientific discoveries from massive data. While mankind is drowning in data, a
significant part of the data is unstructured, and it is difficult to discover relevant informa-
tion. A common denominator in these novel scenarios is the challenge of representation
learning: how to extract salient features or statistical relationships from data in order
to build meaningful representations of the relevant content. In many ways, deep neural
networks have turned out to be very good at discovering structures in high-dimensional
data and have dramatically improved the state-of-the-art in several pattern-recognition
tasks [8]. The global learning task is decomposed into a hierarchy of layers with non-
linear processing, a method achieving great success due to its ability not only to fit
different types of datasets but also to generalize incredibly well. The representational
capabilities of neural networks [9] have attracted significant interest from the machine-
learning community. These networks seem to be able to learn multi-level abstractions,
with a capability to harness unlabeled data, multi-task learning, and multiple inputs,
while learning from distributed and hierarchical data, to represent context at multiple
levels.
The actual goal of representation learning is neither accurate estimation of model
parameters [10] nor compact representation of the data themselves [11, 12]; rather,
we are mostly interested in the generalization capabilities, meaning the ability to suc-
cessfully apply rules extracted from previously seen data to characterize unseen data.
According to the statistical learning theory [13], models with many parameters tend to
overfit by representing the learned data too accurately, therefore diminishing their abil-
ity to generalize to unseen data. In order to reduce this “generalization gap,” i.e., the
difference between the “training error” and the “test error” (a measure of how well the
learner has learned), several regularization methods were proposed in the literature [13].
332 Pablo Piantanida and Leonardo Rey Vega

A recent breakthrough in this area has been the development of dropout [14] for training
deep neural networks. This consists in randomly dropping units during training to
prevent their co-adaptation, including some information-based regularization [15] that
yields a slightly more general form of the variational auto-encoder [16].
Why is that we succeed in learning high-dimensional representations? Recently
there has been much interest in understanding the importance of implicit regularization.
Numerical experiments in [17] demonstrate that network size may not be the main form
of capacity control for deep neural networks and, hence, some other, unknown, form of
capacity control plays a central role in learning multilayer feed-forward networks. From
a theoretical perspective, regularization seems to be an indispensable component in order
to improve the final misclassification probability, while convincing experiments support
the idea that the absence of all regularization does not necessarily induce a poor gener-
alization gap. Possible explanations were approached via rate-distortion theory [18, 19]
by exploring heuristic connections with the celebrated information-bottleneck princi-
ple [20]. Within the same line of work, in [21, 22] Russo and Zou and Xu and Raginsky
have proven bounds showing that the square root of the mutual information between
the training inputs and the parameters inferred from the training algorithm provides a
concise bound on the generalization gap. These bounds crucially depend on the Markov
operator that maps the training set into the network parameters and whose characteriza-
tion could not be an easy task. Similarly, in [23] Achille and Soatto explored how the
use of an information-bottleneck objective on the network parameters (and not on the
representations) may help to avoid overfitting while enforcing invariant representations.
The interplay between information and complexity. The goal of data representation
may be cast as trying to find regularity in the data. Regularity may be identified with the
“ability to compress” by viewing representation learning as lossy data compression: this
tells us that, for a given set of encoder models and dataset, we should try to find the
encoder or combination of encoders that compresses the data most. In this sense, we
may speak of the information complexity of a structure, meaning the minimum amount
of information (number of bits) we need to store enough information about the structure
that allows us to achieve its reconstruction. The central result in this chapter states that
good representation models should squeeze out as much regularity as possible from
the given data. In other words, representations are expected to distill the meaningful
information present in the data, i.e., to separate structure as seeing the regularity from
noise, interpreted as the accidental information.
The structure of this chapter. This chapter can be read without any prior knowledge
of information theory and statistical learning theory. In the first part, the basic learning
framework for analysis is developed and an accessible overview of basic concepts in
statistical learning theory and the information-bottleneck principle are presented. The
second part introduces an upper bound to the generalization gap corresponding to the
cross-entropy loss and shows that, when this penalty term times a suitable multiplier
and the cross-entropy empirical risk are minimized jointly, the problem is equivalent
to optimizing the information-bottleneck objective with respect to the empirical data
distribution. The notion of information complexity is introduced and intuitions behind it
are developed.
Information Bottleneck and Representation Learning 333

11.2 Representation and Statistical Learning

We introduce the framework within which learning from examples is to be studied. We


develop precise notions of risk and the generalization gap, and discuss the mathematical
factors upon which these depend.

11.2.1 Basic Definitions


In this text we are concerned with the problem of pattern classification, which is about
predicting the unknown class of an example or observation. An example can be modeled
as an information source X ∈ X presented to the learner about a target concept Y ∈ Y
(the concept class). In our model we simply assume (X, Y) are abstract discrete spaces
equipped with a σ-algebra. In the problem of pattern classification, one searches for
a function c : X −→ Y which represents one’s guess of Y given X. Although there is
much to say about statistical learning, this section does not cover the field extensively
(an overview can be found in [13]). Besides this, we limit ourselves to describing the
key ideas in a simple way, often sacrificing generality.
definition 11.1 (Misclassification probability) An |Y|-ary classifier is defined by a

(possibly stochastic) decision rule Q Y|X : X → P(Y), where Y ∈ Y denotes the ran-
dom variable associated with the classifier output and X is the information source. The
probability of misclassification of a rule QY|X with respect to a data distribution PXY is
given by
   
PE Q Y|X  1 − EPXY Q Y|X (Y|X) . (11.1)

Minimizing over all possible classifiers QY|X gives the smallest average probability
of misclassification. An optimum classifier c (·) chooses the hypothesis  y ∈ Y with
largest posterior probability PY|X given the observation x, that is the maximum a pos-
teriori (MAP) decision. The MAP test that breaks ties randomly with equal probability
is given by1




1
⎨ , if y ∈ B(x),
Q (y|x)  ⎪
MAP
⎪ |B(x)| (11.2)
Y|X ⎪
⎩ 0, otherwise,

where the set B(x) is defined as

B(x)  y ∈ Y : PY|X (y|x) = max



PY|X (y |x) .
y ∈Y

This classification rule is called the Bayes decision rule. The Bayes decision rule is opti-
mal in the sense that no other decision rule has a smaller probability of misclassification.
It is straightforward to obtain the following lemma.

1 In general, the optimum classifier given in (11.2) is not unique. Any conditional p.m.f. with support in
B(x) for each x ∈ X will be equally good.
334 Pablo Piantanida and Leonardo Rey Vega

lemma 11.1 (Bayes error) The misclassification error rate of the Bayes decision rule is
given by

 MAP  
PE Q = 1 − EPX max
PY|X (y |X) . (11.3)
Y|X y ∈Y

Finding the Bayes decision rule requires knowledge of the underlying distribution
PXY , but typically in applications these distributions are not known. In fact, even a para-
metric form or an approximation to the true distribution is unknown. In this case, the
learner tries to overcome the lack of knowledge by resorting to labeled examples. In
addition, the probability of misclassification using the labeled examples has the par-
ticularity that it is mathematically hard to solve for the optimal decision rule. As a
consequence, it is common to work with a surrogate (information measure) given by
the average logarithmic loss or cross-entropy loss. This loss is used when a probabilis-
tic interpretation of the scores is desired by measuring the dissimilarity between the true
label distribution PY|X and the predicted label distribution Q Y|X , and is defined below.

lemma 11.2 (Surrogate based on the average logarithmic loss) A natural surrogate for
the probability of misclassification PE (Q ) corresponding to a classifier Q
Y|X is given
 Y|X 
by the average logarithmic loss EPXY − log Q Y|X (Y|X) , which satisfies
  
Y|X ) ≤ 1 − exp −EPXY − log Q
PE (Q Y|X (Y|X) . (11.4)

A lower bound for the average logarithmic loss can be computed as


 
EPXY − log QY|X (Y|X) ≥ H(PY|X |PX ). (11.5)

The average logarithmic loss can provide an effective and better-behaved surrogate for
the particular problem of minimizing the probability of misclassification [9]. Evidently,
the optimal decision rule for the average logarithmic loss is QY|X ≡ PY|X . This does not
match in general with the optimal decision rule for the probability of misclassification
QMAP in expression (11.2). Although the average logarithmic loss may induce an irre-
Y|X
ducible gap with respect to the probability of misclassification, it is clear that when the
true PY|X concentrates around a particular value y(x) for each x ∈ X (which is necessary
for a statistical model PY|X to induce low probability of misclassification) this gap could
be significantly reduced.

11.2.2 Learning Data Representations


We will concern ourselves with learning representation models (randomized encoders)
and self-classifiers (randomized decoders) from labeled examples, in other words, learn-
ing target probability distributions which are assumed to belong to some class of
distributions. The motivation behind this paradigm relies on a view of the brain as an
information processor that in solving certain problems (e.g., object recognition) builds a
series of internal representations starting with the sensory (external) input from which it
computes a function (e.g., detecting the orientations of edges in an image or learning to
recognize individual faces).
Information Bottleneck and Representation Learning 335

The problem of finding a good classifier can be divided into that of simultaneously
finding a (possibly randomized) encoder QU|X : X → P(U) that maps raw data to a
representation, possibly living in a higher-dimensional (feature) space U, and a soft-
decoder Q Y|U : U → P(Y), which maps the representation to a probability distribution
on the label space Y. Although these mappings induce an equivalent classifier,

QY|X (y|x) = QU|X (u|x)Q
Y|U (y|u), (11.6)
u∈U

the computation of the latter expression requires marginalizing out u ∈ U, which is in


general computationally hard due to the exponential number of atoms involved in the
representations. A variational upper bound is commonly used to rewrite this intractable
problem into
   
EPXY − log QY|X (Y|X) ≤ E P XY E Q U|X − log Q
Y|U (Y|U) , (11.7)
which simply follows by applying the Jensen inequality [24]. This bound induces the
well-known cross-entropy risk defined below.
definition 11.2 (Cross-entropy loss and risk) Given two randomized mappings QU|X :
X → P(U) and Q Y|U : U → P(Y), we define the average (over representations) cross-
entropy loss as
   
Y|U (y|·)  QU|X (·|x), − log Q
 QU|X (·|x), Q Y|U (y|·) (11.8)

=− QU|X (u|x) log Q Y|U (y|u). (11.9)
u∈U

We measure the expected performance of (QU|X , Q Y|U ) via the risk function:
  
(QU|X , QY|U ) → L(QU|X , Q
Y|U )  EPXY  QU|X (·|X), Q Y|U (Y|·) . (11.10)
In addition to the points noted earlier, another crucial component of knowledge rep-
resentation is the use of deep representations. Formally speaking, we consider Kth-layer
randomized encoders {QUk |Uk−1 }k=1
K with U0 ≡ X instead of one randomized encoder
QU|X . Although this appears at first to be more general, it can be casted using the one-
layer randomized encoder formulation induced by the marginal distribution that relates
the input layer and the output layer of the network. Therefore any result for the one-
layer formulation immediately implies a result for the Kth-layer formulation, and for
this reason we shall focus on the one-layer case without loss of generality.
lemma 11.3 (Optimal decoders) The minimum cross-entropy loss risk satisfies
inf L(Q
Y|U , QU|X ) = H(QY|U |QU ), (11.11)
Q
Y|U
: U→P(Y)

where

QU|X (u|x)PXY (x, y)
x∈X
QY|U (y|u) =  . (11.12)
QU|X (u|x)PX (x)
x∈X

Proof The proof follows  from  the positivity of the relative entropy by noticing that
L(QU|X , Q
Y|U ) = D QY|U  Q
Y|U
 QU + H(QY|U |QU ).
336 Pablo Piantanida and Leonardo Rey Vega

The associated risk to the optimal decoder is


⎡ ⎤
⎢⎢⎢  ⎥⎥⎥
L(QU|X , QY|U )  EPXY ⎢⎢⎢⎣− QU|X (u|x) log QY|U (Y|U)⎥⎥⎥⎦, (11.13)
u∈U

which is only a function of the encoder model QU|X . However, the optimal decoder
cannot be determined since PXY is unknown.
The learner’s goal is to select QU|X and Q
Y|U by minimizing the risk (11.10). However,
since PXY is unknown the learner cannot directly measure the risk, and it is common to
measure the agreement of a pair of candidates with a finite training dataset in terms of
the empirical risk.
definition 11.3 (Empirical risk) Let P XY denote the empirical distribution through the
training dataset Sn  {(x1 , y1 ), . . . , (xn , yn )}. The empirical risk is
  
Lemp (QU|X , Q Y|U )  EP XY  QU|X (·|X), Q
Y|U (Y|·) (11.14)
1 
n

=  QU|X (·|xi ), Q
Y|U (yi |·) . (11.15)
n i=1
lemma 11.4 (Optimality of empirical decoders) Given a randomized encoder QU|X :
X → P(U), define the empirical decoder with respect to the empirical distribution
XY as
P
 XY (x, y)
QU|X (u|x)P
 x∈X
QY|U (y|u)   . (11.16)
QU|X (u|x)PX (x)
x∈X

Y|U : U → P(Y) as
Then, the risk can be lower-bounded uniformly over Q

Lemp (QU|X , Q 
Y|U ) ≥ Lemp (QU|X , QY|U ), (11.17)

where equality holds provided that QY|U ≡ QY|U , i.e., the optimal decoder is computed
from the encoder and the empirical distribution as done in (11.16).
Proof The inequality follows along the lines of Lemma (11.3) by noticing that
Lemp (QU|X , Q   
Y|U ) = D QY|U Q Y|U | QU + Lemp ( QY|U , QU|X ). Finally, the non-negativity
of the relative conditional entropy completes the proof.
Since the empirical risk is evaluated on finite samples, its evaluation may be sensi-
tive to sampling (noise) error, thus giving rise to the issue of generalization. It can be
argued that a key component of learning is not just the development of a representa-
tion model on the basis of a finite training dataset, but its use in order to generalize to
unseen data. Clearly, successful generalization necessitates the closeness (in some sense)
of the selected representation and decoder models. Therefore, successful representation
learning would involve successful generalization. This chapter deals with the informa-
tion complexity of successful generalization. The generalization gap defined below is a
measure of how an algorithm could perform on new data, i.e., data that are not available
during the training phase. In the light of Lemmas 11.3 and 11.4, we will restrict our anal-
ysis to encoders only and assume that the optimal empirical decoder has been selected,
Information Bottleneck and Representation Learning 337


Y|U ≡ QY|U both in the empirical risk (11.14) and in the true risk (11.10). This is
i.e., Q
reasonable given the fact that the true PXY is not known, and the only decoder that can
be implemented in practice is the empirical one.
definition 11.4 (Generalization gap) Given a stochastic mapping QU|X : X → P(U),
the generalization gap is defined as
 
(QU|X , Sn ) → Egap (QU|X , Sn )  Lemp (QU|X , Q Y|U ),
Y|U ) − L(QU|X , Q (11.18)

Y|U )
which represents the error incurred by the selected QU|X when the rule Lemp (QU|X , Q

is used instead of the true risk L(QU|X , QY|U ).

11.2.3 Optimizing on Restricted Classes of Randomized Encoders


We have already introduced the notions of representation and inference models and risk
functions from which these candidates are chosen. Another related question of interest
is as follows: how do we define the encoder class? A simple approach is to model classes
in a parametric fashion. We first introduce the Bayes risk and then the restricted classes
of randomized encoders and decoders.
definition 11.5 (Bayes risk) The minimum cross-entropy risk over all possible candi-
dates is called the Bayes risk and will be denoted by L . In this case,
 Y|U  = H(PY|X |PX ).
L  inf L QU|X , Q (11.19)
QU|X : X→P(U)

definition 11.6 (Learning model) The encoder functions are defined by fθ : Xd × Z −→


Uθm , where X is the finite input alphabet with cardinality |X| and d is a positive integer,
θ ∈ Θ ⊂ RdΘ denotes the unknown parameters to be optimized, Z is a random variable
taking values on a finite alphabet Z with probability PZ , whose role is to randomize
encoders, and Uθ ⊂ [0, 1] is the alphabet corresponding to the hidden representation
which satisfies |Uθ | ≤ |X|d · |Z|. For notational convenience, we let X ≡ Xd and U ≡ Uθm ,
and denote this class as
  
F  QU|X (u|x) = EPZ 1[u = fθ (x, Z)] : θ ∈ Θ .

It is clear that, for every θ, θ → QU|X ∈ F induces a randomized encoder.


In order to simplify subsequent analysis we will assume the following conditions over
the possible data p.m.f. and over the family F of encoders.
definition 11.7 (Restricted model class) We assume that alphabets X, Y are of arbi-
traily large size but finite. Furthermore, there exists η > 0 such that the unknown
data-generating distribution PXY satisfies PX (xmin )  min x∈X PX (x) ≥ η and PY (ymin ) 
miny∈Y PY (y) ≥ η.
definition 11.8 (Empirical risk minimization) The methodology of empirical risk min-
imization is one of the most straightforward approaches, yet it is usually efficient,
338 Pablo Piantanida and Leonardo Rey Vega

provided that the chosen model class F is restricted [25]. The learner chooses a pair
 ∈ F that minimizes the empirical risk:
Q U|X

    
Lemp Q U|X , QY|U ≤ Lemp (QU|X , QY|U ), for all QU|X ∈ F . (11.20)

Moreover, it is possible to minimize a surrogate of the true risk:

Y|U ) ≤ Lemp (QU|X , Q


L(QU|X , Q Y|U ) + Egap (QU|X , Sn ), (11.21)

which depends on the empirical risk and the so-called generalization gap, respectively.
Expression (11.21) states that an adequate selection of the encoder should be performed
in order to minimize the empirical risk and the generalization gap simultaneously. It
is reasonable to expect that the optimal encoder achieving the minimal risk in (11.19)
does not belong to our restricted class of models F , so the learner may want to enlarge
the model classes F as much as possible. However, this could induce a larger value of
the generalization gap, which could lead to a trade-off between these two fundamental
quantities.
definition 11.9 (Approximation and estimation error) The sub-optimality of the model
class F is measured in terms of the approximation error:

Y|U ) − L .
Eapp (F )  inf L(QU|X , Q (11.22)
QU|X ∈F

The induced risk of the selected pair of encoders is given by


        
Eest F , QU|X , QY|U  L QU|X , QY|U − inf L(QU|X , QY|U ), (11.23)
QU|X ∈F

 denotes the minimizer of expression (11.20).


where QU|X

definition 11.10 (Excess risk) The excess risk of the algorithm (11.20) selecting an
 , Q
optimal pair (Q  ) can be decomposed as
Y|U U|X

         
Eexc F , QU|X , QY|U  E L QU|X , QY|U − L
    
= Eapp (F ) + E Eest F , QU|X , QY|U ,

where the expectation is taken with respect to the random choice Sn of dataset which
 , Q
induces the optimal pair (Q  ).
Y|U U|X

The approximation error Eapp (F ) measures how closely encoders in the model class
F can approximate the optimal solution L . On the other hand, the estimation error
 , Q
Eest (F , Q  ) measures the effect of minimizing the empirical risk instead of the
U|X Y|U
true risk, which is caused by the finite size of the training data. The estimation error is
determined by the number of training samples and by the complexity of the model class,
i.e., large models have smaller approximation error but lead to higher estimation errors,
and it is also related to the generalization error [25]. However, for the sake of simplicity,
in this chapter we restrict our attention only to the generalization gap.
Information Bottleneck and Representation Learning 339

11.3 Information-Theoretic Principles and Information Bottleneck

11.3.1 Lossy Source Coding


The problem of source coding is, jointly with that of channel coding, one of the two most
important and relevant problems in information theory [24, 26]. In the source coding
problem, we face the fundamental question of how to represent in the most compact way
possible a given stochastic source such that we can be able to reconstruct, with a given
level of fidelity, the original source. Shannon was the first to formalize and completely
solve this problem [2, 27] in the asymptotic regime,2 establishing the optimal trade-off
between the compactness of the representations and the level of fidelity in the reconstruc-
tion. In the lossless source coding problem, the level of fidelity required is maximal: it
is desired to have a short-length representation which can be used to reconstruct almost
exactly the original source. According to the more general lossy source coding setup, we
look for more compact representations of the original source by dropping the require-
ment of almost-exact reconstruction. The level of fidelity required is measured by using
a predefined distortion measure, which is an essential part of the problem. Interest-
ingly enough, the problem of lossy source coding can be solved completely for any
well-defined distortion measure. Let us formulate this problem mathematically.
definition 11.11 (Lossy source coding problem) Consider a discrete and finite alpha-
bet X and a stochastic source X, which generates identically and independently
distributed samples according to PX ∈ P(X). Consider an alternative alphabet X  and
a distortion function d : X × X  → R≥0 . Consider also a realization X of the source,
n

an encoder function fn : X → {1, . . . Mn }, where Mn ∈ N, and a decoder function


n
n . We say that Cn = ( fn , gn ) is an n-code for X and d(·; ·).
gn : {1, . . . , Mn } → X
definition 11.12 (Achievable rate and fidelity) A pair (R, D) is said to be achievable if,
for every  > 0, there exist n ≥ 1 and an n-code Cn such that

log Mn
≤ R + , (11.24)
  n 
 
EPnX d̄ X n ; gn fn (X n ) ≤ D + , (11.25)

n ) ≡ (1/n) n
where d̄(X n ; X i=1 d(xi ;
xi ).
The set of all achievable pairs (R, D) contains the complete characterization of all
the possible trade-offs between the rate R (which quantifies the level of compression of
the source X measuring the necessary number of bits per symbol) and the distortion D
(which quantifies the average fidelity level per symbol in the reconstruction using the

2 In the asymptotic regime one considers that the number of realizations of the stochastic source to be
compressed tends to infinity. Although this could be questionable in practice, the asymptotic problem
reflects accurately the important trade-offs of the problem. In this presentation, our focus will be on the
asymptotic problem originally solved by Shannon.
340 Pablo Piantanida and Leonardo Rey Vega

distortion function d(·; ·) symbol by symbol). An equivalent characterization of the set


of achievable pairs (R, D) is given by the rate-distortion function defined by

R(D) = inf{R : (R, D) is achievable}. (11.26)

It is the great achievement of Shannon [27] to have obtained the following result.
theorem 11.1 (Rate-distortion function) The rate-distortion function for source X with
 and with distortion function d(·; ·) is given by
reconstruction alphabet X
 
RX,d (D) = inf  .
I PX ; PX|X (11.27)

 : X →P(X)
PX|X

EP [d(X;X)] ≤ D

XX

This function depends solely on the distribution PX and the distortion function d(·; ·)
and contains the exact trade-off between compression and fidelity that can be expected
for the particular source and distortion function. It is easy to establish that this func-
tion is positive, non-increasing in D, and convex. Moreover, there exists D > 0 such
that RX,d (D) is finite and we denote the minimum of such values of D by Dmin with
Rmax  limD→Dmin + RX,d (D). Although RX,d (D) could be hard to compute in closed form
for a particular PX and d(·; ·), the problem in (11.27) is a convex optimization one, for
which there exist efficient numerical techniques. However, several important cases admit
closed-form expressions, such as the Gaussian case with quadratic distortion3 [24].
Another important function related to the rate-distortion function is the distortion-rate
function. This function can be defined independently from the rate-distortion function
and directly from information-theoretic principles. Intuitively, this function is the infi-
mum value of the distortion D as a function of the rate R for all (R, D) achievable pairs.
We will define it directly from the rate-distortion function:

R−1
X,d (I)  inf D ∈ R≥0 : RX,d (D) ≤ I . (11.28)

It is not hard to show the following.4


lemma 11.5 Consider the distortion-rate function defined according to (11.28). This
function is positive, non-increasing in I and convex.
Proof The proof follows easily from (11.28).

Besides their obvious importance in the problem of source coding, the definitions of
the rate-distortion and distortion-rate functions will be useful for the problem of learn-
ing as presented in the previous section. They will permit one to establish connections
between the misclassification probability, the cross-entropy, and the mutual information

3 Although the Gaussian case does not correspond to a finite cardinality set X, the result in (11.27) can
easily be extended to that case using quantization arguments.
4 It is worth mentioning that by using R−1 X,d (I) we are abusing notation. This is because in general it is not
true that RX,d (D) is injective for every D ≥ 0. However, when I ∈ [Rmin , Rmax ) with Rmin  RX,d (Dmax )
 
and Dmax  minx∈X EPX d(X; x) , under some very mild conditions on PX and d(·; ·), R−1 X,d (I) is the true
inverse of RX,d (D), which is guaranteed to be injective in the interval D ∈ (Dmin , Dmax ].
Information Bottleneck and Representation Learning 341

between the input X and the output of the encoder QU|X . These connections will be con-
ceptually important for the rest of the chapter, at least from a qualitative point of view.

11.3.2 Misclassification Probability and Cross-Entropy Loss


It is easy to show that the proposed learning framework can be set up as a lossy source
coding problem. This formulation, however, is not an operational one as was the case
for the information-theoretic one presented in Definitions 11.11 and 11.12. The reason
for this comes from the fact that for our learning framework we do not have the same
type of scaling with n as in the source coding problem in information theory. While, in
the typical source coding problem, encoders and decoders act upon the entire sequence
of observed samples xn = (x1 , . . . , xn ), in the learning framework, the encoder QU|X acts
on a sample-by-sample basis. Nevertheless, the definition of the rate-distortion (w.r.t.
distortion-rate) function is relevant for the learning framework as well, provided that
we avoid any operational interpretation and concentrate on its strictly mathematical
meaning.
Consider alphabets U, X, and Y, corresponding to the descriptions generated by
the encoder QU|X and to the examples and their corresponding labels. From (11.1) and
(11.6), we can write the misclassification probability as
⎡ ⎤
  ⎢⎢⎢  ⎥⎥⎥
PE QU|X , Q ⎢
⎢ ⎥⎥
Y|U = 1 − EPXY ⎣⎢ QU|X (u|X)QY|U (Y|u)⎦⎥
u∈U
⎡ ⎤
⎢⎢⎢   ⎥⎥⎥⎥
= 1 − EPY ⎢⎢⎣ ⎢ ⎥
QY|U (Y|u)EPX|Y QU|X (u|X) ⎥⎦
u∈U
⎡ ⎤
⎢⎢⎢  ⎥⎥⎥
= 1 − EPY ⎢⎢⎣⎢ QY|U (Y|u)QU|Y (u|Y)⎦⎥
⎥⎥
u∈U
 
= EPUY 1 − Q Y|U (Y|U) . (11.29)

From the above derivation, we can set a distortion measure: d(u; y)  1 − Q Y|U (y|u).
In this way, the probability of misclassification can be written as an average over the
outcomes of Y (taken as the source) and U (taken as the reconstruction) of the distortion
measure: 1 − Q Y|U (y|u). In this manner, we can consider the following rate-distortion
function:
 
RY,QY|U (D)  inf I PY ; PU|Y , (11.30)
PU|Y : Y →P(U)
EPUY [1−Q Y|U
(Y|U)] ≤ D

which provides a connection between the misclassification probability and the mutual
 
information I PY ; PU|Y .
From this formulation, we are able to obtain the following lemma, which provides an
upper and a lower bound on the probability of misclassification via the distortion-rate
function and the cross-entropy loss.
342 Pablo Piantanida and Leonardo Rey Vega

lemma 11.6 (Probability of misclassification and cross-entropy loss) The probability of


misclassification PE (Q Y|U , QU|X ) induced by a randomized encoder QU|X : X → P(U)
and decoder Q Y|U : U → P(Y) is bounded by
−1   −1  
RY,Q
I(PX ; QU|X ) ≤ RY,Q
I(PY ; QU|Y ) (11.31)
Y|U Y|U

≤ PE (QY|U , QU|X ) (11.32)


 
≤ 1 − exp −L(Q Y|U , QU|X ) , (11.33)

where QU|Y (u|y) = x∈X QU|X (u|x)PX|Y (x|y) for (u, y) ∈ U × Y.

Proof The upper bound simply follows by using the Jensen inequality from [24],
while the lower bound is a consequence of the definition of the rate-distortion and
distortion-rate functions. The probability of misclassification corresponding to the clas-
sifier can be expressed by the expected distortion EPXY QU|X [d(Y, U)] = PE (Q Y|U , QU|X ),
which is based on the fidelity function d(y, u)  1 − Q Y|U (y|u) as shown in (11.29).
Because of the Markov chain Y −− X −− U, we can use the data-processing inequality
[24] and the definition of the rate-distortion function, obtaining the following bound for
the classification error:
I(PX ; QU|X ) ≥ I(PY ; QU|Y ) (11.34)
 
≥ inf I PY ; PU|Y
 (11.35)
 : Y→P(U)
PU|Y
 ≤ EP Q [d(Y,U)]
EP  [d(Y,U)]
UY XY U|X

 
Y|U , QU|X ) .
= RY,QY|U PE (Q (11.36)
−1
For EPXY QU|X [d(Y, U)], we can use the definition of RY,Q (·), and thus obtain from

Y|U
(11.34) the fundamental bound
−1 −1
RY,Q
(I(PX ; QU|X )) ≤ RY,Q
(I(PY ; QU|Y )) ≤ PE (Q
Y|U , QU|X ).
Y|U Y|U

The lower bound in the above expression states that any limitation in terms of the
mutual information between the raw data and their representation will bound from below
the probability of misclassification while the upper bound shows that the cross-entropy
loss introduced in (11.10) can be used as a surrogate to optimize the probability of mis-
classification, as was also pointed out in Lemma 11.2. As a matter of fact, it appears that
the probability of misclassification is controlled by two fundamental information quan-
tities: the mutual information I(PX ; QU|X ) and the cross-entropy loss L(QY|U , QU|X ).

11.3.3 Noisy Lossy Source Coding and the Information Bottleneck


A more subtle variant of the lossy source coding problem is the noisy lossy source
coding problem, which was first introduced in [28]. The main difference with respect to
Shannon’s original problem is that the source Y is not observed directly at the encoder.
Instead, a noisy version of Y denoted by X is observed and appropriately compressed.
Information Bottleneck and Representation Learning 343

More precisely, we have a memoryless source with single-letter distribution PY


observed through a noisy channel with single-input transition probability PX|Y . From
the compressed version of X, it is desired to reconstruct, with a predetermined level of
fidelity, the realization of the unobserved source Y. The fidelity is measured, similarly
to the usual lossy source coding problem, with distortion function d : Y × U → R≥0 ,
where U is the alphabet in which we generate the reconstructions. The operational
information-theoretic definitions for this problem are analogous to Definitions 11.11
and 11.12, and for this reason are omitted. The rate distortion in this case is given by
 
RXY,d (D) = inf I PX ; PU|X . (11.37)
PU|X : X →P(U)
EPUY [d(Y;U)] ≤ D

Consider the case of logarithmic distortion d(y; u) = − log PY|U (y|u), where

x∈X PU|X (u|x)PXY (x, y)
PY|U (y|u) =  . (11.38)
x∈X PU|X (u|x)PX (x)

The noisy lossy source coding with this choice of distortion function gives rise to the
celebrated information bottleneck [20]. In precise terms,
 
RXY,d (D) = inf I PX ; PU|X . (11.39)
PU|X : X →P(U)
H(PY|U |PU ) ≤ D
 
Noticing that H(PY|U |PU ) = −I PY ; PU|Y + H(PY ) and defining μ  H(PY ) − D, we can
write (11.39) as
 
R̄XY (μ) = inf I PX ; PU|X . (11.40)
PU|X : X →P(U)
I(PY ;PU|Y ) ≥ μ

Equation (11.40) summarizes the trade-off that exists between the level of compression
of the observable source X, using representation U, and the level of information
about the hidden source Y preserved by this representation. This function is called the
rate-relevance function, where μ is the minimum level of relevance we expect from
representation U when the rate used for the compression of X is R̄XY (μ). Notice that in
the information-bottleneck case the distortion d(y; u) depends on the optimal conditional
distribution P∗U|X through (11.38). This makes the problem of characterizing R̄XY (μ)
more difficult than (11.37), in which the distortion function is fixed. In fact, although
R̄XY (μ) is positive, non-decreasing, and convex, the problem in (11.40) is not convex,
which leads to the need for more sophisticated tools for its solution. Moreover, from the
corresponding operational definition for the lossy source coding problem (analogous
to Definitions 11.11 and 11.12), it is clear that the distortion function for sequences

Y n and U n is applied symbol-by-symbol d̄(Y n ; U n ) = −(1/n) i=1 log PY|U (Yi |Ui ),
implying a memoryless condition between hidden source realization Y n and description
U n = fn (X n ). It is possible to show [29, 30] that, if we apply a full logarithmic distortion
d̄(Y n ; U n ) = −(1/n) log PY n |U n (Y n |U n ), not necessarily additive as in the previous case,
the rate-relevance function in (11.40) remains unchanged, where relevance is measured
by the non-additive multi-letter mutual information:
344 Pablo Piantanida and Leonardo Rey Vega

1  
d̄(Y n ; U n ) ≡ I PY n ; P fn (X n )|Y n . (11.41)
n
As a simple example in which the rate-relevance function in (11.40) can be calculated
in closed form, we can consider the case in which X and Y are jointly Gaussian with
zero mean, variances σ2X and σ2Y , and Pearson correlation coefficient given by ρXY .
Using standard information-theoretic arguments [30], it can be shown that the optimal
distribution PU|X is also Gaussian, with mean X and variance given by

2−2μ − (1 − ρ2XY )
σ2U|X = σ2X . (11.42)
1 − 2−2μ
 
With this choice for PU|X we easily obtain that I PY ; PU|Y = μ and that
⎛ ⎞ ⎛ ⎞
1 ⎜⎜ ρ2 ⎟⎟ 1 ⎜⎜ 1 ⎟⎟⎟
R̄XY (μ) = log ⎜⎜⎝ −2μ XY 2 ⎟⎟⎠, 0 ≤ μ ≤ log ⎜⎜⎝ ⎟⎠. (11.43)
2 2 − (1 − ρXY ) 2 1 − ρ2XY

It is interesting to observe that R̄XY (μ) depends only on the structure of the sources X
and Y through the correlation coefficient ρXY and not on their variances. It should also
be noted that the level of relevance μ is constrained to lie in a bounded interval. This is
not surprising because of the Markov chain U −− X −− Y, the maximum value for the
 
relevance level is I PX ; PY|X , which is easily shown to be equal to 12 log (1/(1 − ρ2XY )).
The maximum level of relevance is achievable only as long as the rate R → ∞, that is,
when the source X is minimally compressed. The trade-off between rate and relevance
for this simple example can be appreciated in Fig. 11.1 for ρXY = 0.9.

10

6
R̄XY (µ)

0
0 0.2 0.4 0.6 0.8 1 1.2
µ
Figure 11.1 R̄XY (μ) for ρXY = 0.9.
Information Bottleneck and Representation Learning 345

11.3.4 The Information-Bottleneck Method


Noisy lossy source coding with logarithmic loss can be used as a general principle
for learning problems leading to the information-bottleneck method. This method was
successfully used in several learning problems with considerable success (see [31, 32]
and references therein). Consider the classification problem introduced in Section
11.2.1 and encoder–decoder pairs (QU|X , QY|U ), as was explained in Section 11.2.2. The
information-bottleneck method can be introduced through the following optimization
problem:
 
inf Y|U (Y|U)] + β · I PX ; QU|X .
EPXY QU|X [− log Q (11.44)
QU|X ,Q
Y|U

Expression (11.44) can be interpreted as a cross-entropy loss with a regularization term


 
given by β · I PX ; QU|X , where β is a positive number. The regularization term can be
interpreted as penalization on the complexity of the descriptions generated from the
 
examples X using the encoder QU|X . The smaller the term I PX ; QU|X , the simpler the
descriptions U will be. Moreover, the simpler the descriptions U, the less information
they share with labels X and Y (because of the Markov chain U −− X −− Y). As the
information content in U with respect to Y is naturally decreased, the value of the
cross-entropy EPXY QU|X [− log Q
Y|U (Y|U)] increases. In this way, a trade-off between
the cross-entropy loss and the complexity of the descriptions extracted from X is
 
established. It can happen, though, that the regularization term given I PX ; QU|X
penalizes very complex descriptions that could provide a low cross-entropy value at the
cost of poor generalization and overfitting.
 
From the result in Lemma 11.3 and the fact that the regularization term I PX ; QU|X
does not depend on the decoder Q Y|U , problem (11.44) can be written as
 
inf EPXY QU|X [− log QY|U (Y|U)] + β · I PX ; QU|X , (11.45)
QU|X :X→P(U)

where the decoder can be written as a function of the encoder as follows:



QU|X (u|x)PXY (x, y)
x∈X
QY|U (y|u) =  . (11.46)
QU|X (u|x)PX (x)
x∈X

On recognizing that EPXY QU|X [− log QY|U (Y|U)] = H(QY|U |QU ), where QU (u) =

x∈X QU|X (u|x)PX (x), we see that (11.45) is closely related to the information bot-
tleneck and to the rate-relevance function defined in (11.40). In fact, the problem in
(11.45) can be equivalently written as
    
sup I PY ; QU|Y − β · I PX ; QU|X , (11.47)
QU|X :X→P(U)

with QU|Y (u|y) = x∈X QU|X (u|x)PX|Y (x|y). We can easily see that in (11.47) we are
considering the dual problem to (11.40), looking for the supremum of relevance μ
subject to a given rate R. The value of β (which can be thought of as a typical Lagrange
multiplier [33]) can be thought of as a hyperparameter which controls the trade-off
346 Pablo Piantanida and Leonardo Rey Vega

   
between I PY ; QU|Y (relevance) and I PX ; QU|X (rate). In more precise terms, consider
the following set:

R  (μ, R) ∈ R2≥0 : ∃ QU|X : X → P(U) s.t.
 
R ≥ I PX ; QU|X ,
 
μ ≤ I PY ; QU|Y , U −− X −− Y . (11.48)
It is easy to show that this region corresponds to the set of achievable values of
relevance and rate (μ, R) for the corresponding noisy lossy source coding problem with
logarithmic distortion as was defined in Section 11.3.3. This set is closed and convex
and it is not difficult to show that [34]
 
sup I PY ; QU|Y = sup {μ : (μ, R) ∈ R}. (11.49)
QU|X : X →P(U)
 
I PX ;QU|X ≤ R

Using convex optimization theory [33], we can easily conclude that (11.47) corresponds
to obtaining the supporting hyperplane of region R with slope β. As any convex and
closed set is characterized by all of its supporting hyperplanes, by varying β and solving
(11.47) we are reconstructing the upper boundary of R which coincides with (11.49).
In other words, the hyperparameter β is directly related to the value of R at which we
are considering the maximum possible value of the redundancy μ, or, which amounts
to the same thing, the value of β controls the complexity of representations of X, as was
pointed out above.
It remains only to discuss the implementation of a procedure for solving (11.47).
Unfortunately, although the set R characterizing the solutions of (11.47) is convex, it
is not true that (11.47) is itself a convex optimization problem. However, the structure
of the problem allows the use of efficient numerical optimization procedures that
guarantee convergence to local optimum solutions. These numerical procedures are
basically Blahut–Arimoto (BA)-type algorithms. These are often used to refer to a
class of algorithms for numerically computing the capacity of a noisy channel and the
rate-distortion function for given channel and source distributions, respectively [35, 36].
For these reasons, these algorithms can be applied with minor changes to the problem
(11.47), as was done in [20].
Clearly, for the solution of (11.47) we need as input the distribution PXY . When
only training samples and labels Sn  {(x1 , y1 ), . . . , (xn , yn )} are available, we use the
empirical distribution PXY instead of the true distribution PXY .
In Fig. 11.2, we plot what we call the excess risk (as presented in Definition 11.10),
rewritten as
∗,β ∗,β
Excess risk  H(QY|U |QU ) − H(PY|X |PX ), (11.50)
∗,β ∗,β ∗,β
where QY|U , QU are computed by using the optimal solution QU|X in (11.47) and the
XY . As β defines unequivocally the value of I PX ; Q∗,β , which
empirical distribution P U|X
is basically the rate or complexity associated with the chosen encoder, we choose the
horizontal axis to be labeled by rate R. Experiments were performed by using synthetic
data with alphabets |X| = 128 and |Y| = 4. The excess-risk curve as a function of the
Information Bottleneck and Representation Learning 347

1
103 samples
104 samples
0.8
Excess Risk 105 samples

0.6

0.4

0.2

0
0 0.5 1 1.5 2 2.5 3 H(PX)
Rate (R)
Figure 11.2 Excess risk (11.50) as a function of rate R being the mutual information between the
representation U and the corresponding input X.

rate constraint for different sizes of training samples is plotted. With dashed vertical
lines, we denote the rate for which the excess risk achieves its minimum. When the
number of training samples increases, the optimal rate R approaches its maximum
possible value: H(PX ) (black vertical dashed line on the far right). We emphasize that
for every curve there exists a different limiting rate Rlim , such that, for each R ≥ Rlim , the
excess risk remains constant for that value. It is not difficult to check that Rlim = H(P X ).
Furthermore, for every size of the training samples, there is an optimal value of Ropt
which provides the lowest excess risk in (11.50). In a sense, this is indicating that the
rate R can be interpreted as an effective regularization term and, thus, it can provide
robustness for learning in practical scenarios in which the true input distribution is not
known and the empirical data distribution is used. It is worth mentioning that when
more data are available the optimal value of the regularizing rate R becomes less critical.
This fact was expected since, when the amount of training data increases, the empirical
distribution approaches the data-generating distribution.
In the next section, we provide a formal mathematical proof of the explicit relation
between the generalization gap and the rate constraint, which explains the heuristic
observations presented in Fig. 11.2.

11.4 The Interplay between Information and Generalization

In the following, we will denote L(QU|X ) ≡ L(QU|X , Q Y|U ) and Lemp (QU|X ) ≡

Lemp (QU|X , QY|U ). We will study informational bounds on the generalization
gap (11.18). More precisely, the goal is to find the learning rate n (Q, Sn , γn ) such that
 
P Egap (QU|X , Sn ) > n (QU|X , Sn , γn ) ≤ γn , (11.51)

for a given QU|X ∈ F and some γn → 0 as n → ∞. We will further comment on the


implications for practical algorithms minimizing the surrogate of the risk function,

L(QU|X ) ≤ Lemp (QU|X ) + Egap (QU|X ), (11.52)


348 Pablo Piantanida and Leonardo Rey Vega

which depends on the empirical risk and the so-called generalization gap. Expression
(11.52) states that a suitable selection of the encoder can be obtained by minimizing the
empirical risk and the generalization gap simultaneously, that is
 
Lemp QU|X + λ · n (QU|X , Sn , γn ), (11.53)

for some suitable multiplier λ ≥ 0. It is reasonable to expect that the optimal encoder
achieving the minimal risk in (11.10) does not belong to F , so we may want to enlarge
the model classes as much as possible. However, as usual, we expect a sensitive trade-off
between these two fundamental quantities.

11.4.1 Bounds on the Generalization Gap


We first present the main technical result in Theorem 11.2, that is a sample-dependent
bound on the generalization gap (11.18) with probability of at least 1 − δ, as a function
of a selected randomized encoder QU|X and the data probability distribution PXY . In
particular, we will show that the mutual information between  the raw data and their
√ 
representation controls the learning rate with an order O log(n)/ n , which leads
to an informational PAC-style generalization error bound. From this perspective,
we discuss the implications for model selection, variational auto-encoders, and the
information-bottleneck method.
theorem 11.2 (Informational bound) Let F be a class of encoders. Then, for every
PXY and every δ ∈ (0, 1), with probability at least 1 − δ over the choice of Sn ∼ PnXY the
following inequality holds ∀ QU|X ∈ F :
' ( )
 log(n) Cδ log(n)
Egap (QU|X , Sn ) ≤ Aδ I(PX ; QU|X ) · √ + √ + O , (11.54)
n n n
where Aδ , Bδ ,Cδ are universal constants:
√ + ( )
2Bδ  *  |Y| + 3
Aδ  1 + 1/ |X| , Bδ  2 + log , (11.55)
PX (xmin ) δ
* ( )
−1 |U|
Cδ  2|U|e + Bδ |Y| log . (11.56)
PY (ymin )
The importance of this result is that the main quantity involves the empirical mutual
information between the data X and their randomized representation U(X). This can
be understood as a “measure of information complexity” scaling with rate n−1/2 log(n).
The remaining issue is merely how to interpret this information-theoretic bound and its
implication in the learning problem.
By combining Theorem 11.2 with inequality (11.52) we obtain the following
corollary.
corollary 11.1 (PAC-style generalization error bound) Let F be the class of random-
ized encoders. Then, for every PXY and every δ ∈ (0, 1), with probability at least 1 − δ
over the choice of Sn ∼ PnXY the following inequality holds:
Information Bottleneck and Representation Learning 349

' ( )
Y|U , QU|X ) ≤ H(Q
L(Q Y|U |Q
U ) + Aδ X ; QU|X ) log(n)
I(P

√ + √ +O
log(n)
. (11.57)
n n n
An interesting connection between the empirical risk minimization of the cross-
entropy loss and the information-bottleneck method presented in the previous section
arises which motivates formally the following algorithm [15, 20, 37].
definition 11.13 (Information-bottleneck algorithm) A representation learning
algorithm inspired by the information-bottleneck principle [20] consists in finding an
encoder QU|X ∈ F that minimizes over the random choice Sn ∼ PnXY the functional

L(λ)   
IB (QU|X )  H( QY|U | QU ) + λ · I( PX ; QU|X ), (11.58)
Y|U is given by (11.16) and Q
for a suitable multiplier λ > 0, where Q U is its denominator.

This algorithm optimizes a trade-off between H(Q U ) and the information-based


Y|U |Q

regularization term I(PX ; QU|X ). Interestingly, the resulting regularized empirical risk
suggested by (11.57) can be seen as an optimization of the information-bottleneck
method from the empirical distribution (11.58) but based on the square root of the
mutual information in expression (11.58). Additionally, we observe that, on selecting
,U ∈ P(U) in (11.58) and using the information-radius identity [24], the
an arbitrary Q
following inequality holds:

L(λ) (QU|X ) ≤ H(Q U ) + λ · DQU|X (·|X) Q


Y|U |Q X 
,U |P (11.59)
IB
≡ L(λ) ,U ).
(QU|X , Q (11.60)
VA

The new surrogate function (11.60), denoted by L(λ) VA


(QU|X ), shares a lot in common
with a slightly more general form of the variational auto-encoders (VAEs) [16] and
the recently introduced information dropout (ID) [15, 37], where the latent space is
regularized using a prior Q ,U . Therefore, the information-theoretic bound in Theo-
rem 11.2 shows that the algorithm in Definition 11.13 as well as VAEs and ID are
slightly different but related information-theoretic ways to control the generalization
gap. In all of them the mutual information I(P X ; QU|X ) (or its upper bound given by
 X ) plays the fundamental role, although the specific way in which
,U |P
D QU|X (·|X) Q
this term controls the generalization gap could be different for each case.

11.4.2 Information Complexity of Representations


We could think of the most significative term in the upper bound (11.54) as an
information-complexity cost of data representations, which depends only on the data
samples and on the selected randomized encoder from the restricted model. Suppose
we are given a set of different model classes for the randomized encoders k = [1 : K]:
 
FE(k)  QU|X ≡ EPZ 1[u = fθ (x, Z)] : θ = (θ1 , . . . , θk ) ∈ Θk , PZ ∈ Pk (Z) ,

where there are two kinds of parameters: a structure parameter k and real-value
parameters θ, whose parameters depend on the structure, e.g., Θk may account for
different number of layers or nonlinearities, while Pk (Z) indicates different kinds of
350 Pablo Piantanida and Leonardo Rey Vega

noise distribution. Theorem 11.2 motivates the following model-selection principle for
learning compact representations.

Find a parameter k and real-value parameters θ for the observed data Sn with which
the corresponding data representation can be encoded with the shortest code length:
-   '  .
inf Lemp Q(θ,k) , S n + λ · I 
P X ; Q(θ,k)
, (11.61)
θ∈Θk , k=[1:K]U|X U|X

where the mutual information penalty term indicates the minimum of the expected
redundancy between the minimum code length5 (measured in bits) − log Q(θ,k)U|X
(·|x) to
encode representations under a known data source and the best code length − log QU (·)
chosen to encode the data representations without knowing the input samples:
   
I PX ; Q(θ,k) = min E E (θ,k) − log QU (U) + log Q(θ,k) (U|X) . (11.62)
U|X PX Q
QU ∈P(U) U|X
U|X

This information principle combines the empirical cross-entropy risk (11.14) with the
“information complexity” of the selected encoder (11.62) as being a regularization
that acts as a sample-dependent penalty against overfitting. One may view (11.62) as
a possible means of comparing the appropriateness of distinct representation models
(e.g., number of layers or amount of noise), after a parametric choice has been selected.
The coding interpretation of the penalty term in (11.61) is that the length of the
description of the representations themselves can be quantified in the same units
as the code length in data compression, namely, bits. In other words, for each data
sample x, a randomized encoder can induce different types of representations U(x) with
 
expected information length given by H QU|X (·|x) . When this representation has to
be encoded without knowing QU|X since x is not given to us (e.g., in a communication
problem where the sender wishes to communicate the representations only), the
 
required average length of an encoding distribution QU results in EQU|X − log QU (U) .
In this sense, expression (11.61) suggests that we should select encoders that allow
us to then encode representations efficiently. Interestingly, this is closely related to the
celebrated minimum-description-length (MDL) method for density estimation [38, 39].
However, the fundamental difference between these principles is that the information
complexity (11.62) follows from the generalization gap and measures the amount of
information conveyed by the representations relative to an encoder model, as opposed
to the model parameters of the encoder itself.
The information-theoretic significance of (11.62) goes beyond simply a regulariza-
tion term, since it leads us to introduce the fundamental notion of encoder capacity.
This key idea of encoder capacity is made possible thanks to Theorem 11.2 that
connects mathematically the generalization gap to the information complexity, which
is intimately related to the number of distinguishable samples from the representations.
Notice that the information complexity can be upper-bounded as

5 As is well known in information theory, the shortest expected code length is achievable by a uniquely
decodable code under a known data source [24].
Information Bottleneck and Representation Learning 351

⎛ ⎞
  1  ⎜⎜⎜⎜⎜  1  ⎟⎟⎟
n n
X ; QU|X
I P = D⎜⎜⎝QU|X (·|xi ) QU|X (·|x j )⎟⎟⎟⎟⎠ (11.63)
n i=1 n j=1

1    
n n
≤ 2
D QU|X (·|xi )QU|X (·|x j ) , (11.64)
n i=1 j=1

where {xi }ni=1 are the training examples from the dataset Sn and the last inequality
follows from the convexity of the relative entropy. This bound is measuring the average
degree of closeness between the corresponding representations for the different sample
inputs. When two distributions, QU|X (·|xi ) and QU|X (·|x j ), are very close to each
other, i.e., QU|X assigns high likelihood to similar representations corresponding to
different inputs xi  x j , they do not contribute so much to the complexity of the overall
representations. In other words, the more sample inputs an encoder can differentiate,
the more patterns it can fit well, and hence the larger the mutual information and thus
the risk of overfitting. This observation suggests that the complexity of a representation
model with respect to a sample dataset can be related to the number of data samples
that essentially yield different (distinguishable) representations. Inspired by the concept
of stochastic complexity [39], we introduce below the notion of encoder capacity to
measure the complexity of a representation model.
definition 11.14 (Capacity of randomized encoders) The encoder capacity Ce of a
randomized encoder QU|X with respect to a sample set A ⊆ X is defined as
⎛ ⎞ ( )
⎜⎜⎜    ⎟⎟⎟ 1
Ce (A, QU|X )  max log⎝⎜ ⎜
⎜ ⎟

QU|X u|ψ(u) ⎠⎟ = log|A| − log , (11.65)
ψ : U→A
u∈U
1−ε
1    1
ε  min QU|X (u|x)1 ψ(u)  x ≤ 1 − . (11.66)
ψ : U→A |A| |A|
x∈A u∈U

The argument of the logarithm in the second term of (11.65) represents the probabil-
ity of being able to distinguish samples from their representations 1 − ε, i.e., the average
probability that estimated samples via the maximum-likelihood estimator ψ(·) from QU|X
are equal to the true samples. Therefore, the encoder capacity is the logarithm of the total
number of samples minus a term that depends on the probability of misclassification
of the input samples from their representations. When ε is small, then Ce (A, QU|X ) ≈
log |A| − ε and thus all samples are perfectly distinguishable. The following proposition
gives simple bounds6 on the encoder capacity from the information complexity (11.62),
which, as we already know, has a close relation to the generalization gap.
proposition 11.1 Let QU|X be an encoder distribution and let P X be an empirical

distribution with support An ≡ supp(PX ). Then, the information complexity and the
encoder capacity satisfy

6 Notice that it is possible to provide better bounds on ε by relying on the results in [40]. However, we
preferred simplicity to “tightness” since the purpose of Proposition 11.1 is to link the encoder capacity
and the information complexity.
352 Pablo Piantanida and Leonardo Rey Vega

( )
  1
Ce An , QU|X = log|An | − log (11.67)
1−ε
and
   1  
g−1 log|An | − I PX ; QU|X ≤ ε ≤ log|An | − I P X ; QU|X , (11.68)
2
where ε is defined by (11.66) with respect to An and, for 0 ≤ t ≤ 1,

g(t)  t · log(|An | − 1) + h(t), (11.69)

with h(t)  −t log(t) − (1 − t)log(1 − t) and 0 log 0  0. Furthermore,


 
I P X ; QU|X ≤ Ce . (11.70)

Proof We begin with the lower bound (11.70). Consider the inequalities
     
I PX ; QU|X = min D QU|X QU P X (11.71)
QU ∈P(U)
( )
QU|X (U|x)
≤ min EPX EQU|X max log (11.72)
QU ∈P(U) x∈An QU (U)
(   )
QU|X u|ψ (u)
≤ min max log (11.73)
QU ∈P(U) u∈U QU (u)
⎛ ⎞
⎜⎜⎜    ⎟⎟⎟  
= log⎜⎜⎝⎜ QU|X u|ψ (u) ⎟⎟⎠⎟ = Ce QU|X , An ,

(11.74)
u∈U
 
where (11.73) follows by letting ψ be the mapping maximizing Ce QU|X , An ,
and (11.74) follows by noticing that (11.73) is the smallest worst-case regret, known as
the minimax regret, and thus by choosing QU to be the normalized maximum-likelihood
distribution on the restricted set An the claim is a consequence of the remarkable result
of Shtarkov [41].
It remains to show the bounds in (11.68). In order to show the lower bound, we can
simply apply Fano’s lemma (Lemma 2.10 of [42]), from which we can bound from
below the error probability (11.66) that is based on An . As for the upper bound,
     
log|An | − I P X ; QU|X ≥ H PX − I PX ; QU|X (11.75)
  
= U (u)H Q
Q X|U (·|u) (11.76)
u∈U
 / 0
≥2 U (u) 1 − max Q
Q X|U (x |u) (11.77)

x ∈X
u∈U
= 2ε, (11.78)
 
where (11.75) follows from the assumption An = supp P X and the fact that the entropy
is maximal over the uniform distribution; (11.77) follows by using Equation (7) of [43]
and (11.78) by the definition of ε in (11.66). This concludes the proof.
remark 11.1 In Proposition 11.1, the function g−1 (t)  0 for t < 0 and, for
0 < t < log|An |, g−1 (t) is a solution of the equation g(ε) = t with respect to
Information Bottleneck and Representation Learning 353

 
ε ∈ 0, 1 − 1/|An | ; this solution exists since the function g is continuous and increasing
   
on 0, 1 − 1/|An | and g(0) = 0, g 1 − 1/|An | = log|An |.
remark 11.2 (Generalization requires learning invariant representations) An important
consequence of the lower bound in (11.68) in Proposition 11.1 is that by limiting the
information complexity, i.e., by controlling the generalization gap according to the
criterion (11.61), we bound from below the error probability of distinguishing input
samples from their representations. In other words, from expression (11.67) and Theo-
rem 11.2 we can conclude that encoders inducing a large misclassification probability
on input samples from their representations, i.e., different inputs must share similar
representations, are expected to achieve better generalization. Specifically, this also
implies formally that we need only enforce invariant representations to control the
encoder capacity (e.g., injecting noise during training), from which the generalization
is upper-bounded naturally thanks to Theorem 11.2 and the connection with the
information complexity. However, there is a sensitive trade-off between the amount of
noise (enforcing both invariance and generalization) and the minimization of the cross-
entropy loss. Additionally, it is not difficult to show from the data-processing inequality
that stacking noisy encoder layers reinforces increasingly invariant representations
since distinguishing inputs from their representations becomes harder – or equivalently
the encoder capacity decreases – the deeper the network.

11.4.3 Sketch of the Proofs


We begin by observing that the generalization gap can easily be bounded as
,gap (QU|X , Sn )
Egap (QU|X , Sn ) ≤ E
 ⎛ ⎞
    ⎜⎜ QY|U (y|u) ⎟⎟
+  ⎜
YU (y, u) log ⎜⎜⎝
QYU (y, u) − Q ⎟⎟⎟,
 ⎠ (11.79)
(u,y)∈U×Y QY|U (y|u) 


where we define
 ⎡ ⎤
 ⎢⎢⎢  ⎥⎥⎥
,gap (QU|X , Sn ) = EP ⎢⎢⎢−
E QU|X (u|x) log QY|U (Y|U)⎥⎥⎥⎦
 XY ⎣
u∈U
⎡ ⎤
⎢⎢⎢  ⎥⎥⎥
− EPXY ⎢⎢⎣−⎢ QU|X (u|x)log QY|U (Y|U)⎥⎥⎥⎦. (11.80)
u∈U

,gap (QU|X , Sn ) is the gap corresponding to the optimal decoder selecting, which
That is, E
depends on the true PXY , according to Lemma 11.3. It is not difficult to show that
   
,gap (QU|X , Sn ) ≤ H(QY|U |QU ) − H(Q
E U ) + E  DQ
Y|U |Q Y|U QY|U  , (11.81)
QU
  
where the second term can be bounded as EQU D Q Y|U QY|U  ≤ D 
PXY PXY . The first
term of (11.81) is bounded as
     
H(QY|U |QU ) − H(Q U ) ≤ H(QU ) − H(Q
Y|U |Q U ) + H(PY ) − H(P Y )
 
+ H(QU|Y |PY ) − H(Q U|Y |P Y ). (11.82)
354 Pablo Piantanida and Leonardo Rey Vega

To obtain an upper bound, we use the following bounds [18]:


  / '

U ) ≤
H(QU ) − H(Q  0
φ pX −pX 2 · V {QU|X (u|x)} x∈X , (11.83)
u∈U
  *
H(QU|Y |PY ) − H(Q Y ) ≤ pY −
U|Y |P pY 2 |Y| log|U|

⎢⎢⎢   
+ EPY ⎢⎢⎢⎣ φ pX|Y (·|Y) −
pX|Y (·|Y)2
u∈U
'
 0.
· V {qU|X (u|x)} x∈X , (11.84)

where



⎪ 0 x ≤ 0,


φ(x) = ⎪
⎪ −x log(x) 0 < x < e−1 , (11.85)


⎩ e−1 x ≥ e−1

and V(a) = a − ā1d 22 , with a ∈ Rd , d ∈ N+ , ā = (1/d) di=1 ai , and 1d is the vector of
ones of length d.
It is clear that PY → H(PY ) is a differentiable function and, thus, we can apply a
first-order Taylor expansion to obtain
1 2
Y ) = ∂H(PY ) , pY −  
H(PY ) − H(P pY + o pY − pY 2 , (11.86)
∂pY
where ∂H(PY )/∂PY (y) = − log PY (y) − log(e) for each y ∈ Y. Then, using the
Cauchy–Schwartz inequality, we have
  '  
Y ) ≤ V{log pY (y)}y∈Y pY −
H(PY ) − H(P 
pY 2 + o pY −

pY 2 . (11.87)

McDiarmid’s concentration inequality and Theorem 12.2.1 of [24] allow us to bound


 
with an arbitrary probability close to one the terms D P XY PXY , pX − pX 2 , pY −
pY 2 ,
and pX|Y (·|y) −
pX|Y (·|y) 2 , ∀ y ∈ Y simultaneously. To make sure the bounds hold simul-
taneously over these Y + 3 quantities, we replace δ with δ/(|Y| + 3) in each concentration
inequality. Then, with probability at least 1 − δ, the following bounds hold:
     
pX 2 , pX|Y (·|y) −
max pX − pX|Y (·|y)2 , pY −
pY 2 ≤ √

(11.88)
n
and
( )
  log(n + 1) 1 |Y| + 3
DP XY PXY ≤ |X||Y| + log . (11.89)
n n δ
Then, with probability at least 1 − δ, we have
 ( Bδ ' )
Bδ *
,
Egap (QU|X , Sn ) ≤ 2 φ √ V({qU|X (u|x)} x∈X ) + √ |Y| log|U|
u∈U
n n
'  ( )
 Bδ log(n)
+ V {log pY (y)}y∈Y √ + O (11.90)
n n
Information Bottleneck and Representation Learning 355

*
log(n)  ' Bδ |Y| log |U|
≤ √ Bδ V({qU|X (u|x)} x∈X ) + √
n u∈U
n
−1 '  ( )
2|U|e  Bδ log(n)
+ √ + V {log pY (y)}y∈Y √ + O , (11.91)
n n n
 √  √ √
where we use n ≥ a2 e2 and φ a/ n ≤ ((a/2) log(n)/ n) + (e−1/ n). By combining this
result with the next inequality [18]:
'  √ ⎛ 3 ⎞
 2 ⎜⎜⎜ 1 ⎟⎟⎟ *
V {qU|X (u|x)} x∈X ≤ ⎜
⎝1 + ⎟⎠ I(PX ; QU|X ), (11.92)
u∈U
p X (xmin ) |X|

we relate to the mutual information. Finally, using Taylor arguments as above, we can
easily write
 ' ' 
 I P ; Q  − I P
X ; QU|X  ≡ O( pX −pX 2 ) ≤ O(n−1/2 ) (11.93)
 X U|X 
with probability 1 − δ. It only remains to analyze the second term on the right-hand
side of (11.79). Using standard manipulations, we can easily show that this term can be
equivalently written as
 ⎛ ⎞
    ⎜⎜⎜ QY|U (y|u) ⎟⎟⎟
 PXY (x, y) − P XY (x, y) QU|X (u|x)log ⎜⎜⎝ ⎟⎟. (11.94)
 Y|U (y|u) ⎠
Q
(x,y)∈X×Y u∈U
 
It is not difficult to see that given QU|X , PXY → log QY|U (y|u) is a differentiable
function and, thus, we can apply a first-order Taylor expansion to obtain
⎛ ⎞ 1 2
 ⎜⎜⎜ QY|U (y|u) ⎟⎟⎟  ∂log QY|U (y|u)
QU|X (u|x)log ⎝⎜ ⎜ ⎟
⎟=− QU|X (u|x) , pXY −
pXY
Y|U (y|u) ⎠
Q ∂pXY
u∈U u∈U
 
+ o pXY − pXY 2 (11.95)

and
 
∂log QY|U (y|u) QU|X (u|x ) 1{y = y} − QY|U (y|u)
= . (11.96)
∂PXY (x , y ) QUY (u, y)
With the assumption that every encoder QU|X (u|x) in the family F satisfies that
QU|X (u|x) > α for every (u, x) ∈ U × X with α > 0, we obtain that
 
 ∂ log QY|U (y|u)  2  
  ≤ , ∀(x, x , y , u) ∈ X × X × Y × U. (11.97)
∂PXY (x , y )  α
From simple algebraic manipulations, we can bound the term in (11.94) as
 ⎛ ⎞
    ⎜⎜⎜ QY|U (y|u) ⎟⎟⎟
 XY (x, y)
PXY (x, y) − P QU|X (u|x)log ⎜⎝⎜ ⎟⎟

u∈U
Y|U (y|u) ⎠
Q
(x,y)∈X×Y
⎛ ⎞2
2 ⎜⎜⎜⎜⎜  ⎟⎟
XY (x, y)|⎟⎟⎟⎟⎟ .
≤ ⎜⎜⎝ |PXY (x, y) − P ⎠ (11.98)
α
(x,y)∈X×Y
356 Pablo Piantanida and Leonardo Rey Vega

Again, using McDiarmid’s concentration inequality, it can be shown that with proba-
bility close to one this term is O(1/n), which can be neglected compared with the other
terms calculated previously. This concludes the proof of the theorem.

11.5 Summary and Outlook

We discussed how generalization in representation learning that is based on the


cross-entropy loss is related to the notion of information complexity, and how this
connection is employed to view learning in terms of the information-bottleneck
principle. The resulting information-complexity penalty is a sample-dependent bound
on the generalization gap that crucially depends on the mutual information between the
inputs and the randomized (representation) outputs of the selected encoder, revealing an
interesting connection between the generalization capabilities of representation models
and the information carried by the representations. Furthermore, we have shown that the
information complexity is closely related to the so-called encoder capacity, revealing
the well-known fact that enforcing invariance in the representations is a critical aspect to
control the generalization gap. Among other things, the results of this chapter present a
new viewpoint on the foundations of representation learning, showing the usefulness of
information-theoretic concepts and tools in the comprehension of fundamental learning
problems. This survey provided a summary of some useful links between information
theory and representation learning from which we expect to see advances in years
to come.
In the present analysis, the number of samples is the most useful resource for
the reduction of the generalization gap. Nevertheless, we have not considered other
important ingredients of the problem related to the computational-complexity aspect
of learning representation models. One of them is the particular optimization problem
that has to be solved in order to find an appropriate encoder. It is well known that the
specific “landscape” of the cost function (as a function of the parameters of the family of
encoders) to be optimized and the particular optimization algorithm used (e.g., stochas-
tic gradient-descent algortihms) could have some major effects such that performance
may not be improved by increasing the number of samples. Additional constraints
imposed by real-world applications such as that computations must be performed with a
limited time budget could also be relevant from a more practical perspective. Evidently,
it is pretty clear that many challenges still remain in this exciting research area.

References

[1] National Research Council, Frontiers in massive data analysis. National Academies Press,
2013.
[2] C. Shannon, “A mathematical theory of communication,” Bell System Technical J.,
vols. 3, 4, 27, pp. 379–423, 623–656, 1948.
[3] V. Vapnik, The nature of statistical learning theory, 2nd edn. Springer, 2000.
Information Bottleneck and Representation Learning 357

[4] G. I. Hinton, “Connectionist learning procedures,” in Machine learning, Y. Kodratoff and


R. S. Michalski, eds. Elsevier, 1990, pp. 555–610.
[5] H. B. Barlow, “Unsupervised learning,” Neural Computation, vol. 1, no. 3, pp. 295–311,
1989.
[6] A. Pouget, J. M. Beck, W. J. Ma, and P. E. Latham, “Probabilistic brains: Knowns and
unknowns,” Nature Neurosci., vol. 16, no. 9, pp. 1170–1178, 2013.
[7] H. Barlow, “The exploitation of regularities in the environment by the brain,” Behav. Brain
Sci., vol. 24, no. 8, pp. 602–607, 2001.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553,
pp. 436–444, May 2015.
[9] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 35, no. 8,
pp. 1798–1828, 2013.
[10] A. R. Barron, “Approximation and estimation bounds for artificial neural networks,”
Machine Learning, vol. 14, no. 1, pp. 115–133, 1994.
[11] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5,
pp. 465–471, 1978.
[12] A. R. Barron and T. M. Cover, “Minimum complexity density estimation,” IEEE Trans.
Information Theory, vol. 37, no. 4, pp. 1034–1054, 1991.
[13] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: A survey of some
recent advances,” ESAIM: Probability Statist., vol. 9, no. 11, pp. 323–375, 2005.
[14] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overfitting,” J. Machine Learning Res.,
vol. 15, no. 1, pp. 1929–1958, 2014.
[15] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through
noisy computation,” arXiv:1611.01353 [stat.ML], 2016.
[16] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. 2nd
International Conference on Learning Representations (ICLR), 2013.
[17] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning
requires rethinking generalization,” CoRR, vol. abs/1611.03530, 2016.
[18] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information
bottleneck,” Theor. Comput. Sci., vol. 411, nos. 29–30, pp. 2696–2711, 2010.
[19] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via
information,” CoRR, vol. abs/1703.00810, 2017.
[20] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc.
37th Annual Allerton Conference on Communication, Control and Computing, 1999,
pp. 368–377.
[21] D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via
information usage,” arXiv:1511.05219 [CS, stat], 2015.
[22] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of
learning algorithms,” in Proc. Advances in Neural Information Processing Systems 30,
2017, pp. 2524–2533.
[23] A. Achille and S. Soatto, “Emergence of invariance and disentangling in deep
representations,” arXiv:1706.01350 [CS, stat], 2017.
[24] T. M. Cover and J. A. Thomas, Elements of information theory. Wiley-Interscience, 2006.
[25] V. N. Vapnik, Statistical learning theory. Wiley, 1998.
358 Pablo Piantanida and Leonardo Rey Vega

[26] A. E. Gamal and Y.-H. Kim, Network information theory. Cambridge University Press,
2012.
[27] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[28] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IEEE
Trans. Information Theory, vol. 8, no. 5, pp. 293–304, 1962.
[29] T. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,”
IEEE Trans. Information Theory, vol. 60, no. 1, pp. 740–761, 2014.
[30] M. Vera, L. R. Vega, and P. Piantanida, “Collaborative representation learning,”
arXiv:1604.01433 [cs.IT], 2016.
[31] N. Slonim and N. Tishby, “Document clustering using word clusters via the information
bottleneck method,” in Proc. 23rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2000, pp. 208–215.
[32] L. Wang, M. Chen, M. Rodrigues, D. Wilcox, R. Calderbank, and L. Carin, “Information-
theoretic compressive measurement design,” IEEE Trans. Pattern Analysis Machine
Intelligence, vol. 39, no. 6, pp. 1150–1164, 2017.
[33] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[34] M. Vera, L. R. Vega, and P. Piantanida, “Compression-based regularization with an
application to multi-task learning,” IEEE J. Selected Topics Signal Processing, vol. 5,
no. 12, pp. 1063–1076, 2018.
[35] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless
channels,” IEEE Trans. Information Theory, vol. 18, no. 1, pp. 14–20, 1972.
[36] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Trans.
Information Theory, vol. 18, no. 4, pp. 460–473, 1972.
[37] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information
bottleneck,” CoRR, vol. abs/1612.00410, 2016.
[38] J. Rissanen, “Paper: Modeling by shortest data description,” Automatica, vol. 14, no. 5,
pp. 465–471, 1978.
[39] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length:
Theory and applications. MIT Press, 2005.
[40] S. Arimoto, “On the converse to the coding theorem for discrete memoryless channels
(corresp.),” IEEE Trans. Information Theory, vol. 19, no. 3, pp. 357–359, 1973.
[41] Y. M. Shtarkov, “Universal sequential coding of single messages,” Problems Information
Transmission, vol. 23, no. 3, pp. 175–186, 1987.
[42] A. B. Tsybakov, Introduction to nonparametric estimation, 1st edn. Springer, 2008.
[43] D. Tebbe and S. Dwyer, “Uncertainty and the probability of error (corresp.),” IEEE Trans.
Information Theory, vol. 14, no. 3, pp. 516–518, 1968.
12 Fundamental Limits in Model
Selection for Modern Data Analysis
Jie Ding, Yuhong Yang, and Vahid Tarokh

Summary

With rapid development in hardware storage, precision instrument manufacturing, and


economic globalization etc., data in various forms have become ubiquitous in human
life. This enormous amount of data can be a double-edged sword. While it provides the
possibility of modeling the world with a higher fidelity and greater flexibility, improper
modeling choices can lead to false discoveries, misleading conclusions, and poor predic-
tions. Typical data-mining, machine-learning, and statistical-inference procedures learn
from and make predictions on data by fitting parametric or non-parametric models (in a
broad sense). However, there exists no model that is universally suitable for all datasets
and goals. Therefore, a crucial step in data analysis is to consider a set of postulated
candidate models and learning methods (referred to as the model class) and then select
the most appropriate one. In this chapter, we provide integrated discussions on the
fundamental limits of inference and prediction that are based on model-selection prin-
ciples from modern data analysis. In particular, we introduce two recent advances of
model-selection approaches, one concerning a new information criterion and the other
concerning selection of the modeling procedure.

12.1 Introduction

Model selection is the task of selecting a statistical model or learning method from a
model class, given a set of data. Some common examples are selecting the variables for
low- or high-dimensional linear regression, basis terms such as polynomials, splines, or
wavelets in function estimation, the order of an autoregressive process, the best machine-
learning techniques for solving real-data challenges on an online competition platform,
etc. There has been a long history of model-selection techniques that arise from fields
such as statistics, information theory, and signal processing. A considerable number
of methods have been proposed, following different philosophies and with sometimes
drastically different performances. Reviews of the literature can be found in [1–6] and
references therein. In this chapter, we aim to provide an integrated understanding of

359
360 Jie Ding et al.

the properties of various approaches, and introduce two recent advances leading to the
improvement of classical model-selection methods.
We first introduce some notation. We use Mm = {pθ , θ ∈ Hm } to denote a model which
is a set of probability density functions, where Hm is the parameter space and pθ is short
for pm (Z1 , . . . , Zn | θ), the probability density function of data Z1 , . . . , Zn ∈ Z. A model
class, {Mm }m∈M , is a collection of models indexed by m ∈ M. We denote by n the sample
size, and by dm the size/dimension of the parameter in model Mm . We use p∗ to denote
the true-data generating distribution, and E∗ for the expectation associated with it. In
the parametric framework, there exists some m ∈ M and some θ∗ ∈ Hm such that p∗ is
equivalent to pθ∗ almost surely. Otherwise it is in the non-parametric framework. We use
→ p to denote convergence in probability (under p∗ ), and N(μ, σ2 ) to denote a Gaussian
distribution of mean μ and variance σ2 . We use capital and lower-case letters to denote
random variables and the realized values.
The chapter’s content can be outlined as follows. In Section 12.2, we review the two
statistical/machine-learning goals (i.e., prediction and inference) and the fundamental
limits associated with each of them. In Section 12.3, we explain how model selection is
the key to reliable data analysis through a toy example. In Section 12.4, we introduce
the background and theoretical properties of the Akaike information criterion (AIC)
and the Bayesian information criterion (BIC), as two fundamentally important model-
selection criteria (and other principles sharing similar asymptotic properties). We shall
discuss their conflicts in terms of large-sample performances in relation to two statistical
goals. In Section 12.5 we introduce a new information criterion, referred to as the bridge
criterion (BC), that bridges the conflicts of the AIC and the BIC. We provide its back-
ground, theoretical properties, and a related quantity referred to as the parametricness
index that is practically very useful to describe how likely it is that the selected model
can be practically trusted as the “true model.” In Section 12.6, we review recent develop-
ments in modeling-procedure selection, which, differently from model selection in the
narrow sense of choosing among parametric models, aims to select the better statistical
or machine-learning procedure.

12.2 Fundamental Limits in Model Selection

Data analysis usually consists of two steps. In the first step, candidate models are pos-
tulated, and for each candidate model Mm = {pθ , θ ∈ Hm } we estimate its parameter
θ ∈ Hm . In the second step, from the set of estimated candidate models pθm (m ∈ M) we
select the most appropriate one (for either interpretation or prediction purposes). We note
that not every data analysis and its associated model-selection procedure formally rely
on probability distributions. An example is nearest-neighbor learning with the neighbor
size chosen by cross-validation, which requires only that the data splitting is meaningful
and that the predictive performance of each candidate model/method can be assessed
in terms of some measure (e.g., quadratic loss, hinge loss, or perceptron loss). Also,
Fundamental Limits in Model Selection for Modern Data Analysis 361

motivated by computational feasibility, there are methods (e.g., LASSO1 ) that combine
the two steps into a single step.
Before we proceed, we introduce the concepts of “fitting” and “optimal model” rele-
vant to the above two steps. The fitting procedure given a certain candidate model Mm
is usually achieved by minimizing the negative log-likelihood function
θ → n,m (θ),
with θm being the maximum-likelihood estimator (MLE) (under model Mm ). The maxi-
mized log-likelihood value is defined by n,m ( θm ). The above notion applies to non-i.i.d.
data as well. Another function often used in time-series analysis is the quadratic loss
function {zt − E p (Zt | z1 , . . . , zt−1 )}2 instead of the negative log-likelihood. (Here the
expectation is taken over the distribution p.) Since there can be a number of other vari-
ations, the notion of the negative log-likelihood function could be extended as a general
loss function (p, z) involving a density function p(·) and data z. Likewise, the notion of
the MLE could be thought of as a specific form of M-estimator.
To define what the “optimal model” means, let  pm = pθm denote the estimated distribu-
tion under model Mm . The predictive  performance may be assessed via the out-sample
prediction loss: E∗ (( 
pm , Z )|Z) = Z ( pm , z )p∗ (z )dz , where Z  is independent from and
 

identically distributed to the data used to obtain  pm , and Z is the data domain. A sample
analog of E∗ (( pm , Z  )|Z) may be the in-sample prediction loss (also referred to as the

empirical loss), defined as En (( pm , z)) = n−1 nt=1 ( pm , zt ), to measure the fitness of the
observed data to the model Mm . In view of this definition, the optimal model can be
naturally defined as the candidate model with the smallest out-sample prediction loss,
i.e., m0 = arg minm∈M E∗ (( pm , Z  )|Z). In other words, Mm0 is the model whose predic-
tive power is the best offered by the candidate models given the observed data and the
specified model class. It can be regarded as the theoretical limit of learning given the
current data and the model list.
In a parametric framework, the true data-generating model is usually the optimal
model for sufficiently large sample size [8]. In this vein, if the true density function p∗
belongs to some model Mm , or equivalently p∗ = pθ∗ for some θ∗ ∈ Hm and m ∈ M, then
we aim to select such Mm (from {Mm }m∈M ) with probability going to one as the sample
size increases. This is called consistency in model selection. In addition, the MLE of pθ
for θ ∈ Hm is known to be an asymptotically efficient estimator of the true parameter
θ∗ [9]. In a non-parametric framework, the optimal model depends on the sample size:
for a larger sample size, the optimal model tends to be larger since more observations
can help reveal weak effects that are out of reach at a small sample size. In that situa-
tion, it can be statistically unrealistic to pursue selection consistency [10]. We note that
the aforementioned equivalence between the optimal model and the true model may not

1 Least absolute shrinkage and selection operator (LASSO) [7] is a penalized regression method, whose
penalty term is in the form of λβ1 , where β is the regression coefficient vector and λ is a tuning param-
eter that controls how many (and which) variables are selected. In practice, data analysts often select an
appropriate λ that is based on e.g., five-fold cross-validation.
362 Jie Ding et al.

hold for high-dimensional regression settings where the number of independent vari-
ables is large relative to the sample size [8]. Here, even if the true model is included as a
candidate, its dimension may be too high to be appropriately identified on the basis of a
relatively small amount of data. Then the (literally) parametric setting becomes virtually
non-parametric.
There are two main objectives in learning from data. One is to understand the data-
generation process for scientific discoveries. Under this objective, a notion of funda-
mental limits is concerned with the consistency of selecting the optimal model. For
example, a scientist may use the data to support his/her physics model or identify genes
that clearly promote early onset of a disease. Another objective of learning from data
is for prediction, where the data scientist does not necessarily care about obtaining an
accurate probabilistic description of the data (e.g., which covariates are independent of
the response variable given a set of other covariates). Under this objective, a notion of
fundamental limits is taken in the sense of achieving optimal predictive performance.
Of course, one may also be interested in both directions. For example, scientists may be
interested in a physical model that explains well the causes of precipitation (inference),
and at the same time they may also want to have a good model for predicting the amount
of precipitation on the next day or during the next year (prediction).
In line with the two objectives above, model selection can also have two directions:
model selection for inference and model selection for prediction. The first is intended to
identify the optimal model for the data, in order to provide a reliable characterization of
the sources of uncertainty for scientific interpretation. The second is to choose a model
as a vehicle to arrive at a model/method that offers a satisfying top performance. For
the former goal, it is crucial that the selected model is stable for the data, meaning that
a small perturbation of the data does not affect the selection result. For the latter goal,
however, the selected model may be simply the lucky winner among a few close com-
petitors whose predictive performance can be nearly optimal. If so, the model selection
is perfectly fine for prediction, but the use of the selected model for insight and interpre-
tation may be misleading. For instance, in linear regression, because the covariates are
often correlated, it is quite possible that two very different sets of covariates may offer
nearly identical top predictive performances yet neither can justify its own explanation
of the regression relationship against that by the other.
Associated with the first goal of model selection for inference or identifying the best
candidate is the concept of selection consistency. The selection consistency means that
the optimal model/method is selected with probability going to one as the sample size
goes to infinity. This idealization that the optimal model among the candidates can be
practically deemed the “true model” is behind the derivations of several model-selection
methods. In the context of variable selection, in practical terms, model-selection con-
sistency is intended to mean that the useful variables are identified and their statistical
significance can be ascertained in a follow-up study while the rest of the variables can-
not. However, in reality, with limited data and large noise the goal of model-selection
consistency may not be reachable. Thus, to certify the selected model as the “true” model
for reliable statistical inference, the data scientist must conduct a proper selection of
model diagnostic assessment (see [11] and references therein). Otherwise, the use of
Fundamental Limits in Model Selection for Modern Data Analysis 363

the selected model for drawing conclusions on the data-generation process may give
irreproducible results, which is a major concern in the scientific community [12].
In various applications where prediction accuracy is the dominating consideration,
the optimal model as defined earlier is the target. When it can be selected with high
probability, the selected model can not only be trusted for optimal prediction but also
comfortably be declared the best. However, even when the optimal model is out of reach
in terms of selection with high confidence, other models may provide asymptotically
equivalent predictive performance. In this regard, asymptotic efficiency is a natural con-
sideration for the second goal of model selection. When prediction is the goal, obviously
prediction accuracy is the criterion to assess models. For theoretical examination, the
convergence behavior of the loss of the prediction based on the selected model charac-
terizes the performance of the model-selection criterion. Two properties are often used
to describe good model-selection criteria. The asymptotic efficiency property demands
that the loss of the selected model/method is asymptotically equivalent to the smallest
among all the candidates. The asymptotic efficiency is technically defined by

minm∈M Lm
→p 1 (12.1)
Lm

as n → ∞, where m  denotes the selected model. Here, Lm = E∗ (( pm , Z)) − E∗ ((p∗ , Z))
is the adjusted prediction loss, where 
pm denotes the estimated distribution under model
m. The subtraction of E∗ ((p∗ , Z)) makes the definition more refined, which allows one
to make a better comparison of competing model-selection methods. Overall, the goal
of prediction is to select a model that is comparable to the optimal model regardless of
whether it is stable or not as the sample size varies. This formulation works both for
parametric and for non-parametric settings.

12.3 An Illustration on Fitting and the Optimal Model

We provide a synthetic experiment to demonstrate that better fitting does not imply better
predictive performance due to inflated variances in parameter estimation.

Example 12.1 Linear regression Suppose that we generated synthetic data from a
regression model Y = f (X) + ε, where each item of data is in the form of zi = (yi , xi ).
Each response yi (i = 0, . . . , n − 1) is observed at xi = i/n (fixed design points), namely
yi = f (i/n) + εi . Suppose that the εi s are independent standard Gaussian noises. Suppose
that we use polynomial regression, and the specified models are in the form of f (x) =
m
j=0 β j x (0 ≤ x < 1, m being a positive integer). The candidate models are specified to be
j

{Mm , m = 1, . . . , dn }, with Mm corresponding to f (x) = mj=0 β j x j . Clearly, the dimension
of Mm is dm = m + 1.

The prediction loss for regression is calculated as Lm = n−1 ni=1 ( f (xi ) −  fm (xi ))2 ,
where  f is the least-squares estimate of f using model Mm (see, for example, [8]).
The efficiency, as before, is defined as minm∈M Lm /Lm .
364 Jie Ding et al.

When the data-generating model is unknown, one critical problem is the identification
of the degree of the polynomial model fitted to the data. We need to first estimate polyno-
mial coefficients with different degrees 1, . . . , dn , and then select one of them according
to a certain principle.

In an experiment, we first generate independent data using each of the following true
data-generating models, with sample sizes n = 100, 500, 2000, 3000. We then fit the data
using the model class as given above, with the maximal order dn = 15.
(1) Parametric framework. The data are generated by f (x) = 10(1 + x + x2 ).
Suppose that we adopt the quadratic loss in this example. Then we obtain the in-

sample prediction loss em = n−1 ni=1 (Yi − fm (xi ))2 . Suppose that we plot em against
dm , then the curve must be monotonically decreasing, because a larger model fits the
same data better. We then compute the out-sample prediction loss E∗ ((pm , Z)), which is
equivalent to

n
pm , Z)) = n−1 ( f (xi ) − 
E∗ (( fm (xi ))2 + σ2 (12.2)
i=1
in this example. The above expectation is taken over the true distribution of an inde-
pendent future data item Zt . Instead of showing the out-sample prediction loss of each
candidate model, we plot its rescaled version (on [0, 1]). Recall the asymptotic efficiency
as defined in (12.1). Under quadratic loss, we have E∗ ((p∗ , Z)) = σ2 , and the asymptotic
efficiency requires

minm∈M n−1 ni=1 ( f (xi ) −  fm (xi ))2
 → p 1, (12.3)
n−1 ni=1 ( f (xi ) − 
fm
 (xi ))
2

where m  denotes the selected model. In order to describe how the predictive performance
of each model deviates from the best possible, we define the efficiency of each model as
the term on the left-hand side in (12.1), or (12.3) in this example.
We now plot the efficiency of each candidate model on the left-hand side of Fig. 12.1.
The curves show that the predictive performance is optimal only for the true model. We
note that the minus-σ2 adjustment of the out-sample prediction loss in the numerator and
denominator of (12.3), compared with (12.2), makes it highly non-trivial to achieve the
property (see, for example, [8, 13–15]). Consider, for example, the comparison between
two nested polynomial models with degrees d = 2 and d = 3, the former being the true
data-generating model. It can be proved that, without subtracting σ2 , the ratio (of the
mean-square prediction errors) for each candidate model approaches 1; on subtracting
σ2 , the ratio for the former model still approaches 1, while the ratio for the latter model
approaches 2/3.
(2) Non-parametric framework. The data are generated by f (x) = 20x1/3 .
As for framework (1), we plot the efficiency on the right-hand side of Fig. 12.1. Dif-
ferently from the case (1), the predictive performance is optimal at increasing model
dimensions (as the sample size n increases). As mentioned before, in such a non-
parametric framework (i.e., the true f is not in any of the candidate models), the optimal
model is highly unstable as the sample size varies, so that pursuing an inference of a
fixed good model becomes improper. Intuitively, this is because, in a non-parametric
Fundamental Limits in Model Selection for Modern Data Analysis 365

1 1

0.8 0.8
Efficiency

Efficiency
0.6 0.6

0.4 0.4

0.2 n = 100 0.2 n = 100


n = 500 n = 500
0 0
2 4 6 8 10 12 2 4 6 8 10 12

Figure 12.1 The efficiency of each candidate model under two different data-generating processes.
The best-performing model is the true model of dimension 3 in the parametric framework
(left figure), whereas the best-performing model varies with sample size in the non-parametric
framework (right figure).

framework, more complex models are needed to accommodate more observed data in
order to strike an appropriate trade-off between the estimation bias (i.e., the smallest
approximation error between the data-generating model and a model in the model space)
and the variance (i.e., the variance due to parameter estimation) so that the prediction
loss can be reduced. Thus, in the non-parametric framework, the optimal model changes,
and the model-selection task aims to select a model that is optimal for prediction (e.g.,
asymptotically efficient), while recognizing it is not tangible to identify the true/optimal
model for the inference purpose. Note that Fig. 12.1 is drawn using information of the
underlying true model, but that information is unavailable in practice, hence the need for
a model-selection method so that the asymptotically best efficiency can still be achieved.
This toy experiment illustrates the general rules that (1) a larger model tends to fit data
better, and (2) the predictive performance is optimal with a candidate model that typi-
cally depends both on the sample size and on the true data-generating process (which
is unknown in practice). In a virtually (or practically) parametric scenario, the optimal
model is stable around the present sample size and it may be practically treated as the
true model. In contrast, in a virtually (or practically) non-parametric scenario, the opti-
mal model changes sensitively to the sample size (around the present sample size) and
the task of identifying the elusive optimal model for reliable inference is unrealistic.
With this understanding, an appropriate model-selection technique is called for so as to
single out the optimal model for inference and prediction in a strong practically paramet-
ric scenario, or to strike a good balance between the goodness of fit and model complexity
(i.e., the number of free unknown parameters) on the observed data to facilitate optimal
prediction in a practically non-parametric scenario.

12.4 The AIC, the BIC, and Other Related Criteria

Various model-selection criteria have been proposed in the literature. Though each
approach was motivated by a different consideration, many of them originally aimed to
select either the order in an autoregressive model or a subset of variables in a regression
model. We shall revisit an important class of them referred to as information criteria.
366 Jie Ding et al.

12.4.1 Information Criteria and Related Methods


Several model-selection methods that are based on likelihood functions can be put
within the framework of information criteria. They are generally applicable to paramet-
ric model-based problems, and their asymptotic performances have been well studied in
various settings. A general form of information criterion is to select Mm such that


n
 = arg min n−1
m (pθm , zt ) + fn,dm ,
m∈M t=1

where the objective function is the estimated in-sample loss plus a penalty fn,d (indexed
by the sample size n and model dimension d).
The Akaike information criterion (AIC) [16, 17] is a model-selection principle that
was originally derived by minimizing the Kullback–Leibler (KL) divergence from a
candidate model to the true data-generating model p∗ . Equivalently, the idea is to approx-
imate the out-sample prediction loss by the sum of the in-sample prediction loss and a
correction term. In the typical setting where the loss is logarithmic, the AIC procedure
is to select the model Mm that minimizes

AICm = −2n,m + 2dm , (12.4)

where n,m is the maximized log-likelihood of model Mm given n observations and dm is


the dimension of model Mm . In the task of selecting the appropriate order in autoregres-
sive models or subsets of variables in regression models, it is also common to use the
alternative version AICk = n log 
ek + 2k for the model of order k, where 
ek is the average
in-sample prediction error based on the quadratic loss.
The Bayesian information criterion (BIC) [18] is a principle that selects the model m
that minimizes

BICm = −2n,m + dm log n. (12.5)

The only difference from the AIC is that the constant 2 in the penalty term is replaced
with the logarithm of the sample size. Its original derivation by Schwarz was only for
an exponential family from a frequentist perspective. But it turned out to have a nice
Bayesian interpretation, as its current name suggests. Recall that, in Bayesian data anal-
ysis, marginal likelihood is commonly used for model selection [19]. In a Bayesian
setting, we would introduce a prior with density θ → pm (θ) (θ ∈ Hm ), and a likelihood
of data pm (Z | θ), where Z = [Z1 , . . . , Zn ], for each m ∈ M. We first define the marginal
likelihood of model Mm by

p(Z | Mm ) = pm (Z | θ)pm (θ)dθ. (12.6)
Hm

The candidate model with the largest marginal likelihood should be selected. Inter-
estingly, this Bayesian principle is asymptotically equivalent to the BIC in selecting
models. To see the equivalence, we assume that Z1 , . . . , Zn are i.i.d., and π(·) is any
Fundamental Limits in Model Selection for Modern Data Analysis 367


prior distribution on θ which has dimension d. We let n (θ) = ni=1 log pθ (zi ) be the log-
likelihood function, and θn the MLF of θ. Note that n implicitly depends on the model.
A proof of the Bernstein–von Mises theorem (see Chapter 10.2 of [20]) implies (under
regularity conditions)
 
d
p(Z1 , . . . , Zn ) exp − n (
θn ) + log n → p c∗ (12.7)
2
as n → ∞, for some constant c∗ that does not depend on n. Therefore, selecting a
model with the largest marginal likelihood p(Z1 , . . . , Zn ) (as advocated by Bayesian
model comparison) is asymptotically equivalent to selecting a model with the small-
est BIC in (12.5). It is interesting to see that the marginal likelihood of a model does
not depend on the imposed prior at all, given a sufficiently large sample size. We note
that, in many cases of practical data analysis, especially when likelihoods cannot be
written analytically, the BIC is implemented less than the Bayesian marginal likeli-
hood, because the latter can easily be implemented by utilizing Monte Carlo-based
computation methods [21].
Cross-validation (CV) [22–26] is a class of model-selection methods that are widely
used in machine-learning practice. CV does not require the candidate models to be
parametric, and it works as long as data splittings make sense and one can assess the
predictive performance in terms of some measure. A specific type of CV is the delete-1
CV method [27] (or leave-one-out, LOO). The idea is explained as follows. For brevity,
let us consider a parametric model class as before. Recall that we wish to select a
model Mm with as small an out-sample loss E∗ ((pθm , Z)) as possible. Its computa-
tion involves an unknown true-data-generating process, but we may approximate it by

n−1 ni=1 (pθm,−i , zi ), where 
θm,−i is the MLE under model Mm using all the observa-
tions except zi . In other words, given n observations, we leave each observation out
and attempt to predict that data point by using the n − 1 remaining observations, and
record the average prediction loss over n rounds. It is worth mentioning that LOO is
asymptotically equivalent to the AIC under some regularity conditions [27].
The general practice of CV works in the following manner. It randomly splits the
original data into a training set of nt data and a validation set of nv = n − nt data; each
candidate model is trained from the nt data and validated on the remaining data (i.e., to
record the average validation loss); the above procedure is replicated a few times, each
with a different validation set, in order to alleviate the variance caused by splitting; in
the end, the model with the least average validation loss is selected, and the model is
re-trained using the complete data for future use. The v-fold CV (with v being a positive
integer) is a specific version of CV. It randomly partitions data into v subsets of (approx-
imately) equal size; each model is trained on v − 1 folds and validated on the remaining
1 fold; the procedure is repeated v times, and the model with the smallest average vali-
dation loss is selected. The v-fold CV is perhaps more commonly used than LOO, partly
due to the large computational complexity involved in LOO. The holdout method, which
is often used in data competitions (e.g., Kaggle competition), may be viewed as a special
case of CV: it does data splitting only once, producing one part as the training set and
the remaining part as the validation set.
368 Jie Ding et al.

We have mentioned that LOO was asymptotically equivalent to the AIC. How about a
general CV with nt training data and nv validation data? For regression problems, it has
been proved that CV is asymptotically similar to the AIC when nv /nt → 0 (including
LOO as a special case), and to the BIC when nv /nt → ∞ (see, e.g., [8]). Additional
comments on CV and corrections of some misleading folklore will be elaborated in
Section 12.6.
Note that many other model-selection methods are closely related to the AIC and
the BIC. For example, methods that are asymptotically equivalent to the AIC include
finite-sample-corrected AIC [28], which was proposed as a corrected version of the
AIC, generalized cross-validation [29], which was proposed for selecting the degree of
smoothing in spline regression, the final-prediction error criterion which was proposed
as a predecessor of the AIC [30, 31], and Mallows’ C p method [32] for regression-
variable selection. Methods that share the selection-consistency property of the BIC
include the Hannan and Quinn criterion [33], which was proposed as the smallest
penalty that achieves strong consistency (meaning that the best-performing model is
selected almost surely for sufficiently large sample size), the predictive least-squares
(PLS) method based on the minimum-description-length principle [34, 35], and the use
of Bayes factors, which is another form of Bayesian marginal likelihood. For some
methods such as CV and the generalized information criterion (GIC, or written as
GICλn ) [8, 36, 37], their asymptotic behavior usually depends on the tuning parameters.
In general, the AIC and the BIC have served as the golden rules for model selection
in statistical theory since their coming into existence. Their asymptotic properties have
been rigorously established for autoregressive models and regression models, among
many other models. Though cross-validations or Bayesian procedures have also been
widely used, their asymptotic justifications are still rooted in frequentist approaches in
the form of the AIC, the BIC, etc. Therefore, understanding the asymptotic behavior of
the AIC and the BIC is of vital value both in theory and in practice. We therefore focus
on the properties of the AIC and the BIC in the rest of this section and Section 12.5.

12.4.2 Theoretical Properties of the Model-Selection Criteria


Theoretical examinations of model-selection criteria have centered on several proper-
ties: selection consistency, asymptotic efficiency (both defined in Section 12.2), and
minimax-rate optimality (defined below). The selection consistency targets the goal
of identifying the optimal model or method on its own for scientific understanding,
statistical inference, insight, or interpretation. Asymptotic efficiency and minimax-rate
optimality are in tune with the goal of prediction. Before we introduce these properties,
it is worth mentioning that many model-selection methods can also be categorized into
two classes according to their large-sample performances, respectively represented by
the AIC and the BIC.
First of all, the AIC has been proved to be minimax-rate optimal (defined below)
for a range of variable-selection tasks, including the usual subset-selection and order-
selection problems in linear regression, and non-parametric regression based on series
expansion (with bases such as polynomials, splines, or wavelets) (see, e.g., [38] and the
Fundamental Limits in Model Selection for Modern Data Analysis 369

references therein). For example, consider the minimax risk of estimating the regression
function f ∈ F under the squared error


n
inf sup n−1 E∗ ( 
f (xi ) − f (xi ))2 , (12.8)

f f ∈F i=1

where  f is over all estimators based on the observations, and f (xi ) equals the expecta-
tion of the ith response variable (or the ith value of the regression function) conditional
on the ith vector of variables xi . Each xi can refer to a vector of explanatory vari-
ables, or polynomial basis terms, etc. For a model-selection method δ, its worst-case

risk is sup f ∈F R( f, δ, n), where R( f, δ, n) = n−1 ni=1 E∗ { 
fδ (xi ) − f (xi )}2 , and 
fδ is the least-
squares estimate of f under the variables selected by δ. The method δ is said to be
minimax-rate optimal over F if sup f ∈F R( f, δ, n) converges (as n → ∞) at the same rate
as the minimax risk in (12.8).
Another good property of the AIC is that it is asymptotically efficient (as defined in
(12.1)) in a non-parametric framework. In other words, the predictive performance of its
selected model is asymptotically equivalent to the best offered by the candidate models
(even though it is highly sensitive to the sample size). However, the BIC is known to be
consistent in selecting the data-generating model with the smallest dimension in a para-
metric framework. For example, suppose that the data are truly generated by a quadratic
function corrupted by random noise, and the candidate models are a quadratic polyno-
mial, a cubic polynomial, and an exponential function. Then the quadratic polynomial
is selected with probability going to one as the sample size tends to infinity. The expo-
nential function is not selected because it is a wrong model, and the cubic polynomial is
not selected because it overfits (even though it nests the true model as a special case).

12.5 The Bridge Criterion − Bridging the Conflicts between the AIC and the BIC

In this section, we review a recent advance in the understanding of the AIC, the BIC, and
related criteria. We choose to focus on the AIC and the BIC here because they represent
two cornerstones of model-selection principles and theories. We are concerned only with
settings where the sample size is larger than the model dimensions. Many details of the
following discussion can be found in technical papers such as [8, 13, 15, 39–41] and
references therein.
Recall that the AIC is asymptotically efficient for the non-parametric scenario and
is also minimax optimal. In contrast, the BIC is consistent and asymptotically effi-
cient for the parametric scenario. Despite the good properties of the AIC and the BIC,
they have their own drawbacks. The AIC is known to be inconsistent in a paramet-
ric scenario where there are at least two correct candidate models. As a result, the
AIC is not asymptotically efficient in such a scenario. For example, if data are truly
generated by a quadratic polynomial function corrupted by random noise, and the can-
didate models include quadratic and cubic polynomials, then the former model cannot
be selected with probability going to one as the sample size increases. The asymptotic
370 Jie Ding et al.

probability of it being selected can actually be analytically computed in a way simi-


lar to that shown in [39]. The BIC, on the other hand, does not exhibit the beneficial
properties of minimax-rate optimality and asymptotic efficiency in a non-parametric
framework [8, 41].
Why do the AIC and the BIC have the aforementioned differences? In fact, theoretical
examinations of those aspects are highly non-trivial and have motivated a vast literature
since the arrival of the AIC and the BIC. Here we provide some heuristic explanations.
The formulation of the AIC in (12.4) was originally motivated by searching for the
candidate model p with the smallest KL divergence (denoted by D) from p to the data-
generating model p∗ . Since min p D(p∗ , p) is equivalent to min p E∗ (− log p) for a fixed
p∗ , the AIC was designed to perform well at minimizing the prediction loss. But the
AIC is not consistent for a model class containing a true model and at least one over-
sized model, because fitting the oversized model would reduce the first term −2n,m in
(12.4) only by approximately a chi-square-distributed random variable [42], while the
increased penalty on the second item 2dm is at a constant level which is not sufficiently
large to suppress the overfitting gain. On the other hand, the selection consistency of the
BIC in a parametric framework is not surprising due to its nice Bayesian interpretation.
However, its penalty dm log n in (12.5) is much larger than the 2dm in the AIC, so it can-
not enjoy the predictive optimality in a non-parametric framework (if the AIC already
does so).
To briefly summarize, for achieving asymptotic efficiency, the AIC (BIC) is suit-
able only in non-parametric (parametric) settings. There has been a debate regarding
the choice between the AIC and the BIC in model-selection practice, and a key argu-
ment is on whether the true model is in a parametric framework or not. The same debate
may also appear under other terminologies. In a parametric (non-parametric) frame-
work, the true data-generating model is often said to be well-specified (mis-specified), or
finite (infinite)-dimensional. To see a reason for such terminology, consider for instance
regression analysis using polynomial basis functions as variables. If the true regres-
sion function is indeed a polynomial, then it can be parameterized with a finite number
of parameters, or represented by finitely many basis functions. If the true regression
function is an exponential function, then it cannot be parameterized with any finite-
dimensional parameter and its representation requires infinitely many basis functions.
Without prior knowledge on how the observations were generated, determining which
method to use becomes very challenging.
The following question is then naturally motivated: “Is it possible to have a new infor-
mation criterion that bridges the AIC and the BIC in such a way that the strengths of
both of them can be had in combination?”

12.5.1 Bridging the Fundamental Limits in Inference and Prediction


The above question is of fundamental importance, because, in real applications, data
analysts usually do not know whether the data-generating model is correctly specified or
not. If there indeed exists an “ideal” model-selection procedure that adaptively achieves
the better performance of the AIC and the BIC under all situations, then the risks of
Fundamental Limits in Model Selection for Modern Data Analysis 371

misusing the AIC (BIC) in parametric (non-parametric) scenarios will no longer be a


matter of concern. Recall the previous discussions on the good properties of the AIC
and BIC. An “ideal” procedure could be defined in two possible ways, in terms of what
good properties to combine.
First, can the properties of minimax-rate optimality and consistency be shared? Unfor-
tunately, it has been theoretically shown that there exists no model-selection method that
achieves both types of optimality at the same time [41]. That is, for any model-selection
procedure to be consistent in selection, it must behave sub-optimally in terms of the
minimax rate of convergence in the prediction loss.
Second, can the properties of asymptotic efficiency (in both parametric and non-
parametric frameworks) and consistency (in the parametric framework) be shared?
Recall that consistency in a parametric framework is typically equivalent to asymptotic
efficiency [8, 15]. Clearly, if an ideal method were able to combine asymptotic effi-
ciency and consistency, it would then always achieve asymptotic efficiency irrespective
of whether it were operating in a parametric framework or not.
We note that a negative answer to the first question does not imply a negative answer
to the second question. Because, in contrast to asymptotic efficiency, minimax-rate
optimality allows the true data-generating model to vary for a given sample size (see
the definition (12.8)) and is therefore more demanding. Asymptotic efficiency is in a
pointwise sense, meaning that the data are generated by some fixed (unknown) data-
generating model (i.e., the data-generating process is not of an adversarial nature).
Therefore, the minimaxity (uniformity over a model space) is a much stronger require-
ment than the (pointwise) asymptotic efficiency. That window of opportunity motivated
an active line of recent advances in reconciling the two classes of model-selection
methods [15, 43–46].
In particular, a new model-selection method called the bridge criterion (BC) [10, 15]
was recently proposed to simultaneously achieve consistency in a parametric framework
and asymptotic efficiency in both (parametric and non-parametric) frameworks. It there-
fore bridges the advantages of the AIC and the BIC in the asymptotic regime. The idea
of the BC, which is similar to the ideas of other types of penalized model selection, is
to select a model by minimizing its in-sample loss plus a penalty. Its penalty, which is
different from the penalties in the AIC and the BIC, is a nonlinear function of the model
dimension. The key idea is to impose a BIC-like heavy penalty for a range of small mod-
els, but to alleviate the penalty for larger models if there is more evidence supporting an
infinite-dimensional true model. In that way, the selection procedure is automatically
adaptive to the appropriate setting (either parametric or non-parametric).
The BC selects the model Mm that minimizes

BCm = −2n,m + cn (1 + 2−1 + · · · + dm


−1
)

(with the default cn = n2/3 ) over all the candidate models whose dimensions are no larger
than dmAIC , defined as the dimension of the model selected by the AIC (dm in (12.4)).
Note that the penalty is approximately cn log dm , but it is written as a harmonic number
to highlight some of its nice interpretations. Its original derivation was motivated by
the recent discovery that the information loss of underfitting a model of dimension d
372 Jie Ding et al.

using dimension d − 1 is asymptotically χ21 /d (where χ21 denotes the chi-squared random
variable with one degree of freedom) for large d, assuming that nature generates the
model from a non-informative uniform distribution over its model space (in particular
the coefficient space of all stationary autoregressions). Its heuristic derivation is reviewed
in Section 12.5.2. The BC was later proved to be asymptotically equivalent to the AIC
in a non-parametric framework, and equivalent to the BIC otherwise in rather general
settings [10, 15]. Some intuitive explanations will be given in Section 12.5.3. A technical
explanation of how the AIC, BIC, and BC relate to each other can be found in [15].

12.5.2 Original Motivation


The proposal of the BC was originally motivated from the selection of autoregressive
orders, though it can be applied in general settings just like the AIC and the BIC.
Suppose that a set of time-series data {zt : t = 1, . . . , n} is observed, and we specify
an autoregressive (AR) model class with the largest size dn . Each model of size/order
d (d = 1, . . . , dn ) is in the form of
zt = ψd,1 zt−1 + · · · + ψd,d zt−k + t , (12.9)
d
where ψd, ∈ R ( = 1, . . . , d), ψd,d  0, the roots of the polynomial zd + =1 ψd, zd− have
modulus less than 1, and the εt s are independent random noises with zero mean and vari-
ance σ2 . The autoregressive model is referred to as an AR(d) model, and [ψd,1 , . . . , ψd,d ]T
is referred to as the stable autoregressive filter. The best-fitting loss of AR(d) to the data
is defined by
ed = min E∗ (zt − ψd,1 zt−1 − · · · − ψd,d zt−d )2 ,
ψd,1 ,...,ψd,d ∈R

where E∗ is with respect to the stationary distribution of {zt } (see Chapter 3 of [47]).
The above quantity can be also regarded as the minimum KL divergence from the
space of order-d AR to the true data-generating model (rescaled) under Gaussian noise
assumptions. We further define the relative loss gd = log(ed−1 /ed ) for any positive
integer d.
Now suppose that nature generates the data from an AR(d) process, which is in turn
randomly generated from the uniform distribution Ud . Here, Ud is defined over the
space of all the stable AR filters of order d whose roots have modulus less than 1:

Sd = [ψd,1 , . . . , ψd,d ]T :


d d
zd + ψd, zd− = (z − a ), ψd, ∈ R, |a | < 1,  = 1, . . . , d .
=1 =1

Under this data-generating procedure, gd is a random variable with distribution


described by the following result from [15].
The random variable dgd converges in distribution to χ21 as d tends to infinity.
It is actually a corollary of the following algorithm [15]. A filter Ψd uniformly
distributed on S d can be generated by the following recursive procedure.
Fundamental Limits in Model Selection for Modern Data Analysis 373

Algorithm 12.1 Random generator of uniform distribution on S d


input d
output ψd,1 , . . . , ψd,d
1: Randomly draw ψ1,1 from the uniform distribution on (−1, 1)
2: for k = 1 → d do
3: Randomly draw βk according to the beta distribution
βk ∼ B( k/2 + 1 , (k + 1)/2 )
4: Let ψk,k = 2βk − 1, ψk, = ψk−1, + ψk,k ψk−1,k− ( = 1, . . . , k − 1)
5: end for

This result suggests that the underfitting loss gd ≈ χ21 /d tends to decrease with d.
Because the increment of the penalty from dimension d − 1 to dimension d can be treated
as a quantity to compete with the underfitting loss [15], it suggests that we penalize in a
way not necessarily linear in model dimension.

12.5.3 Interpretation
In this section, we provide some explanations of how the BC can be related to the AIC
and BIC. The explanation does not depend on any specific model assumption, which
shows that it can be applied to a wide range of other models.
The penalty curves (normalized by n) for the AIC, BIC, and BC can be respectively
denoted by

2
fn,dm (AIC) = dm ,
n
log(n)
fn,dm (BIC) = dm ,
n
cn 
d
m
1
fn,dm (BC) = .
n k=1 k

Any of the above penalty curves can be written in the form of dk=1 tk , and only the slopes
tk (k = 1, . . . , dn ) matter to the performance of order selection. For example, suppose that
k2 is selected instead of k1 (k2 > k1 ) by some criterion. This implies that the gain of
2
prediction loss Lk1 − Lk2 is greater than the sum of slopes kk=k tk . Thus, without loss
1 +1
of generality, we can shift the curves of the above three criteria to be tangent to the bent
curve of the BC in order to highlight their differences and connections. Here, two curves
are referred to as tangent to each other if one is above the other and they intersect at one
point, the tangent point.
Given a sample size n, the tangent point between the fn,dm (BC) and fn,dm (BIC) curves
is at T BC:BIC = cn / log n. Consider the example cn = n1/3 . If the true order d0 is finite,
T BC:BIC will be larger than d0 for all sufficiently large n. In other words, there will be
an infinitely large region as n tends to infinity, namely 1 ≤ k ≤ T BC:BIC , where d0 falls
374 Jie Ding et al.

Table 12.1. Regression variable selection (standard errors are given in the parentheses)

LOO AIC BC BIC CV


Case (1) Efficiency 0.79 (0.03) 0.80 (0.03) 0.96 (0.02) 1.00 (0.00) 0.99 (0.01)
Dimension 3.94 (0.19) 3.88 (0.18) 3.22 (0.11) 3.00 (0.00) 3.01 (0.01)
PI 0.94 (0.02)
Case (2) Efficiency 0.91 (0.01) 0.92 (0.01) 0.92 (0.01) 0.74 (0.02) 0.65 (0.02)
Dimension 8.99 (0.11) 9.11 (0.10) 9.11 (0.10) 7.53 (0.13) 6.66 (0.09)
PI 0.33 (0.05)

into and where the BC penalizes more than does the BIC. As a result, asymptotically
the BC does not overfit. On the other hand, the BC will not underfit because the largest
penalty preventing one from selecting dimension k + 1 versus k is cn /n, which will be
less than any fixed positive constant (close to the KL divergence from a smaller model
to the true model) with high probability for large n. This reasoning suggests that the BC
is consistent.
Since the BC penalizes less for larger orders and finally becomes similar to the AIC,
it is able to share the asymptotic optimality of the AIC under suitable conditions. A full
illustration of why the BC is expected to work well in general can be found in [15]. As
we shall see, the bent curve of the BC well connects the BIC and the AIC so that a good
balance between the underfitting and overfitting risks is achieved.
Moreover, in many applications, the data analyst would like to quantify to what
extent the framework under consideration can be virtually treated as parametric, or,
in other words, how likely it is that the postulated model class is well specified. This
motivated the concept of the “parametricness index” (PI) [15, 45] to evaluate the relia-
bility of model selection. One definition of the PI, which we shall use in the following
experiment, is the quantity

|dmBC − dmAIC |
PIn =
|dmBC − dmAIC | + |dmBC − dmBIC |

on [0, 1] if the denominator is well defined, and PIn = 0 otherwise. Here, dmδ is the
dimension of the model selected by the method δ. Under some conditions, it can be
proved that PIn → p 1 in a parametric framework and PIn → p 0 otherwise.
In an experiment concerning Example 1, we generate data using each of those two
true data-generating processes, with sample sizes n = 500. We replicate 100 independent
trials and then summarize the performances of LOO, GCV, the AIC, the BC, the BIC,
and delete-nv CV (abbreviated as CV) with nv = n − n/log n in Table 12.1. Note that
CV with such a validation size is proved to be asymptotically close to the BIC, because
delete-nv CV has the same asymptotic behavior as GICλn introduced in Section 12.4
with

λn = n/(n − nv ) + 1 (12.10)

in selecting regression models [8]. From Table 12.1, it can be seen that the methods on
the right-hand side of the AIC perform better than the others in the parametric setting,
Fundamental Limits in Model Selection for Modern Data Analysis 375

while the methods on the left-hand side of the BIC perform better than the others in the
non-parametric setting. The PI being close to one (zero) indicates the parametricness
(non-parametricness). These are in accord with the existing theory.

12.6 Modeling-Procedure Selection

In this section, we introduce cross-validation as a general tool not only for model
selection but also for modeling-procedure selection, and highlight the point that there
is no one-size-fits-all data-splitting ratio of cross-validation. In particular, we clarify
some widespread folklore on training/validation data size that may lead to improper
data analysis.
Before we proceed, it is helpful to first clarify some commonly used terms involved
in cross-validation (CV). Recall that a model-selection method would conceptually fol-
low two phases: (A) the estimation and selection phase, where each candidate model is
trained using all the available data and one of them is selected; and (B) the test phase,
where the future data are predicted and the true out-sample performance is checked. For
CV methods, the above phase (A) is further split into two phases, namely (A1), the train-
ing phase to build up each statistical/machine-learning model on the basis of the training
data; and (A2), the validation phase which selects the best-performing model on the
basis of the validation data. In the above step (B), analysts are ready to use/implement
the obtained model for predicting the future (unseen) data. Nevertheless, in some data
practice (e.g., in writing papers), analysts may wonder how that model is going to per-
form on completely unseen real-world data. In that scenario, part of the original dataset
(referred to as the “test set”) is taken out before phase (A), in order to approximate the
true predictive performance of the finally selected model after phase (B). In a typical
application that is based on the given data, the test set is not really available, and thus
we need only consider the original dataset being split into two parts: (1) the training set
and (2) the validation set.

12.6.1 Cross-Validation for Model Selection


There are widespread general recommendations on how to apply cross-validation (CV)
for model selection. For instance, it is stated in the literature that 10-fold CV is the
best for model selection. Such guidelines seem to be unwarranted, since they mistakenly
disregard the goal of model selection. For prediction purpose, actually LOO may be
preferred in tuning parameter selection for traditional non-parametric regression.
In fact, common practices that use 5-fold, 10-fold, or 30%-for-validation do not
exhibit asymptotic optimality (neither consistency nor asymptotic efficiency) in simple
regression models, and their performance can be very different depending on the goal of
applying CV. Recall the equivalence between delete-nv CV and GICλn in (12.10). It is
also known that GICλn achieves asymptotic efficiency in non-parametric scenarios only
with λn = 2, and asymptotic efficiency in parametric scenarios only with λn → ∞ (as
n → ∞). In light of that, the optimal splitting ratio nv /nt of CV should either converge
376 Jie Ding et al.

to zero or diverge to infinity, depending on whether the setting is non-parametric or not,


in order to achieve asymptotic efficiency.
In an experiment, we showed how the splitting ratio could affect CV for model selec-
tion by using the MNIST database [6]. The MNIST database contains 70 000 images,
each being a handwritten digit (from 0 to 9) made of 28 × 28 pixels. Six feed-forward
neural network models with different numbers of neurons or layers were considered as
candidates. With different splitting ratios, we ran CV and computed the average valida-
tion loss of each candidate model based on 10 random partitions. We then selected the
model with the smallest average validation loss, and recorded its true predictive perfor-
mance using the separate test data. The performance in terms of the frequency of correct
classifications indicates that a smaller splitting ratio nv /nt leads to better accuracy. The
results are expected from the existing theory, since the neural network modeling here
seems to be of a non-parametric regression nature.

12.6.2 Cross-Validation for Modeling-Procedure Selection


Our discussions in the previous sections have focused on model selection in the narrow
sense where the candidates are parametric models. In this section, we review the use of
CV as a general tool for modeling-procedure selection, which aims to select one from a
finite set of modeling procedures [48]. For example, one may apply modeling procedures
such as the AIC, the BIC, and CV for variable selection. Not knowing the true nature
of the data-generating process, one may select one of those procedures (together with
the model selected by the procedure), using an appropriately designed CV (which is at
the second level). One may also choose among a number of machine-learning methods
based on CV. A measure of accuracy is used to evaluate and compare the performances
of the modeling procedures, and the “best” procedure is defined in the sense that it out-
performs, with high probability, the other procedures in terms of out-sample prediction
loss for sufficiently large n (see, for example, Definition 1 of [14]).
There are two main goals of modeling-procedure selection. The first is to identify with
high probability the best procedure among the candidates. This selection consistency
is similar to the selection consistency studied earlier. The second goal of modeling-
procedure selection does not intend to pinpoint which candidate procedure is the best.
Rather, it tries to achieve the best possible performance offered by the candidates. Note
again that, if there are procedures that have similar best performances, we do not need
to distinguish their minute differences (so as to single out the best candidate) for the
purpose of achieving the asymptotically optimal performance. As in model selection, for
the task of modeling-procedure selection, CV randomly splits n data into nt training data
and nv validation data (so n = nt + nv ). The first nt data are used to run different modeling
procedures, and the remaining nv data are used to select the better procedure. Judging
from the lessons learned from the CV paradox to be given below, for the first goal above,
the evaluation portion of CV should be large enough. But, for the second goal, a smaller
portion of the evaluation may be enough to achieve optimal predictive performance.
In the literature, much attention has been focused on choosing whether to use the AIC
procedure or the BIC procedure for data analysis. For regression-variable selection, it
Fundamental Limits in Model Selection for Modern Data Analysis 377

has been proved that the CV method is consistent in choosing between the AIC and the
BIC given nt → ∞, nv /nt → ∞, and some other regularity assumptions (Theorem 1 of
[48]). In other words, the probability of the BIC being selected goes to 1 in a paramet-
ric framework, and the probability of the AIC being selected goes to 1 otherwise. In
this way, the modeling-procedure selection using CV naturally leads to a hybrid model-
selection criterion that builds upon the strengths of the AIC and the BIC. Such hybrid
selection is going to combine some of the theoretical advantages of both the AIC and
the BIC, as the BC does. The task of classification is somewhat more relaxed than the
task of regression. In order to achieve consistency in selecting the better classifier, the
splitting ratio may be allowed to converge to infinity or any positive constant, depending
on the situation [14]. In general, it is safe to let nt → ∞ and nv /nt → ∞ for consistency
in modeling-procedure selection.

12.6.3 The Cross-Validation Paradox


Closely related to the above discussion is the following paradox. Suppose that a data ana-
lyst has used part of the original data for training and the remaining part for validation.
Now suppose that a new set of data is given to the analyst. The analyst would naturally
add some of the new data in the training phase and some in the validation phase. Clearly,
with more data added to the training set, each sensible candidate modeling procedure is
improved in accuracy; with more data added to the validation set, the evaluation is also
more reliable. It is tempting to think that improving the accuracy on both training and
validation would lead to a sharper comparison between procedures. However, this is not
the case. The prediction error estimation and procedure comparison are two different
targets. In general, we have the following cross-validation paradox: “Better estimation
of the prediction error by CV does not imply better modeling-procedure selection.”
Technically speaking, suppose that an analyst originally splits the data with a large
ratio of nv /nt ; if a new set of data arrives, suppose that half of them were added to
the training set and half to the validation set. Then the new ratio can decrease a lot,
violating the theoretical requirement nv /nt → ∞ (as mentioned above), and leading to
worse procedure selection. Intuitively speaking, when one is comparing two procedures
that are naturally close to each other, the improved estimation accuracy obtained by
adopting more observations in the training part only makes it more difficult to distinguish
between the procedures. If the validation size does not diverge fast enough, consistency
in identifying the better procedure cannot be achieved.

Example 12.2 A set of real-world data reported by Scheetz et al. in [49] (and avail-
able from http://bioconductor.org) consists of over 31 000 gene probes represented on an
Affymetrix expression microarray, from an experiment using 120 rats. Using domain-
specific prescreening procedures as proposed in [49], 18 976 probes were selected that
exhibit sufficient signals for reliable analysis. One purpose of the real data was to dis-
cover how genetic variation relates to human eye disease. Researchers are interested in
finding the genes whose expression is correlated with that of the gene TRIM32, which
378 Jie Ding et al.

has recently been found to cause human eye diseases. Its probe is 1389163_at, one of the
18 976 probes. In other words, we have n = 120 data observations and 18 975 variables
that possibly relate to the observation. From domain knowledge, one expects that only a
few genes are related to TRIM32.

In an experiment we demonstrate the cross-validation paradox using the above dataset.


We first select the top 200 variables with the largest correlation coefficients as was
done in [50], and standardize them. We then compare the following two procedures: the
LASSO for the 200 candidate variables, and the LASSO for those 200 variables plus 100
randomly generated independent N(0, 1) variables. We used LASSO here because we
have more variables than observations. The first procedure is expected to be better, since
the second procedure includes irrelevant variables. We repeat the procedure for 1000
independent replications, and record the frequency of the first procedure being favored
(i.e., the first procedure has the smaller quadratic loss on randomly selected n − nt val-
idation data). We also repeat the experiment for n = 80, 90, 100, 110, by sampling from
the original 120 data without replacement.
The results are shown in Table 12.2. As the paradox suggests, the accuracy of iden-
tifying the better procedure does not necessarily increase when more observations are
added to both the estimation phase and the validation phase. With more observations
added to both the estimation part and the validation part, the accuracy of identifying the
better procedure actually decreases.
Modeling-procedure selection, differently from model selection that aims to select
one model according to some principle, focuses more on the selection of a procedure.
Such a procedure refers to the whole process of model estimation, model selection, and
the approach used for model selection. Thus, the goal of model selection for inference,
in its broad sense, includes not only the case of selection of the true/optimal model with
confidence but also the case of selection of the optimal modeling procedure among those
considered. One example is an emerging online competition platform such as Kaggle
that compares new problem-solving techniques/procedures for real-world data analysis.
In comparing procedures and awarding a prize, Kaggle usually uses cross-validation
(holdout) or its counterpart for time series.
We also note that, in many real applications, the validation size nv itself should be
large enough for inference on the prediction accuracy. For example, in a two-label clas-
sification competition, the standard error of the estimated success rate (based on nv
validation data) is

p(1 − p)/nv ,

where p is the true out-sample success rate of a procedure. Suppose the top-ranked
team achieves slightly above p = 50%. Then, in order for the competition to declare it
to be the true winner at 95% confidence level from the second-ranked team with only

1% observed difference in classification accuracy, the standard error 1/ 4nv is roughly
required to be smaller than 0.5%, which demands nv ≥ 10 000. Thus if the holdout sam-
ple size is not enough, the winning team may well be just the lucky one among the
Fundamental Limits in Model Selection for Modern Data Analysis 379

Table 12.2. The cross-validation paradox: more observations do not lead


to higher accuracy in selecting the better procedure

Sample size n 80 90 100 110 120


Training size nt 25 30 35 40 45
Accuracy 70.0% 69.4% 62.7% 63.2% 59.0%

top-ranking teams. It is worth pointing out that the discussion above is based on sum-
mary accuracy measures (e.g., classification accuracy on the holdout data), and other
hypothesis tests can be employed for formal comparisons of the competing teams (e.g.,
tests based on differencing of the prediction errors).
To summarize this section, we introduced the use of cross-validation as a general tool
for modeling-procedure selection, and we discussed the issue of choosing the test data
size so that such selection is indeed reliable for the purpose of consistency. For modeling-
procedure selection, it is often safe to let the validation size take a large proportion (e.g.,
half) of the data in order to achieve good selection behavior. In particular, the use of
LOO for the goal of comparing procedures is the least trustworthy method. The popu-
lar 10-fold CV may leave too few observations in evaluation to be stable. Indeed, the
5-fold CV often produces more stable selection results for high-dimensional regression.
Moreover, v-fold CV, regardless of v, in general, is often unstable, and a repeated v-fold
approach usually improves performance. A quantitative relation between model stability
and selection consistency remains to be established by research. We also introduced the
paradox that using more training data and validation data does not necessarily lead to
better modeling-procedure selection. That further indicates the importance of choosing
an appropriate splitting ratio.

12.7 Conclusion

There has been a debate regarding whether to use the AIC or the BIC in the past decades,
centering on whether the true data-generating model is parametric or not with respect to
the specified model class, which in turn affects the achievability of fundamental limits
of learning the optimal model given a dataset and a model class at hand. Compared
with the BIC, the AIC seems to be more widely used in practice, perhaps mainly due
to the thought that “all models are wrong” and the minimax-rate optimality of the AIC
offers more protection than does the BIC. Nevertheless, the parametric setting is still
of vital importance. One reason for this is that being consistent in selecting the true
model if it is really among the candidates is certainly mathematically appealing. Also, a
non-parametric scenario can be a virtually parametric scenario, where the optimal model
(even if it is not the true model) is stable for the current sample size. The war between the
AIC and the BIC originates from two fundamentally different goals: one is to minimize
the certain loss for prediction purposes, and the other is to select the optimal model for
inference purposes. A unified perspective on integrating their fundamental limits is a
central issue in model selection, which remains an active line of research.
380 Jie Ding et al.

References

[1] S. Greenland, “Modeling and variable selection in epidemiologic analysis,” Am. J. Public
Health, vol. 79, no. 3, pp. 340–349, 1989.
[2] C. M. Andersen and R. Bro, “Variable selection in regression – a tutorial,” J. Chemomet-
rics, vol. 24, nos. 11–12, pp. 728–737, 2010.
[3] J. B. Johnson and K. S. Omland, “Model selection in ecology and evolution,” Trends
Ecology Evolution, vol. 19, no. 2, pp. 101–108, 2004.
[4] P. Stoica and Y. Selen, “Model-order selection: A review of information criterion rules,”
IEEE Signal Processing Mag., vol. 21, no. 4, pp. 36–47, 2004.
[5] J. B. Kadane and N. A. Lazar, “Methods and criteria for model selection,” J. Amer. Statist.
Assoc., vol. 99, no. 465, pp. 279–290, 2004.
[6] J. Ding, V. Tarokh, and Y. Yang, “Model selection techniques: An overview,” IEEE Signal
Processing Mag., vol. 35, no. 6, pp. 16–34, 2018.
[7] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. Roy. Statist. Soc.
Ser. B, vol. 58, no. 1, pp. 267–288, 1996.
[8] J. Shao, “An asymptotic theory for linear model selection,” Statist. Sinica, vol. 7, no. 2,
pp. 221–242, 1997.
[9] C. R. Rao, “Information and the accuracy attainable in the estimation of statistical
parameters,” in Breakthroughs in statistics. Springer, 1992, pp. 235–247.
[10] J. Ding, V. Tarokh, and Y. Yang, “Optimal variable selection in regression models,”
http://jding.org/jie-uploads/2017/11/regression.pdf, 2016.
[11] Y. Nan and Y. Yang, “Variable selection diagnostics measures for high-dimensional
regression,” J. Comput. Graphical Statist., vol. 23, no. 3, pp. 636–656, 2014.
[12] J. P. Ioannidis, “Why most published research findings are false,” PLoS Medicine, vol. 2,
no. 8, p. e124, 2005.
[13] R. Shibata, “Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process,” Annals Statist., vol. 8, no. 1, pp. 147–164, 1980.
[14] Y. Yang, “Comparing learning methods for classification,” Statist. Sinica, vol. 16, no. 2,
pp. 635–657, 2006.
[15] J. Ding, V. Tarokh, and Y. Yang, “Bridging AIC and BIC: A new criterion for autoregres-
sion,” IEEE Trans. Information Theory, vol. 64, no. 6, pp. 4024–4043, 2018.
[16] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automation
Control, vol. 19, no. 6, pp. 716–723, 1974.
[17] H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in
Selected papers of Hirotugu Akaike. Springer, 1998, pp. 199–213.
[18] G. Schwarz, “Estimating the dimension of a model,” Annals Statist., vol. 6, no. 2, pp. 461–
464, 1978.
[19] A. Gelman, H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian
data analysis. Chapman and Hall/CRC, 2013.
[20] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 1998, vol. 3.
[21] J. S. Liu, Monte Carlo strategies in scientific computing. Springer Science & Business
Media, 2008.
[22] D. M. Allen, “The relationship between variable selection and data agumentation and a
method for prediction,” Technometrics, vol. 16, no. 1, pp. 125–127, 1974.
[23] S. Geisser, “The predictive sample reuse method with applications,” J. Amer. Statist. Assoc.,
vol. 70, no. 350, pp. 320–328, 1975.
Fundamental Limits in Model Selection for Modern Data Analysis 381

[24] P. Burman, “A comparative study of ordinary cross-validation, v-fold cross-validation and


the repeated learning–testing methods,” Biometrika, vol. 76, no. 3, pp. 503–514, 1989.
[25] J. Shao, “Linear model selection by cross-validation,” J. Amer. Statist. Assoc., vol. 88, no.
422, pp. 486–494, 1993.
[26] P. Zhang, “Model selection via multifold cross validation,” Annals Statist., vol. 21, no. 1,
pp. 299–313, 1993.
[27] M. Stone, “An asymptotic equivalence of choice of model by cross-validation and Akaike’s
criterion,” J. Roy. Statist. Soc. Ser. B, pp. 44–47, 1977.
[28] C. M. Hurvich and C.-L. Tsai, “Regression and time series model selection in small
samples,” Biometrika, vol. 76, no. 2, pp. 297–307, 1989.
[29] P. Craven and G. Wahba, “Smoothing noisy data with spline functions,” Numerische
Mathematik, vol. 31, no. 4, pp. 377–403, 1978.
[30] H. Akaike, “Fitting autoregressive models for prediction,” Ann. Inst. Statist. Math., vol. 21,
no. 1, pp. 243–247, 1969.
[31] H. Akaike, “Statistical predictor identification,” Ann. Inst. Statist. Math., vol. 22, no. 1,
pp. 203–217, 1970.
[32] C. L. Mallows, “Some comments on C P ,” Technometrics, vol. 15, no. 4, pp. 661–675, 1973.
[33] E. J. Hannan and B. G. Quinn, “The determination of the order of an autoregression,”
J. Roy. Statist. Soc. Ser. B, vol. 41, no. 2, pp. 190–195, 1979.
[34] J. Rissanen, “Estimation of structure by minimum description length,” Circuits, Systems
Signal Processing, vol. 1, no. 3, pp. 395–406, 1982.
[35] C.-Z. Wei, “On predictive least squares principles,” Annals Statist., vol. 20, no. 1, pp. 1–42,
1992.
[36] R. Nishii et al., “Asymptotic properties of criteria for selection of variables in multiple
regression,” Annals Statist., vol. 12, no. 2, pp. 758–765, 1984.
[37] R. Rao and Y. Wu, “A strongly consistent procedure for model selection in a regression
problem,” Biometrika, vol. 76, no. 2, pp. 369–374, 1989.
[38] A. Barron, L. Birgé, and P. Massart, “Risk bounds for model selection via penalization,”
Probability Theory Related Fields, vol. 113, no. 3, pp. 301–413, 1999.
[39] R. Shibata, “Selection of the order of an autoregressive model by Akaike’s information
criterion,” Biometrika, vol. 63, no. 1, pp. 117–126, 1976.
[40] R. Shibata, “An optimal selection of regression variables,” Biometrika, vol. 68, no. 1,
pp. 45–54, 1981.
[41] Y. Yang, “Can the strengths of AIC and BIC be shared? A conflict between model
indentification and regression estimation,” Biometrika, vol. 92, no. 4, pp. 937–950, 2005.
[42] S. S. Wilks, “The large-sample distribution of the likelihood ratio for testing composite
hypotheses,” Annals Math. Statist., vol. 9, no. 1, pp. 60–62, 1938.
[43] C.-K. Ing, “Accumulated prediction errors, information criteria and optimal forecasting for
autoregressive time series,” Annals Statist., vol. 35, no. 3, pp. 1238–1277, 2007.
[44] Y. Yang, “Prediction/estimation with simple linear models: Is it really that simple?”
Economic Theory, vol. 23, no. 1, pp. 1–36, 2007.
[45] W. Liu and Y. Yang, “Parametric or nonparametric? A parametricness index for model
selection,” Annals Statist., vol. 39, no. 4, pp. 2074–2102, 2011.
[46] T. van Erven, P. Grünwald, and S. De Rooij, “Catching up faster by switching sooner: A
predictive approach to adaptive estimation with an application to the AIC–BIC dilemma,”
J. Roy. Statist. Soc. Ser. B., vol. 74, no. 3, pp. 361–417, 2012.
382 Jie Ding et al.

[47] G. E. Box, G. M. Jenkins, and G. C. Reinsel, Time series analysis: Forecasting and control.
John Wiley & Sons, 2011.
[48] Y. Zhang and Y. Yang, “Cross-validation for selecting a model selection procedure,”
J. Econometrics, vol. 187, no. 1, pp. 95–112, 2015.
[49] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson,
A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, V. C. Sheffield, and E. M. Stone,
“Regulation of gene expression in the mammalian eye and its relevance to eye disease,”
Proc. Natl. Acad. Sci. USA, vol. 103, no. 39, pp. 14 429–14 434, 2006.
[50] J. Huang, S. Ma, and C.-H. Zhang, “Adaptive lasso for sparse high-dimensional regression
models,” Statist. Sinica, pp. 1603–1618, 2008.
13 Statistical Problems with Planted
Structures: Information-Theoretical
and Computational Limits
Yihong Wu and Jiaming Xu

Summary

This chapter provides a survey of the common techniques for determining the sharp sta-
tistical and computational limits in high-dimensional statistical problems with planted
structures, using community detection and submatrix detection problems as illustrative
examples. We discuss tools including the first- and second-moment methods for ana-
lyzing the maximum-likelihood estimator, information-theoretic methods for proving
impossibility results using mutual information and rate-distortion theory, and methods
originating from statistical physics such as the interpolation method. To investigate com-
putational limits, we describe a common recipe to construct a randomized polynomial-
time reduction scheme that approximately maps instances of the planted-clique problem
to the problem of interest in total variation distance.

13.1 Introduction

The interplay between information theory and statistics is a constant theme in the devel-
opment of both fields. Since its inception, information theory has been indispensable for
understanding the fundamental limits of statistical inference. The classical information
bound provides fundamental lower bounds for the estimation error, including Cramér–
Rao and Hammersley–Chapman–Robbins lower bounds in terms of Fisher information
and χ2 -divergence [1, 2]. In the classical “large-sample” regime in parametric statistics,
Fisher information also governs the sharp minimax risk in regular statistical models [3].
The prominent role of information-theoretic quantities such as mutual information, met-
ric entropy, and capacity in establishing the minimax rates of estimation has long been
recognized since the seminal work of [4–7], etc.
Instead of focusing on the large-sample asymptotics, the attention of contemporary
statistics has shifted toward high dimensions, where the problem size and the sample
size grow simultaneously and the main objective is to obtain a tight characterization of
the optimal statistical risk. Certain information-theoretic methods have been remarkably
successful for high-dimensional problems. Such methods include those based on metric
entropy and Fano’s inequality for determining the minimax risk within universal con-
stant factors (minimax rates) [7]. Unfortunately, the aforementioned methods are often

383
384 Yihong Wu and Jiaming Xu

too crude for the task of determining the sharp constant, which requires more refined
analysis and stronger information-theoretic tools.
An additional challenge in dealing with high dimensionality is the need to address
the computational aspect of statistical inference. An important element absent from the
classical statistical paradigm is the computational complexity of inference procedures,
which is becoming increasingly relevant for data scientists dealing with large-scale noisy
datasets. Indeed, recent results [8–13] revealed the surprising phenomenon that certain
problems concerning large networks and matrices undergo an “easy–hard–impossible”
phase transition, and computational constraints can severely penalize the statistical per-
formance. It is worth pointing out that here the notion of complexity differs from the
worst-case computational hardness studied in the computer science literature which
focused on the time and space complexity of various worst-case problems. In contrast,
in a statistical context, the problem is of a stochastic nature and the existing theory on
average-case hardness is significantly underdeveloped. Here, the hardness of a statistical
problem is often established either within the framework of certain computation models,
such as the sums-of-squares relaxation hierarchy, or by means of a reduction argument
from another problem, notably the planted-clique problem, which is conjectured to be
computationally intractable.
In this chapter, we provide an exposition on some of the methods for determining
the information-theoretic as well as the computational limits for high-dimensional sta-
tistical problems with a planted structure, with a specific focus on characterizing sharp
thresholds. Here the planted structure refers to the true parameter, which is often of
a combinatorial nature (e.g., partition) and hidden in the presence of random noise.
To characterize the information-theoretic limit, we will discuss tools including the
first- and second-moment methods for analyzing the maximum-likelihood estimator,
information-theoretic methods for proving impossibility results using mutual informa-
tion and rate-distortion theory, and methods originating from statistical physics such
as the interpolation method. There is no established recipe for determining the com-
putational limit of statistical problems, especially the “easy–hard–impossible” phase
transition, and it is usually done on a case-by-case basis; nevertheless, the common
element is to construct a randomized polynomial-time reduction scheme that approxi-
mately maps instances of a given hard problem to one that is close to the problem of
interest in total variation distance.

13.2 Basic Setup

To be concrete, in this chapter we consider two representative problems, namely com-


munity detection and submatrix detection, as running examples. Both problems can be
cast as the Bernoulli and Gaussian version of the following statistical model with planted
community structure.
We first consider a random graph model containing a single hidden community whose
size can be sub-linear in data matrix size n.
Statistical Problems with Planted Structures 385

definition 13.1 (Single-community model) Let C ∗ be drawn uniformly at random from


all subsets of [n] of cardinality K. Given probability measures P and Q on a common
measurable space, let A be an n × n symmetric matrix with zero diagonal where, for all
1 ≤ i < j ≤ n, Ai j are mutually independent, and Ai j ∼ P if i, j ∈ C ∗ and Ai j ∼ Q otherwise.
Here we assume that we have access only to pair-wise information Ai j for distinct
indices i and j whose distribution is either P or Q depending on the community mem-
bership; no direct observation about the individual indices is available (hence the zero
diagonal of A). Two choices of P and Q arising in many applications are the following:
• Bernoulli case: P = Bern(p) and Q = Bern(q) with p  q. When p > q, this coincides
with the planted dense subgraph model studied in [10, 14–17], which is also a special
case of the general stochastic block model (SBM) [18] with a single community. In
this case, the data matrix A corresponds to the adjacency matrix of a graph, where
two vertices are connected with probability p if both belong to the community C ∗ ,
and with probability q otherwise. Since p > q, the subgraph induced by C ∗ is likely
to be denser than the rest of the graph.
• Gaussian case: P = N(μ, 1) and Q = N(0, 1) with μ  0. This corresponds to a sym-
metric version of the submatrix detection problem studied in [9, 16, 19–23]. When
μ > 0, the entries of A with row and column indices in C ∗ have positive mean μ except
those on the diagonal, while the rest of the entries have zero mean.
We will also consider a binary symmetric community model with two communities
of equal sizes. The Bernoulli case is known as the binary symmetric stochastic block
model (SBM).
definition 13.2 (Binary symmetric community model) Let (C1∗ ,C2∗ ) be two communities
of equal size that are drawn uniformly at random from all equal-sized partitions of [n].
Let A be an n × n symmetric matrix with empty diagonal where, for all 1 ≤ i < j ≤ n, Ai j
are mutually independent, and Ai j ∼ P if i, j are from the same community and Ai j ∼ Q
otherwise.
Given the data matrix A, the problem of interest is to accurately recover the underlying
single community C ∗ or community partition (C1∗ ,C2∗ ) up to a permutation of cluster
indices. The distributions P and Q as well as the community size K depend on the matrix
size n in general. For simplicity we assume that these model parameters are known to
the estimator. Common objectives of recovery include the following.
• Detection: detect the presence of planted communities versus the absence. This is
a hypothesis problem: in the null case the observation consists purely of noise with
independently and identically distributed (i.i.d.) entries, while in the alternative case
the distribution of the entries is dependent on the hidden communities according to
Definition 13.1 or 13.2.
• Correlated recovery: recover the hidden communities better than would be achieved
by random guessing. For example, for the binary symmetric SBM, the goal is to
achieve a misclassification rate strictly less than 1/2.
386 Yihong Wu and Jiaming Xu

• Almost exact recovery: the expected number of misclassified vertices is sub-linear


in the hidden community sizes.
• Exact recovery: all vertices are classified correctly with probability converging to 1
as the dimension n → ∞.

13.3 Information-Theoretic Limits

13.3.1 Detection and Correlated Recovery


In this section, we study detection and correlated recovery under the binary symmetric
community model. The community structure under the binary symmetric community
model can be represented by a vector σ ∈ {±1}n such that σi = 1 if vertex i is in the
first community and σi = −1 otherwise. Let σ∗ denote the true community partition and

σ ∈ {±1}n denote an estimator of σ. For detection, we assume under the null model,
Aii = 0 for all 1 ≤ i ≤ n and Ai j = Ai j are i.i.d. as 12 (P + Q) for 1 ≤ i < j ≤ n, so that E[A]
is matched between the planted model and the null model.
definition 13.3 (Detection) Let P be the distribution of A in the planted model, and
denote by Q the distribution of A in the null model. A test statistic T (A) with a threshold
τ achieves detection if1

lim sup[P(T (A) < τ) + Q(T (A) ≥ τ)] = 0,


n→∞

so that the criterion T (A) ≥ τ determines with high probability whether A is drawn from
P or Q.
definition 13.4 (Correlated recovery) The estimator  σ achieves correlated recovery of
σ∗ if there exists a fixed constant  > 0 such that E[| σ, σ∗ |] ≥ n for all n.
The detection problem can be understood as a binary hypothesis-testing problem.
Given a test statistic T (A), we consider its distribution under the planted and null
models. If these two distributions are asymptotically disjoint, i.e., their total variation
distance tends to 1 in the limit of large datasets, then it is information-theoretically pos-
sible to distinguish the two models with high probability by measuring T (A). A classic
choice of statistic for binary hypothesis testing is the likelihood ratio,
 
P(A) P(A, σ) P(A|σ) P(σ)
= σ = σ .
Q(A) Q(A) Q(A)
This object will figure heavily both in our upper bounds and in our lower bounds of the
detection threshold.
Before presenting our proof techniques, we first give the sharp threshold for detection
and correlated recovery under the binary symmetric community model.

1 This criterion is also known as strong detection, in contrast to weak detection which requires only that
P(T (A) < τ) + Q(T (A) ≥ τ) be bounded away from 1 as n → ∞. Here we focus exclusively on strong
detection. See [24, 25] for detailed discussions on weak detection.
Statistical Problems with Planted Structures 387

theorem 13.1 Consider the binary symmetric community model.


• If P = Bern(a/n) and Q = Bern(b/n) for fixed constants a, b, then both detection and
correlated recovery are information-theoretically possible when (a − b)2 > 2(a + b)
and impossible when (a − b)2 < 2(a + b).
√ √
• If P = N(μ/ n, 1) and Q = N(−μ/ n, 1), then both detection and correlated recovery
are information-theoretically possible when μ > 1 and impossible when μ < 1.
We will explain how to prove the converse part of Theorem 13.1 using second-
moment analysis of the likelihood ratio P(A)/Q(A) and mutual information arguments.
For the positive part of Theorem 13.1, we will present a simple first-moment method to
derive upper bounds that often concide with the sharp thresholds up to a multiplicative
constant.
To achieve the sharp detection upper bound for the SBM, one can use the count
of short cycles as test statistics as in [26]. To achieve the sharp detection threshold
in the Gaussian model and the correlated recovery threshold in both models, one can
resort to spectral methods. For the Gaussian case, this directly follows from a celebrated
phase-transition result on the rank-one perturbation of Wigner matrices [27–29]. For
the SBM, naive spectral methods fail due to the existence of high-degree vertices [30].
More sophisticated spectral methods based on self-avoiding walks or non-backtracking
walks have been shown to achieve the sharp correlated recovery threshold efficiently
[26, 31, 32].

First-Moment Method for Detection and Correlated Recovery Upper


Bound
Our upper bounds do not use the likelihood ratio directly, since it is hard to furnish lower
bounds on the typical value of P(A)/Q(A) when A is drawn from P. Instead, we use the
generalized likelihood ratio
P(A|σ)
max
σ Q(A)
as the test statistic. In the planted model where the underlying true community is σ∗ , this
quantity is trivially bounded below by P(A|σ∗ )/Q(A). Then using a simple first-moment
argument (union bound) one can show that, in the null model Q, with high probability,
this lower bound is not achieved by any σ and hence the generalized likelihood ratio
test succeeds. An easy extension of this argument shows that, in the planted model, the
maximum-likelihood estimator (MLE)

σML = argmaxσ P(A|σ)
has non-zero correlation with σ∗ , achieving the correlated recovery.
Note that the first-moment analysis of the MLE often falls short of proving the sharp
detection and correlated recovery upper bound. For instance, as we will explain next,
the first-moment calculation of Theorem 2 in  [33] shows only that the MLE achieves
detection and correlated recovery when μ > 2 log 2 in the Gaussian model,2 which is

2 Throughout this chapter, logarithms are with respect to the natural base.
388 Yihong Wu and Jiaming Xu

sub-optimal in view of Theorem 13.1. One reason is that the naive union bound in the
first-moment analysis may not be tight; it does not take into the account the correlation
between P(A|σ) and P(A|σ ) for two different σ, σ under the null model.
Next we explain how to carry out the first-moment analysis in the Gaussian case with
√ √ √  
P = N(μ/ n, 1) and Q = N(−μ/ n, 1). Specifically, assume A = (μ/ n) σ∗ (σ∗ )T − I +
W, where W is a symmetric Gaussian random variable with zero diagonal and
i.i.d. √ 
Wi j ∼ N(0, 1) for i < j. It follows that log (P(A|σ)/Q(A)) = (μ/ n) i< j Ai j σi σ j +
μ2 (n − 1)/4. Therefore, the generalized likelihood test reduces to the test statistic

maxσ T (σ)  i< j Ai, j σi σ j . Under the null model Q, T (σ) ∼ N(0, n(n − 1)/2). Under
√  
the planted model P, T (σ) = μ/ n i< j σ∗i σ∗j σi σ j + i< j Wi, j σi σ j . Hence the distribu-
tion of T (σ) depends on the overlap | σ, σ∗ | between σ and the planted partition σ∗ .
Suppose | σ, σ∗ | = nω. Then
 
μn(nω2 − 1) n(n − 1)
T (σ) ∼ N √ , .
2 n 2
To prove that detection is possible, notice that, in the planted model, maxσ T (σ) ≥
T (σ∗ ). Setting ω = 1, Gaussian tail bounds yield that
μn(n − 1) 
P T (σ∗ ) ≤ √ − n log n ≤ n−1 .
2 n
Under the null model, taking the union bound over at most 2n ways to choose σ, we
can bound the probability that any partition is as good, according to T , as the planted
one, by
⎛ ⎛   ⎞2 ⎞⎟
μn(n − 1)  ⎜⎜⎜⎜ ⎜⎜⎜ μ n − 1 log n ⎟⎟⎟ ⎟⎟⎟
Q max T (σ) > √ − n log n ≤ 2 exp⎜⎜⎜⎝−n⎜⎜⎝
n
− ⎟⎟ ⎟⎟.
σ 2 n 2 n n − 1 ⎠ ⎟⎠

Thus the probability of this event is e−Ω(n) whenever μ > 2 log 2, meaning that above
this threshold we can distinguish the null and planted models with the generalized
likelihood test. 
To prove that correlated recovery
 is possible, since μ > 2 log 2, there exists a fixed
 > 0 such that μ(1 −  2 ) > 2 log 2. Taking the union bound over every σ with | σ, σ∗ | ≤
n gives
μn(n − 1) 
P max

T (σ) ≥ √ − n log n
| σ,σ |≤n 2 n
⎛ ⎛  ⎞2 ⎞⎟
⎜⎜⎜ ⎜⎜ μ(1 −  2 )  n ⎟⎟⎟ ⎟⎟⎟
≤ 2n exp⎜⎜⎜⎜⎝−n⎜⎜⎜⎝
log n ⎟⎟ ⎟⎟.

2 n−1 n − 1 ⎠ ⎟⎠

Hence, with probability at least 1 − e−Ω(n) ,


n(n − 1)μ 
max

T (σ) < √ − n log n,
| σ,σ |≤n 2 n
σML , σ∗ | ≥ n with high probability. Thus, 
and consequently |  σML achieves correlated
recovery.
Statistical Problems with Planted Structures 389

Second-Moment Method for Detection Lower Bound


Intuitively, if the planted model P and the null model Q are close to being mutually
singular, then the likelihood ratio P/Q is almost always either very large or close to
zero. In particular, its variance under Q, that is, the χ2 -divergence
⎡ 2 ⎤
⎢⎢⎢ P(A) ⎥⎥
χ (P Q)  EA∼Q ⎣⎢
2 ⎢ − 1 ⎥⎥⎦⎥,
Q(A)
must diverge. This suggests that we can derive lower bounds on the detection threshold
by bounding the second moment of P(A)/Q(A) under Q, or equivalently its expectation
under P. Suppose the second moment is bounded by some constant C, i.e.,
⎡  ⎤
⎢⎢⎢ P(A) 2 ⎥⎥⎥
χ (P Q) + 1 = EA∼Q ⎢⎢⎣
2 ⎥⎥⎦ = EA∼P P(A) ≤ C . (13.1)
Q(A) Q(A)
A bounded second moment readily implies a bounded Kullback–Leibler divergence
between P and Q, since Jensen’s inequality gives
   
P(A) P(A)
D(P Q) = EA∼P log ≤ log EA∼P ≤ logC = O(1) . (13.2)
Q(A) Q(A)
Moreover, it also implies non-detectability. To see this, let E = En be a sequence of
events such that Q(E) → 0 as n → ∞, and let 1E denote the indicator random variable
for E. Then the Cauchy–Schwarz inequality gives

 2
P(A) P(A) 
P(E) = EA∼Q 1E ≤ EA∼Q × EA∼Q 12E ≤ CQ(E) → 0 . (13.3)
Q(A) Q(A)
In other words, the sequence of distributions P is contiguous to Q [4]. Therefore, no
algorithm can return “yes” with high probability (or even positive probability) in the
planted model, and “no” with high probability in the null model. Hence, detection is
impossible.
Next we explain how to compute the χ2 -divergence for the binary symmetric SBM.
One useful observation due to [34] (see also Lemma 21.1 of [35]) is that, using Fubini’s
theorem, the χ2 -divergence between a mixture distribution and a simple distribution can
be written as
P(A|σ)P(A|
σ)
χ2 (P Q) + 1 = Eσ,σ EA∼Q ,
Q(A)
where σ is an independent copy of σ. Note that under the planted model P, the distri-
bution of the ijth entry is given by P1{σi =σ j } + Q1{σi σ j } = (P + Q)/2 + ((P − Q)/2)σi σ j .
Thus3
⎡  ⎤
⎢⎢⎢ ((P + Q)/2 + ((P − Q)/2)σ σ )((P + Q)/2 + ((P − Q)/2) σ 
σ ) ⎥⎥⎥
χ2 (P Q) + 1 = E⎢⎢⎢⎣ ⎢ i j i j ⎥⎥⎥
(P + Q)/2 ⎥⎦
i< j


3 In fact, the quantity ρ = (P − Q)2 /(2(P + Q)) is an f -divergence known as the Vincze–Le Cam distance
[4, 36].
390 Yihong Wu and Jiaming Xu

 
(P − Q)2

=E i σ
1 + σi σ j σ j (13.4)
2(P + Q)
i< j 

⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜ ! ⎟⎟⎟⎥⎥⎥
⎢ ⎜
≤ E⎢⎢⎢⎣exp⎜⎜⎜⎝ρ  j ⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦.
i σ j σ
σi σ (13.5)
i< j

For the Bernoulli setting where P = Bern(a/n) and Q = Bern(b/n) for fixed constants
a, b, we have ρ  τ/n + O(1/n2 ), where τ  (a − b)2 /(2(a + b)). Thus,
" #τ $%
χ2 (P Q) + 1 ≤ E exp  2 + O(1) .
σ, σ
2n
We then write σ = 2ξ − 1, where ξ ∈ {0, 1}n is the indicator vector for the first com-
munity which is drawn uniformly at random from all binary vectors with Hamming
weight n/2, and ξ is its independent copy. Then σ, σ  = 4 ξ, 
ξ − n, where H  ξ, 
ξ ∼
Hypergeometric(n, n/2, n/2). Thus
⎡ ⎛   ⎞⎤
⎢⎢ ⎜⎜ τ 4H − n 2 ⎟⎟⎟⎥⎥⎥
χ2 (P Q) + 1 ≤ E⎢⎢⎢⎣exp⎜⎜⎜⎝ √ + O(1) ⎟⎟⎠⎥⎥⎦.
2 n

Since (1/ n/16)(H − n/4) → N(0, 1) as n → ∞ by the central limit theorem for hyper-
geometric distributions (see, e.g., p. 194 of [37]), using Theorem 1 of [38] for the
convergence of the moment-generating function, we conclude that χ2 (P Q) is bounded
if τ < 1.

Mutual Information-Based Lower Bound for Correlated Recovery


It is tempting to conclude that whenever detection is impossible – that is, whenever we
cannot correctly tell with high probability whether the observation was generated from
the null or planted model – we cannot infer the planted community structure σ∗ better
than chance either; this deduction, however, is not true in general (see Section III.D of
[33] for a simple counterexample). Instead, we resort to mutual information in proving
lower bounds for correlated recovery. In fact, there are two types of mutual information
that are relevant in the context of correlated recovery.
Pair-wise mutual information I(σ1 , σ2 ; A). For two communities, it is easy to show
that correlated recovery is impossible if and only if
I(σ1 , σ2 ; A) = o(1) (13.6)
as n → ∞. This in fact also holds for k communities for any constant k. See
Appendix A13.1 for a justification in a general setting. Thus (13.6) provides an
information-theoretic characterization for correlated recovery.
The intuition is that, since I(σ1 , σ2 ; A) = I(1{σ1 =σ2 } ; A), (13.6) means that the observa-
tion A does not provide enough information to distinguish whether any two vertices are
in the same community. Alternatively, since I(σ1 ; A) = 0 by symmetry and I(σ1 ; σ2 ) =
o(1),4 it follows from the chain rule that I(σ1 , σ2 ; A) = I(σ1 ; σ2 |A) + o(1). Thus (13.6) is

4 Indeed, since P{σ2 = −|σ1 = +} = n/(2n − 2), I(σ1 ; σ2 ) = log 2 − h(n/(2n − 2)) = Θ(n−2 ), where h is the
binary entropy function in (13.34).
Statistical Problems with Planted Structures 391

equivalent to stating that σ1 and σ2 are asymptotically independent given the observa-
tion A; this is shown in Theorem 2.1 of [39] for the SBM below the recovery threshold
τ = (a − b)2 /(2(a + b)) < 1.
Polyanskiy and Wu recently [40] proposed an information-percolation method based
on strong data-processing inequalities for mutual information to bound the mutual infor-
mation in (13.6) in terms of bond percolation probabilities, which yields bounds or a
sharp recovery threshold for correlated recovery; a similar program is carried out inde-
pendently in [41] for a variant of mutual information defined via the χ2 -divergence. For
two communities, this method yields the sharp threshold in the Gaussian model but not
in the SBM.
Next, we describe another method of proving (13.6) via second-moment analysis that
reaches the sharp threshold. Let P+ and P− denote the conditional distribution of A con-
ditioned on σ1 = σ2 and on σ1  σ2 , respectively. The following result can be distilled
from [42] (see Appendix A13.2 for a proof): for any probability distribution Q, if

(P+ − P− )2
= o(1), (13.7)
Q
then (13.6) holds and hence correlated recovery is impossible. The LHS of (13.7) can
be computed similarly to the usual second moment (13.4) when Q is chosen to be the
distribution of A under the null model. In Appendix A13.2 we verify that (13.7) is sat-
isfied below the correlated recovery threshold τ = (a − b)2 /(2(a + b)) < 1 for the binary
symmetric SBM.
Blockwise mutual information I(σ; A). Although this quantity is not directly related
to correlated recovery per se, its derivative with respect to some appropriate signal-to-
noise-ratio (SNR) parameter can be related to or coincides with the reconstruction error
thanks to the I-MMSE formula [43] or variants. Using this method, we can prove that
the Kullback–Leibler divergence D(P Q) = o(n) implies the impossibility of correlated
recovery in the Gaussian case. As shown in (13.2), a bounded second moment read-
ily implies a bounded KL divergence. Hence, as a corollary, we prove that a bounded
second moment also implies the impossibility of correlated recovery in the Gaussian
case. Below, we sketch the proof of the impossibility of correlated recovery in the Gaus-
sian case, by assuming D(P Q) = o(n). The proof makes use of mutual information, the
I-MMSE formula, and a type of interpolation argument [44–46].

Assume that A(β) = βM + W in the planted model and A = W in the null model,

where β ∈ [0, 1] is an SNR parameter, M = (μ/ n)(σσT − I), W is a symmetric Gaus-
i.i.d.
sian random matrix with zero diagonal, and Wi j ∼ N(0, 1) for all i < j. Note that
β = 1 corresponds to the binary symmetric community model in Definition 13.2 with
√ √
P = N(μ/ n, 1) and Q = N(−μ/ n, 1). Below we abbreviate A(β) as A whenever the
context is clear. First, recall that the minimum mean-squared error estimator is given by
the posterior mean of M:
MMSE (A) = E[M|A],
M
and the resulting (rescaled) minimum mean-squared error is
1
MMSE(β) = E M − E[M|A] 2F . (13.8)
n
392 Yihong Wu and Jiaming Xu

We will start by proving that, if D(P Q) = o(n), then, for all β ∈ [0, 1], the MMSE
 = 0, i.e.,
tends to that of the trivial estimator M
1
lim MMSE(β) = lim E M 2
F = μ2 . (13.9)
n→∞ n→∞ n
Note that limn→∞ MMSE(β) exists by virtue of Proposition III.2 of [44]. Let us compute
the mutual information between M and A:
 
P(A|M)
I(β)  I(M; A) = E M,A log (13.10)
P(A)
   
Q(A) P(A|M)
= EA log + E M,A log
P(A) Q(A)
⎡ ⎤
1 ⎢⎢⎢  β M 2F ⎥⎥⎥

= −D(P Q) + E M,A ⎢⎣ β M, A − ⎥⎥
2 2 ⎦
β
= −D(P Q) + E M 2F . (13.11)
4
By assumption, we have that D(P Q) = o(n) holds for β = 1; by the data-processing
inequality for KL divergence [47], this holds for all β < 1 as well. Thus (13.11) becomes

1 β 1 βμ2
lim I(β) = lim E M 2
F = . (13.12)
n→∞ n 4 n→∞ n 4
Next we compute the MMSE. Recall the I-MMSE formula [43] for Gaussian
channels:
dI(β) 1 ! & '2 n
= Mi j − E Mi j |A = MMSE(β) . (13.13)
dβ 2 i< j 4

Note that the MMSE is by definition bounded above by the squared error of the trivial
estimator M = 0, so that for all β we have

1
MMSE(β) ≤ E M 2
F ≤ μ2 . (13.14)
n
On combining these we have
 1
μ2 (a) I(1) (b) 1
= lim = lim MMSE(β) dβ
4 n→∞ n 4 n→∞ 0
 1
(c) 1
≤ lim MMSE(β) dβ
4 0 n→∞
 1
(d) 1 μ2
≤ μ2 dβ = ,
4 0 4
where (a) and (b) hold due to (13.12) and (13.13), (c) follows from Fatou’s lemma, and
(d) follows from (13.14), i.e., MMSE(β) ≤ μ2 pointwise. Since we began and ended with
the same expression, these inequalities must all be equalities. In particular, since (d)
holds with equality, we have that (13.9) holds for almost all β ∈ [0, 1]. Since MMSE(β)
Statistical Problems with Planted Structures 393

is a non-increasing function of β, its limit limn→∞ MMSE(β) is also non-increasing in β.


Therefore, (13.9) holds for all β ∈ [0, 1]. This completes the proof of our claim that the
optimal MMSE estimator cannot outperform the trivial one asymptotically.
To show that the optimal estimator actually converges to the trivial one, we expand
the definition of MMSE(β) in (13.8) and subtract (13.9) from it. This gives

1 & '
lim E −2 M, E[M|A] + E[M|A] 2F = 0 . (13.15)
n→∞ n

From the tower property of conditional expectation and the linearity of the inner product,
it follows that

E M, E[M|A] = E E[M|A], E[M|A] = E E[M|A] 2F ,

and combining this with (13.15) gives

1
lim E E[M|A] 2
F = 0. (13.16)
n→∞ n

Finally, for any estimator 


σ(A) of the community membership σ, we can define an

estimator for M by M  = (μ/ n)(σσT − I). Then using the Cauchy–Schwarz inequality,
we have

 ] = EA [ E[M|A], M
E M,A [ M, M ]
& '
≤ EA E[M|A] F M  F
( √ (13.16)
≤ EA [ E[M|A] 2F ] × μ n = o(n).

Since M,&+M  = μ2 ()σ, * )


σ 2 /n − 1), it follows that E[ σ, 
*
σ 2 ] = o(n2 ), which further
) * + '
implies E ++ σ, σ ++ = o(n) by Jensen’s inequality. Hence, correlated recovery of σ is
impossible.
In passing, we remark that, while we focus on the binary symmetric community in this
section, the proof techniques are widely applicable for many other high-dimensional
inference problems such as detecting a single community [15], sparse PCA, Gaus-
sian mixture clustering [33], synchronization [48], and tensor  PCA [24]. In fact, for
√ 
a more general k-symmetric community model with P = N (k − 1)μ/ n, 1 and Q =
 √ 
N −μ/ n, 1 , the first-moment method shows that both detection and correlated recov-

ery are information-theoretically
 possible when μ > 2 log k/(k − 1) and impossible√
when μ < 2 log(k − 1)/(k − 1). The upper and √ lower bounds differ by a factor of 2
when k is asymptotically large. This gap of 2 is due to the looseness of the second-
moment lower bound. A more refined conditional second lower bound can be applied to
show that the sharp information-theoretic threshold for detection and correlated recovery
is μ = 2 log k/k(1 + ok (1)) when k → ∞ [33]. Complete, but not explicit, characteriza-
tions of information-theoretic reconstruction thresholds were obtained in [46, 49, 50] for
all finite k through the Guerra interpolation technique and cavity method.
394 Yihong Wu and Jiaming Xu

13.3.2 Almost Exact and Exact Recovery


In this section, we study almost-exact and exact recovery using the single-community
model as an illustrating example. The hidden community can be represented by its indi-
cator vector ξ ∈ {0, 1}n such that ξi = 1 if vertex i is in the community and ξi = 0 otherwise.
Let ξ∗ denote the indicator of the true community and  ξ =
ξ(A) ∈ {0, 1}n an estimator.
The only assumptions on the community size K we impose are that K/n is bounded
away from one, and, to avoid triviality, that K ≥ 2. Of particular interest is the case of
K = o(n), where the community size grows sub-linearly with respect to the network size.
definition 13.5 (Almost-exact recovery) An estimator  ξ is said to almost exactly
recover ξ∗ if, as n → ∞, dH (ξ∗ , 
ξ)/K → 0 in probability, where dH denotes the Hamming
distance.
One can verify that the existence of an estimator satisfying Definition 13.5 is
equivalent to the existence of an estimator such that E[dH (ξ∗ , 
ξ)] = o(K).
definition 13.6 (Exact recovery) An estimator  ξ exactly recovers ξ∗ , if, as n → ∞,

P[ξ  ξ] → 0, where the probability is with respect to the randomness of ξ∗ and A.

To obtain upper bounds on the thresholds for almost-exact and exact recovery, we turn
to the MLE. Specifically,
• to show that the MLE achieves almost-exact recovery, it suffices to prove that there
exists n = o(1) such that, with high probability, P(A|ξ) < P(A|ξ∗ ) for all ξ with
dH (ξ, ξ∗ ) ≥ n K; and
• to show that the MLE achieves exact recovery, it suffices to prove that, with high
probability, P(A|ξ) < P(A|ξ∗ ) for all ξ  ξ∗ .
This type of argument often involves two key steps. First, upper-bound the probability
that P(A|ξ) ≥ P(A|ξ∗ ) for a fixed ξ using large-deviation techniques. Second, take an
appropriate union bound over all possible ξ using a “peeling” argument which takes into
account the fact that the further away ξ is from ξ∗ the less likely it is for P(A|ξ) ≥ P(A|ξ∗ )
to occur. Below we discuss these two key steps in more detail.
Given the data matrix A, a sufficient statistic for estimating the community C ∗ is the
log likelihood ratio (LLR) matrix L ∈ Rn×n , where Li j = log(dP/dQ)(Ai j ) for i  j and
Lii = 0. For S , T ⊂ [n], define
!
e(S , T ) = Li j . (13.17)
(i< j):(i, j)∈(S ×T )∪(T ×S )

ML denote the MLE of C ∗ , given by


Let C
ML = argmaxC⊂[n] {e(C,C) : |C| = K},
C (13.18)

which minimizes the error probability P{C  C ∗ } because C ∗ is equiprobable by assump-


tion. It is worth noting that the optimal estimator that minimizes the misclassification
rate (Hamming loss) is the bit-MAP decoder  ξi ), where 
ξ = ( ξi  argmax j∈{0,1} P[ξi = j|L].
Therefore, although the MLE is optimal for exact recovery, it need not be optimal for
Statistical Problems with Planted Structures 395

almost-exact recovery; nevertheless, we choose to analyze the MLE due to its simplicity,
and it turns out to be asymptotically optimal for almost-exact recovery as well.
To state the main results, we introduce some standard notations associated with
binary hypothesis testing based on independent samples. We assume the KL divergences
D(P Q) and D(Q P) are finite. In particular, P and Q are mutually& absolutely
' continu-
ous, and the likelihood ratio, dP/dQ, satisfies EQ [dP/dQ] = EP (dP/dQ)−1 = 1. Let L =
log(dP/dQ) denote the LLR. The likelihood-ratio test for n observations and threshold

nθ is to declare P to be the true distribution if nk=1 Lk ≥ nθ and to declare Q otherwise.
For θ ∈ [−D(Q P), D(P Q)], the standard Chernoff bounds for error probability of this
likelihood-ratio test are given by
⎡ n ⎤
⎢⎢⎢! ⎥⎥⎥
Q⎢⎣⎢ Lk ≥ nθ⎥⎥⎦⎥ ≤ exp(−nE Q (θ)),
⎢ (13.19)
k=1
⎡ n ⎤
⎢⎢⎢! ⎥⎥⎥

P⎢⎢⎣ Lk ≤ nθ⎥⎥⎥⎦ ≤ exp(−nE P (θ)), (13.20)
k=1

where the log moment generating functions of L are denoted by ψQ (λ) = log EQ [exp(λL)]
and ψP (λ) = log EP [exp(λL)] = ψQ (λ + 1) and the large-deviation exponents are given by
Legendre transforms of the log moment generating functions:

E Q (θ) = ψ∗Q (θ)  sup λθ − ψQ (λ), (13.21)


λ∈R
E P (θ) = ψ∗P (θ)  sup λθ − ψP (λ) = E Q (θ) − θ. (13.22)
λ∈R

In particular, E P and E Q are convex functions. Moreover, since ψQ (0) = −D(Q P) and
ψQ (1) = D(P Q), we have E Q (−D(Q P)) = E P (D(P Q)) = 0 and hence E Q (D(P Q)) =
D(P Q) and E P (−D(Q P)) = D(Q P).
Under mild assumptions on the distribution (P, Q) (cf. Assumption 1 of [51]) which
are satisfied both by the Gaussian distribution and by the Bernoulli distribution, the sharp
thresholds for almost exact and exact recovery under the single-community model are
given by the following result.
theorem 13.2 Consider the single-community model with P = N(μ, 1) and Q = N(0, 1),
or P = Bern(p) and Q = Bern(q) with log(p/q) and log((1 − p)/(1 − q)) bounded. If
(K − 1)D(P Q)
K · D(P Q) → ∞ and lim inf > 2, (13.23)
n→∞ log(n/K)
then almost-exact recovery is information-theoretically possible. If, in addition to
(13.23),
, -
KE Q (1/K) log(n/K)
lim inf >1 (13.24)
n→∞ log n
holds, then exact recovery is information-theoretically possible.
Conversely, if almost-exact recovery is information-theoretically possible, then
(K − 1)D(P Q)
K · D(P Q) → ∞ and lim inf ≥ 2. (13.25)
n→∞ log(n/K)
396 Yihong Wu and Jiaming Xu

If exact recovery is information-theoretically possible, then in addition to (13.25), the


following holds:
, -
KE Q (1/K) log(n/K)
lim inf ≥ 1. (13.26)
n→∞ log n
Next we sketch the proof of Theorem 13.2.
Sufficient Conditions
For any C ⊂ [n] such that |C| = K and |C ∩ C ∗ | = , let S = C ∗ \ C and T = C \ C ∗ . Then

e(C,C) − e(C ∗ ,C ∗ ) = e(T, T ) + e(T,C ∗ \ S ) − e(S ,C ∗ ).


K    
Let m = 2 − 2 . Notice that e(S ,C ∗ ) has the same distribution as m i=1 Li under measure

P; e(T, T ) + e(T,C ∗ \S ) has the same distribution as m i=1 L i under measure Q where Li
are i.i.d. copies of log(dP/dQ). It readily follows from large-deviation bounds (13.19)
and (13.20) that
. / , -
P e(T, T ) + e(T,C ∗ \ S ) ≥ mθ ≥ exp −mE Q (θ) , (13.27)
. ∗ /
P e(S ,C ) ≤ mθ ≤ exp(−mE P (θ)).

Next we proceed to describe the union bound for the proof of almost-exact recovery.
Note
0 that showing that MLE 1 achieves almost exact recovery is equivalent to showing
ML ∩ C ∗ | ≤ (1 − n )K = o(1). The first layer of the union bound is straightforward:
P |C
0 1 0 1
ML ∩ C ∗ | ≤ (1 − n )K = ∪(1−n )K |C
|C ML ∩ C ∗ | = . (13.28)
=0

For the second layer of the union bound, one naive way to proceed is
0 1
ML ∩ C ∗ | = ⊂ {C ∈ C : e(C,C) ≥ e(C ∗ ,C ∗ )}
|C
. /
= ∪C∈C e(C,C) ≥ e(C ∗ ,C ∗ ,

where C = {C ⊂ [n] : |C| = K, |C ∩ C ∗ | = }. However, this union bound is too loose


because of the high correlations among e(C,C) − e(C ∗ ,C ∗ ) for different C ∈ C . Instead,
we use the following union bound. Let S = {S ⊂ C ∗ : |S | = K − } and T = {T ⊂ (C ∗ )c :
|T | = K − }. Then, for any θ ∈ R,
ML ∩ C ∗ | = } ⊂ {∃S ∈ S , T ∈ T : e(S ,C ∗ ) ≤ e(T, T ) + e(T,C ∗ \S )}
{|C
⊂ {∃S ∈ S : e(S ,C ∗ ) ≤ mθ}
∪ {∃S ∈ S , T ∈ T : e(T, T ) + e(T,C ∗ \S ) ≥ mθ}
⊂ ∪S ∈S {e(S ,C ∗ ) ≤ mθ}
∪S ∈S ,T ∈T {e(T, T ) + e(T,C ∗ \S ) ≥ mθ}. (13.29)
 K 
Note that we single out e(S ,C ∗ ) because the number of different choices of S , K− , is
 
much smaller than the number of different choices of T , n−K K− , when K  n. Combining
the above union bound with the large-deviation bound (13.27) yields that
0 1  K  
n−K

K

 ∗
P |CML ∩ C | = ≤ e−mE P (θ)
+ e−mE Q (θ) . (13.30)
K− K− K−
Statistical Problems with Planted Structures 397

Note that, for any ≤ (1 − )K,


  #
K Ke $K− # e $K−
≤ ≤ ,
K− K− 
   K−  K−
n−K (n − K)e (n − K)e
≤ ≤ .
K− K− K
Hence, for any ≤ (1 − )K,
0 1
P |CML ∩ C ∗ | = ≤ e−(K− )E1
+ e−(K− )E2
, (13.31)
where
1 #e$
E1  (K − 1)E P (θ) − log ,
2 
 
1 (n − K)e2
E2  (K − 1)E Q (θ) − log .
2 K 2
Thanks to the second condition in (13.23), we have (K − 1)D(P Q)(1 − η) ≥ 2 log(n/K)
for some η ∈ (0, 1). Choose θ = (1 − η)D(P Q). Under some mild assumption on P and Q
which is satisfied in the Gaussian and Bernoulli cases, we have E P (θ) ≥ cη2 D(P Q) for
some universal constant c > 0. Furthermore, recall from (13.22) that E P (θ) = E Q (θ) − θ.

Hence, since KD(P Q) → ∞ by the assumption (13.23), by choosing  = 1/ KD(P Q),
we have min{E1 , E2 } → ∞. The proof for almost-exact recovery is completed by taking
the first layer of the union bound in (13.28)0 over . 1
For exact recovery, we need to show P |C ML ∩ C ∗ | ≤ K − 1 = o(1). Hence, we need
0 1
to further bound P |C ML ∩ C ∗ | = for any (1 − )K ≤ ≤ K − 1. It turns out that the
previous union bound (13.29) is no longer tight. Instead, using e(T, T ) + e(T,C ∗ \ S ) =
e(T, T ∪ C ∗ ) − e(T, S ), we have the following union bound:
ML ∩ C ∗ | = } ⊂ ∪S ∈S {e(S ,C ∗ ) ≤ m1 θ1 } ∪T ∈T {e(T, T ∪ C ∗ ) ≥ m2 θ2 }
{|C
∪S ∈S ,T ∈T {e(T, S ) ≤ m2 θ2 − m1 θ1 },
K    K− 
where m1 = 2 − 2 , m2 = 2 + (K − )K, and θ1 , θ2 are to be optimized. Note that we
further single out e(T, T ∪ C ∗ ) because it depends only on T ∈ T once C ∗ is fixed. Since
(1 − )K ≤ ≤ K − 1, we have |T | = |S | = K − ≤ K and thus the effect of e(T, S ) can be
neglected. Therefore, approximately we can set θ1 = θ2 = θ and get
0 1  K   
n − K −m2 E P (θ)
 ∗
P |CML ∩ C | =  e−m1 E P (θ)
+ e . (13.32)
K− K−
 K   
Using K− ≤ K K− , n−K
K− ≤ (n − K)
K− , and m ≥ m ≥ (1 − )(K − )K, we get that,
2 1
for any (1 − )K ≤ ≤ K − 1,
0 1
ML ∩ C ∗ | = ≤ e−(K− )E3 + e−(K− )E4 ,
P |C (13.33)
where

E3  (1 − )KE P (θ) − log K,

E4  (1 − )KE Q (θ) − log n.


398 Yihong Wu and Jiaming Xu

Note that E P (θ) = E Q (θ) − θ. Hence, we set θ = (1/K) log(n/K) so that E3 = E4 , which
goes to +∞ under the assumption of (13.24). The proof of exact recovery is completed
by taking the union bound over all .

Necessary Conditions
To derive lower bounds on the almost-exact recovery threshold, we resort to a sim-
ple rate-distortion argument. Suppose ξ achieves almost-exact recovery of ξ∗ . Then
E[dH (ξ, ξ)] = n K with n → 0. On the one hand, consider the following chain of
inequalities, which lower-bounds the amount of information required for a distortion
level n :
(a)
I(A; ξ∗ ) ≥ I(
ξ; ξ∗ ) ≥ min I(
ξ; ξ∗ )
E[d(
ξ,ξ∗ )]≤n K

≥ H(ξ∗ ) − max H(


ξ ⊕ ξ∗ )
E[d(
ξ,ξ∗ )]≤n K
  #  K $ (c) #n$
(b) n n
= log − nh ≥ K log (1 + o(1)),
K n K

where (a) follows from the data-processing inequality for mutual information since ξ →
A → ξ forms a Markov chain; (b) is due to the fact that maxE[w(X)]≤pn H(X) = nh(p) for
any p ≤ 1/2, where
   
1 1
h(p)  p log + (1 − p)log (13.34)
p 1− p
  
is the binary entropy function and w(x) = i xi ; and (c) follows from the bound Kn ≥
(n/K)K , the assumption K/n is bounded away from one, and the bound h(p) ≤ −p log p +
p for p ∈ [0, 1].
On the other hand, consider the following upper bound on the mutual information:
 
n K
I(A; ξ∗ ) = min D(PA|ξ∗ Q|Pξ∗ ) ≤ D(PA|ξ∗ Q⊗(2) |Pξ∗ ) = D(P Q),
Q 2

where the first equality follows from the geometric interpretation of mutual information
as an “information radius” (see, e.g., Corollary 3.1 of [52]); the last equality follows
from the tensorization property of KL divergence for product distributions. Combining
the last two displays, we conclude that the second condition in (13.25) is necessary for
almost-exact recovery.
To show the necessity of the first condition in (13.25), we can reduce almost-exact
recovery to a local hypothesis testing via a genie-type argument. Given i, j ∈ [n], let
ξ\i, j denote {ξk : k  i, j}. Consider the following binary hypothesis-testing problem for
determining ξi . If ξi = 0, a node J is randomly and uniformly chosen from { j : ξ j = 1},
and we observe (A, J, ξ\i,J ); if ξi = 1, a node J is randomly and uniformly chosen from
{ j : ξ j = 0}, and we observe (A, J, ξ\i,J ). It is straightforward to verify that this hypothesis-
testing problem is equivalent to testing H0 : Q⊗(K−1) P⊗(K−1) versus H1 : P⊗(K−1) Q⊗(K−1) .
Let E denote the optimal average probability of testing error, pe,0 denote the type-I error
Statistical Problems with Planted Structures 399

probability, and pe,1 denote the type-II error probability. Then we have the following
chain of inequalities:

!
n
E[dH (ξ, 
ξ)] ≥ min P[ξi  
ξi ]

i=1 ξi (A)
!
n
≥ min P[ξi  
ξi ]

i=1 ξi (A,J, ξ\i,J )
=n min P[ξ1  
ξ1 ] = nE.

ξ1 (A,J, ξ\1,J )

By the assumption E[dH (ξ,  ξ)] = o(K), it follows that E = o(K/n). Under the assump-
tion that K/n is bounded away from one, E = o(K/n) further implies that the sum
of type-I and type-II probabilities of error pe,0 + pe,1 = o(1), or equivalently, TV((P ⊗
Q)⊗K−1 , (Q ⊗ P)⊗K−1 ) → 1, where TV(P, Q)  |dP − dQ|/2 denotes the total variation
distance. Using D(P Q) ≥ log(1/(2(1 − TV(P, Q)))) (Equation (2.25) of [53]) and the
tensorization property of KL divergence for product distributions, we conclude that
(K − 1)(D(P Q) + D(Q P)) → ∞ is necessary for almost-exact recovery. It turns out
that, both for the Bernoulli distribution and for the Gaussian distribution as specified in
the theorem statement, D(P Q)  D(Q P), and hence KD(P Q) → ∞ is necessary for
almost-exact recovery.
Clearly, any estimator achieving exact recovery also achieves almost-exact recovery.
Hence lower bounds for almost-exact recovery hold automatically for exact recovery.
Finally, we show the necessity of (13.26) for exact recovery. Since the MLE mini-
mizes the error probability among all estimators if the true community C ∗ is uniformly
distributed, it follows that, if exact recovery is possible, then, with high probability,
C ∗ has a strictly higher likelihood than any other community C  C ∗ , in particular,
C = C ∗ \ {i} ∪ { j} for any pair of vertices i ∈ C ∗ and j  C ∗ . To further illustrate the
proof ideas, consider the Bernoulli case of the single-community model. Then C ∗ has a
strictly higher likelihood than C ∗ \ {i} ∪ { j} if and only if e(i,C ∗ ), the number of edges
connecting i to vertices in C ∗ , is larger than e( j,C ∗ \ {i}), the number of edges connecting
j to vertices in C ∗ \ {i}. Therefore, with high probability, it holds that

min∗ e(i,C ∗ ) > max∗ e( j,C ∗ \ {i0 }), (13.35)


i∈C jC

where i0 is the random index such that i0 ∈ arg mini∈C ∗ e(i,C ∗ ). Note that the e( j,C ∗ \
{i0 }) s are i.i.d. for different j  C ∗ and hence a large-probability lower bound to their
maximum can be derived using inverse concentration inequalities. Specifically, for the
sake of argument by contradiction, suppose that (13.26) does not hold. Furthermore, for
ease of presentation, assume that the large-deviation inequality (13.19) also holds in the
reverse direction (cf. Corollary 5 of [51] for a precise statement). Then it follows that

2 # n $3   # n $
∗ 1
P e( j,C \ {i0 }) ≥ log  exp −KE Q log ≥ n−1+δ
K K K
400 Yihong Wu and Jiaming Xu

for some small δ > 0. Since the e( j,C ∗ \ {i0 })s are i.i.d. and there are n − K of them, it
further follows that, with large probability,
#n$
max∗ e( j,C ∗ \ {i0 }) ≥ log .
jC K
Similarly, by assuming that the large-deviation inequality (13.20) also holds in the
opposite direction and using the fact that E P (θ) = E Q (θ) − θ, we get that
2 # n $3   # n $
1
P e(i,C ∗ ) ≤ log  exp −KE P log ≥ K −1+δ .
K K K
Although the e(i,C ∗ )s are not independent for different i ∈ C ∗ , the dependence is weak
and can be controlled properly. Hence, following the same argument as above, we get
that, with large probability,
#n$
min∗ e(i,C ∗ ) ≤ log .
i∈C K
Combining the large-probability lower and upper bounds and (13.35) yields the contra-
diction. Hence, (13.26) is necessary for exact recovery.
remark 13.1 Note that, instead of using the MLE, one could also apply a two-step
procedure to achieve exact recovery: first use an estimator capable of almost-exact recov-
ery and then clean up the residual errors through a local voting procedure for every
vertex. Such a two-step procedure has been analyzed in [51]. From the computational
perspective, both for the Bernoulli case and for the Gaussian case we have the following
results:
• if K = Θ(n), a linear-time degree-thresholding algorithm achieves the information
limit of almost exact recovery (see Appendix A of [54] and Appendix A of [55]);
• if K = ω(n/ log n), whenever information-theoretically possible, exact recovery can
be achieved in polynomial time using semidefinite programming [56];
• if K ≥ (n/ log n)(1/(8e) + o(1)) for the Gaussian case and K ≥ (n/ log n)(ρBP (p/q) +
o(1)) for the Bernoulli case, exact recovery can be attained in nearly linear time via
message passing plus clean-up [54, 55] whenever information-theoretically possible.
Here ρBP (p/q) denotes a constant depending only on p/q.
However, it remains unknown whether any polynomial-time algorithm can achieve the
respective information limit of almost exact recovery for K = o(n), or exact recovery for
K ≤ (n/ log n)(1/(8e) − ) in the Gaussian case and for K ≤ (n/ log n)(ρBP (p/q) − ) in
the Bernoulli case, for any fixed  > 0.
Similar techniques can be used to derive the almost-exact and exact recovery thresh-
olds for the binary symmetric community model. For the Bernoulli case, almost-exact
recovery is efficiently achieved by a simple spectral method if n(p−q)2 /(p+q) → ∞[57],
which turns out to be also information-theoretically necessary [58]. An exact recov-
ery threshold for the binary community model has been derived and further shown to
be efficiently achievable by a two-step procedure consisting of a spectral method plus
clean-up [58, 59]. For the binary symmetric community model with general discrete dis-
tributions P and Q, the information-theoretic limit of exact recovery has been shown to
Statistical Problems with Planted Structures 401

be determined by the Rényi divergence of order 1/2 between P and Q [60]. The analysis
of the MLE has been carried out under k-symmetric community models for general k,
and the information-theoretic exact recovery threshold has been identified in [16] up to
a universal constant. The precise information-theoretic limit of exact recovery has been
determined in [61] for k = Θ(1) with a sharp constant and has further been shown to be
efficiently achievable by a polynomial-time two-step procedure.

13.4 Computational Limits

In this section we discuss the computational limits (performance limits of all possible
polynomial-time procedures) of detecting the planted structure under the planted-clique
hypothesis (to be defined later). To investigate the computational hardness of a given
statistical problem, one main approach is to find an approximate randomized
polynomial-time reduction, which maps certain graph-theoretic problems, in particular,
the planted-clique problem, to our problem approximately in total variation, thereby
showing that these statistical problems are at least as hard as solving the planted-clique
problem.
We focus on the single-community model in Definition 13.1 and present results
for both the submatrix detection problem (Gaussian) [9] and the community detec-
tion problem (Bernoulli) [10]. Surprisingly, under appropriate parameterizations, the
two problems share the same “easy–hard–impossible” phase transition. As shown in
Fig. 13.1, where the horizontal and vertical axes correspond to the relative community
size and the noise level, respectively, the hardness of the detection has a sharp phase
transition: optimal detection can be achieved by computationally efficient procedures
for a relatively large community, but provably not for a small community. This is one
of the first results in high-dimensional statistics where the optimal trade-off between
statistical performance and computational efficiency can be precisely quantified.
Specifically, consider the submatrix detection problem in the Gaussian case of Def-
inition 13.1, where P = N(μ, 1) and Q = N(0, 1). In other words, the goal is to test
the null model, where the observation is an N × N Gaussian noise matrix, versus the
planted model, where there exists a K × K submatrix of elevated mean μ. Consider the
high-dimensional setting of K = N α and μ = N −β with N → ∞, where α, β > 0 parameter-
izes the cluster size and signal strength, respectively. Information-theoretically, it can be
shown that there exist detection procedures achieving vanishing error probability if and
only if β < β∗  max(α/2, 2α − 1) [21]. In contrast, if only randomized polynomial-time
algorithms are allowed, then reliable detection is impossible if β > β  max(0, 2α − 1);
conversely if β < β , there exists a near-linear-time detection algorithm with vanish-
ing error probability. The plots of β∗ and β in Fig 13.1 correspond to the statistical and
computational limits of submatrix detection, respectively, revealing the following strik-
ing phase transition: for a large community (α ≥ 23 ), optimal detection can be achieved
by computationally efficient procedures; however, for a small community (α < 23 ),
computational constraint incurs a severe penalty on the statistical performance and
the optimal computationally intensive procedure cannot be mimicked by any efficient
algorithms.
402 Yihong Wu and Jiaming Xu

impossible

1
3 easy

hard
α
0 1 2 1
2 3

Figure 13.1 Computational versus statistical limits. For the submatrix detection problem, the size
of the submatrix is K = N α and the elevated mean is μ = N −β . For the community detection
problem, the cluster size is K = N α , and the in-cluster and inter-cluster edge probabilities p and q
are both on the order of N −2β .

For the Bernoulli case, it has been shown that, to detect a planted dense subgraph,
when the in-cluster and inter-cluster edge probabilities p and q are on the same order
and parameterized as N −2β and the cluster size as K = N α , the easy–hard–impossible
phase transition obeys the same diagram as that in Fig. 13.1 [10].
Our intractability result is based on the common hardness assumption of the planted-
clique problem in the Erdős–Rényi graph when the clique size is of smaller order than
the square root of the graph cardinality [62], which has been widely used to establish
various hardness results in theoretical computer science [63–68] as well as the hardness
of detecting sparse principal components [8]. Recently, the average-case hardness of
the planted-clique problem has been established under certain computational models
[69, 70] and within the sum-of-squares relaxation hierarchy [71–73].
The rest of the section is organized as follows. Section 13.4.1 gives the precise def-
inition of the planted-clique problem, which forms the basis of reduction both for the
submatrix detection problem and for the community detection problem, with the latter
requiring a slightly stronger assumption. Section 13.4.2 discusses how to approximately
reduce the planted-clique problem to the single-community detection problem in poly-
nomial time in both Bernoulli and Gaussian settings. Finally, Section 13.4.3 presents the
key techniques to bound the total variation between the reduced instance and the target
hypothesis.

13.4.1 Planted-Clique Problem


Let G(n, γ) denote the Erdős–Rényi graph model with n vertices, where each pair of
vertices is connected independently with probability γ. Let G(n, k, γ) denote the planted-
clique model in which we add edges to k vertices uniformly chosen from G(n, γ) to form
a clique.
Statistical Problems with Planted Structures 403

definition 13.7 The PC detection problem with parameters (n, k, γ), denoted by
PC(n, k, γ) henceforth, refers to the problem of testing the following hypotheses:
H0C : G ∼ G(n, γ), H1C : G ∼ G(n, k, γ).
The problem of finding the planted clique has been extensively studied for γ = 12 and

the state-of-the-art polynomial-time algorithms [14, 62, 74–78] work only for k = Ω( n).

There is no known polynomial-time solver for the PC problem for k = o( n) and any
constant γ > 0. It has been conjectured [63, 64, 67, 70, 79] that the PC problem cannot

be solved in polynomial time for k = o( n) with γ = 12 , which we refer to as the PC
hypothesis.

PC hypothesis Fix some constant 0 < γ ≤ 12 . For any sequence of randomized


polynomial-time tests {ψn,kn } such that lim supn→∞ (log kn / log n) < 1/2,
lim inf PH C {ψn,k (G) = 1} + PH C {ψn,k (G) = 0} ≥ 1.
n→∞ 0 1

The PC hypothesis with γ = is similar to Hypothesis 1 of [9] and Hypothesis BPC


1
2
of [8]. Our computational lower bounds for submatrix detection require that the PC
hypothesis holds for γ = 12 , and for community detection we need to assume that the
PC hypothesis holds for any positive constant γ. The even stronger assumption that the
PC hypothesis holds for γ = 2− log n has been used in Theorem 10.3 of [68] for public-
0.99

key cryptography. Furthermore, Corollary 5.8 of [70] shows


 that,
 under a statistical
log n
Ω
query model, any statistical algorithm requires at least n log(1/γ) queries for detecting
the planted bi-clique in an Erdős–Rényi random bipartite graph with edge probability γ.

13.4.2 Polynomial-Time Randomized Reduction


We present a polynomial-time randomized reduction scheme for the problem of detect-
ing a single community (Definition 13.1) in both Bernoulli and Gaussian cases. For
ease of presentation, we use the Bernoulli case as the main example, and discuss the
minor modifications needed for the Gaussian case. The recent work [13] introduces a
general reduction recipe for the single-community detection problem under general P, Q
distributions, as well as various other detection problems with planted structures.
Let G(N, q) denote the Erdős–Rényi random graph with N vertices, where each pair
of vertices is connected independently with probability q. Let G(N, K, p, q) denote the
planted dense subgraph model with N vertices where (1) each vertex is included in
the random set S independently with probability K/N; and (2) any two vertices are
connected independently with probability p if both of them are in S and with probability
q otherwise, where p > q. The planted dense subgraph here has a random size5 with mean
K, instead of a deterministic size K as assumed in [15, 80].

5 We can also consider a planted dense subgraph with a fixed size K, where K vertices are chosen uniformly
at random to plant a dense subgraph with edge probability p. Our reduction scheme extends to this fixed-
size model; however, we have not been able to prove that the distributions are approximately matched
under the alternative hypothesis. Nevertheless, the recent work [13] showed that the computational limit
for detecting fixed-sized community is the same as that in Fig. 13.1, resolving an open problem in [10].
404 Yihong Wu and Jiaming Xu

definition 13.8 The planted dense subgraph detection problem with parameters
(N, K, p, q), henceforth denoted by PDS(N, K, p, q), refers to the problem of distinguish-
ing between the following hypotheses:
H0 : G ∼ G(N, q)  P0 , H1 : G ∼ G(N, K, p, q)  P1 .
We aim to reduce the PC(n, k, γ) problem to the PDS(N, K, cq, q) problem. For sim-
plicity, we focus on the case of c = 2; the general case follows similarly with a change in
some numerical constants that come up in the proof. We are given an adjacency matrix
A ∈ {0, 1}n×n , or, equivalently, a graph G, and, with the help of additional randomness,
will map it to an adjacency matrix A  ∈ {0, 1}N×N , or, equivalently, a graph G  such that the
C C
hypothesis H0 (H1 ) in Definition 13.7 is mapped to H0 exactly (H1 approximately) in
Definition 13.8. In other words, if A is drawn from G(n, γ), then A  is distributed accord-
ing to P0 ; if A is drawn from G(n, k, 1, γ), then the distribution of A  is close in total
variation to P1 .
Our reduction scheme works as follows. Each vertex in G  is randomly assigned a
parent vertex in G, with the choice of parent being made independently for different
vertices in G, and uniformly over the set [n] of vertices in G. Let V s denote the set
of vertices in G  with parent s ∈ [n] and let s = |V s |. Then the set of children nodes
{V s : s ∈ [n]} will form a random partition of [N]. For any 1 ≤ s ≤ t ≤ n, the number of
edges, E(V s , Vt ), from vertices in V s to vertices in Vt in G will be selected randomly with
a conditional probability distribution specified below. Given E(V s , Vt ), the particular set
of edges with cardinality E(V s , Vt ) is chosen uniformly at random.
It remains to specify, for 1 ≤ s ≤ t ≤ n, the conditional distribution of E(V s , Vt )
given s , t , and A s,t . Ideally, conditioned on s and t , we want to construct a Markov
kernel from A s,t to E(V s , Vt ) which maps Bern(1) to the desired edge distribution
Binom( s t , p), and Bern(1/2) to Binom( s t , q), depending on whether both s and t
are in the clique or not, respectively. Such a kernel, unfortunately, provably does not
exist. Nevertheless, this objective can be accomplished  approximately
 in terms of the
total variation. For s = t ∈ [n], let E(V s , Vt ) ∼ Binom 2 , q . For 1 ≤ s < t ≤ n, denote
t

P s t  Binom( s t , p) and Q s t  Binom( s t , q). Fix 0 < γ ≤ 12 and put m0  log2 (1/γ).
Define



⎪ P (m) + a s t for m = 0,

⎨ st
P s t (m) = ⎪
⎪ P s t (m) for 1 ≤ m ≤ m0 ,


⎩ (1/γ)Q s t (m) for m0 < m ≤ s t ,

where a s t = m0 <m≤ s t [P s t (m)−(1/γ)Q s t (m)]. Let Q s t = (1/(1−γ))(Q s t −γP s t ).
The idea behind our choice of P s t and Q s t is as follows. For a given P s t , we choose
Q s t to map Bern(γ) to Binom( s t , q) exactly; however, in order for Q to be a well-
defined probability distribution, we need to ensure that Q s t (m) ≥ γP s t (m), which fails
when m ≤ m0 . Thus, we set P s t (m) = Q s t (m)/γ for m > m0 . The remaining probability
mass a s t is added to P s t (0) so that P s t is a well-defined probability distribution.
It is straightforward to verify that Q s t and P s t are well-defined probability
distributions, and
dTV (P s t
,P s t ) ≤ 4(8q 2 )(m0 +1) (13.36)
Statistical Problems with Planted Structures 405

as long as s , t ≤ 2 and 16q 2 ≤ 1, where = N/n. Then, for 1 ≤ s < t ≤ n, the conditional
distribution of E(V s , Vt ) given s , t , and A s,t is given by



⎪ P s t if A st = 1, s , t ≤ 2 ,


E(V s , Vt ) ∼ ⎪
⎪ Q if A st = 0, s , t ≤ 2 , (13.37)


⎩ Q
s t
s t if max{ s , t } > 2 .
Next we show that the randomized reduction defined above maps G(n, γ) into G(N, q)
under the null hypothesis and G(n, k, γ) approximately into G(N, K, p, q) under the alter-
native hypothesis. By construction, (1 − γ)Q s t + γP s t = Q s t = Binom( s t , q) and
therefore the null distribution of the PC problem is exactly matched to that of the PDS
problem, i.e., PG|H  C = P0 . The core of the proof lies in establishing that the alternative
0
distributions are approximately matched. The key observation is that, by (13.36), P s t
is close to P s t = Binom( s t , p) and thus, for nodes with distinct parents s  t in the
planted clique, the number of edges E(V s , Vt ) is approximately distributed as the desired
Binom( s t , p); for nodes with the  same
  parent s in the planted clique, even though
E(V s , V s ) is distributed as Binom 2 , q which is not sufficiently close to the desired
s
  
Binom 2s , p , after averaging over the random partition {V s }, the total variation dis-
tance becomes negligible. More formally, we have the following proposition; the proof
is postponed to the next section.
proposition 13.1 Let , n ∈ N, k ∈ [n] and γ ∈ (0, 12 ]. Let N = n, K = k , p = 2q, and
m0 = log2 (1/γ). Assume that 16q 2 ≤ 1 and k ≥ 6e . If G ∼ G(n, γ), then G ∼ G(N, q),
 C = P0 . If G ∼ G(n, k, 1, γ), then
i.e., PG|H
0
# $ 
−K − 2 m0 +1
 C , P1 ≤ e 12 + 1.5ke 18 + 2k (8q ) + 0.5 e72e q − 1
2 2 2
dTV PG|H
1

+ 0.5ke− 36 . (13.38)
Reduction Scheme in the Gaussian Case
The same reduction scheme can be tweaked slightly to work for the Gaussian case,
which, in fact, needs only the PC hypothesis for γ = 12 .6 In this case, we aim to map an
adjacency matrix A ∈ {0, 1}n×n to a symmetric data matrix A  ∈ RN×N with zero diagonal,
or, equivalently, a weighted complete graph G. 
For any 1 ≤ s ≤ t ≤ n, we let E(V s , Vt ) denote the average weights of edges between
V s and Vt in G. As for the Bernoulli model, we will first generate E(V s , Vt ) randomly
with a properly chosen conditional probability distribution. Since E(V s , Vt ) is a sufficient
statistic for the set of Gaussian edge weights, the specific weight assignment can be
generated from the average weight using the same kernel both for the null and for the
alternative.
i.i.d.
To see how this works, consider a general setup where X1 , . . . , Xn ∼ N(μ, 1). Let X̄ =
n
(1/n) i=1 Xi . Then we can simulate X1 , . . . , Xn on the basis of the sufficient statistic X̄

as follows. Let [v0 , v1 , . . . , vn−1 ] be an orthonormal basis for Rn , with v0 = (1/ n)1 and

6 The original reduction proof in [9] for the submatrix detection problem crucially relies on the Gaussianity
and the reduction maps a bigger planted-clique instance into a smaller instance for submatrix detection by
means of averaging.
406 Yihong Wu and Jiaming Xu


1 = (1, . . . , 1) . Generate Z1 , . . . , Zn−1 ∼ N(0, 1). Then X̄1 + n−1
i.i.d.
i=1 Zi vi ∼ N(μ1, In ). Using

this general procedure, we can generate the weights AV s ,Vt on the basis of E(V s , Vt ).
It remains to specify, for 1 ≤ s ≤ t ≤ n, the conditional distribution of E(V s , Vt ) given
s , t , and A s,t . As for the Bernoulli case, conditioned on s and t , ideally we would
want to find a Markov kernel from A s,t to E(V s , Vt ) which maps Bern(1) to the desired
distribution N(μ, 1/ s t ) and Bern(1/2) to N(0, 1/ s t ), depending on whether both s
and t are in the clique or not, respectively. This objective can be accomplished approx-
imately in terms of the total variation. For s = t ∈ [n], let E(V s , Vt ) ∼ N(0, 1/ s t ). For
1 ≤ s < t ≤ n, denote P s t  N(μ, 1/ s t ) and Q s t  N(0, 1/ s t ), with density functions
p s t (x) and q s t (x), respectively.
Fix γ = 12 . Note that
q (x) 8 9
s t
= exp s t μ(μ/2 − x) ≥γ
p s t (x)

if and only if x ≤ x0  μ/2 + (1/μ s t ) log(1/γ). Therefore, we define P s t


and Q s t
with
the density q s t = (1/(1 − γ))(q s t − γp s t ) and



⎪ p (x) + f s t (2μ − x) for x < 2μ − x0 ,

⎨ st
p s t (x) = ⎪
⎪ p s t (x) for x ≤ x0 ,


⎩ (1/γ)q s t (x) for x > x0 ,
where f s t (x) =p s t (x) − (1/γ)q s t (x). Let
 ∞   
μ 1 1
a = f s t (x)dx ≤ Φ̄ − s t + √ log .
s t
x0 2 μ s t γ
As for the Bernoulli case, it is straightforward to verify that Q s t and P s t are well-
defined probability distributions, and
     
1 1 1 2 1
dTV (P s t , P s t ) = a s t ≤ Φ̄ √ log ≤ exp − log (13.39)
2μ s t γ 32μ2 2 γ
as long as s , t ≤ 2 and 4μ2 2 ≤ log(1/γ), where = N/n. Following the same argument
as in the Bernoulli case, we can obtain a counterpart to Proposition 13.1.
proposition 13.2 Let , n ∈ N, k ∈ [n] and γ = 1/2. Let N = n and K = k . Assume that
16μ2 ≤ 1 and k ≥ 6e . Let P0 and P1 denote the desired null and alternative distributions
of the submatrix detection problem (N, K, μ). If G ∼ G(n, γ), then PG|H  C = P0 . If G ∼
0
G(n, k, 1, γ), then
# $   
K
− 12 − 18 k2 log2 2
, P ≤ + + − + e72e μ − 1
2 2 2
dTV PG|H  C 1 e 1.5ke exp 0.5
1 2 32μ2 2

+ 0.5ke− 36 . (13.40)
Let us close this section with two remarks. First, to investigate the computational
aspect of inference in the Gaussian model, an immediate hurdle is that the computational
complexity is not well defined for tests dealing with samples drawn from non-discrete
distributions, which cannot be represented by finitely many bits almost surely. To over-
come this difficulty, we consider a sequence of discretized Gaussian models that is
Statistical Problems with Planted Structures 407

asymptotically equivalent to the original model in the sense of Le Cam [4] and hence
preserves the statistical difficulty of the problem. In other words, the continuous model
and its appropriately discretized counterpart are statistically indistinguishable and, more
importantly, the computational complexity of tests on the latter is well defined. More
precisely, for the submatrix detection model, provided that each entry of the n × n matrix
A is quantized by Θ(log n) bits, the discretized model is asymptotically equivalent to the
previous model (cf. Section 3 and Theorem 1 of [9] for a precise bound on the Le Cam
distance). With a slight modification, the above reduction scheme can be applied to the
discretized model (cf. Section 4.2 of [9]).
Second, we comment on the distinctions between the reduction scheme here and the
prior work that relies on the planted clique as the hardness assumption. Most previous
work [63, 64, 68, 81] in the theoretical computer science literature uses the reduction
from the PC problem to generate computationally hard instances of other problems and
establish worst-case hardness results; the underlying distributions of the instances could
be arbitrary. The idea of proving the hardness of a hypothesis-testing problem by means
of approximate reduction from the planted-clique problem such that the reduced instance
is close to the target hypothesis in total variation originates from the seminal work by
Berthet and Rigollet [8] and the subsequent paper by Ma and Wu [9]. The main dis-
tinction between these works and the results presented here, which are based on the
techniques in [10], is that Berthet and Rigollet [8] studied a composite-versus-composite
testing problem and Ma and Wu [9] studied a simple-versus-composite testing problem,
both in the minimax sense, as opposed to the simple-versus-simple hypothesis consid-
ered here and in [10], which constitutes a stronger hardness result. For the composite
hypothesis, a reduction scheme works as long as the distribution of the reduced instance
is close to some mixture distribution under the hypothesis. This freedom is absent in
constructing reduction for the simple hypothesis, which renders the reduction scheme as
well as the corresponding calculation of the total variation considerably more difficult.
In contrast, for the simple-versus-simple hypothesis, the underlying distributions of the
problem instances generated from the reduction must be close to the desired distributions
in total variation both under the null hypothesis and under the alternative hypothesis.

13.4.3 Bounding the Total Variation Distance


Below we prove Proposition 13.1 and obtain the desired computational limits given by
Fig. 13.1. We consider only the Bernoulli case as the derivations for the Gaussian case
are analogous. The main technical challenge is bounding the total variation distance
in (13.38).
Proof of Proposition 13.1 Let [i, j] denote the unordered pair of i and j. For any set
I ⊂ [N], let E(I) denote the set of unordered pairs of distinct elements in I, i.e., E(I) =
V s Vt denote
{[i, j] : i, j ∈ S , i  j}, and let E(I)c = E([N])\E(I). For s, t ∈ [n] with s  t, let G
the bipartite graph where the set of left (right) vertices is V s (Vt ) and the set of edges is
the set of edges in G  from vertices in V s to vertices in Vt . For s ∈ [n], let GV s V s denote
the subgraph of G  induced by V s . Let P V s Vt denote the edge distribution of G V s Vt for
s, t ∈ [n].
408 Yihong Wu and Jiaming Xu

It is straightforward to verify that the null distributions are exactly matched by the
reduction scheme. Henceforth, we consider the alternative hypothesis, under which G is
drawn from the planted-clique model G(n, k, γ). Let C ⊂ [n] denote the planted clique.
Define S = ∪t∈C Vt and recall that K = k . Then |S | ∼ Binom(N, K/N) and, conditional
on |S |, S is uniformly distributed over all possible subsets of size |S | in [N]. By the
symmetry of the vertices of G, the distribution of A  conditional on C does not depend
on C. Hence, without loss of generality, we shall assume that C = [k] henceforth. The
distribution of A can be written as a mixture distribution indexed by the random set S as
⎡ ⎤
⎢⎢⎢  ⎥⎥⎥
P1  ES ⎢⎢⎢⎢⎣P
∼ 
A S S × Bern(q)⎥⎥⎥⎥⎦.
[i, j]∈E(S ) c

By the definition of P1 ,

dTV (
P1 , P1 )
⎛ ⎡ ⎤ ⎡ ⎤⎞
⎜⎜⎜ ⎢⎢⎢  ⎥⎥⎥ ⎢⎢⎢   ⎥⎥⎥⎟⎟⎟
= dTV ⎜⎜⎜⎝ES ⎢⎢⎢⎣P ⎜ ⎢ S S × ⎥
Bern(q)⎥⎥⎥⎦, ES ⎢⎢⎢⎣ ⎢ Bern(p) Bern(q)⎥⎥⎥⎥⎦⎟⎟⎟⎟⎠
[i, j]∈E(S )c [i, j]∈E(S ) [i, j]∈E(S )c
⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜    ⎟⎟⎟⎥⎥⎥

≤ ES ⎢⎢⎢⎣dTV ⎜⎜⎜⎝P ⎜ S S × Bern(q), Bern(p) Bern(q)⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦
[i, j]∈E(S )c [i, j]∈E(S ) [i, j]∈E(S )c
⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜  ⎟⎟⎟⎥⎥⎥
= ES ⎢⎢⎢⎢⎣dTV ⎜⎜⎜⎜⎝P S S , Bern(p)⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦
[i, j]∈E(S )
⎡ ⎛ ⎞ ⎤
⎢⎢⎢ ⎜⎜⎜  ⎟⎟⎟ ⎥⎥⎥
≤ ES ⎢⎢⎢⎢⎣dTV ⎜⎜⎜⎜⎝P S S , Bern(p)⎟⎟⎟⎟⎠1{|S |≤1.5K} ⎥⎥⎥⎥⎦ + exp(−K/12), (13.41)
[i, j]∈E(S )

where the first inequality follows from the convexity of (P, Q) → dTV (P, Q), and the last
inequality follows from applying the Chernoff bound to |S |. Fix an S ⊂ [N] such that
: :
|S | ≤ 1.5K. Define PVt Vt = [i, j]∈E(Vt ) Bern(q) for t ∈ [k] and PV s Vt = (i, j)∈V s ×Vt Bern(p)
for 1 ≤ s < t ≤ k. By the triangle inequality,
⎛ ⎞ ⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜  ⎟⎟⎟ ⎜⎜⎜ ⎢⎢ 
dTV ⎜⎜⎜⎝P⎜ S S , ⎟
Bern(p)⎟⎟⎠⎟ ≤ dTV ⎜⎜⎝P ⎜ S S , E k ⎢⎢⎢⎢ ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
V1 ⎣ PV s Vt + ⎥⎦⎟⎠
[i, j]∈E(S ) 1≤s≤t≤k
⎛ ⎡ ++ ⎤⎥ ⎞
⎜⎜⎜ ⎢⎢  ⎥⎥  ⎟⎟⎟


+ dTV ⎜⎜⎜⎝EV k ⎢⎢⎣⎢ PV s Vt +++ S ⎥⎥⎥⎦, Bern(p)⎟⎟⎟⎟⎠. (13.42)
1
1≤s≤t≤k [i, j]∈E(S )

To bound the term on the first line on the right-hand side of (13.42), first note that,
conditioned on the set S, {V1k } can be generated as follows. Throw balls indexed by S
into bins indexed by [k] independently and uniformly at random; let Vt is the set of balls
in the tth bin. Define the event E = {V1k : |Vt | ≤ 2 , t ∈ [k]}. Since |Vt | ∼ Binom(|S |, 1/k) is
stochastically dominated by Binom(1.5K, 1/k) for each fixed 1 ≤ t ≤ k, it follows from
the Chernoff bound and the union bound that P{E c } ≤ k exp(− /18). Then we have
Statistical Problems with Planted Structures 409

⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢ 
dTV ⎜⎜⎝PS S , EV k ⎢⎢⎢⎣
⎜  ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
1
PV s Vt + ⎥⎦⎟⎠
1≤s≤t≤k
⎛ ⎡ ++ ⎤⎥ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢  ⎢ 
= dTV ⎜⎜⎝⎜EV k ⎢⎢⎢⎣
(a)
V s Vt ++ S ⎥⎥⎥⎥, E ⎢⎢⎢⎢ ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
1
P + ⎥⎦ V k ⎢⎣ PV s Vt
1 + ⎥⎦⎟⎠
1≤s≤t≤k 1≤s≤t≤k
⎡ ⎛ ⎞+ ⎤
⎢⎢⎢ ⎜⎜⎜   ⎟⎟⎟ + ⎥⎥⎥
≤ EV k ⎢⎢⎢⎣dTV ⎜⎜⎜⎝ V s Vt ,
P PV s Vt ⎟⎟⎟⎠ +++ S ⎥⎥⎥⎦
1
1≤s≤t≤k 1≤s≤t≤k
⎡ ⎛ ⎞ ++ ⎤⎥
⎢⎢⎢ ⎜⎜⎜   ⎟⎟⎟ ⎥⎥

≤ EV k ⎢⎢⎣dTV ⎜⎜⎝ ⎜ V s Vt ,
P PV s Vt ⎟⎟⎠1 V k ∈E +++ S ⎥⎥⎥⎦ + k exp(− /18),
⎟ 0 1
1 1
1≤s≤t≤k 1≤s≤t≤k

where (a) holds because, conditional on V1k , {AV s Vt : s, t ∈ [k]} are independent. Recall
that t = |Vt |. For any fixed V1 ∈ E, we have
k

⎛ ⎞
⎜⎜⎜   ⎟⎟⎟
dTV ⎜⎜⎝ ⎜ 
PV s Vt , PV s Vt ⎟⎟⎟⎠
1≤s≤t≤k 1≤s≤t≤k
⎛ ⎞
⎜⎜⎜   ⎟⎟⎟
(a)
= dTV ⎜⎝⎜ ⎜ V s Vt ,
P PV s Vt ⎟⎟⎟⎠
1≤s<t≤k 1≤s<t≤k
⎛ ⎞
⎜⎜⎜   ⎟⎟⎟
= dTV ⎜⎜⎜⎝ P s t ⎟⎟⎟⎠
(b)
P s t
,
1≤s<t≤k 1≤s<t≤k
⎛ ⎞
⎜⎜⎜   ⎟⎟⎟
≤ dTV ⎜⎜⎜⎝ P s t
, P s t ⎟⎟⎟⎠
1≤s<t≤k 1≤s<t≤k
!   (c)
≤ dTV P s t
,P s t ≤ 2k2 (8q 2 )(m0 +1) ,
1≤s<t≤k

where (a) follows since P Vt Vt = PVt Vt for all t ∈ [k]; (b) is because the number of
edges E(V s , Vt ) is a sufficient statistic for testing P V s Vt versus PV s Vt on the submatrix
AV s Vt of the adjacency matrix; and (c) follows from the total variation bound (13.36).
Therefore,
⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢  ⎥⎥⎟⎟
dTV ⎜⎝⎜P⎜ S S , E k ⎢⎢
V1 ⎣
⎢ PV s Vt +++ S ⎥⎥⎦⎥⎟⎟⎠⎟ ≤ 2k2 (8q 2 )(m0 +1) + k exp(− /18). (13.43)
1≤s≤t≤k

To bound the term on the second line on the right-hand side of (13.42), apply-
ing Lemma 9 of [10], which is a conditional version of the second-moment method,
yields
⎛ ⎡ ++ ⎤⎥  ⎞
⎜⎜⎜ ⎢⎢  ⎥ ⎟⎟⎟
⎜ ⎢ + ⎥
dTV ⎜⎜⎜⎝EV k ⎢⎢⎣⎢ PV s Vt ++ S ⎥⎥⎥⎦, Bern(p)⎟⎟⎟⎟⎠
1
1≤s≤t≤k [i, j]∈E(S )
 " ++ %
1 . / 1 k )10 k 1 10k 1 ++ S − 1 + 2P{E c },
≤ P Ec + EV k ;Vk g(V1k , V1 V ∈E V ∈E +
(13.44)
2 2 1 1 1 1
410 Yihong Wu and Jiaming Xu

where
 : :
1≤s≤t≤k PV s Vt 1≤s≤t≤k PV
s V
t
k )
g(V1k , V = :
1
[i, j]∈E(S ) Bern(p)

k  2
 (|Vs ∩2 Vt |) k ⎛

⎞ |Vs ∩Vt |
⎜⎜⎜ 1 − 32 q ⎟⎟⎟( 2 )
q (1 − q)2 ⎜⎜⎝ ⎟⎟
= + = . (13.45)
s,t=1
p 1− p s,t=1
1 − 2q ⎠

Let X ∼ Bin(1.5K, 1/k2 ) and Y ∼ Bin(3 , e/k). It follows that


⎡ ⎤
⎢⎢⎢  k ⎛ ⎞ |Vs ∩Vt | k
⎜⎜⎜ 1 − 32 q ⎟⎟⎟( 2 )  ++ ⎥⎥⎥
⎢⎢⎢⎢ ⎜⎜⎝ ⎟⎟⎠

1 ++ S ⎥⎥⎥
EV k ;Vk ⎢⎢ 10|V s |≤2 ,|Vt |≤2 + ⎥⎥⎥⎦
1 1⎢ ⎣ s,t=1 1 − 2q s,t=1
⎡ k ⎤
⎢⎢⎢  |Vs ∩Vt |∧2 ++ ⎥⎥⎥
(a)
≤ EV k ;Vk ⎢⎢⎣⎢ +
e ( 2 ) ++ S ⎥⎥⎦⎥
q
1 1
s,t=1

k t |∧2
++
(b)
≤ E eq(
|V s ∩V
) ++ S
2
+
s,t=1
(c) # " X∧2
%$k2 (d) " Y %k2 (e)
≤ E eq( 2 ) ≤ E eq( 2 ) ≤ exp(72e2 q 2 ), (13.46)

where (a) follows from 1 + x ≤ e x for all x ≥ 0 and q < 1/4; (b) follows from the negative
association property of {|V s ∩ V t | : s, t ∈ [k]} proved in Lemma 10 of [10] in view of
x∧2
the monotonicity of x → e 2 ) on R+ ; (c) follows because |V s ∩ V
q ( t | is stochastically
dominated by Binom(1.5K, 1/k ) for all (s, t) ∈ [k] ; (d) follows from Lemma 11 of [10];
2 2

(e) follows from Lemma 12 of [10] with λ = q/2 and q ≤ 1/8. Therefore, by (13.44)
⎛ ⎞ (
⎜⎜⎜  ⎟⎟⎟
dTV ⎜⎜⎜⎝P⎜ S S , Bern(p)⎟⎟⎟⎟⎠ ≤ 0.5ke− 18 + 0.5 e72e q − 1 + 2ke− 18
2 2

[i, j]∈E(S )
 √
≤ 0.5ke− 18 + 0.5 e72e q − 1 + 0.5ke− 36 .
2 2
(13.47)

Proposition 13.1 follows by combining (13.41), (13.42), (13.44), and (13.47).

The following theorem establishes the computational hardness of the PDS problem in
the interior of the hard region in Fig. 13.1.
theorem 13.3 Assume the PC hypothesis holds for all 0 < γ ≤ 1/2. Let α > 0 and
0 < β < 1 be such that
α
max{0, 2α − 1}  β < β < . (13.48)
2
Then there exists a sequence {(N , K , q )} ∈N satisfying

log(1/q ) log K
lim = 2β, lim =α
→∞ log N →∞ log N
Statistical Problems with Planted Structures 411

N
such that, for any sequence of randomized polynomial-time tests φ : {0, 1}( 2 ) → {0, 1}
for the PDS(N , K , 2q , q ) problem, the type-I-plus-type-II error probability is lower-
bounded by

lim inf P0 {φ (G ) = 1} + P1 {φ (G ) = 0} ≥ 1,
→∞

where G ∼ G(N, q) under H0 and G ∼ G(N, K, p, q) under H1 .


Proof Let m0 = log2 (1/γ). By (13.48), there exist 0 < γ ≤ 1/2 and thus m0 such that
1 m0 β + 2 1
2β < α < + β− . (13.49)
2 2m0 β + 1 m0 β
Fix β > 0 and 0 < α < 1 that satisfy (13.49). Let δ = 1/(m0 β). Then it is straightfor-
ward to verify that ((2 + m0 δ)/(2 + δ))β ≥ 12 − δ + ((1 + 2δ)/(2 + δ))β. It follows from the
assumption (13.49) that
; <
2 + m0 δ 1 1 + 2δ
2β < α < min β, − δ + β . (13.50)
2+δ 2 2+δ
Let ∈ N and q = −(2+δ) . Define
2+δ −1 (2+δ)α
n = 2β , k =  2β −1 , N = n , K = k . (13.51)

Then
log(1/q ) 2+δ
lim = = 2β,
→∞ log N (2 + δ)/(2β) − 1 + 1
log K (2 + δ)α/(2β) − 1 + 1
lim = = α. (13.52)
→∞ log N (2 + δ)/(2β) − 1 + 1
Suppose that for the sake of contradiction there exists a small  > 0 and a sequence of
randomized polynomial-time tests {φ } for PDS(N , K , 2q , q ), such that

P0 {φN ,K (G ) = 1} + P1 {φN ,K (G ) = 0} ≤ 1 − 

holds for arbitrarily large , where G is the graph in the PDS(N , K , 2q , q ). Since
α > 2β, we have k ≥ 1+δ . Therefore, 16q 2 ≤ 1 and k ≥ 6e for all sufficiently large
. Applying Proposition 13.1, we conclude that G → φ(G)  is a randomized polynomial-
time test for PC(n , k , γ) whose type-I-plus-type-II error probability satisfies
 = 1} + P C {φ (G)
PH C {φ (G)  = 0} ≤ 1 −  + ξ, (13.53)
0
H 1

where ξ is given by the right-hand side of (13.38). By the definition of q , we have


q 2 = −δ and thus
2 m0 +1 (2+δ)α/β−2−(m0 +1)δ −δ
k2 (q ) ≤ ≤ ,

where the last inequality follows from (13.50). Therefore ξ → 0 as → ∞. Moreover, by


the definition in (13.51),
log k (2 + δ)α/(2β) − 1 1
lim = ≤ − δ,
→∞ log n (2 + δ)/(2β) − 1 2
412 Yihong Wu and Jiaming Xu

where the above inequality follows from (13.50). Therefore, (13.53) contradicts the
assumption that the PC hypothesis holds for γ.

13.5 Discussion and Open Problems

Recent years have witnessed a great deal of progress on understanding the information-
theoretical and computational limits of various statistical problems with planted struc-
tures. As outlined in this survey, various techniques to identify the information-theoretic
limits are available. In some cases, polynomial-time procedures have been shown to
achieve the information-theoretic limits. However, in many other cases, it is believed
that there exists a wide gap between the information-theoretic limits and the compu-
tational limits. For the planted-clique problem, a recent exciting line of research has
identified the performance limits of a sum-of-squares hierarchy [71–73, 82, 83]. Under
the PC hypothesis, complexity-theoretic computational lower bounds have been derived
for sparse PCA [8], submatrix location [9], single-community detection [10], and var-
ious other detection problems with planted structures [13]. Despite these encouraging
results, a variety of interesting questions remain open. Below we list a few representative
problems. Closing the observed computational gap, or, equally importantly, disprov-
ing the possibility thereof on rigorous complexity-theoretic grounds, is an exciting new
topic at the intersection of high-dimensional statistics, information theory, and computer
science.
Computational Lower Bounds for Recovering the Planted Dense Subgraph
Closely related to the PDS detection problem is the recovery problem, where, given a
graph generated from G(N, K, p, q), the task is to recover the planted dense subgraph.
Consider the asymptotic regime depicted in Fig. 13.1. It has been shown in [16, 84] that
exact recovery is information-theoretically possible if and only if β < α/2 and can be
achieved in polynomial time if β < α − 12 . Our computational lower bounds for the PDS
detection problem imply that the planted dense subgraph is hard to approximate to any
constant factor if max(0, 2α − 1) < β < α/2 (the hard regime in Fig. 13.1). Whether the
planted dense subgraph is hard to approximate with any constant factor in the regime of
α − 12 ≤ β ≤ min{2α − 1, α/2} is an interesting open problem. For the Gaussian case, Cai
et al. [23] showed that exact recovery is computationally hard, β > α − 12 , by assuming a
variant of the standard PC hypothesis (see p. 1425 of [23]).
Finally, we note that in order to prove our computational lower bounds for the planted
dense subgraph detection problem in Theorem 13.3, we have assumed that the PC detec-
tion problem is hard for any constant γ > 0. An important open problem is to show by
means of reduction that, if the PC detection problem is hard with γ = 0.5, then it is also
hard with γ = 0.49.
Computational Lower Bounds within the Sum-of-Squares Hierarchy
For the single-community model, Hajek et al. [56] obtained a tight characterization of
the performance limits of semidefinite programming (SDP) relaxations, corresponding
to the sum-of-squares hierarchy with degree 2. In particular, (1) if K = ω(n/log n), SDP
attains the information-theoretic threshold with sharp constants; (2) if K = Θ(n/log n),
Statistical Problems with Planted Structures 413

SDP is sub-optimal by a constant factor; and (3) if K = o(n/log n) and K → ∞, SDP is


order-wise sub-optimal. An interesting future direction would be to generalize this result
to the sum-of-squares hierarchy, showing that sum-of-squares results with any constant
degree are sub-optimal when K = o(n log n).
Furthermore, if K ≥ (n/ log n)(1/(8e) + o(1)) for the Gaussian case and K ≥
(n/ log n)(ρBP (p/q) + o(1)) for the Bernoulli case, exact recovery can be attained in
nearly linear time via message passing plus clean-up [54, 55] whenever information-
theoretically possible. An interesting question is whether exact recovery beyond the
aforementioned two limits is possible in polynomial time.

Recovering Multiple Communities


Consider the stochastic block model under which n vertices are partitioned into k equal-
sized communities, and two vertices are connected by an edge with probability p if they
are from the same community and q otherwise.
First let us focus on correlated recovery in the sparse regime where p = a/n and
q = b/n for two fixed constants a > b in the assortative case. For k = 2, it has been
shown [26, 31, 39] that the information-theoretic and computational thresholds coincide
at (a − b)2 = 2(a + b). Employing statistical physics heuristics, it is further conjectured
that the information-theoretic and computational thresholds continue to coincide for
k = 3, 4, but depart from each other for k ≥ 5; however, a rigorous proof remains to
be derived.
Next let us turn to exact recovery in the relatively sparse regime where p = a log n/n
and q = b log n/n for two fixed constants a > b. For k = Θ(1), it has been
√ shown that the
√ √
SDP relaxations achieve the information-theoretic limit a − b > k. Furthermore,
it has been shown that SDP continues to be optimal for k = o(log n), but ceases to be
optimal for k = Θ(log n). It is conjectured in [16] that no polynomial-time procedure can
be optimal for k = Θ(log n).

Estimating Graphons
Graphon is a powerful network model for studying large networks [85]. Concretely,
given n vertices, the edges are generated independently, connecting each pair of two
distinct vertices i and j with a probability Mi j = f (xi , x j ), where xi ∈ [0, 1] is the
latent feature vector of vertex i that captures various characteristics of vertex i; f :
[0, 1] × [0, 1] → [0, 1] is a symmetric function called a graphon. The problem of interest
is to estimate either the edge probability matrix M or the graphon f on the basis of the
observed graph.
• When f is a step function which corresponds to the stochastic block model with
k blocks for some k, the minimax optimal estimation error rate is shown to be on
the order of k2 /n2 + log k/n [86], while the currently best error rate achievable in
polynomial time is k/n [87].
• When f belongs to Hölder or Sobolev space with smoothness index α, the minimax
optimal rate is shown to be n−2α/(α+1) for α < 1 and log n/n for α > 1 [86], while
the best error rate achievable in polynomial time that is known in the literature is
n−2α/(2α+1) [88].
414 Yihong Wu and Jiaming Xu

For both cases, it remains to be determined whether the minimax optimal rate can be
achieved in polynomial time.
Sparse PCA
Consider the following spiked Wigner model, where the underlying signal is a rank-one
matrix:
λ
X = √ vvT + W. (13.54)
n
i.i.d.
Here, v ∈ Rn , λ > 0 and W ∈ Rn×n is a Wigner random matrix with Wii ∼ N(0, 2) and
i.i.d.
Wi j = Wi j ∼ N(0, 1) for i < j. We assume that for some γ ∈ [0, 1] the support of v is drawn
n
uniformly from all γn subsets S ⊂ [n] with |S | = γn. Once the support has been chosen,
each non-zero component vi is drawn independently and uniformly from {±γ−1/2 }, so
that v 22 = n. When γ is small, the data matrix X is a sparse, rank-one matrix contam-
inated by Gaussian noise. For detection, we also consider a null model of λ = 0 where
X = W.
One natural approach for this problem is PCA: that is, diagonalize X and use its lead-
ing eigenvector v as an estimate of v. Using the theory of random matrices with rank-one
perturbations [27–29], both detection and correlated recovery of v are possible if and
only if λ > 1. Intuitively, PCA exploits only the low-rank structure of the underlying
signal, and not the sparsity of v; it is natural to ask whether one can succeed in detection
or reconstruction for some λ < 1 by taking advantage of this additional structure.
Through analysis of an approximate message-passing algorithm and the free energy, it
has been conjectured [46, 89] that there exists a critical sparsity threshold γ∗ ∈ (0, 1)
such that, if γ ≥ γ∗ , then both the information-theoretic threshold and the computational
threshold are given by λ = 1; if γ < γ∗ , then the computational threshold is given by
λ = 1, but the information-theoretic threshold for λ is strictly smaller. A recent series of
papers has identified the sharp information-theoretic threshold for correlated recovery
through the Guerra interpolation technique and the cavity method [46, 49, 50, 90]. Also,
the sharp information-theoretic threshold for detection has recently been determined
in [25]. However, there is no rigorous evidence justifying the clain that λ = 1 is the
computational threshold.

Tensor PCA
We can also consider a planted tensor model, in which we observe an order-k tensor
X = λv⊗k + W, (13.55)
where v is uniformly distributed over the unit sphere in Rn
and W ∈ (Rn )⊗k
is a totally
symmetric noise tensor with Gaussian entries N(0, 1/n) (see Section 3.1 of [91] for
a precise definition). This model is known as the p-spin model in statistical physics,
and is widely used in machine learning and data analysis to model high-order correla-
tions in a=dataset.
> A natural approach is tensor PCA, which coincides with the MLE:
⊗k
min u 2 =1 X, u . When k = 2, this reduces to standard PCA, which can be efficiently
computed by singular value decomposition; however, as soon as k ≥ 3, tensor PCA
becomes NP-hard in the worst case [92].
Statistical Problems with Planted Structures 415

Previous work
 [24, 91, 93] has shown that tensor PCA achieves consistent estimation

of v if λ  k log k, while this is information-theoretically impossible if λ  k log k.
The exact location of the information-theoretic threshold for any k was determined
recently in [94], but all known polynomial-time algorithms fail far from this thresh-
old. A “tensor unfolding” algorithm is shown in [93] to succeed if λ  n(k/2−1)/2 . In the
special case k = 3, it is further shown in [95] that a degree-4 sum-of-squares relaxation
succeeds if λ = ω(n log n)1/4 and fails if λ = O(n/log n)1/4 . More recent work [96] shows
that a spectral method achieves consistent estimation provided that λ = Ω(n1/4 ), improv-
ing the positive result in [95] by a poly-logarithmic factor. It remains to be determined
whether any polynomial-time algorithm succeeds in the regime of 1  λ  n1/4 . Under
a hypergraph version of the planted-clique detection hypothesis, it is shown in [96] that
no polynomial-time algorithm can succeed when λ ≤ n1/4− for an arbitrarily small con-
stant  > 0. It remains to be determined whether the usual planted-clique problem can be
reduced to the hypergraph version.

Gaussian Mixture Clustering


Consider the following model of clustering in high dimensions. Let v1 , . . . , vk be i.i.d. as

N(0, k/(k − 1) In ), and define v = (1/k) s v s to be their mean. The scaling of the expected
norm of each v s with k ensures that E v s − v 22 = n for all 1 ≤ s ≤ k. For a fixed parameter
α > 0, we then generate m = αn points xi ∈ Rn which are partitioned into k clusters
of equal size by a balanced partition σ : [n] → [k], again chosen uniformly at random
from all such partitions. For each data point i, let σi ∈ [k] denote its clusterindex, and
generate xi independently according to a Gaussian distribution with mean ρ/n (vσi −
v) and identity covariance matrix, where ρ > 0 is a fixed parameter characterizing the
separation between clusters. Equivalently, this model can be described by the following
matrix form:
  
ρ 1
X= S − Jm,k V T + W, (13.56)
n k

where X = [x1 , . . . , xm ]T , V = [v1 , . . . , vk ], S is an m × k matrix with S i,t = 1σi =t , Jm,k is the


i.i.d.
m × k all-one matrix, and Wi, j ∼ N(0, 1). In the null model, there is no cluster structure
and X = W. The subtraction of Jm,k /k centers the signal matrix so that EX = 0 in both
models. It follows from the celebrated BBP phase transition [27, 97] that detection and

correlated recovery using spectral methods is possible if and only if ρ α > (k − 1). In
contrast, detectionand correlated recovery are shown to be information-theoretically
possible if ρ > 2 k log k/α + 2 log k. The sharp characterization of the information-
theoretic limit still remains to be done, and it has been conjectured [98] that the
computational threshold coincides with the spectral detection threshold.

A13.1 Mutual Information Characterization of Correlated Recovery

We consider a general setup. Let the number of communities k be a constant. Denote the
membership vector by σ = (σ1 , . . . , σn ) ∈ [k]n and the observation is A = (Ai j : 1 ≤ i <
j ≤ n). Assume the following conditions.
416 Yihong Wu and Jiaming Xu

A1 For any permutation π ∈ S k , (σ, A) and (π(σ), A) are equal in law, where π(σ) 
(π(σ1 ), . . . , π(σn )).
A2 For any i  j ∈ [n], I(σi , σ j ; A) = I(σ1 , σ2 ; A).
A3 For any z1 , z2 ∈ [k], P{σ1 = z1 , σ2 = z2 } = 1/k2 + o(1) as n → ∞.
These assumptions are satisfied for example for k-community SBM (where each pair
of vertices i and j are connected independently with probability p if σi = σ j and q
otherwise), and the membership vector σ can be uniformly distributed either on [k]n or
on the set of equal-sized k-partition of [n].
Recall that correlated recovery entails the following. For any σ,  σ ∈ [k]n , define the
overlap:
, - 1 ! 1

o σ, 
σ = max σi } −
1{π(σi )= . (13.57)
n π∈S k i∈[n]
k

We say an estimator 
σ=
σ(A) achieves correlated recovery if7
8 , -9
E o σ, 
σ = Ω(1), (13.58)
that is, the misclassification rate, up to a global permutation, outperforms random guess-
ing. Under the above three assumptions, we have the following characterization of
correlated recovery.
lemma A13.1 Correlated recovery is possible if and only if I(σ1 , σ2 ; A) = Ω(1).
Proof We start by recalling the relation between the mutual information and the total
variation. For any pair of random variables (X, Y), define the so-called T -information
[99]: T (X; Y)  dTV (PXY , PX PY ) = E[dTV (PY|X , PY )]. For X ∼ Bern(p), this simply
reduces to
T (X; Y) = 2p(1 − p)dTV (PY|X=0 , PY|X=1 ). (13.59)
Furthermore, the mutual information can be bounded by the T -information, by Pinsker’s
and Fano’s inequality, as follows (from Equation (84) and Proposition 12 of [100]):
2T (X; Y)2 ≤ I(X; Y) ≤ log(M − 1)T (X; Y) + h(T (X; Y)), (13.60)
where in the upper bound M is the number of possible values of X, and h is the binary
entropy function in (13.34).
We prove the “if” part. Suppose I(σ1 , σ2 ; A) = Ω(1). We first claim that assumption
A1 implies that
I(1{σ1 =σ2 } ; A) = I(σ1 , σ2 ; A), (13.61)
that is, A is independent of σ1 , σ2 conditional on 1{σ1 =σ2 } . Indeed, for any z  z ∈
[k], let π be any permutation such that π(z ) = z. Since Pσ,A = Pπ(σ),A , we have
PA|σ1 =z,σ2 =z = PA|π(σ1 )=z,π(σ2 )=z , i.e., PA|σ1 =z,σ2 =z = PA|σ1 =z ,σ2 =z . Similarly, one can

7 For the special case of k = 2, (13.58) is equivalent to (1/n)E[| σ, 


σ |] = Ω(1), where σ, 
σ are assumed to
be {±}n -valued.
Statistical Problems with Planted Structures 417

show that PA|σ1 =z1 ,σ2 =z2 = PA|σ1 =z1 ,σ2 =z2 , for any z1  z2 and z1  z2 , and this proves
the claim.
Let x j = 1{σ1 =σ j } . By the symmetry assumption A2, I(x j ; A) = I(x2 ; A) = Ω(1) for all
0 1
j  1. Since P x j = 1 = 1/k + o(1) by assumption A3, applying (13.60) with M = 2 and
in view of (13.59), we have dTV (PA|x j =0 , PA|x j =1 ) = Ω(1). Thus, there exists an estimator

x j ∈ {0, 1} as a function of A, such that
0 1 0 1
P xj = 1 | xj = 1 +P  x j = 0 | x j = 0 ≥ 1 + dTV (PA|x j =0 , PA|x j =1 ) = 1 + Ω(1). (13.62)

Define  σ as follows: set 


σ1 = 1; for j  1, set  σ j = 1 if x j = 1 and draw 
σ j from
x j = 0. Next, we show that 
{2, . . . , k} uniformly at random if  σ achieves correlated recov-
ery. Indeed, fix a permutation π ∈ S k such that π(σ1 ) = 1. It follows from the definition
of overlap that
, - 1! 0 1 1
E[o σ, 
σ ]≥ P π(σ j ) = σj − . (13.63)
n j2 k

Furthermore, since π(σ1 ) = 1, we have, for any j  1,


0 1 0 1
P π(σ j ) = 
σ j, x j = 1 = P x j = 1, x j = 1

and
0 1 0 1 1 0 1
P π(σ j ) = 
σ j , x j = 0 = P π(σ j ) = 
σ j ,
x j = 0, x j = 0 = P
x j = 0, x j = 0 ,
k−1
where the last step is because, conditional on  x j = 0,0 
σ j is1 chosen from {2, . . . , k}
uniformly and independently of everything else. Since P x j = 1 = 1/k + o(1), we have
0 1 1 0 1 0 1 (13.62) 1
P π(σ j ) = 
σj = P  xj = 1 | xj = 1 +P 
x j = 0 | x j = 0 + o(1) ≥ + Ω(1).
k k
By (13.63), we conclude that 
σ achieves correlated recovery of σ.
Next we prove the “only if” part. Suppose I(σ1 , σ2 ; A) = o(1) and we aim to show that
8 , -9
E o σ, 
σ = o(1) for any estimator  σ. By the definition of overlap, we have
++ +
 ++
, - 1 ! +++ ! 1 ++
o σ, 
σ ≤ ++ σi } −
1{π(σi )= +.
n +π∈S k i∈[n]
k ++

Since there are k! = Ω(1) permutations in S k , it suffices to show that, for any fixed
permutation π,
⎡++  +
++⎤⎥
⎢⎢⎢++ ! ++⎥⎥⎥⎥
E⎢⎢⎢⎢⎣++
1
1{π(σi )= − +⎥ = o(n).
++
i∈[n]
σi } k ++⎥⎦

Since I(π(σi ), π(σ j ); A) = I(σi , σ j ; A), without loss of generality, we assume π = id in the
following. By the Cauchy–Schwarz inequality, it further suffices to show
⎡⎛  ⎞⎟2 ⎤⎥
⎢⎢⎢⎜⎜ ! ⎟⎟⎟ ⎥⎥⎥

E⎢⎢⎢⎣⎢⎜⎜⎜⎝
1
σi } −
1{σi = ⎟⎟⎠ ⎥⎥⎥ = o(n2 ). (13.64)
k ⎦
i∈[n]
418 Yihong Wu and Jiaming Xu

Note that
⎡⎛  ⎞⎟2 ⎤⎥
⎢⎢⎢⎜⎜ ! ⎟⎟⎟ ⎥⎥⎥
⎢ ⎜ 1
E⎢⎢⎢⎣⎜⎜⎜⎝ 1{σi = σi } − ⎟⎟ ⎥⎥
i∈[n]
k ⎠ ⎥⎦
!  1

1

= E 1{σi = σi } − 1 σ j} −
i, j∈[n]
k {σ j = k
! 0 1 2n ! . / n2
= P σi = 
σi , σ j = 
σj − P σi = 
σi + 2 .
i, j∈[n]
k i∈[n] k

For the first term in the last displayed equation, let σ be identically distributed as  σ
but independent of σ. Since I(σi , σ j ; σi , 
σ j ) ≤ I(σi , σ j ; A) = o(1) by the data-processing
inequality, it follows from the lower bound in (13.60) that dTV (Pσi ,σ j , σ j , Pσi ,σ j ,σi ,σ j ) =
σi ,
0 1
o(1). Since P{σi = σi , σ j = σ j } ≤ maxa,b∈[k] P σi = a, σ j = b ≤ 1/k +o(1) by assumption
2

A3, we have
0 1 1
P σi =  σi , σ j = σ j ≤ 2 + o(1).
k
Similarly, for the second term, we have
. / 1
P σi = 
σi = + o(1),
k
where the last equality holds due to I(σi ; A) = o(1). Combining the last three displayed
equations gives (13.64) and completes the proof.

A13.2 Proof of (13.7) ⇒ (13.6) and Verification of (13.7)


in the Binary Symmetric SBM

Combining (13.61) with (13.60) and (13.59), we have I(σ1 , σ2 ; A) = o(1) if and only
if dTV (P+ , P− ) = o(1), where P+ = PA|σ1 =σ2 and P− = PA|σ1 σ2 . Note the follow-
ing characterization about the total variation distance, which simply follows from the
Cauchy–Schwartz inequality:


1 (P+ − P− )2
dTV (P+ , P− ) = inf , (13.65)
2 Q Q
where the infimum is taken over all probability distributions Q. Therefore (13.7) implies
(13.6).
Finally, we consider the binary symmetric SBM and show that, below the corre-
lated recovery threshold τ = (a − b)2 /(2(a + b)) < 1, (13.7) is satisfied if the reference
distribution Q is the distribution of A in the null (Erdős–Rényi) model. Note that
  2  2 
(P+ − P− )2 P+ P− P+ P−
= + −2 .
Q Q Q Q
Hence, it is sufficient to show

Pz Pz
= C + o(1), ∀z, z ∈ {±}
Q
Statistical Problems with Planted Structures 419

for some constant C that is independent of z and z. Specifically, following the derivations
in (13.4), we have
 ⎡ ⎤
Pz Pz ⎢⎢⎢  +++ ⎥⎥⎥
= E⎢⎢⎢⎣ ⎢  j ρ ++ σ1 σ2 = z, σ
i σ
1 + σi σ j σ z⎥⎥⎥⎥⎦
2 = 
1 σ
Q i< j
" #ρ $ ++ %
= (1 + o(1))e−τ /4−τ/2 × E exp σ, σ  2 ++ σ1 σ2 = z, σ
2
2 = 
1 σ z, (13.66)
2
where the last equality holds for ρ = τ/n + O(1/n2 ) and log(1 + x) = x − x2 /2 + O(x3 ).
Write σ = 2ξ − 1 for ξ ∈ {0, 1}n and let
!
n
H1  ξ1
ξ1 + ξ2
ξ2 and H2  ξ j
ξ j.
j≥3

 = 4(H1 + H2 ) − n. Moreover, conditional on σ1 , σ2 and σ


Then σ, σ 1 , σ
2 ,
 
H2 ∼ Hypergeometric n − 2, n/2 − ξ1 − ξ2 , n/2 − 
ξ1 − 
ξ2 .

Since |H1 | ≤ 2, ξ1 + ξ2 ≤ 2, and  ξ1 + 


ξ2 ≤ 2, it follows that, conditional on σ1 σ2 = z,

2 = 
1 σ
σ z, (1/ n)(4H1 + 4H2 − n) converges to N(0, 1) in distribution as n → ∞ by the
central limit theorem for a hypergeometric distribution. Therefore
" #ρ $ ++ %
2 +
E exp σ, σ  ++ σS = z, σ S = z
2
⎡ ⎛   ⎞ ⎤
⎢⎢⎢ ⎜⎜⎜ nρ 4H1 + 4H2 − n 2 ⎟⎟⎟ +++ ⎥⎥
= E⎢⎢⎣exp⎜⎜⎝ √ ⎟⎟⎠ + σ1 σ2 = z, σ
+ z⎥⎥⎥⎦
2 = 
1 σ
2 n

1 + o(1)
= √ ,
1−τ
where the last equality holds due to nρ = τ + o(1/n), τ < 1, and the convergence of the
moment-generating function.

References

[1] E. L. Lehmann and G. Casella, Theory of point estimation, 2nd edn. Springer, 1998.
[2] L. D. Brown and M. G. Low, “Information inequality bounds on the minimax risk (with
an application to nonparametric regression),” Annals Statist., vol. 19, no. 1, pp. 329–337,
1991.
[3] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 2000.
[4] L. Le Cam, Asymptotic methods in statistical decision theory. Springer, 1986.
[5] I. A. Ibragimov and R. Z. Khas’minskı̆, Statistical estimation: Asymptotic theory.
Springer, 1981.
[6] L. Birgé, “Approximation dans les espaces métriques et théorie de l’estimation,” Z.
Wahrscheinlichkeitstheorie verwandte Gebiete, vol. 65, no. 2, pp. 181–237, 1983.
[7] Y. Yang and A. R. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
420 Yihong Wu and Jiaming Xu

[8] Q. Berthet and P. Rigollet, “Complexity theoretic lower bounds for sparse principal com-
ponent detection,” Journal of Machine Learning Research: Workshop and Conference
Proceedings, vol. 30, pp. 1046–1066, 2013.
[9] Z. Ma and Y. Wu, “Computational barriers in minimax submatrix detection,” Annals
Statist., vol. 43, no. 3, pp. 1089–1116, 2015.
[10] B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection on
random graphs,” in Proc. COLT 2015, 2015, pp. 899–928.
[11] T. Wang, Q. Berthet, and R. J. Samworth, “Statistical and computational trade-offs in
estimation of sparse principal components,” Annals Statist., vol. 44, no. 5, pp. 1896–1930,
2016.
[12] C. Gao, Z. Ma, and H. H. Zhou, “Sparse CCA: Adaptive estimation and computational
barriers,” Annals Statist., vol. 45, no. 5, pp. 2074–2101, 2017.
[13] M. Brennan, G. Bresler, and W. Huleihel, “Reducibility and computational lower bounds
for problems with planted sparse structure,” in Proc. COLT 2018, 2018, pp. 48–166.
[14] F. McSherry, “Spectral partitioning of random graphs,” in 42nd IEEE Symposium on
Foundations of Computer Science, 2001, pp. 529–537.
[15] E. Arias-Castro and N. Verzelen, “Community detection in dense random networks,”
Annals Statist., vol. 42, no. 3, pp. 940–969, 2014.
[16] Y. Chen and J. Xu, “Statistical–computational tradeoffs in planted problems and subma-
trix localization with a growing number of clusters and submatrices,” in Proc. ICML 2014,
2014, arXiv:1402.1267.
[17] A. Montanari, “Finding one community in a sparse random graph,” J. Statist. Phys., vol.
161, no. 2, pp. 273–299, arXiv:1502.05680, 2015.
[18] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,”
Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
[19] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel, “Finding large average
submatrices in high dimensional data,” Annals Appl. Statist., vol. 3, no. 3, pp. 985–1012,
2009.
[20] M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh, “Minimax localization of struc-
tural information in large noisy matrices,” in Advances in Neural Information Processing
Systems, 2011.
[21] C. Butucea and Y. I. Ingster, “Detection of a sparse submatrix of a high-dimensional noisy
matrix,” Bernoulli, vol. 19, no. 5B, pp. 2652–2688, 2013.
[22] C. Butucea, Y. Ingster, and I. Suslina, “Sharp variable selection of a sparse submatrix in a
high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134,
2015.
[23] T. T. Cai, T. Liang, and A. Rakhlin, “Computational and statistical boundaries for subma-
trix localization in a large noisy matrix,” Annals Statist., vol. 45, no. 4, pp. 1403–1430,
2017.
[24] A. Perry, A. S. Wein, and A. S. Bandeira, “Statistical limits of spiked tensor models,”
arXiv:1612.07728, 2016.
[25] A. E. Alaoui, F. Krzakala, and M. I. Jordan, “Finite size corrections and likelihood ratio
fluctuations in the spiked Wigner model,” arXiv:1710.02903, 2017.
[26] E. Mossel, J. Neeman, and A. Sly, “A proof of the block model threshold conjecture,”
Combinatorica, vol. 38, no. 3, pp. 665–708, 2013.
[27] J. Baik, G. Ben Arous, and S. Péché, “Phase transition of the largest eigenvalue
for nonnull complex sample covariance matrices,” Annals Probability, vol. 33, no. 5,
pp. 1643–1697, 2005.
Statistical Problems with Planted Structures 421

[28] S. Péché, “The largest eigenvalue of small rank perturbations of hermitian random
matrices,” Probability Theory Related Fields, vol. 134, no. 1, pp. 127–173, 2006.
[29] F. Benaych-Georges and R. R. Nadakuditi, “The eigenvalues and eigenvectors of finite,
low rank perturbations of large random matrices,” Adv. Math., vol. 227, no. 1, pp. 494–
521, 2011.
[30] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang, “Spec-
tral redemption in clustering sparse networks,” Proc. Natl. Acad. Sci. USA, vol. 110,
no. 52, pp. 20 935–20 940, 2013.
[31] L. Massoulié, “Community detection thresholds and the weak Ramanujan property,” in
Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, 2014,
pp. 694–703, arXiv:1109.3318.
[32] C. Bordenave, M. Lelarge, and L. Massoulié, “Non-backtracking spectrum of random
graphs: Community detection and non-regular Ramanujan graphs,” in 2015 IEEE 56th
Annual Symposium on Foundations of Computer Science (FOCS), 2015, pp. 1347–1357,
arXiv: 1501.06087.
[33] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu, “Information-theoretic bounds
and phase transitions in clustering, sparse PCA, and submatrix localization,” IEEE Trans.
Information Theory, vol. 64, no. 7, pp. 4872–4894, 2018.
[34] Y. I. Ingster and I. A. Suslina, Nonparametric goodness-of-fit testing under Gaussian
models. Springer, 2003.
[35] Y. Wu, “Lecture notes on information-theoretic methods for high-dimensional statistics,”
2017, www.stat.yale.edu/~yw562/teaching/598/it-stats.pdf.
[36] I. Vajda, “On metric divergences of probability measures,” Kybernetika, vol. 45, no. 6,
pp. 885–900, 2009.
[37] W. Feller, An introduction to probability theory and its applications, 3rd edn. Wiley, 1970,
vol. I.
[38] W. Kozakiewicz, “On the convergence of sequences of moment generating functions,”
Annals Math. Statist., vol. 18, no. 1, pp. 61–69, 1947.
[39] E. Mossel, J. Neeman, and A. Sly, “Reconstruction and estimation in the planted partition
model,” Probability Theory Related Fields, vol. 162, nos. 3–4, pp. 431–461, 2015.
[40] Y. Polyanskiy and Y. Wu, “Application of information-percolation method to reconstruc-
tion problems on graphs,” arXiv:1804.05436, 2018.
[41] E. Abbe and E. Boix, “An information-percolation bound for spin synchronization on
general graphs,” arXiv:1806.03227, 2018.
[42] J. Banks, C. Moore, J. Neeman, and P. Netrapalli, “Information-theoretic thresholds for
community detection in sparse networks,” in Proc. 29th Conference on Learning Theory,
COLT 2016, 2016, pp. 383–416.
[43] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error
in Gaussian channels,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1261–1282,
2005.
[44] Y. Deshpande and A. Montanari, “Information-theoretically optimal sparse PCA,” in
IEEE International Symposium on Information Theory, 2014, pp. 2197–2201.
[45] Y. Deshpande, E. Abbe, and A. Montanari, “Asymptotic mutual information for the two-
groups stochastic block model,” arXiv:1507.08685, 2015.
[46] F. Krzakala, J. Xu, and L. Zdeborová, “Mutual information in rank-one matrix esti-
mation,” in 2016 IEEE Information Theory Workshop (ITW), 2016, pp. 71–75, arXiv:
1603.08447.
422 Yihong Wu and Jiaming Xu

[47] I. Csiszár, “Information-type measures of difference of probability distributions and indi-


rect observations,” Studia Scientiarum Mathematicarum Hungarica, vol. 2, pp. 299–318,
1967.
[48] A. Perry, A. S. Wein, A. S. Bandeira, and A. Moitra, “Optimality and sub-optimality of
PCA for spiked random matrices and synchronization,” arXiv:1609.05573, 2016.
[49] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, and L. Zdeborová, “Mutual
information for symmetric rank-one matrix estimation: A proof of the replica formula,”
in Advances in Neural Information Processing Systems, 2016, pp. 424–432, arXiv:
1606.04142.
[50] M. Lelarge and L. Miolane, “Fundamental limits of symmetric low-rank matrix estima-
tion,” in Proceedings of the 2017 Conference on Learning Theory, 2017, pp. 1297–1301.
[51] B. Hajek, Y. Wu, and J. Xu, “Information limits for recovering a hidden community,”
IEEE Trans. Information Theory, vol. 63, no. 8, pp. 4729–4745, 2017.
[52] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” 2015,
http://people.lids.mit.edu/yp/homepage/data/itlectures_v4.pdf.
[53] A. B. Tsybakov, Introduction to nonparametric estimation. Springer, 2009.
[54] B. Hajek, Y. Wu, and J. Xu, “Recovering a hidden community beyond the Kesten–Stigum
threshold in O(|E| log∗ |V|) time,” J. Appl. Probability, vol. 55, no. 2, pp. 325–352, 2018.
[55] B. Hajek, Y. Wu, and J. Xu “Submatrix localization via message passing,” J. Machine
Learning Res., vol. 18, no. 186, pp. 1–52, 2018.
[56] B. Hajek, Y. Wu, and J. Xu “Semidefinite programs for exact recovery of a hidden
community,” in Proc. Conference on Learning Theory (COLT), 2016, pp. 1051–1095,
arXiv:1602.06410.
[57] S.Y. Yun and A. Proutiere, “Community detection via random and adaptive sampling,” in
Proc. 27th Conference on Learning Theory, 2014, pp. 138–175.
[58] E. Mossel, J. Neeman, and A. Sly, “Consistency thresholds for the planted bisection
model,” in Proc. Forty-Seventh Annual ACM on Symposium on Theory of Computing,
2015, pp. 69–75.
[59] E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,”
IEEE Trans. Information Theory, vol. 62, no. 1, pp. 471–487, 2016.
[60] V. Jog and P.-L. Loh, “Information-theoretic bounds for exact recovery in weighted
stochastic block models using the Rényi divergence,” arXiv:1509.06418, 2015.
[61] E. Abbe and C. Sandon, “Community detection in general stochastic block models: Fun-
damental limits and efficient recovery algorithms,” in 2015 IEEE 56th Annual Symposium
on Foundations of Computer Science (FOCS), 2015, pp. 670–688, arXiv:1503.00609.
[62] N. Alon, M. Krivelevich, and B. Sudakov, “Finding a large hidden clique in a random
graph,” Random Structures and Algorithms, vol. 13, nos. 3–4, pp. 457–466, 1998.
[63] E. Hazan and R. Krauthgamer, “How hard is it to approximate the best Nash equilibrium?”
SIAM J. Computing, vol. 40, no. 1, pp. 79–91, 2011.
[64] N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie, “Testing k-
wise and almost k-wise independence,” in Proc. Thirty-Ninth Annual ACM Symposium
on Theory of Computing, 2007, pp. 496–505.
[65] P. Koiran and A. Zouzias, “On the certification of the restricted isometry property,”
arXiv:1103.4984, 2011.
[66] L. Kučera, “A generalized encryption scheme based on random graphs,” in Graph-
Theoretic Concepts in Computer Science, 1992, pp. 180–186.
[67] A. Juels and M. Peinado, “Hiding cliques for cryptographic security,” Designs, Codes and
Cryptography, vol. 20, no. 3, pp. 269–280, 2000.
Statistical Problems with Planted Structures 423

[68] B. Applebaum, B. Barak, and A. Wigderson, “Public-key cryptography from differ-


ent assumptions,” in Proc. 42nd ACM Symposium on Theory of Computing, 2010,
pp. 171–180.
[69] B. Rossman, “Average-case complexity of detecting cliques,” Ph.D. dissertation, Mas-
sachusetts Institute of Technology, 2010.
[70] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao, “Statistical algorithms
and a lower bound for detecting planted cliques,” in Proc. 45th Annual ACM Symposium
on Theory of Computing, 2013, pp. 655–664.
[71] R. Meka, A. Potechin, and A. Wigderson, “Sum-of-squares lower bounds for planted
clique,” in Proc. Forty-Seventh Annual ACM Symposium on Theory of Computing, 2015,
pp. 87–96.
[72] Y. Deshpande and A. Montanari, “Improved sum-of-squares lower bounds for hidden
clique and hidden submatrix problems,” in Proc. COLT 2015, 2015, pp. 523–562.
[73] B. Barak, S. B. Hopkins, J. A. Kelner, P. Kothari, A. Moitra, and A. Potechin, “A nearly
tight sum-of-squares lower bound for the planted clique problem,” in IEEE 57th Annual
Symposium on Foundations of Computer Science, FOCS, 2016, pp. 428–437.
[74] U. Feige and R. Krauthgamer, “Finding and certifying a large hidden clique in a
semirandom graph,” Random Structures and Algorithms, vol. 16, no. 2, pp. 195–208,
2000.
[75] U. Feige and D. Ron, “Finding hidden cliques in linear time,” in Proc. DMTCS, 2010,
pp. 189–204.
[76] Y. Dekel, O. Gurel-Gurevich, and Y. Peres, “Finding hidden cliques in linear time with
high probability,” Combinatorics, Probability and Computing, vol. 23, no. 01, pp. 29–49,
2014.
[77] B. P. Ames and S. A. Vavasis, “Nuclear norm minimization for the planted clique and
biclique problems,” Mathematical Programming, vol. 129, no. 1, pp. 69–89, 2011.

[78] Y. Deshpande and A. Montanari, “Finding hidden cliques of size N/e in nearly linear
time,” Foundations Comput. Math., vol. 15, no. 4, pp. 1069–1128, 2015.
[79] M. Jerrum, “Large cliques elude the Metropolis process,” Random Structures and
Algorithms, vol. 3, no. 4, pp. 347–359, 1992.
[80] N. Verzelen and E. Arias-Castro, “Community detection in sparse random networks,”
Annals Appl. Probability, vol. 25, no. 6, pp. 3465–3510, 2015.
[81] N. Alon, S. Arora, R. Manokaran, D. Moshkovitz, and O. Weinstein., “Inapproximability
of densest κ-subgraph from average case hardness,” 2011, www.nada.kth.se/∼rajsekar/
papers/dks.pdf.
[82] S. B. Hopkins, P. K. Kothari, and A. Potechin, “SoS and planted clique: Tight analysis
of MPW moments at all degrees and an optimal lower bound at degree four,” arXiv:
1507.05230, 2015.
[83] P. Raghavendra and T. Schramm, “Tight lower bounds for planted clique in the degree-4
SOS program,” arXiv:1507.05136, 2015.
[84] B. Ames, “Robust convex relaxation for the planted clique and densest k-subgraph
problems,” arXiv:1305.4891, 2013.
[85] L. Lovász, Large networks and graph limits. American Mathematical Society, 2012.
[86] C. Gao, Y. Lu, and H. H. Zhou, “Rate-optimal graphon estimation,” Annals Statist.,
vol. 43, no. 6, pp. 2624–2652, 2015.
[87] O. Klopp and N. Verzelen, “Optimal graphon estimation in cut distance,”
arXiv:1703.05101, 2017.
424 Yihong Wu and Jiaming Xu

[88] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” in Proc. 35th
International Conference on Machine Learning, 2018, arXiv:1709.03183.
[89] T. Lesieur, F. Krzakala, and L. Zdeborová, “Phase transitions in sparse PCA,” in IEEE
International Symposium on Information Theory, 2015, pp. 1635–1639.
[90] A. E. Alaoui and F. Krzakala, “Estimation in the spiked Wigner model: A short proof of
the replica formula,” arXiv:1801.01593, 2018.
[91] A. Montanari, D. Reichman, and O. Zeitouni, “On the limitation of spectral methods:
From the Gaussian hidden clique problem to rank one perturbations of Gaussian ten-
sors,” in Advances in Neural Information Processing Systems, 2015, pp. 217–225, arXiv:
1411.6149.
[92] C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” J. ACM, vol. 60, no. 6,
pp. 45:1–45:39, 2013.
[93] A. Montanari and E. Richard, “A statistical model for tensor PCA,” in Proc. 27th Inter-
national Conference on Neural Information Processing Systems, 2014, pp. 2897–2905.
[94] T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, and L. Zdeborová, “Statistical and
computational phase transitions in spiked tensor estimation,” arXiv:1701.08010, 2017.
[95] S. B. Hopkins, J. Shi, and D. Steurer, “Tensor principal component analysis via sum-of-
square proofs,” in COLT, 2015, pp. 956–1006.
[96] A. Zhang and D. Xia, “Tensor SVD: Statistical and computational limits,”
arXiv:1703.02724, 2017.
[97] D. Paul, “Asymptotics of sample eigenstruture for a large dimensional spiked covariance
model,” Statistica Sinica, vol. 17, no. 4, pp. 1617–1642, 2007.
[98] T. Lesieur, C. D. Bacco, J. Banks, F. Krzakala, C. Moore, and L. Zdeborová, “Phase
transitions and optimal algorithms in high-dimensional Gaussian mixture clustering,”
arXiv:1610.02918, 2016.
[99] I. Csiszár, “Almost independence and secrecy capacity,” Problemy peredachi informatsii,
vol. 32, no. 1, pp. 48–57, 1996.
[100] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”
IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.
14 Distributed Statistical Inference with
Compressed Data
Wenwen Zhao and Lifeng Lai

Summary

This chapter introduces basic ideas of information-theoretic models for distributed sta-
tistical inference problems with compressed data, discusses current and future research
directions and challenges in applying these models to various statistical learning prob-
lems. In these applications, data are distributed in multiple terminals, which can
communicate with each other via limited-capacity channels. Instead of recovering data
at a centralized location first and then performing inference, this chapter describes
schemes that can perform statistical inference without recovering the underlying data.
Information-theoretic tools are borrowed to characterize the fundamental limits of the
classical statistical inference problems using compressed data directly. In this chapter,
distributed statistical learning problems are first introduced. Then, models and results of
distributed inference are discussed. Finally, new directions that generalize and improve
the basic scenarios are described.

14.1 Introduction

Nowadays, large amounts of data are collected by devices and sensors in multiple termi-
nals. In many scenarios, it is risky and expensive to store all of the data in a centralized
location, due to the data size, privacy concerns, etc. Thus, distributed inference, a class
of problems aiming to infer useful information from these distributed data without col-
lecting all of the data in a centralized location, has attracted significant attention. In these
problems, communication among different terminals is usually beneficial but the com-
munication channels between terminals are typically of limited capacity. Compared with
the relatively well-studied centralized setting, where data are stored in one terminal, the
distributed problems with a limited communication budget are more challenging.
Two main scenarios are considered in the existing works on distributed inference. In
the first scenario, named sample partitioning, each terminal has data samples related
to all random variables [1, 2], as shown in Fig. 14.1. In this figure, we use a matrix to
represent the available data. Here, different columns in the matrix denote corresponding
random variables to which the samples are related. The data matrix is partitioned in a

425
426 Wenwen Zhao and Lifeng Lai

X 1X 3 XL
X1X 2 XL 1
Observation 2
1
2 X 1X 3 XL
3
5
7

n X 1X 3 XL
8
10

Figure 14.1 Sample partitioning.

X 1X 2
1
2
Observation X 1X 2 XL
1
2
n
XL
1
2
n

Figure 14.2 Feature partitioning.

row-wise manner, and each terminal observes a subset of the samples, which relates
to all random variables (X1 , X2 , . . . , XL ). This scenario is quite common in real life. For
example, there are large quantities of voice and image data stored in personal smart
devices but, due to the sensitive nature of the data, we cannot ask all users to send
their voice messages or photos to those in the location. Generally, in this scenario, even
though each terminal has fewer data than those the centralized setting, terminals can
still apply learning methods to their local data. Certainly, communicating and combining
learning results from distributed terminals may improve the performance.
In the second scenario, named as feature partitioning, the data stored in each terminal
are related to only a subset, not all, of the random variables. Fig. 14.2 illustrates the
feature partitioning scenario, in which the data matrix is partitioned in a column-wise
manner and each terminal observes the data related to a subset of random variables.
For example, terminal X1 has all observations related to random variable X1 . This
scenario is also quite common in practice. For example, different information about
each patient is typically stored in different locations as patients may go to different
departments or different hospitals for different tests. In general, this scenario is more
challenging than sample partitioning as each terminal in the feature partitioning scenario
Distributed Statistical Inference with Compressed Data 427

is not able to obtain meaningful information from local data alone. Moreover, due to the
limited communication budget (it can be as low as a diminishing value), recovering data
first and then conducting inference is neither optimal nor necessary. Thus, we need to
design inference algorithms that can deal with compressed data, which is a much more
complicated and challenging problem than the problems discussed above.
This chapter will focus on the inference problems for the feature partitioning
scenario, which was first proposed by Berger [3] in 1979. Many existing works
on the classic distributed inference problem focus on the following three branches:
distributed hypothesis testing or distributed detection, distributed pattern classification,
and distributed estimation [4–14]. Using powerful information-theoretic tools such
as typical sequences, the covering lemma, etc., good upper and lower bounds on
the inference performance are derived for the basic model and many special cases.
Moreover, some results have been extended to more general cases. In this chapter,
we focus on the distributed hypothesis-testing problem. In Section 14.2, we study
the basic distributed hypothesis-testing model with non-interactive communica-
tion in the sense that each terminal sends only one message that is a function of
its local data to the decision-maker [4–9, 11, 15]. More details are introduced in
Section 14.2.
In Section 14.3, we consider more sophisticated models that allow interactive com-
munications. In the interactive communication cases, there can be multiple rounds of
communication among different terminals. In each round, each terminal can utilize
messages received from other terminals in previous rounds along with its own data
to determine the transmitted message. We start with a special form of interaction, i.e.,
cascaded communication among terminals [16, 17], in which terminals broadcast their
messages in a sequential order and each terminal uses all messages received so far along
with its own observations for encoding. Then we discuss the full interaction between
two terminals and analyze their performance [18–20]. More details will be introduced
in Section 14.3.
In Sections 14.2 and 14.3, the probability mass functions (PMFs) are fully specified.
In Section 14.4, we generalize the discussion to the scenario with model uncertainty. In
particular, we will discuss the identity-testing problem, in which the goal is to determine
whether given samples are generated from a certain distribution or not. By interpreting
the distributed identity-testing problem as composite hypothesis-testing problems, the
type-2 error exponent can be characterized using information-theoretic tools. We will
introduce more details in Section 14.4.

14.2 Basic Model

In this section, we review basic theoretic models for distributed hypothesis-testing prob-
lems. We first introduce the general model, then discuss basic ideas and results for two
special cases: (1) hypothesis testing against independence; and (2) zero-rate data com-
pression. In this chapter, we will present some results and the main ideas behind these
results. For detailed proofs, interested readers can refer to [4–11].
428 Wenwen Zhao and Lifeng Lai

14.2.1 Model
Consider a system with L terminals: Xi , i = 1, . . . , L and a decision-maker Y. Each ter-
minal and the decision-maker observe a component of the random vector (X1 , . . . , XL , Y)
that takes values in a finite set X1 × · · · × XL × Y and admits a PMF with two possible
forms:

H0 : PX1 ···XL Y , H1 : QX1 ···XL Y . (14.1)

With a slight abuse of notation, Xi is used to denote both the terminal and the alphabet set
from which the random variable Xi takes values. (X1n , . . . , XLn , Y n ) are independently and
identically (i.i.d.) generated according to one of the above joint PMFs. In other words,
(X1n , . . . , XLn , Y n ) is generated by either PnX1 ···XL Y or QnX1 ···XL Y . In a typical hypothesis-
testing problem, one determines which hypothesis is true under the assumption that
(X1n , . . . , XLn , Y n ) are fully available at the decision-maker. In the distributed setting, Xin ,
i = 1, . . . , L and Y n are at different locations. In particular, terminal Xi observes only
Xin and terminal Y observes only Y n . Terminals Xi are allowed to send messages to the
decision-maker Y. Using Y n and the received messages, Y determines which hypothesis
is true. We denote this system as S X1 ···XL |Y . Fig. 14.3 illustrates the system model. In the
following, we will use the term “decision-maker” and terminal Y interchangeably. Here,
Y n is used to model any side information available at the decision-maker. If Y is defined
to be an empty set, then the decision-maker does not have side information.
After observing the data sequence xin ∈ Xni , terminal Xi will use an encoder fi to
transform the sequence xin into a message fi (xin ), which takes values from the message
set Mi ,

fi : Xni → Mi = {1, 2, . . . , Mi }, (14.2)

with rate constraint


1
lim sup log Mi ≤ Ri , i = 1, . . . , L. (14.3)
n→∞ n

Using messages Mi , i = 1, . . . , L and its side information Y n , the decision-maker will


employ a decision function ψ to determine which hypothesis is true:

ψ : M1 × · · · × ML × Yn → {H0 , H1 }. (14.4)

Terminal Terminal
f1 (X 1n )
X1
f 2 (X 2n )
X2
Y
...
...

f L (X nL)
XL
Figure 14.3 The basic model.
Distributed Statistical Inference with Compressed Data 429

For any given encoding functions fi , i = 1, . . . , L and decision function ψ, one can define
the acceptance region as

An = {(x1n , . . . , xnL , yn ) ∈ Xn1 × · · · × XnL × Yn :


ψ( f1 (x1n ) · · · fL (xnL )yn ) = H0 }. (14.5)

Correspondingly, the type-1 error probability is defined as

αn = PnX1 ···XL Y (Acn ), (14.6)

in which Acn denotes the complement of An , and the type-2 error probability is defined as

βn = QnX1 ···XL Y (An ). (14.7)

The goal is to design the encoding functions fi , i = 1, . . . , L and the decision function
ψ to maximize the type-2 error exponent under certain type-1 error and communication-
rate constraints (14.3).
More specifically, one can consider two kinds of type-1 error constraint, namely the
following.

• Constant-type constraint

αn ≤  (14.8)

for a prefixed  > 0, which implies that the type-1 error probability must be smaller
than a given threshold.

• Exponential-type constraint

αn ≤ exp(−nr) (14.9)

for a given r > 0, which implies that the type-1 error probability must decrease expo-
nentially fast with an exponent no less than r. Hence the exponential-type constraint
is stricter than the constant-type constraint.

To distinguish these two different type-1 error constraints, we use different notations
to denote the corresponding type-2 error exponent, and we use the subscript ‘b’ to denote
that the error exponent is under the basic model.

• Under the constant-type constraint, we define the type-2 error exponent as


  
1
θb (R1 , . . . , RL , ) = lim inf − log min βn ,
n→∞ n f1 ,..., fL ,ψ

in which the minimization is over all fi s and ψ satisfying condition (14.3) and (14.8).
• Under the exponential-type constraint, we define the type-2 error exponent as
  
1
σb (R1 , . . . , RL , r) = lim inf − log min βn ,
n→∞ n f1 ,..., fL ,ψ

in which the minimization is over all fi s and ψ satisfying condition (14.3) and (14.9).
430 Wenwen Zhao and Lifeng Lai

14.2.2 Connections and Differences with Distributed Source Coding Problems


From the introduction of the model for the distributed testing problem, we can see that it
is different from the classic distributed source coding problems [21]. In the source coding
problem, which is illustrated in Fig. 14.4, the goal of the decoder is to recover the source
sequences after it has received the compressed messages from terminal {Xl }l=1 L . Accord-

ing to the Slepian–Wolf theorem [21], the decoder can recover the original sequences
with diminishing error probability when the compression rates are larger than certain
values. Hence, when the compression rates are sufficiently large, we can adopt the source
coding method to first recover the original source sequences and then perform infer-
ences. However, in general, the rate constraints are typically too strict for the decision-
maker to fully recover {Xln }l=1
L in the inference problem. Moreover, in the inference

problem, recovery of source sequences is not its goal and typically is not necessary.
On the other hand, this inference problem is closely connected to distributed source
coding problems. In particular, the general idea of the existing schemes in distributed
inference problems is to mimic the schemes used in distributed source coding problems.
In the existing studies [4–7, 22], each terminal Xl compresses its sequence Xln into
Uln . Then these terminals send the auxiliary sequences {Uln }l=1L to the decision-maker

using source coding ideas so that the decision-maker can obtain {U n }L , which has a
l l=1
high probability of being the same as {Ul }l=1 . The compression step is to make sure
n L

each terminal sends enough information for one to recover Uln but does not exceed the
rate constraint. Finally, the decision-maker will decide between the two hypotheses
using {Un }L . Hence, we can see that, even though the decision-maker does not need
l l=1
to recover the sequences {Xln }l=1
L , it does need to recover {U n }L from the compressed
l l=1
messages.

14.2.3 Typical Sequences


The notation and properties of typical sequences are heavily used in the schemes and
proofs in the existing work. Here, for reference, we provide a brief overview of the
definition and key properties. Following [6], for any sequence xn = (x(1), . . . , x(n)) ∈ Xn ,
we use n(a|xn ) to denote the total number of indices t at which x(t) = a. Then, the relative
frequency or empirical PMF, π(a|xn )  n(a|xn )/n, ∀a ∈ X of the components of xn , is
called the type of xn and is denoted by tp(xn ). The set of all types of sequences in Xn

Terminal Terminal
f1 (X 1n )
X1
f 2 (X 2n )
X2
Decoder X 1n , X n2 ,..., X nL
...

...

f L (X nL)
XL
Figure 14.4 A canonical example for the source coding problem.
Distributed Statistical Inference with Compressed Data 431

is denoted by Pn (X). Furthermore, we call a random variable X (n) that has the same
distribution as tp(xn ) the type variable of xn .
For any given sequence xn , we use a typical sequence to measure how likely it is that
this sequence is generated from a PMF PX .
definition 14.1 (Definition 1 of [6]) For a given a type PX ∈ Pn (X) and a constant η,
we denote by T ηn (X) the set of (PX , η)-typical sequences in Xn :
 
T ηn (X)  xn ∈ Xn : |π(a|xn ) − PX (a)| ≤ ηPX (a), ∀a ∈ X .
In the same manner, we use Tηn (X) to denote the set of (PX , η)-typical sequences. Note
that, when η = 0, T 0 (X) denotes the set of sequences x ∈ Xn of type PX , and we use
n n

T n (X) for simplicity.


We use the following lemma to summarize key properties of typical sequences.
lemma 14.1 (Lemma 1 of [6]) Let λ > 0 be arbitrary.
1. PnX (T ηn (X)) ≥ 1 − λ.
2. Let X (n) be a type variable for a sequence in Xn , then
(n + 1)−|X| exp[nH(X (n) )] ≤ |T 0n (X (n) )| ≤ exp[n(H(X (n) ))]. (14.10)
3. Let xn ∈ Xn and X be a random variable in X, then
Pr(X n = xn ) = exp[−n(H(X (n) ) + D(X (n) ||X))]. (14.11)
Let Λn (X) be the set of all the different types of xn over Xn . Then, the data space Xn
is partitioned into classes each with the same type

Xn = T 0n (X (n) ). (14.12)
X (n) ∈Λn (X)

Similarly, for multiple random variables, define their joint empirical PMF as
n(a, b|xn , yn )
π(a, b|xn , yn )  , ∀(a, b) ∈ X × Y. (14.13)
n
For a given a type PXY ∈ Pn (X × Y) and a constant η, we denote by T ηn (XY) the set of
jointly (PXY , η)-typical sequences in Xn × Yn :

T ηn (XY)  (xn , yn ) ∈ Xn × Yn :

|π(a, b|xn , yn ) − PXY (a, b)| ≤ ηPXY (a, b), ∀(a, b) ∈ X × Y . (14.14)
Furthermore, for yn ∈ Yn , we define T ηn (X|yn ) as the set of all xn s that are jointly
typical with yn :
T ηn (X|yn ) = {xn ∈ Xn : (xn , yn ) ∈ T ηn (XY)}. (14.15)
More details and properties can be found in [6, 21].

14.2.4 Testing against Independence with Positive-Rate Data Compression


First, we focus on a special case: testing against independence, in which we are inter-
ested in determining whether (X1 , . . . , XL ) and Y are independent or not. In this case,
432 Wenwen Zhao and Lifeng Lai

QX1 ···XL Y in (14.1) has the special form

QX1 ···XL Y = PX1 ···XL PY

and the two hypotheses in (14.1) become

H0 : PX1 ···XL Y versus H1 : PX1 ···XL PY . (14.16)

Note that the marginal distribution of (X1 , . . . , XL ) and Y are the same under both
hypotheses in the case of testing against independence. The problem has been studied
in [4, 9].
When L = 1 (the system is then denoted as S X1 |Y ), this problem can be shown
to have a close connection with the problem of source coding with a single helper
problem of [22]. Building on this connection and the results in [22], Ahlswede
and Csiszár [4] provided a single-letter characterization of the optimal type-2 error
exponent.
theorem 14.1 (Theorem 2 of [4]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.8) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies

θb (R1 , ) = max I(U1 ; Y), (14.17)


U1 ∈ϕ1

where

ϕ1 = {U1 : R1 ≥ I(U1 ; X1 ), U1 ↔ X1 ↔ Y, |U1 | ≤ |X1 | + 1, }. (14.18)

When L ≥ 2, one can follow the similar approach in [4] to connect the testing against
independence problem to the problem of source coding with multiple helpers. How-
ever, unlike the problem of source coding with a single helper, the general problem
of source coding with multiple helpers is still open. Hence, a different approach to
exploit the more complicated problem is needed. First, we provide a lower bound
on the type-2 error exponent by generalizing Theorem 6 in [9] to the case of L
terminals.
theorem 14.2 (Theorem 6 of [9]) In the system S X1 ···XL |Y with Ri > 0, i = 1, . . . , L,
when the constraint on the type-1 error probability (14.8) and communication con-
straints (14.3) are satisfied, the error exponent of the type-2 error probability is
lower-bounded by

θb (R1 , . . . , RL , ) ≥ max I(U1 · · · U L ; Y), (14.19)


PU1 |X1 · · · PU L |XL

in which the maximization is over the PUi |Xi s such that I(Ui ; Xi ) ≤ Ri and |Ui | ≤ |Xi | + 1
for i = 1, . . . , L.
The lower bound in Theorem 14.2 can be viewed as a generalization of the bound
in Theorem 14.1. The constraints in Theorem 14.2 can be interpreted as the following
Markov-chains condition on the auxiliary random variables:

Ui ↔ Xi ↔ (U1 , . . . , Ui−1 , X1 , . . . , Xi−1 , Xi+1 , . . . , XL , Y). (14.20)


Distributed Statistical Inference with Compressed Data 433

To achieve this lower bound, we design the following encoding/decoding scheme. For
a given rate constraint Ri , terminal Xi first generates a quantization codebook containing
2nRi quantization sequences uni . After observing xin , terminal Xi picks one sequence uni
from the quantization codebook to describe xin and sends this sequence to the decision-
maker. After receiving the descriptions from terminals, the decision-maker will declare
that the hypothesis H0 is true if the descriptions from these terminals and the side
information at the decision-maker are correlated. Otherwise, the decision-maker will
declare H1 true.
Then, we establish an upper bound on the type-2 error exponent that any scheme can
achieve by generalizing Theorem 7 in [9] to the case of L terminals.

theorem 14.3 (Theorem 7 of [9]) In the system S X1 ···XL |Y with Ri ≥ 0, i = 1, . . . , L,


when the constraint on the type-1 error probability (14.8) and communication con-
straints (14.3) are satisfied, the best error exponent for the type-2 error probability is
upper-bounded by

lim θb (R1 , . . . RL , ) ≤ max I(U1 · · · U L ; Y) (14.21)


→0 U1 ···U L

in which the maximization is over the Ui s such that Ri ≥ I(Ui ; Xi ), |Ui | ≤ |Xi | + 1, Ui ↔
Xi ↔ (X1 , . . . , Xi−1 , Xi+1 , . . . , XL , Y) for i = 1, . . . , L.

We note that the constraints on auxiliary random variables in Theorems 14.2 and 14.3
are different. In particular, the Markov constraints in Theorem 14.3 are less strict than
those in Theorem 14.2. This implies that the lower bound and the upper bound do not
match with each other. Hence, more exploration of this problem is needed.
For the case of testing against independence under an exponential-type constraint on
the type-1 error probability, only a lower bound is established in [7], which is stated in
the following theorem.
First, let Φ denote the set of all continuous mappings from P(X1 ) to P(U1 |X1 ). ω(X 1 )
is an element in Φ for one particular X 1 is an auxiliary random variable that satisfies
1 . U
U |X = ω(X
P 1 ). Then, define
1 1
⎧ ⎫


⎪ ⎪



⎨ ⎪

φX1 (R1 , r) = ⎪ω ∈ Φ : max I( 
U ; 
X ) ≤ R ⎪ . (14.22)



1 1 1 ⎪


⎩  
X : D(X1 ||X1 ) ≤ r
  )
PU |X = ω(X1

1 1

theorem 14.4 (Corollary 2 of [7]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.9) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability is lower-bounded by
 
σb (R1 , r) ≥ max min D(X 1 ; 
1 ||X1 ) + I(U Y) . (14.23)
ω ∈ φP (R , r)  X
U  
X1 Y 1 1 1Y
 X 
 || ≤ ||X || + 2
||U 1 1
D(U 1 1 Y||U1 X1 Y) ≤ r

PU |X = PU |X = ω(X  )
1 1 1 1 1
U ↔ X1 ↔ Y

1 ||X1 ) + I(U
If we let r = 0, then D(X 1 ;  1 ; 
Y) reduces to I(U Y), which is the same as in
(14.19). Unlike Theorem 14.1, a matching upper bound is hard to establish due to the
stronger constraint on the type-1 error probability.
434 Wenwen Zhao and Lifeng Lai

14.2.5 General PMF with Zero-Rate Data Compression


In this section, we focus on the “zero-rate” data compression, i.e.,
1
Mi ≥ 2, lim log Mi = 0, i = 1, . . . , L. (14.24)
n→∞ n

In this case, σb (R1 , . . . , RL , r) will be denoted as σb (0, . . . , 0, r). This zero-rate compres-
sion is of practical interest, as the normalized (by the length of the data) communication
cost is minimal. It is well known that this kind of zero-rate information is not useful in
the traditional distributed source coding with side-information problems [21, 23], whose
goal is to recover (X1n , . . . , XLn ) at terminal Y. However, in the distributed inference setup,
the goal is only to determine which hypothesis is true. The limited information from
zero-rate compressed messages will be very useful. A clear benefit of this zero-rate
compression approach is that the terminals need to consume only a limited amount of
communication resources.

Under a Constant-Type Constraint


We first discuss the zero-rate date compression under a constant-type constraint on the
type-1 error probability.
A special case of general zero-rate compression was first studied in Theorem 5 of [6],
in which Han assumed that Mi = 2, i.e., each terminal can send only one bit of informa-
tion. Using the properties of typical sequences, a matching upper and lower bound were
derived, as shown in Theorem 14.5.
theorem 14.5 (Theorem 5 of [6]) Suppose that D(PX1 Y ||QX1 Y ) < +∞. In the system
S X1 |Y with one-bit data compression, denoted as R1 = 02 , when the constraint on the
type-1 error probability (14.8) is satisfied for some 0 < 0 ≤ 1 and for all 0 <  < 0 , the
best error exponent for the type-2 error probability is

θb (02 , ) = min X Y ||QX Y ).


D(P (14.25)
 1 1
P X1 Y :
 = P ,P 
P X1 X1 Y = PY

To achieve this bound, one can use the following simple encoding/coding scheme: if
the observed sequence x1n ∈ T n (PX1 ), i.e., it is a typical sequence of PX1 , then we send 1
to the decision-maker Y; otherwise, we send 0. If terminal Y receives 1, then it decides
H0 is true; otherwise, it decides H1 is true. Using the properties of typical sequences, we
can easily get the lower bound on the type-2 error exponent. To get the matching upper
bound, the condition D(PX1 Y ||QX1 Y ) < +∞ is required. Readers can refer to [6] for more
details.
For the general zero-rate data compression, as we have Mi ≥ 2, more information is
sent to the decision-maker, which may lead to a better performance. Hence, the following
inequality holds:

θb (02 , . . . , 02 , ) ≤ θb (0, . . . , 0, ) ≤ θb (R1 , . . . , RL , ). (14.26)

The scenario with general zero-rate data compression under a constant-type constraint
has been considered in [8]. Shalaby adopted the blowing-up lemma [24] to give a tight
upper bound on the type-2 error exponent.
Distributed Statistical Inference with Compressed Data 435

theorem 14.6 (Theorem 1 of [8]) Let QX1 Y > 0. In the system S X1 |Y with R1 ≥ 0, when
the constraint on the type-1 error probability (14.8) is satisfied for all  ∈ (0, 1), the best
error exponent for the type-2 error probability is upper-bounded by

θb (0, ) ≤ min X Y ||QX Y ).


D(P (14.27)
 1 1
P X1 Y :
 = P ,P 
P X1 X1 Y = PY

Here, the positive condition QX1 Y (x1 , y) > 0, ∀(x1 , y) ∈ X1 × Y is required by the
blowing-up lemma.
Using the inequality (14.26), the combination of Theorems 14.5 and 14.6 yields the
following theorem.
theorem 14.7 (Theorem 2 of [8]) Let QX1 Y > 0. In the system S X1 |Y with R1 ≥ 0, when
the constraint on the type-1 error probability (14.8) is satisfied for all  ∈ (0, 1), the best
error exponent for the type-2 error probability is

θb (0, ) = min X Y ||QX Y ).


D(P (14.28)
 1 1
P X1 Y :
 = P ,P 
P X1 X1 Y = PY

The above results are given for the case with L = 1, in [8] Shalaby and Papamarcou
also discussed the results in the case of general L.

Under an Exponential-Type Constraint


Compared with the constant-type constraint, the exponential-type constraint is
much stricter, which requires the type-1 error probability to decrease exponentially
fast with an exponent no less than r. Hence, the encoding/decoding scheme is
more complexed.
Zero-rate data compression under an exponential-type constraint was first studied by
Han and Kobayashi [7]. They studied the case S X1 Y , which means L = 1, terminal Y also
sends encoded messages to the decision-maker, and the decision-maker does not have
any side information. They provided a lower bound on the type-2 error exponent, which
is stated in the following theorem.
theorem 14.8 (Theorem 6 of [7]) For zero-rate data compression in S X1 Y with R1 =
RY = 0, the error exponent satisfies

σb (0, 0, r) ≥ σopt , (14.29)

in which
X Y ||QX Y )
σopt  min D(P (14.30)
1 1
X Y ∈Hr
P 1

with
 
Hr = PX Y : P
X =  Y = P
P X1 , P Y f or some P
X Y ∈ ϕr , (14.31)
1 1 1

where
X Y : D(P
ϕr = { P X Y ||PX Y ) ≤ r}. (14.32)
1 1 1
436 Wenwen Zhao and Lifeng Lai

To show this bound, the following coding scheme is adopted. After observing x1n ,
terminal X1 knows the type tp(x1n ) and sends tp(x1n ) (or an approximation of it, see below)
to the decision-maker. Terminal Y does the same. As there are at most n|X1 | types [25],
the rate required for sending the type from terminal X1 is (|X1 |log n)/n, which goes to
zero as n increases. After receiving all type information from the terminals, the decision-
maker will check whether there is a joint type P X Y ∈ Hr such that its marginal types
1
are the same as the information received from the terminals. If so, the decision-maker
declares H0 to be true, otherwise it declares H1 to be true. If the message size M1 is less
than n|X1 | , then, instead of the exact type information tp(x1n ), each terminal will send an
approximated version. For more details, please refer to [7].
Later, Han and Amari [5] proved an upper bound that matched the lower bound in
Theorem 14.8 by converting the problem under an exponential-type constraint to the
problem under a constant-type constraint.
theorem 14.9 (Theorem 5.5 of [5]) Let PX1 Y be arbitrary and QX1 Y > 0. For zero-rate
compression in S X1 Y with R1 = RY = 0, the error exponent is upper-bounded by

σb (0, 0, r) ≤ σopt , (14.33)

with σopt defined in (14.30).


A consequence of Theorems 14.8 and 14.9 is the following theorem.
theorem 14.10 (Corollary 5.3 of [5]) Let PX1 Y be arbitrary and QX1 Y > 0. For zero-rate
compression in S X1 Y with R1 = RY = 0, the error exponent is upper-bounded by

σb (0, 0, r) = σopt , (14.34)

with σopt defined in (14.30).


For the case with L ≥ 2, the following results hold.
theorem 14.11 (Theorem 4 of [11]) Let PX1 ,...,XL Y be arbitrary and QX1 ,...,XL Y > 0.
For zero-rate compression in S X1 ...XL |Y with Ri = 0, i = 1, . . . , L and type-1 error
constraint (14.9), the best type-2 error exponent is

σb (0, . . . , 0, r) = min X ···XL Y ||QX ···XL Y ),


D(P (14.35)
1 1
X ···X Y ∈Hr
P 1 L

with
 
Hr = PX ···XL Y : P
Xi = P
Xi , P
Y = P
Y , i = 1, . . . , L for some P
X ···XL Y ∈ ϕr , (14.36)
1 1

where
X ···XL Y : D(P
ϕr = { P X ···XL Y ||PX ···XL Y ) ≤ r}. (14.37)
1 1 1

Thus we have single-letter characterized the distributed testing problem under zero-
rate data compression with an exponential-type constraint on the type-1 error probability.
Furthermore, the minimization problems in (14.30) and (14.35) are convex optimization
problems, which can be solved efficiently.
Distributed Statistical Inference with Compressed Data 437

corollary 14.1 Given PX1 ···XL Y and QX1 ···XL Y , the problem of finding σopt defined in
(14.35) is a convex optimization problem.

14.2.6 General PMF with Positive-Rate Data Compression


In this section, we discuss the general case of hypothesis testing with positive data com-
pression. Unfortunately, establishing a tight upper bound on the type-2 error exponent is
still an open problem, so here we give only a classic lower bound.
The result under a constant-type constraint on the type-1 error probability is stated in
the following theorem.

theorem 14.12 (Theorem 1 of [6]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.8) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies
θb (R1 , ) ≥ max min U X Y ||QU X Y ),
D(P (14.38)
1 1 1 1
U1 ∈ϕ0 P
U X Y ∈ξ0
1 1

where

ϕ0 = {U1 : R1 ≥ I(U1 ; X1 ), U1 ↔ X1 ↔ Y, |U1 | ≤ |X1 | + 1},


 U X = PU X , PU Y = PU Y ,
ξ0 = P U 1 X1 Y : P 1 1 1 1 1 1

and QU1 |X1 = PU1 |X1 .

For the case of an exponential-type constraint on the type-1 error probability, we first
define
⎧ ⎫


⎪ D( X
U 1 
Y||UX Y) ≤ r ⎪



⎨  
1 ⎪


φ(ω) = ⎪U X Y : P = 
P = ω( 
X ) ⎪ (14.39)



1 U|X 1 U|X 1 1 ⎪


⎩ U ↔ X1 ↔ Y ⎭

and
⎧ ⎫


⎪ UX = P
P UX ⎪



⎨   1 1 ⎪

φ(ω) = ⎪ U X1 Y :  UY
PUY = P ⎪ . (14.40)


⎪ ⎪


⎩ X
for some U 1 
Y ∈φ(ω) ⎭

theorem 14.13 (Theorem 1 of [7]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.9) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies

σb (R1 , r) ≥ max min UX Y ||QUX Y ),


D(P (14.41)
1 1
ω ∈ φX (R1 , r)
1 X
U 1 Y∈φ(ω)

 ≤ ||X || + 2
||U|| 1

U|X and φX (R1 , r) defined in (14.22).


with QU|X1 = P 1 1
438 Wenwen Zhao and Lifeng Lai

14.3 Interactive Communication

In Section 14.2, we discussed basic models with non-interactive communications. In


this section, we discuss results for more sophisticated models with interactive com-
munication among different terminals. The basic idea is that, by allowing interactive
communications among terminals, the decision-maker could receive more information
that may lead to a smaller error probability. We discuss two forms of interaction. First,
we study cascaded communication [16, 17], in which these terminals broadcast their
messages in a sequential order from terminal X1 to terminal XL and each terminal uses
all previously received messages along with its own observations for encoding. Then,
we discuss the scenario with multiple rounds of communication between two terminals
X1 and Y [18–20].

14.3.1 Cascaded Communication


Model
As in Section 14.2, we consider the following hypothesis-testing problem:

H0 : PX1 ···XL Y , H1 : QX1 ···XL Y . (14.42)

(X1n , . . . , XLn , Y n ) are i.i.d. generated according to one of the above joint PMFs, and are
observed at different terminals. These terminals broadcast messages in a sequential order
from terminal 1 until terminal L, and each terminal will use all messages received so far
along with its own observations for encoding. More specifically, terminal X1 will first
broadcast its encoded message, which depends only on X1n , and then terminal X2 will
broadcast its encoded message, which now depends not only on its own observations X2n
but also on the message received from terminal X1 . The process continues until terminal
XL , which will use messages received from X1 until XL−1 and its own observations XLn
for encoding. Finally, terminal Y decides which hypothesis is true on the basis of its own
information and the messages received from terminals X1 , . . . , XL . The system model
is illustrated in Fig. 14.5.

Terminal

X1

Terminal
X2 Y

Figure 14.5 Model with cascaded encoders.


Distributed Statistical Inference with Compressed Data 439

More specifically, terminal X1 uses an encoder

f1 : Xn1 → M1 = {1, 2, . . . , M1 }, (14.43)

which is a map from Xn1 to M1 . Terminal Xl , l = 2, . . . , L, uses an encoder

fl : (Xnl , M1 , . . . , Ml−1 ) → Ml = {1, 2, . . . , Ml }, (14.44)

with rates Rl such that


1
lim sup log Ml ≤ Rl , l = 1, . . . , L. (14.45)
n→∞ n

Using its own observations and messages received from encoders, terminal Y will use
a decoding function ψ to decide which hypothesis is true:

ψ : (M1 , . . . , ML , Yn ) → {H0 , H1 }. (14.46)

Given the encoding and decoding functions, we can define the acceptance region and
corresponding type-1 error probability, type-2 error probability, and type-2 error expo-
nents under different types of constraints on the type-1 error probability in a similar
manner to those in Section 14.2. To distinguish this case from the basic model, we use
θc and σc to denote the type-2 error exponent under a constant-type constraint and the
type-2 error exponent under an exponential-type constraint, respectively.

Testing against Independence with Positive-Rate Data Compression


For testing against independence, we fully characterize the type-2 error exponent for the
case of general L. To simplify our presentation, we use L = 2 as an example.
theorem 14.14 (Theorem 3 of [17]) For testing against independence with L = 2
cascaded encoders, the best error exponent for the type-2 error probability satisfies

lim θc (R1 , R2 , ) = max I(U1 U2 ; Y), (14.47)


→0 U1 U2 ∈ϕ0

where
ϕ0 = {U1 U2 : R1 ≥ I(U1 ; X1 ), R2 ≥ I(U2 ; X2 |U1 ),
U1 ↔ X1 ↔ (X2 , Y),
U2 ↔ (X2 , U1 ) ↔ (X1 , Y),
|U1 | ≤ |X1 | + 1, |U2 | ≤ |X2 | · |U1 | + 1}. (14.48)

To achieve this bound, we employ the following coding scheme. For a given rate
constraint R1 , terminal X1 first generates a quantization codebook containing 2nR1 quan-
tization sequences un1 that are based on x1n . After observing x1n , terminal X1 picks
one sequence un1 from the quantization codebook which is jointly typical with x1n and
broadcasts this sequence. After receiving un1 from terminal X1 , terminal X2 generates
a quantization codebook containing 2nR2 quantization sequences un2 that are based on
both x2n and un1 . Then after observing x2n , terminal X2 picks one sequence un2 such that it
is jointly typical with un1 and x2n and broadcasts this sequence. Upon receiving both un1
440 Wenwen Zhao and Lifeng Lai

Table 14.1. The joint PMF P X 1 X 2 Y

X1 X2 Y 000 010 100 110


PX1 X2 Y 0.0704 0.2108 0.0015 0.3233
X1 X2 Y 001 011 101 111
PX1 X2 Y 0.2206 0.0667 0.0046 0.1021

Table 14.2. P U 1 |X 1 and P U 2 |X 2 for the


non-interactive case when R = 0.48

U1 |X1 0|0 1|0 0|1 1|1


PU1 |X1 0.9991 0.0009 0.1564 0.8436
U2 |X2 0|0 1|0 0|1 1|1
PU2 |X2 0.9686 0.0314 0.0357 0.9643

Table 14.3. P U 1 |X 1 and P U 2 |X 2 U 1 for the cascaded


case when R = 0.48

U1 |X1 0|0 1|0 0|1 1|1


PU1 |X1 0.0155 0.9845 0.5829 0.4171
U2 |X2 U1 0|00 1|00 0|01 1|01
PU2 |X2 U1 0.0636 0.9364 0.9727 0.0273
U2 |X2 U1 0|10 1|10 0|11 1|11
PU2 |X2 U1 0.9898 0.0102 0.0005 0.9995

and un2 , the decision-maker will declare that the hypothesis H0 is true if the descriptions
from these terminals and the side information at the decision-maker are correlated. Oth-
erwise, the decision-maker will declare H1 true. To prove the converse part, please refer
to [17] for more details.
From the description above, we can get an intuitive idea that the decision-maker in
the interactive case receives more information than does the decision-maker in the non-
interactive case; thus a better performance is expected. In the following, we provide a
numerical example to illustrate the gain obtained from interactive communications.
In the example, we let X1 , X2 , and Y be binary random variables with joint PMF
PX1 X2 Y , which is shown in Table 14.1. With QX1 X2 Y = PX1 X2 PY and increasing com-
munication constraints R = R1 = R2 , we use Theorem 14.14 to find the best value of
the type-2 error exponent that we can achieve using our cascaded scheme. For compar-
ison, we also use Theorem 14.2 to find an upper bound on the type-2 error exponent
of the non-interactive case. By applying a grid search, we find the optimal conditional
distributions PU1 |X1 and PU2 |X2 for the non-interactive case and the optimal conditional
distributions PU1 |X1 and PU2 |X2 U1 for the cascaded case. We then calculate the bound on
the type-2 error exponent for both cases. For R = 0.48, we list the conditional distribu-
tions PU1 |X1 and PU2 |X2 for the non-interactive case in Table 14.2 and the conditional
distributions PU1 |X1 and PU2 |X2 U1 for the cascaded case in Table 14.3.
The simulation results for different Rs are shown in Fig. 14.6. From Fig. 14.6, we
can see that the type-2 error exponents in both cases increase with the increasing R,
which makes sense as the more information we can send, the fewer errors we will make.
Distributed Statistical Inference with Compressed Data 441

0.12
non-interactive
cascaded
0.1

Error exponent 0.08

0.06

0.04

0.02

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
R
Figure 14.6 Simulation results.

We also observe that the type-2 error exponent achieved using our cascaded communi-
cation scheme is even larger than an upper bound on the type-2 error exponent of any
non-interactive schemes. Hence, we confirm the intuitive idea that the greater amount of
information offered by cascaded communication facilitates better decision-making for
certain forms of testing against independence cases with positive communication rates.

General PMF with Zero-Rate Data Compression


For hypothesis testing with zero-rate data compression, we show that the cascaded
scheme has the same performance as that of the non-interactive communication scenario.
In the non-interactive communication scenario with zero-rate compression, a matching
upper bound and lower bound on the type-2 error exponent were provided in Theorem
14.5. If we can prove an upper bound on the type-2 error exponent for the cascaded com-
munication case that is no larger than the error exponent shown in Theorem 14.5, then
we can arrive at the conclusion that cascaded communication won’t help in the zero-rate
compression case.
In the following, we provide an upper bound on the type-2 error exponent for the
cascaded case.
theorem 14.15 (Theorem 7 of [17]) Letting PX1 ···XL Y be arbitrary and QX1 ···XL Y > 0, for
all  ∈ [0, 1), the best type-2 error exponent for zero-rate compression under the type-1
error constraint (14.8), with L cascaded encoders, satisfies

θc (0, . . . , 0, ) ≤ min X ···XL Y ||QX ···XL Y ),


D(P (14.49)
1 1
X ···X Y ∈L
P 1 L

where
 
L= PX ···XL Y : P
Xi = PXi , P
Y = PY , i = 1, . . . , L . (14.50)
1
442 Wenwen Zhao and Lifeng Lai

On comparing Theorem 14.15 with Theorem 14.5, we can see that the upper bound
on the type-2 error exponent for the cascaded communication scheme is the same as
the type-2 error exponent achievable by the non-interactive communication scheme.
This implies that the performance of the cascaded communication scheme is the same
as that of the non-interactive communication scheme in the zero-rate data compression
case.
The conclusion that cascaded communication does not improve the type-2 error
exponent in the zero-rate data compression case also holds for the scenario with an
exponential-type constraint on the type-1 error probability. In the cascaded communi-
cation case, according to the results in Theorem 14.15, we can use a similar strategy
to that in [5] to convert the problem under the exponential-type constraint (14.9) to the
corresponding problem under the constraint in (14.8). As the converting strategy is inde-
pendent of the communication style, it will be the same as that in Theorem 14.11. Then
an upper bound on the type-2 error exponent under the exponential-type constraint can
be easily derived without going into details, which will be shown in what follows.
theorem 14.16 (Theorem 8 of [17]) Letting PX1 ···XL Y be arbitrary and QX1 ···XL Y > 0,
the best type-2 error exponent for zero-rate compression case under the type-1 error
constraint (14.9), with L cascaded encoders, satisfies

σc (0, . . . , 0, r) ≤ min X ···XL Y ||QX ···XL Y ),


D(P (14.51)
1 1
X ···X Y ∈Hr
P 1 L

with

Hr = PX ···XL Y : P
X = P
X , P
Y = P
Y , l = 1, . . . , L
1 l l

for some 
PX1 ···XL Y ∈ ϕr , (14.52)

where

X ···XL Y : D(P
ϕr = {P X ···XL Y ||PX ···XL Y ) ≤ r}. (14.53)
1 1 1

On comparing Theorem 14.16 with Theorem 14.11, where a matching upper and
lower bound are provided for the non-interactive scheme, we can conclude that there
is no gain in performance on the type-2 error exponent under zero-rate compression
with the exponential-type constraint on the type-1 error probability.

Genral PMF with Positive-Rate Data Compression


We can also get a lower bound on the type-2 error exponent for the case of a general
PMF with positive-rate data compression.
theorem 14.17 (Theorem 5 of [17]) For the case with general hypothesis PX1 X2 Y
versu QX1 X2 Y with L = 2 cascaded encoders, the best error exponent of the type-2 error
probability satisfies

θc (R1 , R2 , ) ≥ max min U U X X Y ||QU U X X Y ),


D(P (14.54)
1 2 1 2 1 2 1 2
U1 U2 ∈ϕ0 P
U U X X Y ∈ξ0
1 2 1 2
Distributed Statistical Inference with Compressed Data 443

where ϕ0 is defined in Theorem 14.14,


 U X = PU X , P U U Y = PU U Y , (14.55)
U U X = PU U X , P
ξ0 = P U 1 U 2 X1 X2 Y : P 1 1 1 1 1 2 2 1 2 2 1 2 1 2

and QU1 |X1 = PU1 |X1 , QU2 |U1 X2 = PU2 |U1 X2 .

14.3.2 Fully Interactive Communication


For fully interactive communication, only the L = 1 case is studied under constant-type
constraint on the type-1 error probability [20].

Model
As in Section 14.2, we consider the following hypothesis-testing problem:
H0 : PX1 Y , H1 : QX1 Y . (14.56)
(X1n , Y n ) are i.i.d. generated according to one of the above joint PMFs, and are observed at
two different terminals. Terminal X1 first encodes its local information to messages M11
and broadcasts it. Terminal Y utilizes the message M11 to encode its own information
as M21 and broadcasts it. This is called one round of interactive communication of X1
and Y. After receiving M21 , terminal X1 can further encode its local information as M21
and broadcast it. This process goes on until N rounds of interactive communication have
been carried out. After receiving all messages from terminal X1 and Y, terminal X1 will
act as the decision-maker Y and makes a decision about the joint PMF of (X1 , Y). The
system model is illustrated in Fig. 14.7.
More specifically, terminal X1 uses the encoding function
f1i : {Xn1 , M2(i−1) , . . . , M21 } → M1i = {1, . . . , M1i }, i = 1, . . . , N, (14.57)
which is a map from Xn1 to M1i . Terminal Y uses an encoder
f2i : {Yn , M1i , . . . , M11 } → M2i = {1, . . . , M2i }, i = 1, . . . , N, (14.58)
with rate R such that
1
N
lim sup log(M1i M2i ) ≤ R, i = 1, . . . , N. (14.59)
n→∞ n i=1

Terminal X1 uses a decoding function to decide which hypothesis is true:


ψ : {M11 , M21 , . . . , M1N , M2N } → {H0 , H1 }. (14.60)
Given the encoding and decoding function, we can define the acceptance region and
corresponding type-1 error probability, type-2 error probability, and type-2 error expo-
nents under different types of constraints on type-1 error probability in a similar way

X Y

Figure 14.7 Model with interactive encoders.


444 Wenwen Zhao and Lifeng Lai

to what we did in Section 14.2. To distinguish this case from the basic model and the
cascaded model, we use θi and σi to denote the type-2 error exponent under a constant-
type constraint and the type-2 error exponent under an exponential-type constraint,
respectively.

Testing against Independence with Positive-Rate Data Compression


theorem 14.18 (Proposition 2 of [20]) For the system S X1 Y with N rounds of
communication, the type-2 error exponent satisfies


N
lim θi ≥ max [I(U[k] ; Y|U[1:k−1] V[1:k−1] ) + I(V[k] ; Y|U[1:k] V[1:k−1] )], (14.61)
→0 U[1:N] V[1:N] ∈ϕ(R)
k=1

where



⎨ 
N
ϕ(R)  ⎪
⎪U V : R ≥ [I(U[k] ; Y|U[1:k−1] V[1:k−1] )
⎩ [1:N] [1:N]
k=1
+ I(V[k] ; Y|U[1:k] V[1:k−1] )],
U[k] ↔ (X, U1:k−1 , V1:k−1 ) ↔ Y, |U[k] | < ∞,
V[k] ↔ (Y, U1:k , V1:k−1 ) ↔ X, |V[k] | < ∞,




k = 1, . . . , N ⎪
⎪. (14.62)

For N = 1, which means there is only one round of communication between terminal
X1 and Y, one can also prove a matching upper bound on the type-2 error exponent.
Hence one has the following theorem.
theorem 14.19 (Theorem 3 of [20]) For the system S X1 Y with N rounds of communi-
cation, the type-2 error exponent satisfies

lim θi = max I(U; Y) + I(V; Y|U). (14.63)


→0 PU|X PV|UY
I(U; Y) + I(V; Y|U) ≤ R

To achieve this bound, one can employ the following coding scheme. Terminal X1
first generates a quantization codebook containing 2n(I(U;X)+η) quantization sequences
un that are based on x1n . After observing x1n , terminal X1 picks one sequence un from
the quantization codebook which is jointly typical with x1n and broadcasts this sequence.
After receiving un1 from terminal X1 , terminal Y generates a quantization codebook
containing 2n(I(V;Y)+) quantization sequences vn that are based on both yn and un1 . Then,
after observing yn , terminal Y picks one sequence vn such that it is jointly typical with un
and yn and broadcasts this sequence. Upon receiving vn , the decision-maker will declare
that the hypothesis H0 is true if the descriptions from terminal Y and the information at
terminal X are correlated. Otherwise, the decision-maker will declare H1 true. To prove
the converse part, please refer to [20] for more details.
Distributed Statistical Inference with Compressed Data 445

General PMF with Zero-Rate Data Compression


One can also prove that interactive communication does not help to improve the perfor-
mance with zero-rate data compression. In particular, [20] provides a matching upper
bound and lower bound on the type-2 error exponent for the fully interactive case.
theorem 14.20 (Theorem 4 of [20]) Letting PX1 Y be arbitrary and QX1 Y > 0, for all
 ∈ [0, 1), the best type-2 error exponent for zero-rate compression under the type-1
error constraint (14.8) satisfies
X Y ||QX Y ),
θi (0, ) = min D(P (14.64)
1 1
X Y ∈L0
P 1

where
 
L0 = PX Y : P
X = PX , P
Y = PY .
1 1 1

General PMF with Positive-Rate Data Compression


One can also get a lower bound on the type-2 error exponent for the case of a general
PMF with positive-rate data compression.
theorem 14.21 (Proposition 2 [20]) In the system S X1 Y , the best error exponent of the
type-2 error probability satisfies
θi (R, ) ≥ max min U V X Y ||QU V X Y ),
D(P (14.65)
[1:N] [1:N] 1 [1:N] [1:N] 1
ϕ(R) ξ(U[1:N] V[1:N] )

where ϕ0 is defined in Theorem 14.18, and



ξ(U[1:N] V[1:N] ) = PU V X Y : P U V X = PU V X ,
[1:N] [1:N] 1 [1:N] [1:N] 1 [1:N] [1:N] 1

U V
P [1:N] [1:N]Y = PU [1:N] V[1:N]Y , . (14.66)

14.4 Identity Testing

In Sections 14.2 and 14.3, we dealt with the scenarios where the PMF under each hypoth-
esis is fully specified. However, there are practical scenarios where the probabilistic
models are not fully specified. One of these problems is the identity-testing problem,
in which the goal is to determine whether given samples are generated from a certain
distribution or not.
In this section, we discuss the identity-testing problem in the feature partitioning
scenario with non-interactive communication of the encoders. As in Section 14.2, we
consider a setup with L terminals (encoders) Xl , l = 1, . . . , L and a decision-making
terminal Y. (X1n , . . . , XLn , Y n ) are generated according to some unknown PMF PX1 ···XL Y .
Terminals {Xl }l=1
L can send compressed messages related to their own data with limited

rates to the decision-maker, then the decision-maker performs statistical inference on the
basis of the messages received from terminals {Xl }l=1 L and its local information related to

Y. In particular, we focus on the problem in which that the decision-maker tries to decide
whether PX1 ···XL Y is the same as a given distribution QX1 ···XL Y , i.e., PX1 ···XL Y = QX1 ···XL Y ,
or they are λ-far away, i.e., ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ (λ > 0), where || · ||1 denotes the 1
446 Wenwen Zhao and Lifeng Lai

norm of its argument. This identity-testing problem can be interpreted as the following
two hypothesis-testing problems.
• Problem 1:

H0 : ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ versus H1 : PX1 ···XL Y = QX1 ···XL Y . (14.67)

• Problem 2:

H0 : PX1 ···XL Y = QX1 ···XL Y versus H1 : ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ. (14.68)

In both problems, our goal is to characterize the type-2 error exponent under the
constraints on the communication rates and type-1 error probability.
This distributed identity-testing problem with composite hypotheses can be viewed
as a generalization of the problems considered in Section 14.2. Between the two possi-
ble problems defined in (14.67) and (14.68), Problem 2 is relatively simple and it can
be solved using similar schemes to those proposed in Section 14.2. In particular, the
encoding schemes and the definition of the acceptance regions at the decision-maker in
Section 14.2 depend only on the form of the PMF under H0 . Since the form of the PMF
under H0 in Problem 2 is known, we can apply the existing coding/decoding schemes
such as that in Section 14.2 and take the type-2 error probability as the supremum of the
type-2 error probabilities under each PX1 ···XL Y that satisfies ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ.
As well, it can be shown that these schemes are optimal for Problem 2. However, in
Problem 1, as H0 is composite, we need to design universal encoding/decoding schemes
so that our schemes can provide a guaranteed performance regardless of the true PMF
under H0 . In this section, we will focus on the more challenging Problem 1 with L = 1,
and, to simplify our presentation, we use terminal X and terminal Y to denote the two
terminals.

14.4.1 Model
To simplify the presentation, we assume that there are only terminal X and terminal Y.
Our goal is to determine whether the true joint distribution is the same as the given
distribution QXY or far away from it. We interpret this problem as a hypothesis-testing
problem with a composite null hypothesis and a simple alternative hypothesis:

H0 : PXY ∈ Π versus H1 : QXY , (14.69)

where Π = {PXY ∈ PXY : ||PXY − QXY ||1 ≥ λ} and λ is some fixed positive number. The
model is shown in Fig. 14.8. In a typical identity-testing problem, one determines which

Terminal
f (X n )
X
Decision-
g (Y n ) Maker
Y
Figure 14.8 Model.
Distributed Statistical Inference with Compressed Data 447

hypothesis is true under the assumption that (X n , Y n ) are fully available at the decision-
maker. We assume terminal X observes only X n and terminal Y observes only Y n .
Terminals X and Y are allowed to send encoded messages to the decision-maker, which
then decides which hypothesis is true using the encoded messages directly. We denote
the system as S XY in what follows.
More formally, the system consists of two encoders, f1 , f2 , and one decision function,
ψ. Terminal X has the encoder f1 , terminal Y has the encoder f2 , and the decision-
maker has the decision function ψ. After observing the data sequences xn ∈ Xn and
yn ∈ Yn , encoders f1 and f2 transform sequences xn and yn into messages f1 (xn ) and
f2 (yn ), respectively, which take values from the message sets Mn and Nn ,

f1 : Xn → M1 = {1, 2, . . . , M1 }, (14.70)
f2 : Y → M2 = {1, 2, . . . , M2 },
n
(14.71)

with rate constraints


1
lim sup log M1 ≤ R1 , (14.72)
n→∞ n
1
lim sup log M2 ≤ R2 . (14.73)
n→∞ n

Using the messages f1 (X n ) and f2 (Y n ), the decision-maker will use the decision
function ψ to determine which hypothesis is true:

ψ : (Mn , Nn ) → {H0 , H1 }. (14.74)

For any given decision function ψ, one can define the acceptance region as

An = {(xn , yn ) ∈ Xn × Yn : ψ( f1 (xn ), f2 (yn )) = H0 }.

For any given fi , i = 1, 2, and ψ, the type-1 error probability αn and the type-2 error
probability βn are defined as follows:

αn = sup PnXY (Acn ) and βn = QnXY (An ). (14.75)


PXY ∈Π

Given the type-1 error probability and type-2 error probability, we define the type-2
error exponents under different types of constraints on the type-1 error probability in a
similar way to that in Section 14.2. To distinguish this case from the basic model, we
use θid and σid to denote the type-2 error exponent under a constant-type constraint and
the type-2 error exponent under an exponential-type constraint, respectively.

14.4.2 Testing against Independence with Positive-Rate Data Compression


We now focus on a special case, namely testing against independence, for which the
hypotheses are

H0 : PXY ∈ Π⊥ versus H1 : QX QY , (14.76)


448 Wenwen Zhao and Lifeng Lai

where Π⊥ = Π ∩ {PXY : PX = QX , PY = QY }. This special case has a two-fold meaning:


testing whether (X, Y) are independent or not, and testing whether the joint distribution
of (X, Y) is QX QY or not. Owing to the fact that the marginal distributions for X and
Y in this special case are the same under both hypotheses, we can get a simple encod-
ing/decoding scheme and derive matching lower and upper bounds on the type-2 error
exponent, which allows us to fully characterize the optimal type-2 error exponent.
theorem 14.22 (Theorem 3 of [26]) For R1 ≥ 0, R2 ≥ 0, we have
θid (R1 , R2 , ) = inf max I(U; V), (14.77)
PXY ∈Π⊥ UV∈ϕPXY

where
 
ϕPXY = UV : I(U; X) ≤ R1 , I(V; Y) ≤ R2 , U ↔ X ↔ Y ↔ V . (14.78)
To achieve this lower bound, we design the following encoding/decoding scheme. For
a given rate constraint R1 , terminal X1 first generates a quantization codebook containing
2nR1 quantization sequences un1 . After observing x1n , terminal X1 picks one sequence un1
from the quantization codebook to describe x1n and sends this sequence to the decision-
maker. Then terminal Y uses a similar scheme and sends its quantization codebook vn .
After receiving the descriptions from both terminals, the decision-maker will employ
a universal decoding method to declare either H0 or H1 true. More details about the
universal decoding method can be found in [26].

14.4.3 General PMF with Zero-Rate Data Compression


Under a Constant-Type Constraint
Shalaby and Papamarcou [8] studied the problem with zero-rate data compression under
a constant-type constraint for a more general case:
H0 : PXY ∈ Π versus H1 : QXY ∈ Ξ, (14.79)
where Π and Ξ are two disjoint subsets of P(X × Y). In this case, the type-2 error
probability is defined as
βn = sup QnXY (An ). (14.80)
QXY ∈Ξ

To show a matching upper bound and lower bond, we assume the uniform positivity
constraint
ρinf  inf min QXY (x, y) > 0. (14.81)
QXY ∈Ξ (x,y)∈X×Y

The result is shown in the following theorem.


theorem 14.23 (Theorem 4 of [8]) Let PXY ∈ Π be arbitrary and (14.81) be satis-
fied. For zero-rate compression in S XY and the type-1 error constraint (14.8), the error
exponent satisfies
θid (0, 0, ) = inf min XY ||QXY ).
D(P (14.82)
PXY ∈Π,QXY ∈Ξ 
P XY :
 = P ,P

P X X Y = PY
Distributed Statistical Inference with Compressed Data 449

Under an Exponential-Type Constraint


As H0 is a composite hypothesis, we provide a universal encoding and decoding scheme
to establish a lower bound on the error exponent of the type-2 error. We further establish
a matching upper bound and hence fully characterize the error exponent of the type-2
error for this scenario.
theorem 14.24 (Theorem 1 of [27]) Let PXY ∈ Π be arbitrary and QXY > 0. For zero-
rate compression in S XY and the type-1 error constraint (14.9), the error exponent
satisfies

σid (0, 0, r) = inf min XY ||QXY )


D(P (14.83)
PXY ∈Π P
 PXY
XY ∈Hr

with
XY : P
HrPXY = {P X = P
X , P
Y = P
Y for some P
XY ∈ ϕrPXY },

where
XY : D(P
ϕrPXY = {P XY ||PXY ) ≤ r}.

To achieve this bound, we use the same encoding scheme as in Section 14.2, but
employ a universal decoding scheme at the decision-maker so that the type-1 error con-
straint is satisfied regardless of what the true value of PXY is. One can certainly design
an individual acceptance region that satisfies the type-1 error constraint for each possi-
ble value of PXY ∈ Π using the approach in the simple hypothesis case, and then take
the union of these individual regions as the final acceptance region. This will clearly sat-
isfy the type-1 error constraint regardless of the true value of PXY . This approach might
work if the number of possible PXY s is finite or grows polynomially in n. However, in
our case, there are uncountably infinitely many possible PXY s in Π. This approach will
lead to a very loose performance bound. Hence, we need to design a new approach that
will lead to a performance bound matching with the converse bound. More details can
be found in [26].

14.4.4 General PMF with Positive-Rate Data Compression


In this section, we investigate the identity-testing problem under the positive rate com-
pression constraints (14.72) and (14.73). We first establish a lower bound on the type-2
error exponent under the constant-type constraint (14.8) for a general PMF. Then we pro-
vide a lower bound on the type-2 error exponent under the exponential-type constraint
(14.9).

Under a Constant-Type Constraint


Let U and V be arbitrary finite sets. For each distribution PX on X, let ω(·|·; PX ) be any
stochastic mapping from X to U, i.e., let ω(u|x; PX ) be the conditional probability of
u ∈ U given x ∈ X. Similarly, for each distribution PY on Y, let (·|·; PY ) be any stochastic
mapping from Y to V, i.e., let (v|y; PY ) be the probability of v ∈ V given y ∈ Y.
450 Wenwen Zhao and Lifeng Lai

theorem 14.25 (Theorem 2 of [26]) For R1 ≥ 0, R2 ≥ 0, we have


θid (R1 , R2 , ) ≥ inf max min UV XY ||QUV XY ),
D(P (14.84)
PXY ∈Π (ω,)∈ϕPXY P
UV XY ∈ξP
XY

where

ϕPXY = (ω, ) : R1 ≥ I(X; U), R2 ≥ I(Y; V),
R1 − R1 ≤ I(U; V),
R2 − R2 ≤ I(U; V),
R1 − R1 + R2 − R2 ≤ I(U; V)
PU|X = ω(u|x; PX ), PV|Y = (v|y; PY ),

U ↔X↔Y ↔V (14.85)
and
 UX = PUX , P UV = PUV .
VY = PVY , P
ξPXY = PUV XY : P (14.86)
Here ϕPXY denotes the set of (ω, ) when the distribution of (X, Y) is PXY . The notation
ξPXY has a similar interpretation.
To achieve this bound, we employ a universal encoding/decoding scheme with a
binning method. More details can be found in [26].

Under an Exponential-Type Constraint


In this section, we consider the case with an exponential-type constraint, i.e., we require
that the exponent of the type-1 error probability should be larger than r. Owing to the
complexity of the problem for the general case S XY , we here give the result assuming
terminal Y can communicate with a large rate R2 > log |Y| so that the decision-maker
has full information about Y n . We will use σ(R1 , r) to denote the corresponding type-2
error exponent.
Let U be an arbitrary finite set and P(U|X) be the set of all possible condi-
tional probability distributions (PU|X (u|x))(u,x)∈U×X on U given values in X. Let ω
denote a continuous mapping from P(X) to P(U|X) and Φ be the set of all possible
mappings.
theorem 14.26 (Theorem 4 of [26]) For R1 ≥ 0, r ≥ 0, we have
σid (R1 , r) ≥ inf sup min UXY ||QUXY ),
D(P (14.87)
PXY ∈Π ω∈φP UXY ∈ΞP (ω)
(R1 ,r) P
XY XY

where
 
φPXY (R1 , r) = ω ∈ Φ :  X)
max I(U;  ≤ R1 ,
 : D(X||X)
X  ≤r

PU|X = ω(X) 
⎧ ⎫


⎪ UXY ||PUXY ) ≤ r
D(P ⎪



⎨ ⎪


ΞPXY (ω) = ⎪ U|X = ω(X)
PUXY : PU|X = P  ⎪ ,


⎪ ⎪


⎩ U ↔X↔Y ⎭
Distributed Statistical Inference with Compressed Data 451


ΞPXY (ω) = PUXY : P
UX =  UY = P
PUX , P UY ,

UXY ∈ 
for some P ΞtXY (ω) ,

U|X , QUXY = QU|X QXY .


and QU|X = P

14.5 Conclusion and Extensions

14.5.1 Conclusion
This chapter has explored distributed inference problems from an information-theoretic
perspective. First, we discussed distributed inference problems with non-interactive
encoders. Second, we considered distributed testing problems with interactive encoders.
We investigated the case of cascaded communication among multiple terminals and
then discussed the fully interactive communication between two terminals. Finally, we
studied the distributed identity-testing problem, in which the decision-maker decides
whether the distribution indirectly revealed from the compressed data from multiple
distributed terminals is the same as or λ-far from a given distribution.

14.5.2 Future Directions


Despite the many interesting works reviewed in this chapter, there are still many open
problems in distributed inference with multiterminal data compression that require fur-
ther research efforts. In addition, one can further generalize the models discussed in this
chapter to deal with more realistic scenarios. We list here some of the open problems
and possible generalizations.
• For testing against independence with multiple non-interactive encoders, it remains
to find a matching upper bound on the type-2 error exponent under a constant-type
constraint on the type-1 error probability.
• Deriving an upper bound on the type-2 error exponent for the general PMF case for
all three cases in this chapter remains to be done.
• Under the exponential-type constraint on the type-1 error probability, it remains to
prove the converse part for testing against independence for all three cases in this
chapter.
• In Section 14.3, we discussed cascaded communication among L terminals and fully
interactive communication between two terminals. One possible generalization of
this problem is to allow fully interactive communication among L terminals. This
model is illustrated in Fig. 14.9.
• The models discussed in this chapter can be viewed as models with multiple encoders
and one receiver. Given these results, another interesting topic to consider is models
with multiple encoders and multiple receivers. Different receivers might have differ-
ent learning goals. Hence, it is of interest to investigate how to encode messages to
strike a desirable trade-off among the performances of multiple receivers. There have
been some interesting recent papers that address problems in this direction [28, 29],
452 Wenwen Zhao and Lifeng Lai

Terminal

X1

Terminal
X2 Y

Figure 14.9 Model with L fully interacting encoders.

focusing on the special case of testing against (conditional) independence with


multiple receivers.
• As discussed in the introduction, the distributed scenario is very common in real life.
Hence, besides the basic inference problems discussed in this chapter, one can study
more sophisticated statistical learning tasks such as linear regression, clustering, etc.
While there are many interesting works that address these problems for the sample
partitioning scenario [2, 30–33], these problems remain largely unexplored for the
feature partitioning scenario.

Acknowledgments

The work of W. Zhao and L. Lai was supported by the National Science Foundation
under grants CNS-1660128, ECCS-1711468, and CCF-1717943.

References

[1] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower


bounds for distributed statistical estimation with communication constraints,” in Advances
in Neural Information Processing Systems 26, 2013, pp. 2328–2336.
[2] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimiza-
tion using an approximate Newton-type method,” in Proc. International Conference on
Machine Learning, 2014.
[3] T. Berger, “Decentralized estimation and decision theory,” in Proc. IEEE Information
Theory Workshop, 1979.
[4] R. Ahlswede and I. Csiszár, “Hypothesis testing with communication constraints,” IEEE
Trans. Information Theory, vol. 32, no. 4, pp. 533–542, July 1986.
[5] T. S. Han and S. Amari, “Statistical inference under multiterminal data compression,” IEEE
Trans. Information Theory, vol. 44, no. 6, pp. 2300–2324, 1998.
[6] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Trans.
Information Theory, vol. 33, no. 6, pp. 759–772, 1987.
Distributed Statistical Inference with Compressed Data 453

[7] T. S. Han and K. Kobayashi, “Exponential-type error probabilities for multiterminal


hypothesis testing,” IEEE Trans. Information Theory, vol. 35, no. 1, pp. 2–14, 1989.
[8] H. M. H. Shalaby and A. Papamarcou, “Multiterminal detection with zero-rate data
compression,” IEEE Trans. Information Theory, vol. 38, no. 2, pp. 254–267, 1992.
[9] W. Zhao and L. Lai, “Distributed detection with vector quantizer,” IEEE Trans. Signal and
Information Processing over Networks, vol. 2, no. 2, pp. 105–119, 2016.
[10] M. S. Rahman and A. B. Wagner, “The optimality of binning for distributed hypothesis
testing,” IEEE Trans. Information Theory, vol. 58, no. 10, pp. 6282–6303, 2012.
[11] W. Zhao and L. Lai, “Distributed testing with zero-rate compression,” in Proc. IEEE
International Symposium on Information Theory, 2015.
[12] C. Tian and J. Chen, “Successive refinement for hypothesis testing and lossless one-helper
problem,” IEEE Trans. Information Theory, vol. 54, no. 10, pp. 4666–4681, 2008.
[13] G. Katz, P. Piantanida, R. Couillet, and M. Debbah, “On the necessity of binning for
the distributed hypothesis testing problem,” in Proc. IEEE International Symposium on
Information Theory, 2015.
[14] M. Mhanna and P. Piantanida, “On secure distributed hypothesis testing,” in Proc. IEEE
International Symposium on Information Theory, 2015.
[15] W. Zhao and L. Lai, “Distributed testing against independence with multiple terminals,” in
Proc. Allerton Conference on Communication, Control, and Computing, 2014, pp. 1246–
1251.
[16] W. Zhao and L. Lai, “Distributed testing against independence with conferencing
encoders,” in Proc. IEEE Information Theory Workshop, 2015.
[17] W. Zhao and L. Lai, “Distributed testing with cascaded encoders,” IEEE Trans. Information
Theory, vol. 64, no. 11, pp. 7339–7348, 2018.
[18] Y. Xiang and Y. Kim, “Interactive hypothesis testing with communication constraints,” in
Proc. Allerton Conference on Communication, Control, and Computing, 2012, pp. 1065–
1072.
[19] Y. Xiang and Y. Kim, “Interactive hypothesis testing against independence,” in Proc. IEEE
International Symposium on Information Theory, 2013, pp. 2840–2844.
[20] G. Katz, P. Piantanida, and M. Debbah, “Collaborative distributed hypothesis testing,”
arXiv:1604.01292, 2016.
[21] A. El Gamal and Y. Kim, Network information theory. Cambridge Unversity Press, 2011.
[22] R. Ahlswede and J. Körner, “Source coding with side information and a converse for
degraded broadcast channels,” IEEE Trans. Information Theory, vol. 21, no. 6, pp.
629–637, 1975.
[23] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Trans.
Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[24] R. Ahlswede, P. Gács, and J. Körner, “Bounds on conditional probabilities with appli-
cations in multi-user communication,” Z. Wahrscheinlichkeitstheorie verwandte Gebiete,
vol. 34, no. 2, pp. 157–177, 1976.
[25] T. M. Cover and J. A. Thomas, Elements of information theory. Wiley, 2005.
[26] W. Zhao and L. Lai, “Distributed identity testing with data compression,” submitted to
IEEE Trans. Information Theory, 2017.
[27] W. Zhao and L. Lai, “Distributed identity testing with zero-rate compression,” in Proc.
IEEE International Symposium on Information Theory, 2017, pp. 3135–3139.
[28] M. Wigger and R. Timo, “Testing against independence with multiple decision centers,”
in Proc. IEEE International Conference on Signal Processing and Communications, 2016,
pp. 1–5.
454 Wenwen Zhao and Lifeng Lai

[29] S. Salehkalaibar, M. Wigger, and R. Timo, “On hypothesis testing against conditional inde-
pendence with multiple decision centers,” IEEE Trans. Communications, vol. 66, no. 6,
pp. 2409–2420, 2018.
[30] J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor, “Communication-efficient sparse regression: A
one-shot approach,” arXiv:1503.04337, 2015.
[31] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas,
“Communication-efficient learning of deep networks from decentralized data,”
arXiv:1602.05629, 2016.
[32] J. Konečný, “Stochastic, distributed and federated optimization for machine learning,”
arXiv:1707.01155, 2017.
[33] V. M. A. Martin, K. David, and B. Merlinsuganthi, “Distributed data clustering: A compar-
ative analysis,” Int. J. Sci. Res. Computer Science, Engineering and Information Technol.,
vol. 3, no. 3, article CSEITI83376, 2018.
15 Network Functional Compression
Soheil Feizi and Muriel Médard

Summary

In this chapter,1 we study the problem of compressing for function computation across
a network from an information-theoretic point of view. We refer to this problem as
network functional compression. In network functional compression, computation of a
function (or some functions) of sources located at certain nodes in a network is desired
at receiver(s). The rate region of this problem has been considered in the literature
under certain restrictive assumptions, particularly in terms of the network topology,
the functions, and the characteristics of the sources. In this chapter, we present results
that significantly relax these assumptions. For a one-stage tree network, we character-
ize a rate region by introducing a necessary and sufficient condition for any achievable
coloring-based coding scheme called the coloring connectivity condition (CCC). We
also propose a modularized coding scheme based on graph colorings to perform arbi-
trarily closely to derived rate lower bounds. For a general tree network, we provide a rate
lower bound based on graph entropies and show that this bound is tight in the case of hav-
ing independent sources. In particular, we show that, in a general tree network case with
independent sources, to achieve the rate lower bound, intermediate nodes should per-
form computations. However, for a family of functions and random variables, which we
call chain-rule proper sets, it is sufficient to have no computations at intermediate nodes
in order for the system to perform arbitrarily closely to the rate lower bound. Moreover,
we consider practical issues of coloring-based coding schemes and propose an efficient
algorithm to compute a minimum-entropy coloring of a characteristic graph under some
conditions on source distributions and/or the desired function. Finally, extensions of
these results for cases with feedback and lossy function computations are discussed.

15.1 Introduction

In modern applications, data are often stored in clouds in a distributed way (i.e., differ-
ent portions of data are located in different nodes in the cloud/network). Therefore, to
compute certain functions of the data, nodes in the network need to communicate with
each other. Depending on the desired computation, however, nodes can first compress

1 The content of this chapter is based on [1].

455
456 Soheil Feizi and Muriel Médard

the data before transmitting them in the network, providing a gain in communication
costs. We refer to this problem as network functional compression.
In this chapter, we consider different aspects of this problem from an information-
theoretic point of view. In the network functional compression problem, we would like to
compress source random variables for the purpose of computing a deterministic function
(or some deterministic functions) at the receiver(s), when these sources and receivers
are nodes in a network. Traditional data-compression schemes are special cases of func-
tional compression, where the desired function is the identity function. However, if the
receiver is interested in computing a function (or some functions) of sources, further
compression is possible.
Several approaches have been applied to investigate different aspects of this prob-
lem. One class of works considered the functional computation problem for specific
functions. For example, Kowshik and Kumar [2] investigated computation of symmet-
ric Boolean functions in tree networks, and Shenvi and Dey [3] and Ramamoorthy [4]
studied the sum network with three sources and three terminals. Other authors investi-
gated the asymptotic analysis of the transmission rate in noisy broadcast networks [5],
and also in random geometric graph models (e.g., [6, 7]). Also, Ma et al. [8] investi-
gated information-theoretic bounds for multiround function computation in collocated
networks. Network flow techniques (also known as multi-commodity methods) have
been used to study multiple unicast problems [9, 10]. Shah et al. [11] used this frame-
work, with some modifications, for function computation considering communication
constraints.
A major body of work on in-network computation investigates information-theoretic
rate bounds, when a function of sources is desired to be computed at the receiver. These
works can be categorized into the study of lossless functional compression and that
of functional compression with distortion. By lossless computation, we mean asymp-
totically lossless computation of a function: the error probability goes to zero as the
blocklength goes to infinity. However, there are several works investigating zero-error
computation of functions (e.g., [12, 13]).
Shannon was the first to consider the function computation problem in [14] for a spe-
cial case when f (X1 , X2 ) = (X1 , X2 ) (the identity function) and for the network topology
depicted in Fig. 15.1(a) (the side-information problem). For a general function, Orlitsky
and Roche provided a single-letter characterization of the rate region in [15]. In [16],
Doshi et al. proposed an optimal coding scheme for this problem.
For the network topology depicted in Fig. 15.1(b), and for the case in which the
desired function at the receiver is the identity function (i.e., f (X1 , X2 ) = (X1 , X2 )), Slepian

X1 f (X1, X2) X1
f (X1, X2)

X2 X2
(a) (b)
Figure 15.1 (a) Functional compression with side information. (b) A distributed functional
compression problem with two transmitters and a receiver.
Network Functional Compression 457

and Wolf [17] provided a characterization of the rate region and an optimal achievable
coding scheme. Some other practical but sub-optimal coding schemes were proposed
by Pradhan and Ramchandran [18]. Also, a rate-splitting technique for this problem
was developed in [19, 20]. Special cases when f (X1 , X2 ) = X1 and f (X1 , X2 ) = (X1 + X2 )
mod 2 have been investigated by Ahlswede and Körner [21] and by Körner and Marton
[22], respectively. Under some special conditions on source distributions, Doshi et al.
[16] investigated this problem for a general function and proposed some achievable
coding schemes.
There have been several prior works that studied lossy functional compression where
the function at the receiver is desired to be computed within a distortion level. Wyner and
Ziv [23] considered the side-information problem for computing the identity function at
the receiver within some distortion. Yamamoto [24] solved this problem for a general
function f (X1 , X2 ). Doshi et al. [16] gave another characterization of the rate-distortion
function given by Yamamoto. Feng et al. [25] considered the side-information problem
for a general function at the receiver in the case in which the encoder and decoder have
some noisy information. For the distributed function computation problem and for a
general function, the rate-distortion region remains unknown, but some bounds have
been given by Berger and Yeung [26], Barros and Servetto [27], and Wagner et al. [28],
who considered a specific quadratic distortion function.
In this chapter, we present results that significantly relax previously considered restric-
tive assumptions, particularly in terms of the network topology, the functions, and the
characteristics of the sources. For a one-stage tree network, we introduce a necessary and
sufficient condition for any achievable coloring-based coding scheme called the coloring
connectivity condition (CCC), thus relaxing the previous sufficient zigzag condition of
Doshi et al. [16]. By using the CCC, we characterize a rate region for distributed func-
tional compression and propose a modularized coding scheme based on graph colorings
in order for the system to perform arbitrarily closely to rate lower bounds. These results
are presented in Section 15.3.1.
In Section 15.3.2, we consider a general tree network and provide a rate lower bound
based on graph entropies. We show that this bound is tight in the case with independent
sources. In particular, we show that, to achieve the rate lower bound, intermediate nodes
should perform computations. However, for a family of functions and random variables,
which we call chain-rule proper sets, it is sufficient to have intermediate nodes act like
relays (i.e., no computations are performed at intermediate nodes) in order for the system
to perform arbitrarily closely to the rate lower bound.
In Section 15.3.3, we discuss practical issues of coloring-based coding schemes and
propose an efficient algorithm to compute a minimum-entropy coloring of a character-
istic graph under some conditions on source distributions and/or the desired function.
Finally, extensions of proposed results for cases with feedback and lossy function
computations are discussed in Section 15.4. In particular, we show that, in functional
compression, unlike in the Slepian–Wolf case, by having feedback, one may outperform
the rate bounds of the case without feedback. These results extend those of Bakshi
et al. We also present a practical coding scheme for the distributed lossy functional
compression problem with a non-trivial performance guarantee.
458 Soheil Feizi and Muriel Médard

15.2 Problem Setup and Prior Work

In this section, we set up the functional compression problem and review some
prior work.

15.2.1 Problem Setup


Consider k discrete memoryless random processes, {X1i }∞ i ∞
i=1 , . . . , {Xk }i=1 , as source pro-
cesses. Memorylessness is not necessary, and one can approximate a source by a
memoryless one with an arbitrary precision [29]. Suppose these sources are drawn from
|X |
finite sets X1 = {x11 , x12 , . . . , x1|X1 | }, . . . , Xk = {xk1 , xk2 , . . . , xk k }. These sources have a joint
probability distribution p(x1 , . . . , xk ). We express n-sequences of these random variables
as X1 , . . . , Xk with the joint probability distribution p(x1 , . . . , xk ). To simplify the nota-
tion, n will be implied by the context if no confusion arises. We refer to the ith element
of x j as x ji . We use x1j , x2j , . . . as different n-sequences of X j . We shall omit the super-
script when no confusion arises. Since the sequence (x1 , . . . , xk ) is drawn i.i.d. according

to p(x1 , . . . , xk ), one can write p(x1 , . . . , xk ) = ni=1 p(x1i , . . . , xki ).
Consider a tree network with k source nodes in its leaves and a receiver in its root.
Other nodes of this tree are referred to as intermediate nodes. Source node j has an
input random process {X ij }∞ i=1 . The receiver wishes to compute a deterministic function
f : X1 × · · · × Xk → Z, or f : Xn1 × · · · × Xnk → Zn , its vector extension.
Note that sources can be at any nodes of the network. However, without loss of gen-
erality, we can modify the network by adding some fake leaves to source nodes which
are not located in leaves of the network. So, in the achieved network, sources are always
located in leaves. Also, by adding some auxiliary nodes, one can make sources be at the
same distance from the receiver. Nodes of this tree are labeled as i for different is, where
source nodes are denoted by {1, . . . , k} and the outgoing link of node i is denoted by ei .
Node i sends Mi over its outgoing edge ei with a rate Ri (it maps length-n blocks of Mi ,
referred to as Mi , to {1, 2, . . ., 2nRi }).
For a source node, Mi = enXi (Xi ), where enXi is the encoding function of the source
node i. For an intermediate node i, i  {1, . . . , k}, with incoming edges e j , . . . , eq , Mi =
gi (M j , . . . , Mq ), where gi (·) is a function to be computed in that node.
The receiver maps incoming messages {Mi , . . . , M j } to Zn by using a function r(·);
j
i.e., r : l=i {1, . . . , 2nRl } → Zn . Thus, the receiver computes r(Mi , . . . , M j ) = r (enX1 (X1 ),
. . . , enXk (Xk )). We refer to this encoding/decoding scheme as an n-distributed functional
code. Intermediate nodes are allowed to compute functions, but have no demand of
their own. The desired function f (X1 , . . . , Xk ) at the receiver is the only demand in the
network. For any encoding/decoding scheme, the probability of error is defined as

Pne = Pr[(x1 , . . . , xk ) : f (x1 , . . . , xk )  r (enX1 (x1 ), . . . , enXk (xk ))]. (15.1)

A rate tuple of the network is the set of rates of its edges (i.e., {Ri } for valid is). We say
a rate tuple is achievable iff there exists a coding scheme operating at these rates so that
Pne → 0 as n → ∞. The achievable rate region is the set closure of the set of all achievable
rates.
Network Functional Compression 459

15.2.2 Definitions and Prior Results


In this part, we present some definitions and review prior results.
definition 15.1 The characteristic (conflict) graph G X1 = (VX1 , E X1 ) of X1 with respect
to X2 , p(x1 , x2 ), and the function f (X1 , X2 ) is defined as follows: VX1 = X1 and an edge
(x11 , x12 ) ∈ X21 is in E X1 iff there exists an x21 ∈ X2 such that p(x11 , x21 )p(x12 , x21 ) > 0 and
f (x11 , x21 )  f (x12 , x21 ).
In other words, in order to avoid confusion about the function f (X1 , X2 ) at the receiver,
if (x11 , x12 ) ∈ E X1 , the descriptions of x11 and x12 must be different. Shannon first defined
this when studying the zero-error capacity of noisy channels [14]. Witsenhausen [30]
used this concept to study a simplified version of the functional compression problem
where one encodes X1 to compute f (X1 ) = X1 with zero distortion and showed that the
chromatic number of the strong graph-product characterizes the rate. The characteris-
tic graph of X2 with respect to X1 , p(x1 , x2 ), and f (X1 , X2 ) is defined analogously and
denoted by G X2 . One can extend the definition of the characteristic graph to the case with
more than two random variables as follows. Suppose X1 , . . . , Xk are k random variables
as defined in Section 15.2.1.

definition 15.2 The characteristic graph G X1 = (VX1 , E X1 ) of X1 with respect to ran-


dom variables X2 , . . . , Xk , p(x1 , . . . , xk ), and f (X1 , . . . , Xk ) is defined as follows: VX1 = X1
and an edge (x11 , x12 ) ∈ X21 is in E X1 if there exist x1j ∈ X j for 2 ≤ j ≤ k such that
p(x11 , x21 , . . . , xk1 )p(x12 , x21 , . . . , xk1 ) > 0 and f (x11 , x21 , . . . , xk1 )  f (x12 , x21 , . . . , xk1 ).

Example 15.1 To illustrate the idea of confusability and the characteristic graph,
consider two random variables X1 and X2 such that X1 = {0, 1, 2, 3} and X2 = {0, 1},
where they are uniformly and independently distributed on their own supports. Sup-
pose f (X1 , X2 ) = (X1 + X2 ) mod 2 is to be perfectly reconstructed at the receiver. Then,
the characteristic graph of X1 with respect to X2 , p(x1 , x2 ) = 18 , and f is as shown in
Fig. 15.2(a). Similarly, G X2 is depicted in Fig. 15.2(b).

The following definition can be found in [31].

definition 15.3 Given a graph G X1 = (VX1 , E X1 ) and a distribution on its vertices VX1 ,
the graph entropy is

HG X1 (X1 ) = min I(X1 ; W1 ), (15.2)


X1 ∈W1 ∈Γ(G X1 )

where Γ(G X1 ) is the set of all maximal independent sets of G X1.


The notation X1 ∈ W1 ∈ Γ(G X1 ) means that we are minimizing over all distributions
p(w1 , x1 ) such that p(w1 , x1 ) > 0 implies x1 ∈ w1 , where w1 is a maximal independent set
of the graph G X1 .
460 Soheil Feizi and Muriel Médard

x 11 = 0 x 21 = 1
R B

x 12 = 0 x 22 = 1
Y G

B R
x =3
4 x 31 = 2
1

(a) (b)
Figure 15.2 Characteristic graphs (a) G X1 and (b) G X2 for the setup of Example 15.1. (Different
letters written over graph vertices indicate different colors.)

Example 15.2 Consider the scenario described in Example 15.1. For the charac-
teristic graph of X1 shown in Fig. 15.2(a), the set of maximal independent sets is
W1 = {{0, 2}, {1, 3}}. To minimize I(X1 ; W1 ) = H(X1 ) − H(X1 |W1 ) = log(4) − H(X1 |W1 ),
one should maximize H(X1 |W1 ). Because of the symmetry of the problem, to maxi-
mize H(X1 |W1 ), p(w1 ) must be uniform over two possible maximal independent sets
of G X1. Since each maximal independent set w1 ∈ W1 has two X1 values, H(X1 |w1 ) =
log(2) bit, and since p(w1 ) is uniform, H(X1 |W1 ) = log(2) bit. Therefore, HG X1 (X1 ) =
log(4) − log(2) = 1 bit. One can see that, if we want to encode X1 ignoring the effect of
the function f , we need H(X1 ) = log(4) = 2 bits. We will show that, for this example,
functional compression saves us 1 bit in every 2 bits compared with the traditional data
compression.

Witsenhausen [30] showed that the chromatic number of the strong graph-product
characterizes the minimum rate at which a single source can be encoded so that the
identity function of that source can be computed with zero distortion. Orlitsky and
Roche [15] defined an extension of Körner’s graph entropy, the conditional graph
entropy.
definition 15.4 The conditional graph entropy is
HG X1 (X1 |X2 ) = min I(W1 ; X1 |X2 ). (15.3)
X1 ∈W1 ∈Γ(G X1 )
W1 −X1 −X2

The notation W1 −X1 −X2 indicates a Markov chain. If X1 and X2 are independent,
HG X1 (X1 |X2 ) = HG X1 (X1 ). To illustrate this concept, let us consider an example borrowed
from [15].

Example 15.3 When f (X1 , X2 ) = X1 , HG X1 (X1 |X2 ) = H(X1 |X2 ).


To show this, consider the characteristic graph of X1 , denoted as G X1. Since
f (X1 , X2 ) = X1 , then, for every x21 ∈ X2 , the set {x1i : p(x1i , x21 ) > 0} of possible x1i are
Network Functional Compression 461

connected to each other (i.e., this set is a clique of G X1 ). Since the intersection of a
clique and a maximal independent set is a singleton, X2 and the maximal independent
set W1 containing X1 determine X1 . So,
HG X1 (X1 |X2 ) = min I(W1 ; X1 |X2 )
X1 ∈W1 ∈Γ(G X1 )
W1 −X1 −X2

= H(X1 |X2 ) − max H(X1 |W1 , X2 )


X1 ∈W1 ∈Γ(G X1 )

= H(X1 |X2 ). (15.4)

definition 15.5 A vertex coloring of a graph is a function cG X1 (X1 ) : VX1 → N of a


graph G X1 = (VX1 , E X1 ) such that (x11 , x12 ) ∈ E X1 implies cG X1 (x11 )  cG X1 (x12 ). The entropy
of a coloring is the entropy of the induced distribution on colors. Here, p(cG X1 (x1i )) =
−1 (c −1 j j
G X1 (x1 ))), where cG X (cG X1 (x1 )) = {x1 : cG X1 (x1 ) = cG X1 (x1 )} for all valid j. This
p(cG i i i
X1 1
subset of vertices with the same color is called a color class. We refer to a coloring
which minimizes the entropy as a minimum-entropy coloring. We use CG X1 as the set of
all valid colorings of a graph G X1.

Example 15.4 Consider again the random variable X1 described in Example 15.1,
whose characteristic graph G X1 and its valid coloring are shown in Fig. 15.2a. One
can see that, in this coloring, two connected vertices are assigned to different colors.
Specifically, cG X1 (X1 ) = {r, b}. Therefore, p(cG X1 (x1i ) = r) = p(x1i = 0) + p(x1i = 2) and
p(cG X1 (x1i ) = b) = p(x1i = 1) + p(x1i = 3).

We define a power graph of a characteristic graph as its co-normal products.

definition 15.6 The nth power of a graph G X1 is a graph GnX = (VXn1 , E Xn 1 ) such that
1
VXn1 = Xn1 and (x11 , x21 ) ∈ E Xn 1 when there exists at least one i such that (x1i
1 , x2 ) ∈ E . We
1i X1
n
denote a valid coloring of GX by cGnX (X1 ).
1 1

One may ignore atypical sequences in a sufficiently large power graph of a conflict
graph and then color that graph. This coloring is called an -coloring of a graph and is
defined as follows.

definition 15.7 Given a non-empty set A ⊂ X1 × X2 , define  p(x1 , x2 ) = p(x1 , x2 )/p(A)


when (x1 , x2 ) ∈ A, and  p(x, y) = 0 otherwise.  p is the distribution over (x1 , x2 ) condi-
tioned on (x1 , x2 ) ∈ A. Denote the characteristic graph of X1 with respect to X2 ,  p(x1 , x2 ),
and f (X1 , X2 ) as G X = ( V X1 , 
E X1 ) and the characteristic graph of X 2 with respect to
1
X1 ,      
p(x1 , x2 ), and f (X1 , X2 ) as G X2 = (VX2 , E X2 ). Note that E X1 ⊆ E X1 and E X2 ⊆ E X2 .
Suppose p(A) ≥ 1 − . We say that cG X1 (X1 ) and cG X2 (X2 ) are -colorings of G X1 and
G x if they are valid colorings of G X and G X .
2 1 2
462 Soheil Feizi and Muriel Médard

In [32], the chromatic entropy of a graph G X1 is defined as follows.

definition 15.8
χ
HG X (X1 ) = min H(cG X1 (X1 )).
1 cG X is an -coloring of G X1
1

The chromatic entropy is a representation of the chromatic number of high-probability


subgraphs of the characteristic graph. In [16], the conditional chromatic entropy is
defined as follows.

definition 15.9
χ
HG X (X1 |X2 ) = min H(cG X1 (X1 )|X2 ).
1 cG X is an -coloring of G X1
1

Regardless of , the above optimizations are minima, rather than infima, because there
are finitely many subgraphs of any fixed graph G X1 , and therefore there are only finitely
many -colorings, regardless of .
In general, these optimizations are NP-hard [33]. But, depending on the desired func-
tion f , there are some interesting cases for which optimal solutions can be computed
efficiently. We discuss these cases in Section 15.3.3.
Körner showed in [31] that, in the limit of large n, there is a relation between the
chromatic entropy and the graph entropy.
theorem 15.1
1 χ
lim H n (X1 ) = HG X1 (X1 ). (15.5)
n→∞ n G X1
This theorem implies that the receiver can asymptotically compute a deterministic
function of a discrete memoryless source. The source first colors a sufficiently large
power of the characteristic graph of the random variable with respect to the func-
tion, and then encodes achieved colors using any encoding scheme that achieves the
entropy bound of the coloring random variable. In the previous approach, to achieve
the encoding rate close to graph entropy of X1 , one should find the optimal distribu-
tion over the set of maximal independent sets of G X1. However, this theorem allows
us to find the optimal coloring of GnX , instead of the optimal distribution on maxi-
1
mal independent sets. One can see that this approach modularizes the encoding scheme
into two parts, a graph-coloring module, followed by a Slepian–Wolf compression
module.
The conditional version of the above theorem is proven in [16].
theorem 15.2
1 χ
lim H n (X1 |X2 ) = HG X1 (X1 |X2 ). (15.6)
n→∞ n G X1
This theorem implies a practical encoding scheme for the problem of functional com-
pression with side information where the receiver wishes to compute f (X1 , X2 ), when
X2 is available at the receiver as the side information. Orlitsky and Roche showed in
[15] that HG X1 (X1 |X2 ) is the minimum achievable rate for this problem. Their proof uses
Network Functional Compression 463

random coding arguments and shows the existence of an optimal coding scheme. This
theorem presents a modularized encoding scheme where one first finds the minimum-
entropy coloring of GnX for large enough n, and then uses a compression scheme on the
1
coloring random variable (such as the Slepian–Wolf Scheme in [17]) to achieve a rate
arbitrarily close to H(cGnX (X1 )|X2 ). This encoding scheme guarantees computation of
1
the function at the receiver with a vanishing probability of error.
All these results considered only functional compression with side information at the
receiver (Fig. 15.1(a)). In general, the rate region of the distributed functional com-
pression problem (Fig. 15.1(b)) has not been determined. However, Doshi et al. [16]
characterized a rate region of this network when source random variables satisfy a
condition called the zigzag condition, defined below.
We refer to the -joint-typical set of sequences of random variables X1 , . . . , Xk as T n .
k is implied in this notation for simplicity. T n can be considered as a strong or weak
typical set [29].
definition 15.10 A discrete memoryless source {(X1i , X2i )}i∈N with a distribution
p(x1 , x2 ) satisfies the zigzag condition if for any  and some n, (x11 , x12 ), (x21 , x22 ) ∈ T n ,
there exists some (x31 , x32 ) ∈ T n such that (x31 , xi2 ), (xi1 , x32 ) ∈ T n for each i ∈ {1, 2}, and
2
(x13 j , x23 j ) = (x1i j , x23−i
j ) for some i ∈ {1, 2} for each j.

In fact, the zigzag condition forces many source sequences to be typical. Doshi et al.
[16] show that, if the source random variables satisfy the zigzag condition, an achievable
rate region for this network is the set of all rates that can be achieved through graph
colorings. The zigzag condition is a restrictive condition which does not depend on the
desired function at the receiver. This condition is not necessary, but it is sufficient. In the
next section, we relax this condition by introducing a necessary and sufficient condition
for any achievable coloring-based coding scheme and characterize a rate region for the
distributed functional compression problem.

15.3 Network Functional Compression

In this section, we present the main results for network functional compression.

15.3.1 A Rate Region for One-Stage Tree Networks


In this section, we compute a rate region for a general one-stage tree network without
any restrictive conditions such as the zigzag condition.
Consider a one-stage tree network with k sources.
definition 15.11 A path with length m between two points Z1 = (x11 , x21 , . . . , xk1 ) and
Zm = (x12 , x22 , . . . , xk2 ) is determined by m − 1 points Zi , 1 ≤ i ≤ m, such that
(i) Pr(Zi ) > 0, for all 1 ≤ i ≤ m; and
(ii) Zi and Zi+1 differ in only one of their coordinates.
464 Soheil Feizi and Muriel Médard

Definition 15.11 can be generalized to two length-n vectors as follows.


definition 15.12 A path with length m between two points Z1 = (x11 , x12 , . . . , x1k ) ∈ T n and
Zm = (x21 , x22 , . . . , x2k ) ∈ T n is determined by m − 1 points Zi , 1 ≤ i ≤ m such that
(i) Zi ∈ T n , for all 1 ≤ i ≤ m; and
(ii) Zi and Zi+1 differ in only one of their coordinates.
Note that each coordinate of Zi is a vector with length n.
definition 15.13 A joint coloring family JC for random variables X1 , . . . , Xk with char-
acteristic graphs G X1 , . . . ,G Xk , and any valid respective colorings cG X1 , . . . ,cG Xk , is
i
defined as JC = { j1c , . . . , jlc }, where jic is the collection of points (x1i1 , x2i2 , . . . , xkk ) whose
coordinates have the same color. Each jic is called a joint coloring class.
We say a joint coloring class jic is connected if, between any two points in jic , there
exists a path that lies in jic . Otherwise, it is disconnected. Definition 15.11 can be
expressed for random vectors X1 , . . . ,Xk with characteristic graphs GnX , . . . ,GnX and
1 k
any valid respective -colorings cGnX , . . . ,cGnX .
1 k
In the following, we present the coloring connectivity condition (CCC) which is a
necessary and sufficient condition for any coloring-based coding scheme.
definition 15.14 Consider random variables X1 , . . . , Xk with characteristic graphs
G X1 , . . . , G Xk , and any valid colorings cG X1 , . . . , cG Xk . We say a joint coloring class
jic ∈ JC satisfies the coloring connectivity condition (CCC) when it is connected, or
its disconnected parts have the same function values. We say colorings cG X1 , . . . , cG Xk
satisfy the CCC when all joint coloring classes satisfy the CCC.
The CCC can be expressed for random vectors X1 , . . . , Xk with characteristic graphs
GnX , . . . , GnX , and any valid respective -colorings cGnX , . . . , cGnX .
1 k 1 k

Example 15.5 Suppose we have two random variables X1 and X2 with characteris-
tic graphs G X1 and G X2 . Let us assume cG X1 and cG X2 are two valid colorings of G X1
and G X2 , respectively. Assume cG X1 (x11 ) = cG X1 (x12 ) and cG X2 (x21 ) = cG X2 (x22 ). Suppose
j
j1c represents this joint coloring class. In other words, j1c = {(x1i , x2 )}, for all 1 ≤ i, j ≤ 2
j
when p(x1i , x2 ) > 0. Figure 15.3 considers two different cases. The first case is when
p(x1 , x2 ) = 0, and other points have a non-zero probability. It is illustrated in Fig. 15.3(a).
1 2

One can see that there exists a path between any two points in this joint coloring class.
Therefore, this joint coloring class satisfies the CCC. If other joint coloring classes of
cG X1 and cG X2 satisfy the CCC, we say cG X1 and cG X2 satisfy the CCC. Now, consider the
second case depicted in Fig. 15.3(b). In this case, we have p(x11 , x22 ) = 0, p(x12 , x21 ) = 0,
and other points have a non-zero probability. One can see that there is no path between
(x11 , x21 ) and (x12 , x22 ) in j1c . So, though these two points belong to a same joint coloring
class, their corresponding function values can be different from each other. Thus, j1c does
not satisfy the CCC for this example. Therefore, cG X1 and cG X2 do not satisfy the CCC.
Network Functional Compression 465

f (x1, x2) f (x1, x2)

1
x11 x12 x1 x12
1
x12 0 0 x2 0
x22 0 x22 1

(a) (b)
Figure 15.3 Two examples of a joint coloring class: (a) satisfying the CCC and (b) not satisfying
the CCC. Dark squares indicate points with zero probability. Function values are depicted in the
picture.

There are several examples of source distributions and functions that satisfy the
CCC.
lemma 15.1 Consider two random variables X1 and X2 with characteristic graphs G X1
and G X2 and any valid colorings cG X1 (X1 ) and cG X2 (X2 ), respectively, where cG X2 (X2 ) is
a trivial coloring, assigning different colors to different vertices (to simplify the notation,
we use cG X2 (X2 ) = X2 to refer to this coloring). These colorings satisfy the CCC. Also,
cGnX (X1 ) and cGnX (x2 ) = X2 satisfy the CCC, for any n.
1 2

Proof of this lemma is presented in Section 15.5.1.


The following lemmas demonstrate why the CCC is a necessary and sufficient
condition for any achievable coloring-based coding scheme.
lemma 15.2 Consider random variables X1 , . . . , Xk with characteristic graphs G X1 , . . . ,
G Xk , and any valid colorings cG X1 , . . . , cG Xk with joint coloring class JC = { jic : i}. For
any two points (x11 , . . . , xk1 ) and (x12 , . . . , xk2 ) in jic , f (x11 , . . . , xk1 ) = f (x12 , . . . , xk2 ) if and only
if jic satisfies the CCC.
Proof of this lemma is presented in Section 15.5.2.
lemma 15.3 Consider random variables X1 , . . . , Xk with characteristic graphs
GnX , . . . , GnX , and any valid -colorings cGnX , . . . , cGnX with joint coloring class JC =
1 k 1 k
{ jic : i}. For any two points (x11 , . . . , x1k ) and (x21 , . . . , x2k ) in jic , f (x11 , . . . , x1k ) = f (x21 , . . . , x2k )
if and only if jic satisfies the CCC.
Proof of this lemma is presented in Section 15.5.3.
Next, we show that, if X1 and X2 satisfy the zigzag condition given in Definition
15.10, any valid colorings of their characteristic graphs satisfy the CCC, but not vice
versa. In other words, we show that the zigzag condition used in [16] is sufficient but not
necessary.
466 Soheil Feizi and Muriel Médard

lemma 15.4 If two random variables X1 and X2 with characteristic graphs G X1 and G X2
satisfy the zigzag condition, any valid colorings cG X1 and cG X2 of G X1 and G X2 satisfy
the CCC, but not vice versa.
Proof of this lemma is presented in Section 15.5.4.
We use the CCC to characterize a rate region of functional compression for a one-stage
tree network as follows.
definition 15.15 For random variables X1 , . . . , Xk with characteristic graphs G X1 , . . . ,
G Xk , the joint graph entropy is defined as follows:
1
HG X1 ,...,G Xk (X1 , . . . , Xk )  lim min H(cGnX (X1 ), . . . , cGnX (Xk )), (15.7)
n→∞ cGn ,...,cGn n 1 k
X1 Xk

in which cGnX (X1 ), . . . , cGnX (Xk ) are -colorings of GnX , . . . , GnX satisfying the CCC.
1 k 1 k
We refer to this joint graph entropy as H[G Xi ]i∈S , where S = {1, 2, . . . , k}. Note that this
limit exists because we have a monotonically decreasing sequence bounded from below.
Similarly, we can define the conditional graph entropy.
definition 15.16 For random variables X1 , . . . , Xk with characteristic graphs G X1 , . . . ,
G Xk , the conditional graph entropy can be defined as follows:

HG X1 ,...,G Xi (X1 , . . . , Xi |Xi+1 , . . . , Xk )


1
 lim min H(cGnX (X1 ), . . . , cGnX (Xi )|cGnX (Xi+1 ), . . . , cGnX (Xk )), (15.8)
n→∞ n 1 i i+1 k

where the minimization is over cGnX (X1 ), . . . , cGnX (Xk ), which are -colorings of GnX ,
1 k 1
. . . , GnX satisfying the CCC.
k

lemma 15.5 For k = 2, Definitions 15.4 and 15.14 are the same.
Proof of this lemma is presented in Section 15.5.5.
Note that, by this definition, the graph entropy does not satisfy the chain rule.
Suppose S(k) denotes the power set of the set {1, 2, . . . , k} excluding the empty subset.
Then, for any S ∈ S(k),

XS  {Xi : i ∈ S }.

Let S c denote the complement of S in S(k). For S = {1, 2, . . . , k}, denote S c as the
empty set. To simplify the notation, we refer to a subset of sources by XS . For instance,
S(2) = {{1}, {2}, {1, 2}}, and for S = {1, 2}, we write H[G Xi ]i∈S (XS ) instead of HG X1 ,G X2
(X1 , X2 ).
theorem 15.3 A rate region of a one-stage tree network is characterized by the
following conditions:

∀S ∈ S(k) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ). (15.9)
i∈S

Proof of this theorem is presented in Section 15.5.6.


If we have two transmitters (k = 2), Theorem 15.3 can be simplified as follows.
Network Functional Compression 467

corollary 15.1 A rate region of the network shown in Fig. 15.1(b) is determined by
the following three conditions:

R11 ≥ HG X1 (X1 |X2 ),


R12 ≥ HG X2 (X2 |X1 ), (15.10)
R11 + R12 ≥ HG X1 ,G X2 (X1 , X2 ).

Algorithm 15.1
The following algorithm proposes a modularized coding scheme which performs
arbitrarily closely to the rate bounds of Theorem 15.1.
• Source nodes compute -colorings of sufficiently large power of their characteristic
graphs satisfying the CCC, followed by Slepian–Wolf compression.
• The receiver first uses a Slepian–Wolf decoder to decode transmitted coloring
variables. Then, it uses a look-up table to compute the function values.

The achievablity proof of this algorithm directly follows from the proof of
Theorem 15.3.

15.3.2 A Rate Lower Bound for a General Tree Network


In this section, we consider a general tree structure with intermediate nodes that are
allowed to perform computations. However, to simplify the notation, we limit the argu-
ments to the tree structure of Fig. 15.4. Note that all discussions can be extended to a
general tree network.

Source Nodes

Intermediate
Nodes

X1 R1

R5
X2 R2

f (X1, X2, X3, X4)

X3 R3
R6

X4 R4

Figure 15.4 An example of a two-stage tree network.


468 Soheil Feizi and Muriel Médard

The problem of function computations for a general tree network has been considered
in [13, 34]. Kowshik and Kumar [13] derive a necessary and sufficient condition for the
encoders on each edge of the tree for a zero-error computation of the desired function.
Appuswamy et al. [34] show that, for a tree network with independent sources, a min-
cut rate is a tight upper bound. Here, we consider an asymptotically lossless functional
compression problem. For a general tree network with correlated sources, we derive rate
bounds using graph entropies. We show that these rates are achievable for the case of
independent sources and propose a modularized coding scheme based on graph color-
ings that performs arbitrarily closely to rate bounds. We also show that, for a family of
functions and random variables, which we call chain-rule proper sets, it is sufficient to
have no computations at intermediate nodes in order for the system to perform arbitrarily
closely to the rate lower bound.
In the tree network depicted in Fig. 15.4, nodes {1, . . . , 4} represent source nodes,
nodes {5, 6} are intermediate nodes, and node 7 is the receiver. The receiver wishes to
compute a deterministic function of source random variables. Intermediate nodes have
no demand of their own, but they are allowed to perform computation. Computing the
desired function f at the receiver is the only demand of the network. For this network, we
compute a rate lower bound and show that this bound is tight in the case of independent
sources. We also propose a modularized coding scheme to perform arbitrarily closely to
derived rate lower bounds in this case.
Sources transmit variables M1 , . . . , M4 through links e1 , . . . , e4 , respectively. Interme-
diate nodes transmit variables M5 and M6 over e5 and e6 , respectively, where M5 =
g5 (M1 , M2 ) and M6 = g6 (M3 , M4 ).
Let S(4) and S(5, 6) be the power sets of the set {1, . . . , 4} and the set {5, 6} except the
empty set, respectively.
theorem 15.4 A rate lower bound for the tree network of Fig. 15.4 can be characterized
as follows:

∀S ∈ S(4) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ),
i∈S

∀S ∈ S(5, 6) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ). (15.11)
i∈S

Proof of this theorem is presented in Section 15.5.7. Note that the result of
Theorem 15.4 can be extended to a general tree network topology.
In the following, we show that, for independent source variables, the rate bounds
of Theorem 15.4 are tight, and we propose a coding scheme that performs arbitrarily
closely to these bounds.

Tightness of the Rate Lower Bound for Independent Sources


Suppose random variables X1 , . . . , X4 with characteristic graphs GnX , . . . ,GnX are inde-
1 4
pendent. Assume cGnX , . . . , cGnX are valid -colorings of these characteristic graphs
1 4
Network Functional Compression 469

satisfying the CCC. The following coding scheme performs arbitrarily closely to rate
bounds of Theorem 15.4.
Source nodes first compute colorings of high-probability subgraphs of their character-
istic graphs satisfying CCC, and then perform source coding on these coloring random
variables. Intermediate nodes first compute their parents’ coloring random variables,
and then, by using a look-up table, find corresponding source values of their received
colorings. Then, they compute -colorings of their own characteristic graphs. The corre-
sponding source values of their received colorings form an independent set in the graph.
If all are assigned to a single color in the minimum-entropy coloring, intermediate nodes
send this coloring random variable followed by a source coding. However, if vertices of
this independent set are assigned to different colors, intermediate nodes send the color-
ing with the lowest entropy followed by source coding (Slepian–Wolf). The receiver first
performs a minimum-entropy decoding [29] on its received information and achieves
coloring random variables. Then, it uses a look-up table to compute its desired function
by using the achieved colorings.
In the following, we summarize this proposed algorithm.

Algorithm 15.2
The following algorithm proposes a modularized coding scheme which performs
arbitrarily closely to the rate bounds of Theorem 15.4 when the sources are independent.
• Source nodes compute -colorings of sufficiently large power of their characteristic
graphs satisfying the CCC, followed by Slepian–Wolf compression.
• Intermediate nodes compute -colorings of sufficiently large power of their charac-
teristic graphs by using their parents’ colorings.
• The receiver first uses a Slepian–Wolf decoder to decode transmitted coloring
variables. Then, it uses a look-up table to compute the function values.

The achievablity proof of this algorithm is presented in Section 15.5.8. Also, in Sec-
tion 15.3.3, we show that minimum-entropy colorings of independent random variables
can be computed efficiently.

A Case When Intermediate Nodes Do Not Need to Compute


Though the proposed coding scheme in Algorithm 15.1 can perform arbitrarily closely
to the rate lower bound, it may require computation at intermediate nodes. Here, we
show that, for a family of functions and random variables, intermediate nodes do not
need to perform computations.
definition 15.17 Suppose f (X1 , . . . , Xk ) is a deterministic function of random vari-
ables X1 , . . . , Xk . ( f, X1 , . . . , Xk ) is called a chain-rule proper set when, for any s ∈ S(k),
H[G Xi ]i∈S = HG Xs (X s ).
470 Soheil Feizi and Muriel Médard

theorem 15.5 In a general tree network, if sources X1 , . . . , Xk are independent random


variables and ( f, X1 , . . . , Xk ) is a chain-rule proper set, it is sufficient to have intermedi-
ate nodes as relays (with no computations) in order for the system to perform arbitrarily
closely to the rate lower bound mentioned in Theorem 15.4.
Proof of this theorem is presented in Section 15.5.9.
In the following lemma, we provide a sufficient condition to guarantee that a set is a
chain-rule proper set.
lemma 15.6 Suppose X1 and X2 are independent and f (X1 , X2 ) is a deterministic func-
j
tion. If, for any x21 and x22 in X2 , we have f (x1i , x21 )  f (x1 , x22 ) for any possible i and j,
then ( f, X1 , X2 ) is a chain-rule proper set.
Proof of this lemma is presented in Section 15.5.10.

15.3.3 Polynomial Time Cases for Finding the Minimum-Entropy Coloring


of a Characteristic Graph
In the proposed coding schemes of Sections 15.3.1 and 15.3.2, one needs to compute a
minimum-entropy coloring (a coloring random variable which minimizes the entropy)
of a characteristic graph. In general, finding this coloring is an NP-hard problem (as
shown by Cardinal et al. [33]). However, in this section we show that, depending on
the characteristic graph’s structure, there are some interesting cases where finding the
minimum-entropy coloring is not NP-hard, but tractable and practical. In one of these
cases, we show that having a non-zero joint probability condition on random variables’
distributions, for any desired function f , means that the characteristic graphs are formed
of some non-overlapping fully connected maximal independent sets. We will show that,
in this case, the minimum-entropy coloring can be computed in polynomial time, accord-
ing to Algorithm 15.3. In another case, we show that, if the function we seek to compute
is a type of quantization function, this problem is also tractable.
For simplicity, we consider functions with two input random variables, but one can
extend all of the discussions to functions with more input random variables than two.
Suppose cG min represents a minimum-entropy coloring of a characteristic graph G .
X
X1
1

Non-Zero Joint Probability Distribution Condition


Consider the network shown in Fig. 15.1(b). Source random variables have a joint prob-
ability distribution p(x1 , x2 ), and the receiver wishes to compute a deterministic function
of sources (i.e., f (X1 , X2 )). In this section, we consider the effect of the probability
distribution of sources in computations of minimum-entropy colorings.
theorem 15.6 Suppose that, for all (x1 , x2 ) ∈ X1 × X2 , p(x1 , x2 ) > 0. Then, the maxi-
mally independent sets of the characteristic graph G X1 (and its nth power GnX , for any
1
n) are some non-overlapping fully connected sets. Under this condition, the minimum-
entropy coloring can be achieved by assigning different colors to its different maximally
independent sets as described in Algorithm 15.3.
Network Functional Compression 471

f (x1, x2)

x21 x22 x23


X2
x11 1 2 3
x12 1 3
x13 1 4 3
X1

Figure 15.5 Having non-zero joint probability condition is necessary for Theorem 15.6. A dark
square represents a zero-probability point.

Here are some remarks about Theorem 15.6.


• The condition p(x1 , x2 ) > 0, for all (x1 , x2 ) ∈ X1 × X2 , is a necessary condition for
Theorem 15.6. In order to illustrate this, consider Fig. 15.5. In this example, x11 , x12 ,
and x13 are in X1 , and x21 , x22 , and x23 are in X2 . Suppose p(x12 , x22 ) = 0. By considering
the value of the function f at these points depicted in the figure, one can see that,
in G X1 , x12 is not connected to x11 and x13 . However, x11 and x13 are connected to each
other. Thus, Theorem 15.6 does not hold here.
• The condition used in Theorem 15.6 merely restricts the probability distribution
and it does not depend on the function f . Thus, for any function f at the receiver,
if we have a non-zero joint probability distribution of source random variables (for
example, when the source random variables are independent), finding the minimum-
entropy coloring is easy and tractable.
• Orlitsky and Roche [15] showed that, for the side-information problem, having a
non-zero joint probability condition yields a simplified graph entropy calculation.
Here, we show that, under this condition, characteristic graphs have certain struc-
tures so that optimal coloring-based coding schemes that perform arbitrarily closely
to rate lower bounds can be designed efficiently.

Quantization Functions
In this section, we consider some special functions which lead to practical minimum-
entropy coloring computation.
An interesting function in this context is a quantization function. A natural quan-
tization function is a function which separates the X1 −X2 plane into some rectangles
so that each rectangle corresponds to a different value of that function. The sides of
these rectangles are parallel to the plane axes. Figure 15.6(a) depicts such a quantization
function.
472 Soheil Feizi and Muriel Médard

f (x1, x2) f (x1, x2)

1 2 3
X2 X2 X2
X2 1 X2
1 3 X1 1 1 3
2
X1 2 4 4
4
2 3
X1 2
5 5 5
Function Region
2 2
X1 X1 X 1 × X2

(a) (b)
Figure 15.6 (a) A quantization function. Function values are depicted in the figure on each
rectangle. (b) By extending the sides of rectangles, the plane is covered by some function regions.

Given a quantization function, one can extend different sides of each rectangle in
the X1 −X2 plane. This may make some new rectangles. We call each of them a func-
tion region. Each function region can be determined by two subsets of X1 and X2 . For
example, in Fig. 15.6(b), one of the function regions is distinguished by the shaded area.
definition 15.18 Consider two function regions X11 ×X12 and X21 ×X22 . If, for any x11 ∈ X11
and x12 ∈ X21 , there exists x21 such that p(x11 , x21 )p(x12 , x21 ) > 0 and f (x11 , x21 )  f (x12 , x21 ), we
say these two function regions are pair-wise X1 -proper.
theorem 15.7 Consider a quantization function f such that its function regions are
pair-wise X1 -proper. Then, G X1 (and GnX , for any n) is formed of some non-overlapping
1
fully connected maximally independent sets, and its minimum-entropy coloring can be
achieved by assigning different colors to different maximally independent sets.
Proof of this theorem is presented in Section 15.5.12.
Note that, without X1 -proper condition of Theorem 15.7, assigning different colors to
different partitions still leads to an achievable coloring scheme. However, it is not nec-
essarily a minimum-entropy coloring. In other words, without this condition, maximally
independent sets may overlap.
corollary 15.2 If a function f is strictly monotonic with respect to X1 , and p(x1 , x2 ) 
0, for all x1 ∈ X1 and x2 ∈ X2 , then G X1 (and GnX , for any n) is a complete graph.
1

Under the conditions of Corollary 15.2, functional compression does not give us
any gain, because, in a complete graph, one should assign different colors to different
vertices. Traditional compression where f is the identity function is a special case of
Corollary 15.2.
Section 15.3.3 presents conditions on either source probability distributions and/or
the desired function such that characteristic graphs of random variables are composed of
Network Functional Compression 473

fully connected non-overlapping maximally independent sets. The following algorithm


shows how a minimum-entropy coloring can be computed for these graphs in polynomial
time complexity.

Algorithm 15.3
Suppose G X1 = (V, E) is a graph composed of fully connected non-overlapping maxi-
mally independent sets and Ḡ X1 = (V, Ē) represents its complement, where E and Ē are
partitions of complete graph edges. Say C is the set of used colors formed as follows.
• Choose a node v ∈ V.
• Color node v and its neighbors in the graph Ḡ X1 by a color cv such that cv  C.
• Add cv to C. Repeat until all nodes are colored.
This algorithm finds minimum colorings of G X1 in polynomial time with respect to the
number of vertices of G X1.

The achievablity proof of this algorithm is presented in Section 15.5.13.

15.4 Discussion and Future Work

In this section, we discuss other aspects of the functional compression problem such
as the effect of having feedback and lossy computations. First, by presenting an exam-
ple, we show that, unlike in the Slepian–Wolf case, by having feedback in functional
compression, one can outperform the rate bounds of the case without feedback. Then,
we investigate the problem of distributed functional compression with distortion, where
computation of a function within a distortion level is desired at the receiver. Here, we
propose a simple sub-optimal coding scheme with a non-trivial performance guarantee.
Finally, we explain some future research directions.

15.4.1 Feedback in Functional Compression


If the function at the receiver is the identity function, the functional compression prob-
lem is Slepian–Wolf compression with feedback. For this case, having feedback does
not improve the rate bounds. For example, see [12], which considers both zero-error
and asymptotically zero-error Slepian–Wolf compression with feedback. However, here
by presenting an example we show that, for a general desired function at the receiver,
having feedback can improve the rate bounds of the case without feedback.

Example 15.6 Consider a distributed functional compression problem with two sources
and a receiver as depicted in Fig. 15.7(a). Suppose each source has one byte (8 bits) to
transmit to the receiver. Bits are sorted from the (most significant bit MSB) to the least
significant bit (LSB). Bits can be 0 or 1 with the same probability. The desired function
at the receiver is f (X1 , X2 ) = max(X1 , X2 ).
474 Soheil Feizi and Muriel Médard

X1 X1

f (X1, X2) f (X1, X2)

X2 X2

(a) (b)
Figure 15.7 A distributed functional compression network (a) without feedback and (b) with
feedback.

In the case without feedback, the characteristic graphs of sources are trivially com-
plete graphs. Therefore, each source should transmit all bits to the receiver (i.e., the
un-scaled rates are R1 = 8 and R2 = 8).
Now, suppose the receiver can broadcast some feedback bits to sources. In the fol-
lowing, we propose a communication scheme that has a reduced sum transmission rate
compared with the case without feedback:
First, each source transmits its MSB. The receiver compares two received bits. If they
are the same, the receiver broadcasts 0 to sources; otherwise it broadcasts 1. If sources
receive 1 from feedback, they stop transmitting. Otherwise, they transmit their next sig-
nificant bits. For this communication scheme, the un-scaled sum rate of the forward links
can be calculated as follows:

1 1 1
Rf1 + Rf2 = (2 × 1) + (2 × 2) + · · · + n (2 × n)
2 4
  2
(a) (b)
(c) n+2
= 2 2− n , (15.12)
2

where n is the blocklength (in this example, n = 8), and Rf1 and Rf2 are the transmission
rates of sources X1 and X2 , respectively. Sources stop transmitting after the first bit
transmission if these bits are not equal. The probability of this event is 1/2 (term (a)
in equation (15.12)). Similarly, forward transmissions stop in the second round with
probability 1/4 (term (b) in equation (15.12)) and so on. Equality (c) follows from a
closed-form solution for the series i (i/2i ). For n = 8, this rate is around 3.92, which
is less than the sum rate in the case without feedback. With similar calculations, the
feedback rate is around 1.96. Hence, the total forward and feedback transmission rate is
around 5.88, which is less than that in the case without feedback.

15.4.2 A Practical Rate-Distortion Scheme for Distributed Functional Compression


In this section, we consider the problem of distributed functional compression with dis-
tortion. The objective is to compress correlated discrete sources so that an arbitrary
deterministic function of sources can be computed up to a distortion level at the receiver.
Network Functional Compression 475

Here, we present a practical coding scheme for this problem with a non-trivial perfor-
mance guarantee. All discussions can be extended to more general networks similar to
results of Section 15.3.
Consider two sources as described in Section 15.2.1. Here, we assume that the receiver
wants to compute a deterministic function f : X1 × X2 → Z or f : Xn1 × Xn2 → Zn , its
vector extension up to distortion D with respect to a given distortion function d : Z ×
Z → [0, ∞). A vector extension of the distortion function is defined as follows:

1
n
d(z1 , z2 ) = d(z1i , z2i ), (15.13)
n i=1

where z1 , z2 ∈ Zn . We assume that d(z1 , z2 ) = 0 if and only if z1 = z2 . This assumption


causes vector extension to satisfy the same property (i.e., d(z1 , z2 ) = 0 if and only if
z1 = z2 ).
The probability of error in this case is

Pne = Pr[{(x1 , x2 ) : d( f (x1 , x2 ), r(enX1 (x1 ), enX2 (x2 ))) > D}].

We say a rate pair (R1 , R2 ) is achievable up to distortion D if there exist enX1 , enX2 ,
and r such that Pne → 0 when n → ∞.
Yamamoto gives a characterization of a rate-distortion function for the side-
information functional compression problem (i.e., X2 is available at the receiver) in
[24]. The rate-distortion function proposed in [24] is a generalization of the Wyner–
Ziv side-information rate-distortion function [23]. Another multi-letter characterization
of the rate-distortion function for the side-information problem given by Yamamoto was
discussed in [16]. The multi-letter characterization of [16] can be extended naturally to
a distributed functional compression case by using results of Section 15.3.
Here, we present a practical coding scheme with a non-trivial performance gaurantee
for a given distributed lossy functional compression setup.
Define the D-characteristic graph of X1 with respect to X2 , p(x1 , x2 ), and f (X1 , X2 ) as
having vertices V = X1 and the pair (x11 , x12 ) is an edge if there exists some x21 ∈ X2 such
that p(x11 , x21 )p(x12 , x21 ) > 0 and d( f (x11 , x21 ), f (x12 , x21 )) > D as in [16]. Denote this graph as
G X1 (D). Similarly, we define G X2 (D).
theorem 15.8 For the network depicted in Fig. 15.1(b) with independent sources, if the
distortion function is a metric, then the following rate pair (R1 , R2 ) is achievable for the
distributed lossy functional compression problem with distortion D:

R1 ≥ HG X1 (D/2) (X1 ),
R2 ≥ HG X2 (D/2) (X2 ),
R1 + R2 ≥ HG X1 (D/2),G X2 (D/2) (X1 , X2 ). (15.14)

The proof of this theorem is presented in Section 15.5.14. A modularized coding


scheme similar to Algorithm 15.2 can be used on D-characteristic graphs to perform
arbitrarily closely to rate bounds of Theorem 15.8.
476 Soheil Feizi and Muriel Médard

15.4.3 Future Work in Functional Compression


Throughout this chapter, we considered the case of having only one desired function at
the receiver. However, all results can be extended naturally to the case of having several
desired functions at the receiver by considering a vector extension of functions, and
computing characteristic graphs of variables with respect to that vector. In fact, one can
show that the characteristic graph of a random variable with respect to several functions
is equal to the union of individual characteristic graphs (the union of two graphs with
the same vertices is a graph with the same vertex set whose edges are the union of the
edges of the individual graphs).
For possible future work, one may consider a general network topology rather than
tree networks. For instance, one can consider a general multi-source multicast network
in which receivers desire to have a deterministic function of source random variables.
For the case of having the identity function at the receivers, this problem has been well
studied in [35–37] under the name of network coding for multi-source multicast net-
works. Ho et al. [37] show that random linear network coding can perform arbitrarily
closely to min-cut max-flow bounds. To have an achievable scheme for the functional
version of this problem, one may perform random network coding on coloring random
variables satisfying the CCC. If receivers desire different functions, one can use color-
ings of multi-functional characteristic graphs satisfying the CCC, and then use random
network coding for these coloring random variables. This achievable scheme can be
extended to the disjoint multicast and disjoint multicast plus multicast cases described
in [36]. This scheme is an achievable scheme; however, it is not optimal in general. If
the sources are independent, one may use encoding/decoding functions derived for tree
networks at intermediate nodes, along with network coding.
Throughout this chapter, we considered asymptotically lossless or lossy computation
of a function. For possible future work, one may consider this problem for the zero-
error computation of a function, which leads to a communication complexity problem.
One can use the tools and schemes we have developed in this chapter to attain some
achievable schemes in the zero-error computation case as well.

15.5 Proofs

15.5.1 Proof of Lemma 15.1


Proof First, we know that any random variable X2 by itself is a trivial coloring of
G X2 such that each vertex of G X2 is assigned to a different color. So, JC for cG X1 (X1 ) and
n
cG X2 (X2 ) = X2 can be written as JC = { j1c , . . . , jc jc } such that j1c = {(x1i , x21 ) : cG X1 (x1i ) = σi },
where σi is a generic color. Any two points in j1c are connected to each other by a path
with length one. So, j1c satisfies the CCC. This arguments hold for any jic for any valid i.
Thus, all joint coloring classes and, therefore, cG X1 (X1 ) and cG X2 (X2 ) = X2 satisfy the
CCC. The argument for cGnX (X1 ) and cGnX (X2 ) = X2 is similar.
1 2
Network Functional Compression 477

15.5.2 Proof of Lemma 15.2


Proof We first show that if jic satisfies the CCC, then, for any two points (x11 , . . . , xk1 ) and
(x12 , . . . , xk2 ) in jic , f (x11 , . . . , xk1 ) = f (x12 , . . . , xk2 ) . Since jic satisfies the CCC, by definition,
either f (x11 , . . . , xk1 ) = f (x12 , . . . , xk2 ) or there exists a path with length m − 1 between these
two points Z1 = (x11 , . . . , xk1 ) and Zm = (x12 , . . . , xk2 ), for some m, where two consecutive
points Z j and Z j+1 on this path differ in exactly one of their coordinates. Without loss of
j j j
generality, suppose Z j and Z j+1 differ in their first coordinate, i.e., Z j = (x11 , x22 . . . , xkk )
j j j j j
and Z j+1 = (x10 , x22 . . . , xkk ). Since these two points belong to jic , cG X1 (x11 ) = cG X1 (x10 ).
j j
If f (Z j )  f (Z j+1 ), there would exist an edge between x11 and x10 in G X1, and they could
not have the same color. Therefore, f (Z j ) = f (Z j+1 ). By applying the same argument
inductively for all consecutive points on the path between Z1 and Zm , we have f (Z1 ) =
f (Z2 ) = · · · = f (Zm ).
If jic does not satisfy the CCC, by definition there exist at least two points Z1 and Z2
in jic with different function values.

15.5.3 Proof of Lemma 15.3


Proof The proof is similar to Lemma 15.2. The only difference is that we use the def-
inition of the CCC for cGnX , . . . , cGnX . Since jic satisfies the CCC, either f (x11 , . . . , x1k ) =
1 k
f (x21 , . . . , x2k ) or there exists a path with length m − 1 between any two points Z1 =
(x11 , . . . , x1k ) ∈ T n and Zm = (x21 , . . . , x2k ) ∈ T n in jic , for some m. Consider two consecu-
tive points Z j and Z j+1 in this path. They differ in one of their coordinates (suppose they
j j j
differ in their first coordinate). In other words, suppose Z j = (x11 , x22 , . . . , xkk ) ∈ T n and
j j j j j
Z j+1 = (x10 , x22 , . . . , xkk ) ∈ T n . Since these two points belong to jic , cG X1 (x11 ) = cG X1 (x10 ).
j j
If f (Z j )  f (Z j+1 ), there would exist an edge between x11 and x10 in GnX and they could
1
not acquire the same color. Thus, f (Z j ) = f (Z j+1 ). By applying the same argument for
all consecutive points on the path between Z1 and Zm , one can get f (Z1 ) = f (Z2 ) = · · · =
f (Zm ). The converse part is similar to Lemma 15.2.

15.5.4 Proof of Lemma 15.4


Proof Suppose X1 and X2 satisfy the zigzag condition, and cG X1 and cG X2 are two valid
colorings of G X1 and G X2 , respectively. We want to show that these colorings satisfy the
CCC. To do this, consider two points (x11 , x21 ) and (x12 , x22 ) in a joint coloring class jic .
The definition of the zigzag condition guarantees the existence of a path with length two
between these two points. Thus, cG X1 and cG X2 satisfy the CCC.
The second part of this lemma says that the converse part is not true. For example, one
can see that in a special case considered in Lemma 15.1, those colorings always satisfy
the CCC in the absence of any condition such as the zigzag condition.
478 Soheil Feizi and Muriel Médard

15.5.5 Proof of Lemma 15.5


Proof By using the data-processing inequality, we have

1
HG X1 (X1 |X2 ) = lim min H(cGnX (X1 )|cGnX (X2 ))
n→∞ cGn ,cGn n 1 2
X1 X2

1
= lim min H(cGnX (X1 )|X2 ).
n→∞ cGn n 1
X 1

Then, Lemma 15.1 implies that cGnX (X1 ) and cGnX (x2 ) = X2 satisfy the CCC. A direct
1 2
application of Theorem 15.2 completes the proof.

15.5.6 Proof of Theorem 15.3


Proof We first show the achievability of this rate region. We also propose a modular-
ized encoding/decoding scheme in this part. Then, for the converse, we show that no
encoding/decoding scheme can outperform this rate region.
(1) Achievability.
lemma 15.7 Consider random variables X1 , . . . , Xk with characteristic graphs
GnX , . . . , GnX , and any valid -colorings cGnX , . . . , cGnX satisfying the CCC over typical
1 k 1 k
points T n , for sufficiently large n. There exists


f : cGnX (X1 ) × · · · × cGnX (Xk ) → Zn (15.15)
1 k

such that 
f (cGnX (x1 ), . . . , cGnX (xk )) = f (x1 , . . . , xk ), for all (x1 , . . . , xk ) ∈ T n .
1 k

Proof Suppose the joint coloring family for these colorings is JC = { jic : i}. We proceed
by constructing  f . Assume (x11 , . . . , x1k ) ∈ jic and cGnX (x11 ) = σ1 , . . . , cGnX (x1k ) = σk . Define
1 1
f (σ1 , . . . σk ) = f (x11 , . . . , x1k ).
To show that this function is well defined on elements in its support, we should show
that, for any two points (x11 , . . . , x1k ) and (x21 , . . . , x2k ) in T n , if cGnX (x11 ) = cGnX (x21 ), . . . ,
1 1
cGnX (x1k ) = cGnX (x2k ), then f (x11 , . . . , x1k ) = f (x21 , . . . , x2k ).
k k
Since cGnX (x11 ) = cGnX (x21 ), . . . , cGnX (x1k ) = cGnX (x2k ), these two points belong to a joint
1 1 k k
coloring class such as jic . Since cGnX , . . . , cGnX satisfy the CCC, we have, by using
1 k
Lemma 15.3, f (x11 , . . . , x1k ) = f (x21 , . . . , x2k ). Therefore, our function 
f is well defined and
has the desired property.

Lemma 15.7 implies that, given -colorings of characteristic graphs of random vari-
ables satisfying the CCC, at the receiver, we can successfully compute the desired
function f with a vanishing probability of error as n goes to infinity. Thus, if the decoder
at the receiver is given colors, it can look up f from its table of 
f . It remains to be ascer-
tained at which rates encoders can transmit these colors to the receiver faithfully (with a
probability of error less than ).
Network Functional Compression 479

lemma 15.8 (Slepian–Wolf theorem) A rate region of a one-stage tree network with the
desired identity function at the receiver is characterized by the following conditions:

∀S ∈ S(k) =⇒ Ri ≥ H(XS |XS c ). (15.16)
i∈S

Proof See [17].

We now use the Slepian–Wolf (SW) encoding/decoding scheme on the achieved col-
oring random variables. Suppose the probability of error in each decoder of SW is less
than /k. Then, the total error in the decoding of colorings at the receiver is less than .
Therefore, the total error in the coding scheme of first coloring GnX , . . . , GnX and then
1 k
encoding those colors by using the SW encoding/decoding scheme is upper-bounded by
the sum of errors in each stage. By using Lemmas 15.7 and 15.8, we find that the total
error is less than , and goes to zero as n goes to infinity. By applying Lemma 15.8 on
the achieved coloring random variables, we have
 1
∀S ∈ S(k) =⇒ Ri ≥ H(cGnX |cGnX ), (15.17)
i∈S
n S Sc

where cGnX and cGnX are -colorings of characteristic graphs satisfying the CCC. Thus,
S Sc
using Definition 15.6 completes the achievability part.
(2) Converse. Here, we show that any distributed functional source coding scheme
with a small probability of error induces -colorings on characteristic graphs of random
variables satisfying the CCC. Suppose  > 0. Define Fn for all (n, ) as follows:

Fn = { 
f : Pr[ 
f (X1 , . . . , Xk )  f (X1 , . . . , Xk )] < }. (15.18)

In other words, Fn is the set of all functions that differ from f with probability .
Suppose f is an achievable code with vanishing error probability, where


f (x1 , . . . , xk ) = rn (enX1 ,n (x1 ), . . . , enXk ,n (xk )), (15.19)

where n is the blocklength. Then there exists n0 such that for all n > n0 , Pr( 
f  f ) < .
In other words, 
f ∈ Fn . We call these codes -error functional codes.
lemma 15.9 Consider some function f : X1 × · · · × Xk → Z. Any distributed functional
code which reconstructs this function with zero-error probability induces colorings on
G X1 , . . . ,G Xk satisfying the CCC with respect to this function.
Proof Say we have a zero-error distributed functional code represented by encoders
enX1 , . . . , enXk and a decoder r. For any two points (x11 , . . . , xk1 ) and (x12 , . . . , xk2 ) with
positive probabilities, if their encoded values are the same (i.e., enX1 (x11 ) = enX1 (x12 ), . . . ,
enXk (xk1 ) = enXk (xk2 )), their function values will be the same as well since it is an error-
free scheme:

f (x11 , . . . , xk1 ) = f (x12 , . . . , xk2 ). (15.20)


480 Soheil Feizi and Muriel Médard

We show that enX1 , . . . , enXk are in fact some valid colorings of G X1 , . . . , G Xk satisfying
the CCC. We demonstrate this argument for X1 . The argument for other random variables
is analogous. First, we show that enX1 induces a valid coloring on G X1 , and then we show
that this coloring satisfies the CCC.
Let us proceed by contradiction. If enX1 did not induce a coloring on G X1 , there must
be some edge in G X1 connecting two vertices with the same color. Let us call these ver-
tices x11 and x12 . Since these vertices are connected in G X1 , there must exist an (x21 , . . . , xk1 )
such that p(x11 , x21 , . . . , xk1 )p(x12 , x21 , . . . , xk1 ) > 0, enX1 (x11 ) = enX1 (x12 ), and f (x11 , x21 , . . . , xk1 ) 
f (x12 , x21 , . . . , xk1 ). Taking x21 = x22 , . . . , xk1 = xk2 as in equation (15.20) leads to a contradic-
tion. Therefore, the contradiction assumption is wrong and enX1 induces a valid coloring
on G X1.
Now, we show that these induced colorings satisfy the CCC. If this were not true,
it would mean that there must exist two points (x11 , . . . , xk1 ) and (x12 , . . . , xk2 ) in a joint
coloring class jic so that there is no path between them in jic . So, Lemma 15.2 says
that the function f can acquire different values at these two points. In other words, it is
possible to have f (x11 , . . . , xk1 )  f (x12 , . . . , xk2 ), where cG X1 (x11 ) = cG X1 (x12 ), . . . , cG Xk (xk1 ) =
cG Xk (xk2 ), which is in contradiction with equation (15.20). Thus, these colorings satisfy
the CCC.

In the last step, we must show that any achievable functional code represented by Fn
induces -colorings on characteristic graphs satisfying the CCC.
lemma 15.10 Consider random variables X1 , . . . , Xk . All -error functional codes of
these random variables induce -colorings on characteristic graphs satisfying the CCC.
Proof Suppose  f (x1 , . . . , xk ) = r(enX1 (x1 ), . . . , enXk (xk )) ∈ Fn is such a code. If the
function it is desired to compute is  f , then, according to Lemma 15.9, a zero-error recon-
struction of f induces colorings on characteristic graphs satisfying the CCC with respect
f . Let the set of all points (x1 , . . . , xk ) such that 
to  f (x1 , . . . , xk )  f (x1 , . . . , xk ) be denoted
by C. Since Pr(  f  f ) < , Pr[C] < . Therefore, functions enX1 , . . . , enXk restricted to C
are -colorings of characteristic graphs satisfying the CCC. with respect to f .

According to Lemmas 15.9 and 15.10, any distributed functional source code with
vanishing error probability induces -colorings on characteristic graphs of source
variables satisfying the CCC with respect to the desired function f.
Then, according to the Slepian–Wolf theorem, Theorem 15.8, we have

 1
∀S ∈ S(k) =⇒ Ri ≥ H(cGnX |cGnX ), (15.21)
i∈S
n S Sc

where cGnX and cGnX are -colorings of characteristic graphs satisfying the CCC with
S Sc
respect to f. Using Definition 15.6 completes the converse part.
Network Functional Compression 481

15.5.7 Proof of Theorem 15.4


Proof Here, we show that no coding scheme can outperform this rate region. Suppose
source nodes {1, . . . , 4} are directly connected to the receiver. By direct application of
Theorem 15.3, the first set of conditions of Theorem 15.4 can be derived. Repeating the
argument for intermediate nodes {5, 6} completes the proof.

15.5.8 Achievablity Proof of Algorithm 15.1


Proof To show the achievability, we show that, if the nodes of each stage were directly
connected to the receiver, the receiver could compute its desired function. For nodes
{1, . . . , 4} in the first stage, the argument follows directly from Theorem 15.3. Now we
show that the argument holds for intermediate nodes {5, 6} as well. Consider node 5 in
the second stage of the network. Since the corresponding source values of its received
colorings form an independent set on its characteristic graph and since this node com-
putes the minimum-entropy coloring of this graph, it is equivalent to the case where
it would receive the exact source information, because both of them lead to the same
coloring random variable. Therefore, by having nodes 5 and 6 directly connected to the
receiver, and a direct application of Theorem 15.3, the receiver is able to compute its
desired function by using colorings of characteristic graphs of nodes 5 and 6.

15.5.9 Proof of Theorem 15.5


Proof Here, we present the proof for the tree network structure depicted in Fig. 15.4.
However, all of the arguments can be extended to a general case. Suppose intermediate
nodes 5 and 6 perform no computations and act as relays. Therefore, we have

R5 = R1 + R2 = HG X1 ,G X2 (X1 , X2 |X3 , X4 ).

By using the chain-rule proper set condition, we can rewrite this as

R5 = HG X1 ,X2 (X1 , X2 |X3 , X4 ),

which is the same condition as that of Theorem 15.4. Repeating this argument for R6
and R5 + R6 establishes the proof.

15.5.10 Proof of Lemma 15.6


Proof To prove this lemma, it is sufficient to show that, under the conditions of this
lemma, any colorings of the graph G X1 ,X2 can be expressed as colorings of G X1 and G X2 ,
and vice versa. The converse part is straightforward because any colorings of G X1 and
G X2 can be viewed as a coloring of G X1 ,X2 .
482 Soheil Feizi and Muriel Médard

Fully
Connected
Parts
x21 x22

GX1 GX1

Figure 15.8 An example of G X1 ,X2 satisfying conditions of Lemma 15.6, when X2 has two
members.

Consider Fig. 15.8, which illustrates the conditions of this lemma. Under these con-
ditions, since all x2 in X2 have different function values, the graph G X1 ,X2 can be
decomposed into subgraphs which have the same topology as G X1 (i.e., isomorphism
to G X1 ), corresponding to each x2 in X2 . These subgraphs are fully connected to each
other under the conditions of this lemma. Thus, any coloring of this graph can be rep-
resented as two colorings of G X1 (within each subgraph) and G X2 (across subgraphs).
Therefore, the minimum-entropy coloring of G X1 ,X2 is equal to the minimum-entropy
coloring of (G X1 ,G X2 ), i.e., HG X1 ,G X2 (X1 , X2 ) = HG X1 ,X2 (X1 , X2 ).

15.5.11 Proof of Theorem 15.6


Proof Suppose Γ(G X1 ) is the set of all maximal independent sets of G X1. Let us proceed
by contradiction. Consider Fig. 15.9(a). Suppose w1 and w2 are two different non-empty
maximally independent sets. Without loss of generality, assume x11 and x12 are in w1 , and
x12 and x13 are in w2 . These sets have a common element x12 . Since w1 and w2 are two
different maximally independent sets, x11  w2 and x13  w1 . Since x11 and x12 are in w1 ,
there is no edge between them in G X1. The same argument holds for x12 and x13 . However,
we have an edge between x11 and x13 , because w1 and w2 are two different maximally
independent sets, and at least there should exist such an edge between them. Now, we
want to show that this is not possible.

w1 w2 w1 w2

x11
x11
x13
x13
x12 x12

(a) (b)
Figure 15.9 Having non-zero joint probability distribution, (a) maximally independent sets cannot
overlap with each other (this figure depicts the contradiction); and (b) maximally independent
sets should be fully connected to each other. In this figure, a solid line represents a connection,
and a dashed line means that no connection exists.
Network Functional Compression 483

Since there is no edge between x11 and x12 , for any x21 ∈ X2 , p(x11 , x21 )p(x12 , x21 ) > 0 and
f (x11 , x21 ) = f (x12 , x21 ). A similar argument can be expressed for x12 and x13 . In other words,
for any x21 ∈ X2 , p(x12 , x21 )p(x13 , x21 ) > 0 and f (x12 , x21 ) = f (x13 , x21 ). Thus, for all x21 ∈ X2 ,
p(x11 , x21 )p(x13 , x21 ) > 0 and f (x11 , x21 ) = f (x13 , x21 ). However, since x11 and x13 are connected
to each other, there should exist an x21 ∈ X2 such that f (x11 , x21 )  f (x13 , x21 ), which is
not possible. So, the contradiction assumption is not correct and these two maximally
independent sets do not overlap with each other.
We showed that maximally independent sets cannot have overlaps with each other.
Now, we want to show that they are also fully connected to each other. Again, let us
proceed by contradiction. Consider Fig. 15.9(b). Suppose w1 and w2 are two different
non-overlapping maximally independent sets. Suppose there exists an element in w2 (call
it x13 ) which is connected to one of the elements in w1 (call it x11 ) and is not connected to
another element of w1 (call it x12 ). By using a similar argument to the one in the previous
paragraph, we may show that it is not possible. Thus, x13 should be connected to x11 .
Therefore, if, for all (x1 , x2 ) ∈ X1 × X2 , p(x1 , x2 ) > 0, then the maximally independent
sets of G X1 are some separate fully connected sets. In other words, the complement of
G X1 is formed by some non-overlapping cliques. Finding the minimum-entropy coloring
of this graph is trivial and can be achieved by assigning different colors to these non-
overlapping fully connected maximally independent sets.
This argument also holds for any power of G X1. Suppose x11 , x21 , and x31 are some
typical sequences in Xn1 . If x11 is not connected to x21 and x31 , it is not possible to have x21
and x31 connected. Therefore, one can apply a similar argument to prove the theorem for
GnX , for some n. This completes the proof.
1

15.5.12 Proof of Theorem 15.7


Proof We first prove it for G X1. Suppose X11 × X12 and X21 × X22 are two X1 -proper func-
tion regions of a quantization function f , where X11  X21 . We show that X11 and X21
are two non-overlapping fully connected maximally independent sets. By definition,
X11 and X21 are two non-equal partition sets of X1 . Thus, they do not have any element in
common.
Now, we want to show that the vertices of each of these partition sets are not con-
nected to each other. Without loss of generality, we show it for X11 . If this partition set of
X1 has only one element, this is a trivial case. So, suppose x11 and x12 are two elements in
X11 . From the definition of function regions, one can see that, for any x21 ∈ X2 such that
p(x11 , x21 )p(x12 , x21 ) > 0, we have f (x11 , x21 ) = f (x12 , x21 ). Thus, these two vertices are not con-
nected to each other. Now, suppose x13 is an element in X21 . Since these function regions
are X1 -proper, there should exist at least one x21 ∈ X2 such that p(x11 , x21 )p(x13 , x21 ) > 0 and
f (x11 , x21 )  f (x13 , x21 ). Thus, x11 and x13 are connected to each other. Therefore, X11 and X21
are two non-overlapping fully connected maximally independent sets. One can easily
apply this argument to other partition sets. Thus, the minimum-entropy coloring can be
achieved by assigning different colors to different maximally independent sets (partition
sets). The proof for GnX , for any n, is similar to the one mentioned in Theorem 15.6.
1
This completes the proof.
484 Soheil Feizi and Muriel Médard

15.5.13 Achievablity Proof of Algorithm 15.3


Proof Suppose G X1 has P vertices labeled as {1, . . . , P} and sorted in a list. Say that,
if a vertex v has dv neighbors in Ḡ X1 , the complexity of finding them is on the order of
O(dv ). Therefore, for a vertex v, the complexity of the first two steps of the algorithm
is on the order of O(dv log(P)), where O(log(P)) is the complexity of updating the list
of un-colored vertices. Therefore, the overall worst-case complexity of the algorithm
is on the order of O(P2 log(P)). Since the maximally independent sets of graph G X1
are non-overlapping and also fully connected, any valid coloring scheme should assign
them to different colors. Therefore, the minimum number of required colors is equal to
the number of non-overlapping maximally independent sets of G X1 , which in fact is the
number of colors used in Algorithm 15.2. This completes the proof.

15.5.14 Proof of Theorem 15.8


Proof From Theorem 15.2, we have that, by sending colorings of high-probability
subgraphs of sources D/2-characteristic graphs satisfying the CCC, one can achieve
the rate region described in (15.14). For simplicity, we assume that the power of the
graphs is one. Extensions to an arbitrary power are analogous. Suppose the receiver gets
two colors from sources (say c1 from source 1, and c2 from source 2). To show that
the receiver is able to compute its desired function up to distortion level D, we need
to show that, for every (x11 , x21 ) and (x12 , x22 ) such that CG X1 (D/2) (x11 ) = CG X1 (D/2) (x12 ) and
CG X2 (D/2) (x21 ) = CG X2 (D/2) (x22 ), we have d( f (x11 , x21 ), f (x12 , x22 )) ≤ D. Since the distortion
function d is a metric, we have

d( f (x11 , x21 ), f (x12 , x22 )) ≤ d( f (x11 , x21 ), f (x12 , x21 )) + d( f (x12 , x21 ), f (x12 , x22 ))
≤ D/2 + D/2 = D. (15.22)

This completes the proof.

References

[1] S. Feizi and M. Médard, “On network functional compression,” IEEE Trans. Information
Theory, vol. 60, no. 9, pp. 5387–5401, 2014.
[2] H. Kowshik and P. R. Kumar, “Optimal computation of symmetric Boolean functions in
tree networks,” in Proc. 2010 IEEE International Symposium on Information Theory, ISIT
2010, 2010, pp. 1873–1877.
[3] S. Shenvi and B. K. Dey, “A necessary and sufficient condition for solvability of a 3s/3t
sum-network,” in Proc. 2010 IEEE International Symposium on Information Theory, ISIT
2010, 2010, pp. 1858–1862.
[4] A. Ramamoorthy, “Communicating the sum of sources over a network,” in Proc. 2008
IEEE International Symposium on Information Theory, ISIT 2008, 2008, pp. 1646–1650.
[5] R. Gallager, “Finding parity in a simple broadcast network,” IEEE Trans. Information
Theory, vol. 34, no. 2, pp. 176–180, 1988.
Network Functional Compression 485

[6] A. Giridhar and P. Kumar, “Computing and communicating functions over sensor net-
works,” IEEE J. Selected Areas in Communications, vol. 23, no. 4, pp. 755–764, 2005.
[7] S. Kamath and D. Manjunath, “On distributed function computation in structure-free ran-
dom networks,” in Proc. 2008 IEEE International Symposium on Information Theory, ISIT
2008, 2008, pp. 647–651.
[8] N. Ma, P. Ishwar, and P. Gupta, “Information-theoretic bounds for multiround function
computation in collocated networks,” in Proc. 2009 IEEE International Symposium on
Information Theory, ISIT 2009, 2009, pp. 2306–2310.
[9] R. Ahuja, T. Magnanti, and J. Orlin, Network flows: Theory, algorithms, and applications.
Prentice Hall, 1993.
[10] F. Shahrokhi and D. Matula, “The maximum concurrent flow problem,” J. ACM, vol. 37,
no. 2, pp. 318–334, 1990.
[11] V. Shah, B. Dey, and D. Manjunath, “Network flows for functions,” in Proc. 2011 IEEE
International Symposium on Information Theory, 2011, pp. 234–238.
[12] M. Bakshi and M. Effros, “On zero-error source coding with feedback,” in Proc. 2010
IEEE International Symposium on Information Theory, ISIT 2010, 2010.
[13] H. Kowshik and P. Kumar, “Zero-error function computation in sensor networks,” in Proc.
48th IEEE Conference on Decision and Control, 2009 held jointly with the 2009 28th
Chinese Control Conference, CDC/CCC 2009, 2009, pp. 3787–3792.
[14] C. E. Shannon, “The zero error capacity of a noisy channel,” IRE Trans. Information
Theory, vol. 2, no. 3, pp. 8–19, 1956.
[15] A. Orlitsky and J. R. Roche, “Coding for computing,” IEEE Trans. Information Theory,
vol. 47, no. 3, pp. 903–917, 2001.
[16] V. Doshi, D. Shah, M. Médard, and M. Effros, “Functional compression through graph
coloring,” IEEE Trans. Information Theory, vol. 56, no. 8, pp. 3901–3917, 2010.
[17] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE
Trans. Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[18] S. S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (DIS-
CUS): Design and construction,” IEEE Trans. Information Theory, vol. 49, no. 3, pp.
626–643, 2003.
[19] B. Rimoldi and R. Urbanke, “Asynchronous Slepian–Wolf coding via source-splitting,” in
Proc. 1997 IEEE International Symposium on Information Theory, 1997, p. 271.
[20] T. P. Coleman, A. H. Lee, M. Médard, and M. Effros, “Low-complexity approaches
to Slepian–Wolf near-lossless distributed data compression,” IEEE Trans. Information
Theory, vol. 52, no. 8, pp. 3546–3561, 2006.
[21] R. F. Ahlswede and J. Körner, “Source coding with side information and a converse for
degraded broadcast channels,” IEEE Trans. Information Theory, vol. 21, no. 6, pp. 629–
637, 1975.
[22] J. Körner and K. Marton, “How to encode the modulo-two sum of binary sources,” IEEE
Trans. Information Theory, vol. 25, no. 2, pp. 219–221, 1979.
[23] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information
at the decoder,” IEEE Trans. Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
[24] H. Yamamoto, “Wyner–Ziv theory for a general function of the correlated sources,” IEEE
Trans. Information Theory, vol. 28, no. 5, pp. 803–807, 1982.
[25] H. Feng, M. Effros, and S. Savari, “Functional source coding for networks with receiver
side information,” in Proc. Allerton Conference on Communication, Control, and Comput-
ing, 2004, pp. 1419–1427.
486 Soheil Feizi and Muriel Médard

[26] T. Berger and R. W. Yeung, “Multiterminal source encoding with one distortion criterion,”
IEEE Trans. Information Theory, vol. 35, no. 2, pp. 228–236, 1989.
[27] J. Barros and S. Servetto, “On the rate-distortion region for separate encoding of correlated
sources,” in IEEE Trans. Information Theory (ISIT), 2003, p. 171.
[28] A. B. Wagner, S. Tavildar, and P. Viswanath, “Rate region of the quadratic Gaussian
two-terminal source-coding problem,” in Proc. 2006 IEEE International Symposium on
Information Theory, 2006.
[29] I. Csiszár and J. Körner, in Information theory: Coding theorems for discrete memoryless
systems. New York, 1981.
[30] H. S. Witsenhausen, “The zero-error side information problem and chromatic numbers,”
IEEE Trans. Information Theory, vol. 22, no. 5, pp. 592–593, 1976.
[31] J. Körner, “Coding of an information source having ambiguous alphabet and the entropy
of graphs,” in Proc. 6th Prague Conference on Information Theory, 1973, pp. 411–425.
[32] N. Alon and A. Orlitsky, “Source coding and graph entropies,” IEEE Trans. Information
Theory, vol. 42, no. 5, pp. 1329–1339, 1996.
[33] J. Cardinal, S. Fiorini, and G. Joret, “Tight results on minimum entropy set cover,”
Algorithmica, vol. 51, no. 1, pp. 49–60, 2008.
[34] R. Appuswamy, M. Franceschetti, N. Karamchandani, and K. Zeger, “Network coding for
computing: Cut-set bounds,” IEEE Trans. Information Theory, vol. 57, no. 2, pp. 1015–
1030, 2011.
[35] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,” IEEE
Trans. Information Theory, vol. 46, pp. 1204–1216, 2000.
[36] R. Koetter and M. Médard, “An algebraic approach to network coding,” IEEE/ACM Trans.
Networking, vol. 11, no. 5, pp. 782–795, 2003.
[37] T. Ho, M. Médard, R. Koetter, D. R. Karger, M. Effros, J. Shi, and B. Leong, “A random
linear network coding approach to multicast,” IEEE Trans. Information Theory, vol. 52,
no. 10, pp. 4413–4430, 2006.
16 An Introductory Guide to Fano’s
Inequality with Applications in
Statistical Estimation
Jonathan Scarlett and Volkan Cevher

Summary

Information theory plays an indispensable role in the development of algorithm-


independent impossibility results, both for communication problems and for seemingly
distinct areas such as statistics and machine learning. While numerous information-
theoretic tools have been proposed for this purpose, the oldest one remains arguably
the most versatile and widespread: Fano’s inequality. In this chapter, we provide a sur-
vey of Fano’s inequality and its variants in the context of statistical estimation, adopting
a versatile framework that covers a wide range of specific problems. We present a variety
of key tools and techniques used for establishing impossibility results via this approach,
and provide representative examples covering group testing, graphical model selection,
sparse linear regression, density estimation, and convex optimization.

16.1 Introduction

The tremendous progress in large-scale statistical inference and learning in recent years
has been spurred by both practical and theoretical advances, with strong interactions
between the two: algorithms that come with a priori performance guarantees are clearly
desirable, if not crucial, in practical applications, and practical issues are indispensable
in guiding the theoretical studies.
A key role, complementary to that of performance bounds for specific algorithms, is
played by algorithm-independent impossibility results, stating conditions under which
one cannot hope to achieve a certain goal. Such results provide definitive benchmarks
for practical methods, serve as certificates for near-optimality, and help guide practical
developments toward directions where the greatest improvements are possible.
Since its introduction in 1948, the field of information theory has continually provided
such benefits for the problems of storing and transmitting data, and has accordingly
shaped the design of practical communication systems. In addition, recent years have
seen mounting evidence that the tools and methodology of information theory reach
far beyond communication problems, and can provide similar benefits within the entire
data-processing pipeline.

487
488 Jonathan Scarlett and Volkan Cevher

Table 16.1. Examples of applications for which impossibility results have been derived using
Fano’s inequality
Sparse and low-rank problems Other estimation problems
Problem References Problem References
Group testing [2, 3] Regression [12, 13]
Compressive sensing [4, 5] Density estimation [13, 14]
Sparse Fourier transform [6, 7] Kernel methods [15, 16]
Principal component analysis [8, 9] Distributed estimation [17, 18]
Matrix completion [10, 11] Local privacy [19]
Sequential decision problems Other learning problems
Problem References Problem References
Convex optimization [20, 21] Graph learning [26, 27]
Active learning [22] Ranking [28, 29]
Multi-armed bandits [23] Classification [30, 31]
Bayesian optimization [24] Clustering [32]
Communication complexity [25] Phylogeny [33]

While many information-theoretic tools have been proposed for establishing impossi-
bility results, the oldest one remains arguably the most versatile and widespread: Fano’s
inequality [1]. This fundamental inequality is not only ubiquitous in studies of com-
munication, but has also been applied extensively in statistical inference and learning
problems; several examples are given in Table 16.1.
In applying Fano’s inequality to such problems, one typically encounters a number of
distinct challenges different from those found in communication problems. The goal of
this chapter is to introduce the reader to some of the key tools and techniques, explain
their interactions and connections, and provide several representative examples.

16.1.1 Overview of Techniques


Throughout the chapter, we consider the following statistical estimation framework,
which captures a broad range of problems including the majority of those listed in
Table 16.1.
• There exists an unknown parameter θ, known to lie in some set Θ (e.g., a subset of
R p ), that we would like to estimate.
• In the simplest case, the estimation algorithm has access to a set of samples Y =
(Y1 , . . . , Yn ) drawn from some joint distribution Pnθ (y) parameterized by θ. More gener-
ally, the samples may be drawn from some joint distribution Pnθ,X (y) parameterized by
(θ, X), where X = (X1 , . . . , Xn ) are inputs that are either known in advance or selected
by the algorithm itself.
• Given knowledge of Y, as well as X if inputs are present, the algorithm forms an
estimate  θ of θ, with the goal of the two being “close” in the sense that some loss
function (θ, θ) is small. When referring to this step of the estimation algorithm, we
will use the terms algorithm and decoder interchangeably.
We will initially use the following simple running example to exemplify some of the key
concepts, and then turn to detailed applications in Sections 16.4 and 16.6.
An Introductory Guide to Fano’s Inequality with Applications 489

Example 16.1 (1-sparse linear regression) A vector parameter θ ∈ R p is known to have


at most one non-zero entry, and we are given n linear samples of the form Y = Xθ + Z,1
where X ∈ Rn×p is a known input matrix, and Z ∼ N(0, σ2 I) is additive Gaussian noise. In
other words, the ith sample Yi is a noisy sample of Xi , θ, where Xi ∈ R p is the transpose
of the ith row of X. The goal is to construct an estimate  θ such that the squared distance
(θ,
θ) = θ − 
θ22 is small.
This example is an extreme case of k-sparse linear regression, in which θ has at most
k  p non-zero entries, i.e., at most k columns of X impact the output. The more general
k-sparse recovery problem will be considered in Section 16.6.1.

We seek to establish algorithm-independent impossibility results, henceforth referred


to as converse bounds, in the form of lower bounds on the sample complexity, i.e.,
the number of samples n required to achieve a certain average target loss. The fol-
lowing aspects of the problem significantly impact this goal, and their differences are
highlighted throughout the chapter.
• Discrete versus continuous. Depending on the application, the parameter set Θ may
be discrete or continuous. For instance, in the 1-sparse linear regression example, one
may consider the case in which θ is known to lie in a finite set Θ ⊆ R p , or one may
consider the general estimation of a vector in the set
Θ = {θ ∈ R p : θ0 ≤ 1}, (16.1)
where θ0 is the number of non-zero entries in θ.
• Minimax versus Bayesian. In the minimax setting, one seeks a decoder that attains
a small loss for any given θ ∈ Θ, whereas in the Bayesian setting, one considers the
average performance under some prior distribution on θ. Hence, these two variations
respectively consider the worst-case and average-case performance with respect to θ.
We focus primarily on the minimax setting throughout the chapter, and further discuss
Bayesian settings in Section 16.7.2.
• Choice of target goal. Naturally, the target goal can considerably impact the fun-
damental performance limits of an estimation problem. For instance, in discrete
settings, it is common to consider exact recovery, requiring that  θ = θ (i.e., the 0-1 loss
(θ,θ) = 1{ θ  θ}), but it is also of interest to understand to what extent approximate
recovery criteria make the problem easier.
• Non-adaptive versus adaptive sampling. In settings consisting of an input X =
(X1 , . . . , Xn ) as introduced above, one often distinguishes between the non-adaptive
setting, in which X is specified prior to observing any samples, and the adaptive
setting, in which a given input Xi can be designed starting from the past inputs
(X1 , . . . , Xi−1 ) and samples (Y1 , . . . , Yi−1 ). It is of significant interest to understand to
what extent the additional freedom of adaptivity impacts the performance.
With these variations in mind, we proceed by outlining the main steps in obtaining
converse bounds for statistical estimation via Fano’s inequality.
1 Throughout the chapter, we interchange tuple-based notations such as X = (X1 , . . . , Xn ), Y = (Y1 , . . . , Yn )
with vector/matrix notation such as X ∈ Rn×p , Y ∈ Rn .
490 Jonathan Scarlett and Volkan Cevher

Algorithm
Output θ̂
Index V Select Estimate V̂
Infer
Parameter Y X
Index

Parameter θV
Samples

Figure 16.1 Reduction of minimax estimation to multiple hypothesis testing. The gray boxes are
fixed as part of the problem statement, whereas the white boxes are constructed to our liking for
the purpose of proving a converse bound. The dashed line marked with X is optional, depending
on whether inputs are present.

Step 1: Reduction to Multiple Hypothesis Testing


The multiple-hypothesis-testing problem is defined as follows. An index V ∈ {1, . . . , M} is
drawn from a prior distribution PV , and a sequence of samples Y = (Y1 , . . . , Yn ) is drawn
from a probability distribution PY|V parameterized by V. The M possible conditional
distributions are known in advance, and the goal is to identify the index V with high
probability given the samples.
In Fig. 16.1, we provide a general illustration of how an estimation problem can be
reduced to multiple hypothesis testing, possibly with the added twist of including inputs
X = (X1 , . . . , Xn ). Supposing for the time being that we are in the minimax setting, the
idea is to construct a hard subset of parameters {θ1 , . . . , θ M } that are difficult to distinguish
given the samples. We then lower-bound the worst-case performance by the average over
this hard subset. As a concrete example, a good choice for the 1-sparse linear regression
problem is to set M = 2p and consider the set of vectors of the form
θ = (0, . . . , 0, ±, 0, . . . , 0), (16.2)
where  > 0 is a constant. Hence, the non-zero entry of θ has a given magnitude, which
can be selected to our liking for the purpose of proving a converse.
We envision an index V ∈ {1, . . . , M} being drawn uniformly at random and used to
select the corresponding parameter θV , and the estimation algorithm being run to pro-
duce an estimate  θ. If the parameters {θ1 , . . . , θ M } are not too close and the algorithm
successfully produces  θ ≈ θV , then we should be able to infer the index V from  θ. This
entire process can be viewed as a problem of multiple hypothesis testing, where the vth
hypothesis is that the underlying parameter is θv (v = 1, . . . , M). With this reduction, we
can deduce that if the algorithm performs well then the hypothesis test is successful; the
contrapositive statement is then that if the hypothesis test cannot be successful, then the
algorithm cannot perform well.
√ regression example, we find from (16.2) that distinct θ j , θ j must
In the 1-sparse linear
satisfy θ j −θ j 2 ≥ 2·. As a result, we immediately obtain from the triangle inequality
that the following holds:

2
If  θ − θv  2 < · , then arg min  θ − θv  = v. (16.3)
2 v =1,...,M
An Introductory Guide to Fano’s Inequality with Applications 491


In other words, if the algorithm yields  θ − θv 22 < ( 2/2) , then V can be identified as
the index corresponding to the closest vector to  θ. Thus, sufficiently accurate estimation
implies success in identifying V.
Discussion. Selecting the hard subset {θ1 , . . . , θ M } of parameters is often considered
something of an art. While the proofs of existing converse bounds may seem easy in
hindsight when the hard subset is known, coming up with a suitable choice for a new
problem usually requires some creativity and/or exploration. Despite this, there exist
general approaches that have proved to been effective in a wide range of problems, which
we exemplify in Sections 16.4 and 16.6.
In general, selecting the hard subset requires balancing conflicting goals: increasing
M so that the hypothesis test is more difficult, keeping the elements “close” so that
they are difficult to distinguish, and keeping the elements “sufficiently distant” so that
one can recover V from  θ. Typically, one of the following three approaches is adopted:
(i) explicitly construct a set whose elements are known or believed to be difficult to
distinguish; (ii) prove the existence of such a set using probabilistic arguments; or (iii)
consider packing as many elements as possible into the entire space. We will provide
examples of all three kinds.
In the Bayesian setting, θ is already random, so we cannot use the above-mentioned
method of lower-bounding the worst-case performance by the average. Nevertheless, if
Θ is discrete, we can still use the trivial reduction V = θ to form a multiple-hypothesis-
testing problem with a possibly non-uniform prior. In the continuous Bayesian setting,
one typically requires more advanced methods that are not covered in this chapter; we
provide further discussion in Section 16.7.2.

Step 2: Application of Fano’s Inequality


Once a multiple hypothesis test has been set up, Fano’s inequality provides a lower
bound on its error probability in terms of the mutual information, which is one of the
most fundamental information measures in information theory. The mutual information
can often be explicitly characterized given the problem formulation, and a variety of
useful properties are known for doing so, as outlined below.
We briefly state the standard form of Fano’s inequality for the case in which V is
uniform on (1, . . . , M) and 
V is some estimate of V:

I(V; 
V) + log 2
P[
V  V] ≥ 1 − . (16.4)
log M

The intuition is as follows. The term log M represents the prior uncertainty (i.e., entropy)
of V, and the mutual information I(V;  V) represents how much information  V reveals
about V. In order to have a small probability of error, we require that the information
revealed is close to the prior uncertainty.
Beyond the standard form of Fano’s inequality (16.4), it is useful to consider other
variants, including approximate recovery and conditional versions. These are the topic
of Section 16.2, and we discuss other alternatives in Section 16.7.2.
492 Jonathan Scarlett and Volkan Cevher

Step 3: Bounding the Mutual Information


In order to make lower bounds such as (16.4) explicit, we need to upper-bound the
mutual information therein. This often consists of tedious yet routine calculations, but
there are cases where it is highly non-trivial. The mutual information depends crucially
on the choice of reduction in the first step.
The joint distribution of (V, 
V) is decoder-dependent and usually very complicated,
so, to simplify matters, the typical first step is to apply an upper bound known as
the data-processing inequality. In the simplest case in which there is no extra input to
the sampling mechanism (i.e., X is absent in Fig. 16.1), this inequality takes the form
I(V; 
V) ≤ I(V; Y) under the Markov chain V → Y →  V. Thus, we are left to answer the
question of how much information the samples reveal about the index V.
In Section 16.3, we introduce several useful tools for this purpose, including the
following.
• Tensorization. If the samples Y = (Y1 , . . . , Yn ) are conditionally independent given V,

we have I(V; Y) ≤ ni=1 I(V; Yi ). Bounds of this type simplify the mutual information
containing a set of observations to simpler terms containing only a single observation.
• Kullback–Leibler (KL)-divergence-based bounds. Straightforward bounds on the
mutual information reveal that, if {Pnθv }v=1,...,M are close in terms of KL divergence,
then the mutual information is small. Results of this type are useful, as the relevant
KL divergences can often be evaluated exactly or tightly bounded.
In addition to these, we introduce variations for cases in which the input X is present in
Fig. 16.1, distinguishing between non-adaptive and adaptive sampling.
Toy example. To give a simple example of how this step is combined with the pre-
vious one, consider the case in which we wish to identify one of M hypotheses, with
the vth hypothesis being that Y ∼ Pv (y) for some distribution Pv on {0, 1}n . That is, the
n observations (Y1 , . . . , Yn ) are binary-valued. Starting with the above-mentioned bound
I(V; 
V) ≤ I(V; Y), we simply write I(V; Y) ≤ H(Y) ≤ n log 2, which follows since Y takes
one of at most 2n values. Substitution into (16.4) yields Pe ≥ 1 − (n + 1)/ log2 M, which
means that achieving Pe ≤ δ requires n ≥ (1 − δ) log2 M − 1. This formalizes the intuitive
fact that reliably identifying one of M  1 hypotheses requires roughly log2 M binary
observations.

16.2 Fano’s Inequality and Its Variants

In this section, we state various forms of Fano’s inequality that will form the basis for
the results in the remainder of the chapter.

16.2.1 Standard Version


We begin with the most simple and widely used form of Fano’s inequality. We use the
generic notation V for the discrete random variable in a multiple hypothesis test, and we
write its estimate as 
V. In typical applications, one has a Markov-chain relation such as
An Introductory Guide to Fano’s Inequality with Applications 493

V →Y→ V, where Y is the collection of samples; we will exploit this fact in Section
16.3, but, for now, one can think of 
V being randomly generated by any means given V.
The two fundamental quantities appearing in Fano’s inequality are the conditional
entropy H(V| V), representing the uncertainty of V given its estimate, and the error
probability:
Pe = P[
V  V]. (16.5)
Since H(V|V) = H(V) − I(V; 
V), the conditional entropy is closely related to the mutual
information, representing how much information  V reveals about V.
theorem 16.1 (Fano’s inequality) For any discrete random variables V and 
V on a
common finite alphabet V, we have
 
H(V|V) ≤ H2 (Pe ) + Pe log |V| − 1 , (16.6)
where H2 (α) = α log(1/α) + (1 − α) log(1/(1 − α)) is the binary entropy function. In
particular, if V is uniform on V, we have
I(V; 
V) ≥ (1 − Pe ) log |V| − log 2, (16.7)
or equivalently,
I(V; 
V) + log 2
Pe ≥ 1 − . (16.8)
log |V|
Since the proof of Theorem 16.1 is widely accessible in standard references such as
[34], we provide only an intuitive explanation of (16.6). To resolve the uncertainty in V
given 
V, we can first ask whether the two are equal, which bears uncertainty H2 (Pe ). If
they differ, which only occurs a fraction Pe of the time, the remaining uncertainty is at
 
most log |V| − 1 .
remark 16.1 For uniform V, we obtain (16.7) by upper-bounding |V| − 1 ≤ |V| and
H2 (Pe ) ≤ log 2 in (16.6), and subtracting H(V) = log |V| on both sides. While these addi-
tional bounds have a minimal impact for moderate to large values of |V|, a notable
case where one should use (16.6) is the binary setting, i.e., |V| = 2. In this case, (16.7)
is meaningless due to the right-hand side being negative, whereas (16.6) yields the
following for uniform V:
I(V; 
V) ≥ log 2 − H2 (Pe ). (16.9)
It follows that the error probability is lower-bounded as
 
Pe ≥ H2−1 log 2 − I(V;  V) , (16.10)
   
where H2−1 (·) ∈ 0, 12 is the inverse of H2 (·) ∈ [0, log 2] on the domain 0, 12 .

16.2.2 Approximate Recovery


The notion of error probability considered in Theorem 16.1 is that of exact recovery,
insisting that 
V = V. More generally, one can consider notions of approximate recovery,
494 Jonathan Scarlett and Volkan Cevher

where one only requires 


V to be “close” to V in some sense. This is useful for at least
two reasons.
• Exact recovery is often a highly stringent criterion in discrete statistical estimation
problems, and it is of considerable interest to understand to what extent moving to
approximate recovery makes the problem easier.
• When we reduce continuous estimation problems to the discrete setting (Section
16.5), permitting approximate recovery will provide a useful additional degree of
freedom.
We consider a general setup with a random variable V, an estimate 
V, and an error
probability of the form
 
Pe (t) = P d(V, 
V) > t (16.11)

for some real-valued function d(v, v) and threshold t ∈ R. In contrast to the exact recovery
setting, there are interesting cases where V and  V are defined on different alphabets, so
we denote these by V and V,  respectively.
One can interpret (16.11) as requiring  V to be within a “distance” t of V. How-
ever, d need not be a true distance function, and it need not even be symmetric or take
non-negative values. This definition of the error probability in fact entails no loss of gen-
erality, since one can set t = 0 and d(V, 
V) = 1{(V, 
V) ∈ E} for an arbitrary set E containing
the pairs that are considered errors.
In the following, we make use of the quantities

Nmax (t) = max Nv (t), Nmin (t) = min Nv (t), (16.12)
 
v∈V  
v∈V

where

Nv (t) = 1{d(v,
v) ≤ t} (16.13)
v∈V

counts the number of v ∈ V within a “distance” t of  


v ∈ V.
theorem 16.2 (Fano’s inequality with approximate recovery) For any random variables
V,   we have
V on the finite alphabets V, V,
|V| − Nmin (t)
H(V|
V) ≤ H2 (Pe (t)) + Pe (t) log + log Nmax (t). (16.14)
Nmax (t)
In particular, if V is uniform on V, then
|V|
I(V; 
V) ≥ (1 − Pe (t)) log − log 2, (16.15)
Nmax (t)
or equivalently

I(V; 
V) + log 2
Pe (t) ≥ 1 − . (16.16)
log(|V|/Nmax (t))
The proof is similar to that of Theorem 16.1, and can be found in [35].
An Introductory Guide to Fano’s Inequality with Applications 495

By setting d(v,v) = 1{v  


v} and t = 0, we find that Theorem 16.2 recovers Theorem
16.1 as a special case. More generally, the bounds (16.15) and (16.16) resemble those for
exact recovery in (16.7) and (16.8), except that log |V| is replaced by log(|V|/Nmax (t)).
When V = V,  one can intuitively think of the approximate recovery setting as divid-
ing the space into regions of size Nmax (t), and only requiring the correct region to be
identified, thereby reducing the effective alphabet size to |V|/Nmax (t).

16.2.3 Conditional Version


When applying Fano’s inequality, it is often useful to condition on certain random
events and random variables. The following theorem states a general variant of Theo-
rem 16.1 with such conditioning. Conditional forms for the case of approximate recovery
(Theorem 16.2) follow in an identical manner.

theorem 16.3 (Conditional Fano inequality) For any discrete random variables V and

V on a common alphabet V, any discrete random variable A on an alphabet A, and any
subset A ⊆ A, the error probability Pe = P[
V  V] satisfies
 H(V|
V, A = a) − log 2
Pe ≥ P[A = a]   , (16.17)
a∈A
log |Va | − 1
where Va = {v ∈ V : P[V = v | A = a] > 0}. For possibly continuous A, the same holds

true with a∈A P[A = a]( · · · ) replaced by E[1{A ∈ A }( · · · )].

Proof We write Pe ≥ a∈A P[A = a]P[ V  V | A = a], and lower-bound the conditional
error probability using Fano’s inequality (Theorem 16.1) under the joint distribution of
(V, 
V) conditioned on A = a.
remark 16.2 Our main use of Theorem 16.3 will be to average over the input X
(Fig. 16.1) in the case in which it is random and independent of V. In such cases, by
setting A = X in (16.17) and letting A contain all possible outcomes, we simply recover
Theorem 16.1 with conditioning on X in the conditional entropy and mutual informa-
tion terms. The approximate recovery version, Theorem 16.2, extends in the same way.
In Section 16.4, we will discuss more advanced applications of Theorem 16.3, including
(i) genie arguments, in which some information about V is revealed to the decoder, and
(ii) typicality arguments, where we condition on V falling in some high-probability set.

16.3 Mutual Information Bounds

We saw in Section 16.2 that the mutual information I(V; V) naturally arises from Fano’s
inequality when V is uniform. More generally, we have H(V| V) = H(V) − I(V;  V), so
we can characterize the conditional entropy by characterizing both the entropy and the
mutual information. In this section, we provide some of the main useful tools for upper-
bounding the mutual information. For brevity, we omit the proofs of standard results
commonly found in information-theory textbooks, or simple variations thereof.
496 Jonathan Scarlett and Volkan Cevher

Throughout the section, the random variables V and  V are assumed to be discrete,
whereas the other random variables involved, including the inputs X = (X1 , . . . , Xn ) and
samples Y = (Y1 , . . . , Yn ), may be continuous. Hence, notation such as PY (y) may repre-
sent either a probability mass function (PMF) or a probability density function (PDF).

16.3.1 Data-Processing Inequality


Recall the random variables V, X, Y, and  V in the multiple-hypothesis-testing reduction
depicted in Fig. 16.1. In nearly all cases, the first step in bounding a mutual information
term such as I(V; V) is to upper-bound it in terms of the samples Y, and possibly the
inputs X. By doing so, we remove the dependence on  V, and form a bound that is
algorithm-independent.
The following lemma provides three variations along these lines. The three are all
essentially equivalent, but are written separately since each will be more naturally suited
to certain settings, as described below. Recall the terminology that X → Y → Z forms
a Markov chain if X and Z are conditionally independent given Y, or equivalently, Z
depends on (X, Y) only through Y.
lemma 16.1 (Data-processing inequality)
(i) If V → Y →  V forms a Markov chain, then I(V; 
V) ≤ I(V; Y).
(ii) If V → Y →  V forms a Markov chain conditioned on X, then I(V; 
V|X) ≤ I(V; Y|X).
 
(iii) If V → (X, Y) → V forms a Markov chain, then I(V; V) ≤ I(V; X, Y).
We will use the first part when X is absent or deterministic, the second part for random
non-adaptive X, and the third when the elements of X can be chosen adaptively on the
basis of the past samples (Section 16.1.1).

16.3.2 Tensorization
One of the most useful properties of mutual information is tensorization. Under
suitable conditional independence assumptions, mutual information terms containing
length-n sequences (e.g., Y = (Y1 , . . . , Yn )) can be upper-bounded by a sum of n mutual
information terms, the ith of which contains the corresponding entry of each associated
vector (e.g., Yi ). Thus, we can reduce a complicated mutual information term containing
sequences to a sum of simpler terms containing individual elements. The following
lemma provides some of the most common scenarios in which such tensorization can
be performed.
lemma 16.2 (Tensorization of mutual information) (i) If the entries of Y = (Y1 , . . . , Yn )
are conditionally independent given V, then


n
I(V; Y) ≤ I(V; Yi ). (16.18)
i=1
An Introductory Guide to Fano’s Inequality with Applications 497

(ii) If the entries of Y are conditionally independent given (V, X), and Yi depends on
(V, X) only through (V, Xi ), then


n
I(V; Y|X) ≤ I(V; Yi |Xi ). (16.19)
i=1

(iii) If, in addition to the assumptions in part (ii), Yi depends on (V, Xi ) only through
Ui = ψi (V, Xi ) for some deterministic function ψi , then


n
I(V; Y|X) ≤ I(Ui ; Yi ). (16.20)
i=1

The proof is based on the sub-additivity of entropy, along with the conditional indepen-
dence assumptions given. We will use the first part of the lemma when X is absent or
deterministic, and the second and third parts for random non-adaptive X. When X can
be chosen adaptively on the basis of the past samples (Section 16.1.1), the following
variant is used.
lemma 16.3 (Tensorization of mutual information for adaptive settings) (i) If Xi is a
function of (X1i−1 , Y1i−1 ), and Yi is conditionally independent of (X1i−1 , Y1i−1 ) given (V, Xi ),
then

n
I(V; X, Y) ≤ I(V; Yi |Xi ). (16.21)
i=1

(ii) If, in addition to the assumptions in part (i), Yi depends on (V, Xi ) only through
Ui = ψi (V, Xi ) for some deterministic function ψi , then


n
I(V; X, Y) ≤ I(Ui ; Yi ). (16.22)
i=1

The proof is based on the chain rule for mutual information, i.e., I(V; X, Y) =
n
i=1 I(Xi , Yi ; V | X1 , Y1 ), as well as suitable simplifications via the conditional
i−1 i−1

independence assumptions.

remark 16.3 The mutual information bounds in Lemma 16.3 are analogous to those
used in the problem of communication with feedback (see Section 7.12 of [34]). A key
difference is that, in the latter setting, the channel input Xi is a function of (V, X1i−1 , Y1i−1 ),
with V representing the message. In statistical estimation problems, the quantity V
being estimated is typically unknown to the decision-maker, so the input Xi is only a
function of (X1i−1 , Y1i−1 ).
remark 16.4 Lemma 16.3 should be applied with care, since, even if V is uniform
on some set a priori, it may not be uniform conditioned on Xi . This is because, in the
adaptive setting, Xi depends on Y1i−1 , which in turn depends on V.
498 Jonathan Scarlett and Volkan Cevher

16.3.3 KL-Divergence-Based Bounds


By definition, the mutual information is the KL divergence between the joint distribu-
tion and the product of marginals, I(V; Y) = D(PVY PV × PY ), and can equivalently be
viewed as a conditional divergence I(V; Y) = D(PY|V PY |PV ). Viewing the mutual infor-
mation in this way leads to a variety of useful bounds in terms of related KL-divergence
quantities, as the following lemma shows.
lemma 16.4 (KL-divergence-based bounds) Let PV , PY , and PY|V be the marginal
distributions corresponding to a pair (V, Y), where V is discrete. For any auxiliary
distribution QY , we have

I(V; Y) = PV (v)D PY|V (· | v) PY (16.23)
v

≤ PV (v)D PY|V (· | v) QY (16.24)
v

≤ max D PY|V (· | v) QY , (16.25)


v

and, in addition,

I(V; Y) ≤ PV (v)PV (v )D PY|V (· | v) PY (· | v ) (16.26)
v,v

≤ max D PY|V (· | v) PY|V (· | v ) . (16.27)


v,v

Proof We obtain (16.23) from the definition of mutual information, and


   
(16.24) from the fact that E log(PY|V (Y|V)/PY (Y)) = E log(PY|V (Y|V)/QY (Y)) −
 
E log(PY (Y)/QY (Y)) ; the second term here is a KL divergence, and is therefore
non-negative. We obtain (16.26) from (16.24) by noting that QY can be chosen to be
any of the PY (· | v ), and the remaining inequalities (16.25) and (16.27) are trivial.

The upper bounds in (16.24)–(16.27) are closely related, and often essentially
equivalent in the sense that they lead to very similar converse bounds. In the authors’
experience, it is usually slightly simpler to choose a suitable auxiliary distribution
QY and apply (16.25), rather than bounding the pair-wise divergences as in (16.27).
Examples will be given in Sections 16.4 and 16.6.
remark 16.5 We have used the generic notation Y in Lemma 16.4, but in applications
this may represent either the entire vector Y, or a single one of its entries Yi . Hence, the
lemma may be used to bound I(V; Y) directly, or one may first apply tensorization and
then use the lemma to bound each I(V; Yi ).
remark 16.6 Lemma 16.4 can also be used to bound conditional mutual information
terms such as I(V; Y|X). Conditioned on any X = x, we can upper-bound I(V; Y|X = x)
using Lemma 16.4, with an auxiliary distribution QY|X=x that may depend on x. For
instance, doing this for (16.25) and then averaging over X, we obtain for any QY|X that
An Introductory Guide to Fano’s Inequality with Applications 499


I(V; Y|X) ≤ max D PY|X,V (· | ·, v) QY|X PX (16.28)
v

≤ max D PY|X,V (· | x, v) QY|X (· | x) . (16.29)


x,v

The bound (16.25) in Lemma 16.4 is useful when there exists a single auxiliary
distribution QY that is “close” to each PY|V (·|v) in KL divergence, i.e., D(PY|V (· | v)  QY )
is small. It is natural to extend this idea by introducing multiple auxiliary distributions,
and requiring only that any one of them is close to a given PY|V (·|v). This can be viewed
as “covering” the conditional distributions {PY|V (·|v)}v∈V with “KL-divergence balls,”
and we will return to this viewpoint in Section 16.5.3.
lemma 16.5 (Mutual information bound via covering) Under the setup of Lemma 16.4,
suppose there exist N distributions Q1 (y), . . . , QN (y) such that, for all v and some  > 0,
it holds that
min D PY|V (· | v) Q j ≤ . (16.30)
j=1,...,N
Then we have
I(V; Y) ≤ log N + . (16.31)

The proof is based on applying (16.24) with QY (y) = (1/N) Nj=1 Q j (y), and then
lower-bounding this summation over j by the value j∗ (v) achieving the minimum in
(16.30). We observe that setting N = 1 in Lemma 16.5 simply yields (16.25).

16.3.4 Relations between KL Divergence and Other Measures


As evidenced above, the KL divergence plays a crucial role in applications of Fano’s
inequality. In some cases, directly characterizing the KL divergence can still be difficult,
and it is more convenient to bound it in terms of other divergences or distances. The
following lemma gives a few simple examples of such relations; the reader is referred
to [36] for a more thorough treatment.
lemma 16.6 (Relations between divergence measures) Fix two distributions P and
 
Q, and consider the KL divergence D(PQ) = EP log(P(Y)/Q(Y)) , total variation
 
(T V) dTV (P, Q) = 12 EQ |P(Y)/Q(Y) − 1| , squared Hellinger distance H 2 (P, Q) =
 √ 2    
EQ P(Y)/Q(Y) − 1 , and χ2 -divergence χ2 (PQ) = EQ P(Y)/Q(Y) − 1 2 . We have
• (KL versus TV) D(PQ) ≥ 2dTV (P, Q)2 , whereas if P and Q are probability mass
functions and each entry of Q is at least η > 0, then D(PQ)
 ≤ (2/η)dTV (P, Q) ;
2

• (Hellinger versus TV) 2 H (P, Q) ≤ dTV (P, Q) ≤ H(P, Q) 1 − H (P, Q)/4;


1 2 2

• (KL versus χ2 -divergence) D(PQ) ≤ log(1 + χ2 (PQ)) ≤ χ2 (PQ).

16.4 Applications − Discrete Settings

In this section, we provide two examples of statistical estimation problems in which


the quantity being estimated is discrete: group testing and graphical model selection.
Our goal is not to treat these problems comprehensively, but rather to study particular
instances that permit a simple analysis while still illustrating the key ideas and tools
500 Jonathan Scarlett and Volkan Cevher

introduced in the previous sections. We consider the high-dimensional setting, in


which the underlying number of parameters being estimated is much higher than the
number of measurements. To simplify the final results, we will often write them using
the asymptotic notation o(1) for asymptotically vanishing terms, but non-asymptotic
variants are easily inferred from the proofs.

16.4.1 Group Testing


The group-testing problem consists of determining a small subset of “defective” items
within a larger set of items on the basis of a number of pooled tests. A given test contains
some subset of the items, and the binary test outcome indicates, possibly in a noisy
manner, whether or not at least one defective item was included in the test. This problem
has a history in medical testing [37], and has regained significant attention following
applications in communication protocols, pattern matching, database systems, and more.
In more detail, the setup is described as follows.
• In a population of p items, there are k unknown defective items. This defective
 p set
is denoted by S ⊆ {1, . . . , p}, and is assumed to be uniform on the set of k subsets
having cardinality k. Hence, in this example, we are in the Bayesian setting with a
uniform prior. We focus on the sparse setting, in which k  p, i.e., defective items
are rare.
• There are n tests specified by a test matrix X ∈ {0, 1}n×p . The (i, j)th entry of X,
denoted by Xi j , indicates whether item j is included in test i. We initially consider
the non-adaptive setting, where X is chosen in advance. We allow this choice to be
random; for instance, a common choice of random design is to let the entries of X be
i.i.d. Bernoulli random variables.
• To account for possible noise, we consider the following observation model:

Yi = Xi j ⊕ Zi , (16.32)
j∈S

 
where Zi ∼ Bernoulli() for some  ∈ 0, 12 , ⊕ denotes modulo-2 addition, and ∨ is
the “OR” operation. In the channel coding terminology, this corresponds to passing

the noiseless test outcome j∈S Xi j through a binary symmetric channel. We assume
that the noise variables Zi are independent of each other and of X, and we define the
vector of test outcomes Y = (Y1 , . . . , Yn ).
• Given X and Y, a decoder forms an estimate  S of S . We initially consider the exact
recovery criterion, in which the error probability is given by

Pe = P[
S  S ], (16.33)

where the probability with respect to S , X, and Y.


In the following sections, we present several results and analysis techniques that are
primarily drawn from [2, 3].
An Introductory Guide to Fano’s Inequality with Applications 501

Exact Recovery with Non-Adaptive Testing


Under the exact recovery criterion (16.33), we have the following lower bound on the
required number of tests. Recall that H2 (α) = α log(1/α) + (1 − α)log(1/(1 − α)) denotes
the binary entropy function.
theorem 16.4 (Group testing with exact recovery) Under the preceding noisy group-
testing setup, in order to achieve Pe ≤ δ, it is necessary that
k log(p/k)
n≥ (1 − δ − o(1)) (16.34)
log 2 − H2 ()
as p → ∞, possibly with k → ∞ simultaneously.
Proof Since S is discrete-valued, we can use the trivial reduction to multiple hypoth-
esis testing with V = S . Applying Fano’s inequality (Theorem 16.1) with conditioning
on X (Section 16.2.3), we obtain
p
I(S ; Y|X) ≥ (1 − δ) log − log 2, (16.35)
k

where we have also upper-bounded I(S ;  S |X) ≤ I(S ; Y|X) using the data-processing
inequality (from the second part of Lemma 16.1), which in turn uses the fact that
S →Y→ S conditioned on X.

Let Ui = j∈S Xi j denote the hypothetical noiseless outcome. Since the noise vari-
ables {Zi }ni=1 are independent and Yi depends on (S , X) only through Ui (see (16.32)),
we can apply tensorization (from the third part of Lemma 16.2) to obtain

n
I(S ; Y|X) ≤ I(Ui ; Yi ) (16.36)
i=1
 
≤ n log 2 − H2 () , (16.37)

where (16.37) follows since Yi is generated from Ui according to a binary


 p symmetric

channel, which has capacity log 2 − H2 (). By substituting (16.37) and k ≥ p/k k into
(16.35) and rearranging, we obtain (16.34).
Theorem 16.4 is known to be tight in terms of scaling laws whenever δ ∈ (0, 1) is fixed
and k = o(p), and, perhaps more interestingly, tight including constant factors as δ → 0
under the scaling k = O(pθ ) for sufficiently small θ > 0. The matching achievability
result in this regime can be proved using maximum-likelihood decoding [38]. However,
achieving such a result using a computationally efficient decoder remains a challenging
problem.

Approximate Recovery with Non-Adaptive Testing


We now move to an approximate recovery criterion. The decoder outputs a list
L ⊆ {1, . . . , p} of cardinality L ≥ k, and we require that at least a fraction (1 − α)k of the
defective items appear in the list, for some α ∈ (0, 1). It follows that the error probability
can be written as

Pe (t) = P[d(S , L) > t], (16.38)


502 Jonathan Scarlett and Volkan Cevher

where d(S , L) = |S \L|, and t = αk. Notice that a higher value of L means that more
non-defective items may be included in the list, whereas a higher value of α means that
more defective items may be absent.
theorem 16.5 (Group testing with approximate recovery) Under the preceding noisy
group-testing setup with list size L ≥ k, in order to achieve Pe (αk) ≤ δ for some α ∈ (0, 1)
(not depending on p), it is necessary that
(1 − α)k log(p/L)
n≥ (1 − δ − o(1)) (16.39)
log 2 − H2 ()
as p → ∞, k → ∞, and L → ∞ simultaneously with L = o(p).
Proof We apply the approximate recovery version of Fano’s inequality (Theorem 16.2)
with d(S , L) = |S \L| and t = αk as above. For any L with cardinality L, the number of
αk  p−L L 
S with d(S , L) ≤ αk is given by Nmax (t) = j=0 j k− j , which follows by counting
the number of ways to place k − j defective items in L, and the remaining j defective
items in the other p − L entries. Hence, using Theorem 16.2 with conditioning on X (see
Section 16.2.3), and applying the data-processing inequality (from the second part of
Lemma 16.1), we obtain
 p
k
I(S ; Y|X) ≥ (1 − δ) log αk  p−L L
 − log 2. (16.40)
j=0 j k− j

By upper-bounding the summation by αk + 1 times the maximum value, and


performing some asymptotic simplifications via the assumption L = o(p), we can
 
simplify the logarithm to k log(p/L) (1 + o(1)) [39]. The theorem is then established by
upper-bounding the conditional mutual information using (16.37).
Theorem 16.5 matches Theorem 16.4 up to the factor of 1 − α and the replacement
of log(p/k) by log(p/L), suggesting that approximate recovery provides a minimal
reduction in the number of tests even for moderate values of α and L. However, under
approximate recovery, a near-matching achievability bound is known under the scaling
k = O(pθ ) for all θ ∈ (0, 1), rather than only for sufficiently small θ [38].

Adaptive Testing
Next, we discuss the adaptive-testing setting, in which a given input vector Xi ∈ {0, 1} p ,
corresponding to a single row of X, is allowed to depend on the previous inputs and
outcomes, i.e., X1i−1 = (X1 , . . . , Xi−1 ) and Y1i−1 = (Y1 , . . . , Yi−1 ). In fact, it turns out that
Theorems 16.4 and 16.5 still apply in this setting. Establishing this simply requires
making the following modifications to the above analysis.
• Apply the data-processing inequality in the form of the third part of Lemma 16.1,
yielding (16.35) and (16.40) with I(S ; X, Y) in place of I(S ; Y|X).
• Apply tensorization via Lemma 16.3 to deduce (16.36) and (16.37) with I(S ; X, Y)
in place of I(S ; Y|X).
An Introductory Guide to Fano’s Inequality with Applications 503

In the regimes where Theorems 16.4 and/or 16.5 are known to have matching upper
bounds with non-adaptive designs, we can clearly deduce that adaptivity provides no
asymptotic gain. However, as with approximate recovery, adaptivity can significantly
broaden the conditions under which matching achievability bounds are known, at least
in the noiseless setting [40].

Discussion: General Noise Models


The preceding analysis can easily be extended to more general group-testing models in
which the observations (Y1 , . . . , Yn ) are conditionally independent given X. A broad class

of such models can be written in the form (Yi |Ni ) ∼ PY|N , where Ni = j∈S 1{Xi j = 1}
denotes the number of defective items in the ith test. In such cases, the preceding results
hold true more generally when log 2 − H2 () is replaced by the capacity maxPN I(N; Y)
of the “channel” PY|N .
For certain models, we can obtain a better lower bound by applying a genie argument,
along with the conditional form of Fano’s inequality in Theorem 16.3. Fix  ∈ {1, . . . , k},
and suppose that a uniformly random subset S (1) ⊆ S of cardinality k −  is revealed
to the decoder. This extra information can only make the group-testing problem easier,
so any converse bound for this modified setting remains valid for the original setting.
Perhaps counter-intuitively, this idea can lead to a better final bound.
We only briefly outline the details of this more general analysis, and refer the
interested reader to [3, 41]. Using Theorem 16.3 with A = S (1) , and applying the
data-processing inequality and tensorization, one can obtain
n
I(Ni(0) ; Yi |Ni(1) ) − log 2
Pe ≥ 1 − i=1   , (16.41)
log p−k+ 

where Ni(1) = j∈S (1) 1{Xi j = 1}, and Ni(0) = Ni − Ni(1) . The intuition is that we condition
on Ni(1) since it is known via the genie, while the remaining information about Yi is
determined by Ni(0) . Once (16.41) has been established, it remains only to simplify the
mutual information terms; see [3, 41] for further details.

16.4.2 Graphical Model Selection


Graphical models provide compact representations of the conditional independence
relations between random variables, and frequently arise in areas such as image
processing, statistical physics, computational biology, and natural-language processing.
The fundamental problem of graphical model selection consists of recovering the graph
structure given a number of independent samples from the underlying distribution.
Graphical model selection has been studied under several different families of joint
distributions, and also several different graph classes. We focus our attention on the
commonly used Ising model with binary observations, and on a simple graph class
known as forests, defined to contain the graphs having no cycles.
504 Jonathan Scarlett and Volkan Cevher

Figure 16.2 Two examples of graphs that are forests (i.e., acyclic graphs); the graph on the right is
also a tree (i.e., a connected acyclic graph).

Formally, the setup is described as follows.


• We are given n independent samples Y1 , . . . , Yn from a p-dimensional joint distribu-
tion: Yi = (Yi1 , . . . , Yip ) for i = 1, . . . , n. This joint distribution is encoded by a graph
G = (V, E), where V = {1, . . . , p} is the vertex set, and E ⊆ V × V is the edge set. We
use the terminology vertex and node interchangeably. We assume that there are no
edges from a vertex to itself, and that the edges are undirected: (i, j) ∈ E and ( j, i) ∈ E
are equivalent, and count as only one edge.
• We focus on the Ising model, in which the observations are binary-valued, and the
joint distribution of a given sample, say Y1 = (Y11 , . . . , Y1p ) ∈ {−1, 1} p , is

1 
PG (y1 ) = exp λ y1i y1 j , (16.42)
Z (i, j)∈E

where Z is a normalizing constant. Here λ > 0 is a parameter to the distribution


dictating the edge strength; a higher value means it is more likely that Y1i = Y1 j for
any given edge (i, j) ∈ E.
• We restrict the graph G = (V, E) to be the set of all forests:

Gforest = G : G has no cycles}, (16.43)

where a cycle is defined to be a path of distinct edges leading back to the start node,
e.g., (1, 4), (4, 2), (2, 1). A special case of a forest is a tree, which is an acyclic graph for
which a path exists between any two nodes. One can view any forest as being a dis-
joint union of trees, each defined on some subset of V. See Fig. 16.2 for an illustration.
• Let Y ∈ {−1, 1}n×p be the matrix whose ith row contains the p entries of the ith
sample. Given Y, a decoder forms an estimate G  of G, or equivalently, an estimate E 
of E. We initially focus on the exact-recovery criterion, in which the minimax error
probability is given by

  G],
Mn (Gforest , λ) = inf sup PG [G (16.44)
 G∈Gforest
G

where PG denotes the probability when the true graph is G, and the infimum is over
all estimators.
To the best of our knowledge, Fano’s inequality has not been applied previously in this
exact setup; we do so using the general tools for Ising models given in [26, 27, 42, 43].
An Introductory Guide to Fano’s Inequality with Applications 505

Exact Recovery
Under the exact-recovery criterion, we have the following.
theorem 16.6 (Exact recovery of forest graphical models) Under the preceding Ising
graphical model selection setup with a given edge parameter λ > 0, in order to achieve
Mn (Gforest , λ) ≤ δ, it is necessary that
 
log p 2 log p
n ≥ max , (1 − δ − o(1)) (16.45)
log 2 λ tanh λ
as p → ∞.
Proof Recall from Section 16.1.1 that we can lower-bound the worst-case error
probability over Gforest by the average error probability over any subset of Gforest . This
gives us an important degree of freedom in the reduction to multiple hypothesis testing,
and corresponds to selecting a hard subset θ1 , . . . , θ M as described in Section 16.1.1. We
refer to a given subset G ⊆ Gforest as a graph ensemble, and provide two choices that
lead to the two terms in (16.45).
For any choice of G ⊆ Gforest , Fano’s inequality (Theorem 16.1) gives
(1 − δ) log |G| − log 2
n≥ , (16.46)
I(G; Y1 )
for G uniform on G, where we used I(G; G)  ≤ I(G; Y) ≤ nI(G; Y1 ) by the data-processing
inequality and tensorization (from the first parts of Lemmas 16.1 and 16.2).
Restricted Ensemble 1. Let G1 be the set of all trees. It is well known from graph
theory that the number of trees on p nodes is |G1 | = p p−2 [44]. Moreover, since Y1 is
a length-p binary sequence, we have I(G; Y1 ) ≤ H(Y1 ) ≤ p log 2. Hence, (16.46) yields
n ≥ ((1 − δ)(p − 2) log p − log 2)/(p log 2), implying the first bound in (16.45).
Restricted
  Ensemble 2. Let G2 be the set of graphs containing a single edge, so that
|G2 | = 2p . We will upper-bound the mutual information using (16.25) in Lemma 16.4,
choosing the auxiliary distribution QY to be PG , with G being the empty graph. Thus,
we need to bound D(PG PG ) for each G ∈ G2 .
We first give an upper bound on D(PG PG ) for any two graphs (G,G). We start with
the trivial bound
D(PG PG ) ≤ D(PG PG ) + D(PG PG ). (16.47)
 
Recall the definition D(PQ) = EP log(P(Y)/Q(Y)) , and consider the substitution of
PG and PG according to (16.42), with different normalizing constants ZG and ZG . We
see that when we sum the two terms in (16.47), the normalizing constants inside the
logarithms cancel out, and we are left with
  
D(PG PG ) ≤ λ EG [Y1i Y1 j ] − EG [Y1i Y1 j ]
(i, j)∈E\E
  
+ λ EG [Y1i Y1 j ] − EG [Y1i Y1 j ] (16.48)
(i, j)∈E\E

for G = (V, E) and G = (V, E).


506 Jonathan Scarlett and Volkan Cevher

In the case that G has a single edge (i.e., G ∈ G2 ) and G is the empty graph, we can
easily compute EG [Y1i Y1 j ] = 0, and (16.48) simplifies to
D(PG PG ) ≤ λEG [Y1i Y1 j ], (16.49)
where (i, j) is the unique edge in G. Since Y1i and Y1 j only take values in {−1, 1}, we
have EG [Y1i Y1 j ] = (+1)P[Y1i = Y1 j ] + (−1)P[Y1i  Y1 j ] = 2P[Y1i = Y1 j ] − 1, and letting
E have a single edge in (16.42) yields PG [(Y1i , Y1 j ) = (yi , y j )] = eλyi y j /(2eλ + 2e−λ ), and
hence PG [Y1i = Y1 j ] = eλ /(eλ + e−λ ). Combining this with EG [Y1i Y1 j ] = 2P[Y1i = Y1 j ] − 1
yields EG [Y1i Y1 j ] = 2eλ /(eλ + e−λ ) − 1 = tanh λ. Hence, using (16.49) along with
(16.25) in Lemma 16.4, we obtain I(G; Y1 ) ≤ λ tanh λ. Substitution into (16.46) (with
log |G| = (2 log p)(1 + o(1))) yields the second bound in (16.45).
Theorem 16.6 is known to be tight up to constant factors whenever λ = O(1) [44, 45].
When λ is constant, the lower bound becomes n = Ω(log p), whereas for asymptotically
 
vanishing λ it simplifies to n = Ω (1/λ2 ) log p .

Approximate Recovery
We consider the approximate recovery of G = (V, E) with respect to the edit distance
 = |E\E|
d(G, G)  + |E\E|,
 which is the number of edge additions and removals needed

to transform G into G or vice versa. Since any forest can have at most p − 1 edges, it is
natural to consider the case in which an edit distance of up to αp is permitted, for some
α > 0. Hence, the minimax risk is given by
 > αp].
Mn (Gforest , λ, α) = inf sup PG [d(G, G) (16.50)
 G∈Gforest
G

In this setting, we have the following.


theorem 16.7 (Approximate recovery of forest graphical models) Under the preceding
Ising graphical model selection setup with a given edge parameter λ > 0 and approx-
 
imate recovery parameter α ∈ 0, 12 (with the latter not depending on p), in order to
achieve Mn (Gforest , λ, α) ≤ δ, it is necessary that
 
(1 − α) log p 2(1 − α) log p
n ≥ max , (1 − δ − o(1)) (16.51)
log 2 λ tanh λ
as p → ∞.
Proof For any G ⊆ Gforest , Theorem 16.2 provides the following analog of (16.46):
(1 − δ) log(|G|/Nmax (αp)) − log 2
n≥ (16.52)
I(G; Y1 )
  ≤ t} implicitly depends on
for G uniform on G, where Nmax (t) = maxG G∈G 1{d(G, G)
G. We again consider two restricted ensembles; the first is identical to the exact recovery
setting, whereas the second is modified due to the fact that learning single-edge graphs
with approximate recovery is trivial.
Restricted Ensemble 1. Once again, let G1 be the set of all trees. We have already
established |G1 | = (p − 2) log p and I(G; Y1 ) ≤ n log 2 for this ensemble, so it remains
only to characterize Nmax (αp).
An Introductory Guide to Fano’s Inequality with Applications 507

While the decoder may output a graph G  not lying in G1 , we can assume with-
 is always selected such that d(G,G
out loss of generality that G  ∗ ) ≤ αp for some
G∗ ∈ G1 ; otherwise, an error would be guaranteed. As a result, for any G,  and

any G ∈ G1 such that d(G, G) ≤ αp, we have from the triangle inequality that
 + d(G,G
d(G,G∗ ) ≤ d(G, G)  ∗ ) ≤ 2αp, which implies that

Nmax (αp) ≤ 1{d(G,G∗ ) ≤ 2αp}. (16.53)
G∈G1

Now observe that, since all graphs in G1 have exactly p − 1 edges, transforming G to G∗
requires removing j edges and adding j different edges, for some j ≤ αp. Hence, we have
αp
 
 p−1 p − p+1
Nmax (αp) ≤ 2
. (16.54)
j=0
j j

By upper-bounding the summation by αp + 1 times the maximum, and performing


 
some asymptotic simplifications, we can show that log Nmax (αp) ≤ αp log p (1 + o(1)).
By substituting into (16.52) and recalling that |G1 | = (p − 2) log p and I(G; Y1 ) ≤ p log 2,
we obtain the first bound in (16.51).
Restricted Ensemble 2a. Let G2a be the set of all graphs on p nodes containing exactly
p/2 isolated edges; if p is an odd number, the same analysis applies with an arbitrary
  |G
single node ignored. We proceed by characterizing  2a|,4
I(G;
 Y1 ), and Nmax (αp). The
number of graphs in the ensemble is |G2a | = 2p p−2 · · · 2
2 2 = p!/2
p/2 , and Stirling’s
  2
approximation yields log |G2a | ≥ p log p (1 + o(1)).
Since the KL divergence is additive for product distributions, and we established in
the exact-recovery case that the KL divergence between the distributions of a single-edge
graph and an empty graph is at most λ tanh λ, we deduce that D(PG PG ) ≤ (p/2)λ tanh λ
for any G ∈ G2a , where G is the empty graph. We therefore obtain from Lemma 16.4
that I(G; Y1 ) ≤ (p/2)λ tanh λ.
αp  (2p)−p/2
A similar argument to that of Ensemble 1 yields Nmax (αp) ≤ j=0 p/2 ,
 j j
in analogy with (16.54). This again simplifies to Nmax (αp) ≤ αp log p (1 + o(1)),
 
and, having established log |G2a | ≥ p log p (1 + o(1)) and I(G; Y1 ) ≤ (p/2)λ tanh λ,
substitution into (16.52) yields the second bound in (16.51).

The bound in Theorem 16.7 matches that of Theorem 16.6 up to a multiplicative


factor of 1 − α, thus suggesting that approximate recovery does not significantly help
in reducing the required number of samples, at least in the minimax sense, for the Ising
model and forest graph class.

Adaptive Sampling
We now return to the exact-recovery setting, and consider a modification in which we
have an added degree of freedom in the form of adaptive sampling.
• The algorithm proceeds in rounds; in round i, the algorithm queries a subset of the
p nodes indexed by Xi ∈ {0, 1} p , and the corresponding sample Yi is generated as
follows.
508 Jonathan Scarlett and Volkan Cevher

◦ The joint distribution of the entries of Yi , corresponding to the entries where


Xi is one, coincides with the corresponding marginal distribution of PG , with
independence between rounds.
◦ The values of the entries of Yi , corresponding to the entries where Xi is zero, are
given by ∗, a symbol indicating that the node was not observed.
We allow Xi to be selected on the basis of past queries and samples, namely,
X1i−1 = (X1 , . . . , Xi−1 ) and Y1i−1 = (Y1 , . . . , Yi−1 ).
• Let n(Xi ) denote the number of ones in Xi , i.e., the number of nodes observed in
round i. While we allow the total number of rounds to vary, we restrict the algorithm
to output an estimate G  after observing at most nnode nodes. This quantity is related
to n in the non-adaptive setting according to nnode = np, since in the non-adaptive
setting we always observe all p nodes in each sample.
• The minimax risk is given by
  G],
Mnnode (Gforest , λ) = inf sup PG [G (16.55)
 G∈Gforest
G

where the infimum is over all adaptive algorithms that observe at most nnode nodes
in total.
theorem 16.8 (Adaptive sampling for forest graphical models) Under the preceding
Ising graphical model selection problem with adaptive sampling and a given parameter
λ > 0, in order to achieve Mnnode (Gforest , λ) ≤ δ, it is necessary that
 
p log p 2p log p
nnode ≥ max , (1 − δ − o(1)) (16.56)
log 2 λ tanh λ
as p → ∞.
Proof We prove the result using Ensemble 1 and Ensemble 2a above. We let N denote
the number of rounds; while this quantity is allowed to vary, we can assume without loss
of generality that N = nnode by adding or removing rounds where no nodes are queried.
For any subset G ⊆ Gforest , applying Fano’s inequality (Theorem 16.1) and tensorization
(from the first part of Theorem 16.3) yields

N
I(G; Yi |Xi ) ≥ (1 − δ) log |G| − log 2, (16.57)
i=1

where G is uniform on G.
Restricted Ensemble 1. We again let G1 be the set of all trees, for which we know
that |G| = p p−2 . Since the n(Xi ) entries of Yi differing from ∗ are binary, and those
equaling ∗ are deterministic given Xi , we have I(G; Yi |Xi = xi ) ≤ n(xi ) log 2. Averaging
N N
over Xi and summing over i yields i=1 I(G; Yi |Xi ) ≤ i=1 E[n(Xi )] log 2 ≤ nnode log 2,
and substitution into (16.57) yields the first bound in (16.56).
Restricted Ensemble 2. We again use the above-defined ensemble G2a of graphs with
 
p/2 isolated edges, for which we know that |G2a | ≥ p log p (1 + o(1)). In this case,
when we observe n(Xi ) nodes, the subgraph corresponding to these observed nodes
has at most n(Xi )/2 edges, all of which are isolated. Hence, using Lemma 16.4, the
An Introductory Guide to Fano’s Inequality with Applications 509

above-established fact that the KL divergence from a single-edge graph to the empty
graph is at most λ tanh λ, and the additivity of KL divergence for product distributions,
we deduce that I(G; Yi |Xi = xi ) ≤ (n(xi )/2)λ tanh λ. Averaging over Xi and summing over
N
i yields i=1 I(G; Yi |Xi ) ≤ 12 nnode λ tanh λ, and substitution into (16.57) yields the second
bound in (16.56).
The threshold in Theorem 16.8 matches that of Theorem 16.6, and, in fact, a similar
analysis under approximate recovery also recovers the threshold in Theorem 16.7. This
suggests that adaptivity is of limited help in the minimax sense for the Ising model
and forest graph class. There are, however, other instances of graphical model selection
where adaptivity provably helps [43, 46].

Discussion: Other Graph Classes


Degree and edge constraints. While the class Gforest is a relatively easy class to handle,
similar techniques have also been used for more difficult classes, notably including
those that place restrictions on the maximal degree d and/or the number of edges k.
Ensembles 2 and 2a above can again be used, and the resulting bounds are tight in
certain scaling regimes where λ → 0, but loose in other regimes due to their lack of
dependence on d and k. To obtain bounds with such a dependence, alternative ensembles
consisting of subgraphs with highly correlated nodes have been proposed [26, 27, 42].
For instance, suppose that a group of d + 1 nodes has all possible edges connected
except one. Unless d or the edge strength λ is small, the high connectivity makes the
nodes very highly correlated, and the subgraph is difficult to distinguish from a fully
connected subgraph. This is in contrast with Ensembles 2 and 2a above, whose graphs
are difficult to distinguish from the empty graph.
Bayesian setting. Beyond minimax estimation, it is also of interest to understand
the fundamental limits of random graphs. A particularly prominent example is the
Erdős–Rényi random graph, in which each edge is independently included with some
probability q ∈ (0, 1). This is a case where the conditional form of Fano’s inequality has
proved useful; specifically, one can apply Theorem 16.3 with A = G, and A equal to the
following typical set of graphs:
 
p p
T = G : (1 − )q ≤ |E| ≤ (1 + )q , (16.58)
2 2
where  > 0 is a constant.  Standard
 properties of typical sets [34] yield that
p   
P[GER ∈ T ] → 1, |T | = e H2 (q)(2) (1+O()) , and H(V|V ∈ T ) = H2 (q) 2p (1 + O())
 
whenever q 2p → ∞, and once these facts have been established, Theorem 16.3 yields
the following necessary condition for Pe ≤ δ:
pH2 (q)
n≥ (1 − δ − o(1)). (16.59)
2 log 2
 
For instance, in the case that q = O 1/p (i.e., there are O(p) edges on average), we have
 
H2 (q) = Θ (log p /p), and we find that n = Ω(log p) samples are necessary. This scaling
is tight when λ is constant [45], whereas improved bounds for other scalings can be
found in [27].
510 Jonathan Scarlett and Volkan Cevher

16.5 From Discrete to Continuous

Thus far, we have focused on using Fano’s inequality to provide converse bounds
for the estimation of discrete quantities. In many, if not most, statistical applications,
one is instead interested in estimating continuous quantities; examples include linear
regression, covariance estimation, density estimation, and so on. It turns out that the
discrete form of Fano’s inequality is still broadly applicable in such settings. The idea,
as outlined in Section 16.1, is to choose a finite subset that still captures the inherent
difficulty in the problem. In this section, we present several tools used for this purpose.

16.5.1 Minimax Estimation Setup


Recall the setup described in Section 16.1.1: a parameter θ is known to lie in some
subset Θ of a continuous domain (e.g., R p ), the samples Y = (Y1 , . . . , Yn ) are drawn from
a joint distribution Pnθ (y), an estimate 
θ is formed, and the loss incurred is (θ, θ). For
clarity of exposition, we focus primarily on the case in which there is no input, i.e., X
in Fig. 16.1 is absent or deterministic. However, the main results (Theorems 16.9 and
16.10 below) extend to settings with inputs as described in Section 16.1.1; the mutual
information I(V; Y) is replaced by I(V; Y|X) in the non-adaptive setting, or I(V; X, Y) in
the adaptive setting.
In continuous settings, the reduction to multiple hypothesis testing (see Fig. 16.1)
requires that the loss function is sufficiently well behaved. We focus here on a widely
considered class of functions that can be written as
 
(θ,
θ) = Φ ρ(θ,
θ) , (16.60)

where ρ(θ, θ ) is a metric, and Φ(·) is an increasing function from R+ to R+ . For instance,
the squared-2 loss (θ, θ ) = θ − θ 22 clearly takes this form.
We focus on the minimax setting, defining the minimax risk as follows:
 
Mn (Θ, ) = inf sup Eθ (θ,
θ) , (16.61)

θ θ∈Θ

where the infimum is over all estimators 


θ =
θ(Y), and Eθ denotes expectation when the
underlying parameter is θ. We subsequently define Pθ analogously.

16.5.2 Reduction to the Discrete Case


We present two related approaches to reducing the continuous estimation problem to
a discrete one. The first, which is based on the standard form of Fano’s inequality in
Theorem 16.1, was discovered much earlier [12], and, accordingly, it has been used in
a much wider range of applications. However, the second approach, which is based on
the approximate recovery version of Fano’s inequality in Theorem 16.2, has recently
been shown to provide added flexibility in the reduction [35].
An Introductory Guide to Fano’s Inequality with Applications 511

Reduction with Exact Recovery


As we discussed in Section 16.1, we seek to reduce the continuous problem to multiple
hypothesis testing in such a way that successful minimax estimation implies success in
the hypothesis test with high probability. To this end, we choose a hard subset θ1 , . . . , θ M ,
for which the elements are sufficiently well separated that the index v ∈ {1, . . . , M} can
be identified from the estimate θ (see Fig. 16.1). This is formalized in the proof of the
following result.
theorem 16.9 (Minimax bound via reduction to exact recovery) Under the preceding
minimax estimation setup, fix  > 0, and let {θ1 , . . . , θ M } be a finite subset of Θ such that

ρ(θv , θv ) ≥ , ∀v, v ∈ {1, . . . , M}, v  v . (16.62)

Then, we have
 I(V; Y) + log 2
Mn (Θ, ) ≥ Φ 1− , (16.63)
2 log M
where V is uniform on {1, . . . , M}, and the mutual information is with respect to
V → θV → Y. Moreover, in the special case M = 2, we have
 −1  
Mn (Θ, ) ≥ Φ H log 2 − I(V; Y) , (16.64)
2 2

where H2−1 (·) ∈ [0, 0.5] is the inverse binary entropy function.
Proof As illustrated in Fig. 16.1, the idea is to reduce the estimation problem to
a multiple-hypothesis-testing problem. As an initial step, we note from Markov’s
inequality that, for any 0 > 0,
 
θ) ≥ sup Φ(0 )Pθ [(θ,
sup Eθ (θ, θ) ≥ Φ(0 )] (16.65)
θ∈Θ θ∈Θ
= Φ(0 ) sup Pθ [ρ(θ,
θ) ≥ 0 ], (16.66)
θ∈Θ

where (16.66) uses (16.60) and the assumption that Φ(·) is increasing.
Suppose that a random index V is drawn uniformly from {1, . . . , M}, the samples
Y are drawn from the distribution Pnθ corresponding to θ = θV , and the estimator is
applied to produce  θ. Let  V correspond to the closest θ j according to the metric ρ, i.e.,
 
V = arg minv=1,...,M ρ(θv , θ). Using the triangle inequality and the assumption (16.62), if
ρ(θv ,
θ) < /2 then we must have  V = v; hence,
 
Pv ρ(θv ,
θ) ≥ ≥ Pv [
V  v], (16.67)
2
where Pv is a shorthand for Pθv .
With the above tools in place, we proceed as follows:
   
sup Pθ ρ(θ, θ) ≥ ≥ max Pv ρ(θv ,
θ) ≥ (16.68)
θ∈Θ 2 v=1,...,M 2

≥ max Pv [V  v] (16.69)
v=1,...,M
512 Jonathan Scarlett and Volkan Cevher

1 
≥ Pv [
V  v] (16.70)
M v=1,...,M
I(V; Y) + log 2
≥ 1− , (16.71)
log M
where (16.68) follows upon maximizing over a smaller set, (16.69) follows from (16.67),
(16.70) lower-bounds the maximum by the average, and (16.71) follows from Fano’s
inequality (Theorem 16.1) and the fact that I(V;  V) ≤ I(V; Y) by the data-processing
inequality (Lemma 16.1).
The proof of (16.63) is concluded by substituting (16.71) into (16.66) with 0 = /2,
and taking the infimum over all estimators 
θ. For M = 2, we obtain (16.64) in the same
way upon replacing (16.71) by the version of Fano’s inequality for M = 2 given in
Remark 16.1.

We return to this result in Section 16.5.3, where we introduce and compare some
of the most widely used approaches to choosing the set {θ1 , . . . , θ M } and bounding the
mutual information.

Reduction with Approximate Recovery


The following generalization of Theorem 16.9, which is based on Fano’s inequality
with approximate recovery (Theorem 16.2), provides added flexibility in the reduction.
An example comparing the two approaches will be given in Section 16.6 for the sparse
linear regression problem.
theorem 16.10 (Minimax bound via reduction to approximate recovery) Under the
preceding minimax estimation setup, fix  > 0, t ∈ R, a finite set V of cardinality M, and
an arbitrary real-valued function d(v, v ) on V × V, and let {θv }v∈V be a finite subset of
Θ such that

d(v, v ) > t ⇒ ρ(θv , θv ) ≥ , ∀v, v ∈ V. (16.72)

Then we have for any  ≥ 0 that

 I(V; Y) + log 2
Mn (Θ, ) ≥ Φ 1− , (16.73)
2 log(M/Nmax (t))
where V is uniform on {1, . . . , M}, the mutual information is with respect to V → θV → Y,

and Nmax (t) = maxv ∈V v∈V 1{d(v, v ) ≤ t}.
The proof is analogous to that of Theorem 16.9, and can be found in [35].

16.5.3 Local versus Global Approaches


Here we highlight two distinct approaches to applying the reduction to exact recovery
as per Theorem 16.9, termed the local and global approaches. We do not make such
a distinction for the approximate-recovery variant in Theorem 16.10, since we are not
aware of a global approach having been used previously for this variant.
An Introductory Guide to Fano’s Inequality with Applications 513

Local approach. The most common approach to applying Theorem 16.9 is to


construct a set {θ1 , . . . , θ M } of elements that are “close” in KL divergence. Specifically,
upper-bounding the mutual information via Lemma 16.4 (with the vector Y playing the
role of Y therein), one can weaken (16.63) as follows.
corollary 16.1 (Local approach to minimax estimation) Under the setup of Theorem
16.9 with a given set {θ1 , . . . , θ M } satisfying (16.62), it holds for any auxiliary distribution
Qn (y) that

 minv=1,...,M D(Pnθv Qn ) + log 2


Mn (Θ, ) ≥ Φ 1− . (16.74)
2 log M

Moreover, the same bound holds true when minv D(Pnθv Qn ) is replaced by any one of
 
(1/M) v D(Pnθv Q), (1/M 2 ) v,v D(Pnθv Pnθ ), or maxv,v D(Pnθv Pnθ ).
v v

Attaining a good bound in (16.74) requires choosing {θ1 , . . . , θ M } to trade off two


competing objectives: (i) a larger value of M means that more hypotheses need to be
distinguished; and (ii) a smaller value of minv D(Pnθv Qn ) means that the hypotheses
are more similar. Generally speaking, there is no single best approach to optimizing
this trade-off, and the size and structure of the set can vary significantly from problem
to problem. Moreover, the construction need not be explicit; one can instead use
probabilistic arguments to prove the existence of a set satisfying the desired properties.
Examples are given in Section 16.6. Naturally, an analog of Corollary 16.1 holds for
M = 2 as per Theorem 16.9, and a counterpart for approximate recovery holds as per
Theorem 16.10.
We briefly mention that Corollary 16.1 has interesting connections with the popular
Assouad method from the statistics literature, as detailed in [47]. In addition, the
counterpart of Corollary 16.1 with M = 2 (using (16.10) in its proof) is similarly related
to an analogous technique known as Le Cam’s method.
Global approach. An alternative approach to applying Theorem 16.9 is the global
approach, which performs the following: (i) construct a subset of Θ with as many
elements as possible subject to the assumption (16.62); and (ii) construct a set that
covers Θ, in the sense of Lemma 16.5, with as few elements as possible. The following
definitions formalize the notions of forming “as many” and “as few” elements as possi-
ble. We write these in terms of a general real-valued function ρ0 (θ, θ ) that need not be a
metric.
definition 16.1 A set {θ1 , . . . , θ M } ⊆ Θ is said to be an  p -packing set of Θ with respect
to a measure ρ0 : Θ×Θ → R if ρ0 (θv , θv ) ≥  p for all v, v ∈ {1, . . . , M} with v  v. The  p -
packing number Mρ∗0 (Θ,  p ) is defined to be the maximum cardinality of any  p -packing.
definition 16.2 A set {θ1 , . . . , θN } ⊆ Θ is said to be an  c -covering set of Θ with
respect to ρ0 : Θ × Θ → R if, for any θ ∈ Θ, there exists some v ∈ {1, . . . , N} such
that ρ0 (θ, θv ) ≤  c . The  c -covering number Nρ∗0 (Θ,  c ) is defined to be the minimum
cardinality of any  c -covering.
514 Jonathan Scarlett and Volkan Cevher

Figure 16.3 Examples of -packing (left) and -covering (right) sets in the case in which ρ0 is the
Euclidean distance in R2 . Since ρ0 is a metric, a set of points is an -packing if and only if their
corresponding /2-balls do not intersect.

Observe that assumption (16.62) of Theorem 16.9 precisely states that {θ1 , . . . , θ M } is
an -packing set, though the result is often applied with M far smaller than the -packing
number. The logarithm of the covering number is often referred to as the metric entropy.
The notions of packing and covering are illustrated in Fig. 16.3. We do not explore
the properties of packing and covering numbers in detail in this chapter; the interested
reader is referred to [48, 49] for a more detailed treatment. We briefly state the following
useful property, showing that the two definitions are closely related in the case in which
ρ0 is a metric.
lemma 16.7 (Packing versus covering numbers) If ρ0 is a metric, then Mρ∗0 (Θ, 2) ≤
Nρ∗0 (Θ, ) ≤ Mρ∗0 (Θ, ).
We now show how to use Theorem 16.9 to construct a lower bound on the minimax
risk in terms of certain packing and covering numbers. For the packing number, we
will directly consider the metric ρ used in Theorem 16.9. On the other hand, for the
covering number, we consider the density Pnθv (y) associated with each θ ∈ Θ, and use
the associated KL divergence measure:

NKL,n (Θ, ) = Nρ∗n (Θ, ), ρnKL (θ, θ ) = D(Pnθ Pnθ ). (16.75)
KL

corollary 16.2 (Global approach to minimax estimation) Under the minimax


estimation setup of Section 16.5.1, we have for any  p > 0 and c,n > 0 that
∗ (Θ,  ) +  + log 2
p log NKL,n c,n c,n
Mn (Θ, ) ≥ Φ 1− . (16.76)
2 log Mρ∗ (Θ,  p )
In particular, if Pnθ (y) is the n-fold product of some single-measurement distribution
Pθ (y) for each θ ∈ Θ, then we have for any  p > 0 and  c > 0 that
∗ (Θ,  ) + n + log 2
p log NKL c c
Mn (Θ, ) ≥ Φ 1− , (16.77)
2 log Mρ∗ (Θ,  p )
∗ (Θ, ) = N ∗ (Θ, ) with ρ (θ, θ ) = D(P P ).
where NKL ρKL KL θ θ
An Introductory Guide to Fano’s Inequality with Applications 515

Proof Since Theorem 16.9 holds for any packing set, it holds for the maximal packing
set. Moreover, using Lemma 16.5, we have I(V; Y) ≤ log NKL,n∗ (Θ, c,n ) + c,n in (16.63),
since covering the entire space Θ is certainly enough to cover the elements in the
packing set. By combining these, we obtain the first part of the corollary. The second
part follows directly from the first part on choosing c,n = n c and noting that the KL
divergence is additive for product distributions.

Corollary 16.2 has been used as the starting point to derive minimax lower bounds for
a wide range of problems [13]; see Section 16.6 for an example. It has been observed
that the global approach is mainly useful for infinite-dimensional problems such as
density estimation and non-parametric regression, with the local approach typically
being superior for finite-dimensional problems such as vector or matrix estimation.

16.5.4 Beyond Estimation – Fano’s Inequality for Optimization


While the minimax estimation framework captures a diverse range of problems of
interest, there are also interesting problems that it does not capture. A notable example,
which we consider in this section, is stochastic optimization. We provide a brief
treatment, and refer the reader to [20] for further details and results.
We consider the following setup.
• We seek to minimize an unknown function f : X → R on some input domain X, i.e.,
to find a point x ∈ X such that f (x) is as low as possible.
• The algorithm proceeds in iterations. At the ith iteration, a point xi ∈ X is queried,
and an oracle returns a sample yi depending on the function, e.g., a noisy function
value, a noisy gradient, or a tuple containing both. The selected point xi can depend
on the past queries and samples.
• After iteratively sampling n points, the optimization algorithm returns a final point
 x) = f (
x, and the loss incurred is  f ( x) − min x∈X f (x), i.e., the gap to the optimal
function value.
• For a given class of functions F , the minimax risk is given by


Mn (F ) = inf sup E f [ f (X)], (16.78)
 f ∈F
X

where the infimum is over all optimization algorithms that iteratively query the
function n times and return a final point 
x as above, and E f denotes the expectation
when the underlying function is f .
In the following, we let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) denote the queried
locations and samples across the n rounds.
theorem 16.11 (Minimax bound for noisy optimization) Fix  > 0, and let
{ f1 , . . . , f M } ⊆ F be a finite subset of F such that for each x ∈ X, we have  fv (x) ≤  for
at most one value of v ∈ {1, . . . , M}. Then we have

I(V; X, Y) + log 2
Mn (F ) ≥  · 1 − , (16.79)
log M
516 Jonathan Scarlett and Volkan Cevher

where V is uniform on {1, . . . , M}, and the mutual information is with respect to
V → fV → (X, Y). Moreover, in the special case M = 2, we have
 
Mn (F ) ≥  · H2−1 log 2 − I(V; X, Y) , (16.80)

where H2−1 (·) ∈ [0, 0.5] is the inverse binary entropy function.
Proof By Markov’s inequality, we have
 ≥ sup  · P f [ f (X)
sup E f [ f (X)]  ≥ ]. (16.81)
f ∈F f ∈F

Suppose that a random index V is drawn uniformly from {1, . . . , M}, and the triplet
 is generated by running the optimization algorithm on fV . Given X
(X, Y, X) = x,
let 
V index the function among { f1 , . . . , f M } with the lowest corresponding value:

V = arg minv=1,...,M fv (
x).
By the assumption that any x satisfies  fv (x) ≤  for at most one of the M functions,
x) ≤  implies 
we find that the condition  fv ( V = v. Hence, we have
  
Pv  fv (X) >  ≥ P fv [
V  v]. (16.82)

The remainder of the proof follows (16.68)–(16.71) in the proof of Theorem 16.9. We
  
lower-bound the minimax risk sup f ∈F P f  f (X) ≥  by the average over V, and apply
Fano’s inequality (Theorem 16.1 and Remark 16.1) and the data-processing inequality
(from the third part of Lemma 16.3).
remark 16.7 Theorem 16.10 is based on reducing the optimization problem to a
multiple-hypothesis-testing problem with exact recovery. One can derive an analogous
result reducing to approximate recovery, but we are unaware of any works making use
of such a result for optimization.

16.6 Applications − Continuous Settings

In this section, we present three applications of the tools introduced in Section 16.5:
sparse linear regression, density estimation, and convex optimization. Similarly to the
discrete case, our examples are chosen to permit a relatively simple analysis, while still
effectively exemplifying the key concepts and tools.

16.6.1 Sparse Linear Regression


In this example, we extend the 1-sparse linear regression example of Section 16.1.1 to
the more general scenario of k-sparsity. The setup is described as follows.
• We wish to estimate a high-dimensional vector θ ∈ R p that is k-sparse: θ0 ≤ k,
where θ0 is the number of non-zero entries in θ.
• The vector of n measurements is given by Y = Xθ + Z, where X ∈ Rn×p is a known
deterministic matrix, and Z ∼ N(0, σ2 In ) is additive Gaussian noise.
An Introductory Guide to Fano’s Inequality with Applications 517

• Given knowledge of X and Y, an estimate  θ is formed, and the loss is given by the
squared 2 -error, (θ,
θ) = θ −
θ22 , corresponding to (16.60) with ρ(θ,
θ) = θ −
θ2 and
Φ(·) = (·)2 . Overloading the general notation Mn (Θ, ), we write the minimax risk as
Mn (k, X) = inf sup Eθ [θ − 
θ22 ], (16.83)

θ θ∈R p : θ0 ≤k

where Eθ denotes the expectation when the underlying vector is θ.

Minimax Bound
The lower bound on the minimax risk is formally stated as follows. To simplify the
analysis slightly, we state the result in an asymptotic form for the sparse regime
k = o(p); with only minor changes, one can attain a non-asymptotic variant attaining the
same scaling laws for more general choices of k [35].
theorem 16.12 (Sparse linear regression) Under the preceding sparse linear regression
problem with k = o(p) and a fixed regression matrix X, we have
σ2 kp log(p/k)
Mn (k, X) ≥ (1 + o(1)) (16.84)
32X2F
as p → ∞. In particular, under the constraint X2F ≤ npΓ for some Γ > 0, achieving
Mn (k, X) ≤ δ requires n ≥ (σ2 k log(p/k)/(32δΓ)(1 + o(1)).
Proof We present a simple proof based on a reduction to approximate recovery (The-
orem 16.10). In Section 16.6.1, we discuss an alternative proof based on a reduction to
exact recovery (Theorem 16.9).
We define the set
 
V = v ∈ {−1, 0, 1} p : v0 = k , (16.85)
and with each v ∈ V we associate a vector θv =  v for some  > 0. Letting d(v, v )
denote the Hamming distance, we have the following properties.

• For v, v ∈ V, if d(v, v ) > t, thenθv − θv 2 >  t.  
• The cardinality of V is |V| = 2k kp , yielding log |V| ≥ log kp ≥ k log(p/k).
• The quantity Nmax (t) in Theorem 16.10 is the maximum possible number of
v ∈ V such that d(v, v ) ≤ t for a fixed v. Setting t = k/2, a simple counting
  
argument gives Nmax (t) ≤ k/2 2 j pj ≤ k/2 + 1 · 2k/2 · k/2
p
, which simplifies to
 
j=0
log Nmax (t) ≤ (k/2) log(p/k) (1 + o(1)) due to the assumption k = o(p).

From these observations, applying Theorem 16.10 with t = k/2 and  =  k/2 yields
k · ( )2 I(V; Y) + log 2
Mn (k, X) ≥ 1−   . (16.86)
8 (k/2) log(p/k) (1 + o(1))
Note that we do not condition on X in the mutual information, since we have assumed
that X is deterministic.
To bound the mutual information, we first apply tensorization (from the first part

of Lemma 16.2) to obtain I(V; Y) ≤ ni=1 I(V; Yi ), and then bound each I(V; Yi ) using
equation (16.24) in Lemma 16.4. We let QY be the N(0, σ2 ) density function, and we let
518 Jonathan Scarlett and Volkan Cevher

Pv,i denote the density function of N(XiT θv , σ2 ), where XiT is the transpose of the ith row
of X. Since the KL divergence between the N(μ0 , σ2 ) and N(μ1 , σ2 ) density functions
is (μ1 − μ0 )2 /(2σ2 ), we have D(Pv,i QY ) = |XiT θv |2 /(2σ2 ). As a result, Lemma 16.4
  
yields I(V; Yi ) ≤ (1/|V|) v D(Pv,i QY ) = (1/(2σ2 ))E |XiT θV |2 for uniform V. Summing
over i and recalling that θv =  v, we deduce that

( )2
I(V; Y) ≤ E[XV22 ]. (16.87)
2σ2
From the choice of V in (16.85), we can easily compute Cov[V] = (k/p)I p , which
implies that E[XV22 ] = (k/p)X2F . Substitution into (16.87) yields I(V; Y) ≤ ( )2 /
(2σ2 ) · (k/p)X2F , and we conclude from (16.86) that

k · ( )2 ( )2 /2σ2 · (k/p)X2F + log 2


Mn (k, X) ≥ 1−   . (16.88)
8 (k/2) log(p/k) (1 + o(1))

The proof is concluded by setting ( )2 = σ2 p log(p/k)/(2X2F ), which is chosen to


make the bracketed term tend to 12 .

Up to constant factors, the lower bound in Theorem 16.12 cannot be improved


without additional knowledge of X beyond its Frobenius norm [5]. For instance, in the
case in which X has i.i.d. Gaussian entries, a matching upper bound holds with high
probability under maximum-likelihood decoding.

Alternative Proof: Reduction with Exact Recovery


In contrast to the proof given above (which has been adapted from [35]), the first known
proof of Theorem 16.12 was based on packing with exact recovery (Theorem 16.9) [5].
For the sake of comparison, we briefly outline this alternative approach, which turns out
to be more complicated.
The main step is to prove the existence of a set {θ1 , . . . , θ M } satisfying the following
properties.
 
• The number of elements satisfies M = Ω k log(p/k) .
• Each element is k-sparse with non-zero entries equal to ±1.
• The elements are well separated in the sense that θv − θv 22 = Ω(k) for v  v .
• The empirical covariance matrix is close to a scaled identity matrix in the follow-
M  
ing sense: (1/M) v=1 θv θvT − (k/p) · I p 2→2 = o k/p , where  · 2→2 denotes the
2 /2 -operator norm, i.e., the largest singular value.
Once this has been established, the proof proceeds along the same lines as the proof we
gave above, scaling the vectors down by some  > 0 and using Theorem 16.9 in place
of Theorem 16.10.
The existence of the packing set is proved via a probabilistic argument. If one gener-
 
ates Ω k log(p/k) uniformly random k-sparse sequences with non-zero entries equaling
±1, then these will satisfy the remaining two properties with positive probability.
While it is straightforward to establish the condition of being well separated, the proof
An Introductory Guide to Fano’s Inequality with Applications 519

of the condition on the empirical covariance matrix requires a careful application of the
non-elementary matrix Bernstein inequality.
Overall, while the two approaches yield the same result up to constant factors in this
example, the approach based on approximate recovery is entirely elementary and avoids
the preceding difficulties.

16.6.2 Density Estimation


In this section, we consider the problem of estimating an entire probability density
function given samples from its distribution, which is commonly known as density esti-
mation. We consider a non-parametric view, meaning that the density does not take any
specific parametric form. As a result, the problem is inherently infinite-dimensional, and
lends itself to the global packing and covering approach introduced in Section 16.5.3.
While many classes of density functions have been considered in the literature [13],
we focus our attention on a specific setting for clarity of exposition.
• The density function f that we seek to estimate is defined on the domain [0, 1], i.e.,
1
f (y) ≥ 0 for all y ∈ [0, 1], and 0 f (y)dy = 1.
• We assume that f satisfies the following conditions:
f (y) ≥ η, ∀y ∈ [0, 1],  f TV ≤ Γ (16.89)
for some η ∈ (0, 1) and Γ > 0, where the total variation (TV) norm is defined as
L  
 f TV = supL sup0≤x1 ≤...≤xL ≤1 l=2 f (xl ) − f (xl−1 ) . The set of all density functions
satisfying these constraints is denoted by Fη,Γ .
• Given n independent samples Y = (Y1 , . . . , Yn ) from f , an estimate  f is formed, and
1
  
the loss is given by ( f, f ) =  f − f 2 = 0 ( f (x) − f (x)) dx. Hence, the minimax risk
2 2

is given by
 
Mn (η, Γ) = inf sup E f  f −  f 22 , (16.90)

f f ∈Fη,Γ

where E f denotes the expectation when the underlying density is f .

Minimax Bound
The minimax lower bound is given as follows.
theorem 16.13 (Density estimation) Consider the preceding density estimation setup
with some η ∈ (0, 1) and Γ > 0 not depending on n. There exists a constant c > 0
(depending on η and Γ) such that, in order to achieve Mn (η, Γ) ≤ δ, it is necessary that
3/2
1
n ≥ c· (16.91)
δ
 
when δ is sufficiently small. In other words, Mn (η, Γ) = Ω n−2/3 .
Proof We specialize the general analysis of [13] to the class Fη,Γ . Recalling
the packing and covering numbers from Definitions 16.1 and 16.2, we adopt the
shorthand notation M2∗ ( p ) = Mρ∗ (Fη,Γ ,  p ) with ρ( f, f ) =  f − f 2 , and similarly
520 Jonathan Scarlett and Volkan Cevher

N2∗ ( c ) = Nρ∗ (Fη,Γ ,  c ). We first show that NKL


∗ (Corollary 16.2) can be upper-bounded

in terms of M2 , which will lead to a minimax lower bound that depends only on the
packing number M2∗ . For f1 , f2 ∈ Fη,Γ , we have
 1
( f1 (x) − f2 (x))2
D( f1  f2 ) ≤ dx (16.92)
0 f2 (x)

1 1
≤ ( f1 (x) − f2 (x))2 dx (16.93)
η 0
1
=  f1 − f2 22 , (16.94)
η

where (16.92) follows since the KL divergence is upper-bounded by the χ2 -divergence


(Lemma 16.6), and (16.93) follows from the assumption that the density is lower-
∗ in Corollary 16.2, we deduce the following
bounded by η. From the definition of NKL
for any  c > 0:
∗ √ √
NKL ( c ) ≤ N2∗ ( η c ) ≤ M2∗ ( η c ), (16.95)

where the first inequality holds because any η c -covering in the 2 -norm is also a
 c -covering in the KL divergence due to (16.94), and the second inequality follows
from Lemma 16.7.
Combining (16.95) with Corollary 16.2 and the choice Φ(·) = (·)2 gives

p 2 log M2∗ ( η c ) + n c + log 2
Mn (η, Γ) ≥ 1− . (16.96)
2 log M2∗ ( p )

We now apply the following bounds on the packing number of Fη,Γ , which we state
from [13] without proof:

c ·  −1 ≤ log M2∗ () ≤ c ·  −1 , (16.97)

for some constants c, c > 0 and sufficiently small  > 0. It follows that
⎛ ⎞
p 2⎜ −1/2 + n + log 2 ⎟
Mn (η, Γ) ≥
⎜ ⎜⎜⎜1 − c · (η c ) c ⎟⎟⎟
⎟⎠. (16.98)

2 c ·  −1
p

The remainder of the proof amounts to choosing  p and  c to balance the terms
appearing in this expression.
 
First, choosing  c to equate the terms c · (η c )−1/2 and n c leads to  c = c /n 2/3 with
 2/3
c = cη−1/2 , yielding (c · (η c )−1/2 + n c + log 2)/(c ·  −1
p ) = (2n c /n + log 2)/(c ·  −1
p ).
−1 
Next, choosing  p to make this fraction equal to 2 yields  p = (2/c) 2(c ) n +
1 2/3 1/3

log 2 , which means that  p ≥ c · n−1/3 for suitable c > 0 and sufficiently large n.
Finally, since we made the fraction equal to 12 , (16.98) yields Mn (η, Γ) ≥  2p /8 ≥
(c )2 n−2/3 /8. Setting Mn (η, Γ) = δ and solving for n yields the desired result.

The scaling given in Theorem 16.13 cannot be improved; a matching upper bound is
given in [13], and can be achieved even when η = 0.
An Introductory Guide to Fano’s Inequality with Applications 521

16.6.3 Convex Optimization


In our final example, we consider the optimization setting introduced in Section 16.5.4.
We provide an example that is rather simple, yet has interesting features not present
in the previous examples: (i) an example departing from estimation; (ii) a continuous
example with adaptivity; and (iii) a case where Fano’s inequality with |V| = 2 is used.
We consider the following special case of the general setup of Section 16.5.4.
• We let F be the set of differentiable and strongly convex functions on X = [0, 1], with
strong convexity parameter equal to one:
 
1
Fscv = f : f is differentiable ∩ f (x) − x2 is convex . (16.99)
2

The analysis that we present can easily be extended to functions on an arbitrary


closed interval with an arbitary strong convexity parameter.
• When we query a point x ∈ X, we observe a noisy sample of the function value and
its gradient:

Y = ( f (x) + Z, f (x) + Z ), (16.100)

where Z and Z are independent N(0, σ2 ) random variables, for some σ2 > 0. This is
commonly referred to as the noisy first-order oracle.

Minimax Bound
The following theorem lower-bounds the number of queries required to achieve
δ-optimality. The proof is taken from [20] with only minor modifications.
theorem 16.14 (Stochastic optimization of strongly convex functions) Under the
preceding convex optimization setting with noisy first-order oracle information, in order
to achieve Mn (Fscv ) ≤ δ, it is necessary that

σ2 log 2
n≥ (16.101)
40δ
when δ is sufficiently small.
Proof We construct a set of two functions satisfying the assumptions of √ Theorem
16.11. Specifically, we fix (,  ) such that 0 <  <  < 18 , define x1∗ = 12 − 2 and

x2∗ = 12 + 2 , and set

1
fv (x) = (x − xv∗ )2 , v = 1, 2. (16.102)
2

These functions are illustrated in Fig. 16.4.


 
Since  ∈ 0, 18 , both x1∗ and x2∗ lie in (0, 1), and hence min x∈[0,1] f1 (x) =
min x∈[0,1] f2 (x) = 0. Moreover, a direct evaluation reveals that f1 (x) + f2 (x) =
 
x − 12 2 + 2 > 2, which implies that any -optimal point for one function cannot be
522 Jonathan Scarlett and Volkan Cevher

0.15

0.1

Function value

0.05

0
0 0.2 0.4 0.6 0.8 1
Input x
Figure 16.4 Construction of two functions in Fscv that are difficult to distinguish, and such that
any point x ∈ [0, 1] can be -optimal for only one of the two functions.

-optimal for the other function. This is the condition needed in order for us to apply
Theorem 16.11, yielding from (16.80) that
Mn (Fscv ) ≥  · H2−1 (log 2 − I(V; X, Y)). (16.103)
To bound the mutual information, we first apply tensorization (from the first part

of Lemma 16.3) to obtain I(V; X, Y) ≤ ni=1 I(V; Yi |Xi ). We proceed by bounding
I(V; Yi |Xi ) for any given i. Fix x ∈ [0, 1], let PYx and PYx be the density functions of
the noisy samples of f1 (x) and f1 (x), and let QYx and QYx be defined similarly for
 
f0 (x) = 12 x − 12 2 . We have
D(PYx × PYx QYx × QYx ) = D(PYx QYx ) + D(PYx QYx ) (16.104)
( f1 (x) − f0 (x))2 ( f1 (x) − f0 (x))2
= + , (16.105)
2σ2 2σ2
where (16.104) holds since the KL divergence is additive for product distributions, and
(16.105) uses the fact that the divergence between the N(μ0 , σ2 ) and N(μ1 , σ2 ) density
functions is (μ1 − μ0 )2 /(2σ2 ). √ 
  
Recalling that f1 (x) = 12 x − 12 + 2 2 and f0 (x) = 12 x − 12 2 , we have
⎛ " ⎞2
1 √ ⎜⎜⎜  ⎟⎟⎟⎟
2
1 ⎜
( f1 (x) − f0 (x)) = 2 + 2 x −
2
2 ≤ ⎜⎝ + ⎟ ≤ 2 , (16.106)
4 2 2⎠
where the first inequality uses the fact that x ∈ [0, 1], and the second inequality follows
√ √ √  √ √ 2
since  < 18 and hence  =  ·  ≤  /8 (note that 1/ 8 + 1/ 2 ≤ 2). Moreover,
taking the derivatives of f0 and f1 gives ( f1 (x) − f0 (x))2 = 2 , and substitution into
(16.105) yields D(PYx × PYx QYx × QYx ) ≤ 2 /σ2 .
The preceding analysis applies in a near-identical manner when f2 is used in
place of f1 , and yields the same KL divergence bound when (PYx , PYx ) is defined
An Introductory Guide to Fano’s Inequality with Applications 523

with respect to f2 . As a result, for any x ∈ [0, 1], we obtain from (16.25) in Lemma
16.4 that I(V; Yi |Xi = x) ≤ 2 /σ2 . Averaging over X, we obtain I(V; Yi |Xi ) ≤ 2 /σ2 ,

and substitution into the above-established bound I(V; X, Y) ≤ ni=1 I(V; Yi |Xi ) yields
I(V; X, Y) ≤ 2n /σ2 . Hence, (16.103) yields
2n
Mn (Fscv ) ≥  · H2−1 log 2 − . (16.107)
σ2

Now observe that if n ≤ σ2 (log 2)/4 then the argument to H2−1 (·) is at least (log 2)/2. It
  1
is easy to verify that H2−1 (log 2)/2 > 10 , from which it follows that Mn (Fscv ) > /10.
Setting  = 10δ and noting that  can be chosen arbitrarily close to , we conclude that
the required number of samples σ2 (log 2)/4 recovers (16.101).

Theorem 16.14 provides tight scaling laws, since stochastic gradient descent is
known to achieve δ-optimality for strongly convex functions using O(σ2 /δ) queries.
Analogous results for the multidimensional setting can be found in [20].

16.7 Discussion

16.7.1 Limitations of Fano’s Inequality


While Fano’s inequality is a highly versatile method with successes in a wide range
of statistical applications (see Table 16.1), it is worth pointing out some of its main
limitations. We briefly mention some alternative methods below, as well as discussing
some suitable generalizations of Fano’s inequality in Section 16.7.2.
Non-asymptotic weakness. Even in scenarios where Fano’s inequality provides
converse bounds with the correct asymptotics including constants, these bounds can be
inferior to alternative methods in the non-asymptotic sense [50, 51]. Related to this issue
is the distinction between the weak converse and strong converse: we have seen that
Fano’s inequality typically provides necessary conditions of the form n ≥ n∗ (1 − δ − o(1))
for achieving Pe ≤ δ, in contrast with strong converse results of the form n ≥ n∗ (1 − o(1))
for any δ ∈ (0, 1). Alternative techniques addressing these limitations are discussed
in the context of communication in [50], and in the context of statistical estimation
in [52, 53].
Difficulties in adaptive settings. While we have provided examples where Fano’s
inequality provides tight bounds in adaptive settings, there are several applications
where alternative methods have proved to be more suitable. One reason for this is that
the conditional mutual information terms I(V; Yi |Xi ) (see Lemma 16.3) often involve
complicated conditional distributions that are difficult to analyze. We refer the reader
to [54–56] for examples in which alternative techniques proved to be more suitable for
adaptive settings.
Restriction to KL divergence. When applying Fano’s inequality, one invariably
needs to bound a mutual information term, which is an instance of the KL divergence.
While the KL divergence satisfies a number of convenient properties that can help in this
process, it is sometimes the case that other divergence measures are more convenient to
524 Jonathan Scarlett and Volkan Cevher

work with, or can be used to derive tighter results. Generalizations of Fano’s inequality
have been proposed specifically for this purpose, as we discuss in the following
section.

16.7.2 Generalizations of Fano’s Inequality


Several variations and generalizations of Fano’s inequality have been proposed in the
literature [57–62]. Most of these are not derived from the most well-known proof of
Theorem 16.1, but are instead based on an alternative proof via the data-processing
inequality for KL divergence: For any event E, one has
 
I(V;  V PV × P
V) = D(PV  V ) ≥ D2 PV 
V [E]  (PV × P
V )[E] , (16.108)

where D2 (pq) = p log (p/q) + (1 − p) log ((1 − p)/(1 − q)) is the binary KL divergence
function. Observe that, if V is uniform and E is the event that V   V, then we have
PV 
V [E] = Pe and (P V × P
V )[E] = 1 − 1/|V|, and Fano’s inequality (Theorem 16.1)
follows on substituting the definition of D2 (··) into (16.108) and rearranging. This
proof lends itself to interesting generalizations, including the following.
Continuum version. Consider a continuous random variable V taking values on
 
V ⊆ R p for some p ≥ 1, and an error probability of the form Pe (t) = P d(V,  V) > t for
some real-valued function d on R p × R p . This is the same formula as (16.11), which we
previously introduced for the discrete setting. Defining the “ball” Bd ( v, t) = {v ∈ R p :
d(v,v) ≤ t} centered at 
v, (16.108) leads to the following for V uniform on V:

I(V; 
V) + log 2
Pe (t) ≥ 1 −  , (16.109)
log Vol(V)/supv∈R p Vol(V ∩ Bd (
v, t))
where Vol(·) denotes the volume of a set. This result provides a continuous counterpart
to the final part of Theorem 16.2, in which the cardinality ratio is replaced by a volume
ratio. We refer the reader to [35] for example applications, and to [62] for the simple
proof outlined above.
Beyond KL divergence. The key step (16.108) extends immediately to other
measures that satisfy the data-processing inequality. A useful class of such measures is
  
the class of f -divergences: D f (PQ) = EQ f P(Y)/Q(Y) for some convex function f
satisfying f (1) = 0. Special cases include KL divergence ( f (z) = z log z), total variation

( f (z) = 12 |z − 1|), squared Hellinger distance ( f (z) = ( z − 1)2 ), and χ2 -divergence
( f (z) = (z − 1)2 ). It was shown in [60] that alternative choices beyond the KL divergence
can provide improved bounds in some cases. Generalizations of Fano’s inequality
beyond f -divergences can be found in [61].
Non-uniform priors. The first form of Fano’s inequality in Theorem 16.1 does not
require V to be uniform. However, in highly non-uniform cases where H(V)  log|V|,
the term Pe log(|V| − 1) may be too large for the bound to be useful. In such cases, it
is often useful to use different Fano-like bounds that are based on the alternative proof
above. In particular, the step (16.108) makes no use of uniformity, and continues to
hold even in the non-uniform case. In [57], this bound was further weakened to provide
simpler lower bounds for non-uniform settings with discrete alphabets. Fano-type
An Introductory Guide to Fano’s Inequality with Applications 525

lower bounds in continuous Bayesian settings with non-uniform priors arose more
recently, and are typically more technically challenging; the interested reader is referred
to [18, 63].

Acknowledgments

J. Scarlett was supported by an NUS startup grant. V. Cevher was supported by the
European Research Council (ERC) under the European Union’s Horizon 2020 research
and innovation programme (grant agreement 725594 – time-data).

References

[1] R. M. Fano, “Class notes for MIT course 6.574: Transmission of information,” 1952.
[2] M. B. Malyutov, “The separating property of random matrices,” Math. Notes Academy
Sci. USSR, vol. 23, no. 1, pp. 84–91, 1978.
[3] G. Atia and V. Saligrama, “Boolean compressed sensing and noisy group testing,” IEEE
Trans. Information Theory, vol. 58, no. 3, pp. 1880–1901, 2012.
[4] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-
dimensional and noisy setting,” IEEE Trans. Information Theory, vol. 55, no. 12, pp.
5728–5741, 2009.
[5] E. J. Candès and M. A. Davenport, “How well can we estimate a sparse vector?” Appl.
Comput. Harmonic Analysis, vol. 34, no. 2, pp. 317–323, 2013.
[6] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Nearly optimal sparse Fourier transform,”
in Proc. 44th Annual ACM Symposium on Theory of Computation, 2012, pp. 563–578.
[7] V. Cevher, M. Kapralov, J. Scarlett, and A. Zandieh, “An adaptive sublinear-time
block sparse Fourier transform,” in Proc. 49th Annual ACM Symposium on Theory of
Computation, 2017, pp. 702–715.
[8] A. A. Amini and M. J. Wainwright, “High-dimensional analysis of semidefinite relaxations
for sparse principal components,” Annals Statist., vol. 37, no. 5B, pp. 2877–2921, 2009.
[9] V. Q. Vu and J. Lei, “Minimax rates of estimation for sparse PCA in high dimensions,”
in Proc. 15th International Conference on Artificial Intelligence and Statistics, 2012, pp.
1278–1286.
[10] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix
completion: Optimal bounds with noise,” J. Machine Learning Res., vol. 13, no. 5, pp.
1665–1697, 2012.
[11] M. A. Davenport, Y. Plan, E. Van Den Berg, and M. Wootters, “1-bit matrix completion,”
Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.
[12] I. Ibragimov and R. Khasminskii, “Estimation of infinite-dimensional parameter in
Gaussian white noise,” Soviet Math. Doklady, vol. 236, no. 5, pp. 1053–1055, 1977.
[13] Y. Yang and A. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
[14] L. Birgé, “Approximation dans les espaces métriques et théorie de l’estimation,”
Probability Theory and Related Fields, vol. 65, no. 2, pp. 181–237, 1983.
526 Jonathan Scarlett and Volkan Cevher

[15] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive
models over kernel classes via convex programming,” J. Machine Learning Res., vol. 13,
no. 2, pp. 389–427, 2012.
[16] Y. Yang, M. Pilanci, and M. J. Wainwright, “Randomized sketches for kernels: Fast and
optimal nonparametric regression,” Annals Statist., vol. 45, no. 3, pp. 991–1023, 2017.
[17] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower
bounds for distributed statistical estimation with communication constraints,” in Advances
in Neural Information Processing Systems, 2013, pp. 2328–2336.
[18] A. Xu and M. Raginsky, “Information-theoretic lower bounds on Bayes risk in decentral-
ized estimation,” IEEE Trans. Information Theory, vol. 63, no. 3, pp. 1580–1600, 2017.
[19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax
rates,” in Proc. 54th Annual IEEE Symposium on Foundations of Computer Science, 2013,
pp. 429–438.
[20] M. Raginsky and A. Rakhlin, “Information-based complexity, feedback and dynamics in
convex programming,” IEEE Trans. Information Theory, vol. 57, no. 10, pp. 7036–7056,
2011.
[21] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic
lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Trans.
Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012.
[22] M. Raginsky and A. Rakhlin, “Lower bounds for passive and active learning,” in Advances
in Neural Information Processing Systems, 2011, pp. 1026–1034.
[23] A. Agarwal, S. Agarwal, S. Assadi, and S. Khanna, “Learning with limited rounds of
adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons,”
in Proc. Conference on Learning Theory, 2017, pp. 39–75.
[24] J. Scarlett, “Tight regret bounds for Bayesian optimization in one dimension,” in Proc.
International Conference on Machine Learning, 2018, pp. 4507–4515.
[25] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, “Information theory methods
in communication complexity,” in Proc. 17th IEEE Annual Conference on Computational
Complexity, 2002, pp. 93–102.
[26] N. Santhanam and M. Wainwright, “Information-theoretic limits of selecting binary
graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7, pp.
4117–4134, 2012.
[27] K. Shanmugam, R. Tandon, A. Dimakis, and P. Ravikumar, “On the information theoretic
limits of learning Ising models,” in Advances in Neural Information Processing Systems,
2014, pp. 2303–2311.
[28] N. B. Shah and M. J. Wainwright, “Simple, robust and optimal ranking from pairwise
comparisons,” J. Machine Learning Res., vol. 18, no. 199, pp. 1–38, 2018.
[29] A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade,
“Worst-case vs average-case design for estimation from fixed pairwise comparisons,”
http://arxiv.org/abs/1707.06217.
[30] Y. Yang, “Minimax nonparametric classification. i. rates of convergence,” IEEE Trans.
Information Theory, vol. 45, no. 7, pp. 2271–2284, 1999.
[31] M. Nokleby, M. Rodrigues, and R. Calderbank, “Discrimination on the Grassmann
manifold: Fundamental limits of subspace classifiers,” IEEE Trans. Information Theory,
vol. 61, no. 4, pp. 2133–2147, 2015.
[32] A. Mazumdar and B. Saha, “Query complexity of clustering with side information,” in
Advances in Neural Information Processing Systems, 2017, pp. 4682–4693.
An Introductory Guide to Fano’s Inequality with Applications 527

[33] E. Mossel, “Phase transitions in phylogeny,” Trans. Amer. Math. Soc., vol. 356, no. 6, pp.
2379–2404, 2004.
[34] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2006.
[35] J. C. Duchi and M. J. Wainwright, “Distance-based and continuum Fano inequalities with
applications to statistical estimation,” http://arxiv.org/abs/1311.2669.
[36] I. Sason and S. Verdú, “ f -divergence inequalities,” IEEE Trans. Information Theory,
vol. 62, no. 11, pp. 5973–6006, 2016.
[37] R. Dorfman, “The detection of defective members of large populations,” Annals Math.
Statist., vol. 14, no. 4, pp. 436–440, 1943.
[38] J. Scarlett and V. Cevher, “Phase transitions in group testing,” in Proc. ACM-SIAM
Symposium on Discrete Algorithms, 2016, pp. 40–53.
[39] J. Scarlett and V. Cevher, “How little does non-exact recovery help in group testing?” in
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017,
pp. 6090–6094.
[40] L. Baldassini, O. Johnson, and M. Aldridge, “The capacity of adaptive group testing,” in
Proc. IEEE Int. Symp. Inform. Theory, 2013, pp. 2676–2680.
[41] J. Scarlett and V. Cevher, “Converse bounds for noisy group testing with arbitrary
measurement matrices,” in Proc. IEEE International Symposium on Information Theory,
2016, pp. 2868–2872.
[42] J. Scarlett and V. Cevher, “On the difficulty of selecting Ising models with approximate
recovery,” IEEE Trans. Signal Information Processing over Networks, vol. 2, no. 4, pp.
625–638, 2016.
[43] J. Scarlett and V. Cevher, “Lower bounds on active learning for graphical model selection,”
in Proc. 20th International Conference on Artificial Intelligence and Statistics, 2017.
[44] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning high-dimensional Markov
forest distributions: Analysis of error rates,” J. Machine Learning Res., vol. 12, no. 5, pp.
1617–1653, 2011.
[45] A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky, “High-dimensional structure
estimation in Ising models: Local separation criterion,” Annals Statist., vol. 40, no. 3, pp.
1346–1375, 2012.
[46] G. Dasarathy, A. Singh, M.-F. Balcan, and J. H. Park, “Active learning algorithms
for graphical model selection,” in Proc. 19th International Conference on Artificial
Intelligence and Statistics, 2016, pp. 1356–1364.
[47] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997,
pp. 423–435.
[48] J. Duchi, “Lecture notes for statistics 311/electrical engineering 377 (MIT),”
http://stanford.edu/class/stats311/.
[49] Y. Wu, “Lecture notes for ECE598YW: Information-theoretic methods for
high-dimensional statistics,” www.stat.yale.edu/~yw562/ln.html.
[50] Y. Polyanskiy, V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength
regime,” IEEE Trans. Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
[51] O. Johnson, “Strong converses for group testing from finite blocklength results,” IEEE
Trans. Information Theory, vol. 63, no. 9, pp. 5923–5933, 2017.
[52] R. Venkataramanan and O. Johnson, “A strong converse bound for multiple hypothesis
testing, with applications to high-dimensional estimation,” Electron. J. Statistics, vol. 12,
no. 1, pp. 1126–1149, 2018.
[53] P.-L. Loh, “On lower bounds for statistical learning theory,” Entropy, vol. 19, no. 11,
p. 617, 2017.
528 Jonathan Scarlett and Volkan Cevher

[54] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances
Appl. Math., vol. 6, no. 1, pp. 4–22, 1985.
[55] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino:
The adversarial multi-armed bandit problem,” in Proc. 26th Annual IEEE Conference on
Foundations of Computer Science, 1995, pp. 322–331.
[56] E. Arias-Castro, E. J. Candès, and M. A. Davenport, “On the fundamental limits of
adaptive sensing,” IEEE Trans. Information Theory, vol. 59, no. 1, pp. 472–481, 2013.
[57] T. S. Han and S. Verdú, “Generalizing the Fano inequality,” IEEE Trans. Information
Theory, vol. 40, no. 4, pp. 1247–1251, 1994.
[58] L. Birgé, “A new lower bound for multiple hypothesis testing,” IEEE Trans. Information
Theory, vol. 51, no. 4, pp. 1611–1615, 2005.
[59] A. A. Gushchin, “On Fano’s lemma and similar inequalities for the minimax risk,”
Probability Theory and Math. Statistics, vol. 2003, no. 67, pp. 26–37, 2004.
[60] A. Guntuboyina, “Lower bounds for the minimax risk using f -divergences, and
applications,” IEEE Trans. Information Theory, vol. 57, no. 4, pp. 2386–2399, 2011.
[61] Y. Polyanskiy and S. Verdú, “Arimoto channel coding converse and Rényi divergence,”
in Proc. 48th Annual Allerton Conference on Communication, Control, and Compution,
2010, pp. 1327–1333.
[62] G. Braun and S. Pokutta, “An information diffusion Fano inequality,”
http://arxiv.org/abs/1504.05492.
[63] X. Chen, A. Guntuboyina, and Y. Zhang, “On Bayes risk lower bounds,” J. Machine
Learning Res., vol. 17, no. 219, pp. 1–58, 2016.
Index

α dimension, 77, 80, 86, 87, 94 binary symmetric community model, 385
χ2 -divergence, 389, 499, 524 binary symmetric community recovery problem,
f -divergence, 524 386–393
bitrate, 45
data compression, 72 unlimited, 57
unrestricted, 53
adaptive composition, 326 Blahut–Arimoto algorithm, 283, 346
ADX, 46 boosting, 327
ADX Bradley–Terry model, 231
achievability, 57 Bregman divergence, 277
converse, 56 Bregman information, 277
Akaike information criterion, 366 Bridge criterion, 371
algorithmic stability, 304, 308
aliasing, 65 capacity, 267, 325, 346
almost exact recovery, 386, 394–401 channel, 346
necessary conditions, 398–401 cavity method, 393, 414
sufficient conditions, 396–398 channel, 3, 325
analog-to-digital binary symmetric, 8
compression, 46 capacity, 10, 11
conversion, 46 discrete memoryless, 10
approximate message passing (AMP), 74, 200, 202, Gaussian, 200, 203, 210
206–207 channel coding, 339
approximation error, 338, 365 achievability, 9
arithmetic coding, 266 blocklength, 10
artificial intelligence, 264 capacity-achieving code, 15
Assouad’s method, 513 codebook, 10, 13
asymptotic efficiency, 363 codewords, 3, 10
asymptotic equipartition property, 6 converse, 11
decoder, 3
barrier-based second-order methods, 110 encoder, 3
basis pursuit, 74 error-correction codes, 10
Bayes decision rule, 333 linear codes, 12
Bayes risk, 337 generator matrix, 13
Bayesian information criterion, 366 parity-check matrix, 13
Bayesian network, 281 maximum-likelihood decoding, 10
belief propagation, 200 polar codes, 14
bias, 365 sphere packing, 10
bilinear problem, 223 syndrome decoding, 12
binary symmetric community detection problem, characteristic (conflict) graph, 459
386–393 characteristic graph, 459

529
530 Index

chromatic entropy, 462 number, 514


circular Nyquist density, 168, 169 covering number, 185, 317
classification, 303, 318, 333, 376–379 cross-entropy, 334
clustering, 264, 271, 303 cross-entropy loss, 335, 341–342
clustering memoryless sources, 276 cross-entropy risk, 335
clustering sources with memory, 279 cross-validation, 367
correct, 272 v-fold, 367
dependence-based clustering, 280 leave-one-out, 367
distance-based clustering, 276 multi-fold, 90
Gaussian mixture clustering, 415 paradox, 377
hard clustering algorithm, 277
hard clustering methods, 272 Dantzig selector, 87
hierarchical clustering algorithm, 284 data
independence clustering, 281 empirical data distribution, 336
information-based clustering, 284 labeled data, 264
k-means clustering, 266, 276 training data, 264
soft clustering algorithm, 277 true-data generating distribution, 303, 337, 360
soft clustering methods, 272 unlabeled data, 263
coherence, 164, 170, 174 data acquisition, 18, 72
mutual, 177 analog signal, 18
coloring connectivity condition, 457, 464 digital signal, 18
communication, 267–268 quantization, 18
community detection, 31 rate distortion, 18
community detection problem, 223 scalar, 18
compressed sensing, 163, 237 vector, 18
compression-based, 75 sampling, 18
compressible phase retrieval (COPER), 93 Nyquist rate, 18
compressible signal pursuit, 77 sampling frequency, 18
compression, 266–267 sampling period, 18
compression codes, 75 shift-invariant, 18
compression-based gradient descent, 81 uniform, 18
compression-based (projected) gradient sampling theorem, 18
descent, 91 task-ignorant, 20
computational limits, 401–412 task-oriented, 20
concentration inequality, 171, 175, 179–181 data analysis, 26
conditional T -information, 306 statistical inference, see statistical inference
conditional chromatic entropy, 462 statistical learning, see statistical learning
conditional covariance matrix, 208 data compression, 3, 76, 456
conditional graph entropy, 466 lossless data compression, 76
conditional mutual information, 282, 306 lossy data compression, 76, 95
conjugate gradient method, 110 data-processing inequality, 492, 496
consistency, 361 data representation, 20, 134–135, 137
constrained least-squares problem, 110, analysis operator, 21
113, 119 data-driven, 20, 135
constraint linear, 21, 22
bitrate, 45 nonlinear, 21, 23
contiguity, 389 supervised, 21
convex analysis, 105, 108 unsupervised, 21
convex function, 109 deep learning, see neural networks
convex optimization, 105, 108, 109 dictionary learning, see dictionary learning
convex set, 108 dimensionality reduction, 20
correlated recovery, 385–388, 390–393 features, 20
lower bound, 390–393 independent component analysis, see independent
upper bound, 387–388 component analysis
CoSaMP, 74 kernel trick, 24
covariance matrix, 208 linear, 135
covering, 499, 514 manifold learning, 23
Index 531

matrix factorization, 25 random sparsity, 152


model based, 20, 135 separable sparsity, 152
nonlinear, 136 statistical risk minimization, 153
principal component analysis, see principal unsupervised, 24
component analysis vector data, 137–149
sparse, see dictionary learning achievability results, 142–148
subspace, 22 empirical risk minimization, 139
synthesis operator, 21 identifiability, 143–144
union of subspaces, 24 K-SVD, 139, 145
data science, 15 minimax risk, see minimax risk
fairness, 35 model, 138
phase transitions, 17 noisy reconstruction, 144–147
pipeline, 16 outliers, 147–148
privacy, 35 sparsity, 141, 142
training label, 21 statistical risk minimization, 139
training sample, 21 differential entropy, 209
training samples, 21 differential privacy, 125, 323
De Bruijn identity, 209 digital
decoder, 52, 335, 339 representation, 46
empirical decoder, 336 distance
maximum mutual information, 267 Lempel–Ziv distance, 274
minimum partition information, 268 Bhattacharya, 279
optimal decoder, 335 edit distance, 274
randomized decoder, 334 Euclidean distance, 274
soft decoder, 335 Hamming distance, 274
deep convolutional network, 264 information distance, 275
deep learning, see neural networks, 264, 326 Itakura–Saito, 276
deep neural network, 331 Mahalanobis distance, 277
multilayer feed-forward network, 331 normalized compression distance, 280
deep recurrent network, 264 normalized information distance, 275
deep representation, 335 distance-based clustering, 276
denoising, 269 distortion function, 339
density estimation, 303, 519 distortion measure, 339
dependence-based clustering, 280 distortion rate, 77
detection, 201, 386–390, 401–412 analog-to-digital, 52
lower bound, 389–390 function, 77, 95
upper bound, 387–388 analog, 55
dictionary, 177 indirect, 55
incomplete, 177 indirect/remote, 48
redundant, 177 remote, 55
dictionary learning, 24, 134–159 sampling, 52
algorithms, 159 vector Gaussian, 48
coherence measures, 142, 147, 155 distributed detection, 427
complete, 136, 140 distributed estimation, 427
distance metric, 138–140, 153 distributed hypothesis testing, 427
Kronecker structured, 152 cascaded communication, 427,
mean-squared error, 138 438–443
minimax lower bound, 24 interactive communication, 427, 443–445
overcomplete, 136, 140 non-interactive communication, 427–437
sample complexity, 138 distributed identity testing, 427, 445–450
supervised, 24 distributed inference, 35
tensor data, 149–157 distributed pattern classification, 427
achievability results, 155–157 distributed source coding, 430
coefficient tensor, 152 distributed statistical inference, 425
coordinate dictionary, 152 feature partition, 426
empirical risk minimization, 153 sample partition, 425
model, 151 distribution over permutations, 232
532 Index

divergence filter
Bregman divergence, 274, 277 pre-sampling, 63
f -divergence, 274, 278 Wiener, 54, 64
I-divergence, 277 first-moment method, 387–388
Kullback–Leibler divergence, 11, 277 Fisher information, 209
frame, 136, 138, 148
empirical loss, 303, 361 tight, 144, 145
empirical risk, 303, 336 frequency H( f ), 51
empirical risk minimization, 139, 153, 304, 337 frequency response, 51
noisy, 324 functional compression
encoder, 51, 335, 339 distributed functional compression with
randomized, 335 distortion, 474
capacity, 351 feedback in functional compression, 473
entropy, 5, 6 functional compression with distortion,
conditional entropy, 7 456
entropy rate, 96 lossless functional compression, 456
erasure T -information, 306 network functional compression, 456
erasure mutual information, 306 fundamental limit, 201, 265, 330
error
absolute generalization error, 313 Gauss–Markov
approximation error, 338, 365 process, 62
Bayes error, 333 Gaussian
estimation error, 338 process, 50
expected generalization error, 304 vector, 48
generalization error, 303 Gaussian complexity, 111
prediction error, 364, 366, 368, 377, 379 localized Gaussian complexity, 105, 111
testing error, 331 Gaussian mixture clustering, 415
training error, 331 generalization error, 303
error probability, 493 generalization gap, 337
approximate recovery, 494 bounds, 348–349
estimation, 105, 201, 265 information bounds, 348
Bayesian, 489 generalized information criterion, 368
continuous, 510 generalized likelihood ratio, 268
minimax, 489, 510 generalized likelihood ratio test, 268
estimation bias, 365 generalized linear model, 203
estimation error, 338 Gibbs algorithm, 320
estimation phase, 378 good codes, 210
estimation variance, 365 graph
exact recovery, 386, 394–401 forest, 504
necessary conditions, 398–401 tree, 504
sufficient conditions, 396–398 weighted undirected graph, 108
expectation propagation, 200 graph coloring, 461
exponential family of distributions, 232 graph entropy, 459
graph sparsification via sub-sampling, 108
false-positive rate, 201 graphical model, 281
Fano’s inequality, 7, 11, 116, 134, 137, 141, 491 Gaussian, 281
applications, 488, 499, 516 graphical model selection, 503
approximate recovery, 494 adaptive sampling, 507
conditional version, 495 approximate recovery, 506
continuum version, 524 Bayesian, 509
generalizations, 524 graphons, 413
genie argument, 503 group testing, 500
limitations, 523 adaptive, 502
standard version, 493 approximate recovery, 501
Fano’s lemma, 352 non-adaptive, 501
fast iterative shrinkage-thresholding Guerra interpolation method, 384,
algorithm, 74 393, 414
Index 533

Hamming distance, 9 Kolmogorov complexity, 274


Hamming weight, 13 Kullback-Leibler divergence, 277, 498,
hash function, 108 499, 524
Hellinger distance, 499, 524
Hoeffding’s lemma, 307 labeled data, 264
Huffman coding, 266 large sieve, 165, 168, 170, 187–189
hyperparameter, 345 inequality, 187
hypothesis, 303, 333 large-system limit, 202
hypothesis class, 304 Lasso, 119
hypothesis space, 302 lautum information, 289
hypothesis testing, 117, 268 Le Cam’s method, 513
composite hypothesis testing, 268 learning algorithm, 303
M-ary hypothesis testing, 117 learning distribution over permutations of options,
outlier hypothesis testing, 278 229, 230
universal outlier hypothesis testing, 279 learning from comparisons, 229
learning from first-order marginals, 230
I-MMSE, 209, 391–392 learning model, 337
ICA, see independent component analysis learning problem, 303
illum information, 289 learning rate, 347
image inpainting, 164, 181 least absolute shrinkage and selection operator
image registration, 269 (LASSO), 74
image super-resolution, 164, 181 least-angle regression (LARS), 74
independence clustering, 281 least-squares optimization problem, 110
independent component analysis, 22, 281 Lempel–Ziv algorithm, 266
inference, 198, 265, 268–271, 362 likelihood ratio, 268
Bayesian inference, 198 likelihood ratio test, 268
high-dimensional inference, 198 linear model
information bottleneck, 25, 283, 342–347 generalized, 203
algorithm, 349 standard, 200
method, 345–347 linear regression, 489
information dimension, 97, 98 loss, 360–363
k-th order information dimension, 98 average logarithmic loss, 334
Rényi information dimension, 97 cross-entropy loss, 334, 335, 341–342
information dropout, 349 empirical loss, 303, 361
information sequence, 212 function, 302
information source, 2 in-sample prediction loss, 364
binary, 4 out-sample prediction loss, 364
information theory, 265, 330 prediction loss, 360–363
information-based clustering, 284 validation loss, 367, 376
information-theoretical limits, 386–401 loss function, 488, 510
instance, 303 lossless compression, 266
instance space, 302 lossless source coding, 339
inter-cluster similarity, 274 lossy compression, 267
intra-cluster similarity, 274 lossy source coding, 339–341
invariant representations, 353 low-rank, 73, 120
Ising model, 281 low-rank matrix estimation, 119
iterative hard thresholding (IHT), 74 Luce model, 231
iterative sketching, 120
machine learning, 27
JPEG, 75 Markov chain, 496
JPEG2000, 75 matrix factorization problem, 223
maximum a posteriori, 333
Karhunen–Loève maximum-likelihood estimator (MLE),
expansion, 57 361, 387
kernel function, 122 maximum mutual information decoder, 267
kernel matrix, 122 McDiarmid’s concentration inequality, 354
kernel ridge regression, 122 mean-squared covariance, 219
534 Index

mean-squared error, 52, 201 normalized mutual information, 275


estimator, 54, 64 tensorization, 492
minimal, 53
minimum, 201 nearest neighbor, 276, 360
prediction, 201 network coding, 476
metric entropy, 116 network functional compression, 456
min-cut max-flow bounds, 476 neural networks, 25, 326, 376
minimal mean-squared error, 47 deep convolutional network, 264
minimax estimation deep neural network, 331
global approach, 513, 519 deep recurrent network, 264
local approach, 513 feed-forward neural network, 376
minimax regret, 352 gradient descent, 25
minimax risk, 116, 510 model, 232
dictionary learning non-parametric, 121, 360
lower bounds, 140–142, 154–155 rectified linear unit, 25
tightness, 141, 155, 158 regression, 121
upper bounds, 142, 155 sparse model, 232, 234, 236
minimax rate, 368 null-space property, 165, 185
minimum description length, 350, 368 set-theoretic, 165, 184–187
minimum mean-squared error (MMSE), 202, Nyquist
208, 219 rate, 50
minimum mean-squared error sequence, 214 Nyquist–Shannon sampling theorem, 73
minimum partition information, 285
minimum partition information decoder, 268 objectives of data analysis, 362
Minkowski dimension, 164, 185 Occam’s principle, 78
modified, 165, 185 optimization, 105, 515
misclassification probability, 333, 341–342 convex, 521
model selection, 326, 348, 350, 359–363 orthogonal matching pursuit (OMP), 74
asymptotic efficiency, 363 orthonormal basis, 172
consistency, 361 outlier hypothesis testing, 278
criteria, 363, 365–375 overfitting, 232, 345, 370, 374
for inference, 362
for prediction, 362 PAC-based generalization bounds, 312, 348
model class, 338, 360 packing, 513, 514, 518
model complexity, 365 number, 513
model parameters, 360 packing number, 116
non-parametric, 360 parametric, 360
optimal model, 360–365 parametric model, 231
parametric, 360 parsimony, 164, 180
restricted model class, 337 fractal structure, 184
true model, 360–365 manifold structure, 184
modeling-procedure selection, 375–379 partition information, 284
MPEG, 75 pattern classification, 333
multi-task learning, 331 PCA, see principal component analysis
multicast networks, 476 peeling algorithm, 239
multiinformation, 287 phase diagram, 198
multinomial logit model, 231, 235 phase retrieval, 92
mixture of multinomial logit models, 232, 256 phase transition, 202, 219, 384, 401
nested multinomial logit model, 232 channel coding, 12
multiple hypothesis testing, 490, 511 computational, 17
mutual information, 11, 124, 202, 208, 306, source coding, 8
341–342, 390–393 statistical, 17
bounds, 492, 495 Plackett model, 231
bounds via KL divergence, 492, 498 planted-clique problem, 402
conditional mutual information, 282 planted dense subgraph (PDS) problem, 404
empirical mutual information, 348 polynomial-time randomized reduction, 403
Index 535

population risk, 302 recovery algorithm


positive-rate data compression, 431, 437, 439, 442, phase-retrieval recovery algorithm, 93
444, 445, 447, 449 regression, 110, 200, 303, 359, 362–366, 368, 370,
posterior distribution, 197 375–377, 379
posterior marginal distribution, 206 linear, 200, 359, 362, 363, 368
power graph, 461 polynomial, 363
power spectral density, 19 regularization, 91, 113, 320, 326, 345
prediction, 201, 268, 362 reinforcement learning, 264
prediction error, 364, 366, 368, 377, 379 relative entropy, 207, 305
prediction loss, 360–363 replica method, 199
principal component analysis, 22 replica-symmetric formula, 202, 215
sparse, 23 representation learning, 333–338
spiked covariance model, 22 information complexity, 349
statistical phase transition, 23 reproducing kernel Hilbert space, 121
prior distribution, 197 restricted isometry property, 237
Bernoulli–Gaussian prior, 203 risk
Gaussian mixture model, 209 Bayes, 208, 337
privacy, 105, 124 cross-entropy risk, 335
differential privacy, 125 empirical cross-entropy risk, 350
privacy attacks, 125 empirical risk, 303, 336
probabilistic method, 10 excess, 304, 338
probit model, 231, 235 expected empirical risk, 325
process expected excess risk, 304
continuous-time, 49 expected risk, 302
cyclostationary, 64 induced risk, 338
stationary, 49 minimum cross-entropy risk, 337
projected gradient method, 110 population risk, 302
proximal gradient method, 110 risk function, 335
PSD true risk, 335
unimodal, 66
sample, 360
QR decomposition, 110 sample size, 360
sample complexity, 314, 317, 322, 323
Rand index, 275 sampler, 50
random projections, 104 bounded linear, 50
random utility maximization, 231 linear
random utility model, 231, 235, 247 time-invariant, 51
randomized algorithm, 104 multi-branch, 51
rank centrality algorithm, 247 uniform, 51, 63
ranking, 233, 236 sampling
rate adaptive, 489
Landau, 54 bounded linear, 50
rate region branch, 51
achievable rate region, 458 density
rate distortion, 77, 267 symmetric, 51
achievable distortion, 339 multi-branch, 67
achievable fidelity, 339 non-adaptive, 489
achievable rate, 339 optimal, 54, 57
distortion-rate function, 340 rate, 63
rate-distortion function, 77, 95, 340 set, 50
sampling, 60 source coding, 49
rate-distortion coding, 3 uniform
rate-distortion dimension, 98, 99 ADX, 63
rate-relevance function, 343 sampling rate, 73
receiver operating characteristic (ROC) curve, 201 optimal, 59
reconstruction algorithm, 73 sub-Nyquist, 59
536 Index

sampling theorem, 73 lossy, 3


second-moment method, 389–390 lossy source coding, 339–341
Shannon–Fano–Elias coding, 266 near-lossless, 5
signal noisy lossy source coding, 342–344
analog, 49 remote, 54
bandlimited, 50 sampling , 49
noise, 50 variable length, 5
stationary, 49 source coding with a single helper, 432
signal alignment, 269 source coding with multiple helpers, 432
signal declipping, 164, 181 space
signal recovery, 164 Paley–Wiener, 54
noisy signal recovery, 172, 176 sparse coding, 24
sparse signal recovery, 164 sparse linear regression, 516
stable signal recovery, 172 sparse PCA, 414
signal representation, 177 sparse representation, 73
sparse signal representation, 177 sparsity, 73, 119, 164, 180, 201, 234
signal separation, 164, 180–184 level, 181
noiseless signal separation, 181 block-sparsity, 73
sparse signal separation, 164 spectral methods, 258
signal structure, 164 spectrum
signature condition, 238 density power, 50
similarity density power
function, 274 rectangular, 61
inter-cluster similarity, 274 triangular, 61
intra-cluster similarity, 274 occupancy, 50
score, 274 square-root bottleneck, 184
similarity criterion, 273 stability, 304, 308, 311
single-community detection problem, 401–412 in T -information, 310
single-community model, 384 in erasure mutual information, 309
single-community recovery problem, 394–401 in erasure T -information, 308
single-crossing property, 209, 218 in mutual information, 310
singular value decomposition, 110 standard linear model, 200, 213–215
sketches statistical inference, 26, 28
Gaussian sketches, 105 covariate, 28
Rademacher sketches, 106 estimation, 27, 28
randomized orthonormal systems estimation error, 29
sketches, 106 generalized linear model, 29
sketches based on random row sampling, 107 hypothesis testing, 26, 30
sparse sketches based on hashing, 108 alternate hypothesis, 30
sub-Gaussian sketches, 106 binary, 30
sketching, 104 maximum a posteriori test, 31
iterative sketching, 120 multiple, 30
sketching kernels, 123 Neyman–Pearson test, 30
Slepian–Wolf, 467, 473 null hypothesis, 30
Slepian–Wolf compression with Rényi information measures, 31
feedback, 473 type-I error, 30
source, 339 type-II error, 30
source coding, 339–341 I-MMSE relationship, 29
achievability, 5 model selection, 26, 28
blocklength, 8 Akaike information criterion, 28
converse, 7 minimum description length, 28
decoder, 3 non-parametric, 26
encoder, 3 parametric, 26
indirect, 54 prediction error, 29
lossless, 3, 4 regression, 27, 28
lossless source coding, 339 regression parameters, 28
Index 537

standard linear model, 29 test phase, 375


statistical learning, 27, 33 test set, 375
clustering, 27 test statistic, 201
density estimation, 27 testing against independence, 431–433,
empirical risk minimization, 34 439–441, 444
generalization error, 27, 33 Thurstone–Mosteller model, 231
loss function, 33 total variation, 499, 524
supervised total variation distance, 305
training data, 33 training data, 264
supervised learning, 27 training phase, 336, 375, 377
test data, 27 training samples, 346
testing, 27 training set, 264, 303, 367, 375, 377
testing error, 27 tree networks, 458
training, 27 general tree networks, 457, 467
training data, 27 one-stage tree networks, 457, 463
training error, 27 true positive rate, 201
unsupervised learning, 27, 34 type 1 error probability, 429
validation data, 27 constant-type constraint, 429
statistical model exponential-type constraint, 429
model class, 360 type 2 error probability, 429
model complexity, 365 type 2 error exponent, 429
model parameters, 360 typical sequences, 6, 430–431
non-parametric model, 360 typical set, 509
parametric model, 360
statistical risk minimization, 139, 153 uncertainty principle, 163
stochastic block model (SBM), 385 uncertainty relation, 163, 165–181, 189
k communities, 416 coherence-based, 164, 170, 174–175,
single community, 385 177–179
two communities, 385 in (Cm ,  · 1 ), 174–181
stochastic complexity, 351 in (Cm ,  · 2 ), 165–173
stochastic gradient descent algortihm, 356 infinite-dimensional, 189
sub-Nyquist sampling, 49 large sieve-based, 168–170
sub-Gaussian, 307 set-theoretic, 164
subjet, 493 unconstrained least-squares problem, 110,
submatrix detection model, 385 112, 118
submatrix detection problem, 401–412 underfitting, 371, 373, 374
subspace pursuit, 74 universal classification, 270
sum-of-squares relaxation hierarchy, 412 universal clustering, 270, 273
supervised learning, 264 universal communication, 267–268
supervised learning problem, 303 universal compressed sensing, 98
support vector machine, 276 universal compression, 266–267
synchronization problem, 393 universal denoising, 269
system universal inference, 268–271
linear universal information theory, 265
time-varying, 50 universal outlier hypothesis testing, 279
universal prediction, 268
T -information, 306 universal similarity, 275
tangent cone, 109 unlabeled data, 263
tensor PCA, 414 unsupervised learning, 264, 331
tensors unsupervised learning problem, 303
CP decomposition, 151 utility, 235
data, 134–136 random utility, 235
multiplication, 151
Tucker decomposition, 151 validation loss, 367, 376
unfolding, 151 validation phase, 375, 377, 378
538 Index

validation set, 367, 375, 377 water-filling, 55, 58, 65


Vapnik–Chervonenkis theory, 318 water-filling
Vapnik–Chervonenkis class, 318 ADX, 56
Vapnik–Chervonenkis dimension, 318 reverse, 48
variance, 365
variational auto-encoder, 332, 348, zero-rate data compression, 434, 441, 445, 448
349 zigzag condition, 463

You might also like