PDC BixioRimoldi

Principles of Digital Communication
This comprehensive and accessible text teaches the fundamentals of digital

communication via a top-down-reversed approach, specifically formulated for
a one-semester course. It offers students a smooth and exciting journey through
the three sub-layers of the physical layer and is the result of many years of teaching
digital communication.
The unique approach begins with the decision problem faced by the receiver of
the first sub-layer (the decoder), hence cutting straight to the heart of digital com-
munication, and enabling students to learn quickly, intuitively, and with minimal
background knowledge.
A swift and elegant extension to the second sub-layer leaves students with an
understanding of how a receiver works. Signal design is addressed in a seamless
sequence of steps, and finally the third sub-layer emerges as a refinement that
decreases implementation costs and increases flexibility.
The focus is on system-level design, with connections to physical reality,
hardware constraints, engineering practice, and applications made throughout.
Numerous worked examples, homework problems, and MATLAB simulation exercises
make this text a solid basis for students to specialize in the field of digital
communication and it is suitable for both traditional and flipped classroom
teaching.
Bixio Rimoldi is a Professor at the Ecole Polytechnique Fédérale de Lausanne

(EPFL), Switzerland, where he developed an introductory course on digital com-
munication. Previously he was Associate Professor at Washington University and
took visiting positions at Stanford, MIT, and Berkeley. He is an IEEE fellow, a
past president of the IEEE Information Theory Society, and a past director of the
communication system program at EPFL.
“This is an excellent introductory book on digital communications theory that is
suitable for advanced undergraduate students and/or first-year graduate students,
or alternatively for self-study. It achieves a nice degree of rigor in a clear, gentle
and student-friendly manner. The exercises alone are worth the price of the book.”
Dave Forney
MIT
“Principles of Digital Communication: A Top-Down Approach, 2015, Cambridge
University Press, is a special and most attractive text to be used in an introductory
(first) course on “Digital Communications”. It is special in that it addresses the
most basic features of digital communications in an attractive and simple way,
thereby facilitating the teaching of these fundamental aspects within a single
semester. This is done without compromising the required mathematical and
statistical framework. This remarkable achievement is the outcome of many years
of excellent teaching of undergraduate and graduate digital communication courses
by the author.
The book is built as appears in the title in a top-down manner. It starts with
only basic knowledge on decision theory and, through a natural progression, it
develops the full receiver structure and the signal design principles. The final part
addresses aspects of practical importance and implementation issues. The text
also covers in a clear and simple way more advanced aspects of coding and the
associated maximum likelihood (Viterbi) decoder. Hence it may be used also as an
introductory text for a more advanced (graduate) digital communication course.
All in all, this extremely well-structured text is an excellent book for a first
course on Digital Communications. It covers exactly what is needed and it does so
in a simple and rigorous manner that the students and the tutor will appreciate.
The achieved balance between theoretical and practical aspects makes this text
well suited for students with inclinations to either an industrial or an academic
career.”
Shlomo Shamai
Technion, Israel Institute of Technology
“The Rimoldi text is perfect for a beginning mezzanine-level course in digital
communications. The logical three layer – discrete-time, continuous-time, pass-
band – approach to the problem of communication system design greatly enhances
understanding. Numerous examples, problems, and MATLAB exercises make the
book both student and instructor friendly. My discussions with the author about
the book’s development have convinced me that it’s been a labor of love. The
completed manuscript clearly bears this out.”
Dan Costello
University of Notre Dame
Principles of Digital Communication
A Top-Down Approach
Bixio Rimoldi
School of Computer and Communication Sciences
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Switzerland
University Printing House, Cambridge CB2 8BS, United Kingdom
Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107116450
© Cambridge University Press 2016
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2016
Printed in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalog record for this publication is available from the British Library
Library of Congress Cataloging in Publication data
Rimoldi, Bixio.
Principles of digital communication : a top-down approach / Bixio Rimoldi,
School of Computer and Communication Sciences, Ecole Polytechnique
Fédérale de Lausanne (EPFL), Switzerland.
pages cm
Includes bibliographical references and index.
ISBN 978-1-107-11645-0 (Hardback : alk. paper)
1. Digital communications. 2. Computer networks. I. Title.
TK5103.7.R56 2015
621.382–dc23 2015015425
ISBN 978-1-107-11645-0 Hardback
Additional resources for this publication at www.cambridge.org/rimoldi
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
This book is dedicated to my parents,
for their boundless support and trust,
and to the late Professor James L. Massey,
whose knowledge, wisdom, and generosity
have deeply touched generations of students.
Contents
Preface page xi
Acknowledgments xviii
List of symbols xx
List of abbreviations xxii
1 Introduction and objectives 1

1.1 The big picture through the OSI layering model 1
1.2 The topic of this text and some historical perspective 5
1.3 Problem formulation and preview 9
1.4 Digital versus analog communication 13
1.5 Notation 15
1.6 A few anecdotes 16
1.7 Supplementary reading 18
1.8 Appendix: Sources and source coding 18
1.9 Exercises 20
2 Receiver design for discrete-time observations: First layer 23

2.1 Introduction 23
2.2 Hypothesis testing 26
2.2.1 Binary hypothesis testing 28
2.2.2 m-ary hypothesis testing 30
2.3 The Q function 31
2.4 Receiver design for the discrete-time AWGN channel 32
2.4.1 Binary decision for scalar observations 34
2.4.2 Binary decision for n-tuple observations 35
2.4.3 m-ary decision for n-tuple observations 39
2.5 Irrelevance and sufficient statistic 41
2.6 Error probability bounds 44
2.6.1 Union bound 44
2.6.2 Union Bhattacharyya bound 48
2.7 Summary 51
vii
viii Contents
2.8 Appendix: Facts about matrices 53

2.9 Appendix: Densities after one-to-one differentiable
transformations 58
2.10 Appendix: Gaussian random vectors 61
2.11 Appendix: A fact about triangles 65
2.12 Appendix: Inner product spaces 65
2.12.1 Vector space 65
2.12.2 Inner product space 66
2.13 Exercises 74
3 Receiver design for the continuous-time AWGN channel:

Second layer 95
3.1 Introduction 95
3.2 White Gaussian noise 97
3.3 Observables and sufficient statistics 99
3.4 Transmitter and receiver architecture 102
3.5 Generalization and alternative receiver structures 107
3.6 Continuous-time channels revisited 111
3.7 Summary 114
3.8 Appendix: A simple simulation 115
3.9 Appendix: Dirac-delta-based definition of white Gaussian noise 116
3.10 Appendix: Thermal noise 118
3.11 Appendix: Channel modeling, a case study 119
3.12 Exercises 123
4 Signal design trade-offs 132

4.1 Introduction 132
4.2 Isometric transformations applied to the codebook 132
4.3 Isometric transformations applied to the waveform set 135
4.4 Building intuition about scalability: n versus k 135
4.4.1 Keeping n fixed as k grows 135
4.4.2 Growing n linearly with k 137
4.4.3 Growing n exponentially with k 139
4.5 Duration, bandwidth, and dimensionality 142
4.6 Bit-by-bit versus block-orthogonal 145
4.7 Summary 146
4.8 Appendix: Isometries and error probability 148
4.9 Appendix: Bandwidth definitions 149
4.10 Exercises 150
Contents ix
5 Symbol-by-symbol on a pulse train: Second layer revisited 159

5.2 The ideal lowpass case 160
5.3 Power spectral density 163
5.4 Nyquist criterion for orthonormal bases 167
5.5 Root-raised-cosine family 170
5.6 Eye diagrams 172
5.7 Symbol synchronization 174
5.7.1 Maximum likelihood approach 175
5.7.2 Delay locked loop approach 176
5.8 Summary 179
5.9 Appendix: L2 , and Lebesgue integral: A primer 180
5.10 Appendix: Fourier transform: A review 184
5.11 Appendix: Fourier series: A review 187
5.12 Appendix: Proof of the sampling theorem 189
5.13 Appendix: A review of stochastic processes 190
5.14 Appendix: Root-raised-cosine impulse response 192
5.15 Appendix: The picket fence “miracle” 193
5.16 Exercises 196
6 Convolutional coding and Viterbi decoding: First layer revisited 205

6.2 The encoder 205
6.3 The decoder 208
6.4 Bit-error probability 211
6.4.1 Counting detours 213
6.4.2 Upper bound to Pb 216
6.5 Summary 219
6.6 Appendix: Formal definition of the Viterbi algorithm 222
6.7 Exercises 223
7 Passband communication via up/down conversion: Third layer 232

7.2 The baseband-equivalent of a passband signal 235
7.2.1 Analog amplitude modulations: DSB, AM, SSB, QAM 240
7.3 The third layer 243
7.4 Baseband-equivalent channel model 252
7.5 Parameter estimation 256
x Contents
7.6 Non-coherent detection 260

7.7 Summary 264
7.8 Appendix: Relationship between real- and
complex-valued operations 265
7.9 Appendix: Complex-valued random vectors 267
7.9.1 General statements 267
7.9.2 The Gaussian case 269
7.9.3 The circularly symmetric Gaussian case 270
7.10 Exercises 275
Bibliography 284
Index 286
Preface
This text is intended for a one-semester course on the foundations of digital

communication. It assumes that the reader has basic knowledge of linear algebra,
probability theory, and signal processing, and has the mathematical maturity that
is expected from a third-year engineering student.
The text has evolved out of lecture notes that I have written for EPFL students.
The first version of my notes greatly profited from three excellent sources, namely
the book Principles of Communication Engineering by Wozencraft and Jacobs [1],
the lecture notes written by Professor Massey for his ETHZ course Applied Digital
Information Theory, and the lecture notes written by Professors Gallager and
Lapidoth for their MIT course Introduction to Digital Communication. Through
the years the notes have evolved and although the influence of these sources is still
recognizable, the text has now its own “personality” in terms of content, style, and
organization.
The content is what I can cover in a one-semester course at EPFL.1 The focus
is the transmission problem. By staying focused on the transmission problem
(rather than also covering the source digitization and compression problems),
I have just the right content and amount of material for the goals that I deem
most important, specifically: (1) cover to a reasonable depth the most central
topic of digital communication; (2) have enough material to do justice to the
beautiful and exciting area of digital communication; and (3) provide evidence
that linear algebra, probability theory, calculus, and Fourier analysis are in the
curriculum of our students for good reasons. Regarding this last point, the area
of digital communication is an ideal showcase for the power of mathematics in
solving engineering problems.
The digitization and compression problems, omitted in this text, are also
important, but covering the former requires a digression into signal processing to
acquire the necessary technical background, and the results are less surprising than
those related to the transmission problem (which can be tackled right away, see
Chapter 2). The latter is covered in all information theory courses and rightfully
so. A more detailed account of the content is given below, where I discuss the text
organization.
1
We have six periods of 45 minutes per week, part of which we have devoted to exercises,
for a total of 14 weeks.
xi
xii Preface
In terms of style, I have paid due attention to proofs. The value of a rigorous
proof goes beyond the scientific need of proving that a statement is indeed true.
From a proof we can gain much insight. Once we see the proof of a theorem, we
should be able to tell why the conditions (if any) imposed in the statement are
necessary and what can happen if they are violated. Proofs are also important
because the statements we find in theorems and the like are often not in the exact
form needed for a particular application. Therefore, we might have to adapt the
statement and the proof as needed.
An instructor should not miss the opportunity to share useful tricks. One of
my favorites is the trick I learned from Professor Donald Snyder (Washington
University) on how to label the Fourier transform of a rectangle. (Most students
remember that the Fourier transform of a rectangle is a sinc but tend to forget
how to determine its height and width. See Appendix 5.10.)
The remainder of this preface is about the text organization. We follow a top-
down approach, but a more precise name for the approach is top-down-reversed
with successive refinements. It is top-down in the sense of Figure 1.7 of Chapter
1, which gives a system-level view of the focus of this book. (It is also top-down
in the sense of the OSI model depicted in Figure 1.1.) It is reversed in the sense
that the receiver is treated before the transmitter. The logic behind this reversed
order is that we can make sensible choices about the transmitter only once we are
able to appreciate their impact on the receiver performance (error probability,
implementation costs, algorithmic complexity). Once we have proved that the
receiver and the transmitter decompose into blocks of well-defined tasks (Chapters
2 and 3), we refine our design, changing the focus from “what to do” to “how to
do it effectively” (Chapters 5 and 6). In Chapter 7, we refine the design of the
second layer to take into account the specificity of passband communication. As a
result, the second layer splits into the second and the third layer of Figure 1.7.
In Chapter 2 we acquaint ourselves with the receiver-design problem for channels
that have a discrete output alphabet. In doing so, we hide all but the most
essential aspect of a channel, specifically that the input and the output are related
stochastically. Starting this way takes us very quickly to the heart of digital
communication, namely the decision rule implemented by a decoder that mini-
mizes the error probability. The decision problem is an excellent place to begin
as the problem is new to students, it has a clean-cut formulation in terms of
minimizing an objective function (the error probability), the derivations rely only
on basic probability theory, the solution is elegant and intuitive (the maximum
a posteriori probability decision rule), and the topic is at the heart of digital
communication. After a general start, the receiver design is specialized for the
discrete-time AWGN (additive white Gaussian noise) channel that plays a key
role in subsequent chapters. In Chapter 2, we also learn how to determine (or
upper bound) the probability of error and we develop the notion of a sufficient
statistic, needed in the following chapter. The appendices provide a review of
relevant background material on matrices, on how to obtain the probability density
function of a variable defined in terms of another, on Gaussian random vectors,
and on inner product spaces. The chapter contains a large collection of exercises.
In Chapter 3 we make an important transition concerning the channel used to
communicate, specifically from the rather abstract discrete-time channel to the
Preface xiii
more realistic continuous-time AWGN channel. The objective remains the same,
i.e. develop the receiver structure that minimizes the error probability. The theory
of inner product spaces, as well as the notion of sufficient statistic developed in
the previous chapter, give us the tools needed to make the transition elegantly and
swiftly. We discover that the decomposition of the transmitter and the receiver, as
done in the top two layers of Figure 1.7, is general and natural for the continuous-
time AWGN channel. This constitutes the end of the first pass over the top two
layers of Figure 1.7.
Up until Chapter 4, we assume that the transmitter has been given to us.
In Chapter 4, we prepare the ground for the signal-design. We introduce the
design parameters that we care about, namely transmission rate, delay, bandwidth,
average transmitted energy, and error probability, and we discuss how they relate
to one another. We introduce the notion of isometry in order to change the
signal constellation without affecting the error probability. It can be applied to
the encoder to minimize the average energy without affecting the other system
parameters such as transmission rate, delay, bandwidth, error probability; alterna-
tively, it can be applied to the waveform former to vary the signal’s time/frequency
features. The chapter ends with three case studies for developing intuition. In each
case, we fix a signaling family, parameterized by the number of bits conveyed by
a signal, and we determine the probability of error as the number of bits grows to
infinity. For one family, the dimensionality of the signal space stays fixed, and the
conclusion is that the error probability grows to 1 as the number of bits increases.
For another family, we let the signal space dimensionality grow exponentially and,
in so doing, we can make the error probability become exponentially small. Both
of these cases are instructive but have drawbacks that make them unworkable
solutions as the number of bits becomes large. The reasonable choice seems to
be the “middle-ground” solution that consists in letting the dimensionality grow
linearly with the number of bits. We demonstrate this approach by means of what
is commonly called pulse amplitude modulation (PAM). We prefer, however, to
call it symbol-by-symbol on a pulse train because PAM does not convey the idea
that the pulse is used more than once and people tend to associate PAM to a
certain family of symbol alphabets. We find symbol-by-symbol on a pulse train
to be more descriptive and more general. It encompasses, for instance, phase-shift
keying (PSK) and quadrature amplitude modulation (QAM).
Chapter 5 discusses how to choose the orthonormal basis that characterizes the
waveform former (Figure 1.7). We discover the Nyquist criterion as a means to
construct an orthonormal basis that consists of the T -spaced time translates of a
single pulse, where T is the symbol interval. Hence we refine the n-tuple former that
can be implemented with a single matched filter. In this chapter we also learn how
to do symbol synchronization (to know when to sample the matched filter output)
and introduce the eye diagram (to appreciate the importance of a correct symbol
synchronization). Because of its connection to the Nyquist criterion, we also derive
the expression for the power spectral density of the communication signal.
In Chapter 6, we design the encoder and refine the decoder. The goal is to expose
the reader to a widely used way of encoding and decoding. Because there are
several coding techniques – numerous enough to justify a graduate-level course –
we approach the subject by means of a case study based on convolutional coding.
xiv Preface
The minimum error probability decoder incorporates the Viterbi algorithm. The
content of this chapter was selected as an introduction to coding and to introduce
the reader to elegant and powerful tools, such as the previously mentioned Viterbi
algorithm and the tools to assess the resulting bit-error probability, notably detour
flow graphs and generating functions.
The material in Chapter 6 could be covered after Chapter 2, but there are some
drawbacks in doing so. First, it unduly delays the transition from the discrete-time
channel model of Chapter 2 to the more realistic continuous-time channel model
of Chapter 3. Second, it makes more sense to organize the teaching into a first
pass where we discover what to do (Chapters 2 and 3), and a refinement where
we focus on how to do it effectively (Chapters 5, 6, and 7). Finally, at the end of
Chapter 2, it is harder to motivate the students to invest time and energy into
coding for the discrete-time AWGN channel, because there is no evidence yet that
the channel plays a key role in practical systems. Such evidence is provided in
Chapter 3. Chapters 5 and 6 could be done in the reverse order, but the chosen
order is preferable for continuity reasons with respect to Chapter 4.
The final chapter, Chapter 7, is where the third layer emerges as a refinement
of the second layer to facilitate passband communication.
The following diagram summarizes the main thread throughout the text.
minimizes the error probability
We derive the receiver that
The channel is discrete-time.

The signaling method is generic.
Chapter 2
The focus is on the receiver.
We specialize to the discrete-time AWGN channel.
We step up to the continuous-time AWGN channel.

Chapter 3 The focus is still on the receiver: a second layer emerges.
We decompose the transmitter to mirror the receiver.
We prepare to address the signal design.

Chapter 4 Constraints, figures of merit, and goals are introduced.
We develop the intuition through case studies.
We address the signal design
and refine the receiver
The second layer is revisited.

We pursue symbol-by-symbol on a pulse train.
Chapter 5
We refine the n-tuple former accordingly (matched-filter
and symbol synchronization).
The first layer is revisited.

Chapter 6 We do a case study based on convolutional coding.
We refine the decoder accordingly (Viterbi algorithm).
The third layer emerges from a refinement aimed at

Chapter 7
facilitating passband communication.
Preface xv
Each chapter contains one or more appendices, with either background or com-
plementary material.
I should mention that I have made an important concession to mathematical
rigor. This text is written for people with the mathematical background of an
engineer. To be mathematically rigorous, the integrals that come up in dealing
with Fourier analysis should be interpreted in the Lebesgue sense.2 In most under-
graduate curricula, engineers are not taught Lebesgue integration theory. Hence
some compromise has to be made, and here is one that I find very satisfactory. In
Appendix 5.9, I introduce the difference between the Riemann and the Lebesgue
integrals in an informal way. I also introduce the space of L2 functions and the
notion of L2 equivalence. The ideas are natural and can be understood with-
out technical details. This gives us the language needed to rigorously state the
sampling theorem and Nyquist criterion, and the insight to understand why the
technical conditions that appear in those statements are necessary. The appendix
also reminds us that two signals that have the same Fourier transform are L2
equivalent but not necessarily point-wise equal. Because we introduce the Lebesgue
integral in an informal way, we are not in the position to prove, say, that we
can swap an integral and an infinite sum. In some way, having a good reason
for skipping such details is a blessing, because dealing with all technicalities can
quickly become a major distraction. These technicalities are important at some
level and unimportant at another level. They are important for ensuring that the
theory is consistent and a serious graduate-level student should be exposed to
them. However, I am not aware of a single case where they make a difference in
dealing with finite-support functions that are continuous and have finite-energy,
especially with the kind of signals we encounter in engineering. Details pertaining
to integration theory that are skipped in this text can be found in Gallager’s book
[2], which contains an excellent summary of integration theory for communication
engineers. Lapidoth [3] contains many details that are not found elsewhere. It is
an invaluable text for scholars in the field of digital communication.
The last part of this preface is addressed to instructors. Instructors might
consider taking a bottom-up approach with respect to Figure 1.7. Specifically, one
could start with the passband AWGN channel model and, as the first step in the
development, reduce it to the baseband model by means of the up/down converter.
In this case the natural second step is to reduce the baseband channel to the
discrete-time channel and only then address the communication problem across the
discrete-time channel. I find such an approach to be pedagogically less appealing
as it puts the communication problem last rather than first. As formulated by
Claude Shannon, the father of modern digital communication, “The fundamental
problem of communication is that of reproducing at one point either exactly or
approximately a message selected at another point”. This is indeed the problem
that we address in Chapter 2. Furthermore, randomness is the most important
aspect of a channel. Without randomness, there is no communication problem.
The channels considered in Chapter 2 are good examples to start with, because
they model randomness without additional distractions. However, the choice of
2
The same can be said for the integrals involving the noise, but our approach avoids
such integrals. See Section 3.2.
xvi Preface
such abstract channels needs to be motivated. I motivate in two ways: (i) by

asking the students to trust that the theory we develop for that abstract channel
will turn out to be exactly what we need for more realistic channel models and
(ii) by reminding them of the (too often overlooked) problem-solving technique
that consists in addressing a difficult problem by first considering a simplified “toy
version” of the same.
A couple of years ago, I flipped the classroom in the following sense. Rather
than developing the theory in class via standard ex-cathedra lectures and letting
the students work on problems at home, I have the students go over the theory
at their own pace at home, and I devote the class time to exercises, to detecting
difficulties, to filling the gaps, and to motivating the students. Almost the entire
content of the book (appendices apart) is covered in the reading assignments.
In my case, flipping the classroom is the result of a process that began with
the conviction that the time spent in class was not well spent for many students.
There is a fair amount of math in the course Principles of Digital Communication
and because it is mandatory at EPFL, there is quite a bit of disparity in terms of
the rate at which a student can follow the math. Hence, no single pace of teaching
is satisfactory, but the real issue has deeper roots. Learning is not about making
one step after the other on a straight line at a constant pace, which is essentially
what we do in a typical ex-cathedra lecture.
There are a number of things that can improve our effectiveness when we study
and cannot be done in an ex-cathedra lecture. Ideally, we should be able to choose
suitable periods of time and to decide when a break is needed. More importantly,
we should be able to control the information flow, in the sense of being able to
“pause” it, e.g. in order to think whether or not what we are learning makes sense
to us, to make connections with what we know already, to work out examples, etc.
We should also be able to “rewind”, and to “fast forward”. None of this can be
done in an ex-cathedra lecture; however, all of this can be done when we study from
a book.3 Pausing to think, to make connections, and to work out examples is a
particularly useful process that is not sufficiently ingrained in many undergraduate
students, perhaps precisely because the ex-cathedra format does not permit it. The
book has to be suitable (self-contained and sufficiently clear), the students should
be sufficiently motivated to read, and they should be able to ask questions as
needed.
Motivation is typically not a problem for the students when the reading is an
essential ingredient for passing the class. In my course, the students quickly realize
that they will not be able to solve the exercises if they do not read the theory, and
there is little chance for them to pass the class without theory and exercises.
But what makes the flipped classroom today more interesting than in the past,
is the availability of Web-based tools for posting and answering questions. For
my class, I have been using Nota Bene.4 Designed at MIT, this is a website on
3
. . . or when we watch a video. But a book can be more useful as a reference, because it
is easier to find what you are looking for in a book than on a video, and a book can be
annotated (personalized) more easily.
4
http://nb.mit.edu.
Preface xvii
which I post the reading assignments (essentially all sections). When students
have a question, they go to the site, highlight the relevant part, and type a
question in a pop-up window. The questions are summarized on a list that can be
sorted according to various criteria. Students can “vote” on a question to increase
its importance. Most questions are answered by students, and as an incentive
to interact on Nota Bene, I give a small bonus for posting pertinent questions
and/or for providing reasonable answers.5 The teaching assistants (TAs) and
myself monitor the site and we intervene as needed. Before I go to class, I take a
look at the questions, ordered by importance; then in class I “fill the gaps” as I
see fit.
Most of the class time is spent doing exercises. I encourage the students to help
each other by working in groups. The TAs and myself are there to help. This way,
I see who can do what and where the difficulties lie. Assessing the progress this
way is more reliable than by grading exercises done at home. (We do not grade
the exercises, but we do hand out solutions.) During an exercise session, I often
go to the board to clarify, to help, or to complement, as necessary.
In terms of my own satisfaction, I find it more interesting to interact with the
students in this way, rather than to give ex-cathedra lectures that change little from
year to year. The vast majority of the students also prefer the flipped classroom:
They say so and I can tell that it is the case from their involvement. The exercises
are meant to be completed during the class time,6 so that at home the students
can focus on the reading. By the end of the semester7 we have covered almost all
sections of the book. (Appendices are left to the student’s discretion.) Before a new
reading assignment, I motivate the students to read by telling them why the topic
is important and how it fits into the big picture. If there is something unusual,
e.g. a particularly technical passage, I tell them what to expect and/or I give a
few hints. Another advantage of the flipped classroom is never falling behind the
schedule. At the beginning of the semester, I know which sections will be assigned
which week, and prepare the exercises accordingly. After the midterm, I assign a
MATLAB project to be completed in groups of two and to be presented during the
last day of class. The students like this very much.8
5
A pertinent intervention is worth half a percent of the total number of points that can
be acquired over the semester and, for each student, I count at most one intervention
per week. This limits the maximum amount of bonus points to 7% of the total.
6
Six periods of 45 minutes at EPFL.
7
Fourteen weeks at EPFL.
8
The idea of a project was introduced with great success by my colleague, Rüdiger
Urbanke, who taught the course during my sabbatical.
Acknowledgments
This book is the result of a slow process, which began around the year 2000, of
seemingly endless revisions of my notes written for Principles of Digital Commu-
nication – a sixth-semester course that I have taught frequently at EPFL. I would
like to acknowledge that the notes written by Professors Robert Gallager and
Amos Lapidoth, for their MIT course Introduction to Digital Communication, as
well as the notes by Professor James Massey, for his ETHZ course Applied Digital
Information Theory, were of great help to me in writing the first set of notes that
evolved into this text. Equally helpful were the notes written by EPFL Professor
Emre Telatar, on matrices and on complex random variables; they became the
core of some appendices on background and on complementary material.
A big thanks goes to the PhD students who helped me develop new exercises
and write solutions. This includes Mani Bastani Parizi, Sibi Bhaskaran, László
Czap, Prasenjit Dey, Vasudevan Dinkar, Jérémie Ezri, Vojislav Gajic, Michael
Gastpar, Saeid Haghighatshoar, Hamed Hassani, Mahdi Jafari Siavoshani, Javad
Ebrahimi, Satish Korada, Shrinivas Kudekar, Stéphane Musy, Christine Neuberg,
Ayfer Özgür, Etienne Perron, Rajesh Manakkal, and Philippe Roud. Some exer-
cises were created from scratch and some were inspired from other textbooks. Most
of them evolved over the years and, at this point, it would be impossible to give
proper credit to all those involved. The first round of teaching Principles of Digital
Communication required creating a number of exercises from scratch. I was very
fortunate to have Michael Gastpar (PhD student at the time and now an EPFL
colleague) as my first teaching assistant. He did a fabulous job in creating many
exercises and solutions.
I would like to thank my EPFL students for their valuable feedback. Pre-final
drafts of this text were used at Stanford University and at UCLA, by Professors
Ayfer Özgür and Suhas Diggavi, respectively. Professor Rüdiger Urbanke used
them at EPFL during two of my sabbatical leaves. I am grateful to them for their
feedback and for sharing with me their students’ comments.
I am grateful to the following collaborators who have read part of the
manuscript and whose feedback has been very valuable: Emmanuel Abbe, Albert
Abilov, Nicolae Chiurtu, Michael Gastpar, Matthias Grossglauser, Paolo Ienne,
Alberto Jimenez-Pacheco, Olivier Lévêque, Nicolas Macris, Stefano Rosati, Anja
Skrivervik, and Adrian Tarniceriu.
xviii
Acknowledgments xix
I am particularly indebted to the following people for having read the whole
manuscript and for giving me a long list of suggestions, while noting the typos and
mistakes: Emre Telatar, Urs Niesen, Saeid Haghighatshoar, and Sepand Kashani-
Akhavan.
Warm thanks go to Françoise Behn who learned LATEX to type the first version
of the notes, to Holly Cogliati-Bauereis for her infinite patience in correcting my
English, to Emre Telatar for helping with LATEX-related problems, and to Karol
Kruzelecki and Damir Laurenzi for helping with computer issues.
Finally, I would like to acknowledge many interesting discussions with various
colleagues, in particular those with Emmanuel Abbe, Michael Gastpar, Amos
Lapidoth, Upamanyu Madhow, Emre Telatar, and Rüdiger Urbanke. I would also
like to thank Rüdiger Urbanke for continuously encouraging me to publish my
notes. Without his insistence and his jokes about my perpetual revisions, I might
still be working on them.
List of symbols
A, B, . . . Sets.
N Set of natural numbers: {1, 2, 3, . . . }.
Z Set of integers: {. . . , −2, −1, 0, 1, 2, . . . }.
R Set of real numbers.
C Set of complex numbers.
H := {0, . . . , m − 1} Message set.
C := {c0 , . . . , cm−1 } Codebook (set of codewords).
W := {w0 (t), . . . , wm−1 (t)} Set of waveform signals.
V Vector space or inner product space.
u:A→B Function u with domain A and range B.
H∈H Random message (hypothesis) taking value in H.
N (t) Noise.
NE (t) Baseband-equivalent noise.
R(t) Received (random) signal.
Y = (Y1 , . . . , Yn ) Random n-tuple observed by the decoder.
√
j −1.
{} Set of objects.
AT Transpose of the matrix A. It may be applied to an
n-tuple a.
A† Hermitian transpose of the matrix A. It may be
applied to an n-tuple a.
E [X] Expected value of X.
a, b Inner product between a and b (in that order).
a Norm of the vector a.
|a| Absolute value of a.
a := b a is defined as b.
1{S} Indicator function. Its value is 1 if the statement S
is true and 0 otherwise.
1A (x) Same as 1{x ∈ A}.
xx
List of symbols xxi
E Average energy.
KN (t + τ, t), KN (τ ) Autocovariance of N (t).
Used to denote the end of theorems, definitions,
examples, proofs, etc.
{·} Real part of the enclosed quantity.
{·} Imaginary part of the enclosed quantity.
∠ Phase of the complex-valued number that follows.
List of abbreviations
AM amplitude modulation.
bps bits per second.
BSS binary symmetric source.
DSB-SC double-sideband modulation with suppressed carrier.
iid independent and identically distributed.
l. i. m. limit in L2 norm.
LNA low-noise amplifier.
MAP maximum a posteriori.
Mbps megabits per second.
ML maximum likelihood.
MMSE minimum mean square error.
PAM pulse amplitude modulation.
pdf probability density function.
pmf probability mass function.
PPM pulse position modulation.
PSK phase-shift keying.
QAM quadrature amplitude modulation.
SSB single-sideband modulation.
WSS wide-sense stationary.
xxii
1 Introduction and objectives
This book focuses on the system-level engineering aspects of digital point-to-point

communication. In a way, digital point-to-point communication is the building
block we use to construct complex communication systems including the Inter-
net, cellular networks, satellite communication systems, etc. The purpose of this
chapter is to provide contextual information. Specifically, we do the following.
(i) Place digital point-to-point communication into the bigger picture. We do
so in Section 1.1 where we discuss the Open System Interconnection (OSI)
layering model.
(ii) Provide some historical context (Section 1.2).
(iii) Give a preview for the rest of the book (Section 1.3).
(iv) Clarify the difference between analog and digital communication (Section
1.4).
(v) Justify some of the choices we make about notation (Section 1.5).
(vi) Mention a few amusing and instructive anecdotes related to the history of
communication (Section 1.6).
(vii) Suggest supplementary reading material (Section 1.7).
The reader eager to get started can skip this chapter without losing anything
essential to understand the rest of the text.
1.1 The big picture through the OSI layering model

When we communicate using electronic devices, we produce streams of bits that
typically go through various networks and are processed by devices from a variety
of manufacturers. The system is very complex and there are a number of things that
can go wrong. It is amazing that we can communicate as easily and reliably as we
do. This could hardly be possible without layering and standardization. The Open
System Interconnection (OSI) layering model of Figure 1.1 describes a framework
for the definition of data-flow protocols. In this section we use the OSI model to
convey the basic idea of how modern communication networks deal with the key
challenges, notably routing, flow control, reliability, privacy, and authenticity. For
the sake of concreteness, let us take e-mailing as a sample activity. Computers use
bytes (8 bits) or multiples thereof to represent letters. So the content of an e-mail
1
2 1. Introduction and objectives
Sending Receiving
Process Process
Application AH Data Application
Presentation PH Presentation
Session SH Session
(segment)
Transport TH Transport
(packet)
Network NH Network
(frame)
Data Link DH DT Data Link
(bits)
Physical DH NH TH SH PH AH Data DT Physical
Physical Medium
Figure 1.1. OSI layering model.
is represented by a stream of bytes. Received e-mails usually sit on a remote server.

When we launch a program to read e-mail – hereafter referred to as the client –
it checks into the server to see if there are new e-mails. It depends on the client’s
setting whether a new e-mail is automatically downloaded to the client or just
a snippet is automatically downloaded until the rest is explicitly requested. The
client tells the server what to do. For this to work, the server and the client not
only need to be able to communicate the content of the mail message but they also
need to talk to each other for the sake of coordination. This requires a protocol.
If we use a dedicated program to do e-mail (as opposed to using a Web browser),
the common protocols used for retrieving e-mail are the IMAP (Internet Message
Access Protocol) and the POP (Post Office Protocol), whereas for sending e-mail
it is common to use the SMTP (Simple Mail Transfer Protocol).
The idea of a protocol is not specific to e-mail. Every application that uses the
Internet needs a protocol to interact with a peer application. The OSI model
reserves the application layer for programs (also called processes) that imple-
ment application-related protocols. In terms of data traffic, the protocol places a
so-called application header (AH) in front of the bits produced by the application.
The top arrow in Figure 1.1 indicates that the two application layers talk to each
other as if they had a direct link.
1.1. The big picture through the OSI layering model 3
Typically, there is no direct physical link between the two application layers.
Instead, the communication between application layers goes through a shared
network, which creates a number of challenges. To begin with, there is no guarantee
of privacy for anything that goes through a shared network. Furthermore, networks
carry data from many users and can get congested. Hence, if possible, the data
should be compressed to reduce the traffic. Finally, there is no guarantee that the
sending and the receiving computers represent letters the same way. Hence, the
application header and the data need to be communicated by using a universal
language. The presentation layer handles the encryption, the compression, and
the translation to/from a universal language. The presentation layer also needs a
protocol to talk to the peer presentation layer at the destination. The protocol is
implemented by means of the presentation header (PH).
For the presentation layers to talk to each other, we need to make sure that
the two hosting computers are connected. Establishing, maintaining, and ending
communication between physical devices is the job of the session layer . The session
layer also manages access rights. Like the other layers, the session layer uses a
protocol to interact with the peer session layer. The protocol is implemented by
means of the session header (SH).
The layers we have discussed so far would suffice if all the machines of interest
were connected by a direct and reliable link. In reality, links are not always reliable.
Making sure that from an end-to-end point of view the link appears reliable
is one of the tasks of the transport layer . By means of parity check bits, the
transport layer verifies that the communication is error-free and if not, it requests
retransmission. The transport layer has a number of other functions, not all of
which are necessarily required in any given network. The transport layer can break
long sequences into shorter ones or it can multiplex several sessions between the
same two machines into a single one. It also provides flow control by queueing up
data if the network is congested or if the receiving end cannot absorb it sufficiently
fast. The transport layer uses the transport header (TH) to communicate with the
peer layer. The transport header followed by the data handed down by the session
layer is called a segment.
Now assume that there are intermediate nodes between the peer processes of
the transport layer. In this case, the network layer provides the routing service.
Unlike the above layers, which operate on an end-to-end basis, the network layer
and the layers below have a process also at intermediate nodes. The protocol
of the network layer is implemented in the network header (NH). The network
header contains, among other things, the source and the destination address.
The network header followed by the segment (of the transport layer) is called a
packet.
The next layer is the data link (DL) layer. Unlike the other layers, the DL puts
a header at the beginning and a trailer at the end of each packet handed down
by the network layer. The result is called a frame. Some of the overhead bits are
parity-check bits meant to determine if errors have occurred in the link between
nodes. If the DL detects errors, it might ask to retransmit or drop the frame
altogether. If it drops the frame, it is up to the transport layer, which operates on
an end-to-end basis, to request retransmission.
The physical layer – the subject of this text – is the bottom layer of the OSI
stack. The physical layer creates a more-or-less reliable “bit pipe” out of the
physical channel between two nodes. It does so by means of a transmitter/receiver
pair, called modem,1 on each side of the physical channel. We will learn that the
physical-layer designer can trade reliability for complexity and delay.
In summary, the OSI model has the following characteristics. Although the
actual data transmission is vertical, each layer is programmed as if the transmission
were horizontal. For a process, whatever is not part of its own header is considered
as being actual data. In particular, a process makes no distinction between the
headers of the higher layers and the actual data segment. For instance, the pre-
sentation layer translates, compresses, and encrypts whatever it receives from the
application layer, attaches the PH, and sends the result to its peer presentation
layer. The peer in turn reads and removes the PH and decrypts, decompresses,
and translates the data which is then passed to the application layer. What the
application layer receives is identical to what the peer application layer has sent
up to a possible language translation. The DL inserts a trailer in addition to
a header. All layers, except the transport and the DL layer, assume that the
communication to the peer layer is error-free. If it can, the DL layer provides
reliability between successive nodes. Even if the reliability between successive
nodes is guaranteed, nodes might drop packets due to queueing overflow. The
transport layer, which operates at the end-to-end level, detects missing segments
and requests retransmission.
It should be clear that a layering approach drastically simplifies the tasks of
designing and deploying communication infrastructures. For instance, a program-
mer can test the application layer protocol with both applications running on the
same computer – thus bypassing all networking problems. Likewise, a physical-
layer specialist can test a modem on point-to-point links, also disregarding net-
working issues. Each of the tasks of compressing, providing reliability, privacy,
authenticity, routing, flow control, and physical-layer communication requires spe-
cific knowledge. Thanks to the layering approach, each task can be accomplished by
people specialized in their respective domain. Similarly, equipment from different
manufacturers work together, as long as they respect the protocols.
The OSI architecture is a generic model that does not prescribe a specific
protocol. The Internet uses the TCP/IP protocol stack, which is essentially com-
patible with the OSI architecture but uses five instead of seven layers [4]. The
reduction is mainly obtained by combining the OSI application, presentation, and
session layers into a single layer called the application layer. The transport layer
1
Modem is the result of contracting the terms mod ulator and demodulator. In analog
modulation, such as frequency modulation (FM) and amplitude modulation (AM), the
source signal modulates a parameter of a high-frequency oscillation, called the carrier
signal. In AM it modulates the carrier’s amplitude and in FM it modulates the carrier’s
frequency. The modulated signal can be transmitted over the air and in the absence of
noise (which is never the case) the demodulator at the receiver reconstructs an exact
copy of the source signal. In practice, due to noise, the reconstruction only approximates
the source signal. Although modulation and demodulation are misnomers in digital
communication, the term modem has remained in use.
1.2. The topic of this text and some historical perspective 5
is implemented either by the Transmission Control Protocol (TCP) or by the

User Datagram Protocol (UDP). The network layer implements addressing and
routing via the Internet Protocol (IP). The DL and the physical layers complete
the stack.
1.2 The topic of this text and some historical perspective

This text is about the theory that governs the physical-layer design (bottom layer
in Figure 1.1), referred to as communication theory. Of course, other layers are
about communication as well, and the reader might wonder why communication
theory is not about all the layers. The terminology became established in the early
days, prior to the OSI model, when communication was mainly point-to-point.
Rather than including the other layers as they became part of the big picture,
communication theory remained “faithful” to its original domain. The reason is
most likely due to the dissimilarity between the body of knowledge needed to
reason about the objectives of different layers. To gain some historical perspective,
we summarize the key developments that have led to communication theory.
Electromagnetism was discovered in the 1820s by Ørsted and Ampère. The
wireline telegraph was demonstrated by Henry and Morse in the 1830s. Maxwell’s
theory of electromagnetic fields, published in 1865, predicted that electromagnetic
fields travel through space as waves, moving at the speed of light. In 1876, Bell
invented the telephone. Around 1887, Hertz demonstrated the transmission and
reception of the electromagnetic waves predicted by Maxwell. In 1895, Marconi
built a wireless system capable of transmitting signals over more than 2 km. The
invention of the vacuum-tube diode by Fleming in 1904 and of the vacuum-tube
triode amplifier by Forest in 1906 enabled long-distance communication by wire
and wireless. The push for sending many phone conversations over the same line
led, in 1917, to the invention of the wave filter by Campbell.
The beginning of digital communication theory is associated with the work of
Nyquist (1924) and Hartley (1928), both at Bell Laboratories. Quite generally,
we communicate over a channel by choosing the value of one or more parameters
of a carrier signal. Intuitively, the more parameters we can choose independently,
the more information we can send with one signal, provided that the receiver is
capable of determining the value of those parameters.
A good analogy to understand the relationship between a signal and its param-
eters is obtained by comparing a signal with a point in a three-dimensional (3D)
space. A point in 3D space is completely specified by the three coordinates of the
point with respect to a Cartesian coordinate system. Similarly, a signal can be
described by a number n of parameters with respect to an appropriately chosen
reference system called the orthonormal basis. If we choose each coordinate as a
function of a certain number of bits, the more coordinates n we have the more bits
we can convey with one signal.
Nyquist realized that if the signal has to be confined to a specified time interval
of duration T seconds (e.g. the duration of the communication) and frequency
interval of width B Hz (e.g. the channel bandwidth), then the integer n can be
w(t)
a5
p(t)
a4
1 a3
t t
T0 a2
a1
a0
Figure 1.2. Information-carrying pulse train. It is the scaling factor

of each pulse, called symbol, that carries information. In this example,
the symbols take value in {a0 , a1 , a2 , a3 , a4 , a5 }.
chosen to be close to ηBT , where η is some positive number that depends on the
definition of duration and bandwidth. A good value is η = 2.
As an example, consider Figure 1.2. On the left of the figure is a pulse p(t)
that we use as a building block for the communication
3 signal.2 On the right is an
example of a pulse train of the form w(t) = i=0 si p(t−iT0 ), obtained from shifted
and scaled replicas of p(t). We communicate by scaling the pulse replica p(t − iT0 )
by the information-carrying symbol si . If we could substitute p(t) with a narrower
pulse, we could fit more such pulses in a given time interval and therefore we
could send more information-carrying symbols. But a narrower pulse uses more
bandwidth. Hence there is a limit to the pulse width. For a given pulse width,
there is a limit to the number of pulses that we can pack in a given time interval if
we want the receiver to be able to retrieve the symbol sequence from the received
pulse train. Nyquist’s result implies that we can fit essentially 2BT non-interfering
pulses in a time interval of T seconds if the bandwidth is not to exceed B Hz.
In trying to determine the maximum number of bits that can be conveyed with
one signal, Hartley introduced two constraints that make good engineering sense.
First, in a practical realization, the symbols cannot take arbitrarily large values in
R (the set of real numbers). Second, the receiver cannot estimate a symbol with
infinite precision. This suggests that, to avoid errors, symbols should take values
in a discrete subset of some interval [−A, A]. If ±Δ is the receiver’s precision in
determining the amplitude of a pulse, then symbols should take a value in some
alphabet {a0 , a1 , . . . , am−1 } ⊂ [−A, A] such that |ai − aj | ≥ 2Δ when i = j. This
A
implies that the alphabet size can be at most m = 1 + Δ (see Figure 1.3).
n
There are m distinct n-length sequences that can be formed with symbols taken
from an alphabet of size m. Now suppose that we want to communicate a sequence
2
A pulse is not necessarily rectangular. In fact, we do not communicate via rectangular
pulses because they use too much bandwidth.
1.2. The topic of this text and some historical perspective 7
2Δ
−A - A
s s s s s s-R
a0 a1 a2 a3 a4 a5
Figure 1.3. Symbol alphabet of size m = 6.
k bits n symbols w(t)

- Encoder - Waveform -
Former
Figure 1.4. Transmitter.
of k bits. There are 2k distinct such sequences and each such sequence should be
mapped into a distinct symbol sequence (see Figure 1.4). This implies
2k ≤ mn . (1.1)
example 1.1 There are 24 = 16 distinct binary sequences of length k = 4 and
there are 42 = 16 distinct symbol sequences of length n = 2 with symbols taking
value in an alphabet of size m = 4. Hence we can associate a distinct length-2
symbol sequence to each length-4 bit sequence. The following is an example with
symbols taken from the alphabet {a0 , a1 , a2 , a3 }.
bit sequence symbol sequence
0000 a 0 a0
0001 a 0 a1
0010 a 0 a2
0011 a 0 a3
0100 a 1 a0
.. ..
. .
1111 a3 a3
A k
Inserting m = 1 + Δ and n = 2BT in (1.1) and solving for T yields
k A
≤ 2B log2 1 + (1.2)
T Δ
as the highest possible rate in bits per second that can be achieved reliably with
bandwidth B, symbol amplitudes within ±A, and receiver accuracy ±Δ.
Unfortunately, (1.2) does not provide a fundamental limit to the bit rate, because
there is no fundamental limit to how small Δ can be made.
The missing ingredient in Hartley’s calculation was the noise. In 1926 Johnson,
also at Bell Labs, realized that every conductor is affected by thermal noise. The
idea that the received signal should be modeled as the sum of the transmitted signal
plus noise became prevalent through the work of Wiener (1942). Clearly the noise
prevents the receiver from retrieving the symbols’ values with infinite precision,
which is the effect that Hartley wanted to capture with the introduction of Δ, but
unfortunately there is no way to choose Δ as a function of the noise. In fact, in
the presence of thermal noise, error-free communication becomes impossible. (But
we can make the error probability as small as desired.)
Prior to the publication of Shannon’s revolutionary 1948 paper, the common
belief was that the error probability induced by the noise could be reduced
only by increasing the signal’s power (e.g. by increasing A in the example of
Figure 1.3) or by reducing the bit rate (e.g. by transmitting the same bit multiple
times). Shannon proved that the noise can set a limit to the number of bits per
second that can be transmitted reliably, but as long as we communicate below that
limit, the error probability can be made as small as desired without modifying the
signal’s power and bandwidth. The limit to the bit rate is called channel capacity.
For the setup of interest to us it is the right-hand side of
k P
≤ B log2 1 + , (1.3)
T N0 B
where P is the transmitted signal’s power and N0 /2 is the power spectral density
of the noise (assumed to be white and Gaussian). If the bit rate of a system is
above channel capacity then, no matter how clever the design, the error probability
is above a certain value. The theory that leads to (1.3) is far more subtle and far
more beautiful than the arguments leading to (1.2); yet, the two expressions are
strikingly similar.
What we mentioned here is only a special case of a general formula derived
by Shannon to compute the capacity of a broad class of channels. As he did for
channels, Shannon also posed and answered fundamental questions about sources.
For the purpose of this text, there are two lessons that we should retain about
sources. (1) The essence of a source is its randomness. If a listener knew exactly
what a speaker is about to say, there would be no need to listen. Hence a source
should be modeled by a random variable (or a sequence thereof). In line with
the topic of this text, we assume that the source is digital, meaning that the
random variable takes values in a discrete set. (See Appendix 1.8 for a brief
summary of various kind of sources.) (2) For every such source, there exists a
source encoder that converts the source output into the shortest (in average)
binary string and a source decoder that reconstructs the source output from the
encoder output. The encoder output, for which no further compression is possible,
has the same statistic as a sequence of unbiased coin flips, i.e. it is a sequence of
independent and uniformly distributed bits. Clearly, we can minimize the storage
and/or communicate more efficiently if we compress the source into the shortest
binary string. In this text, we are not concerned with source coding but, for the
above-mentioned reasons, we model the source as a generator of independent and
uniformly distributed bits.
Like many of the inventors mentioned above, Shannon worked at Bell Labs.
His work appeared one year after the invention of the solid-state transistor, by
Bardeen, Brattain, and Shockley, also at Bell Labs. Figure 1.5 summarizes the
various milestones.
1.3. Problem formulation and preview 9
at d
ic ive
n
io
u n ce
m Re
y
or
ve om nd
In ered
he
ed
C a
he
T
nt
io d
v
or r
nt blis
av Tub ad itte
n n e
ap co
ve
he t o
io r a ur
tio
ed
gr Dis
T sis
y
Pu
at e T ict
R m
d
nt
ca
m te
ed
u m c e ns
h
rm h P
ni
Te ism
n
c u t a n ra
irt i e I n
o n eor
u
om e
fo f t he
C Inv
Va Dis es T
Lo io W nve
t
In o s t
lin ne
h
le
of i o n er
of r
ire ag
- v
h lte
e
Te ll’s
ng a
h nt nt
W om
irt e E
B eF
e
ph
w
tr
B Inv se
ax
ec
le
ad
oi
El
W
M
N
R
t
1820
1830
1840
1865
1876
1887
1895
1900
1910
1917
1924
1942
1948
Figure 1.5. Technical milestones leading up to information theory.
Information theory, which is mainly concerned with discerning between what

can be done and what cannot, regardless of complexity, led to coding theory –
a field mainly concerned with implementable ways to achieve what information
theory proves as achievable. In particular, Shannon’s 1948 paper triggered a race
for finding implementable ways to communicate reliably (i.e. with low error prob-
ability) at rates close to the channel capacity. Coding theory has produced many
beautiful results: too many to summarize here; yet, approaching channel capacity
in wireline and wireless communication was out of reach until the discovery of
turbo codes by Berrou, Glavieux, and Thitimajshima in 1993. They were the first
to demonstrate a method that could be used to communicate reliably over wireline
and wireless channels, with less than 1 dB of power above the theoretical limit.
As of today, the most powerful codes for wireless and wireline communication are
the low-density parity-check (LDPC) codes that were invented in 1960 by Gallager
in his doctoral dissertation at MIT and reinvented in 1996 by MacKay and Neal.
When first invented, the codes did not receive much attention because at that time
their extraordinary performance could not be demonstrated for lack of appropriate
analytical and simulation tools. What also played a role is that the mapping of
LDPC codes into hardware requires many connections, which was a problem before
the advent of VLSI (very-large-scale integration) technology in the early 1980s.
1.3 Problem formulation and preview

Our focus is on the system aspects of digital point-to-point communication. By the
term system aspect we mean that we remain at the level of building blocks rather
than going into electronic details; digital means that the message is taken from a
finite set of possibilities; and we restrict ourselves to point-to-point communication
as it constitutes the building block of all communication systems.
i∈H wi (t) ∈ W R(t) Ĥ

- Transmitter - Linear - i - Receiver -
Filter
6
Noise
N (t)
Figure 1.6. Basic point-to-point communication system over

a bandlimited Gaussian channel.
Digital communication is a rather unique field in engineering in which theoretical

ideas from probability theory, stochastic processes, linear algebra, and Fourier
analysis have had an extraordinary impact on actual system design. The mathe-
matically inclined will appreciate how these theories seem to be so appropriate to
solve the problems we will encounter.
Our main target is to acquire a solid understanding about how to communicate
through a channel, modeled as depicted in Figure 1.6. The source chooses a message
represented in the figure by the index i, which is the realization of a random
variable H that takes values in some finite alphabet H = {0, 1, . . . , m − 1}. As
already mentioned, in reality, the message is represented by a sequence of bits, but
for notational convenience it is often easier to label each sequence with an index
and use the index to represent the message. The transmitter maps a message i
into a signal wi (t) ∈ W, where W and H have the same cardinality. The channel
filters the signal and adds Gaussian noise N (t). The receiver’s task is to guess
the message based on the channel output R(t). So Ĥ is the receiver’s guess of H.
(Owing to the random behavior of the noise, Ĥ is a random variable even under
the condition that H = i.)
In a typical scenario, the channel is given and the communication engineer
designs the transmitter/receiver pair, taking into account objectives and con-
straints. The objective could be to maximize the number of bits per second being
communicated, while keeping the error probability below some threshold. The
constraints could be expressed in terms of the signal’s power and bandwidth.
The noise added by the channel is Gaussian because it represents the contri-
bution of various noise sources.3 The filter has both a physical and a conceptual
justification. The conceptual justification stems from the fact that most wireless
communication systems are subject to a license that dictates, among other things,
the frequency band that the signal is allowed to occupy. A convenient way for the
system designer to deal with this constraint is to assume that the channel contains
an ideal filter that blocks everything outside the intended band. The physical
reason has to do with the observation that the signal emitted from the transmitting
antenna typically encounters obstacles that create reflections and scattering. Hence
the receiving antenna might capture the superposition of a number of delayed and
attenuated replicas of the transmitted signal (plus noise). It is a straightforward
3
Individual noise sources do not necessarily have Gaussian statistics. However, due to
the central limit theorem, their aggregate contribution is often quite well approximated
by a Gaussian random process.
1.3. Problem formulation and preview 11
exercise to check that this physical channel is linear and time-invariant. Thus it
can be modeled by a linear filter as shown in Figure 1.6.4 Additional filtering may
occur due to the limitations of some of the components at the sender and/or at
the receiver. For instance, this is the case for a linear amplifier and/or an antenna
for which the amplitude response over the frequency range of interest is not flat
and/or the phase response is not linear. The filter in Figure 1.6 accounts for all
linear time-invariant transformations that act upon the communication signals as
they travel from the sender to the receiver. The channel model of Figure 1.6 is
meaningful for both wireline and wireless communication channels. It is referred
to as the bandlimited Gaussian channel.
Mathematically, a transmitter implements a one-to-one mapping between the
message set and a set of signals. Without loss of essential generality, we may let
the message set be H = {0, 1, . . . , m − 1} for some integer m ≥ 2. For the channel
model of Figure 1.6, the signal set W = {w0 (t), w1 (t), . . . , wm−1 (t)} consists of
continuous and finite-energy signals. We think of the signals as stimuli used by the
transmitter to excite the channel input. They are chosen in such a way that the
receiver can tell, with high probability, which channel input produced an observed
channel output.
Even if we model the source as producing an index from H = {0, 1, . . . , m − 1}
rather than a sequence of bits, we can still measure the communication rate in
terms of bits per second (bps). In fact the elements of the message set can be labeled
with distinct binary sequences of length log2 m. Every time that we communicate
a message, we equivalently communicate log2 m bits. If we can send a signal from
the set W every T seconds, then the message rate is 1/T [messages per second]
and the bit rate is (log2 m)/T [bits per second].
Digital communication is a field that has seen many exciting developments and
is still in vigorous expansion. Our goal is to introduce the reader to the field,
with emphasis on fundamental ideas and techniques. We hope that the reader will
develop an appreciation for the trade-offs that are possible at the transmitter, will
understand how to design (at the building-block level) a receiver that minimizes
the error probability, and will be able to analyze the performance of a point-to-
point communication system.
We will discover that a natural way to design, analyze, and implement a trans-
mitter/receiver pair for the channel of Figure 1.6 is to think in terms of the modules
shown in Figure 1.7. As in the OSI layering model, peer modules are designed
as if they were connected by their own channel. The bottom layer reduces the
passband channel to the more basic baseband-equivalent channel. The middle layer
further reduces the channel to a discrete-time channel that can be handled by the
encoder/decoder pair.
We conclude this section with a very brief overview of the chapters.
Chapter 2 addresses the receiver-design problem for discrete-time observations,
in particular in relationship to the channel seen by the top layer of Figure 1.7, which
is the discrete-time additive white Gaussian noise (AWGN) channel. Throughout
4
If the scattering and reflecting objects move with respect to the transmitting/receiving
antenna, then the filter is time-varying. We do not consider this case.
6
Messages
?
Encoder Decoder
6
n-Tuples
T
R ?
R
A
Waveform n-Tuple E
N
Former Former C
S
E
M
I 6 I
V
T Baseband-Equivalent E
T Signals R
E ?
R
Up- Down-
Converter Converter
Passband 6
Signals
R(t)

-

6
N (t)
Figure 1.7. Decomposed transmitter and receiver and corresponding

(sub-) layers of the physical layer.
the text the receiver’s objective is to minimize the probability of an incorrect

decision.
In Chapter 3 we “upgrade” the channel model to a continuous-time AWGN
channel. We will discover that all we have learned in the previous chapter has a
direct application for the new channel. In fact, we will discover that, without loss
of optimality, we can insert what we call a waveform former at the channel input
and the corresponding n-tuple former at the output and, in so doing, we turn the
new channel model into the one already considered.
Chapter 4 develops intuition about the high-level implications of the signal set
used to communicate. It is in this chapter that we start shifting attention from
the problem of designing the receiver for a given set of signals to the problem of
designing the signal set itself.
The next two chapters are devoted to practical signaling. In Chapter 5, we
focus on the waveform former for what we call symbol-by-symbol on a pulse train.
Chapter 6 is a case study on coding. The encoder is of convolutional type and the
decoder is based on the Viterbi algorithm.
1.4. Digital versus analog communication 13
Chapter 7 is about passband communication. A typical passband channel is the

radio channel. What we have learned in the previous chapters can, in principle,
be applied directly to passband channels; but there are several reasons in favor
of a design that consists of a baseband transmitter followed by an up-converter
that shifts the spectrum to the desired frequency interval. The receiver reflects the
transmitter’s structure, with the down-converter that shifts the spectrum back
to baseband. An obvious advantage of this approach is that we decouple most
of the transmitter/receiver design from the carrier frequency (also called center
frequency) of the transmitted signal. If we decide to shift the carrier frequency,
like when we change the channel in a walkie-talkie, we just act on the up/down-
converter, and this can be done very easily. Furthermore, having the last stage of
the transmitter operate in its own frequency band prevents the output signal from
feeding back “over the air” into the earlier stages and producing the equivalent of
the annoying “audio feedback” that occurs when we put a microphone next to the
corresponding speaker.
1.4 Digital versus analog communication

The meaning of digital versus analog communication needs to be clarified, in
particular because it should not be confused with their meaning in the context
of electronic circuits. We can communicate digitally by means of analog or digital
electronics and the same is true for analog communication.
We speak of digital communication when the transmitter sends one of a finite
set of possible signals. For instance, if we communicate 1000 bits, we are communi-
cating one out of 21000 possible binary sequences of length 1000. To communicate
our choice, we use signals that are appropriate for the channel at hand. No matter
which signals we use, the result will be digital communication. One of the simplest
ways to do this is that each bit determines the amplitude of a carrier over a
certain duration of time. So the first bit could determine the amplitude from time
0 to T , the second from T to 2T , etc. This is the simplest form of pulse amplitude
modulation. There are many sensible ways to map bits to waveforms that are
suitable to a channel, and regardless of the choice, it will be a form of digital
communication.
We speak of analog communication when the transmitter sends one of a contin-
uum of possible signals. The transmitted signal could be the output of a micro-
phone. Any tiny variation of the signal can constitute another valid signal. More
likely, in analog communication we use the source signal to vary a parameter of
a carrier signal. Two popular ways to do analog communication are amplitude
modulation (AM) and frequency modulation (FM). In AM we let the carrier’s
amplitude depend on the source signal. In FM it is the carrier’s frequency that
varies as a function of the source signal.
The difference between analog and digital communication might seem to be
minimal at this point, but actually it is not. It all boils down to the fact that
in digital communication the receiver has a chance to exactly reconstruct the
transmitted signal because there is a finite number of possibilities to choose from.
The signals used by the transmitter are chosen to facilitate the receiver’s decision.
One of the performance criteria is the error probability, and we can design systems
that have such a small error probability that for all practical purposes it is zero.
The situation is quite different in analog communication. As there is a continuum
of signals that the transmitter could possibly send, there is no chance for the
receiver to reconstruct an exact replica of the transmitted signal from the noisy
received signal. It no longer makes sense to talk about error probability. If we say
that an error occurs every time that there is a difference between the transmitted
signal and the reconstruction provided by the receiver, then the error probability
is always 1.
example 1.2 Consider a very basic transmitter that maps a sequence b0 , b1 , b2 , b3

of numbers into a sequence w(t) of rectangular pulses of a fixed duration as shown
in Figure 1.8. (The ith pulse has amplitude bi .)
Is this analog or digital communication? It depends on the alphabet of bi , i =
0, . . . , 3.
b1
b0 b3
b2
Figure 1.8. Transmitted signal w(t).
If it is a discrete alphabet, like {0.9, 2, −1.3}, then we speak of digital commu-

nication. In this case there are only m4 valid sequences b0 , b1 , b2 , b3 , where m is
the alphabet size, and equally many possibilities for w(t). In principle, the receiver
can compare the noisy channel output waveform against all these possibilities and
choose the most likely sequence. If the alphabet is R, then the communication is
analog. In this case the noise will make it virtually impossible for the receiver to
guess the correct sequence.
The difference, which may still seem insignificant at this point, is made signifi-
cant by the notion of channel capacity. For every channel, there is a rate, called
channel capacity, with the following meaning. Digital communication across the
channel can be made as reliable as desired at any rate below channel capacity. At
rates above channel capacity, it is impossible to reduce the error probability below
a certain value. Now we can see where the difference between analog and digital
communication becomes fundamental. For instance, if we want to communicate
at 1 gigabit per second (Gbps) from Zurich to Los Angeles by using a certain
type of cable, we can cut the cable into pieces of length L, chosen in such a
way that the channel capacity of each piece is greater than 1 Gbps. We can then
design a transmitter and a receiver that allow us to communicate virtually error-
free at 1 Gbps over distance L. By concatenating many such links, we can cover
any desired distance at the same rate. By making the error probability over each
1.5. Notation 15
link sufficiently small, we can meet the desired end-to-end probability of error.
The situation is very different in analog communication, where every piece of
cable contributes to a degradation of the reconstruction.
Need another example? Compare faxing a text to sending an e-mail over the
same telephone line. The fax uses analog technology. It treats the document as
a continuum of gray levels (in two dimensions). It does not differentiate between
text or images. The receiver prints a degraded version of the original. And if we
repeat the operation multiple times by re-faxing the latest reproduction it will not
take long until the result is dismal. E-mail on the other hand is a form of digital
communication. It is almost certain that the receiver reconstructs an identical
replica of the transmitted text.
Because we can turn a continuous-time source into a discrete one, as described
in Appendix 1.8, we always have the option of doing digital rather than analog
communication. In the conversion from continuous to discrete, there is a deteriora-
tion that we control and can make as small as desired. The result can, in principle,
be communicated over unlimited distance and over arbitrarily poor channels with
no further degradation.
1.5 Notation
In Chapter 2 and Chapter 3 we use a discrete-time and a continuous-time channel
model, respectively. Accordingly, the signals we use to communicate are n-tuples in
Chapter 2 and functions of time in Chapter 3. The transition from one set of signals
to the other is made smoothly via the elegant theory of inner product spaces. This
requires seeing both n-tuples and functions as vectors of an appropriate inner
product space, which is the reason we have opted to use the same fonts for both
kind of signals. (Many authors use bold-faced fonts for n-tuples.)
Some functions of time are referred to as waveforms. These are functions that
typically represent voltages or currents within electrical circuits. An example of a
waveform is the signal we use to communicate across a continuous-time channel.
Pulses are waveforms that serve as building blocks for more complex waveforms.
An example of a pulse is the impulse response of a linear time-invariant filter
(LTI). From a mathematical point of view it is by no means essential to make
a distinction between a function, a waveform, and a pulse. We use these terms
because they are part of the language used by engineers and because it helps us
associate a physical meaning with the specific function being discussed.
In this text, a generic function such as g : I → B, where I ⊆ R is the domain
and B is the range, is typically a function of time or a function of frequency.
Engineering texts underline the distinction by writing g(t) and g(f ), respectively.
This is an abuse of notation, which can be very helpful. We will make use of this
abuse of notation as we see fit. By writing g(t) instead of g : I → B, we are
effectively seeing t as representing I, rather than representing a particular value
of I. To refer to a particular moment in time, we use a subscripts such as in t0 .
So, g(t0 ) refers to the value that the function g takes at t = t0 . Similarly, g(f )
refers to a function of frequency and g(f0 ) is the value that g takes at f = f0 .
The Fourier transform of a function g : R → C is denoted by gF . Hence

∞
gF (f ) = g(t)e−j2πf t dt,
−∞
√
where j = −1.
By [a, b] we denote the set of real numbers between a and b, including a and b.
We write (a, b], [a, b), and (a, b) to exclude a, b, or both a and b from the set,
respectively. A set that consists of the elements a1 , a2 , . . . , am is denoted by
{a1 , a2 , . . . , am }.
If A is a set and S is a statement that can be true or false, then {x ∈ A : S}
denotes the subset of A for which the statement S is true. For instance, {x ∈ Z :
3 divides x} is the set of all integers that are an integer multiple of 3.
We write 1{S} for the indicator of the statement S. The indicator of S is 1 if S
is true and 0 if S is false. For instance 1{t ∈ [a, b]} takes value 1 when t ∈ [a, b] and
0 elsewhere. When the indicator is a function like in this case, we call it indicator
function. As another example, 1{k = l} is 1 when k = l and 0 when k = l.
As is customary in mathematics, we use the letters i, j, k, l, m, n for integers.
(The integer j should not be confused with the complex number j.)
The convolution between u(t) and v(t) is denoted by (u v)(t). Here (u v)
should be seen as the name of a new function obtained by convolving the functions
u(t) and v(t). Sometimes it is useful to write u(t) v(t) for (u v)(t).
1.6 A few anecdotes

This text is targeted mainly at engineering students. Throughout their careers
some will make inventions that may or may not be successful. After reading The
Information: A History, a Theory, a Flood by James Gleick5 [5], I felt that I should
pass on some anecdotes that nicely illustrate one point, specifically that no matter
how great an idea or an invention is, there will be people that will criticize it.
The printing press was invented by Johannes Gutenberg around 1440. It is now
recognized that it played an essential role in the transition from medieval to modern
times. Yet in the sixteenth century, the German priest Martin Luther decried
that the “multitude of books [were] a great evil . . . ”; in the seventeenth century,
referring to the “horrible mass of books” Leibniz feared a return to barbarism for
“in the end the disorder will become nearly insurmountable”; in 1970 the American
historian Lewis Mumford predicted that “the overproduction of books will bring
about a state of intellectual enervation and depletion hardly to be distinguished
from massive ignorance”.
The telegraph was invented by Claude Chappe during the French Revolution.
A telegraph was a tower for sending optical signals to other towers in line of sight.
In 1840 measurements were made to determine the transmission speed. Over a
stretch of 760 km, from Toulon to Paris, comprising 120 stations, it was determined
5
A copy of the book was generously offered by our dean, Martin Vetterli, to each professor
as a 2011 Christmas gift.
1.6. A few anecdotes 17
that two out of three messages arrived within a day during the warm months and
that only one in three arrived in winter. This was the situation when F. B. Morse
proposed to the French government a telegraph that used electrical wires. Morse’s
proposal was rejected because “No one could interfere with telegraph signals in
the sky, but wire could be cut by saboteurs” [5, Chapter 5].
In 1833 the lawyer and philologist John Pickering, referring to the American
version of the French telegraph on Central Wharf (a Chappe-like tower commu-
nicating shipping news with three other stations in a 12-mile line across Boston
Harbor) asserted that “It must be evident to the most common observer, that no
means of conveying intelligence can ever be devised, that shall exceed or even equal
the rapidity of the Telegraph, for, with the exception of the scarcely perceptible
relay at each station, its rapidity may be compared with that of light itself”. In
today’s technology we can communicate over optical fiber at more than 1012 bits
per second, which may be 12 orders of magnitude faster than the telegraph referred
to by Pickering. Yet Pickering’s flawed reasoning may have seemed correct to most
of his contemporaries.
The electrical telegraph eventually came and was immediately a great success,
yet some feared that it would put newspapers out of business. In 1852 it was
declared that “All ideas of connecting Europe with America, by lines extending
directly across the Atlantic, is utterly impracticable and absurd”. Six years later
Queen Victoria and President Buchanan were communicating via such a line.
Then came the telephone. The first experimental applications of the “electrical
speaking telephone” were made in the US in the 1870s. It quickly became a great
success in the USA, but not in England. In 1876 the chief engineer of the General
Post Office, William Preece, reported to the British Parliament: “I fancy the
descriptions we get of its use in America are a little exaggerated, though there
are conditions in America which necessitate the use of such instruments more
than here. Here we have a superabundance of messengers, errand boys and things
of that kind . . . I have one in my office, but more for show. If I want to send a
message – I use a sounder or employ a boy to take it”.
Compared to the telegraph, the telephone looked like a toy because any child
could use it. In comparison, the telegraph required literacy. Business people first
thought that the telephone was not serious. Where the telegraph dealt in facts
and numbers, the telephone appealed to emotions. Seeing information technology
as a threat to privacy is not new. Already at the time one commentator said, “No
matter to what extent a man may close his doors and windows, and hermetically
seal his key-holes and furnace-registers, with towels and blankets, whatever he may
say, either to himself or a companion, will be overheard”.
In summary, the printing press has been criticized for promoting barbarism; the
electrical telegraphy for being vulnerable to vandalism, a threat to newspapers,
and not superior to the French telegraph; the telephone for being childish, of
no business value, and a threat to privacy. We could of course extend the list
with comments about typewriters, cell phones, computers, the Internet, or about
applications such as e-mail, SMS, Wikipedia, Street View by Google, etc. It would
be good to keep some of these examples in mind when attempts to promote new
ideas are met with resistance.
1.7 Supplementary reading

Here is a small selection of recommended textbooks for background material,
to complement this text, or to venture into more specialized topics related to
communication theory.
There are many excellent books on background material. A recommended selec-
tion is: [6] by Vetterli, Kovačević and Goyal for signal processing; [7] by Ross for
probability theory; [8] by Rudin and [9] by Apostol for real analysis; [10] by Axler
and [11] by Hoffman and Kunze for linear algebra; [12] by Horn and Johnson for
matrices.
A very accessible undergraduate textbook on communication, with background
material on signals and systems, as well as on random processes, is [13] by Proakis
and Salehi. For the graduate-level student, [2] by Gallager is a very lucid exposition
on the principles of digital communication with integration theory done at the
“right level”. The more mathematical reader is referred to [3] by Lapidoth. For
breadth, more of an engineering perspective, and synchronization issues see [14] by
Barry, Lee, and Messerschmitt. Other recommended textbooks that have a broad
coverage are [15] by Madhow and [16] by Wilson. Note that [1] by Wozencraft and
Jacobs is a somewhat dated classic textbook but still a recommended read.
To specialize in wireless communication, the recommended textbooks are [17]
by Tse and Viswanath and [18] by Goldsmith. The standard textbooks for a first
course in information theory are [19] by Cover and Thomas and [20] by Gallager.
Reference [21] by MacKay is an original and refreshing textbook for information
theory, coding, and statistical inference; [22] by Lin and Costello is recommended
for a broad introduction to coding theory, whereas [23] by Richardson and Urbanke
is the reference for low-density parity-check coding, and [4] by Kurose and Ross is
recommended for computer networking.
1.8 Appendix: Sources and source coding

We often assume that the message to be communicated is the realization of a
sequence of independent and identically distributed binary symbols. The purpose
of this section is to justify this assumption. The results summarized here are given
without proof.
In communication, it is always a bad idea to assume that a source produces a
deterministic output. As mentioned in Section 1.2, if a listener knew exactly what a
speaker is about to say, there would be no need to listen. Therefore, a source output
should be considered as the realization of a stochastic process. Source coding is
about the representation of the source output by a string of symbols from a finite
(often binary) alphabet. Whether this is done in one, two, or three steps, depends
on the kind of source.
Discrete sources A discrete source is modeled by a discrete-time random process

that takes values in some finite alphabet. For instance, a computer file is repre-
sented as a sequence of bytes, each of which can take on one of 256 possible values.
1.8. Appendix: Sources and source coding 19
So when we consider a file as being the source signal, the source can be modeled as
a discrete-time random process taking values in the finite alphabet {0, 1, . . . , 255}.
Alternatively, we can consider the file as a sequence of bits, in which case the
stochastic process takes values in {0, 1}.
For another example, consider the sequence of pixel values produced by a digital
camera. The color of a pixel is obtained by mixing various intensities of red, green,
and blue. Each of the three intensities is represented by a certain number of bits.
One way to exchange images is to exchange one pixel at a time, according to some
predetermined way of serializing the pixel’s intensities. Also in this case we can
model the source as a discrete-time process.
A discrete-time sequence taking values in a finite alphabet can always be con-
verted into a binary sequence. The resulting average length depends on the source
statistic and on how we do the conversion. In principle we could find the minimum
average length by analyzing all possible ways of making the conversion. Surpris-
ingly, we can bypass this tedious process and find the result by means of a simple
formula that determines the so-called entropy (of the source). This was a major
result in Shannon’s 1948 paper.
example 1.3 A discrete memoryless source is a discrete source with the addi-
tional property that the output symbols are independent and identically distributed.
For a discrete memoryless source that produces symbols taking values in an m-
letter alphabet the entropy is

m
− pi log2 pi ,
i=1
where pi , i = 1, . . . , m is the probability associated to the ith letter. For instance,

if m = 3 and the probabilities are p1 = 0.5, p2 = p3 = 0.25 then the source
entropy is 1.5 bits. Shannon’s result implies that it is possible to encode the source
in such a way that, on average, it requires 1.5 bits to describe a source symbol. To
see how this can be done, we encode two ternary source symbols into three binary
symbols by mapping the most likely source letter into 1 and the other two letters
into 01 and 00, respectively. Then the average length of the binary representation
is 1×0.5+2×0.25+2×0.25 = 1.5 bits per source symbol, as predicted by the source
entropy. There is no way to compress the source to fewer than 1.5 bits per source
symbol and be able to recover the original from the compressed description.
Any book on information theory will prove the stated relationship between the
entropy of a memoryless source and the minimum average number of bits needed
to represent a source symbol. A standard reference is [19].
If the output of the encoder that produces the shortest binary sequence can no
longer be compressed, it means that it has entropy 1. One can show that to have
entropy 1, a binary source must produce independent and uniformly distributed
symbols. Such a source is called a binary symmetric source (BSS). We conclude
that the binary output of a source encoder can either be further compressed
or it has the same statistic as the output of a BSS. This is the main reason a
communication-link designer typically assumes that the source is a BSS.
Discrete-time continuous-alphabet sources These are modeled by a discrete-time

random process that takes values in some continuous alphabet. For instance, if
we measure the temperature of a room at regular time-intervals, we obtain a
sequence of real-valued numbers, modeled as the realization of a discrete-time
continuous alphabet random process. (This is assuming that we measure with
infinite precision. If we use a digital thermometer, then we are back to the discrete
case.) To store or to transmit the realization of such a source, we first round
up or down the number to the nearest element of some fixed discrete set of
numbers (as the digital thermometer does). This is called quantization and the
result is the quantized process with the discrete set as its alphabet. Quantization
is irreversible, but by choosing a sufficiently dense alphabet, we can make the
difference between the original and the quantized process as small as desired. As
described in the previous paragraph, the quantized sequence can be mapped in a
reversible way into a binary sequence. If the resulting binary sequence is as short
as possible (in average) then, once again, it is indistinguishable from the output of
a BSS.
Continuous-time sources These are modeled by a continuous-time random pro-
cess. The electrical signal produced by a microphone can be seen as a sample path
of a continuous-time random process. In all practical applications, such signals
are either bandlimited or can be lowpass-filtered to make them bandlimited. For
instance, even the most sensitive human ear cannot detect frequencies above some
value (say 25 kHz). Hence any signal meant for the human ear can be made
bandlimited through a lowpass filter. The sampling theorem (Theorem 5.2) asserts
that a bandlimited signal can be represented by a discrete-time sequence, which
in turn can be made into a binary sequence as described.
In this text we will not need results from information theory, but we will often
assume that the message to be communicated is the output of a BSS. Because we
can always map a block of k bits into an index taking value in {0, 1, . . . , 2k − 1},
an essentially equivalent assumption is that the source produces independent and
identically distributed (iid) random variables that are uniformly distributed over
some finite alphabet.
1.9 Exercises
Note: The exercises in this first chapter are meant to test if the reader has the
expected knowledge in probability theory.
exercise 1.1 (Probabilities of basic events) Assume that X1 and X2 are inde-
pendent random variables that are uniformly distributed in the interval [0, 1]. Com-
pute the probability of the following events. Hint: For each event, identify the
corresponding region inside the unit square.
(a) 0 ≤ X1 − X2 ≤ 13 .
(b) X13 ≤ X2 ≤ X12 .
(c) X2 − X1 = 12 .
1.9. Exercises 21
(d) (X1 − 12 )2 + (X2 − 12 )2 ≤ ( 12 )2 .

(e) Given that X1 ≥ 14 , compute the probability that (X1 − 12 )2 +(X2 − 12 )2 ≤ ( 12 )2 .
exercise 1.2 (Basic probabilities) Find the following probabilities.
(a) A box contains m white and n black balls. Suppose k balls are drawn. Find
the probability of drawing at least one white ball.
(b) We have two coins; the first is fair and the second is two-headed. We pick one
of the coins at random, toss it twice and obtain heads both times. Find the
probability that the coin is fair.
exercise 1.3 (Conditional distribution) Assume that X and Y are random

variables with joint density function

A, 0≤x<y≤1
fX,Y (x, y) =
0, otherwise.
(a) Are X and Y independent?

(b) Find the value of A.
(c) Find the density function of Y . Do this first by arguing informally using a
sketch of fX,Y (x, y), then compute the density formally.
(d) Find E [X|Y = y]. Hint: Try to find it from a sketch of fX,Y (x, y).
(e) The E [X|Y = y] found in part (d) is a function of y, call it f (y). Find
E [f (Y )].
(f ) Find E [X] (from the definition) and compare it to the E [E [X|Y ]] = E [f (Y )]
that you have found in part (e).
exercise 1.4 (Playing darts) Assume that you are throwing darts at a target.
We assume that the target is one-dimensional, i.e. that the darts all end up on a
line. The “bull’s eye” is in the center of the line, and we give it the coordinate 0.
The position of a dart on the target can then be measured with respect to 0. We
assume that the position X1 of a dart that lands on the target is a random variable
that has a Gaussian distribution with variance σ12 and mean 0. Assume now that
there is a second target, which is further away. If you throw a dart to that target,
the position X2 has a Gaussian distribution with variance σ22 (where σ22 > σ12 ) and
mean 0. You play the following game: You toss a “coin” which gives you Z = 1
with probability p and Z = 0 with probability 1 − p for some fixed p ∈ [0, 1]. If
Z = 1, you throw a dart onto the first target. If Z = 0, you aim for the second
target instead. Let X be the relative position of the dart with respect to the center
of the target that you have chosen.
(a) Write down X in terms of X1 , X2 and Z.

(b) Compute the variance of X. Is X Gaussian?
(c) Let S = |X| be the score, which is given by the distance of the dart to the
center of the target (that you picked using the coin). Compute the average
score E[S].
exercise 1.5 (Uncorrelated vs. independent random variables)

(a) Let X and Y be two continuous real-valued random variables with joint prob-
ability density function fXY . Show that if X and Y are independent, they are
also uncorrelated.
(b) Consider two independent and uniformly distributed random variables U ∈
{0, 1} and V ∈ {0, 1}. Assume that X and Y are defined as follows: X = U +V
and Y = |U − V |. Are X and Y independent? Compute the covariance of X
and Y . What do you conclude?
exercise 1.6 (Monty Hall) Assume you are participating in a quiz show. You
are shown three boxes that look identical from the outside, except they have labels
0, 1, and 2, respectively. Only one of them contains one million Swiss francs,
the other two contain nothing. You choose one box at random with a uniform
probability. Let A be the random variable that denotes your choice, A ∈ {0, 1, 2}.
(a) What is the probability that the box A contains the money?
(b) The quizmaster of course knows which box contains the money. He opens one
of the two boxes that you did not choose, being careful not to open the one
that contains the money. Now, you know that the money is either in A (your
first choice) or in B (the only other box that could contain the money). What
is the probability that B contains the money?
(c) If you are now allowed to change your mind, i.e. choose B instead of sticking
with A, would you do it?
2 Receiver design for
discrete-time observations:
First layer
2.1 Introduction
The focus of this and the next chapter is the receiver design. The task of the
receiver can be appreciated by considering a very noisy channel. The “GPS chan-
nel” is a good example. Let the channel input be the electrical signal applied to
the transmitting antenna of a GPS satellite orbiting at an altitude of 20 200 km,
and let the channel output be the signal at the antenna output of a GPS receiver
at sea level. The signal of interest at the output of the receiving antenna is very
weak. If we were to observe the receiving antenna output with a general-purpose
instrument, such as an oscilloscope or a spectrum analyzer, we would not be able to
distinguish the signal from the noise. Yet, most of the time the receiver manages
to reproduce the bit sequence transmitted by the satellite. This is the result of
sophisticated operations performed by the receiver.
To understand how to deal with the randomness introduced by the channel, it is
instructive to start with channels that output n-tuples (possibly n = 1) rather than
waveforms. In this chapter, we learn all we need to know about decisions based
on the output of such channels. As a prominent special case, we will consider the
discrete-time additive white Gaussian noise (AWGN) channel. In so doing, by the
end of the chapter we will have derived the receiver for the first layer of Figure 1.7.
Figure 2.1 depicts the communication systems considered in this chapter. Its
components are as follows.
• A source: The source (not represented in the figure) produces the message to
be transmitted. In a typical application, the message consists of a sequence
of bits but this detail is not fundamental for the theory developed in this
chapter. It is fundamental that the source chooses one “message” from a set of
possible messages. We are free to choose the “label” we assign to the various
messages and our choice is based on mathematical convenience. For now the
mathematical model of a source is as follows. If there are m possible choices,
we model the source as a random variable H that takes values in the message
set H = {0, 1, . . . , (m − 1)}. More often than not, all messages are assumed to
have the same probability but for generality we allow message i to occur with
probability PH (i). The message set H and the probability distribution PH are
assumed to be known to the system designer.
23
24 2. First layer
i∈H ci ∈ C ⊂ X n Y ∈ Yn Ĥ ∈ H
- Transmitter - Channel - Receiver -
Figure 2.1. General setup for Chapter 2.
• A channel: The system designer needs to be able to cope with a broad class of
channel models. A quite general way to describe a channel is by specifying its
input alphabet X (the set of signals that are physically compatible with the
channel input), the channel output alphabet Y, and a statistical description
of the output given the input. Unless otherwise specified, in this chapter the
output alphabet Y is a subset of R. A convenient way to think about the channel
is to imagine that for each letter x ∈ X that we apply to the channel input, the
channel outputs the realization of a random variable Y ∈ Y of statistic that
depends on x. If Y is a discrete random variable, we describe the probability
distribution (also called probability mass function, abbreviated to pmf) of Y
given x, denoted by PY |X (·|x). If Y is a continuous random variable, we describe
the probability density function (pdf) of Y given x, denoted by fY |X (·|x). In
a typical application, we need to know the statistic of a sequence Y1 , . . . , Yn
of channel outputs, Yk ∈ Y, given a sequence X1 , . . . , Xn of channel inputs,
Xk ∈ X , but our typical channel is memoryless, meaning that

n
PY1 ,...,Yn |X1 ,...,Xn (y1 , . . . , yn |x1 , . . . , xn ) = PYi |Xi (yi |xi )
i=1
for discrete-output alphabets and

n
fY1 ,...,Yn |X1 ,...,Xn (y1 , . . . , yn |x1 , . . . , xn ) = fYi |Xi (yi |xi )
i=1
for continuous-output alphabets. A discrete-time channel of this generality

might seem to be a rather abstract concept, but the theory we develop with
this channel model turns out to be what we need.
• A transmitter: Mathematically, the transmitter is a mapping from the message
set H = {0, 1, . . . , m − 1} to the signal set C = {c0 , c1 , . . . , cm−1 } (also called
signal constellation), where ci ∈ X n for some n. From an engineering point of
view, the transmitter is needed because the signals that represent the message
are not suitable to excite the channel. From this viewpoint, the transmitter
is just a sort of sophisticated connector. The elements of the signal set C are
chosen in such a way that a well-designed receiver that observes the channel
output can tell (with high probability) which signal from C has excited the
channel input.
• A receiver: The receiver’s task is to “guess” the realization of the hypothesis
H from the realization of the channel output sequence. We use ı̂ to represent
the guess made by the receiver. Like the message, the guess of the message
is the outcome of a random experiment. The corresponding random variable
2.1. Introduction 25
is denoted by Ĥ ∈ H. Unless specified otherwise, the receiver will always be

designed to minimize the probability of error, denoted Pe and defined as the
probability that Ĥ differs from H. Guessing the value of a discrete random
variable H from the realization of another random variable (or random vector)
is called a hypothesis testing problem. We are interested in hypothesis testing
in order to design communication systems, but it can also be used in other
applications, for instance to develop a smoke detector.
example 2.1 (Source) As a first example of a source, consider H = {0, 1}

and PH (0) = PH (1) = 1/2. H could model individual bits of, say, a file.
Alternatively, one could model an entire file of, say, 1 Mbit by saying that
6
H = {0, 1, . . . , (210 − 1)} and PH (i) = 210
1
6 , i ∈ H.
example 2.2 (Transmitter) A transmitter for a binary source could be a map

from H = {0, 1} to C = {−a, a} for some real-valued constant a. If the source
is 4-ary, the transmitter could be any one-to-one map from H = {0, 1, 2, 3} to
C = {−3a, −a, a, 3a}. Alternatively,
√ the map could be from H = {0, 1, 2, 3} to
C = {a, ja, −a, −ja}, where j = −1. The latter is a valid choice if X ⊆ C. All
three cases have real-world applications, but we have to wait until Chapter 7 to
fully understand the utility of complex-valued signal sets.
example 2.3 (Channel) The channel model that we will use frequently in this
chapter is the one that maps a signal c ∈ Rn into Y = c+Z, where Z is a Gaussian
random vector of independent and identically distributed components. As we will
see later, this is the discrete-time equivalent of the continuous-time additive white
Gaussian noise (AWGN) channel.
The chapter is organized as follows. We first learn the basic ideas behind hypoth-
esis testing, the field that deals with the problem of guessing the outcome of a
random variable based on the observation of another random variable (or random
vector). Then we study the Q function as it is a very valuable tool in dealing with
communication problems that involve Gaussian noise. At that point, we will be
ready to consider the problem of communicating across the discrete-time additive
white Gaussian noise channel. We will first consider the case that involves two
messages and scalar signals, then the case of two messages and n-tuple signals,
and finally the case of an arbitrary number m of messages and n-tuple signals.
Then we study techniques that we use, for instance, to tell if we can reduce the
dimensionality of the channel output signals without undermining the receiver
performance. The last part of the chapter deals with techniques to bound the
error probability when an exact expression is unknown or too difficult to evaluate.
A point about terminology and symbolism needs to be clarified. We are using
ci (and not si ) to denote the signal used for message i because the signals of this
chapter will become codewords in subsequent chapters.
26 2. First layer
2.2 Hypothesis testing

Hypothesis testing refers to the problem of guessing the outcome of a random
variable H that takes values in a finite alphabet H = {0, 1, . . . , m − 1}, based on
the outcome of a random variable Y called observable.
This problem comes up in various applications under different names. Hypothesis
testing is the terminology used in statistics where the problem is studied from a
fundamental point of view. A receiver does hypothesis testing, but communication
people call it decoding. An alarm system such as a smoke detector also does
hypothesis testing, but people would call it detection. A more appealing name
for hypothesis testing is decision making. Hypothesis testing, decoding, detection,
and decision making are all synonyms.
In communication, the hypothesis H is the message to be transmitted and
the observable Y is the channel output (or a sequence of channel outputs). The
receiver guesses the realization of H based on the realization of Y . Unless stated
otherwise, we assume that, for all i ∈ H, the system designer knows PH (i) and
fY |H (·|i).1
The receiver’s decision will be denoted by ı̂ and the corresponding random
variable by Ĥ ∈ H. If we could, we would ensure that Ĥ = H, but this is
generally not possible. The goal is to devise a decision strategy that maximizes
the probability Pc = P r{Ĥ = H} that the decision is correct.2 An equivalent goal
is to minimize the error probability Pe = P r{Ĥ = H} = 1 − Pc .
Hypothesis testing is at the heart of the communication problem. As described
by Claude Shannon in the introduction to what is arguably the most influential
paper ever written on the subject [24], “The fundamental problem of communica-
tion is that of reproducing at one point either exactly or approximately a message
selected at another point”.
example 2.4 As a typical example of a hypothesis testing problem, consider the
problem of communicating one bit of information across an optical fiber. The bit
being transmitted is modeled by the random variable H ∈ {0, 1}, PH (0) = 1/2. If
H = 1, we switch on a light-emitting diode (LED) and its light is carried across the
optical fiber to a photodetector. The photodetector outputs the number of photons
Y ∈ N it detects. The problem is to decide whether H = 0 (the LED is off )
or H = 1 (the LED is on). Our decision can only be based on whatever prior
information we have about the model and on the actual observation Y = y. What
makes the problem interesting is that it is impossible to determine H from Y with
certainty. Even if the LED is off, the detector is likely to detect some photons (e.g.
due to “ambient light”). A good assumption is that Y is Poisson distributed with
intensity λ, which depends on whether the LED is on or off. Mathematically, the
situation is as follows:
1
We assume that Y is a continuous random variable (or continuous random vector). If
it is discrete, then we use PY |H (·|i) instead of fY |H (·|i).
2
P r{·} is a short-hand for probability of the enclosed event.
2.2. Hypothesis testing 27
λy0 −λ0
when H = 0, Y ∼ PY |H (y|0) = e ,
y!
λy
when H = 1, Y ∼ PY |H (y|1) = 1 e−λ1 ,
y!
where 0 ≤ λ0 < λ1 . We read the above as follows: “When H = 0, the observable
Y is Poisson distributed with intensity λ0 . When H = 1, Y is Poisson distributed
with intensity λ1 ”. Once again, the problem of deciding the value of H from the
observable Y is a standard hypothesis testing problem.
From PH and fY |H , via Bayes’ rule, we obtain
PH (i)fY |H (y|i)
PH|Y (i|y) = ,
fY (y)

where fY (y) = i PH (i)fY |H (y|i). In the above expression PH|Y (i|y) is the pos-
terior (also called a posteriori probability of H given Y ). By observing Y = y, the
probability that H = i goes from the prior PH (i) to the posterior PH|Y (i|y).
If the decision is Ĥ = i, the probability that it is the correct decision is the
probability that H = i, i.e. PH|Y (i|y). As our goal is to maximize the probability
of being correct, the optimum decision rule is
Ĥ(y) = arg max PH|Y (i|y) (MAP decision rule), (2.1)

i∈H
where arg maxi g(i) stands for “one of the arguments i for which the function g(i)
achieves its maximum”. The above is called the maximum a posteriori (MAP) deci-
sion rule. In case of ties, i.e. if PH|Y (j|y) equals PH|Y (k|y) equals maxi PH|Y (i|y),
then it does not matter if we decide for Ĥ = k or for Ĥ = j. In either case, the
probability that we have decided correctly is the same.
Because the MAP rule maximizes the probability of being correct for each
observation y, it also maximizes the unconditional probability Pc of being cor-
rect. The former is PH|Y (Ĥ(y)|y). If we plug in the random variable Y instead
of y, then we obtain a random variable. (A real-valued function of a random
variable is a random variable.) The expected value of this random variable is the
(unconditional) probability of being correct, i.e.

Pc = E[PH|Y (Ĥ(Y )|Y )] = PH|Y (Ĥ(y)|y)fY (y)dy. (2.2)
y
There is an important special case, namely when H is uniformly distributed.

In this case PH|Y (i|y), as a function of i, is proportional to fY |H (y|i). Therefore,
the argument that maximizes PH|Y (i|y) also maximizes fY |H (y|i). Then the MAP
decision rule is equivalent to the decision
Ĥ(y) = arg max fY |H (y|i) (ML decision rule), (2.3)

i∈H
called the maximum likelihood (ML) decision rule. The name stems from the fact
that fY |H (y|i), as a function of i, is called the likelihood function.
28 2. First layer
Notice that the ML decision rule is defined even if we do not know PH . Hence
it is the solution of choice when the prior is not known. (The MAP and the ML
decision rules are equivalent only when the prior is uniform.)
2.2.1 Binary hypothesis testing
The special case in which we have to make a binary decision, i.e. H = {0, 1}, is
both instructive and of practical relevance. We begin with it and generalize in the
next section.
As there are only two alternatives to be tested, the MAP test may now be
written as
Ĥ = 1
fY |H (y|1)PH (1) ≥ fY |H (y|0)PH (0)
.
fY (y) < fY (y)
Ĥ = 0
The above notation means that the MAP test decides for Ĥ = 1 when the left is
bigger than or equal to the right, and decides for Ĥ = 0 otherwise. Observe that
the denominator is irrelevant because fY (y) is a positive constant – hence it will
not affect the decision. Thus an equivalent decision rule is
Ĥ = 1
≥
fY |H (y|1)PH (1) f (y|0)PH (0).
< Y |H
Ĥ = 0
The above test is depicted in Figure 2.2 assuming y ∈ R. This is a very important
figure that helps us visualize what goes on and, as we will see, will be helpful to
compute the probability of error.
The above test is insightful as it shows that we are comparing posteriors after
rescaling them by canceling the positive number fY (y) from the denominator.
However, there are alternative forms of the test that, depending on the details, can
be computationally more convenient. An equivalent test is obtained by dividing
fY |H (y|1)PH (1)
fY |H (y|0)PH (0)
Ĥ = 0 Ĥ = 1
Figure 2.2. Binary MAP decision.

2.2. Hypothesis testing 29
both sides with the non-negative quantity fY |H (y|0)PH (1). This results in the
following binary MAP test:
Ĥ = 1
fY |H (y|1) ≥ PH (0)
Λ(y) = = η. (2.4)
fY |H (y|0) < PH (1)
Ĥ = 0
The left side of the above test is called the likelihood ratio, denoted by Λ(y),
whereas the right side is the threshold η. Notice that if PH (0) increases, so does
the threshold. In turn, as we would expect, the region {y : Ĥ(y) = 0} becomes
larger.
When PH (0) = PH (1) = 1/2 the threshold η is unity and the MAP test becomes
a binary ML test:
Ĥ = 1
≥
fY |H (y|1) f (y|0).
< Y |H
Ĥ = 0
A function Ĥ : Y → H = {0, . . . , m − 1} is called a decision function (also called

decoding function). One way to describe a decision function is by means of the
decision regions Ri = {y ∈ Y : Ĥ(y) = i}, i ∈ H. Hence Ri is the set of y ∈ Y for
which Ĥ(y) = i. We continue this subsection with H = {0, 1}.
To compute the error probability, it is often convenient to compute the error
probability for each hypothesis and then take the average. When H = 0, the
decision is incorrect if Y ∈ R1 or, equivalently, if Λ(y) ≥ η. Hence, denoting by
Pe (i) the error probability when H = i,

Pe (0) = P r{Y ∈ R1 |H = 0} = fY |H (y|0)dy (2.5)
R1
or, equivalently,
Pe (0) = P r{Λ(Y ) ≥ η|H = 0}. (2.6)
Whether it is easier to work with the right side of (2.5) or that of (2.6) depends
on whether it is easier to work with the conditional density of Y or of Λ(Y ). We
will see examples of both cases.
Similar expressions hold for the probability of error conditioned on H = 1,
denoted by Pe (1). Using the law of total probability, we obtain the (unconditional)
error probability
Pe = Pe (1)PH (1) + Pe (0)PH (0).
In deriving the probability of error we have tacitly used an important technique
that we use all the time in probability: conditioning as an intermediate step.
Conditioning as an intermediate step may be seen as a divide-and-conquer strategy.
The idea is to solve a problem that seems hard by breaking it up into subproblems
30 2. First layer
that (i) we know how to solve and (ii) once we have the solution to the sub-
problems we also have the solution to the original problem. Here is how it works
in probability. We want to compute the expected value of a random variable Z.
Assume that it is not immediately clear how to compute the expected value of
Z, but we know that Z is related to another random variable W that tells us
something useful about Z: useful in the sense that for every value w we are able
to compute the expected value of Z given W = w. Then, via the law of total
expectation, we compute: E [Z] = w E [Z|W = w] PW (w). The same principle
applies for probabilities. (This is not a coincidence: The probability of an event is
the expected value of the indicator
function of that event.) For probabilities, the
expression is P r(Z ∈ A) = w P r(Z ∈ A|W = w)PW (w). It is called the law of
total probability.
Let us revisit what we have done in light of the above comments and what else we
could have done. The computation of the probability of error involves two random
variables, H and Y , as well as an event {H = Ĥ}. To compute the probability
of error (2.5) we have first conditioned on all possible values of H. Alternatively,
we could have conditioned on all possible values of Y . This is indeed a viable
alternative. In fact we have already done so (without saying it) in (2.2). Between
the two, we use the one that seems more promising for the problem at hand. We
will see examples of both.
2.2.2 m-ary hypothesis testing
Now we go back to the m-ary hypothesis testing problem. This means that H =
{0, 1, . . . , m − 1}.
Recall that the MAP decision rule, which minimizes the probability of making
an error, is
ĤM AP (y) = arg max PH|Y (i|y)

i∈H
fY |H (y|i)PH (i)
= arg max
i∈H fY (y)
= arg max fY |H (y|i)PH (i),
i∈H
where fY |H (·|i) is the probability density function of the observable Y when the
hypothesis is i and PH (i) is the probability of the ith hypothesis. This rule is well
defined up to ties. If there is more than one i that achieves the maximum on the
right side of one (and thus all) of the above expressions, then we may decide for
any such i without affecting the probability of error. If we want the decision rule
to be unambiguous, we can for instance agree that in case of ties we choose the
largest i that achieves the maximum.
When all hypotheses have the same probability, then the MAP rule specializes
to the ML rule, i.e.
ĤM L (y) = arg max fY |H (y|i).

i∈H
2.3. The Q function 31
R0 R1
Rm−1
Ri
Figure 2.3. Decision regions.
We will always assume that fY |H is either given as part of the problem formulation
or that it can be figured out from the setup. In communication, we typically know
the transmitter and the channel. In this chapter, the transmitter is the map from H
to C ⊂ X n and the channel is described by the pdf fY |X (y|x) known for all x ∈ X n
and all y ∈ Y n . From these two, we immediately obtain fY |H (y|i) = fY |X (y|ci ),
where ci is the signal assigned to i.
Note that the decision (or decoding) function Ĥ assigns an i ∈ H to each y ∈ Rn .
As already mentioned, it can be described by the decision (or decoding) regions
Ri , i ∈ H, where Ri consists of those y for which Ĥ(y) = i. It is convenient to
think of Rn as being partitioned by decoding regions as depicted in Figure 2.3.
We use the decoding regions to express the error probability Pe or, equivalently,
the probability Pc = 1 − Pe of deciding correctly. Conditioned on H = i we have
Pe (i) = 1 − Pc (i)

=1− fY |H (y|i)dy.
Ri
2.3 The Q function

The Q function plays a very important role in communication. It will come up
frequently throughout this text. It is defined as
∞
1 ξ2
Q(x) := √ e− 2 dξ,
2π x
where the symbol := means that the left side is defined by the expression on the
right. Hence if Z is a normally distributed zero-mean random variable of unit
variance, denoted by Z ∼ N (0, 1), then P r{Z ≥ x} = Q(x). (The Q function has
been defined specifically to make this true.)
If Z is normally distributed with mean m and variance σ 2 , denoted by Z ∼
N (m, σ 2 ), the probability P r{Z ≥ x} can also be written using the Q function.
In fact the event {Z ≥ x} is equivalent to { Z−m
σ ≥ x−m
σ }. But
Z−m
σ ∼ N (0, 1).
32 2. First layer
Hence P r{Z ≥ x} = Q( x−m

σ ). This result will be used frequently. It should be
memorized.
We now describe some of the key properties of the Q function.
(a) If Z ∼ N (0, 1), FZ (z) := P r{Z ≤ z} = 1 − Q(z). (The reader is advised to
draw a picture that expresses this relationship in terms of areas under the
probability density function of Z.)
(b) Q(0) = 1/2, Q(−∞) = 1, Q(∞) = 0.
(c) Q(−x) + Q(x) = 1. (Again, it is advisable to draw a picture.)
α2 2 α2
1
(d) √2πα e− 2 ( 1+α
α
2 ) < Q(α) <
√ 1 e− 2 , α > 0.
2πα
(e) An alternative expression for the Q function with fixed integration limits is
π x2
Q(x) = π1 02 e− 2 sin2 θ dθ. It holds for x ≥ 0.
α2
(f) Q(α) ≤ 12 e− 2 , α ≥ 0.
Proofs The proofs of (a), (b), and (c) are immediate (a picture suffices). The
proof of part (d) is left as an exercise (see Exercise 2.12). To prove (e), let X ∼
N (0, 1) and Y ∼ N (0, 1) be independent random variables and let ξ ≥ 0 . Hence
P r{X ≥ 0, Y ≥ ξ} = Q(0)Q(ξ) = Q(ξ) 2 . Using polar coordinates to integrate over
the region x ≥ 0, y ≥ ξ (shaded region of the figure below), yields
π2 ∞ − r2
Q(ξ) e 2
= P r{X ≥ 0, Y ≥ ξ} = r drdθ
2 0 ξ
sin θ
2π
π2 ∞ π2
1 −t 1 ξ2
− 2 sin
= e dtdθ = e 2θ
dθ.
2π 0 ξ2
2
2π 0
2 sin θ
ξ
ξ
r= sin(θ)
θ x
ξ2 ξ2
To prove (f), we use (e) and the fact that e− 2 sin2 θ ≤ e− 2 for θ ∈ [0, π2 ]. Hence
π2
1 ξ2 1 ξ2
Q(ξ) ≤ e− 2 dθ = e− 2 .
π 0 2
A plot of the Q function and its bounds is given in Figure 2.4.
2.4 Receiver design for the discrete-time AWGN channel

The hypothesis testing problem discussed in this section is key in digital com-
munication. It is the one performed by the top layer of Figure 1.7. The setup
is depicted in Figure 2.5. The hypothesis H ∈ H = {0, . . . , m − 1} represents
a randomly selected message. The transmitter maps H = i to a signal n-tuple
2.4. Receiver design for the discrete-time AWGN channel 33
α2
101 √ 1 e− 2
2πα
100
2
Q(α) 1 − α2
2
e
−1
10
2
α2 √ 1 α
1+α2 2πα
e− 2
−2
10
Q(α)
10−3
10−4
10−5
10−6
10−7
0 1 2 3 4 5
α
Figure 2.4. Q function with upper and lower bounds.
i∈H c i ∈ Rn Y = ci + Z Ĥ

- Transmitter - - Receiver -

6
Z ∼ N (0, σ 2 In )
Figure 2.5. Communication over the discrete-time additive

white Gaussian noise channel.
ci ∈ Rn . The channel adds a random (noise) vector Z which is zero-mean and has
independent and identically distributed Gaussian components of variance σ 2 . In
short, Z ∼ N (0, σ 2 In ). The observable is Y = ci + Z.
We begin with the simplest possible situation, specifically when there are only
two equiprobable messages and the signals are scalar (n = 1). Then we generalize
to arbitrary values for n and finally we consider arbitrary values also for the
cardinality m of the message set.
34 2. First layer
2.4.1 Binary decision for scalar observations
Let the message H ∈ {0, 1} be equiprobable and assume that the transmitter maps
H = 0 into c0 ∈ R and H = 1 into c1 ∈ R. The output statistic for the various
hypotheses is as follows:
H=0: Y ∼ N (c0 , σ 2 )
H=1: Y ∼ N (c1 , σ 2 ).
An equivalent way to express the output statistic for each hypothesis is

1 (y − c0 )2
fY |H (y|0) = √ exp −
2πσ 2 2σ 2

1 (y − c1 )2
fY |H (y|1) = √ exp − .
2πσ 2 2σ 2
We compute the likelihood ratio

fY |H (y|1) (y − c1 )2 − (y − c0 )2 c1 − c 0 c20 − c21

Λ(y) = = exp − = exp y + .
fY |H (y|0) 2σ 2 σ2 2σ 2
(2.7)
The threshold is
PH (0)
η= .
PH (1)
Now we have all the ingredients for the MAP rule.
Instead of comparing Λ(y) to the threshold η, we can compare ln Λ(y) to ln η.
The function ln Λ(y) is called log likelihood ratio. Hence the MAP decision rule
can be expressed as
Ĥ = 1
c1 − c0 c20 − c21 ≥
y + ln η.
σ2 2σ 2 <
Ĥ = 0
The progress consists of the fact that the receiver no longer computes an expo-
nential function of the observation. It has to compute ln(η), but this is done once
and for all.
Without loss of essential generality, assume c1 > c0 . Then we can divide both
sides by c1σ−c
2
0
(which is positive) without changing the outcome of the above
comparison. We can further simplify by moving the constants to the right. The
result is the simple test

1, y≥θ
ĤMAP (y) =
0, otherwise,
fY |H (y|0) fY |H (y|1)
y
c0 θ c1
Figure 2.6. When PH (0) = PH (1), the decision threshold θ is the midpoint
between c0 and c1 . The shaded area represents the probability of error
conditioned on H = 0.
where
σ2 c 0 + c1
θ= ln η + .
c1 − c0 2
Notice that if PH (0) = PH (1), then ln η = 0 and the threshold θ becomes the
midpoint c0 +c
2
1
(Figure 2.6).
We now determine the error probability.
∞
Pe (0) = P r{Y > θ|H = 0} = fY |H (y|0)dy.
θ
This is the probability that a Gaussian random variable with mean c0 and vari-
ance σ 2 exceeds the threshold θ. From
our review on theQ function
we know
c1 −θ
immediately that Pe (0) = Q θ−c 0
. Similarly, P (1) = Q . Finally, Pe =
θ−c c −θ σ e σ
PH (0)Q σ + PH (1)Q σ .
0 1
The most common case is when PH (0) = PH (1) = 1/2. Then θ−c σ
0
= c1σ−θ =
c1 −c0 d
2σ = 2σ , where d is the distance between c0 and c1 . In this case, Pe (0) =
Pe (1) = Pe , where

d
Pe = Q .
2σ
This result can be obtained straightforwardly without side calculations. As shown
in Figure 2.6, the threshold is the middle point between c0 and c1 and Pe = Pe (0) =
d
Q( 2σ ). This result should be known by heart.
2.4.2 Binary decision for n-tuple observations
As in the previous subsection, we assume that H takes values in {0, 1}. What is
new is that the signals are now n-tuples for n ≥ 1. So when H = 0, the transmitter
sends some c0 ∈ Rn and when H = 1, it sends c1 ∈ Rn . The noise added by the
channel is Z ∼ N (0, σ 2 In ) and independent of H.
From here on, we assume that the reader is familiar with the definitions and basic
results related to Gaussian random vectors. (See Appendix 2.10 for a review.) We
also assume familiarity with the notions of inner product, norm, and affine plane.
36 2. First layer
(See Appendix 2.12 for a review.) The inner product between the vectors u and v
will be denoted by u, v, whereas u = u, u denotes the norm of u. We will
make extensive use of these notations.
Even though for now the vector space is over the reals, in Chapter 7 we will
encounter complex vector spaces. Whether the vector space is over R or over C,
the notation is almost identical. For instance, if a and b are (column) n-tuples in
Cn , then a, b = b† a, where † denotes conjugate transpose. The equality holds
even if a and b are in Rn , but in this case the conjugation is inconsequential and
we could write a, b = bT a, where T denotes transpose. By default, we will use
the more general notation for complex vector spaces. An equality that we will use
frequently, therefore should be memorized, is
a ± b2 = a2 + b2 ± 2{a, b}, (2.8)
where {·} denotes the real part of the enclosed complex number. Of course we
can drop the {·} for elements of a real vector space.
As done earlier, to derive a MAP decision rule, we start by writing down the
output statistic for each hypothesis
H=0: Y = c0 + Z ∼ N (c0 , σ 2 In )
H=1: Y = c1 + Z ∼ N (c1 , σ 2 In ),
or, equivalently,

1 y − c0 2
H=0: Y ∼ fY |H (y|0) = exp −
(2πσ 2 )n/2 2σ 2

1 y − c1 2
H=1: Y ∼ fY |H (y|1) = exp − .
(2πσ 2 )n/2 2σ 2
Like in the scalar case, we compute the likelihood ratio

fY |H (y|1) y − c0 2 − y − c1 2
Λ(y) = = exp .
fY |H (y|0) 2σ 2
Taking the logarithm on both sides and using (2.8), we obtain
y − c0 2 − y − c1 2
ln Λ(y) = (2.9)
2σ 2
c − c c 2 − c 2
1 0 0 1
= y, + . (2.10)
σ2 2σ 2
From (2.10), the MAP rule can be written as
Ĥ = 1
c1 − c0 c0 2 − c1 2 ≥
y, + ln η. (2.11)
σ2 2σ 2 <
Ĥ = 0
Notice the similarity with the corresponding expression of the scalar case. As for
the scalar case, we move the constants to the right and normalize to obtain
Ĥ = 1
≥
y, ψ θ, (2.12)
<
Ĥ = 0
where
c1 − c0
ψ=
d
is the unit-length vector that points in the direction c1 − c0 , d = c1 − c0 is the
distance between the signals, and
σ2 c1 2 − c0 2
θ= ln η +
d 2d
is the decision threshold. Hence the decision regions R0 and R1 are delimited by
the affine plane

y ∈ Rn : y, ψ = θ .
For definiteness, we are assigning the points of the delimiting affine plane to R1 ,
but this is an arbitrary decision that has no effect on the error probability because
the probability that Y is on any given affine plane is zero.
We obtain additional geometrical insight by considering those y for which (2.9)
is constant. The situation is depicted in Figure 2.7, where the signed distance p is
positive if the delimiting affine plane lies in the direction pointed by ψ with respect
to c0 and q is positive if the affine plane lies in the direction pointed by −ψ with
respect to c1 . (In the figure, both p and q are positive.) By Pythagoras’ theorem
applied to the two right triangles with common edge, for all y on the affine plane,
y − c0 2 − y − c1 2 equals p2 − q 2 .
R1 r c 1
R0 AA q
A

A
p A

A
1
−c
r
c0 H
HH A
y
y H A
−c H A
0 H
H
Ar y
A affine plane
A
Figure 2.7. Affine plane delimiting R0 and R1 .
38 2. First layer
Note that p and q are related to η, σ 2 , and d via
y − c0 2 − y − c1 2 = p2 − q 2
y − c0 2 − y − c1 2 = 2σ 2 ln η.
Hence p2 − q 2 = 2σ 2 ln η. Using this and d = p + q, we obtain
d σ 2 ln η
p= +
2 d
d σ 2 ln η
q= − .
2 d
When PH (0) = PH (1) = 12 , the delimiting affine plane is the set of y ∈ Rn for
which (2.9) equals 0. These are the points y that are at the same distance from
c0 and from c1 . Hence, R0 contains all the points y ∈ Rn that are closer to c0
than to c1 .
A few additional observations are in order.
• The vector ψ is not affected by the prior but the threshold θ is. Hence the prior
affects the position but not the orientation of the delimiting affine plane. As
one would expect, the plane moves away from c0 when PH (0) increases. This is
consistent with our intuition that the decoding region for a hypothesis becomes
larger as the probability of that hypothesis increases.
• The above-mentioned effect of the prior is amplified when σ 2 increases. This is
also consistent with our intuition that the decoder relies less on the observation
and more on the prior when the observation becomes noisier.
• Notice the similarity of (2.9) and (2.10) with (2.7). This suggests a tight
relationship between the scalar and the vector case. We can gain additional
insight by placing the origin of a new coordinate system at c0 +c 2
1
and by
c1 −c0
letting the first coordinate be in the direction of ψ = d , where again
d = c1 − c0 . In this new coordinate system, H = 0 is mapped into the
vector c̃0 = (− d2 , 0, . . . , 0)T and H = 1 is mapped into c̃1 = ( d2 , 0, . . . , 0)T .
If ỹ = (ỹ1 , . . . , ỹn )T is the channel output in this new coordinate system,
ỹ, ψ = ỹ1 . This shows that for a binary decision, the vector case is essentially
the scalar case embedded in an n-dimensional space.
As for the scalar case, we compute the probability of error by conditioning
on H = 0 and H = 1 and then remove the conditioning by averaging: Pe =
Pe (0)PH (0) + Pe (1)PH (1).
When H = 0, Y = c0 + Z and the MAP decoder makes the wrong decision when
Z, ψ ≥ p, i.e. when the projection of Z onto the directional unit vector ψ has
(signed) length that is equal to or greater than p. That this is the condition for
an error should be clear from Figure 2.7, but it can also be derived by inserting
Y = c0 + Z into Y, ψ ≥ θ and using (2.8). Since Z, ψ is a zero-mean Gaussian
random variable of variance σ 2 (see Appendix 2.10), we obtain
p d σ ln η
Pe (0) = Q =Q + .
σ 2σ d
R0 R1
c0 c1
s s
@
@
@
c2 @
s @
R2
Figure 2.8. Example of Voronoi regions in R2 .
Proceeding similarly, we find

q d σ ln η
Pe (1) = Q =Q − .
σ 2σ d
d
The case PH (0) = PH (1) = 0.5 is the most common one. Because p = q = 2,
determining the error probability for this special case is straightforward:

d
d
Pe = Pe (1) = Pe (0) = P r Z, ψ ≥ =Q .
2 2σ
2.4.3 m-ary decision for n-tuple observations
When H = i, i ∈ H = {0, 1, . . . , m − 1}, the channel input is ci ∈ Rn . For now we

1
make the simplifying assumption that PH (i) = m , which is a common assumption
in communication. Later on we will see that generalizing is straightforward.
When Y = y ∈ Rn , the ML decision rule is
ĤM L (y) = arg max fY |H (y|i)
i∈H
1 y − c 2
i
= arg max exp − (2.13)
i∈H (2πσ 2 )n/2 2σ 2
= arg min y − ci .
i
Hence an ML decision rule for the AWGN channel is a minimum-distance decision

rule as shown in Figure 2.8. Up to ties, Ri corresponds to the Voronoi region of
ci , defined as the set of points in Rn that are at least as close to ci as to any
other cj .
example 2.5 (m-PAM) Figure 2.9 shows the signal set {c0 , c1 , . . . , c5 } ⊂ R for
6-ary PAM (pulse amplitude modulation) (the appropriateness of the name will
become clear in the next chapter). The figure also shows the decoding regions of an
ML decoder, assuming that the channel is AWGN. The signal points are elements
of R, and the ML decoder chooses according to the minimum-distance rule. When
the hypothesis is H = 0, the receiver makes the wrong decision if the observation
40 2. First layer
y ∈ R falls outside the decoding region R0 . This is the case if the noise Z ∈ R is
larger than d/2, where d = ci − ci−1 , i = 1, . . . , 5. Thus

d d
Pe (0) = P r Z > =Q .
2 2σ
By symmetry, Pe (5) = Pe (0). For i ∈ {1, 2, 3, 4}, the probability of error when
H = i is the probability that the event {Z ≥ d2 } ∪ {Z < − d2 } occurs. This event
is the union of disjoint events. Its probability is the sum of the probabilities of the
individual events. Hence

d d d d
Pe (i) = P r Z ≥ ∪ Z<− = 2P r Z ≥ = 2Q , i ∈ {1, 2, 3, 4}.
2 2 2 2σ

Finally, Pe = 26 Q 2σ d
+ 46 2Q 2σd
= 53 Q 2σ d
. We see immediately how to
generalize. For a PAM constellation of m points (m positive integer), the error
probability is
2 d
Pe = 2 − Q .
m 2σ
d
-
s s s s s s - y
c0 c1 c2 c3 c4 c5
R0 R1 R2 R3 R4 R5
Figure 2.9. 6-ary PAM constellation in R.
example 2.6 (m-QAM) Figure 2.10 shows the signal set {c0 , c1 , c2 , c3 } ⊂ R2 for
4-ary quadrature amplitude modulation (QAM). We consider signals as points in
R2 . (We could choose to consider signals as points in C, but we have to postpone
this view until we know how to deal with complex valued noise.) The noise is
Z ∼ N (0, σ 2 I2 ) and the observable, when H = i, is Y = ci + Z. We assume that
the receiver implements an ML decision rule, which for the AWGN channel means
minimum-distance decoding. The decoding region for c0 is the first quadrant, for
c1 the second quadrant, etc. When H = 0, the decoder makes the correct decision
if {Z1 > − d2 } ∩ {Z2 ≥ − d2 }, where d is the minimum distance among signal
points. This is the intersection of independent events. Hence the probability of the
intersection is the product of the probability of each event, i.e.

d 2
d d d
Pc (0) = P r Z1 ≥ − ∩ Z2 ≥ − 2
=Q − = 1−Q .
2 2 2σ 2σ
2.5. Irrelevance and sufficient statistic 41
By symmetry, for all i, Pc (i) = Pc (0). Hence,

d d
Pe = Pe (0) = 1 − Pc (0) = 2Q − Q2 .
2σ 2σ
y2 y2 z2 y
s
6 6 d
2
6
c1 c0 -s - z1
s d s 6
2 d
2
- y1 ? - y1
d
2
s s
c2 c3
Figure 2.10. 4-ary QAM constellation in R2 .
m-QAM is defined for all m of the form (2j)2 , for j = 1, 2, . . . . An example of

16-QAM is given later in Figure 2.22.
example 2.7 In Example 2.6 we have computed Pe (0) via Pc (0), but we could
have opted for computing Pe (0) directly. Here is how:

d d
Pe (0) = P r Z1 ≤ − ∪ Z2 ≤ −
2 2

d d d d
= P r Z1 ≤ − + P r Z2 ≤ − − Pr Z1 ≤ − ∩ Z2 ≤ −
2 2 2 2
d d
= 2Q − Q2 .
2σ 2σ
Notice that, in determining Pc (0) (Example 2.6), we compute the probability of
the intersection of independent events (which is the product of the probability of
the individual events) whereas in determining Pe (0) without passing through Pc (0)
(this example), we compute the probability of the union of events that are not
disjoint (which is not the sum of the probability of the individual events).
2.5 Irrelevance and sufficient statistic

Have you ever tried to drink from a fire hydrant? There are situations in which the
observable Y contains excessive data, but if we reduce the amount of data, how
to be sure that we are not throwing away anything useful for a MAP decision? In
this section we derive a criterion to answer that question. We begin by recalling
the notion of Markov chain.
42 2. First layer
definition 2.8 Three random variables U, V, W are said to form a Markov

chain in that order, symbolized by U → V → W , if the distribution of W given
both U and V is independent of U , i.e. PW |V,U (w|v, u) = PW |V (w|v).
The reader should verify the correctness of the following two statements, which
are straightforward consequences of the above Markov chain definition.
(i) U → V → W if and only if PU,W |V (u, w|v) = PU |V (u|v)PW |V (w|v). In words,
U, V, W form a Markov chain (in that order) if and only if U and W are
independent when conditioned on V .
(ii) U → V → W if and only if W → V → U , i.e. Markovity in one direction
implies Markovity in the other direction.
Let Y be the observable and T (Y ) be a function (either stochastic or determin-
istic) of Y . Observe that H → Y → T (Y ) is always true, but in general it is not
true that H → T (Y ) → Y .
definition 2.9 Let T (Y ) be a random variable obtained from processing an
observable Y . If H → T (Y ) → Y , then we say that T (Y ) is a sufficient statistic
(for the hypothesis H).
If T (Y ) is a sufficient statistic, then the error probability of a MAP decoder that
observes T (Y ) is identical to that of a MAP decoder that observes Y . Indeed for
all i ∈ H and all y ∈ Y, PH|Y (i|y) = PH|Y,T (i|y, t) = PH|T (i|t), where t = T (y).
(The first equality holds because the random variable T is a function of Y and the
second equality holds because H → T (Y ) → Y .) Hence if i maximizes PH|Y (·|y)
then it also maximizes PH|T (·|t). We state this important result as a theorem.
theorem 2.10 If T (Y ) is a sufficient statistic for H, then a MAP decoder
that estimates H from T (Y ) achieves the exact same error probability as one that
estimates H from Y .
example 2.11 Consider the communication system depicted in Figure 2.11

where H, Z1 , Z2 are independent random variables. Let Y = (Y1 , Y2 ). Then H →
Y1 → Y . Hence a MAP receiver that observes Y1 = T (Y ) achieves the same
error probability as a MAP receiver that observes Y . Note that the independence
assumption is essential here. For instance, if Z2 = Z1 , we can obtain Z2 (and Z1 )
from the difference Y2 − Y1 . We can then remove Z1 from Y1 and obtain H. In
this case from (Y1 , Y2 ) we can make an error-free decision about H.
Z1 Z2
?
? Y2

H - - -
-
Receiver Ĥ
-
Y1
Figure 2.11. Example of irrelevant measurement.

2.5. Irrelevance and sufficient statistic 43
In some situations, like in the above example, we make multiple measurements

and want to prove that some of the measurements are relevant for the detection
problem and some are not. Specifically, the observable Y may consist of two
components Y = (Y1 , Y2 ) where Y1 and Y2 may be m and n tuples, respectively.
If T (Y ) = Y1 is a sufficient statistic, then we say that Y2 is irrelevant.
We have seen that H → T (Y ) → Y implies that Y cannot be used to reduce
the error probability of a MAP decoder that observes T (Y ). Is the contrary also
true? Specifically, assume that a MAP decoder that observes Y always makes the
same decision as one that observes only T (Y ). Does this imply H → T (Y ) → Y ?
The answer is “yes and no”. We may expect the answer to be “no” because, for
H → U → V to hold, it has to be true that PH|U,V (i|u, v) equals PH|U (i|u) for all
values of i, u, v, whereas for v to have no effect on a MAP decision it is “sufficient”
that for all u, v the maximum of PH|U (·|u) and that of PH|U,V (·|u, v) be achieved
for the same i. It seems clear that the former requirement is stronger. Indeed in
Exercise 2.21 we give an example to show that the answer to the above question
is “no” in general.
The choice of distribution on H is relevant for the example in Exercise 2.21.
That we can construct artificial examples by “playing” with the distribution on H
becomes clear if we choose, say, PH (0) = 1. Now the decoder that chooses Ĥ = 0
all the time is always correct. Yet one should not conclude that Y is irrelevant.
However, if for every choice of PH , the MAP decoder that observes Y and the
MAP decoder that observes T (Y ) make the same decision, then H → T (Y ) → Y
must hold. We prove this in Exercise 2.23. The following example is an application
of this result.
example 2.12 Regardless of the distribution on H, the binary test (2.4) depends
on Y only through the likelihood ratio Λ(Y ). Hence H → Λ(Y ) → Y must hold,
which makes the likelihood ratio a sufficient statistic. Notice that Λ(y) is a scalar
even when y is an n-tuple.
The following result is a useful tool in verifying that a function T (y) is a sufficient
statistic. It is proved in Exercise 2.22.
theorem 2.13 (Fisher–Neyman factorization theorem) Suppose that g0 , g1 , . . . ,

gm−1 and h are functions such that for each i ∈ H one can write
fY |H (y|i) = gi (T (y))h(y). (2.14)
Then T is a sufficient statistic.
We will often use the notion of indicator function. Recall that if A is an arbitrary
set, the indicator function 1{x ∈ A} is defined as

1, x ∈ A
1{x ∈ A} =
0, otherwise.
example 2.14 Let H ∈ H = {0, 1, . . . , m − 1} be the hypothesis and when

H = i let the components of Y = (Y1 , . . . , Yn )T be iid uniformly distributed in
44 2. First layer
[0, i]. We use the Fisher–Neyman factorization theorem to show that T (Y ) =

max{Y1 , . . . , Yn } is a sufficient statistic. In fact
1 1 1
fY |H (y|i) = 1{y1 ∈ [0, i]} 1{y2 ∈ [0, i]} · · · 1{yn ∈ [0, i]}
i
i i
1
n, if 0 ≤ min{y1 , . . . , yn } and max{y1 , . . . , yn } ≤ i
= i
0, otherwise
1
=1{0 ≤ min{y1 , . . . , yn }}1{max{y1 , . . . , yn } ≤ i}.
in
In this case, the Fisher–Neyman factorization theorem applies with gi (T ) =
in 1{T ≤ i}, where T (y) = max{y1 , . . . , yn }, and h(y) = 1{0 ≤ min{y1 , . . . , yn }}.
1
2.6 Error probability bounds

2.6.1 Union bound
Here is a simple and extremely useful bound. Recall that for general events A, B
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
≤ P (A) + P (B) .
More generally, using induction, we obtain the union bound
m m
P Ai ≤ P (Ai ), (UB)
i=1 i=1
that applies to any collection of events Ai , i = 1, . . . , m. We now apply the union

bound to approximate the probability of error in multi-hypothesis testing. Recall
that

Pe (i) = P r{Y ∈ Ri |H = i} =
c
fY |H (y|i)dy,
Rci
where Rci denotes the complement of Ri . If we are able to evaluate the above
integral for every i, then we are able to determine the probability of error exactly.
The bound that we derive is useful if we are unable to evaluate the above integral.
For i = j define

Bi,j = y : PH (j)fY |H (y|j) ≥ PH (i)fY |H (y|i) .
Bi,j is the set of y for which the a posteriori probability of H given Y = y is at
least as high for j as it is for i. Roughly speaking,3 it contains the ys for which a
MAP decision rule would choose j over i.
3
A y for which the a posteriori probability is the same for i and for j is contained in
both Bi,j and Bj,i .
2.6. Error probability bounds 45
The following fact is very useful:

Rci ⊆ Bi,j . (2.15)
j:j=i
To see that the above inclusion holds, consider an arbitrary y ∈ Rci . By definition,
there is at least one k ∈ H such that PH (k)fY |H (y|k) ≥ PH (i)fY |H (y|i). Hence
y ∈ Bi,k .
The reader may wonder why we do not have equality in (2.15). To see that
equality may or may not apply, consider a y that belongs to Bi,l for some l. It
could be so because PH (l)fY |H (y|l) = PH (i)fY |H (y|i) (notice the equality sign).
To simplify the argument, let us assume that for the chosen y there is only one
such l. The MAP decoding rule does not prescribe whether y should be in the
decoding region of i or l. If it is in that of i, then equality in (2.15) does not hold.
If none of the y for which PH (l)fY |H (y|l) = PH (i)fY |H (y|i) for some l has been
assigned to Ri then we have equality in (2.15). In one sentence, we have equality
if all the ties have been resolved against i.
We are now in the position to upper bound Pe (i). Using (2.15) and the union
bound we obtain

Pe (i) = P r {Y ∈ Rci |H = i} ≤ P r Y ∈ Bi,j |H = i
j:j=i

≤ P r Y ∈ Bi,j |H = i (2.16)
j:j=i

= fY |H (y|i)dy.
j:j=i Bi,j
The gain is that it is typically easier to integrate over Bi,j than over Rci .
For instance, when the channel is AWGN and the decision rule is ML, Bi,j is
the set of points in Rn that are at least as close to cj as they are to ci . Figure 2.12
depicts this situation.
In this case,

cj − ci
fY |H (y|i)dy = Q ,
Bi,j 2σ
and the union bound yields the simple expression
cj − ci
Pe (i) ≤ Q .
2σ
j:j=i
cj ci
Bi,j
Figure 2.12. The shape of Bi,j for AWGN channels and ML decision.
46 2. First layer
In the next section we derive an easy-to-compute tight upper bound on

fY |H (y|i)dy
Bi,j
for a general fY |H . Notice that the above integral is the probability of error under
H = i when there are only two hypotheses, the other hypothesis is H = j, and
the priors are proportional to PH (i) and PH (j).
example 2.15 (m-PSK) Figure 2.13 shows a signal set for 8-ary PSK (phase-
shift keying). m-PSK is defined for all integers m ≥ 2. Formally, the signal
transmitted when H = i , i ∈ H = {0, 1, . . . , m − 1}, is
⎛ ⎞
cos 2πim
c i = Es ⎝ ⎠ .
sin 2πim
√
For now Es is just the radius of the PSK constellation. As we will see, Es = ci 2
is (proportional to) the energy required to generate ci .
c2 R2
R3 R1
c3 c1
c4 c0 R4 R0
c5 c7
R5 R7
c6 R6
Figure 2.13. 8-ary PSK constellation in R2 and decoding regions.
Assuming the AWGN channel, the hypothesis testing problem is specified by

H=i: Y ∼ N (ci , σ 2 I2 )
and the prior PH (i) is assumed to be uniform. Because we have a uniform prior,
the MAP and the ML decision rules are identical. Furthermore, since the channel is
the AWGN channel, the ML decoder is a minimum-distance decoder. The decoding
regions are also shown in Figure 2.13.
By symmetry, Pe = Pe (i). Using polar coordinates to integrate the density of the
noise, it is not hard to show that
π

1 π− m sin2 m
π
Es
Pe (i) = exp − 2 dθ.
π 0 sin (θ) 2σ 2
The above expression is rather complicated. Let us see what we obtain through the
union bound.
With reference to Figure 2.14 we have:
c3
B4,3
R4 c4 B4,3 ∩ B4,5
B4,5
c5
Figure 2.14. Bounding the error probability of PSK by means

of the union bound.
Pe (i) = P r{Y ∈ Bi,i−1 ∪ Bi,i+1 |H = i}

≤ P r{Y ∈ Bi,i−1 |H = i} + P r{Y ∈ Bi,i+1 |H = i}
= 2P r{Y ∈ Bi,i−1 |H = i}

ci − ci−1
= 2Q
2σ
√
Es π
= 2Q sin .
σ m
Notice that we have been using a version of the union bound adapted to the problem:
we get a tighter bound by using the fact that Rci ⊆ Bi,i−1 ∪ Bi,i+1 rather than
Rci ⊆ ∪j=i Bi,j .
How good is the upper bound? We know that it is good if we can find a lower
bound which is close enough to the upper bound. From Figure 2.14 with i = 4 in
mind we see that
Pe (i) = P r{Y ∈ Bi,i−1 |H = i} + P r{Y ∈ Bi,i+1 |H = i}

− P r{Y ∈ Bi,i−1 ∩ Bi,i+1 |H = i}.
The above expression can be used to upper and lower bound Pe (i). In fact, if we
lower bound the last term by setting it to zero, we obtain the upper bound that we
have just derived. To the contrary, if we upper bound the last term, we obtain a
lower bound to Pe (i). To do so, observe that Rci is the union of (m − 1) disjoint
cones, one of which is Bi,i−1 ∩ Bi,i+1 (see again Figure 2.14). The integral of
fY |H (·|i) over those cones is Pe . If all those integrals gave the same result (which
is not the case) the result would be Pm−1e (i)
. From the figure, the integral of fY |H (·|i)
over Bi,i−1 ∩ Bi,i+1 is clearly smaller than that over the other cones. Hence its
value must be less than Pm−1
e (i)
. Mathematically,
Pe (i)
P r{Y ∈ (Bi,i−1 ∩ Bi,i+1 )|H = i} ≤ .
m−1
48 2. First layer
Inserting in the previous expression, solving for Pe (i) and using the fact that
Pe (i) = Pe yields the desired lower bound
!
Es π m−1
Pe ≥ 2Q sin .
σ2 m m
m
The ratio between the upper and the lower bound is the constant m−1 . For m large,
the bounds become very tight.
The way we upper-bounded P r{Y ∈ Bi,i−1 ∩ Bi,i+1 |H = i} is not the only way to
proceed. Alternatively, we could use the fact that Bi,i−1 ∩ Bi,i+1 is included in Bi,k
where k is the index of the codeword opposed to ci . (In Figure 2.14, B4,3√ ∩ B4,5 ⊂

B4,0 .) Hence P r{Y ∈ Bi,i−1 ∩ Bi,i+1 |H = i} ≤ P r{Y ∈ Bi,k |H = i} = Q Es /σ .
This goes to zero as Es /σ 2 → ∞. It implies that the lower bound obtained this way
becomes tight as Es /σ 2 becomes large.
It is not surprising that the upper bound to Pe (i) becomes tighter as m or
Es /σ 2 (or both) become large. In fact it should be clear that under those conditions
P r{Y ∈ Bi,i−1 ∩ Bi,i+1 |H = i} becomes smaller.
PAM, QAM, and PSK are widely used in modern communications systems. See
Section 2.7 for examples of standards using these constellations.
2.6.2 Union Bhattacharyya bound

"
Let us summarize. From the union bound applied to Rci ⊆ j:j=i Bi,j we have
obtained the upper bound
Pe (i) = P r{Y ∈ Rci |H = i}

≤ P r{Y ∈ Bi,j |H = i}
j:j=i
and we have used this bound for the AWGN channel. With the bound, instead of
having to compute

P r{Y ∈ Rci |H = i} = fY |H (y|i)dy,
Rci
which requires integrating over a possibly complicated region Rci , we have only to
compute

P r{Y ∈ Bi,j |H = i} = fY |H (y|i)dy.
Bi,j
The latter integral is simply Q(di,j /σ), where di,j is the distance between ci and
c −c
the affine plane bounding Bi,j . For an ML decision rule, di,j = i 2 j .
What if the channel is not AWGN? Is there a relatively simple expression for
P r{Y ∈ Bi,j |H = i} that applies for general channels? Such an expression does
exist. It is the Bhattacharyya bound that we now derive.4 We will need it only for
those i for which PH (i) > 0. Hence, for the derivation that follows, we assume that
it is the case.
The definition of Bi,j may be rewritten in either of the following two forms

#
PH (j)fY |H (y|j) PH (j)fY |H (y|j)
y: ≥1 = y: ≥1
PH (i)fY |H (y|i) PH (i)fY |H (y|i)
except that the above fraction is not defined when fY |H (y|i) vanishes. This excep-
tion apart, we see that
#
PH (j)fY |H (y|j)
1{y ∈ Bi,j } ≤
PH (i)fY |H (y|i)
is true when y is inside Bi,j ; it is also true when outside because the left side
vanishes and the right is never negative. We do not have to worry about the
exception because we will use
#
PH (j)fY |H (y|j)
fY |H (y|i)1{y ∈ Bi,j } ≤ fY |H (y|i)
PH (i)fY |H (y|i)
#
PH (j) $
= fY |H (y|i)fY |H (y|j),
PH (i)
which is obviously true when fY |H (y|i) vanishes.
We are now ready to derive the Bhattacharyya bound:

P r{Y ∈ Bi,j |H = i} = fY |H (y|i)dy
y∈Bi,j

= fY |H (y|i)1{y ∈ Bi,j }dy
y∈Rn
#
PH (j) $
≤ fY |H (y|i)fY |H (y|j) dy. (2.17)
PH (i) y∈Rn
What makes the last integral appealing is that we integrate over the entire Rn . The
above bound takes a particularly simple form when there are only two hypotheses
of equal probability. In this case,
$
Pe (0) = Pe (1) = Pe ≤ fY |H (y|0)fY |H (y|1) dy. (2.18)
y∈Rn
As shown in Exercise 2.32, for discrete memoryless channels the bound further
simplifies.
4
There are two versions of the Bhattacharyya bound. Here we derive the one that has
the simpler derivation. The other version, which is tighter by a factor 2, is derived in
Exercises 2.29 and 2.30.
50 2. First layer
As the name indicates, the union Bhattacharyya bound combines (2.16) and
(2.17), namely
#
PH (j) $
Pe (i) ≤ P r{Y ∈ Bi,j |H = i} ≤ fY |H (y|i)fY |H (y|j) dy.
PH (i) y∈Rn
j:j=i j:j=i
We can now remove the conditioning on H = i and obtain

$
Pe ≤ PH (i)PH (j) fY |H (y|i)fY |H (y|j) dy. (2.19)
i j:j=i y∈Rn
example 2.16 (Tightness of the Bhattacharyya bound) Let the message H ∈

{0, 1} be equiprobable, let the channel be the binary erasure channel described in
Figure 2.15, and let ci = (i, i, . . . , i)T , i ∈ {0, 1}.
X Y
1−p
0 0
1 1
1−p
Figure 2.15. Binary erasure channel.
The Bhattacharyya bound for this case yields

$
P r{Y ∈ B0,1 |H = 0} ≤ PY |H (y|1)PY |H (y|0)
y∈{0,1,Δ}n
$
= PY |X (y|c1 )PY |X (y|c0 )
y∈{0,1,Δ}n
(a)
= pn ,
where in (a) we used the fact that the first factor under the square root vanishes if y
contains 0s and the second vanishes if y contains 1s. Hence the only non-vanishing
term in the sum is the one for which yi = Δ for all i. The same bound applies for
H = 1. Hence Pe ≤ 12 pn + 12 pn = pn .
If we use the tighter version of the union Bhattacharyya bound, which as men-
tioned earlier is tighter by a factor of 2, then we obtain
1 n
Pe ≤ p .
2
For the binary erasure channel and the two codewords c0 and c1 we can
actually compute the exact probability of error. An error can occur only if
Y = (Δ, Δ, . . . , Δ)T , and in this case it occurs with probability 12 . Hence,
2.7. Summary 51
1 1
Pe = P r{Y = (Δ, Δ, . . . , Δ)T } = pn .
2 2
The Bhattacharyya bound is tight for the scenario considered in this example!
2.7 Summary
The maximum a posteriori probability (MAP) rule is a decision rule that does
exactly what the name implies – it maximizes the a posteriori probability – and in
so doing it maximizes the probability that the decision is correct. With hindsight,
the key idea is quite simple and it applies even when there is no observable. Let
us review it.
Assume that a coin is flipped and we have to guess the outcome. We model the
coin by the random variable H ∈ {0, 1}. All we know is PH (0) and PH (1). Suppose
that PH (0) ≤ PH (1). Clearly we have the highest chance of being correct if we
guess Ĥ = 1 every time we perform the experiment of flipping the coin. We will
be correct if indeed H = 1, and this has probability PH (1). More generally, for an
arbitrary number m of hypotheses, we choose (one of) the i that maximizes PH (·)
and the probability of being correct is PH (i).
It is more interesting when there is some “side information”. The side informa-
tion is obtained when we observe the outcome of a related random variable Y .
Once we have made the observation Y = y, our knowledge about the distribution
of H gets updated from the prior distribution PH (·) to the posterior distribution
PH|Y (·|y). What we have said in the previous paragraphs applies with the posterior
instead of the prior.
In a typical example PH (·) is constant whereas for the observed y, PH|Y (·|y) is
strongly biased in favor of one hypothesis. If it is strongly biased, the observable
has been very informative, which is what we hope of course.
Often PH|Y is not given to us, but we can find it from PH and fY |H via Bayes’
rule. Although PH|Y is the most fundamental quantity associated to a MAP test
and therefore it would make sense to write the test in terms of PH|Y , the test is
typically written in terms of PH and fY |H because these are the quantities that
are specified as part of the model.
Ideally a receiver performs a MAP decision. We have emphasized the case in
which all hypotheses have the same probability as this is a common assumption
in digital communication. Then the MAP and the ML rule are identical.
The following is an example of how the posterior becomes more and more
selective as the number of observations increases. The example also shows that
the posterior becomes less selective if the observations are more “noisy”.
example 2.17 Assume H ∈ {0, 1} and PH (0) = PH (1) = 1/2. The outcome of
H is communicated across a binary symmetric channel (BSC) of crossover proba-
bility p < 12 via a transmitter that sends n 0s when H = 0 and n 1s when H = 1.
%n X = {0, 1}, output alphabet Y = X , and transition
The BSC has input alphabet
probability pY |X (y|x) = i=1 pY |X (yi |xi ) where pY |X (yi |xi ) equals 1 − p if yi = xi
and p otherwise. (We obtain a BSC, for instance, if we place an appropriately
chosen 1-bit quantizer at the output of the AWGN channel used with a binary
52 2. First layer
input alphabet.) Letting k be the number of 1s in the observed channel output y we

have
pk (1 − p)n−k , H = 0
PY |H (y|i) =
pn−k (1 − p)k , H = 1.
Using Bayes’ rule,
PH,Y (i, y) PH (i)PY |H (y|i)
PH|Y (i|y) = = ,
PY (y) PY (y)

where PY (y) = i PY |H (y|i)PH (i) is the normalization that ensures i PH|Y (i|y)
equals 1.
Hence
k
pk (1 − p)n−k p (1 − p)n
PH|Y (0|y) = =
2PY (y) 1−p 2PY (y)
k
pn−k (1 − p)k 1−p pn
PH|Y (1|y) = = .
2PY (y) p 2PY (y)
PH|Y (0|y)
1 PH|Y (0|y)
0.5
0.5
0 k 0 k
0 1 0 10 20 30 40 50
(a) p = 0.25, n = 1. (b) p = 0.25, n = 50.
PH|Y (0|y)
1 PH|Y (0|y)
0.5
0.5
0 k 0 k
0 1 0 10 20 30 40 50
(c) p = 0.47, n = 1. (d) p = 0.47, n = 50.
Figure 2.16. Posterior as a function of the number k of 1s observed at the

output of a BSC of crossover probability p. The channel input consists of
n 0s when H = 0 and of n 1s when H = 1.
2.8. Appendix: Facts about matrices 53
Figure 2.16 depicts the behavior of PH|Y (0|y) as a function of the number k of 1s
in y. For the top two figures, p = 0.25. We see that when n = 50 (top right figure),
the prior is very biased in favor of one or the other hypothesis, unless the number k
of observed 1s is nearly n/2 = 25. Comparing to n = 1 (top left figure), we see that
many observations allow the receiver to make a more confident decision. This is
true also for p = 0.47 (bottom row), but we see that with the crossover probability
p close to 1/2, there is a smoother transition between the region in favor of one
hypothesis and the region in favor of the other. If we make only one observation
(bottom left figure), then there is only a slight difference between the posterior for
H = 0 and that for H = 1. This is the worse of the four cases (fewer observations
through noisier channel). The best situation is of course the one of figure (b) (more
observations through a more reliable channel).
We have paid particular attention to the discrete-time AWGN channel as it will

play an important role in subsequent chapters. The ML receiver for the AWGN
channel is a minimum-distance decision rule. For signal constellations of the PAM
and QAM family, the error probability can be computed exactly by means of the
Q function. For other constellations, it can be upper bounded by means of the
union bound and the Q function.
A quite general and useful technique to upper bound the probability of error is
the union Bhattacharyya bound. Notice that it applies to MAP decisions associated
to general hypothesis testing problems, not only to communication problems. The
union Bhattacharyya bound just depends on fY |H and PH (no need to know the
decoding regions).
Most current communications standards use PAM, QAM, or PSK to form their
codewords. The basic idea is to form long codewords with components that belong
to a PAM, QAM, or PSK set of points, called alphabet. Only a subset of the
sequences that can be obtained this way are used as codewords. Chapter 6 is
dedicated to a comprehensive case study. Here are a few concrete applications:
5-PAM is used as the underlying constellation for the Ethernet; QAM is used in
telephone modems and in various digital video broadcasting standards (DVB-C2,
DVB-T2); depending on the data rate, PSK and QAM are used in the wireless
LAN (local area network) IEEE 802.11 standard, as well as in the third-generation
partnership project (3GPP) and long-term evolution (LTE) standards; and PSK
(with variations) is used in the Bluetooth 2, ZigBee, and EDGE standards.
2.8 Appendix: Facts about matrices

In this appendix we provide a summary of useful definitions and facts about
matrices over C. An excellent text about matrices is [12]. Hereafter H † is the
conjugate transpose of the matrix H. It is also called the Hermitian adjoint
of H.
definition 2.18 A matrix U ∈ Cn×n is said to be unitary if U † U = I. If U is

unitary and has real-valued entries, then it is said to be orthogonal.
54 2. First layer
The following theorem lists a number of handy facts about unitary matrices.
Most of them are straightforward. Proofs can be found in [12, page 67].
theorem 2.19 If U ∈ Cn×n , the following are equivalent:

(a) U is unitary;
(b) U is nonsingular and U † = U −1 ;
(c) U U † = I;
(d) U † is unitary;
(e) the columns of U form an orthonormal set;
(f ) the rows of U form an orthonormal set; and
(g) for all x ∈ Cn the Euclidean length of y = U x is the same as that of x; that
is, y † y = x† x.
The following is an important result that we use to prove Lemma 2.22.
theorem 2.20 (Schur) Any square matrix A can be written as A = U RU † where
U is unitary and R is an upper triangular matrix whose diagonal entries are the
eigenvalues of A.
Proof Let us use induction on the size n of the matrix. The theorem is clearly
true for n = 1. Let us now show that if it is true for n − 1, it follows that it is
true for n. Given a matrix A ∈ Cn×n , let v be an eigenvector of unit norm, and
β the corresponding eigenvalue. Let V be a unitary matrix whose first column
is v. Consider the matrix V † AV . The first column of this matrix is given by
V † Av = βV † v = βe1 , where e1 is the unit vector along the first coordinate. Thus

β ∗
V † AV = ,
0 B
where B is square and of dimension n−1. By the induction hypothesis B = W SW † ,

where W is unitary and S is upper triangular. Thus,

† β ∗ 1 0 β ∗ 1 0
V AV = = (2.20)
0 W SW † 0 W 0 S 0 W†
and putting

1 0 β ∗
U =V and R = ,
0 W 0 S
we see that U is unitary, R is upper triangular, and A = U RU † , completing the

induction step. The eigenvalues of a matrix are the roots of the characteristic
polynomial. To see that the diagonal entries of R are indeed the eigenvalues of
A, it suffices to bring
& the characteristic
' polynomial%of A in the following form:
det(λI − A) = det U (λI − R)U † = det(λI − R) = i (λ − rii ).
definition 2.21 A matrix H ∈ Cn×n is said to be Hermitian if H = H † . It

is said to be skew-Hermitian if H = −H † . If H is Hermitian and has real-valued
entries, then it is said to be symmetric.
lemma 2.22 A Hermitian matrix H ∈ Cn×n can be written as

H = U ΛU † = λi ui u†i ,
i
where U is unitary and Λ = diag(λ1 , . . . , λn ) is a diagonal matrix that consists of

the eigenvalues of H. Moreover, the eigenvalues are real and the ith column of U ,
ui , is an eigenvector associated to λi .
Proof By Theorem 2.20 (Schur) we can write H = U RU † , where U is unitary
and R is upper triangular with the diagonal elements consisting of the eigenvalues
of H. From R = U † HU we immediately see that R is Hermitian. Hence it must be
a diagonal matrix and the diagonal elements must be real. If ui is the ith column
of U , then
Hui = U ΛU † ui = U Λei = U λi ei = λi ui ,
showing that it is indeed an eigenvector associated to the ith eigenvalue λi .
If H ∈ Cn×n is Hermitian, then u† Hu is real for all u ∈ Cn . Indeed, [u† Hu]† =
u H † u = u† Hu. Hence, if we compare the set Cn×n of square matrices to the set
†
C of complex numbers, then the subset of Hermitian matrices is analogous to the

real numbers.
A class of Hermitian matrices with a special positivity property arises naturally
in many applications, including communication theory. They are the analogs of
the positive numbers.
definition 2.23 A Hermitian matrix H ∈ Cn×n is said to be positive definite if
u† Hu > 0 for all non-zero u ∈ Cn .
If H satisfies the weaker inequality u† Hu ≥ 0, then H is said to be positive

semidefinite.
For any matrix A ∈ Cm×n , the matrix AA† is positive semidefinite. Indeed, for
any vector v ∈ Cm , v † AA† v = A† v2 ≥ 0. Similarly, a covariance matrix is
positive semidefinite.
& ' To see why, let X ∈ Cm be a zero-mean random vector and
†
let K = E XX be its covariance matrix. For any v ∈ Cm ,
& '
v † Kv = v † E XX † v
& '
= E v † XX † v
& '
= E v † X(v † X)†
& '
= E |v † X|2 ≥ 0.
lemma 2.24 (Eigenvalues of positive (semi)definite matrices) The eigenvalues of
a positive definite matrix are positive and those of a positive semidefinite matrix
are non-negative.
Proof Let A be a positive definite matrix, let u be an eigenvector of A associated
to the eigenvalue λ, and calculate u† Au = u† λu = λu† u. Hence λ = u† Au/u† u,
56 2. First layer
proving that λ must be positive, because it is the ratio of two positive numbers.
If A is positive semidefinite, then the numerator of λ = u† Au/u† u can vanish.
theorem 2.25 (SVD) Any matrix A ∈ Cm×n can be written as a product
A = U DV † ,
where U and V are unitary (of dimension m × m and n × n, respectively)

and D ∈ Rm×n is non-negative and diagonal. This is called the singular value
decomposition (SVD) of A. Moreover, by letting k be the rank of A, the following
statements are true.
(a) The columns of V are the eigenvectors of A† A. The last n − k columns span
the null space of A.
(b) The columns of U are eigenvectors of AA† . The first k columns span the
range of A.
(c) If m ≥ n then
⎛ √ √ ⎞
diag( λ1 , . . . , λn )
D = ⎝ ................... ⎠,
0(n−m)×n
where λ1 ≥ λ2 ≥ · · · ≥ λk > λk+1 = · · · = λn = 0 are the eigenvalues of

A† A ∈ Cn×n which are non-negative because A† A is positive semidefinite.
(d) If m ≤ n then

D = diag( λ1 , . . . , λm ) : 0m×(n−m) ,
where λ1 ≥ λ2 ≥ · · · ≥ λk > λk+1 = · · · = λm = 0 are the eigenvalues of

AA† .
Note 1: Recall that the set of non-zero eigenvalues of AB equals the set of non-zero
eigenvalues of BA, see e.g. [12, Theorem 1.3.29]. Hence the non-zero eigenvalues
in (c) and (d) are the same.
Note 2: To remember that V contains the eigenvectors of A† A (as opposed to

containing those of AA† ) it suffices to look at the dimensions: V has to be an
n × n matrix, and so is A† A.
Proof It is sufficient to consider the case with m ≥ n because if m < n, we

can apply the result to A† = U DV † and obtain A = V D† U † . Hence let m ≥
n, and consider the matrix A† A ∈ Cn×n . This matrix is positive semidefinite.
Hence its eigenvalues λ1 ≥ λ2 ≥ · · · λn ≥ 0 are real and non-negative and we can
choose the eigenvectors v1 , v2 , . . . , vn to form an orthonormal basis for Cn . Let
V = (v1 , . . . , vn ). Let k be the number of positive eigenvalues and choose
1
ui = √ Avi , i = 1, 2, . . . , k. (2.21)
λi
Observe that
1 λj †
u†i uj = v † A† Avj = v vj = δij , 1 ≤ i, j ≤ k.
λi λj i λi i
Hence {ui : i = 1, . . . , k} is a set of orthonormal vectors. Complete this set of

vectors to an orthonormal basis for Cm by choosing {ui : i = k + 1, . . . , m} and
let U = (u1 , u2 , . . . , um ). Note that (2.21) implies

ui λi = Avi , i = 1, 2, . . . , k, k + 1, . . . , n,
where for i = k + 1, . . . , n the above relationship holds since λi = 0 and vi is a

corresponding eigenvector. Using matrix notation we obtain
⎛√ ⎞
λ1 0
⎜ .. ⎟
⎜ . ⎟
⎜ √ ⎟
U⎜ 0 λ ⎟ = AV, (2.22)
⎜ n⎟
⎝. . . . . . . . . . . . . . . .⎠
0
i.e. A = U DV † . For i = 1, 2, . . . , m,
AA† ui = U DV † V D† U † ui
= U DD† U † ui = ui λi ,
where in the last equality we use the fact that U † ui contains 1 at position i and 0
elsewhere, and DD† = diag(λ1 , λ2 , . . . , λk , 0, . . . , 0). This shows that λi is also an
eigenvalue of AA† . We have also shown that {vi : i = k + 1, . . . , n} spans the null
space of A and from (2.22) we see that {ui : i = 1, . . . , k} spans the range of A.
The following key result is a simple application of the SVD.
lemma 2.26 The linear transformation described by a matrix A ∈ Rn×n maps

the unit cube into a parallelepiped of volume | det A|.
Proof From the singular value decomposition, we can write A = U DV † , where

D is diagonal and U and V are unitary matrices. A transformation described by
a unitary matrix is volume preserving. (In fact if we apply an orthogonal matrix
to an object, we obtain the same object described in a new coordinate system.)
Hence we can focus our attention on the effect of D. But D maps the unit vectors
e1 , e2 , . . . , en into λ1 e1 , λ2 e2 , . . . , λn en , respectively. Therefore it maps the unit
square
% into a rectangular parallelepiped of sides λ1 , λ2 , . . . , λn and of volume
| λi | = | det D| = | det A|, where the last equality holds because the determinant
of a product (of matrices) is the product of the determinants and the determinant
of a unitary matrix has unit magnitude.
58 2. First layer
2.9 Appendix: Densities after one-to-one differentiable

transformations
In this appendix we outline how to determine the density of a random vector
Y when we know the density of a random vector X and Y = g(X) for some
differentiable and one-to-one function g.
We begin with the scalar case. Generalizing to the vector case is conceptually
straightforward. Let X ∈ X be a random variable of density fX and define
Y = g(X) for a given one-to-one differentiable function g : X → Y. The density
becomes useful when we integrate it over some set A to obtain the probability
that X ∈ A. In Figure 2.17 the shaded area “under” fX equals P r{X ∈ A}. Now
assume that g maps the interval A into the interval B. Then X ∈ A if and only if
Y ∈ B. Hence P r{X ∈ A} = P r{Y ∈ B}, which means that the two shaded areas
in the figure must be identical. This requirement completely specifies fY .
For the mathematical details we need to consider an infinitesimally small
interval A. Then P r{X ∈ A} = fX (x̄)l(A), where l(A) denotes the length of A
and x̄ is any point in A. Similarly, P r{Y ∈ B} = fY (ȳ)l(B) where ȳ = g(x̄). Hence
fY fulfills fX (x̄)l(A) = fY (ȳ)l(B).
The last ingredient is the fact that the absolute value of the slope of g at x̄
l(B)
is the ratio l(A) . (We are still assuming infinitesimally small intervals.) Hence
fY (y)|g (x)| = fX (x) and after solving for fY (y) and using x = g −1 (y) we obtain

the desired result
fX (g −1 (y))
fY (y) = . (2.23)
|g (g −1 (y))|
fX ( y−b
a )
example 2.27 If g(x) = ax + b then fY (y) = |a| .
y = g(x)
fY (y) A
x
fX (x)
Figure 2.17. Finding the density of Y = g(X) from that of X.

Shaded surfaces have the same area.
2.9. Appendix: Densities after one-to-one differentiable transformations 59
example 2.28 Let fX be Rayleigh, specifically

x2
xe− 2 , x ≥ 0
fX (x) =
0, otherwise
and let Y = g(X) = X 2 . Then
y
0.5e− 2 , y≥0
fY (y) =
0, otherwise.
Next we consider the multidimensional case, starting with two dimensions. Let
X = (X1 , X2 )T have pdf fX (x) and consider first the random vector Y obtained
from the affine transformation
Y = AX + b
for some nonsingular matrix A and vector b. The procedure to determine fY
parallels that for the scalar case. If A is a small rectangle, small enough that fX (x)
can be considered as constant for all x ∈ A, then P r{X ∈ A} is approximated by
fX (x̄)a(A), where a(A) is the area of A and x̄ ∈ A. If B is the image of A, then
fY (ȳ)a(B) → fX (x̄)a(A) as a(A) → 0.
Hence
a(A)
fY (ȳ) → fX (x̄) as a(A) → 0.
a(B)
For the next and final step, we need to know that A maps A of area a(A) into B
of area a(B) = a(A)| det A|. So the absolute value of the determinant of a matrix
is the amount by which areas scale through the affine transformation associated
to the matrix. This is true in any dimension n, but for n = 1 we speak of length
rather than area and for n ≥ 3 we speak of volume. (For the one-dimensional case,
observe that the determinant of the scalar a is a). See Lemma 2.26 (in Appendix
2.8) for an outline of the proof of this important geometrical interpretation of the
determinant of a matrix. Hence

fX A−1 (y − b)
fY (y) = .
| det A|
We are ready to generalize to a function g : Rn → Rn which is one-to-one and
differentiable. Write g(x) = (g1 (x), . . . , gn (x)) and define its Jacobian J(x) to be
∂gi
the matrix that has ∂x j
at position (i, j). In the neighborhood of x the relationship
y = g(x) may be approximated by means of an affine expression of the form
y = Ax + b,
where A is precisely the Jacobian J(x). Hence, leveraging on the affine case, we
can immediately conclude that
fX (g −1 (y))
fY (y) = , (2.24)
| det J(g −1 (y))|
which holds for any n.
60 2. First layer
Sometimes the new random vector Y is described by the inverse function,

namely X = g −1 (Y ) (rather than the other way around, as assumed so far). In
this case there is no need to find g. The determinant of the Jacobian of g at x is
one over the determinant of the Jacobian of g −1 at y = g(x).
As a final note we mention that if g is a many-to-one map, then for a specific
y the pull-back g −1 (y) will be a set {x1 , . . . , xk } for some k. In this case the right
fX (xi )
side of (2.24) will be i | det J(xi )| .
example 2.29 (Rayleigh distribution) Let X1 and X2 be two independent, zero-

mean, unit-variance, Gaussian random variables. Let R and Θ be the corresponding
polar coordinates, i.e. X1 = R cos Θ and X2 = R sin Θ. We are interested in the
probability density functions fR,Θ , fR , and fΘ . Because we are given the map g
from (r, θ) to (x1 , x2 ), we pretend that we know fR,Θ and that we want to find
fX1 ,X2 . Thus
1
fX1 ,X2 (x1 , x2 ) = fR,Θ (r, θ),
| det J|
where J is the Jacobian of g, namely

cos θ −r sin θ
J= .
sin θ r cos θ
Hence det J = r and

1
fX1 ,X2 (x1 , x2 ) = fR,θ (r, θ).
r
2 2
x +x
Using fX1 ,X2 (x1 , x2 ) = 1
2π exp − 1 2 2 and x21 + x22 = r2 , we obtain

r r2
fR,θ (r, θ) = exp − .
2π 2
Since fR,Θ (r, θ) depends only on r, we infer that R and Θ are independent
random variables and that Θ is uniformly distributed in [0, 2π). Hence

1
θ ∈ [0, 2π)
fΘ (θ) = 2π
0 otherwise
and
r2
re− 2 r≥0
fR (r) =
0 otherwise.
We come to the same conclusion by integrating fR,Θ over θ to obtain fR and by

integrating over r to obtain fΘ . Notice that fR is a Rayleigh probability density. It
is often used to evaluate the error probability of a wireless link subject to fading.
2.10. Appendix: Gaussian random vectors 61
2.10 Appendix: Gaussian random vectors

A Gaussian random vector is a collection of jointly Gaussian random variables.
We learn to use vector notation as it simplifies matters significantly.
Recall that a random variable W : Ω → R is a mapping from the sample space
to the reals. W is a Gaussian random variable with mean m and variance σ 2 if
and only if its probability density function (pdf) is

1 (w − m)2
fW (w) = √ exp − .
2πσ 2 2σ 2
Because a Gaussian random variable is completely specified by its mean m and
variance σ 2 , we use the short-hand notation N (m, σ 2 ) to denote its pdf. Hence
W ∼ N (m, σ 2 ).
An n-dimensional random vector X is a mapping X : Ω → Rn . It can be
seen as a collection X = (X1 , X2 , . . . , Xn )T of n random variables. The pdf
of X is the joint pdf of X1 , X2 , . . . , Xn . The expected value of X, denoted by
E[X], is the n-tuple (E[X1 ], E[X2 ], . . . , E[Xn ])T . The covariance matrix of X is
KX = E[(X − E[X])(X − E[X])T ]. Notice that XX T is an n × n random matrix,
i.e. a matrix of random variables, and the expected value of such a matrix is, by
definition, the matrix whose components are the expected values of those random
variables. A covariance matrix is always Hermitian. This follows immediately from
the definitions.
The pdf of a vector W = (W1 , W2 , . . . , Wn )T that consists of independent and
identically distributed (iid) ∼ N (0, 1) components is

n
1 w2
fW (w) = √ exp − i (2.25)
i=1
2π 2

1 wT w
= exp − . (2.26)
(2π)n/2 2
definition 2.30 The random vector Z = (Z1 , . . . , Zm )T ∈ Rm is a zero-mean

Gaussian random vector and Z1 , . . . , Zm are zero-mean jointly Gaussian random
variables if and only if there exists a matrix A ∈ Rm×n such that Z can be
expressed as
Z = AW, (2.27)
where W ∈ Rn is a random vector of iid ∼ N (0, 1) components.
More generally, Z = AW + m, m ∈ Rm , is a Gaussian random vector of mean

m. We focus on zero-mean Gaussian random vectors since we can always add or
remove the mean as needed.
It follows immediately from the above definition that linear combinations of
zero-mean jointly Gaussian random variables are zero-mean jointly Gaussian
random variables. Indeed, BZ = BAW , where the right-hand side is as in (2.27)
with the matrix BA instead of A.
62 2. First layer
Recall from Appendix 2.9 that, if Z = AW for some nonsingular matrix

A ∈ Rn×n , then
fW (A−1 z)
fZ (z) = .
| det A|
Using (2.26) we obtain
−1 T −1

exp − (A z)2(A z)
fZ (z) = .
(2π)n/2 | det A|
The above expression can be written using KZ instead of A. Indeed from
KZ = E[AW (AW )T ] = E[AW W T AT ] = AIn AT = AAT
we obtain
(A−1 z)T (A−1 z) = z T (A−1 )T A−1 z
= z T (AAT )−1 z
= z T KZ−1 z
and
√ √
det KZ = det AAT = det A det A = | det A|.
Thus

1 1
fZ (z) = exp − z T KZ−1 z (2.28a)
(2π)n det KZ 2

1 1 T −1
= exp − z KZ z . (2.28b)
det(2πKZ ) 2
The above densities specialize to (2.26) when KZ = In .
example 2.31 Let W ∈ Rn have iid ∼ N (0, 1) components and define Z = U W
for some orthogonal matrix U . Then Z is zero-mean, Gaussian, and its covariance
matrix is KZ = U U T = In . Hence Z has the same distribution as W .
If A ∈ Rm×n (hence not necessarily square) with linearly independent rows and
Z = AW , then we can find an m × m nonsingular matrix Ã and write Z = ÃW̃
for a Gaussian random vector W̃ ∈ Rm with iid ∼ N (0, 1) components. To see
this, we use the SVD (Appendix 2.8) to write A = U DV T . Now
Z = U DV T W = U DW,
where (with a slight abuse of notation) we have substituted W for V T W because

they have the same distribution (Example 2.31). The m × n diagonal matrix D
can be written as [D̃ : 0], where D̃ is an m × m diagonal matrix with positive
diagonal elements and 0 is the m × (n − m) zero-matrix. Hence DW = D̃W̃ ,
where W̃ consists of the first m components of W . Thus Z = U D̃W̃ = ÃW̃ with
Ã = U D̃ nonsingular because U is orthogonal and D̃ is nonsingular.
2.10. Appendix: Gaussian random vectors 63
If A has linearly dependent rows, then KZ = AAT is singular. In fact, AT has

linearly dependent columns and we cannot recover x ∈ Rm from AT x, let alone
from KZ x = AAT x. In this case the random vector Z is still Gaussian but it
is not possible to write its pdf as in (2.28), and we say that its pdf does not
exist.5 This is not a problem because on the rare occasions when we encounter
such a case, we can generally find a workaround that goes as follows. Typically
we want to know the density of a random vector Z so that we can determine the
probability of an event such as Z ∈ B. If A has linearly dependent rows, then
(with probability 1) some of the components of Z are linear combinations of some
of the other components. There is always a way to write the event Z ∈ B in terms
of an event that involves only a linearly independent subset of the Z1 , . . . , Zm .
This subset forms a Gaussian random vector of nonsingular covariance matrix,
for which the density is defined as in (2.28). An example follows. (For more on
degenerate cases see e.g. [3, Section 23.4.3].)
example 2.32 (Degenerate case) Let W ∼ N (0, 1), A = (1, 1)T , and Z = AW .
By our definition, Z is a Gaussian random vector. However, A is a matrix of
linearly dependent rows implying that Z has linearly dependent components. Indeed
Z1 = Z2 . We can easily check that KZ is the 2 × 2 matrix with 1 in each position,
hence it is singular and fZ is not defined. How do we compute the probability
of events involving Z if we do not know its pdf ? Any event involving Z can be
rewritten as an event involving Z1 only (or equivalently involving Z2 only). For
instance, the event {Z1 ∈ [1, 3]} ∩ {Z2 ∈ [2, 5]} occurs if and only if {Z1 ∈ [2, 3]}
(see Figure 2.18). Hence

P r Z1 ∈ [1, 3] ∩ Z2 ∈ [2, 5] = P r Z1 ∈ [2, 3] = Q(2) − Q(3).
z2
z1
0 1 3
Figure 2.18.
Next we show that if a random vector has density as in (2.28), then it can be
written as in (2.27). Let Z ∈ Rm be such a random vector and let KZ be its
5
It is possible to play tricks and define a function that can be considered as being the
density of a Gaussian random vector of singular covariance matrix. But what we gain
in doing so is not worth the trouble.
64 2. First layer
nonsingular covariance matrix. As a covariance matrix is Hermitian, we can write

(see Appendix 2.8)
KZ = U ΛU † , (2.29)
where U is unitary and Λ is diagonal. Because KZ is nonsingular, all diagonal

elements of Λ must be positive. Define W = Λ− 2 U † Z, where for α ∈ R, Λα is
1
the diagonal matrix obtained by raising the diagonal elements of Λ to the power
1 1
α. Then Z = U Λ 2 W = AW with A = U Λ 2 nonsingular. It remains to be
shown that fW (w) is as on the right-hand side of (2.26). It must be, because the
transformation from Rn to Rn that sends W to Z = AW is one-to-one. Hence the
density of fW (w) that leads to fZ (z) is unique. It must be (2.26), because (2.28)
was obtained from (2.26) assuming Z = AW .
Many authors use (2.28) to define a Gaussian random vector. We favor (2.27)
because it is more general (it does not depend on the covariance being nonsingular),
and because from this definition it is straightforward to prove a number of key
results associated to Gaussian random vectors. Some of these are dealt with in
the examples that follow.
example 2.33 The ith component Zi of a Gaussian random vector Z = (Z1 , . . . ,

Zm )T is a Gaussian random variable. This is an immediate consequence of
Definition 2.30. In fact, Zi = AZ, where A is the row vector that has 1 at position
i and 0 elsewhere. To appreciate the convenience of working with (2.27) instead of
(2.28), compare this answer with the tedious derivation that consists of integrating
over fZ to obtain fZi (see Exercise 2.37).
example 2.34 (Gaussian random variables are not necessarily jointly Gaussian)
Let Y1 ∼ N (0, 1), let X ∈ {±1} be uniformly distributed, and let Y2 = Y1 X. Notice
that Y2 has the same pdf as Y1 . This follows from the fact that the pdf of Y1 is an
even function. Hence Y1 and Y2 are both Gaussian. However, they are not jointly
Gaussian. We come to this conclusion by observing that Y = Y1 + Y2 = Y1 (1 + X)
is 0 with probability 1/2. Hence Y cannot be Gaussian.
example 2.35 Two random variables are said to be uncorrelated if their

covariance is 0. Uncorrelated Gaussian random variables are not necessarily
independent. For instance, Y1 and Y2 of Example 2.34 are uncorrelated Gaussian
random variables, yet they are not independent. However, uncorrelated jointly
Gaussian random variables are always independent. This follows immediately
from the pdf (2.28). The contrary is always true: random variables (not nec-
essarily Gaussian) that are independent are always uncorrelated. The proof is
straightforward.
The shorthand Z ∼ N (m, KZ ) means that Z is a Gaussian random vector of

mean m and covariance matrix KZ . If KZ is nonsingular, then

1 1
fZ (z) = exp − (z − E[Z])T KZ−1 (z − E[Z]) .
(2π)n det KZ 2
2.12. Appendix: Inner product spaces 65
2.11 Appendix: A fact about triangles

In Example 2.15 we derive the error probability for PSK by using the fact that
for a triangle with edges a, b, c and angles α, β, γ as shown in the figure below,
the following relationship holds:
a b c
= = . (2.30)
sin α sin β sin γ

a

γ

a sin β = b sin(π − α)
b

β α

c
To prove the first equality relating a and b we consider the distance between the
vertex γ (common to a and b) and its projection onto the extension of c. As shown
in the figure, this distance may be computed in two ways obtaining a sin β and
b sin(π − α), respectively. The latter may be written as b sin(α). Hence a sin β =
b sin(α), which is the first equality. The second equality is proved similarly.
2.12 Appendix: Inner product spaces

2.12.1 Vector space
Most readers are familiar with the notion of vector space from a linear algebra
course. Unfortunately, some linear algebra courses for engineers associate vectors
to n-tuples rather than taking the axiomatic point of view – which is what we need.
A vector space (or linear space) consists of the following (see e.g. [10, 11] for more).
(1) A field F of scalars.6
(2) A set V of objects called vectors.7
(3) An operation called vector addition, which associates with each pair of vectors
α and β in V a vector α + β in V, in such a way that
(i) it is commutative: α + β = β + α;
(ii) it is associative: α + (β + γ) = (α + β) + γ for every α, β, γ in V;
(iii) there is a unique vector, called the zero vector and denoted by 0, such
that α + 0 = α for all α in V;
(iv) for each α in V, there is a β in V such that α + β = 0.
(4) An operation called scalar multiplication, which associates with each vector
α in V and each scalar a in F a vector aα in V, in such a way that
(i) 1α = α for every α in V;
(ii) (a1 a2 )α = a1 (a2 α) for every a1 , a2 in F;
6
In this book the field is almost exclusively R (the field of real numbers) or C (the field
of complex numbers). In Chapter 6, where we talk about coding, we also work with the
field F2 of binary numbers.
7
We are concerned with two families of vectors: n-tuples and functions.
66 2. First layer
(iii) a(α + β) = aα + aβ for every α, β in V;

(iv) (a + b)α = aα + bα for every a, b ∈ F.
In this appendix we consider general vector spaces for which the scalar field is
C. They are commonly called complex vector spaces. Vector spaces for which the
scalar field is R are called real vector spaces.
2.12.2 Inner product space
Given a vector space and nothing more, one can introduce the notion of a basis for
the vector space, but one does not have the tool needed to define an orthonormal
basis. Indeed the axioms of a vector space say nothing about geometric ideas such
as “length” or “angle”. To remedy this, one endows the vector space with the
notion of inner product.
definition 2.36 Let V be a vector space over C. An inner product on V is a
function that assigns to each ordered pair of vectors α, β in V a scalar α, β in C
in such a way that, for all α, β, γ in V and all scalars c in C,
(a) α + β, γ = α, γ + β, γ
cα, β = cα, β;
(b) β, α = α, β∗ ; (Hermitian symmetry)
(c) α, α ≥ 0 with equality if and only if α = 0.
It is implicit in (c) that α, α is real for all α ∈ V. From (a) and (b), we obtain
the additional properties
(d) α, β + γ = α, β + α, γ
α, cβ = c∗ α, β.
Notice that the above definition is also valid for a vector space over the field of
real numbers, but in this case the complex conjugates appearing in (b) and (2.36)
are superfluous. However, over the field of complex numbers they are necessary
for any α = 0, otherwise we could write
0 < jα, jα = −1α, α < 0,
where the first inequality
√ follows from condition (c) and the fact that jα is a
valid vector (j = −1), and the equality follows from (a) and (2.36) without the
complex conjugate. We see that the complex conjugate is necessary or else we can
create the contradictory statement 0 < 0.
On Cn there is an inner product that is sometimes called the standard inner
product. It is defined on a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) by

a, b = aj b∗j .
j
On R , the standard inner product is often called the dot or scalar product and
n
denoted by a · b. Unless explicitly stated otherwise, over Rn and over Cn we will

always assume the standard inner product.
An inner product space is a real or complex vector space, together with a

specified inner product on that space. We will use the letter V to denote a generic
inner product space.
example 2.37 The vector space Rn equipped with the dot product is an inner
product space and so is the vector space Cn equipped with the standard inner
product.
By means of the inner product, we introduce the notion of length, called norm,
of a vector α, via

α = α, α.
Using linearity, we immediately obtain that the squared norm satisfies
α ± β2 = α ± β, α ± β = α2 + β2 ± 2{α, β}. (2.31)
The above generalizes (a ± b)2 = a2 + b2 ± 2ab, a, b ∈ R, and |a ± b|2 =

|a|2 + |b|2 ± 2{ab∗ }, a, b ∈ C.
We say that two vectors are collinear if one is a scalar multiple of the other.
theorem 2.38 If V is an inner product space then, for any vectors α, β in V

and any scalar c,
(a) cα = |c|α;

(b) α ≥ 0 with equality if and only if α = 0;
(c) |α, β| ≤ αβ with equality if and only if α and β are collinear
(Cauchy–Schwarz inequality);
(d) α + β ≤ α + β with equality if and only if one of α, β is a non-negative
multiple of the other (triangle inequality);
(e) α + β2 + α − β2 = 2(α2 + β2 )
(parallelogram equality).
Proof Statements (a) and (b) follow immediately from the definitions. We
postpone the proof of the Cauchy–Schwarz inequality to Example 2.43 as at
that time we will be able to make a more elegant proof based on the concept of a
projection. To prove the triangle inequality we use (2.31) and the Cauchy–Schwarz
inequality applied to {α, β} ≤ |α, β| to prove that α + β2 ≤ (α + β)2 .
Notice that {α, β} ≤ |α, β| holds with equality if and only if α, β is a
non-negative real. The Cauchy–Schwarz inequality holds with equality if and only
if α and β are collinear. Both conditions for equality are satisfied if and only if
one of α, β is a non-negative multiple of the other. The parallelogram equality
follows immediately from (2.31) used twice, once with each sign.
68 2. First layer
*
-
*

@
@ β
β α +
α + @

α
β @ β
−
@
β
-
R
@
-
α α
Triangle inequality. Parallelogram equality.
At this point we could use the inner product and the norm to define the angle
between two vectors but we do not have any use for this. Instead, we will make
frequent use of the notion of orthogonality. Two vectors α and β are defined to
be orthogonal if α, β = 0.
example 2.39 This example and the two that follow are relevant for what we
do from Chapter 3 on. Let W = {w0 (t), . . . , wm−1 (t)} be a finite collection of
∞
functions from R to C such that −∞ |w(t)|2 dt < ∞ for all elements of W. Let
V be the complex vector space spanned by the elements of W, where the addition
of two functions and the multiplication of a function by a scalar are defined in
the obvious way. The reader should verify that the axioms of a vector space are
fulfilled. A vector space of functions will be called a signal space. The standard
inner product for functions from R to C is defined as

α, β = α(t)β ∗ (t)dt,
which implies the norm

#
α = |α(t)|2 dt,
but it is not a given that V with the standard inner product forms an inner product
space. It is straightforward to verify that the axioms (a), (c), and (d) of Definition
2.36 are fulfilled for all elements of V but axiom (b) is not necessarily fulfilled
(see Example 2.40). If V is such that for all α ∈ V, α, α = 0 implies that α is
the zero vector, then V endowed with the standard inner product forms an inner
product space. All we have said in this example applies also for the real vector
spaces spanned by functions from R to R.
example 2.40 Let V be the set of functions from R to R spanned by the function
that is zero everywhere, except at 0 where it takes value 1. It can easily be checked
that this is a vector space. It contains all the functions that are zero everywhere,
except at 0 where they can take on any value in R. Its zero vector is the function
that is 0 everywhere, including at 0. For all α in V, the standard inner product
α, α equals 0. Hence V with the standard inner product is not an inner product
space.
The problem
highlighted with Example 2.40 is that for a general function
α : I → C, |α(t)|2 dt = 0 does not necessarily imply α(t) = 0 for all t ∈ I. It
is important to be aware of this fact. However, this potential problem will never
arise in practice because all electrical signals are continuous. Sometimes we work
out examples using signals that have discontinuities (e.g. rectangles) but even
then the problem will not arise unless we use rather bizarre signals.

example 2.41 Let p(t) be a complex-valued square-integrable function (i.e.
|p(t)|2 dt < ∞) and let |p(t)|2 dt > 0. For instance, p(t) could be the rectangular
pulse 1{t ∈ [0, T ]} for some T > 0. The set V = {cp(t) : c ∈ C} with the standard
inner product forms an inner product space. (In V, only the zero-pulse has zero
norm.)
theorem 2.42 (Pythagoras’ theorem) If α and β are orthogonal vectors in V,

then
α + β2 = α2 + β2 .
Proof Pythagoras’ theorem follows immediately from the equality α + β2 =

α2 + β2 + 2{α, β} and the fact that α, β = 0 by definition of orthogo-
nality.
Given two vectors α, β ∈ V, β = 0, we define the projection of α on β as
the vector α|β collinear to β (i.e. of the form cβ for some scalar c) such that
α⊥β = α − α|β is orthogonal to β.
6
α α⊥β
p
- -
α|β β
Projection of α on β.
Using the definition of orthogonality, what we want is
0 = α⊥β , β = α − cβ, β = α, β − cβ2 .

α,β
Solving for c we obtain c = β2 . Hence
α, β
α|β = β = α, ϕϕ and α⊥β = α − α|β ,
β2
β
where ϕ = β is β scaled to unit norm. Notice that the projection of α on β does
not depend on the norm of β. In fact, the norm of α|β is |α, ϕ|.
Any non-zero vector β ∈ V defines a hyperplane by the relationship
{α ∈ V : α, β = 0} .
70 2. First layer
p
0
Hyperplane defined by β.
The hyperplane is the set of vectors in V that are orthogonal to β. A hyperplane

always contains the zero vector.
An affine plane, defined by a vector β and a scalar c, is an object of the form
{α ∈ V : α, β = c} .
ϕ
6
@
I
@ MBB
@ B
P
i
PP @ B
PP
@B
Affine plane defined by ϕ. 0
The vector β and scalar c that define a hyperplane are not unique, unless
we agree that we use only normalized vectors to define hyperplanes. By letting
β
ϕ = β , the above definition of affine plane may equivalently be written as
{α ∈ V : α, ϕ = βc
} or even as {α ∈ V : α − β
c
ϕ, ϕ = 0}. The first shows
c
that at an affine plane is the set of vectors that have the same projection β ϕ
on ϕ. The second form shows that the affine plane is a hyperplane translated by
c
the vector β ϕ. Some authors make no distinction between affine planes and
hyperplanes; in this case both are called hyperplane.
In the example that follows, we use the notion of projection to prove the
Cauchy–Schwarz inequality stated in Theorem 2.38.
example 2.43 (Proof of the Cauchy–Schwarz inequality). The Cauchy–Schwarz

inequality states that, for any α, β ∈ V, |α, β| ≤ αβ with equality if and only
if α and β are collinear. The statement is obviously true if β = 0. Assume β = 0
and write α = α|β + α⊥β . (See the next figure.) Pythagoras’ theorem states that
α2 = α|β 2 +α⊥β 2 . If we drop the second term, which is always non-negative,
we obtain α2 ≥ α|β 2 with equality if and only if α and β are collinear. From
the definition of projection, α|β 2 = |
α,β| |
α,β|
2 2
β2 . Hence α ≥ β2

2
with equality
if and only if α and β are collinear. This is the Cauchy–Schwarz inequality.
6
α
α|β
p β
- -
-
α,β
β
The Cauchy–Schwarz inequality.

A basis of V is a list of vectors in V that is linearly independent and spans V.
Every finite-dimensional vector space has a basis. If β1 , β2 , . . . , βn is a basis for
the inner product space V and α ∈ V is an arbitrary vector, then there are unique
scalars (coefficients) a1 , . . . , an such that α = ai βi but finding them may be
difficult. However, finding the coefficients of a vector is particularly easy when the
basis is orthonormal.
A basis ψ1 , ψ2 , . . . , ψn for an inner product space V is orthonormal if

0, i = j
ψi , ψj =
1, i = j.

Finding the ith coefficient ai of an orthonormal expansion α = ai ψi is
immediate. It suffices to observe that all but the ith term of ai ψi are orthogonal
to ψi and that the inner product of the ith term with ψi yields ai . Hence if
α= ai ψi then
ai = α, ψi .
Observe that |ai | is the norm of the projection of α on ψi . This should not be
surprising given that the ith term of the orthonormal expansion of α is collinear
to ψi and the sum of all the other terms are orthogonal to ψi .
There is another major advantage to working with an orthonormal basis. If a
and b are the n-tuples of coefficients of the expansion of α and β with respect to
the same orthonormal basis, then
α, β = a, b, (2.32)
where the right-hand side inner product is given with respect to the standard
inner product. Indeed

α, β = a i ψi , bj ψ j = ai ψi , bj ψ j
i j i j

= ai ψi , bi ψi = ai b∗i = a, b.
i i
Letting β = α, the above also implies

α = a,

where a = i |ai |2 .
72 2. First layer
An orthonormal set of vectors ψ1 , . . . , ψn of an inner product space V is a linearly

independent set. Indeed 0 = ai ψi implies ai = 0, ψi = 0. By normalizing the
vectors and recomputing the coefficients, we can easily extend this reasoning to a
set of orthogonal (but not necessarily orthonormal) vectors α1 , . . . , αn . They too
must be linearly independent.
The idea of a projection on a vector generalizes to a projection on a subspace. If
U is a subspace of an inner product space V, and α ∈ V, the projection of α on U
is defined to be a vector α|U ∈ U such that α − α|U is orthogonal to all vectors in
U . If ψ1 , . . . , ψm is an orthonormal basis for U , then the condition that α − α|U is
orthogonal to all vectors of U implies 0 = α − α|U , ψi = α, ψi − α|U , ψi . This
shows that α, ψi = α|U , ψi . The right side of this equality is the ith coefficient
of the orthonormal expansion of α|U with respect to the orthonormal basis. This
proves that
m
α|U = α, ψi ψi
i=1
is the unique projection of α on U . We summarize this important result and prove

that the projection of α on U is the element of U that is closest to α.
theorem 2.44 Let U be a subspace of an inner product space V, and let α ∈ V.
The projection of α on U , denoted by α|U , is the unique element of U that satisfies
any (hence all) of the following conditions:
(i) α − α|Uis orthogonal to every element of U ;
m
(ii) α|U = i=1 α, ψi ψi ;
(iii) for any β ∈ U , α − α|U ≤ α − β.
Proof Statement (i) is the definition of projection and we have already proved
that it is equivalent to statement (ii). Now consider any vector β ∈ U . From
Pythagoras’ theorem and the fact that α − α|U is orthogonal to α|U − β ∈ U , we
obtain
α − β2 = α − α|U + α|U − β2 = α − α|U 2 + α|U − β2 ≥ α − α|U 2 .
Moreover, equality holds if and only if α|U −β2 = 0, i.e. if and only if β = α|U .
theorem 2.45 Let V be an inner product space and let β1 , . . . , βn be any
collection of linearly independent vectors in V. Then we may construct orthogonal
vectors α1 , . . . , αn in V such that they form a basis for the subspace spanned by
β1 , . . . , βn .
Proof The proof is constructive via a procedure known as the Gram–Schmidt

orthogonalization procedure. First let α1 = β1 . The other vectors are constructed
inductively as follows. Suppose α1 , . . . , αm have been chosen so that they form an
orthogonal basis for the subspace Um spanned by β1 , . . . , βm . We choose the next
vector as
αm+1 = βm+1 − βm+1 |Um , (2.33)
where βm+1 |Um is the projection of βm+1 on Um . By definition, αm+1 is orthogonal
to every vector in Um , including α1 , . . . , αm . Also, αm+1 = 0 for otherwise βm+1
contradicts the hypothesis that it is linearly independent of β1 , . . . , βm . Therefore

α1 , . . . , αm+1 is an orthogonal collection of non-zero vectors in the subspace Um+1
spanned by β1 , . . . , βm+1 . Therefore it must be a basis for Um+1 . Thus the vectors
α1 , . . . , αn may be constructed one after the other according to (2.33).
corollary 2.46 Every finite-dimensional vector space has an orthonormal
basis.
Proof Let β1 , . . . , βn be a basis for the finite-dimensional inner product space V.
Apply the Gram–Schmidt procedure to find an orthogonal basis α1 , . . . , αn . Then
αi
ψ1 , . . . , ψn , where ψi = αi
, is an orthonormal basis.
Gram–Schmidt orthonormalization procedure
We summarize the Gram–Schmidt procedure, modified so as to produce orthonor-

mal vectors. If β1 , . . . , βn is a linearly independent collection of vectors in the
inner product space V, then we may construct a collection ψ1 , . . . , ψn that forms
an orthonormal basis for the subspace spanned by β1 , . . . , βn as follows. We let
ψ1 = ββ11 and for i = 2, . . . , n we choose

i−1
α i = βi − βi , ψj ψj
j=1
αi
ψi = .
αi
We have assumed that β1 , . . . , βn is a linearly independent collection. Now assume
that this is not the case. If βj is linearly dependent of β1 , . . . , βj−1 , then at step
i = j the procedure will produce αi = ψi = 0. Such vectors are simply disregarded.
Figure 2.19 gives an example of the Gram–Schmidt procedure applied to a set
of signals.
i βi βi , ψj βi |Vi−1 αi = βi − βi |Vi−1 αi ψi bi

j<i
⎛ ⎞
1 1 1 2
⎝0⎠
1 - - 2
1 1 1 0
⎛ ⎞
1 1 1 1 1
2 1 1 ⎝1⎠
1 1 1 1 0
⎛ ⎞
1 1 1 1 0
⎝0⎠
3 0, 0 2
2 1 2 2 2
Figure 2.19. Application of the Gram–Schmidt orthonormalization procedure

starting with the waveforms given in the left column.
74 2. First layer
2.13 Exercises
Exercises for Section 2.2
exercise 2.1 (Hypothesis testing: Uniform and uniform) Consider a binary

hypothesis testing problem in which the hypotheses H = 0 and H = 1 occur with
probability PH (0) and PH (1) = 1 − PH (0), respectively. The observable Y takes
values in {0, 1}2k , where k is a fixed positive integer. When H = 0, each component
of Y is 0 or 1 with probability 12 and components are independent. When H = 1,
Y is chosen uniformly at random from the set of all sequences
of length 2k that
have an equal number of ones and zeros. There are 2k k such sequences.
(a) What is PY |H (y|0)? What is PY |H (y|1)?

(b) Find a maximum-likelihood decision rule for H based on y. What is the single
number you need to know about y to implement this decision rule?
(c) Find a decision rule that minimizes the error probability.
(d) Are there values of PH (0) such that the decision rule that minimizes the error
probability always chooses the same hypothesis regardless of y? If yes, what
are these values, and what is the decision?
exercise 2.2 (The “Wetterfrosch”) Let us assume that a “weather frog” bases
his forecast of tomorrow’s weather entirely on today’s air pressure. Determining
a weather forecast is a hypothesis testing problem. For simplicity, let us assume
that the weather frog only needs to tell us if the forecast for tomorrow’s weather
is “sunshine” or “rain”. Hence we are dealing with binary hypothesis testing. Let
H = 0 mean “sunshine” and H = 1 mean “rain”. We will assume that both values
of H are equally likely, i.e. PH (0) = PH (1) = 12 . For the sake of this exercise,
suppose that on a day that precedes sunshine, the pressure may be modeled as a
random variable Y with the following probability density function:

A− A 2 y, 0 ≤ y ≤ 1
fY |H (y|0) =
0, otherwise.
Similarly, the pressure on a day that precedes a rainy day is distributed according to

B + B3 y, 0 ≤ y ≤ 1
fY |H (y|1) =
0, otherwise.
The weather frog’s purpose in life is to guess the value of H after measuring Y .
(a) Determine A and B.

(b) Find the a posteriori probability PH|Y (0|y). Also find PH|Y (1|y).
(c) Show that the implementation of the decision rule Ĥ(y) = arg maxi PH|Y (i|y)
reduces to

0, if y ≤ θ
Ĥθ (y) = (2.34)
1, otherwise,
for some threshold θ and specify the threshold’s value.
2.13. Exercises 75
(d) Now assume that the weather forecaster does not know about hypothesis testing
and arbitrarily choses the decision rule Ĥγ (y) for some arbitrary γ ∈ R.
Determine, as a function of γ, the probability that the decision rule decides
Ĥ = 1 given that H = 0. This probability is denoted P r{Ĥ(Y ) = 1|H = 0}.
(e) For the same decision rule, determine the probability of error Pe (γ) as a
function of γ. Evaluate your expression at γ = θ.
(f ) Using calculus, find the γ that minimizes Pe (γ) and compare your result to θ.
exercise 2.3 (Hypothesis testing in Laplacian noise) Consider the following

hypothesis testing problem between two equally likely hypotheses. Under hypothesis
H = 0, the observable Y is equal to a + Z where Z is a random variable with
Laplacian distribution
1 −|z|
fZ (z) = e .
2
Under hypothesis H = 1, the observable is given by −a + Z. You may assume that
a is positive.
(a) Find and draw the density fY |H (y|0) of the observable under hypothesis
H = 0, and the density fY |H (y|1) of the observable under hypothesis H = 1.
(b) Find the decision rule that minimizes the probability of error.
(c) Compute the probability of error of the optimal decision rule.
exercise 2.4 (Poisson parameter estimation) In this example there are two
hypotheses, H = 0 and H = 1, which occur with probabilities PH (0) = p0 and
PH (1) = 1 − p0 , respectively. The observable Y takes values in the set of non-
negative integers. Under hypothesis H = 0, Y is distributed according to a Poisson
law with parameter λ0 , i.e.
λy0 −λ0
PY |H (y|0) = e . (2.35)
y!
Under hypothesis H = 1,
λy1 −λ1
PY |H (y|1) = e . (2.36)
y!
This is a model for the reception of photons in optical communication.
(a) Derive the MAP decision rule by indicating likelihood and log likelihood ratios.
Hint: The direction of an inequality changes if both sides are multiplied by a
negative number.
(b) Derive an expression for the probability of error of the MAP decision rule.
(c) For p0 = 1/3, λ0 = 2 and λ1 = 10, compute the probability of error of the
MAP decision rule. You may want to use a computer program to do this.
(d) Repeat (c) with λ1 = 20 and comment.
exercise 2.5 (Lie detector) You are asked to develop a “lie detector” and
analyze its performance. Based on the observation of brain-cell activity, your
76 2. First layer
detector has to decide if a person is telling the truth or is lying. For the purpose
of this exercise, the brain cell produces a sequence of spikes. For your decision you
may use only a sequence of n consecutive inter-arrival times Y1 , Y2 , . . . , Yn . Hence
Y1 is the time elapsed between the first and second spike, Y2 the time between the
second and third, etc. We assume that, a priori, a person lies with some known
probability p. When the person is telling the truth, Y1 , . . . , Yn is an iid sequence of
exponentially distributed random variables with intensity α, (α > 0), i.e.
fYi (y) = αe−αy , y ≥ 0.
When the person lies, Y1 , . . . , Yn is iid exponentially distributed with intensity

β, (α < β).
(a) Describe the decision rule of your lie detector for the special case n = 1. Your
detector should be designed so as to minimize the probability of error.
(b) What is the probability PL|T that your lie detector says that the person is
lying when the person is telling the truth?
(c) What is the probability PT |L that your test says that the person is telling the
truth when the person is lying.
(d) Repeat (a) and (b) for a general n. Hint: When Y1 , . . . , Yn is a collection
of iid random variables that are exponentially distributed with parameter
α > 0, then Y1 + · · · + Yn has the probability density function of the Erlang
distribution, i.e.
αn
fY1 +···+Yn (y) = y n−1 e−αy , y ≥ 0.
(n − 1)!
exercise 2.6 (Fault detector) As an engineer, you are required to design the
test performed by a fault detector for a “black-box” that produces a sequence of iid
binary random variables . . . , X1 , X2 , X3 , . . . . Previous experience shows that this
1
“black box” has an a priori failure probability of 1025 . When the “black box” works
properly, pXi (1) = p. When it fails, the output symbols are equally likely to be 0
or 1. Your detector has to decide based on the observation of the past 16 symbols,
i.e. at time k the decision will be based on Xk−16 , . . . , Xk−1 .
(a) Describe your test.

(b) What does your test decide if it observes the sequence 0101010101010101?
Assume that p = 0.25.
exercise 2.7 (Multiple choice exam) You are taking a multiple choice exam.
Question number 5 allows for two possible answers. According to your first
impression, answer 1 is correct with probability 1/4 and answer 2 is correct with
probability 3/4. You would like to maximize your chance of giving the correct
answer and you decide to have a look at what your neighbors on the left and right
have to say. The neighbor on the left has answered ĤL = 1. He is an excellent
student who has a record of being correct 90% of the time when asked a binary
question. The neighbor on the right has answered ĤR = 2. He is a weaker student
who is correct 70% of the time.
2.13. Exercises 77
(a) You decide to use your first impression as a prior and to consider ĤL and
ĤR as observations. Formulate the decision problem as a hypothesis testing
problem.
(b) What is your answer Ĥ?
exercise 2.8 (MAP decoding rule: Alternative derivation) Consider the binary
hypothesis testing problem where H takes values in {0, 1} with probabilites PH (0)
and PH (1). The conditional probability density function of the observation Y ∈ R
given H = i, i ∈ {0, 1} is given by fY |H (·|i). Let Ri be the decoding region for
hypothesis i, i.e. the set of y for which the decision is Ĥ = i, i ∈ {0, 1}.
(a) Show that the probability of error is given by

Pe = PH (1) + PH (0)fY |H (y|0) − PH (1)fY |H (y|1) dy.
R1
"
Hint: Note that R = R0 R1 and R
fY |H (y|i)dy = 1 for i ∈ {0, 1}.
(b) Argue that Pe is minimized when
R1 = {y ∈ R : PH (0)fY |H (y|0) < PH (1)fY |H (y|1)}
i.e. for the MAP rule.
exercise 2.9 (Independent and identically distributed vs. first-order Markov)

Consider testing two equally likely hypotheses H = 0 and H = 1. The observable
Y = (Y1 , . . . , Yk )T is a k-dimensional binary vector. Under H = 0 the components
of the vector Y are independent uniform random variables (also called Bernoulli
(1/2) random variables). Under H = 1, the component Y1 is also uniform, but the
components Yi , 2 ≤ i ≤ k, are distributed as follows:

3/4, if yi = yi−1
PYi |Y1 ,...,Yi−1 (yi |y1 , . . . , yi−1 ) = (2.37)
1/4, otherwise.
(a) Find the decision rule that minimizes the probability of error. Hint: Write
down a short sample sequence (y1 , . . . , yk ) and determine its probability under
each hypothesis. Then generalize.
(b) Give a simple sufficient statistic for this decision. (For the purpose of this
question, a sufficient statistic is a function of y with the property that a
decoder that observes y can not achieve a smaller error probability than a
MAP decoder that observes this function of y.)
(c) Suppose that the observed sequence alternates between 0 and 1 except for one
string of ones of length s, i.e. the observed sequence y looks something like
y = 0101010111111 . . . 111111010101. (2.38)
What is the least s such that we decide for hypothesis H = 1?
exercise 2.10 (SIMO channel with Laplacian noise, exercise from [1]) One
of the two signals c0 = −1, c1 = 1 is transmitted over the channel shown in
78 2. First layer
Figure 2.20a. The two noise random variables Z1 and Z2 are statistically
independent of the transmitted signal and of each other. Their density functions are
1 −|α|
fZ1 (α) = fZ2 (α) = e .
2
Z1 y2
? 6
- m - Y1 a
X ∈ {c0 , c1 } s (1, 1)
- b
s
(y1 , y2 )
- m - Y2
- y1
6
Z2
(a) (b)
Figure 2.20.
(a) Derive a maximum likelihood decision rule.

(b) Describe the maximum likelihood decision regions in the (y1 , y2 ) plane.
Describe also the “either choice” regions, i.e. the regions where it does not
matter if you decide for c0 or for c1 . Hint: Use geometric reasoning and the
fact that for a point (y1 , y2 ) as shown in Figure 2.20b, |y1 −1|+|y2 −1| = a+b.
(c) A receiver decides that c1 was transmitted if and only if (y1 + y2 ) > 0. Does
this receiver minimize the error probability for equally likely messages?
(d) What is the error probability for the receiver in (c)? Hint: One way to do this
−ω
is to use the fact that if W = Z1 + Z2 then fW (ω) = e 4 (1 + ω) for w > 0
and fW (−ω) = fW (ω).
exercise 2.11 (Q function on regions, exercise from [1]) Let X ∼ N (0, σ 2 I2 ).

For each of the three diagrams shown in Figure 2.21, express the probability that
X lies in the shaded region. You may use the Q function when appropriate.
exercise 2.12 (Properties of the Q function) Prove properties (a) through (d)
of the Q function defined in Section 2.3. Hint: For property (d), multiply and divide
inside the integral by the integration variable and integrate by parts. By upper- and
lower-bounding the resulting integral, you will obtain the lower and upper bound.
2.13. Exercises 79
x2 x2 x2
1
x1 x1 x1
−2 1 2 1
(a) (b) (c)
Figure 2.21.
exercise 2.13 (16-PAM vs. 16-QAM) The two signal constellations in Figure
2.22 are used to communicate across an additive white Gaussian noise channel.
Let the noise variance be σ 2 . Each point represents a codeword ci for some i.
Assume each codeword is used with the same probability.
a
-
s s s s s s s s s s s s s s s s- x
0
6x2
b
-
s s s s
s s s s
- x1
s s s s
s s s s
Figure 2.22.
(a) For each signal constellation, compute the average probability of error Pe as
a function of the parameters a and b, respectively.
(b) For each signal constellation, compute the average energy per symbol E as a
function of the parameters a and b, respectively:

16
E= PH (i)ci 2 . (2.39)
i=1
In the next chapter it will become clear in what sense E relates to the energy
of the transmitted signal (see Example 3.2 and the discussion that follows).
(c) Plot Pe versus σE2 for both signal constellations and comment.
80 2. First layer
exercise 2.14 (QPSK decision regions) Let H ∈ {0, 1, 2, 3} and assume that
when H = i you transmit the codeword ci shown in Figure 2.23. Under H = i, the
receiver observes Y = ci + Z.
y2
6
s c1
s s - y1
c2 c0
s c3
Figure 2.23.
(a) Draw the decoding regions assuming that Z ∼ N (0, σ 2 I2 ) and that PH (i) =
1/4, i ∈ {0, 1, 2, 3}.
(b) Draw the decoding regions (qualitatively) assuming Z ∼ N (0, σ 2 I2 ) and
PH (0) = PH (2) > PH (1) = PH (3). Justify your answer.
(c) Assume PH (i) = 1/4, i ∈ {0, 1, 2, 3} and that Z ∼ N (0, K), where
2again that
σ 0
K= . How do you decode now?
0 4σ 2
exercise 2.15 (Antenna array) The following problem relates to the design
of multi-antenna systems. Consider the binary equiprobable hypothesis testing
problem:
H = 0 : Y1 = A + Z 1 , Y 2 = A + Z2
H = 1 : Y1 = −A + Z1 , Y2 = −A + Z2 ,
where Z1 , Z2 are independent Gaussian random variables with different variances

σ12 = σ22 , that is, Z1 ∼ N (0, σ12 ) and Z2 ∼ N (0, σ22 ). A > 0 is a constant.
(a) Show that the decision rule that minimizes the probability of error (based on
the observable Y1 and Y2 ) can be stated as
0
σ22 y1 + σ12 y2 ≷ 0.
1
(b) Draw the decision regions in the (Y1 , Y2 ) plane for the special case where
σ1 = 2σ2 .
(c) Evaluate the probability of the error for the optimal detector as a function of
σ12 , σ22 and A.
2.13. Exercises 81
exercise 2.16 (Multi-antenna receiver) Consider a communication system

with one transmitter and n receiver antennas. The receiver observes the n-tuple
Y = (Y1 , . . . , Yn )T with
Yk = Bgk + Zk , k = 1, 2, . . . , n,
where B ∈ {±1} is a uniformly distributed source bit, gk models the gain of antenna
k and Zk ∼ N (0, σ 2 ). The random variables B, Z1 , . . . , Zn are independent. Using
n-tuple notation the model becomes
Y = Bg + Z,
where Y , g, and Z are n-tuples.
(a) Suppose that the observation Yk is weighted by an arbitrary real number wk
and combined with the other observations to form

n
V = Yk wk = Y, w,
k=1
where w is an n-tuple. Describe the ML receiver for B given the observation

V . (The receiver knows g and of course knows w.)
(b) Give an expression for the probability of error Pe .
|
g,w|
(c) Define β = gw and rewrite the expresson for Pe in a form that depends
on w only through β.
(d) As a function of w, what are the maximum and minimum values for β and
how do you choose w to achieve them?
(e) Minimize the probability of error over all possible choices of w. Could you
reduce the error probability further by doing ML decision directly on Y rather
than on V ? Justify your answer.
(f ) How would you choose w to minimize the error probability if Zk had variance
σk2 , k = 1, . . . , n? Hint: With a simple operation at the receiver you can
transform the new problem into the one you have already solved.
exercise 2.17 (Signal constellation) The signal constellation of Figure 2.24
is used to communicate across the AWGN channel of noise variance σ 2 . Assume
that the six signals are used with equal probability.
(a) Draw the boundaries of the decision regions.
(b) Compute the average probability of error, Pe , for this signal constellation.
(c) Compute the average energy per symbol for this signal constellation.
exercise 2.18 (Hypothesis testing and fading) Consider the following commu-
nication problem depicted in Figure 2.25. There are two equiprobable hypotheses.
When H = 0, we transmit c0 = −b, where b is an arbitrary but fixed positive
number. When H = 1, we transmit c1 = b. The channel is as shown in Figure
2.25, where Z ∼ N (0, σ 2 ) represents the noise, A ∈ {0, 1} represents a random
attenuation (fading) with PA (0) = 12 , and Y is the channel output. The random
variables H, A, and Z are independent.
82 2. First layer
x2
6
s s s
6
a
? - x1
b
s s s
Figure 2.24.

X ∈ {c0 , c1 } × - Y

6 6
A Z
Figure 2.25.
(a) Find the decision rule that the receiver should implement to minimize the
probability of error. Sketch the decision regions.
(b) Calculate the probability of error Pe , based on the above decision rule.
exercise 2.19 (MAP decoding regions) To communicate across an additive

white Gaussian noise channel, an encoder uses the codewords ci , i = 0, 1, 2, shown
below:
c0 = (1, 0)T
c1 = (−1, 0)T
c2 = (−1, 1)T .
(a) Draw the decoding regions of an ML decoder.

(b) Now assume that codeword i is used with probability PH (i), where PH (0) =
0.25, PH (1) = 0.25, and PH (2) = 0.5 and that the receiver performs a MAP
decision. Adjust the decoding regions accordingly. (A qualitative illustration
suffices.)
(c) Finally, assume that the noise variance increases (same variance in both
components). Update the decoding regions of the MAP decision rule. (Again,
a qualitative illustration suffices.)
2.13. Exercises 83
exercise 2.20 (Sufficient statistic) Consider a binary hypothesis testing problem

specified by:

Y 1 = Z1
H=0:
Y2 = Z1 Z2

Y1 = −Z1
H=1:
Y2 = −Z1 Z2 ,
where Z1 , Z2 , and H are independent random variables. Is Y1 a sufficient statistic?
exercise 2.21 (More on sufficient statistic) We have seen that if H → T (Y ) →

Y , then the probability of error Pe of a MAP decoder that decides on the value of
H upon observing both T (Y ) and Y is the same as that of a MAP decoder that
observes only T (Y ). It is natural to wonder if the contrary is also true, specifically
if the knowledge that Y does not help reduce the error probability that we can
achieve with T (Y ) implies H → T (Y ) → Y . Here is a counter-example. Let the
hypothesis H be either 0 or 1 with equal probability (the choice of distribution
on H is critical in this example). Let the observable Y take four values with the
following conditional probabilities
⎧ ⎧
⎪
⎪ 0.4 if y = 0 ⎪
⎪ 0.1 if y = 0
⎨ ⎨
0.3 if y = 1 0.2 if y = 1
PY |H (y|0) = PY |H (y|1) =
⎪
⎪ 0.2 if y = 2 ⎪
⎪ 0.3 if y = 2
⎩ ⎩
0.1 if y = 3 0.4 if y = 3
and T (Y ) is the following function

0 if y = 0 or y = 1
T (y) =
1 if y = 2 or y = 3.
(a) Show that the MAP decoder Ĥ(T (y)) that decides based on T (y) is equivalent
to the MAP decoder Ĥ(y) that operates based on y.
(b) Compute the probabilities P r{Y = 0 | T (Y ) = 0, H = 0} and P r{Y =
0 | T (Y ) = 0, H = 1}. Is it true that H → T (Y ) → Y ?
exercise 2.22 (Fisher–Neyman factorization theorem) Consider the hypothesis

testing problem where the hypothesis is H ∈ {0, 1, . . . , m − 1}, the observable
is Y , and T (Y ) is a function of the observable. Let fY |H (y|i) be given for all
i ∈ {0, 1, . . . , m − 1}. Suppose that there are positive functions g0 , g1 , . . . , gm−1 , h
so that for each i ∈ {0, 1, . . . , m − 1} one can write
fY |H (y|i) = gi (T (y))h(y). (2.40)
(a) Show that when the above conditions are satisfied, a MAP decision depends
on the observable Y only through T (Y ). In other words, Y itself is not
necessary. Hint: Work directly with the definition of a MAP decision rule.
84 2. First layer
(b) Show that T (Y ) is a sufficient statistic, that is H → T (Y ) → Y . Hint: Start

by observing the following fact. Given a random variable Y with probability
density function fY (y) and given an arbitrary event B, we have
fY (y)1{yB}
fY |Y ∈B = . (2.41)
f (y)dy
B Y
Proceed by defining B to be the event B = {y : T (y) = t} and make use of

(2.41) applied to fY |H (y|i) to prove that fY |H,T (Y ) (y|i, t) is independent of i.
(c) (Example 1) Under hypothesis H = i, let Y = (Y1 , Y2 , . . . , Yn ), Yk ∈ {0, 1}, be
an independent and identically distributed sequence of coin tosses n such that
PYk |H (1|i) = pi . Show that the function T (y1 , y2 , . . . , yn ) = k=1 yk fulfills
the condition expressed in equation (2.40). Notice that T (y1 , y2 , . . . , yn ) is
the number of 1s in (y1 , y2 , . . . , yn ).
(d) (Example 2) Under hypothesis H = i, let the observable Yk be Gaussian
distributed with mean mi and variance 1; that is
1 (y−mi )2
fYk |H (y|i) = √ e− 2 ,
2π
and Y1 , Y2 , . . . , Yn be independently drawn according
n to this distribution. Show
that the sample mean T (y1 , y2 , . . . , yn ) = n1 k=1 yk fulfills the condition
expressed in equation (2.40).
exercise 2.23 (Irrelevance and operational irrelevance) Let the hypothesis H be

related to the observables (U, V ) via the channel PU,V |H and for simplicity assume
that PU |H (u|h) > 0 and PV |U,H (v|u, h) > 0 for every h ∈ H, v ∈ V, and u ∈ U .
We say that V is operationally irrelevant if a MAP decoder that observes (U, V )
achieves the same probability of error as one that observes only U , and this is true
regardless of PH . We now prove that irrelevance and operational irrelevance imply
one another. We have already proved that irrelevance implies operational irrele-
vance. Hence it suffices to show that operational irrelevance implies irrelevance
or, equivalently, that if V is not irrelevant, then it is not operationally irrelevant.
We will prove the latter statement. We begin with a few observations that are
instructive. By definition, V irrelevant means H → U → V . Hence V irrelevant
is equivalent to the statement that, conditioned on U , the random variables H
and V are independent. This gives us one intuitive explanation about why V is
operationally irrelevant when H → U → V . Once we observe that U = u, we can
restate the hypothesis testing problem in terms of a hypothesis H and an observable
V that are independent (conditioned on U = u) and because of independence, from
V we learn nothing about H. But if V is not irrelevant, then there is at least a
u, call it u , for which H and V are not independent conditioned on U = u .
It is when such a u is observed that we should be able to prove that V affects
the decision. This suggests that the problem we are trying to solve is intimately
related to the simpler problem that involves the hypothesis H and the observable
V and the two are not independent. We begin with this problem and then we
generalize.
2.13. Exercises 85
(a) Let the hypothesis be H ∈ H (of yet unspecified distribution) and let the
observable V ∈ V be related to H via an arbitrary but fixed channel PV |H .
Show that if V is not independent of H then there are distinct elements
i, j ∈ H and distinct elements k, l ∈ V such that
PV |H (k|i) > PV |H (k|j)
(2.42)
PV |H (l|i) < PV |H (l|j).

Hint: For every h ∈ H, v∈V PV |H (v|h) = 1.
(b) Under the condition of part (a), show that there is a distribution PH for
which the observable V affects the decision of a MAP decoder.
(c) Generalize to show that if the observables are U and V , and PU,V |H is fixed
so that H → U → V does not hold, then there is a distribution on H for
which V is not operationally irrelevant. Hint: Argue as in parts (a) and (b)
for the case U = u , where u is as described above.
exercise 2.24 (Antipodal signaling) Consider the signal constellation shown
in Figure 2.26.
x2
6
c1
a s
- x1
−a a
s −a
c0
Figure 2.26.
Assume that the codewords c0 and c1 are used to communicate over the discrete-
time AWGN channel. More precisely:
H=0: Y = c0 + Z,
H=1: Y = c1 + Z,
where Z ∼ N (0, σ 2 I2 ). Let Y = (Y1 , Y2 )T .
(a) Argue that Y1 is not a sufficient statistic.
(b) Give a different signal constellation with two codewords c̃0 and c̃1 such that,
when used in the above communication setting, Y1 is a sufficient statistic.
exercise 2.25 (Is it a sufficient statistic?) Consider the following binary
hypothesis testing problem
H=0: Y = c0 + Z
H=1: Y = c1 + Z,
86 2. First layer
where c0 = (1, 1)T = −c1 and Z ∼ N (0, σ 2 I2 ).

(a) Can the error probability of an ML decoder that observes Y = (Y1 , Y2 )T be
lower than that of an ML decoder that observes Y1 + Y2 ?
(b) Argue whether or not H → (Y1 + Y2 ) → Y forms a Markov chain. Your
argument should rely on first principles. Hint 1: Y is in a one-to-one
relationship with (Y1 + Y2 , Y1 − Y2 ). Hint 2: argue that the random variables
Z1 + Z2 and Z1 − Z2 are statistically independent.
exercise 2.26 (Union bound) Let Z ∼ N (c, σ 2 I2 ) be a random vector that takes
values in R2 , where c = (2, 1)T . Find a non-trivial upper bound to the probability
that Z is in the shaded region of Figure 2.27.
z2
1 c
z1
1
Figure 2.27.
exercise 2.27 (QAM with erasure) Consider a QAM receiver that outputs a
special symbol δ (called erasure) whenever the observation falls in the shaded area
shown in Figure 2.28 and does minimum-distance decoding otherwise. (This is
neither a MAP nor an ML receiver.) Assume that c0 ∈ R2 is transmitted and
that Y = c0 + N is received where N ∼ N (0, σ 2 I2 ). Let P0i , i = 0, 1, 2, 3 be the
probability that the receiver outputs Ĥ = i and let P0δ be the probability that it
outputs δ. Determine P00 , P01 , P02 , P03 , and P0δ .
y2
c1 c0 b
b−a
y1
c2 c3
Figure 2.28.
2.13. Exercises 87
Comment: If we choose b − a large enough, we can make sure that the probability
of the error is very small (we say that an error occurred if Ĥ = i, i ∈ {0, 1, 2, 3}
and H = Ĥ). When Ĥ = δ, the receiver can ask for a retransmission of H. This
requires a feedback channel from the receiver to the transmitter. In most practical
applications, such a feedback channel is available.
exercise 2.28 (Repeat codes and Bhattacharyya bound) Consider two equally
likely hypotheses. Under hypothesis H = 0, the transmitter sends c0 = (1, . . . , 1)T
and under H = 1 it sends c1 = (−1, . . . , −1)T , both of length n. The channel
model is AWGN with variance σ 2 in each component. Recall that the probability
of error for an ML receiver that observes the channel output Y ∈ Rn is
√
n
Pe = Q .
σ
Suppose now that the decoder has access only to the sign of Yi , 1 ≤ i ≤ n, i.e. it
observes
W = (W1 , . . . , Wn ) = (sign(Y1 ), . . . , sign(Yn )). (2.43)
(a) Determine the MAP decision rule based on the observable W . Give a simple
sufficient statistic.
(b) Find the expression for the probability of error P̃e of the MAP decoder that
observes W . You may assume that n is odd.
(c) Your answer to (b) contains a sum that cannot be expressed in closed form.
Express the Bhattacharyya bound on P̃e .
(d) For n = 1, 3, 5, 7, find the numerical values of Pe , P̃e , and the Bhattacharyya
bound on P̃e .
exercise 2.29 (Tighter union Bhattacharyya bound: Binary case) In this

problem we derive a tighter version of the union Bhattacharyya bound for binary
hypotheses. Let
H=0: Y ∼ fY |H (y|0)
H=1: Y ∼ fY |H (y|1).
The MAP decision rule is
Ĥ(y) = arg max PH (i)fY |H (y|i),

i
and the resulting probability of error is

Pe = PH (0) fY |H (y|0)dy + PH (1) fY |H (y|1)dy.
R1 R0
(a) Argue that

Pe = min PH (0)fY |H (y|0), PH (1)fY |H (y|1) dy.
y
88 2. First layer
√
(b) Prove that for a, b ≥ 0, min(a, b) ≤ ab ≤ a+b 2 . Use this to prove the tighter
version of the Bhattacharyya bound, i.e.

1 $
Pe ≤ fY |H (y|0)fY |H (y|1)dy.
2 y
(c) Compare the above bound to (2.19) when there are two equiprobable hypoth-
eses. How do you explain the improvement by a factor 12 ?
exercise 2.30 (Tighter union Bhattacharyya bound: M -ary case) In this

problem we derive a tight version of the union bound for M -ary hypotheses. Let
us analyze the following M-ary MAP detector:
Ĥ(y) = smallest i such that
PH (i)fY |H (y|i) = max{PH (j)fY |H (y|j)}.
j
Let

y : PH (j)fY |H (y|j) ≥ PH (i)fY |H (y|i), j<i
Bi,j =
y : PH (j)fY |H (y|j) > PH (i)fY |H (y|i), j > i.
(a) Verify that Bi,j = Bj,i
c
. "
(b) Given H = i, the detector will make an error if and only if y ∈ j:j=i Bi,j .
M −1
The probability of error is Pe = i=0 Pe (i)PH (i). Show that:

M −1
Pe ≤ [P r{Y ∈ B i,j |H = i}PH (i) + P r{Y ∈ B j,i |H = j}PH (j)]
i=0 j>i
−1
. /

M
= fY |H (y|i)PH (i)dy + fY |H (y|j)PH (j)dy
i=0 j>i Bi,j Bi,j
c
−1
M

= min fY |H (y|i)PH (i), fY |H (y|j)PH (j) dy .
i=0 j>i y
To prove the last part, go back to the definition of Bi,j .

(c) Hence show that:

M −1 $
PH (i) + PH (j)
Pe ≤ fY |H (y|i)fY |H (y|j)dy .
i=0 j>i
2 y
√
Hint: For a, b ≥ 0, min(a, b) ≤ ab ≤ a+b
2 .
exercise 2.31 (Applying the tight Bhattacharyya bound) As an application

of the tight Bhattacharyya bound (Exercise 2.29), consider the following binary
hypothesis testing problem
H=0: Y ∼ N (−a, σ 2 )
H=1: Y ∼ N (+a, σ 2 ),
where the two hypotheses are equiprobable.
2.13. Exercises 89
(a) Use the tight Bhattacharyya bound to derive a bound on Pe .

(b) We know that theprobability
of error for this binary hypothesis testing problem
2
a2
is Q( σa ) ≤ 12 exp − 2σ 2 , where we have used the result Q(x) ≤ 12 exp − x2 .
How do the two bounds compare? Comment on the result.
exercise 2.32 (Bhattacharyya bound for DMCs) Consider a discrete mem-
oryless channel (DMC). This is a channel model described by an input alphabet
X , an output alphabet Y, and a transition probability8 PY |X (y|x). When we use
this channel to transmit an n-tuple x ∈ X n , the transition probability is

n
PY |X (y|x) = PY |X (yi |xi ).
i=1
So far, we have come across two DMCs, namely the BSC (binary symmetric
channel) and the BEC (binary erasure channel). The purpose of this problem is to
see that for DMCs, the Bhattacharyya bound takes a simple form, in particular
when the channel input alphabet X contains only two letters.
(a) Consider a transmitter that sends c0 ∈ X n and c1 ∈ X n with equal probability.
Justify the following chain of (in)equalities.
(a) $
Pe ≤ PY |X (y|c0 )PY |X (y|c1 )
y
0
1 n
1
(b)
= 2 PY |X (yi |c0,i )PY |X (yi |c1,i )
y i=1
n $

(c)
= PY |X (yi |c0,i )PY |X (yi |c1,i )
y1 ,...,yn i=1
(d) $
= PY |X (y1 |c0,1 )PY |X (y1 |c1,1 )
y1
$
... PY |X (yn |c0,n )PY |X (yn |c1,n )
yn
n $

(e)
= PY |X (y|c0,i )PY |X (y|c1,i )
i=1 y
!n(a,b)
(f ) $
= PY |X (y|a)PY |X (y|b) ,
a∈X ,b∈X ,a=b y
where n(a, b) is the number of positions i in which c0,i = a and c1,i = b.
8
Here we are assuming that the output alphabet is discrete. Otherwise we use densities
instead of probabilities.
90 2. First layer
(b) The Hamming distance dH (c0 , c1 ) is defined as the number of positions in

which c0 and c1 differ. Show that for a binary input channel, i.e. when
X = {a, b}, the Bhattacharyya bound becomes
Pe ≤ z dH (c0 ,c1 ) ,
where
$
z= PY |X (y|a)PY |X (y|b).
y
Notice that z depends only on the channel, whereas its exponent depends only
on c0 and c1 .
(c) Evaluate the channel parameter z for the following.
(i) The binary input Gaussian channel described by the densities
√
fY |X (y|0) = N (− E, σ 2 )
√
fY |X (y|1) = N ( E, σ 2 ).
(ii) The binary symmetric channel (BSC) with X = Y = {±1} and transition
probabilities described by

1 − δ, if y = x,
PY |X (y|x) =
δ, otherwise.
(iii) The binary erasure channel (BEC) with X = {±1}, Y = {−1, E, 1}, and
transition probabilities given by
⎧
⎨ 1 − δ, if y = x,
PY |X (y|x) = δ, if y = E,
⎩
0, otherwise.
exercise 2.33 (Bhattacharyya bound and Laplacian noise) Assuming two
equiprobable hypotheses, evaluate the Bhattacharyya bound for the following
(Laplacian noise) setting:
H=0: Y = −a + Z
H=1: Y = a + Z,
where a ∈ R+ is a constant and Z is a random variable of probability density
function fZ (z) = 12 exp (−|z|), z ∈ R.
exercise 2.34 (Dice tossing) You have two dice, one fair and one biased. A
friend tells you that the biased die produces a 6 with probability 14 , and produces
the other values with uniform probabilities. You do not know a priori which of the
two is a fair die. You chose with uniform probabilities one of the two dice, and
perform n consecutive tosses. Let Yi ∈ {1, . . . , 6} be the random variable modeling
the ith experiment and let Y = (Y1 , · · ·, Yn ).
(a) Based on the observable Y , find the decision rule to determine whether the
die you have chosen is biased. Your rule should maximize the probability that
the decision is correct.
2.13. Exercises 91
(b) Identify a sufficient statistic S ∈ N.

(c) Find the Bhattacharyya bound on the probability of error. You can either work
with the observable (Y1 , . . . , Yn ) or with (Z1 , . . . , Zn ), where Zi indicates
whether the ith observation is a 6 or not. Yet another alternative is to
work with
S. Depending on the approach, the following may be useful:
n
i=0 i x = (1 + x) for n ∈ N.
n i n
exercise 2.35 (ML receiver and union bound for orthogonal signaling) Let
H ∈ {1, . . . , m} be uniformly distributed and consider the communication problem
described by:
H=i: Y = ci + Z, Z ∼ N (0, σ 2 Im ),
where c1 , . . . , cm , ci ∈ Rm , is a set of constant-energy orthogonal codewords.
Without loss of generality we assume
√
ci = Eei ,
where ei is the ith unit vector in Rm , i.e. the vector that contains 1 at position i
and 0 elsewhere, and E is some positive constant.
(a) Describe the maximum-likelihood decision rule.
(b) Find the distances ci − cj , i = j.
(c) Using the union bound and the Q function, upper bound the probability Pe (i)
that the decision is incorrect when H = i.
exercise 2.36 (Uniform polar to Cartesian) Let R and Φ be independent

random variables. R is distributed uniformly over the unit interval, Φ is distributed
uniformly over the interval [0, 2π).
(a) Interpret R and Φ as the polar coordinates of a point in the plane. It is clear
that the point lies inside (or on) the unit circle. Is the distribution of the
point uniform over the unit disk? Take a guess!
(b) Define the random variables
X = R cos Φ
Y = R sin Φ.
Find the joint distribution of the random variables X and Y by using the
Jacobian determinant.
(c) Does the result of part (b) support or contradict your guess from part (a)?
Explain.
exercise 2.37 (Real-valued Gaussian random variables) For the purpose of

this exercise, two zero-mean real-valued Gaussian random variables X and Y are
92 2. First layer
called jointly Gaussian if and only if their joint density is

1 1 −1
fXY (x, y) = √ exp − (x, y)Σ (x, y) ,T
(2.44)
2π det Σ 2
where (for zero-mean random vectors) the so-called covariance matrix Σ is
2
& ' σX σXY
Σ = E (X, Y )T (X, Y ) = . (2.45)
σXY σY2
(a) Show that if X and Y are zero-mean jointly Gaussian random variables, then
X is a zero-mean Gaussian random variable, and so is Y .
(b) Show that if X and Y are independent zero-mean Gaussian random variables,
then X and Y are zero-mean jointly Gaussian random variables.
(c) However, if X and Y are Gaussian random variables but not independent,
then X and Y are not necessarily jointly Gaussian. Give an example where
X and Y are Gaussian random variables, yet they are not jointly Gaussian.
(d) Let X and Y be independent Gaussian random variables with zero mean and
2
variance σX and σY2 , respectively. Find the probability density function of
Z =X +Y.
Observe that no computation is required if we use the definition of jointly
Gaussian random variables given in Appendix 2.10.
exercise 2.38 (Correlation vs. independence) Let Z be a random variable with
probability density function

1/2, −1 ≤ z ≤ 1
fZ (z) =
0, otherwise.
Also, let X = Z and Y = Z 2 .
(a) Show that X and Y are uncorrelated.
(b) Are X and Y independent?
(c) Now let X and Y be jointly Gaussian, zero mean, uncorrelated with variances
2
σX and σY2 , respectively. Are X and Y independent? Justify your answer.
Miscellaneous exercises
exercise 2.39 (Data-storage channel) The process of storing and retrieving

binary data on a thin-film disk can be modeled as transmitting binary symbols
across an additive white Gaussian noise channel where the noise Z has a variance
that depends on the transmitted (stored) binary symbol X. The noise has the
following input-dependent density:
⎧
⎪
2
− z
⎪
⎨ √ 1 2 e 2σ12 if X = 1
2πσ1
fZ (z) =
⎪
2
− z2
⎪
⎩ √ 1 2 e 2σ0 if X = 0,
2πσ0
where σ1 > σ0 . The channel inputs are equally likely.

2.13. Exercises 93
(a) On the same graph, plot the two possible output probability density functions.
Indicate, qualitatively, the decision regions.
(b) Determine the optimal receiver in terms of σ0 and σ1 .
(c) Write an expression for the error probability Pe as a function of σ0 and σ1 .
exercise 2.40 (A simple multiple-access scheme) Consider the following very

simple model of a multiple-access scheme. There are two users. Each user has
two hypotheses. Let H1 = H2 = {0, 1} denote the respective set of hypotheses and
assume that both users employ a uniform prior. Further, let X1 and X2 be the
respective signals sent by user one and two. Assume that the transmissions of both
users are independent and that X1 ∈ {±1} and X2 ∈ {±2} where X1 and X2 are
positive if their respective hypothesis is zero and negative otherwise. Assume that
the receiver observes the signal Y = X1 +X2 +Z, where Z is a zero-mean Gaussian
random variable with variance σ 2 and is independent of the transmitted signal.
(a) Assume that the receiver observes Y and wants to estimate both H1 and H2 .
Let Ĥ1 and Ĥ2 be the estimates. What is the generic form of the optimal
decision rule?
(b) For the specific set of signals given, what is the set of possible observations,
assuming that σ 2 = 0? Label these signals by the corresponding (joint)
hypotheses.
(c) Assuming now that σ 2 > 0, draw the optimal decision regions.
(d) What is the resulting probability of correct decision? That is, determine the
probability P r{Ĥ1 = H1 , Ĥ2 = H2 }.
(e) Finally, assume that we are interested in only the transmission of user two.
Describe the receiver that minimizes the error probability and determine
P r{Ĥ2 = H2 }.
exercise 2.41 (Data-dependent noise) Consider the following binary Gaussian

hypothesis testing problem with data dependent noise. Under hypothesis H = 0 the
transmitted signal is c0 = −1 and the received signal is Y = c0 + Z0 , where Z0 is
zero-mean Gaussian with variance one. Under hypothesis H = 1 the transmitted
signal is c1 = 1 and the received signal is Y = c1 + Z1 , where Z1 is zero-mean
Gaussian with variance σ 2 . Assume that the prior is uniform.
(a) Write the optimal decision rule as a function of the parameter σ 2 and the
received signal Y .
(b) For the value σ 2 = e4 compute the decision regions.
(c) Give expressions as simple as possible for the error probabilities Pe (0) and
Pe (1).
exercise 2.42 (Correlated noise) Consider the following communication

problem. The message is represented by a uniformly distributed random variable
H, that takes values in {0, 1, 2, 3}. When H = i we send ci , where c0 = (0, 1)T ,
c1 = (1, 0)T , c2 = (0, −1)T , c3 = (−1, 0)T (see Figure 2.29). When H = i, the
receiver observes the vector Y = ci + Z, where Z is a zero-mean Gaussian random
vector of covariance matrix Σ = ( 42 25 ).
94 2. First layer
x2
6
1 s c0
c3 c1
s s - x1
−1 1
−1 s c2
Figure 2.29.
(a) In order to simplify the decision problem, we transform Y into Ŷ = BY =

Bci + BZ, where B is a 2-by-2 invertible matrix, and use Ŷ as a sufficient
statistic. Find a B such that BZ is a zero-mean Gaussian random vector
with

independent and identically distributed components. Hint: If A = 14 −12 02 ,
then AΣAT = I, with I = ( 10 01 ).
(b) Formulate the new hypothesis testing problem that has Ŷ as the observable
and depict the decision regions.
(c) Give an upper bound to the error probability in this decision problem.
3 Receiver design for the
continuous-time AWGN
channel: Second layer
3.1 Introduction
In Chapter 2 we focused on the receiver for the discrete-time AWGN (additive
white Gaussian noise) channel. In this chapter, we address the same problem for
a channel model closer to reality, namely the continuous-time AWGN channel .
Apart from the channel model, the assumptions and the goal are the same: We
assume that the source statistic, the transmitter, and the channel are given to
us and we seek to understand what the receiver has to do to minimize the error
probability. We are also interested in the resulting error probability, but this follows
from Chapter 2 with no extra work. The setup is shown in Figure 3.1.
The channel of Figure 3.1 captures the most important aspect of all real-
world channels, namely the presence of additive noise. Owing to the central limit
theorem, the assumption that the noise is Gaussian is often a very good one. In
Section 3.6 we discuss additional channel properties that also affect the design and
performance of a communication system.
example 3.1 A cable is a good example of a channel that can be modeled by
the continuous-time AWGN channel. If the cable’s frequency response cannot be
considered as constant over the signal’s bandwidth, then the cable’s filtering effect
also needs to be taken into consideration. We discuss this in Section 3.6. Another
good example is the channel between the antenna of a geostationary satellite and
the antenna of the corresponding Earth station. For the communication in either
direction we can consider the model of Figure 3.1.
Although our primary focus is on the receiver, in this chapter we also gain
valuable insight into the transmitter structure. First we need to introduce the
notion of signal’s energy and specify two mild technical restrictions that we impose
on the signal set W = {w0 (t), . . . , wm−1 (t)}.
example 3.2 Suppose that wi (t) is the voltage feeding the antenna of a trans-
mitter when H = i. An antenna has an internal impedance Z. A typical value for
Z is 50 ohms. Assuming that Z is purely resistive, the current at the feeding point
is wi (t)/Z, the
instantaneous power is wi2 (t)/Z, and the energy transferred to the
1 2
antenna is Z wi (t)dt. Alternatively, if the wi (t) is the current feeding the antenna
when H = i, the voltage at the feeding point is wi (t)Z, the instantaneous power is
95
96 3. Second layer
H=i∈H wi (t) ∈ W R(t) Ĥ ∈ H

- Transmitter - - Receiver -

6
N (t)
Figure 3.1. Communication across the continuous-time AWGN channel.

wi2 (t)Z, and
the energy is Z wi2 (t)dt. In both cases the energy is proportional to
wi 2 = |wi (t)|2 dt.
As in the above example, the squared norm of a signal wi (t) is generally associ-
ated with the signal’s energy. It is quite natural to assume that we communicate
via finite-energy signals. This is the first restriction on W. A linear combination
of a finite number of finite-energy signals is itself a finite-energy signal. Hence,
every vector of the vector space V spanned by W is a square-integrable function.
The second requirement is that if v ∈ V has a vanishing norm, then v(t) vanishes
for all t. Together, these requirements imply that V is an inner product space of
square-integrable functions. (See Example 2.39.)
example 3.3 (Continuous functions) Let v : R → R be a continuous function

and suppose that |v(t0 )| = a for some t0 and some positive a. By continuity, there
exists an > 0 such that |v(t)| > a2 for all t ∈ I where I = (t0 − 2 , t0 + 2 ). It
follows that
a2
v(t)2 ≥ v(t)1{t ∈ I}2 ≥ > 0.
4
We conclude that if a continuous function has a vanishing norm, then the function
vanishes everywhere.
All signals that represent real-world communication signals are finite-energy and
continuous. Hence the vector space they span is always an inner product space.
This is a good place to mention the various reasons we are interested in the
signal’s energy or, somewhat equivalently, in the signal’s power, which is the energy
per second. First, for safety and for spectrum reusability, there are regulations that
limit the power of a transmitted signal. Second, for mobile devices, the energy of
the transmitted signal comes from the battery: a battery charge lasts longer if we
decrease the signal’s power. Third, with no limitation to the signal’s power, we can
transmit across a continuous-time AWGN channel at any desired rate, regardless
of the available bandwidth and of the target error probability. Hence, it would be
unfair to compare signaling methods that do not use the same power.
For now, we assume that W is given to us. The problem of choosing a suitable
set W of signals will be studied in subsequent chapters.
The highlight of the chapter is the power of abstraction. The receiver design
for the discrete-time AWGN channel relied on geometrical ideas that can be
3.2. White Gaussian noise 97
6
H=i Ĥ
?
Encoder Decoder
6
ci Y
?
Waveform n-Tuple
Former Former
6
wi (t) R(t)

-

6
N (t)
Figure 3.2. Waveform channel abstraction.
formulated whenever we are in an inner product space. We will use the same
ideas for the continuous-time AWGN channel.
The main result is a decomposition of the sender and the receiver into the
building blocks shown in Figure 3.2. We will see that, without loss of generality,
we can (and should) think of the transmitter as consisting of an encoder that
maps the message i ∈ H into an n-tuple ci , as in the previous chapter, followed
by a waveform former that maps ci into a waveform wi (t). Similarly, we will see
that the receiver can consist of an n-tuple former that takes the channel output
and produces an n-tuple Y . The behavior from the waveform former input to the
n-tuple former output is that of the discrete-time AWGN channel considered in the
previous chapter. Hence we know already what the decoder of Figure 3.2 should
do with the n-tuple former output.
In this chapter (like in the previous one) the vectors (functions) are real-valued.
Hence, we could use the formalism that applies to real inner product spaces. Yet, in
preparation of Chapter 7, we use the formalism for complex inner product spaces.
This mainly concerns
the standard inner product between functions, where we
write a, b = a(t)b∗ (t)dt instead of a, b = a(t)b(t)dt. A similar comment
applies to the definition
& of' covariance, where for zero-mean random variables we
use cov(Zi , Zj ) = E Zi Zj∗ instead of cov(Zi , Zj ) = E [Zi Zj ].
3.2 White Gaussian noise

The purpose of this section is to introduce the basics of white Gaussian noise N (t).
The standard approach is to give a mathematical description of N (t), but this
98 3. Second layer
requires measure theory if done rigorously. The good news is that a mathematical
model of N (t) is not needed because N (t) is not observable through physical
experiments. (The reason will become clear shortly.) Our approach is to model
what we can actually measure. We assume a working knowledge of Gaussian
random vectors (reviewed in Appendix 2.10).
A receiver is an electrical instrument that connects to the channel output via
a cable. For instance, in wireless communication, we might consider the channel
output to be the output of the receiving antenna; in which case, the cable is the
one that connects the antenna to the receiver. A cable is a linear time-invariant
filter. Hence, we can assume that all the observations made by the receiver are
through some linear time-invariant filter.
So if N (t) represents the noise introduced by the channel, the receiver sees, at
best, a filtered version Z(t) of N (t). We model Z(t) as a stochastic process and,
as such, it is described by the statistic of Z(t1 ), Z(t2 ), . . . , Z(tk ) for any positive
integer k and any finite collection of sampling times t1 , t2 , . . . , tk .
If the filter impulse response is h(t), then linear system theory suggests that

Z(t) = N (α)h(t − α)dα
and

Z(ti ) = N (α)h(ti − α)dα, (3.1)
but the validity of these expressions needs to be justified, because N (t) is not
a deterministic signal. It is possible to define N (t) as a stochastic process and
prove that the (Lebesgue) integral in (3.1) is well defined; but we avoid this path
which, as already mentioned, requires measure theory. In this text, equation (3.1) is
shorthand for the statement “Z(ti ) is the random variable that models the output
at time ti of a linear time-invariant filter of impulse response h(t) fed with white
Gaussian noise N (t)”. Notice that h(ti − α) is a function of α that we can rename
as gi (α). Now we are in the position to define white Gaussian noise.
definition 3.4 N (t) is white Gaussian noise of power spectral density N20 if,
for any finite collection of real-valued L2 functions g1 (α), . . . , gk (α),

Zi = N (α)gi (α)dα, i = 1, 2, . . . , k (3.2)
is a collection of zero-mean jointly Gaussian random variables of covariance

& ∗
' N0 N0
cov Zi , Zj = E Zi Zj = gi (t)gj∗ (t)dt = gi , gj . (3.3)
2 2
If we are not evaluating the integral in (3.2), how do we know if N (t) is white
Gaussian noise? In this text, when applicable, we say that N (t) is white Gaussian
noise, in which case we can use (3.3) as we see fit. In the real world, often we know
enough about the channel to know whether or not its noise can be modeled as
3.3. Observables and sufficient statistics 99
white and Gaussian. This knowledge could come from a mathematical model of
the channel. Another possibility is that we perform measurements and verify that
they behave according to Definition 3.4.
Owing to its importance and frequent use, we formulate the following special
case as a lemma. It is the most important fact that should be remembered about
white Gaussian noise.
lemma 3.5 Let {g1 (t), . . . , gk (t)} be an orthonormal set of real-valued functions.
Then Z = (Z1 , . . . , Zk )T , with Zi defined as in (3.2), is a zero-mean Gaussian
random vector with iid components of variance σ 2 = N20 .
Proof The proof is a straightforward application of the definitions.
example 3.6 Consider two bandpass filters that have non-overlapping frequency
responses but are otherwise identical, i.e. if we frequency-translate the frequency
response of one filter by the proper amount we obtain the frequency response of the
other filter. By Parseval’s relationship, the corresponding impulse responses are
orthogonal to one another. If we feed the two filters with white Gaussian noise and
sample their output (even at different times), we obtain two iid Gaussian random
variables. We could extend the experiment (in the obvious way) to n filters of non-
overlapping frequency responses, and would obtain n random variables that are
iid – hence of identical variance. This explains why the noise is called white: like
for white light, white Gaussian noise has its power equally distributed among all
frequencies.
Are there other types of noise? Yes, there are. For instance, there are natural
and man-made electromagnetic noises. The noise produced by electric motors and
that produced by power lines are examples of man-made noise. Man-made noise is
typically neither white nor Gaussian. The good news is that a careful design should
be able to ensure that the receiver picks up a negligible amount of man-made noise
(if any). Natural noise is unavoidable. Every conductor (resistor) produces thermal
(Johnson) noise. (See Appendix 3.10.) The assumption that thermal noise is white
and Gaussian is an excellent one. Other examples of natural noise are solar noise
and cosmic noise. A receiving antenna picks up these noises, the intensity of which
depends on the antenna’s gain and pointing direction. A current in a conductor
gives rise to shot noise. Shot noise originates from the discrete nature of the electric
charges. Wikipedia is a good reference to learn more about various noise sources.
3.3 Observables and sufficient statistics

Recall that the setup is that of Figure 3.1, where N (t) is white Gaussian noise.
As discussed in Section 3.2, owing to the noise, the channel output R(t) is not
observable. What we can observe via physical experiments (measurements) are
k-tuples V = (V1 , . . . , Vk )T such that
∞
Vi = R(α)gi∗ (α)dα, i = 1, 2, . . . , k, (3.4)
−∞
100 3. Second layer
where k is an arbitrary positive integer and g1 (t), . . . , gk (t) are arbitrary finite-
energy waveforms. The complex conjugate operator “∗ ” on gi∗ (α) is superfluous for
real-valued signals but, as we will see in Chapter 7, the baseband representation
of a passband impulse response is complex-valued.
Notice that we assume that we can perform an arbitrarily large but finite number
k of measurements. By disallowing infinite measurements we avoid distracting
mathematical subtleties without losing anything of engineering relevance.
It is important to point out that the kind of measurements we consider is quite
general. For instance, we can pass R(t) through an ideal lowpass filter of cutoff
frequency B for some huge B (say 1010 Hz) and collect an arbitrary large number
1
of samples taken every 2B seconds so as to fulfill the sampling theorem (Theorem
5.2). In fact, by choosing gi (t) = h( 2B i
− t), where h(t) is the impulse response of
i
the lowpass filter, Vi becomes the filter output sampled at time t = 2B . As stated
by the sampling theorem, from these samples we can reconstruct the filter output.
If R(t) consists of a signal plus noise, and the signal is bandlimited to less than B
Hz, then from the samples we can reconstruct the signal plus the portion of the
noise that has frequency components in [−B, B].
Let V be the inner product space spanned by the elements of the signal set W
and let {ψ1 (t), . . . , ψn (t)} be an arbitrary orthonormal basis for V. We claim that
the n-tuple Y = (Y1 , . . . , Yn )T with ith component

Yi = R(α)ψi∗ (α)dα
is a sufficient statistic (for the hypothesis H) among any collection of measurements

that contains Y . To prove this claim, let V = (V1 , . . . , Vk )T be the collection of
additional measurements made according to (3.4). Let U be the inner product space
spanned by V ∪ {g1 (t), . . . , gk (t)} and let {ψ1 (t), . . . , ψn (t), φ1 (t), . . . , φñ (t)} be an
orthonormal basis for U obtained by extending the orthonormal basis {ψ1 (t), . . . ,
ψn (t)} for V. Define

Ui = R(α)φ∗i (α)dα, i = 1, . . . , ñ.
It should be clear that we can recover V from Y and U . This is so because, from
the projections onto a basis, we can obtain the projection onto any waveform in
the span of the basis. Mathematically,
∞
Vi = R(α)gi∗ (α)dα
−∞
⎡ ⎤∗
∞
n
ñ
= R(α) ⎣ ξi,j ψj (α) + ξi,j+n φj (α)⎦ dα
−∞ j=1 j=1

n
ñ
∗ ∗
= ξi,j Yj + ξi,j+n Uj ,
j=1 j=1
3.3. Observables and sufficient statistics 101
where ξi,1 , . . . , ξi,n+ñ is the unique set of coefficients in the orthonormal expansion
of gi (t) with respect to the basis {ψ1 (t), . . . , ψn (t), φ1 (t), φ2 (t), . . . , φñ (t)}.
Hence we can consider (Y, U ) as the observable and it suffices to show that Y is
a sufficient statistic. Note that when H = i,

∗

Yj = R(α)ψj (α) = wi (α) + N (α) ψj∗ (α)dα = ci,j + Z|V,j ,
where ci,j is the jth component of the n-tuple of coefficients ci that represents
the waveform wi (t) with respect to the chosen orthonormal basis, and Z|V,j is a
zero-mean Gaussian random variable of variance N20 . The notation Z|V,j is meant
to remind us that this random variable is obtained by “projecting” the noise onto
the jth element of the chosen orthonormal basis for V. Using n-tuple notation, we
obtain the following statistic
H = i, Y = ci + Z|V ,
where Z|V ∼ N (0, N20 In ). Similarly,

Uj = R(α)φ∗j (α) = wi (α) + N (α) φ∗j (α)dα = N (α)φ∗j (α)dα = Z⊥V,j ,
where we used the fact that wi (t) is in the subspace spanned by {ψ1 (t), . . . , ψn (t)}
and therefore it is orthogonal to φj (t) for each j = 1, 2, . . . , ñ. The notation Z⊥V,j
reminds us that this random variable is obtained by “projecting” the noise onto
the jth element of an orthonormal basis that is orthogonal to V. Using n-tuple
notation, we obtain
H = i, U = Z⊥V ,
where Z⊥V ∼ N (0, N20 Iñ ). Furthermore, Z|V and Z⊥V are independent of each
other and of H. The conditional density of Y, U given H is
fY,U |H (y, u|i) = fY |H (y|i)fU (u).
From the Fisher–Neyman factorization theorem (Theorem 2.13, Chapter 2, with

h(y, u) = fU (u), T (y, u) = y, and gi (T (y, u)) = fY |H (y|i)), we see that Y is a
sufficient statistic and U is irrelevant as claimed.
Figure 3.3 depicts what is going on, which we summarize as follows:
Y = ci + Z|V is a sufficient statistic: it is the projection of R(t) onto the
signal space V;
U = Z⊥V is irrelevant: it contains only independent noise.
Could we prove that a subset of the components of Y is not a sufficient statistic?
Yes, we could. Here is the outline of a proof. Without loss of essential generality,
let us think of Y as consisting of two parts, Ya and Yb . Similarly, we decompose
every ci into the corresponding parts cia and cib . The claim is that H followed
by Ya followed by (Ya , Yb ) does not form a Markov chain in that order. In fact
when H = i, Yb consists of cib plus noise. Since cib cannot be deduced from Ya in
102 3. Second layer

Y
U
6
Z⊥V
Q
Q
Q
Z Q

1cH |V V Q
i H
Q
HH Q

j
H
-

Y
Q 0
Q
Q
Q
Q
Q
Q
Figure 3.3. The vector of measurements (Y T , U T )T describes the projection
of the received signal R(t) onto U . The vector Y describes the projection of
R(t) onto V.
general (or else we would not bother sending cib ), it follows that the statistic of
Yb depends on i even if we know the realization of Ya .
3.4 Transmitter and receiver architecture

The results of the previous section tell us that a MAP receiver for the waveform
AWGN channel can be structured as shown in Figure 3.4. We see that the receiver
front end computes Y ∈ Rn from R(t) in a block that we call n-tuple former. (The
name is not standard.) Thus the n-tuple former performs a huge data reduction
from the channel output R(t) to the sufficient statistic Y . The hypothesis testing
problem based on the observable Y is
H=i: Y = ci + Z,
where Z ∼ N (0, N20 In ) is independent of H. This is precisely the hypothesis testing
problem studied in Chapter 2 in conjunction with a transmitter that sends ci ∈ Rn
to signal message i across the discrete-time AWGN channel. As shown in the figure,
we can also decompose the transmitter into a module that produces ci , called
encoder , and a module that produces wi (t), called waveform former. (Once again,
the terminology is not standard.) Henceforth the n-tuple of coefficients ci will be
referred to as the codeword associated to wi (t). Figure 3.4 is the main result of the
chapter. It implies that the decomposition of the transmitter and the receiver as
depicted in Figure 3.2 is indeed completely general and it gives details about the
waveform former and the n-tuple former.
Everything that we learned about a decoder for the discrete-time AWGN channel
is applicable to the decoder of the continuous-time AWGN channel. Incidentally,
3.4. Transmitter and receiver architecture 103
Waveform Former
ψ1 (t)
ci,1 ?
- ×l -HH wi (t) =

ci,j ψj (t)
i∈H H j
Σ H
..
Encoder .
- ×l -
ci,n
6
ψn (t)
?
l N (t)
ψ1∗ (t) AWGN
?
Y1 Integrator ×l
ı̂ .. ..
Decoder . .
Integrator ×l R(t)
Yn
6
ψn∗ (t)
n-Tuple Former
Figure 3.4. Decomposed transmitter and receiver.
the decomposition of Figure 3.4 is consistent with the layering philosophy of the
OSI model (Section 1.1), in the sense that the encoder and decoder are designed as
if they were talking to each other directly via a discrete-time AWGN channel. In
reality, the channel seen by the encoder/decoder pair is the result of the “service”
provided by the waveform former and the n-tuple former.
The above decomposition is useful for the system conception, for the perfor-
mance analysis, as well as for the system implementation; but of course, we always
have the option of implementing the transmitter as a straight map from the
message set H to the waveform set W without passing through the codebook C.
Although such a straight map is a possibility and makes sense for relatively
unsophisticated systems, the decomposition into an encoder and a waveform former
is standard for modern designs. In fact, information theory, as well as coding
theory, devote much attention to the study of encoder/decoder pairs.
The following example is meant to make two important points that apply when
we communicate across the continuous-time AWGN channel and make an ML
decision. First, sets of continuous-time signals may “look” very different yet they
may share the same codebook, which is sufficient to guarantee that the error
probability be the same; second, for binary constellations, what matters for the
error probability is the distance between the two signals and nothing else.
example 3.7 (Orthogonal signals) The following four choices of W = {w0 (t),
w1 (t)} look very different yet, upon an appropriate choice
√ of orthonormal √basis,
they share the same codebook C = {c0 , c1 } with c0 = ( E, 0)T and c1 = (0, E)T .
104 3. Second layer
To see this, it suffices to verify that wi , wj equals E if i = j and equals 0 otherwise.
Hence the two signals are orthogonal to each other and they have squared norm E.
Figure 3.5 shows the signals and the associated codewords.
ψ2 x2
w1 • c1 •
• ψ1 • x1
w0 c0
(a) W in the signal space. (b) C in R2 .
Figure 3.5. W and C.
Choice 1 (Rectangular pulse position modulation):

E
w0 (t) = 1{t ∈ [0, T ]}
T
E
w1 (t) = 1{t ∈ [T, 2T ]},
T
where we have used the indicator function 1{t ∈ [a, b]} to denote a rectangular pulse
which is 1 in the interval [a, b] and 0 elsewhere. Rectangular pulses can easily be
generated, e.g. by a switch. They are used, for instance, to communicate a binary
symbol within an electrical circuit. As we will see, in the frequency domain these
pulses have side lobes that decay relatively slow, which is not desirable for high
data rate over a channel for which bandwidth is at a premium.
Choice 2 (Frequency-shift keying):

2E t
w0 (t) = sin πk 1{t ∈ [0, T ]}
T T

2E t
w1 (t) = sin πl 1{t ∈ [0, T ]},
T T
where k and l are positive integers, k = l. With a large value of k and l, these
signals could be used for wireless communication. To see that the two signals are
orthogonal to each other we can use the trigonometric identity sin(α) sin(β) =
0.5[cos(α − β) − cos(α + β)].
Choice 3 (Sinc pulse position modulation):

E t
w0 (t) = sinc
T T

E t−T
w1 (t) = sinc .
T T
3.4. Transmitter and receiver architecture 105
An advantage of sinc pulses is that they have a finite support in the frequency
domain. By taking their Fourier transform, we quickly see that they are orthogonal
to each other. See Appendix 5.10 for details.
Choice 4 (Spread spectrum):

√ 1
n
T T
w0 (t) = Eψ1 (t), with ψ1 (t) = s0,j 1 t − j ∈ 0,
T j=1 n n

√ 1
n
T T
w1 (t) = Eψ2 (t), with ψ2 (t) = s1,j 1 t − j ∈ 0, ,
T j=1 n n
where (s0,1 , . . . , s0,n ) ∈ {±1}n and (s1,1 , . . . , s1,n ) ∈ {±1}n are orthogonal. This
signaling method is called spread spectrum. It is not hard to show that it uses much
bandwidth but it has an inherent robustness with respect to interfering (non-white
and possibly non-Gaussian) signals.
Now assume that we use one of the above choices to communicate across a
continuous-time AWGN channel and that the receiver implements an ML decision
rule. Since the codebook C is the same in all cases, the decoder and the error
probability will be identical no matter which choice we make.
Computing the error probability is particularly easy when c1there are only two
−c0
codewords. From the previous chapter we know that Pe = Q 2σ , where σ 2 =
N0
2 . The distance
0
1 2
1 √ √
c − c := 2 (c − c )2 = E + E = 2E
1 0 1,i 0,i
i=1
can also be computed as

#
w1 − w0 := [w1 (t) − w0 (t)]2 dt,
which requires neither an orthonormal basis nor the codebook. Yet another
alternative is to use Pythagoras’ theorem. As we know already that our sig-
nals
have squared norm √ E and are orthogonal to each other, their distance is
w0 2 + w1 2 = 2E. Inserting, we obtain

E
Pe = Q .
N0
example 3.8 (Single-shot PAM) Let ψ(t) be a unit-energy pulse. We speak of

single-shot pulse amplitude modulation when the transmitted signal is of the form
wi (t) = ci ψ(t),
where ci takes a value in some discrete subset of R of the form {±a, ±3a, ±5a, . . . ,
±(m − 1)a} for some positive number a. An example for m = 6 is shown in Figure
2.9, where d = 2a.
106 3. Second layer
example 3.9 (Single-shot PSK) Let T and fc be positive numbers and let m be
a positive integer. We speak of single-shot phase-shift keying when the signal set
consists of signals of the form

2E 2π
wi (t) = cos 2πfc t + i 1 t ∈ [0, T ] , i = 0, 1, . . . , m − 1. (3.5)
T m
For mathematical convenience, we assume that 2fc T is an integer, so that
wi 2 = E for all i. (When 2fc T is an integer, wi2 (t) has an integer number
of periods in a length-T interval. This ensures that all wi (t) have the same
norm, regardless of the initial phase. In practice, fc T is very large, which
implies that there are many periods in an interval of length T , in which case
the energy difference due to an incomplete period is negligible.) The signal space
representation can be obtained by using the trigonometric identity cos(α + β) =
cos(α) cos(β) − sin(α) sin(β) to rewrite (3.5) as
wi (t) = ci,1 ψ1 (t) + ci,2 ψ2 (t),
where

√ 2πi 2
ci,1 = E cos , ψ1 (t) = cos(2πfc t)1{t ∈ [0, T ]},
m T

√ 2πi 2
ci,2 = E sin , ψ2 (t) = − sin(2πfc t)1{t ∈ [0, T ]}.
m T
The reader should verify that ψ1 (t) and ψ2 (t) are normalized functions and,
because 2fc T is an integer, they are orthogonal to each other. This can easily be
verified using the trigonometric identity sin α cos β = 12 [sin(α + β) + sin(α − β)].
Hence the codeword associated to wi (t) is

√ cos 2πi/m
ci = E .
sin 2πi/m
In Example 2.15, we have already studied this constellation for the discrete-time
AWGN channel.
example 3.10 (Single-shot QAM) Let T and fc be positive numbers such that
2fc T is an integer, let m be an even positive integer, and define
2
ψ1 (t) = cos(2πfc t)1{t ∈ [0, T ]}
T
2
ψ2 (t) = sin(2πfc t)1{t ∈ [0, T ]}.
T
(We have already established in Example 3.9 that ψ1 (t) and ψ2 (t) are orthogonal
to each other and have unit norm.) If the components of ci = (ci,1 , ci,2 )T , i =
0, . . . , m2 − 1, take values in some discrete subset of the form {±a, ±3a, ±5a, . . . ,
±(m − 1)a} for some positive a, then
wi (t) = ci,1 ψ1 (t) + ci,2 ψ2 (t),
3.5. Generalization and alternative receiver structures 107
is a single-shot quadrature amplitude modulation (QAM). The values of ci for

m = 2 and m = 4 are shown in Figures 2.10 and 2.22, respectively.
The signaling methods discussed in this section are the building blocks of many
communication systems.
3.5 Generalization and alternative receiver structures

It is interesting to explore a refinement and a variation of the receiver structure
shown in Figure 3.4. We also generalize to an arbitrary message distribution. We
take the opportunity to review what we have so far.
The source produces H = i with probability PH (i), i ∈ H. When H = i, the
channel output is R(t) = wi (t) + N (t), where wi (t) ∈ W = {w0 (t), w1 (t), . . . ,
wm−1 (t)} is the signal constellation composed of finite-energy signals (known to
the receiver) and N (t) is white Gaussian noise. Throughout this text, we make the
natural assumption that the vector space V spanned by W forms an inner product
space (with the standard inner product). This is guaranteed if the zero signal is
the only signal that has vanishing norm, which is always the case in real-world
situations. Let {ψ1 (t), . . . , ψn (t)} be an orthonormal basis for V. We can use the
Gram–Schmidt procedure to find an orthonormal basis, but sometimes we can pick
a more convenient one “by hand”. At the receiver, we obtain a sufficient statistic
by taking the inner product of the received signal R(t) with each element of the
orthonormal basis. The result is
Y = (Y1 , Y2 , . . . , Yn )T , where
Yi = R, ψi , i = 1, . . . , n.
We now face a hypothesis testing problem with prior PH (i), i ∈ H, and observ-
able Y distributed according to

1 y − ci 2
fY |H (y|i) = n exp − ,
(2πσ 2 ) 2 2σ 2
where σ 2 = N20 . A MAP receiver that observes Y = y decides Ĥ = i for one of

the i ∈ H that maximize PH (i)fY |H (y|i) or any monotonic function thereof. Since
fY |H (y|i) is an exponential function of y, we simplify the test by taking the natural
logarithm. We also remove terms that do not depend on i and rescale, keeping in
mind that if we scale with a negative number, we have to change the maximization
into minimization.
If we choose negative N0 as the scaling factor we obtain the first of the following
equivalent MAP tests.
(i) Choose Ĥ as one of the j that minimizes y − cj 2 − N0 ln PH (j).

c 2
(ii) Choose Ĥ as one of the j that maximizes y, cj − j2 + N20 ln PH (j).
w 2
(iii) Choose Ĥ as one of the j that maximizes r(t)wj∗ (t)dt− 2j + N20 ln PH (j).
108 3. Second layer

r(t) - ×m - Integrator - r(t)b∗ (t)dt
6
b∗ (t) (a)
- @
@ -

r(t) b∗ (T − t) r(t)b∗ (t)dt
t=T
(b)

Figure 3.6. Two ways to implement r(t)b∗ (t)dt, namely via a correlator
(a) and via a matched filter (b) with the output sampled at time T .
The second is obtained from the first by using y−ci 2 = y2 −2{y, ci }+ci 2 .
Once we drop the {·} operator (the vectors are real-valued), remove the constant
y2 , scale by −1/2, we obtain (ii). ∗ ∗
∗
Rules∗ (ii) and (iii) are equivalent since r(t)wi (t)dt = r(t) j ci,j ψj (t) dt =
y c
j j i,j = y, c i .
The MAP rules (i)–(iii) require performing operations of the kind

r(t)b∗ (t)dt, (3.6)
where b(t) is some function (ψj (t) or wj (t)). There are two ways to implement
(3.6). The obvious way, shown in Figure 3.6a is by means of a so-called correlator .
A correlator is a device that multiplies and integrates two input signals. The other
way to implement (3.6) is via a so-called matched filter . This is a filter that takes
r(t) as the input and has h(t) = b∗ (T − t) as impulse response (Figure 3.6b), where
T is an arbitrary design parameter selected in such a way as to make h(t) a causal
impulse response. The matched filter output y(t) is then

y(t) = r(α) h(t − α) dα

= r(α) b∗ (T + α − t) dα,
and at t = T it is

y(T ) = r(α) b∗ (α) dα.
We see that the latter is indeed (3.6).

example 3.11 (Matched filter) If b(t) is as in Figure 3.7, then y = r(t), b(t)
is the output at t = 0 of a linear time-invariant filter that has input r(t) and
has h0 (t) = b(−t) as the impulse response (see the figure). The impulse response
h0 (t) is non-causal. We obtain the same result with a causal filter by delaying the
3.5. Generalization and alternative receiver structures 109
impulse response by 3T and by sampling the output at t = 3T . The delayed impulse

response is h3T (t), also shown in the figure.
b(t) h0 (t)
1
t t
0 3T 3T 0
h3T (t)
t
0 3T
Figure 3.7.
It is instructive to plot the matched filter output as we do in the next example.

example 3.12 Suppose that the signals are w0 (t) = aψ(t) and w1 (t) = −aψ(t),
where a is some positive number and
1
ψ(t) = 1{0 ≤ t ≤ T }.
T
The signals are plotted on the left of Figure 3.8. The n-tuple former consists of
the matched filter of impulse response h(t) = ψ ∗ (T − t) = ψ(t), with the output
sampled at t = T . In the absence of noise, the matched filter output at the sampling
time should be a when w0 (t) is transmitted and −a when w1 (t) is transmitted. The
w0 (t) noiseless y(t) when w0 (t) transmitted
1 a
t t
T T
w1 (t) noiseless y(t) when w1 (t) transmitted
T T
t t
1 −a
Figure 3.8. Matched filter response (right) to the input on the left.
110 3. Second layer
plots on the right of the figure show the matched filter response y(t) to the input
on the left. Indeed, at t = T we have a or −a. At any other time we have b or −b,
for some b such that 0 ≤ b ≤ a. This, and the fact that the noise variance does not
depend on the sampling time, implies that t = T is the sampling time at which the
error probability is minimized.
Figure 3.9 shows the block diagrams for the implementation of the three MAP
rules (i)–(iii). In each case the front end has been implemented by using matched
filters, but correlators could also be used, as in Figure 3.4.
Whether we use matched filters or correlators depends on the technology and on
the waveforms. Implementing a correlator in analog technology is costly. But, if the
processing is done by a microprocessor that has enough computational power, then
a correlation can be done at no additional hardware cost. We would be inclined to
use matched filters if there were easy-to-implement filters of the desired impulse
response. In Exercise 3.10 of this chapter, we give an example where the matched
filters can be implemented with passive components.
y1
- ψ1∗ (T − t) @ - minimize2
r(t) y − cj − N0 ln PH (j) ı̂
- t=T -
or, equivalently,
maximize
- ψn∗ (T − t) @ -
y, cj + qj
t=T yn
n-Tuple Former.
q0

r(t)w0∗ (t)dt
?
- w0∗ (T − t) @ - m-
r(t) ı̂
- t=T -
qm−1 Select
∗ Largest
r(t)wm−1 (t)dt
?
- wm−1
∗
(T − t) @ - m-
t=T
Alternative to the n-Tuple Former.
Figure 3.9. Block diagrams of a MAP receiver for the waveform AWGN
channel, with y = (y1 , . . . , yn )T and qj = −wj 2 /2 + (N0 /2) ln PH (j).
The dashed boxes can alternatively be implemented
via correlators.
3.6. Continuous-time channels revisited 111
Notice that the bottom implementation of Figure 3.9 requires neither an

orthonormal basis nor knowledge of the codebook, but it does require m as
opposed to n matched filters (or correlators). We know that m ≥ n, and often
m is much larger than n. Notice also that this implementation does not quite fit
into the decomposition of Figure 3.2. In fact the receiver bypasses the need for
the n-tuple Y . As a byproduct, this proves that the receiver performance is not
affected by the choice of an orthonormal basis.
In a typical communication system, n and m are very large. So large that
it is not realistic to have n or m matched filters (or correlators). Even if we
disregard the cost of the matched filters, the number of operations required by
a decoder that performs a “brute force” search to find the distance-minimizing
index (or inner-product-maximizing index) is typically exorbitant. We will see
that clever design choices can dramatically reduce the complexity of a MAP
receiver.
The equivalence of the two operations of Figure 3.6 is very important. It should
be known by heart.
3.6 Continuous-time channels revisited

Every channel adds noise and this is what makes the communication problem both
challenging and interesting. In fact, noise is the only reason there is a fundamental
limitation to the maximal rate at which we can communicate reliably through
a cable, an optical fiber, and most other channels of practical interest. Without
noise we could transmit reliably as many bits per second as we want, using as little
energy as desired, even in the presence of the other channel imperfections that we
describe next.
Attenuation and amplification Whether wireline or wireless, a passive channel

always attenuates the signal. For a wireless channel, the attenuation can be of
several orders of magnitude. Much of the attenuation is compensated for by a
cascade of amplifiers in the first stage of the receiver, but an amplifier scales both
the information-carrying signal and the noise, and adds some noise of its own.
The fact that the receiver front end incorporates a cascade of amplifiers needs
some explanation. Why should the signal be amplified if the noise is amplified by
the same factor? A first answer to this question is that electronic devices, such as
an n-tuple former, are designed to process electrical signals that are in a certain
range of amplitudes. For instance, the signal’s amplitude should be large compared
to the noise added by the circuit. This explains why the first amplification stage
is done by a so-called low-noise amplifier . If the receiving antenna is connected
to the receiver via a relatively long cable, as it would be the case for an outdoor
antenna, then the low-noise amplifier is typically placed between the antenna and
the cable.
The low-noise amplifier (or the stage that follows it) contains a noise-reduction
filter that removes the out-of-band noise. With perfect electronic circuits, such a
filter is superfluous because the out-of-band noise is removed by the n-tuple former.
112 3. Second layer
But the out-of-band noise increases the chance that the electronic circuits – up to
and including the n-tuple former – saturate, i.e. that the amplitude of the noise
exceeds the range that can be tolerated by the circuits.
The typical next stage is the so-called automatic gain control (AGC) amplifier,
designed to bring the signal’s amplitude into the desired range. Hence the AGC
amplifier introduces a scaling factor that depends on the strength of the input
signal.
For the rest of this text, we ignore hardware imperfections. Therefore, we can
also ignore the presence of the low-noise amplifier, of the noise-reduction filter, and
of the automatic gain control amplifier. If the channel scales the signal by a factor
α, the receiver front end can compensate by scaling the received signal by α−1 , but
the noise is also scaled by the same factor. This explains why, in evaluating the error
probability associated to a signaling scheme, we often consider channel models that
only add noise. In such cases, the scaling factor α−1 is implicitly accounted for
in the noise parameter N0 /2. An example of how to determine N0 /2 is given in
Appendix 3.11, where we work out a case study based on satellite communication.
Propagation delay and clock misalignment Propagation delay refers to the time
it takes a signal to reach a receiver. If the signal set is W = {w0 (t), w1 (t), . . . ,
wm−1 (t)} and the propagation delay is τ , then for the receiver it is as if the signal
set were W̃ = {w0 (t − τ ), w1 (t − τ ), . . . , wm−1 (t − τ )}. The common assumption is
that the receiver does not know τ when the communication starts. For instance,
in wireless communication, a receiver has no way to know that the propagation
delay has changed because the transmitter has moved while it was turned off. It
is the responsibility of the receiver to adapt to the propagation delay. We come to
the same conclusion when we consider the fact that the clocks of different devices
are often not synchronized. If the clock of the receiver reads t − τ when that of
the transmitter reads t then, once again, for the receiver, the signal set is W̃ for
some unknown τ . Accounting for the unknown τ at the receiver goes under the
general name of clock synchronization. For reasons that will become clear, the
clock synchronization problem decomposes into the symbol synchronization and
into the phase synchronization problems, discussed in Sections 5.7 and 7.5. Until
then and unless otherwise specified, we assume that there is no propagation delay
and that all clocks are synchronized.
Filtering In wireless communication, owing to reflections and diffractions on

obstacles, the electromagnetic signal emitted by the transmitter reaches the
receiver via multiple paths. Each path has its own delay and attenuation. If wi (t)
L
is transmitted, the receiver antenna output has the form R(t) = l=1 wi (t − τl )hl
plus noise, where τl and hl are the delay and the attenuation along the lth
path. Unlike a mirror, the rough surface of certain objects creates a large
number of small reflections that are best accounted for by the integral form
R(t) = wi (t − τ )h(τ )dτ plus noise. This is the same as saying that the channel
contains a filter of impulse response h(t). For a different reason, the same channel
model applies to wireline communication. In fact, due to dispersion, the channel
output to a unit-energy pulse applied to the input at time t = 0 is some impulse
3.6. Continuous-time channels revisited 113
response h(t). Owing to the channel linearity, the output due to wi (t) at the input
is, once again, R(t) = wi (t − τ )h(τ )dτ plus noise.
The possibilities we have to cope with the channel filtering depend on whether
the channel impulse response is known to the receiver alone, to both the transmit-
ter and the receiver, or to neither. It is often realistic to assume that the receiver
can measure the channel impulse response. The receiver can then communicate it
to the transmitter via the reversed communication link (if it exists). Hence it is
hardly the case that only the transmitter knows the channel impulse response.
If the transmitter uses the signal set W = {w0 (t), w1 (t), . . . , wm−1 (t)} and the
receiver knows h(t), from the receiver’s point of view, the signal set is W̃ with the
ith signal being w̃i (t) = (wi h)(t) and the channel just adds white Gaussian noise.
This is the familiar case. Realistically, the receiver knows at best an estimate h̃(t)
of h(t) and uses it as the actual channel impulse response.
The most challenging situation occurs when the receiver does not know and
cannot estimate h(t). This is a realistic assumption in bursty communication, when
a burst is too short for the receiver to estimate h(t) and the impulse response
changes from one burst to the next.
The most favorable situation occurs when both the receiver and the transmitter
know h(t) or an estimate thereof. Typically it is the receiver that estimates the
channel impulse response and communicates it to the transmitter. This requires
two-way communication, which is typically available. In this case, the transmitter
can adapt the signal constellation to the channel characteristic. Arguably, the
best strategy is the so-called water-filling (see e.g. [19]) that can be implemented
via orthogonal frequency division multiplexing (OFDM).
We have assumed that the channel impulse response characterizes the channel
filtering for the duration of the transmission. If the transmitter and/or the receiver
move, which is often the case in mobile communication, then the channel is still
linear but time-varying. Excellent graduate-level textbooks that discuss this kind
of channel are [2] and [17].
Colored Gaussian noise We can think of colored noise as filtered white noise. It
is safe to assume that, over the frequency range of interest, i.e. the frequency range
occupied by the information-carrying signals, there is no positive-length interval
over which there is no noise. (A frequency interval with no noise is physically
unjustifiable and, if we insist on such a channel model, we no longer have an
interesting communication problem because we can transmit infinitely many bits
error-free by signaling where there is no noise.) For this reason, we assume that
the frequency response of the noise-shaping filter cannot vanish over a positive-
length interval in the frequency range of interest. In this case, we can modify
the aforementioned noise-reduction filter in such a way that, in the frequency
range of interest, it has the inverse frequency response of the noise-shaping filter.
The noise at the output of the modified noise-reduction filter, called whitening
filter , is zero-mean, Gaussian, and white (in the frequency range of interest).
The minimum error probability with the whitening filter cannot be higher than
without, because the filter is invertible in the frequency range of interest. What
we gain with the noise-whitening filter is that we are back to the familiar situation
114 3. Second layer
where the noise is white and the signal set is W̃ = {w̃0 (t), w̃1 (t), . . . , w̃m−1 (t)},
where w̃i (t) = (wi h)(t) and h(t) is the impulse response of the whitening filter.
3.7 Summary
In this chapter we have addressed the problem of communicating a message across
a waveform AWGN channel. The importance of the continuous-time AWGN
channel model comes from the fact that every conductor is a linear time-invariant
system that smooths out and adds up the voltages created by the electron’s motion.
Owing to the central limit theorem, the result of adding up many contributions can
be modeled as white Gaussian noise. No conductor can escape this phenomena,
unless it is cooled to zero degrees kelvin. Hence every channel adds Gaussian noise.
This does not imply that the continuous-time AWGN channel is the only channel
model of interest. Depending on the situation, there can be other impairments
such as fading, nonlinearities, and interference, that should be considered in the
channel model, but they are outside the scope of this text.
As in the previous chapter, we have focused primarily on the receiver that
minimizes the error probability assuming that the signal set is given to us. We
were able to move forwards swiftly by identifying a sufficient statistic that reduces
the receiver design problem to the one studied in Chapter 2. The receiver consists
of an n-tuple former and a decoder. We have seen that the sender can also be
decomposed into an encoder and a waveform former. This decomposition nat-
urally fits the layering philosophy discussed in the introductory chapter: The
waveform former at the sender and the n-tuple former at the receiver can be
seen as providing a “service” to the encoder–decoder pair. The service consists
in making the continuous-time AWGN channel look like a discrete-time AWGN
channel.
Having established the link between the continuous-time and the discrete-
time AWGN channel, we are in the position to evaluate the error probability
of a communication system for the AWGN channel by means of simulation. An
example is given in Appendix 3.8.
How do we proceed from here? First, we need to introduce the performance
parameters we care mostly about, discuss how they relate to one another, and
understand what options we have to control them. We start this discussion in the
next chapter where we also develop some intuition about the kind of signals we
want to use to transmit many bits.
Second, we need to start paying attention to cost and complexity because they
can quickly get out of hand. For a brute-force implementation, the n-tuple former
requires n correlators or matched filters and the decoder needs to compute and
compare y, cj + qj for m codewords. With k = 100 (a very modest number of
transmitted bits) and n = 2k (a realistic relationship), the brute-force approach
requires 200 matched filters or correlators and the decoder needs to evaluate
roughly 1030 inner products. These are staggering numbers. In Chapter 5 we will
learn how to choose the waveform former in such a way that the n-tuple former
can be implemented with a single matched filter. In Chapter 6 we will see that
3.8. Appendix: A simple simulation 115
there are encoders for which the decoder needs to explore a number of possibilities
that grows linearly rather than exponentially in k.
3.8 Appendix: A simple simulation

Here we give an example of a basic simulation. Instead of sending a continuous-
time waveform w(t), we send the corresponding codeword c; instead of adding a
sample path of white Gaussian noise of power spectral density N0 /2, we add a
realization z of a Gaussian random vector that consists of iid components that are
zero mean and of variance σ 2 = N0 /2. The decoder observes y = c + z. MATLAB is a
programming language that makes it possible to implement a simulation in a few
lines of code. Here is how we can determine (by simulation) the error probability
of m-PAM for m = 6, d = 2, and σ 2 = 1.
% define the parameters

m = 6 % alphabet size (positive even number)
d = 2 % distance between points
noiseVariance = 1
k = 1000 % number of transmitted symbols
% define the encoding function

encodingFunction = -(m-1)*d/2:d:(m-1)*d/2;
% generate the message

message = randi(m,k,1);
% encode
c = encodingFunction(message);
% generate the noise

z = normrnd(0,sqrt(noiseVariance),1,k);
% add the noise

y = c+z;
% decode
[distances,message_estimate] = min(abs(repmat(y’,1,m)...
-repmat(encodingFunction,k,1)),[],2);
% determine the symbol error probability and print

errorRate = symerr(message,message_estimate)/k
The above MATLAB code produces the following output (reformatted)
m = 6
d = 2
116 3. Second layer
noiseVariance = 1
k = 1000
errorRate = 0.2660
3.9 Appendix: Dirac-delta-based definition of white

Gaussian noise
It is common to define white Gaussian noise as a zero-mean WSS Gaussian random
process N (t) of autocovariance KN (τ ) = N20 δ(τ ).
From the outset, the difference between this and the approach we chose (Section
3.2) lies on where we start with a mathematical model of the physical world. We
chose to start with the measurements that the receiver can make about N (t),
whereas the standard approach starts with N (t) itself.
To model and use N (t) in a rigorous way requires familiarity with the notion
of stochastic processes (typically not a problem), the ability to manipulate the
Dirac delta (not a problem until something goes wrong), and measure theory to
guarantee that integrals such as N (α)g(α)dα are well-defined. Most engineers
are not familiar with measure theory. This results in situations that are undesirable
for the instructor and for the student. Nevertheless it is important that the reader
be aware of the standard procedure, which is the reason for this appendix.
As the following example shows, it is a simple exercise to derive (3.3) from the
above definition of N (t). (We take it for granted that the integrals exist.)
example 3.13 Let g1 (t) and g2 (t) be two finite-energy pulses and for i = 1, 2,
define

Zi = N (α)gi (α)dα, (3.7)
where
N (t) is white Gaussian noise as we just defined. We compute the covariance
cov Zi , Zj as follows:
& '
cov(Zi , Zj ) = E Zi Zj∗

=E N (α)gi (α)dα N ∗ (β)gj∗ (β)dβ

= E [N (α)N ∗ (β)] gi (α)gj∗ (β)dαdβ

N0
= δ(α − β)gi (α)gj∗ (β)dαdβ
2

N0
= gi (β)gj∗ (β)dβ.
2
3.9. Appendix: Dirac-delta-based definition of white Gaussian noise 117
example 3.14 Let N (t) be white Gaussian noise at the input of a lin-
ear time-invariant circuit of impulse response h(t) and let Z(t) be the
filter’s output. Compute the autocovariance of the output Z(t) = N (α)h(t−α)dα.
Solution: The definition of autocovariance is KZ (τ ) := E [Z(t + τ )Z ∗ (t)]. We
proceed two ways. The computation using the definition of N (t) given in this
appendix mimics the derivation in Example 3.13. The result is KZ (τ ) = N20 h(t+
τ )h∗ (t)dt. If we use the definition of white Gaussian noise given in Section 3.2,
we do not need to calculate (but we do need to know (3.3), which is part of the
definition). In fact, the Zi and Zj defined in (3.2) and used in (3.3) become
Z(t + τ ) and Z(t) if we set gi (α) = h(t + τ − α) and gj (α) = h(t − α), respectively.
Hence we can read the result directly out of 3.3, namely

N0 ∗ N0
KZ (τ ) = h(t + τ − α)h (t − α)dt = h(β + τ )h∗ (β)dβ.
2 2
By defining the self-similarity function1 of h(t)

Rh (τ ) = h(t + τ )h∗ (t)dt
we can summarize as follows

N0
KZ (τ ) = Rh (τ ).
2
The definition of N (t) given in this appendix is somewhat unsatisfactory also
based on physical grounds. Recall that the Fourier transform of the autocovariance
KN (τ ) is the power spectral density SN (f ) (also called power spectrum). If
KN (τ ) = N20 δ(τ ) then SN (f ) = N20 , i.e. a constant. Integrating over the power
spectral density yields the power, which in this case is infinite. The noise of a
physical system cannot have infinite power. A related problem shows up from a
different angle when we try to determine the variance of a sample N (t0 ) for an
arbitrary time t0 . This is the autocovariance KN (τ ) evaluated at τ = 0, but a
Dirac delta is not defined as a stand-alone function.2 Since we think of a Dirac
delta as a very narrow and very tall function of unit area, we could argue that
δ(0) = ∞. This is also unsatisfactory because we would rather avoid having to
define Gaussian random variables of infinite variance. More precisely, a stochastic
process is characterized by specifying the joint distribution of each finite collection
of samples, which implies that we would have to define the density of any collection
of Gaussian random variables of infinite variance. Furthermore, we know that a
random variable of infinite variance is not a good model for what we obtain when
we sample noise. The reason the physically-unsustainable Dirac-delta-based model
leads to physically meaningful results is that we use it only to describe filtered
1
Also called the autocorrelation function. We reserve the term autocorrelation function
for stochastic processes and use self-similarity function for deterministic pulses.
2
Recall that a Dirac delta function is defined through what happens when we integrate
it against a function, i.e. through the relationship δ(t)g(t) = g(0).
118 3. Second layer
white Gaussian noise. (But then, why not bypass the mathematical description of
N (t) as we do in Section 3.2?)
As a final remark, note that defining an object indirectly through its behavior,
as we have done in Section 3.2, is not new to us. We do something similar when
we
introduce the Dirac delta function by saying that it fulfills the relationship
f (t)δ(t) = f (0). In both cases, we introduce the object of interest by saying how
it behaves when integrated against a generic function.
3.10 Appendix: Thermal noise

Any conductor at non-zero temperature produces thermal (Johnson) noise. The
motion of charges (electrons) that move inside a conductor yields many tiny
electrical fields, the sum of which can be measured as a voltage at the conductor’s
terminals. Owing to the central limit theorem, the aggregate voltage can be
modeled as white Gaussian noise. (It looks white, up to very high frequencies.)
Thermal noise was first measured by Johnson (Bell Labs, 1926) who made the
following experiment. He took a number of different conducting substances, such
as solutions of salt in water, copper sulfate, etc., and measured the intrinsic voltage
fluctuations across these substances. He found that the thermal noise expresses
itself as a voltage source VN (t) in series with the noise-free conductor (Figure 3.10).
The mean square voltage of VN (t) per hertz (Hz) of bandwidth (accounting only
for positive frequencies) equals 4RkB T , where kB = 1.381 × 10−23 is Bolzmann’s
constant in joules/kelvin, T is the absolute temperature of the substance in kelvin
(290 K at room temperature), and R its resistance in ohms.
Johnson described his findings to Nyquist (also at Bell Labs) who was able to
explain the results by using thermodynamics and statistical mechanics. (Nyquist’s
paper [25] is only four pages and very accessible. A recommended reading.) The
expression for the mean of VN2 (t) per Hz of bandwidth derived by Nyquist is
4Rhf
hf , (3.8)
e kB T − 1
R VN (t)
(a) Noisy conductor. (b) Equivalent electrical circuit.
Figure 3.10. (a) Conductor of resistance R; (b) equivalent electrical

circuit, where VN (t) is a voltage source modeled as white Gaussian
noise of (single-sided) power spectral density N0 = 4kB T R and R is
an ideal (noise-free) resistor.
3.11. Appendix: Channel modeling, a case study 119
where h = 6.626 × 10−34 joules×seconds is Planck’s constant. This expression

also holds for the mean square voltage at the terminals of an impedance Z with
{Z} = R.
For small values of x, ex − 1 is approximately x. Hence, as long as hf is much
smaller than kB T , the denominator of Nyquist’s expression is approximately khf
BT
,
in which case (3.8) simplifies to
4RkB T,
in exact agreement with Johnson’s measurements.
example 3.15 At room temperature (T = 290 kelvin), kB T is about 4·10−21 . At
600 GHz, hf is about 4 · 10−22 . Hence, for applications in the frequency range from
0 to 600 GHz, we can pretend that VN (t) has a constant power spectral density.
example 3.16 Consider a resistor of 50 ohms at T = 290 kelvin. The
mean square voltage per Hz of bandwidth due to thermal noise is 4kB T R =
4 × 1.381 × 10−23 × 290 × 50 = 8 × 10−19 volts2 /Hz.
It is a straightforward exercise to check that the power per Hz of (single-
sided) bandwidth that the voltage source of Figure 3.10b dissipates into a load
of matching impedance is kB T watts. Because this is a very small number, it is
convenient to describe it by means of its temperature T .
Even other noises, such as the noise produced by an amplifier or the one picked
up by an antenna, are often characterized by a “noise temperature”, defined as
the number T that makes kB T equal to the spectral density of the noise injected
into the receiver. (See Appendix 3.11.)
3.11 Appendix: Channel modeling, a case study

Once the signal set W is fixed (up to a scaling factor), the error probability of a
MAP decoder for the AWGN channel depends only on the signal energy divided
by the noise-power density (signal-to-noise ratio in short) E/N0 at the input of the
n-tuple former. How do we determine E in terms of the design parameters that we
can measure, such as the power PT of the transmitter, the transmitting antenna
gain GT , the distance d between the transmitter and the receiver, the receiving
antenna gain GR ? And how do we find the value for N0 ? In this appendix, we
work out a case study based on satellite communication.
Consider a transmitting antenna that radiates isotropically in free space at a
power level of PT watts. Imagine a sphere of radius d meters centered at the
transmitting antenna. The surface of this sphere has an area of 4πd2 , thus the
PT 2
power density at distance d is 4πd 2 watts/m .
Satellites and the corresponding Earth stations use antennas that have direc-
tivity (typically a parabolic or a horn antenna for a satellite, and a parabolic
antenna for an Earth station). Their directivity is specified by their gain G in the
pointing direction. If the transmitting antenna has gain GT , the power density in
the pointing direction at distance d is P4πd
T GT
2 watts/m2 .
120 3. Second layer
A receiving antenna at distance d gathers a portion of the transmitted power

that is proportional to the antenna’s effective area AR . If we assume that the
transmitting antenna is pointed in the direction of the receiving antenna, the
received power is PR = PT4πdGT A R
2 .
Like the transmitting antenna, the receiving antenna can be described by its
gain GR . For a given effective area AR , the gain is inversely proportional to λ2 ,
where λ is the wavelength. (As the bandwidth of the transmitted signal is small
compared to the carrier frequency, we can use the carrier frequency wavelength.)
Notice that this relationship between area, gain, and wavelength is rather intuitive.
A familiar case is that of a flashlight. Owing to the small wavelength of light,
a flashlight can create a focused beam even with a relatively small parabolic
reflector. As we know from experience, the bigger the flashlight reflector, the
narrower the beam. The precise relationship is GR = 4πA λ2 . (Thanks to the ratio
R
AR
λ2 , the gain GR is dimension-free.) Solving for AR and plugging into PR yields
PT GT GR
PR = . (3.9)
(4πd/λ)2
The factor LS = (4πd/λ)2 is commonly called the free-space path loss, but this
is a misnomer. In fact the free-space attenuation is independent of the wavelength.
It is the relationship between the antenna’s effective area and its gain that brings
in the factor λ2 . Nevertheless, being able to write
GT GR
PR = PT (3.10)
LS
has the advantage of underlining the “gains” and the “losses”. Notice also that
LS is a factor on which the system designer has little control (for a geostationary
satellite the distance is fixed and the carrier frequency is often dictated by
regulations), whereas PT , GT , and GR are parameters that a designer might be
able to choose (within limits).
Now suppose that the receiving antenna is connected to the receiver via a
lossless coaxial cable. The antenna and the receiver input have an impedance and
the connecting cable has a characteristic impedance. For best power transfer, the
three impedances should be resistive and have the same value, typically 50 ohms
(see, e.g., Wikipedia, impedance matching). We assume that it is indeed the case
and let R ohms be its value. Then, the impedance seen by the antenna looking
into the cable is also R as if the receiver were connected directly to the antenna
(see, e.g., Wikipedia, transmission line, or [14]). Figure 3.11 shows the electrical
model for the receiving antenna and its load.3 It shows the voltage source W (t)
that represents the intended signal, the voltage source VN (t) that represents all
noise sources, the antenna impedance R and the antenna’s load R.
3
The circuit of Figure 3.11 is a suitable model for determining the voltage (and the
current) at the receiver input (the load in the figure). There is a more complete
model [26] that enables us to associate the power dissipated by the antenna’s internal
impedance with the power that the antenna radiates back to space.
3.11. Appendix: Channel modeling, a case study 121
R
VN (t)
W (t) R
receiving antenna
Figure 3.11. Electrical model for the receiving antenna and the load
it sees looking into the first amplifier.
The advantage of having all the noise sources be represented by a single source
which is co-located with the signal source W (t) is that the signal-to-noise ratio at
that point is the same as the signal-to-noise-ratio at the input of the n-tuple former.
(Once all noise sources are accounted for at the input, the electronic circuits are
considered as noise free). So, the E/N0 of interest to us is the signal energy absorbed
by the load divided by the noise-power density absorbed by the same load.
The power harvested by the antenna is passed onto the load. This power is PR ,
hence the energy is PR τ , where τ is the duration of the signals (assumed to be
the same for all signals).
As mentioned in Appendix 3.10, it is customary to describe the noise-power
density by the temperature TN of a fictitious resistor that transfers the same noise-
power density to the same load. This density is kB TN . If we know (for instance
from measurements) the power density of each noise source, we can determine the
equivalent density at the receiver input, sum all the densities, and divide by kB to
obtain the noise temperature TN . Here we assume that this number is provided to
us by the manufacturer of the receiver (see Example 3.17 for a numerical value).
Putting things together, we obtain
PR τ PT τ GT GR
E/N0 = = . (3.11)
kB T N LS k B T N
To go one step further, we characterize the two voltage sources of Figure 3.11.
This is a calculation that the hardware designer might want to do to determine
the range of voltages and currents at the antenna output.
Recall that a voltage of v volts applied to a resistor of R ohms dissipates the
power P = v 2 /R watts. When H = i, W (t) = αwi (t) for some scaling factor α.
We determine α by computing the resulting average power dissipated by the load
α2 E
and by equating to PR . Thus PR = 4Rτ . Inserting the value of PR and solving for
α yields
#
4RPT GT GR
α= .
LS E/τ
122 3. Second layer
R
VN (t)
αwi (t) R
receiving antenna
(a) Electrical circuit.

αwi (t) - -

6
VN (t)
N0 /2 = 2RkB TN
(b) System-engineering viewpoint.

wi (t) - -

6
N (t)
k B T N LS E
N0 /2 = 2PT τ GT GR
(c) Preferred channel model.
Figure 3.12. Various viewpoints under hypothesis H = i.
Hence, when H = i, the received signal (before noise) is

#
4RPT GT GR
W (t) = αwi (t) = wi (t).
LS E/τ
Figure 3.12a summarizes the equivalent electrical circuit under the hypothesis
H = i. As determined in Appendix 3.10, the mean square voltage of the noise
source VN (t) per Hz of (single-sided) bandwidth is N0 = 4RkB TN . Figure 3.12b
is the equivalent representation from the point of view of a system designer. The
usefulness of these models is that they give us actual voltages. As long as we are not
concerned with hardware limitations, for the purpose of the channel model, we are
allowed to scale the signal and the noise by the same factor. Specifically, if we divide
the signal by α and divide the noise-power density by α2 , we obtain the channel
model of Figure 3.12c. Observe that the impedance R has fallen out of the picture.
3.12. Exercises 123
As a “sanity check”, if we compute E/N0 using Figure 3.12c we obtain τLPSTkGBTTGNR ,

which corresponds to (3.11). The following example gives numerical values.
example 3.17 The following parameters pertain to Mariner-10, an American
robotic space probe launched by NASA in 1973 to fly to the planets Mercury and
Venus.
PT = 16.8 watts (12.25 dBW).
λ = 0.13 m (carrier frequency at 2.3 GHz).
GT = 575.44 (27.6 dB).
GR = 1.38 × 106 (61.4 dB).
d = 1.6 × 1011 meters.
TN = 13.5 kelvin.
Rb = 117.6 kbps.
The bit rate Rb = 117.6 kbps (kilobits per second) is the maximum data rate at
which the space probe could transmit information. This can be achieved via antipo-
dal signals of duration τ = 1/Rb = 8.5 × 10−6 seconds. Under this assumption,
plugging into (3.11) yields E/N0 = 2.54. The error rate for antipodal signaling is

2Es
Pe = Q = 0.0120.
N0
We see that the error rate is fairly high, but by means of coding techniques
(Chapter 6), it is possible to achieve reliable communication at the expense of
some reduction in the bit rate.
3.12 Exercises
exercise 3.1 (Gram–Schmidt procedure on tuples) By means of the Gram–

Schmidt orthonormalization procedure, find an orthonormal basis for the subspace
spanned by the four vectors β1 = (1, 0, 1, 1)T , β2 = (2, 1, 0, 1)T , β3 = (1, 0, 1, −2)T ,
and β4 = (2, 0, 2, −1)T .
exercise 3.2 (Gram–Schmidt procedure on two waveforms) Use the Gram–

Schmidt procedure to find an orthonormal basis for the vector space spanned by
the functions shown in Figure 3.13.
w0 (t) w1 (t)
2
1
t t
T T
2
Figure 3.13.
124 3. Second layer
exercise 3.3 (Gram–Schmidt procedure on three waveforms)
β0 (t) β1 (t) β2 (t)
2 2 2
1 1 1
1 2 3
t t t
1 1 2
Figure 3.14.
(a) By means of the Gram–Schmidt procedure, find an orthonormal basis for the
space spanned by the waveforms in Figure 3.14.
(b) In your chosen orthonormal basis, let w0 (t) and w1 (t) be represented by the
codewords c0 = (3, −1, 1)T and c1 = (−1, 2, 3)T , respectively. Plot w0 (t) and
w1 (t).
(c) Compute the (standard) inner products c0 , c1 and w0 , w1 and compare
them.
(d) Compute the norms c0 and w0 and compare them.
exercise 3.4 (Orthonormal expansion) For the signal set of Figure 3.15, do
the following.
(a) Find the orthonormal basis ψ1 (t), . . . , ψn (t) that you would find by following
the Gram–Schmidt (GS) procedure. Note: No need to work out the intermedi-
ate steps of the GS procedure. The purpose of this exercise is to check, with
hardly any calculation, your understanding of what the GS procedure does.
(b) Find the codeword ci ∈ Rn that describes wi (t) with respect to your orthonor-
mal basis. (No calculation needed.)
w0 (t) w1 (t) w2 (t) w3 (t)
1 1 1 1
2 3
t t t t
1 3
Figure 3.15.
3.12. Exercises 125
exercise 3.5 (Noise in regions) Let N (t) be white Gaussian noise of power
spectral density N20 . Let g1 (t), g2 (t), and g3 (t) be waveforms as shown in Figure
3.16. For i = 1, 2, 3, let Zi = N (t)gi∗ (t)dt, Z = (Z1 , Z2 )T , and U = (Z1 , Z3 )T .
g1 (t) g2 (t) g3 (t)
1 1 1
T T
t t t
T
−1 −1 −1
Figure 3.16.
(a) Determine the norm gi , i = 1, 2, 3.

(b) Are Z1 and Z2 independent? Justify your answer.
(c) Find the probability Pa that Z lies in the square of Figure 3.17a.
(d) Find the probability Pb that Z lies in the square of Figure 3.17b.
(e) Find the probability Qa that U lies in the square of Figure 3.17a.
(f ) Find the probability Qc that U lies in the square of Figure 3.17c.
6Z2 or Z3 Z2 6 6Z2 or Z3
2 - Z1 - Z1
√ 1 2
1 (0, − 2) −1
@
@
- Z1
@
@ (0, −2√2) −2
1 2
(a)
(b) (c)
Figure 3.17.
Exercises for Sections 3.4 and 3.5
exercise 3.6 (Two-signals error probability) The two signals of Figure 3.18 are
used to communicate one bit across the continuous-time AWGN channel of power
spectral density N0 /2 = 6 W/Hz. Write an expression for the error probability of
an ML receiver.
exercise 3.7 (On–off signaling) Consider the binary hypothesis testing problem
specified by:
H=0: R(t) = w(t) + N (t)
H=1: R(t) = N (t),
126 3. Second layer
w0 (t) w1 (t)
1 1
T 2T T 3T
t 0 t
0 T
Figure 3.18.
where N (t) is additive white Gaussian noise of power spectral density N0 /2 and
w(t) is the signal shown in the left of Figure 3.19
(a) Describe the maximum likelihood receiver for the received signal R(t), t ∈ R.
(b) Determine the error probability for the receiver you described in (a).
(c) Sketch a block diagram of your receiver of part (a) using a filter with impulse
response h(t) (or a scaled version thereof ) shown in the right-hand part of
Figure 3.19.
w(t) h(t)
1 1
3T
t t
T 2T
−1
Figure 3.19.
exercise 3.8 (QAM receiver) Let the channel output be

R(t) = W (t) + N (t),
where W (t) has the form
$ $
X1 T2 cos 2πfc t + X2 T2 sin 2πfc t, for 0 ≤ t ≤ T,
W (t) =
0, otherwise,
2fc T ∈ Z is a constant known to the receiver, X = (X1 , X2 ) is a uniformly
distributed random vector that takes values in
√ √ √ √ √ √ √ √
{( E, E), (− E, E), (− E, − E), ( E, − E)}
for some known constant E, and N (t) is white Gaussian noise of power spectral
density N20 .
(a) Specify a receiver that, based on the channel output R(t), decides on the value
of the vector X with least probability of error.
(b) Find the error probability of the receiver you have specified.
3.12. Exercises 127
exercise 3.9 (Signaling scheme example) Let the message H be uniformly

distributed over the message set H = {0, 1, 2, . . . , 2k − 1}. When H = i ∈ H, the
transmitter sends wi (t) = w(t − iT
2k
), where w(t) is as shown in Figure 3.20. The
channel output is R(t) = wi (t) + N (t), where N (t) denotes white Gaussian noise
of power spectral density N20 .
w(t)
A
t
0 T
2k
Figure 3.20.
Sketch a block diagram of a receiver that, based on R(t), decides on the value of
H with least probability of error. (See Example 4.6 for the probability of error.)
exercise 3.10 (Matched filter implementation) In this problem, we consider the
implementation of matched filter receivers. In particular, we consider frequency-
shift keying (FSK) with the following signals:
$
nj
2
T cos 2π T t, for 0 ≤ t ≤ T,
wj (t) = (3.12)
0, otherwise,
where nj ∈ Z and 0 ≤ j ≤ m − 1. Thus, the communication scheme consists of m

n
signals wj (t) of different frequencies Tj .
(a) Determine the impulse response hj (t) of a causal matched filter for the signal
wj (t). Plot hj (t) and specify the sampling time.
(b) Sketch the matched filter receiver. How many matched filters are needed?
(c) Sketch the output of the matched filter with impulse response hj (t) when the
input is wj (t).
(d) Consider the ideal resonance circuit shown in Figure 3.21.
i(t)
L C u(t)
Figure 3.21.
128 3. Second layer
For this circuit, the voltage response to the input current i(t) = δ(t) is

1 t
cos √LC , t≥0
h(t) = C
0 otherwise.
Show how this can be used to implement the matched filter for wj (t).
Determine how L and C should be chosen. Hint: Suppose that i(t) = wj (t).
In this case, what is u(t)?
exercise 3.11 (Matched filter intuition) In this problem, we develop further

intuition about matched filters. You may assume that all waveforms are real-
valued. Let R(t) = ±w(t) + N (t) be the channel output, where N (t) is additive
white Gaussian noise of power spectral density N0 /2 and w(t) is an arbitrary but
fixed pulse. Let φ(t) be a unit-norm but otherwise arbitrary pulse, and consider
the receiver operation
Y = R, φ = w, φ + N, φ. (3.13)
The signal-to-noise ratio (SNR) is defined as
|w, φ|2
SNR = .
E [|N, φ|2 ]
Notice that the SNR remains the same if we scale φ(t) by a constant factor. Notice
also that
& ' N0
E |N, φ|2 = . (3.14)
2
(a) Use the Cauchy–Schwarz inequality to give an upper bound on the SNR. What
is the condition for equality in the Cauchy–Schwarz inequality? Find the φ(t)
that maximizes the SNR. What is the relationship between the maximizing
φ(t) and the signal w(t)?
(b) Let us verify that we would get the same result using a pedestrian approach.
Instead of waveforms we consider tuples. So let c = (c1 , c2 )T ∈ R2 and use cal-
culus (instead of the Cauchy–Schwarz inequality) to find the φ = (φ1 , φ2 )T ∈
R2 that maximizes c, φ subject to the constraint that φ has unit norm.
(c) Verify with a picture (convolution) that the output at time T of a filter
with
∞ input w(t) and impulse response h(t) = w(T − t) is indeed w, w =
2
−∞
w (t)dt.
exercise 3.12 (Two receive antennas) Consider the following communication

problem. The message is represented by a uniformly distributed random variable
X ∈ {±1}. The transmitter sends Xw(t) where w(t) is a unit-energy pulse known
to the receiver. There are two channels with output R1 (t) and R2 (t), respectively,
where
R1 (t) = Xβ1 w(t − τ1 ) + N1 (t)

R2 (t) = Xβ2 w(t − τ2 ) + N2 (t),
3.12. Exercises 129
where β1 , β2 , τ1 , τ2 are constants known to the receiver and N1 (t) and N2 (t) are
white Gaussian noise of power spectral density N0 /2. We assume that N1 (t) and
N2 (t) are independent of each other (in the obvious sense) and independent of X.
We also assume that w(t − τ1 )w(t − τ2 )dt = γ, where −1 ≤ γ ≤ 1.
(a) Describe an ML receiver for X that observes both R1 (t) and R2 (t) and deter-
mine its probability of error in terms of the Q function, β1 , β2 , γ, and N0 /2.
(b) Repeat part (a) assuming that the receiver has access only to the sum-signal
R(t) = R1 (t) + R2 (t).
exercise 3.13 (Receiver) The signal set
w0 (t) = sinc2 (t)
√
w1 (t) = 2 sinc2 (t) cos(4πt)
is used to communicate across the AWGN channel of noise power spectral

density N20 .
(a) Sketch a block diagram of an ML receiver for the above signal set. (No need
to worry about filter causality.)
(b) Determine the error probability of your receiver assuming that w0 (t) and
w1 (t) are equally likely.
(c) If you keep the same receiver, but use w0 (t) with probability 13 and w1 (t)
with probability 23 , does the error probability increase, decrease, or remain the
same?
exercise 3.14 (ML receiver with single causal filter) Let w1 (t) be as shown in
Figure 3.22 and let w2 (t) = w1 (t − Td ), where Td is a fixed number known to the
receiver. One of the two pulses is selected at random and transmitted across the
AWGN channel of noise power spectral density N20 .
w1 (t)
A
t
T
Figure 3.22.
(a) Describe an ML receiver that decides which pulse was transmitted. We ask
that the n-tuple former contains a single matched filter. Make sure that the
filter is causal and plot its impulse response.
(b) Express the probability of error in terms of T, A, Td , N0 .
exercise 3.15 (Delayed signals) One of the two signals shown in Figure
3.23 is selected at random and is transmitted over the additive white Gaussian
noise channel of noise spectral density N20 . Draw a block diagram of a maximum
likelihood receiver that uses a single matched filter and express its error probability.
130 3. Second layer
w0 (t) w1 (t)
1 1
T 2T T 3T
t 0 t
0 T
Figure 3.23.
exercise 3.16 (ML decoder for AWGN channel) The signal of Figure 3.24 is
fed to an ML receiver designed for a transmitter that uses the four signals of Figure
3.15 to communicate across the AWGN channel. Determine the receiver output Ĥ.
R(t)
1
t
1
Figure 3.24.
exercise 3.17 (AWGN channel and sufficient statistic) Let W = {w0 (t), w1 (t)}
be the signal constellation used to communicate an equiprobable bit across an
additive Gaussian noise channel. In this exercise, we verify that the projection of
the channel output onto the inner product space V spanned by W is not necessarily
a sufficient statistic, unless the noise is white. Let ψ1 (t), ψ2 (t) be an orthonormal
basis for V. We choose the additive noise to be N (t) = Z1 ψ1 (t)+Z2 ψ2 (t)+Z3 ψ3 (t)
for some normalized ψ3 (t) that is orthogonal to ψ1 (t) and ψ2 (t) and choose
Z1 , Z2 , and Z3 to be zero-mean jointly Gaussian random variables of identical
variance σ 2 . Let ci = (ci,1 , ci,2 , 0)T be the codeword associated to wi (t) with
respect to the extended orthonormal basis ψ1 (t), ψ2 (t), ψ3 (t). There is a one-to-one
correspondence between the channel output R(t) and Y = (Y1 , Y2 , Y3 )T , where
Yi = R, ψi . In terms of Y , the hypothesis testing problem is
H = i : Y = ci + Z i = 0, 1,
where we have defined Z = (Z1 , Z2 , Z3 )T .

3.12. Exercises 131
(a) As a warm-up exercise, let us first assume that Z1 , Z2 , and Z3 are inde-
pendent. Use the Fisher–Neyman factorization theorem (Exercise 2.22 of
Chapter 2) to show that Y1 , Y2 is a sufficient statistic.
(b) Now assume that Z1 and Z2 are independent but Z3 = Z2 . Prove that in this
case Y1 , Y2 is not a sufficient statistic.
(c) To check a specific case, consider c0 = (1, 0, 0)T and c1 = (0, 1, 0)T . Determine
the error probability of an ML receiver that observes (Y1 , Y2 )T and that of
another ML receiver that observes (Y1 , Y2 , Y3 )T .
exercise 3.18 (Mismatched receiver) Let a channel output be
R(t) = c X w(t) + N (t), (3.15)
where c > 0 is some deterministic constant, X is a uniformly distributed random
variable that takes values in {3, 1, −1, −3}, w(t) is the deterministic waveform

1, if 0 ≤ t < 1
w(t) = (3.16)
0, otherwise,
N0
and N (t) is white Gaussian noise of power spectral density 2 .
(a) Describe the receiver that, based on the channel output R(t), decides on the
value of X with least probability of error.
(b) Find the error probability of the receiver you have described in part (a).
(c) Suppose now that you still use the receiver you have described in part (a),
but that the received signal is actually
3
R(t) = c X w(t) + N (t), (3.17)
4
i.e. you were unaware that the channel was attenuating the signal. What is
the probability of error now?
(d) Suppose now that you still use the receiver you have found in part (a) and that
R(t) is according to equation (3.15), but that the noise is colored. In fact, N (t)
is a zero-mean stationary Gaussian noise process of auto-covariance function
1 −|τ |/α
e
KN (τ ) = ,
4α
where 0 < α < ∞ is some deterministic real parameter. What is the
probability of error now?
4 Signal design trade-offs
4.1 Introduction
In Chapters 2 and 3 we have focused on the receiver, assuming that the signal set
was given to us. In this chapter we introduce the signal design.
The problem of choosing a convenient signal constellation is not as clean-cut as
the receiver-design problem. The reason is that the receiver-design problem has
a clear objective, to minimize the error probability, and one solution, namely the
MAP rule. In contrast, when we choose a signal constellation we make trade-offs
among conflicting objectives.
We have two main goals for this chapter: (i) to introduce the design parameters
we care mostly about; and (ii) to sharpen our intuition about the role played by the
dimensions of the signal space as we increase the number of bits to be transmitted.
The continuous-time AWGN channel model is assumed.
4.2 Isometric transformations applied to the codebook

If the channel is AWGN and the receiver implements a MAP rule, the error
probability is completely determined by the codebook C = {c0 , . . . , cm−1 }. The
purpose of this section is to identify transformations to the codebook that do
not affect the error probability. For the moment we assume that the codebook
and the noise are real-valued. Generalization to complex-valued codebooks and
complex-valued noise is straightforward but requires familiarity with the formalism
of complex-valued random vectors (Appendix 7.9).
From the geometrical intuition gained in Chapter 2, it should be clear that the
probability of error remains the same if a given codebook and the corresponding
decoding regions are translated by the same n-tuple b ∈ Rn .
A translation is a particular instance of an isometry. An isometry is a distance-
preserving transformation. Formally, given an inner product space V, a : V → V is
an isometry if and only if for any α ∈ V and β ∈ V, the distance between α and β
equals that between a(α) and a(β). All isometries from Rn to Rn can be obtained
from the composition of a reflection, a rotation, and a translation.
132
4.2. Isometric transformations applied to the codebook 133
example 4.1 Figure 4.1 shows an original codebook C = {c0 , c1 , c2 , c3 } and three
variations obtained by applying to C a reflection, a rotation, and a translation,
respectively. In each case the isometry a : Rn → Rn sends ci to c̃i = a(ci ).
x2 x2
c1 c0 c̃2 c̃3
x1 x1
c2 c3 c̃1 c̃0
(a) Original codebook C. (b) Reflected codebook.
x2 x2
c̃0 c̃1 c̃0

c̃1
x1 x1
c̃3
c̃2 c̃3
c̃2
(c) Rotated codebook. (d) Translated codebook.
Figure 4.1. Isometries.
It should be intuitively clear that if we apply an isometry to a codebook and its

decoding regions, then the error probability remains unchanged. A formal proof
of this fact is given in Appendix 4.8.
If we apply a rotation or a reflection to an n-tuple, we do not change its norm.
Hence reflections and rotations applied to a signal set do not change the average
energy, but translations generally do. We determine the translation that minimizes
the average energy.
Let Ỹ be a zero-mean random vector in Rn . For any b ∈ Rn ,
EỸ + b2 = EỸ 2 + b2 + 2EỸ , b = EỸ 2 + b2 ≥ EỸ 2
with equality if and only if b = 0. An arbitrary (not necessarily zero-mean) random
vector Y ∈ Rn can be written as Y = Ỹ + m, where m = E[Y ] and Ỹ = Y − m is
zero-mean. The above inequality can then be restated as
EY − b2 = EỸ + m − b2 ≥ EỸ 2 ,
with equality if and only if b = m.
We apply the above to a codebook C = {c0 , . . . , cm−1 }. If we let Y be the
random variable that takes value ci with probability PH (i), then we see that the
134 4. Signal design trade-offs
& '
average energy E = E Y can be decreased by a translation if and only if the
2
mean m = E [Y ] = i PH (i)ci is non-zero. If it is non-zero, then the translated

constellation C˜ = {c̃0 , . . . , c̃m−1 }, where c̃i = ci − m, will achieve the minimum
energy among all possible translated versions of C. The average energy associated
to the translated constellation is Ẽ = E − m2 .
If S = {w0 (t), . . . , wm−1 (t)} is the set of waveforms linked to C via some
orthonormal basis, then through the same basis c̃i will be associated to w̃i (t) =
wi (t) − m(t) where m(t) = i PH (i)wi (t). An example follows.
example 4.2 Let w0 (t) and w1 (t) be rectangular pulses with support [0, T ] and
[T, 2T ], respectively, as shown on the left of Figure 4.2a. Assuming that PH (0) =
PH (1) = 12 , we calculate the average m(t) = 12 w0 (t) + 12 w1 (t) and see that it is
non-zero (center waveform). Hence we can save energy by using the new signal set
defined by w̃i (t) = wi (t) − m(t), i = 0, 1 (right). In Figure 4.2b we see the signals
in the signal space, where ψi (t) = w i−1 (t)
wi−1 , i = 1, 2. As we see from the figures,
w̃0 (t) and w̃1 (t) are antipodal signals. This is not a coincidence: After we remove
the mean, any two signals become the negative of each other. As an alternative to
representing the elements of W in the signal space, we could have represented the
elements of the codebook C in R2 , as we did in Figure 4.1. The two representations
are equivalent.
w0 (t) w̃0 (t)
t t
w1 (t) m(t) w̃1 (t)
t t t
(a) Waveform viewpoint.

ψ2 ψ2 ψ2
w1 •
• w̃1 •
m
• ψ1 ψ1 ψ1
w0
•
w̃0
(b) Signal space viewpoint.
Figure 4.2. Energy minimization by translation.

4.4. Building intuition about scalability: n versus k 135
4.3 Isometric transformations applied to the waveform set

The definition of isometry is based on the notion of distance, which is defined in
every inner product space: the distance between α ∈ V and β ∈ V is the norm
β − α.
Let V be the inner product space spanned by W = {w0 (t), . . . , wm−1 (t)} and let
a : V → V be an isometry. If we apply this isometry to W, we obtain a new signal
set W̃ = {w̃0 (t), . . . , w̃m−1 (t)} ⊂ V. Let B = {ψ1 (t), . . . , ψn (t)} be an orthonormal
basis for V and let C = {c0 , . . . , cm−1 } be the codebook associated to W via B.
Could we have obtained W̃ by applying some isometry to the codebook C?
Yes, we could. Through B, we obtain the codebook C˜ = {c̃0 , . . . , c̃m−1 } asso-
ciated to W̃. Through the composition that sends ci → wi (t) → w̃i (t) → c̃i , we
obtain a map from C to C. ˜ It is easy to see that this map is an isometry of the
kind considered in Section 4.2.
Are there other kinds of isometries applied to W that cannot be obtained simply
by applying an isometry to C? Yes, there are. The easiest way to see this is to
keep the codebook the same and substitute the original orthonormal basis B =
{ψ1 (t), . . . , ψn (t)} with some other orthonormal basis B̃ = {ψ̃1 (t), . . . , ψ̃n (t)}. In
so doing, we obtain an isometry from V to some other subspace Ṽ of the set of
finite-energy signals.
The new signal set W̃ might not bear any resemblance to W, yet the resulting
error probability will be identical since the codebook is unchanged. This sort of
transformation is implicit in Example 3.7 of Section 3.4.
4.4 Building intuition about scalability: n versus k

The aim of this section is to sharpen our intuition by looking at a few examples of
signal constellations that contain a large number m of signals. We are interested
in exploring what happens to the probability of error when the number k = log2 m
of bits carried by one signal becomes large. In doing so, we will let the energy grow
linearly with k so as to keep constant the energy per bit Eb , which seems to be fair.
The dimensionality of the signal space will be n = 1 for the first example (single-
shot PAM) and n = 2 for the second (single-shot PSK). In the third example
(bit-by-bit on a pulse train) n will be equal to k. In the final example (block-
orthogonal signaling) we will let n = 2k . These examples will provide us with
useful insight on the asymptotic relationship between the number of transmitted
bits and the dimensionality of the signal space.
What matters for all these examples is the choice of codebook. There is no
need, in principle, to specify the waveform signal wi (t) associated to a codeword
ci . Nevertheless, we will specify wi (t) to make the examples more complete.
4.4.1 Keeping n fixed as k grows
example 4.3 (Single-shot PAM) In this example, we fix n = 1 and consider

PAM (see Example 3.8). We are interested in evaluating the error probability as
the number m of messages goes to infinity. Recall that the waveform associated to
message i is
wi (t) = ci ψ(t),
where ψ(t) is an arbitrary unit-energy waveform. (With n = 1 we do not have

any choice other than modulating the amplitude of a pulse.) We are totally free to
choose the pulse. For the sake
of completeness we arbitrarily choose a rectangular
pulse such as ψ(t) = √1T 1 t ∈ [− T2 , T2 ] .
We have already computed the error probability of PAM in Example 2.5 of
Section 2.4.3, namely

2 a
Pe = 2 − Q ,
m σ
where σ 2 = N0 /2. Following the instructions in Exercise 4.12, it is straightfor-
ward to prove that the average energy of the above constellation when signals are
uniformly distributed is
a2 (m2 − 1)
E= . (4.1)
3
Equating to E = kEb and using the fact that k = log2 m yields
#
3Eb log2 m
a= ,
(m2 − 1)
which goes to 0 as m goes to ∞. Hence Pe goes to 1 as m goes to ∞.
example 4.4 (Single-shot PSK) In this example, we keep n = 2 and consider

PSK (see Example 3.9). In Example 2.15, we have derived the following lower
bound to the error probability of PSK
!
E π m−1
Pe ≥ 2Q sin ,
σ2 m m
where σ 2 = N20 is the variance of the noise in each coordinate. If we insert E = kEb
and m = 2k , we see that the lower bound goes to 1 as k goes to infinity. √ This
happens because the circumference of the PSK constellation grows as k whereas
the number of points grows as 2k . Hence, the minimum distance between points
goes to zero (indeed exponentially fast).
As they are, the signal constellations used in the above two examples are not
suitable to transmit a large amount k of bits by letting the constellation size
m = 2k grow exponentially with k. The problem with the above two examples
is that, as m grows, we are trying to pack an exponentially increasing number of
points into a space that also grows in size but not fast enough. The space becomes
“crowded” as m grows, meaning that the minimum distance becomes smaller and
the probability of error increases.
We should not conclude that PAM and PSK are not useful to send many bits.
On the contrary, these signaling methods are widely used. In the next chapter we
will see how. (See also the comment after the next example.)
4.4.2 Growing n linearly with k
example 4.5 (Bit-by-bit on a pulse train) The idea is to use a different dimen-
sion for each bit. Let (bi,1 , bi,2 , . . . , bi,k ) be the binary sequence corresponding to
message i. For mathematical convenience, we assume these bits to take value in
{±1} rather than {0, 1}. We let the associated codeword ci = (ci,1 , ci,2 , . . . , ci,k )T
be defined by

ci,j = bi,j Eb ,
E
where Eb = k is the energy per bit. The transmitted signal is

k
wi (t) = ci,j ψj (t), t ∈ R. (4.2)
j=1
As already mentioned, the choice of orthonormal basis is immaterial for the

point we are making, but in practice some choices are more convenient than others.
Specifically, if we choose ψj (t) = ψ(t − jTs ) for some waveform ψ(t) that fulfills
ψi , ψj = 1{i = j}, then the n-tuple former is drastically simplified because a
single matched filter is sufficient to obtain all n projections (see Section 3.5).
For instance, we can choose ψ(t) = √1T 1 t ∈ [−Ts /2, Ts /2] , which fulfills the
s
mentioned constraints. We can now rewrite the waveform signal as

k
wi (t) = ci,j ψ(t − jTs ), t ∈ R. (4.3)
j=1
The above expression justifies the name bit-by-bit on a pulse train given to this
signaling method (see Figure 4.3). As we will see in Chapter 5, there are many
other possible choices for the pulse ψ(t).
ψ(t) wi (t)
t t
−Ts Ts Ts 3Ts 9Ts
2 2 2 2 2
(a) (b)
√
Figure 4.3. Example of (4.3) for k = 4 and ci = Eb (1, 1, −1, 1)T .
The codewords c0 , . . . , cm−1 are the vertices of a k-dimensional hypercube as

shown in Figure 4.4 for k = 1, 2, 3. For these values of k we immediately see
from the figure what the decoding regions of an ML decoder are, but let us proceed
analytically and find an ML decoding rule that works for any k. The ML receiver
√ k
decides that the constellation point used by the sender is the ci ∈ ± Eb that
maximizes y, ci − c2i . Since ci 2 is the same
2
for all i, the previous expression

is maximized by the ci that maximizes √ y, c i = yj ci,j . The maximum is achieved
for the i for which ci,j = sign(yj ) Eb , where

1, y≥0
sign(y) =
−1, y < 0.
x2
c1 • • c0
c1 c0
• • x x1
0
c2 • • c3
(a) k = 1. (b) k = 2.
c5 c4
• • −x1
c1 c0
• •
−x2
• •
c6 c7
• •
c2 c3
−x3
(c) k = 3.
Figure 4.4. Codebooks for bit-by-bit on a pulse train signaling.
We now compute the error probability. As usual, we first compute the error
probability conditioned on a specific ci . From the codebook symmetry, √ we expect
that the error probability will not depend on i. If ci,j is positive, Yj = E√
b +Zj and
a maximum likelihood decoder will make the correct decision if Zj > − Eb . (The
statement is an “if and only if ” if we ignore the zero-probability event that Zj =
√ √
− Eb .) This happens with probability 1−Q σEb . Based on similar reasoning, it is
straightforward to verify that the probability of error is the same if ci,j is negative.
Now let Cj be the event that the decoder makes the correct decision about the
jth bit. The probability of Cj depends only on Zj . The independence of the noise
components implies the independence of C1 , C2 , . . . , Ck . Thus, the probability that
all k bits are decoded correctly when H = i is
√ k
Eb
Pc (i) = 1 − Q ,
σ
which is the same for all i and, therefore, it is also the average Pc . Notice that Pc →
0 as k → ∞.√
However, the probability that any specific bit be decoded incorrectly
is Pb = Q( σEb ), which does not depend on k.
Although in this example we chose to transmit a single bit per dimension, we

could have chosen to transmit log2 m bits per dimension by letting the codeword
components take value in an m-ary PAM constellation. In that case we call the
signaling scheme symbol-by-symbol on a pulse train. Symbol-by-symbol on a pulse
train and variations thereof is the basis for many digital communication systems.
It will be studied in depth in Chapter 5.
The following question seems natural at this point: Is it possible to avoid that
Pc → 0 as k → ∞? The next example gives us the answer.
4.4.3 Growing n exponentially with k
example 4.6 (Block-orthogonal signaling) Let n = m = 2k , choose n orthonor-

mal waveforms ψ1 (t), . . . , ψn (t), and define w1 (t), . . . , wm (t) to be
√
wi (t) = Eψi (t).
This is called block-orthogonal signaling. The name stems from the fact that in
practice a block of k bits are collected and then mapped into one of m orthogonal
waveforms (see Figure 4.5). Notice that wi 2 = E for all i.
There are many ways to choose the 2k waveforms ψi (t). One way is to choose
ψi (t) = ψ(t − iT ) for some normalized pulse ψ(t) such that ψ(t − iT ) and ψ(t − jT )
are orthogonal when i = j. In this case the requirement for ψ(t) is the same as that
in bit-by-bit on a pulse train, but now we need 2k rather than k shifted versions,
and we send one pulse rather than a train of k weighted pulses. For obvious reasons
this signaling scheme is called pulse position modulation.
Another example is to choose
2E
wi (t) = cos(2πfi t)1{t ∈ [0, T ]}. (4.4)
T
This is called m-FSK (m-ary frequency-shift keying). If we choose 2fi T = ki for

some integer ki such that ki = kj if i = j then
T
2E 1 1
wi , wj = cos[2π(fi + fj )t] + cos[2π(fi − fj )t] dt = E1{i = j},
T 0 2 2
as desired.
x3
x2 c3 •
c2 •
• x2
c2
•
• x1 c1
c1 x1
(a) m = n = 2. (b) m = n = 3.
Figure 4.5. Codebooks for block-orthogonal signaling.
When m ≥ 3, it is not easy to visualize the decoding regions. However, we can

using the fact that all coordinates of ci are 0 except for the
proceed analytically, √
ith, which has value E. Hence,
E
ĤM L (y) = arg maxy, ci −
i 2
= arg maxy, ci
i
= arg max yi ,
i
where yi is the ith component of y. To compute (or bound) the error probability,
we start as usual with a fixed ci . We choose i = 1. When H = 1,
√
E + Zj if j = 1,
Yj =
Zj 1.
if j =
Then the probability of correct decoding is given by
Pc (1) = P r{Y1 > Z2 , Y1 > Z3 , . . . , Y1 > Zm |H = 1}.
To evaluate the right side, we first condition on Y1 = α, where α ∈ R is an

arbitrary number
. !/m−1
α
P r{H = Ĥ|H = 1, Y1 = α} = P r{α > Z2 , . . . , α > Zm } = 1 − Q ,
N0 /2
and then remove the conditioning on Y1 ,

. !/m−1
∞
α
Pc (1) = fY1 |H (α|1) 1 − Q dα
−∞ N0 /2
∞ √
. !/m−1
1 (α − E)2 α
= √ exp − 1−Q dα,
−∞ πN0 N0 N0 /2
√
where we use the fact that when H = 1, Y1 ∼ N ( E, N20 ). The above expression
for Pc (1) cannot be simplified further, but we can evaluate it numerically. By
symmetry, Pc (1) = Pc (i) for all i. Hence Pc = Pc (1) = Pc (i).
The fact that the distance between any two distinct codewords is a constant
simplifies the union bound considerably:

d
Pe = Pe (i) ≤ (m − 1)Q
2σ
!
E
= (m − 1)Q
N0

E
< 2k exp −
2N0

E/k
= exp −k − ln 2 ,
2N0
where we used
N0
σ2 = ,
2
d2 = ci − cj 2
= ci 2 + cj 2 − 2{ci , cj }
= ci 2 + cj 2
= 2E,
2
x
Q(x) ≤ exp − , x ≥ 0,
2
m − 1 < 2k .
By letting E = Eb k we obtain

Eb
Pe < exp −k − ln 2 .
2N0
Eb
We see that Pe → 0 as k → ∞, provided that N 0
> 2 ln 2. (It is possible to prove
Eb
that the weaker condition N0 > ln 2 is sufficient. See Exercise 4.3.)
The result of the above example is quite surprising at first. The more bits we
send, the larger is the probability Pc that they will all be decoded correctly. Yet
what goes on is quite clear. In setting all but one component of each codeword
√
to zero, we can make the non-zero component as large as kEb . The decoder
looks for the largest component. Because the variance of the noise is the same
in all components and does not grow with k, when k is large it becomes almost
impossible for the noise to alter the position of the largest component.
4.5 Duration, bandwidth, and dimensionality

Road traffic regulations restrict the length and width of vehicles that are allowed to
circulate on highways. For similar and other reasons, the duration and bandwidth
of signals that we use to communicate over a shared medium are often subject
to limitations. Hence the question: What are the implications of having to use a
given time and frequency interval? A precise answer to this question is known only
in the limit of large time and frequency intervals, but this is good enough for our
purpose.
First we need to define what it means for a signal to be time- and frequency-
limited. We get a sense that the answer is not obvious by recalling that a signal that
has finite support in the time domain must have infinite support in the frequency
domain (and vice versa).
There are several meaningful options to define the duration and the bandwidth
of a signal in such a way that both are finite for most signals of interest. Typically,
people use the obvious definition for the duration of a signal, namely the length
of the shortest interval that contains the signal’s support and use a “softer”
criterion for the bandwidth. The most common bandwidth definitions are listed in
Appendix 4.9.
In this section we use an approach introduced by David Slepian in his Shannon
Lecture [27].1 In essence, Slepian’s approach hinges on the idea that we should not
make a distinction between signals that cannot be distinguished using a measuring
instrument. Specifically, after fixing a small positive number η < 1 that accounts
for the instrument’s precision, we say that two signals are indistinguishable at level
η if their difference has norm less than η.
We say that s(t) is time-limited to (a, b) if it is indistinguishable at level η from
s(t)1{t ∈ (a, b)}. The length of the shortest such interval (a, b) is the signal’s
duration T (at level η).
example 4.7 Consider the signal h(t) = e−|t| , t ∈ R. The norm of h(t) −
√
h(t)1 t ∈ (− 2 , 2 ) is e−T . Hence h(t) has duration T = −2 ln η at level η.
T T
Similarly, we say that s(t) is frequency-limited to (c, d) if sF (f ) and sF (f )1{f ∈

(c, d)} are indistinguishable at level η. The signal’s bandwidth W is the width
(length) of the shortest such interval.
1
The Shannon Award is the most prestigious award bestowed by the Information
Theory Society. Slepian was the first, after Shannon himself, to receive the award. The
recipient presents the Shannon Lecture at the next IEEE International Symposium on
Information Theory.
4.5. Duration, bandwidth, and dimensionality 143
A particularity of these definitions is that if we increase the strength of a

signal, we could very well increase its duration and bandwidth. This makes good
engineering sense. Yet it is in distinct contradiction with the usual (strict) def-
inition of duration and the common definitions of bandwidth (Appendix 4.9).
Another particularity is that all finite-energy signals have finite bandwidth and
finite duration.
The dimensionality of a signal set is modified accordingly.2 We say that a set G
of signals has approximate dimension n at level during the interval (− T2 , T2 ) if
there is a fixed collection of n = n(T, ) signals, say {ψ1 (t), . . . , ψn (t)}, such that
2 , 2 ) every signal in G is indistinguishable at level from some
T T
over the interval (−
n
signal of the form i=1 ai ψi (t). That is, we require for each h(t) ∈
G that there
n
exists a1 , . . . , an such that h(t)1 t ∈ (− T2 , T2 ) and i=1 ai ψi (t)1 t ∈ (− T2 , T2 )
are indistinguishable at level . We further require that n be the smallest such
number. We can now state the main result (without proof).
theorem 4.8 (Slepian) Let Gη be the set of all signals frequency-limited to
(− W W T T
2 , 2 ) and time-limited to (− 2 , 2 ) at level η. Let n(W, T, η, ) be the approx-
imate dimension of Gη at level during the interval (− T2 , T2 ). Then, for every
> η,
n(W, T, η, )
lim =W
T →∞ T
n(W, T, η, )
lim = T.
W →∞ W
In essence, this result says that for an arbitrary time interval (a, b) of length T
and an arbitrary frequency interval (c, d) of width W , in the limit of large T and
W , the set of finite-energy signals that are time-limited to (a, b) and frequency-
limited to (c, d) is spanned by T W orthonormal functions. For later reference we
summarize this by the expression
n=
˙ T W, (4.5)
where the “·” on top of the equal sign is meant to remind us that the relationship
holds in the limit of large values for W and T .
Unlike Slepian’s bandwidth definition, which applies also to complex-valued
signals, the bandwidth definitions of Appendix 4.9 have been conceived with real-
valued signals in mind. If s(t) is real-valued, the conjugacy constraint implies
that |sF (f )| is an even function.3 If, in addition, the signal is baseband, then
it is frequency-limited to some interval of the form (−B, B) and, according to a
well-established practice, we say that the signal’s bandwidth is B (not 2B). To
avoid confusions, we use the letter W for bandwidths that account for positive and
2
We do not require that this signal set be closed under addition and under multiplication
by scalars, i.e. we do not require that it forms a vector space.
3
See Section 7.2 for a review of the conjugacy constraint.
negative frequencies and use B for so-called single-sided bandwidths. (We may call
W a double-sided bandwidth.)
A result similar to (4.5) can be formulated for other meaningful definitions of
time and frequency limitedness. The details depend on the definitions but the
essence does not. What remains true for many meaningful definitions is that,
asymptotically, there is a linear relationship between W T and n.
Two illustrative examples of this relationship follow. To avoid annoying calcula-
tions, for each example, we take the freedom to use the most convenient definition
of duration and bandwidth.
example 4.9 Let ψ(t) = √1sinc(t/Ts ) and
Ts

ψF (f ) = Ts 1 f ∈ [−1/(2Ts ), 1/(2Ts )]
be a normalized pulse and its Fourier transform. Let ψl (t) = ψ(t−lTs ), l = 1, . . . , n.
The collection B = {ψ1 (t), . . . , ψn (t)} forms an orthonormal set. One way to see
that ψi (t) and ψj (t) are orthogonal to each other when i = j is to go to the
Fourier domain and use Parseval’s relationship. (Another way is to evoke Theorem
5.6 of Chapter 5.) Let G be the space spanned by the orthonormal basis B. It
has dimension n by construction. All signals of G are strictly frequency-limited to
(−W/2, W/2) for W = 1/Ts and time-limited (for some η) to (0, T ) for T = nTs .
For this example W T = n.
example 4.10 If we substitute an orthonormal basis {ψ1 (t), . . . , ψn (t)} with the
related
√ orthonormal basis {ϕ1 (t), . . . , ϕn (t)} obtained via the relationship ϕi (t) =
bψi (bt) for some b ≥ 1, i = 1, . . . , n, then all signals are time-compressed and
frequency-expanded by the same factor b. This example shows that we can trade W
for T without changing the dimensionality of the signal space, provided that W T
is kept constant.
Note that, in this section, n is the dimensionality of the signal space that may
or may not be related to a codeword length (also denoted by n).
The relationship between n and W T establishes a fundamental relationship
between the discrete-time and the continuous-time channel model. It says that
if we are allowed to use a frequency interval of width W Hz during T seconds,
then we can make approximately (asymptotically exactly) up to W T uses of the
equivalent discrete-time channel model. In other words, we get to use the discrete-
time channel at a rate of up to W channel uses per second.
The symmetry of (4.5) implies that time and frequency are on an equal footing
in terms of providing the degrees of freedom exploited by the discrete-time channel.
It is sometimes useful to think of T and W as the width and height of a rectangle
in the time–frequency plane, as shown in Figure 4.6. We associate such a rectangle
with the set of signals that have the corresponding time and frequency limitations.
Like a piece of land, such a rectangle represents a natural resource and what
matters for its exploitation is its area.
The fact that n can grow linearly with W T and not faster is bad news for block-
orthogonal signaling. This means that n cannot grow exponentially in k unless
W T does the same. In a typical system, W is fixed by regulatory constraints
4.6. Bit-by-bit versus block-orthogonal 145
W t
Figure 4.6. Time–frequency plane.
and T grows linearly with k. (T is essentially the time it takes to send k bits.)
Hence W T cannot grow exponentially in k, which means that block-orthogonal
is not scalable. Of the four examples studied in Section 4.4, only bit-by-bit on
a pulse train seems to be a viable candidate for large values of k, provided that
we can make it more robust to additive white Gaussian noise. The purpose of
the next section is to gain valuable insight into what it takes to achieve this
goal.
4.6 Bit-by-bit versus block-orthogonal

We have seen that the message error probability goes to 1 in bit-by-bit on a pulse
train and goes to 0 (exponentially fast) in block-orthogonal signaling. The union
bound is quite useful for understanding what goes on.
In computing the error probability when message i is transmitted, the union
bound has one term for each j = i. The dominating terms correspond to the signals
cj that are closest to ci . If we neglect the other terms, we obtain an expression of
the form

dm
Pe (i) ≈ Nd Q ,
2σ
where Nd is the number of dominant terms, i.e. the number of nearest neigh-
bors to ci , and dm is the minimum distance, i.e. the distance to a nearest
neighbor.
For bit-by-bit on a pulse train, there are k closest neighbors, each neighbor
obtained
√ by changing ci in exactly one component, and each of them is at distance
2 Eb from ci . As k increases, Nd increases and Q( d2σ
m
) stays constant. The increase
of Nd makes Pe (i) increase.
Now consider block-orthogonal signaling. All signals are at the same distance
from each other. Hence there are Nd = 2k −1 nearest neighbors to ci , all at distance
√ √
dm = 2E = 2kEb . Hence

dm 1 d2 1 kEb
Q ≤ exp − m2 = exp − 2 ,
2σ 2 8σ 2 4σ
Nd = 2k − 1 = exp(k ln 2) − 1.
We see that the probability kEthat the noise carries a signal closer to a specific
neighbor decreases as exp − 4σ 2 , whereas the number of nearest neighbors
b
Eb
increases as exp(k ln 2). For 4σ2 > ln 2 the product decreases, otherwise it increases.
In essence, to reduce the error probability we need to increase the minimum
distance. If the number of dimensions remains constant, as in the first two examples
of Section 4.4, the space occupied by the signals becomes crowded, the minimum
distance decreases, and the error probability
√ increases. For block-orthogonal signal-
√ the signal’s norm increases as kEb and, by Pythagoras,
ing, √ the distance is a factor
2 larger than the norm – hence the distance grows as 2kEb . In bit-by-bit on a
pulse train, the minimum distance remains constant. As we will see in Chapter 6,
sophisticated coding techniques in conjunction with a generalized form of bit-by-bit
on a pulse train can reduce the error probability by increasing the distance profile.
4.7 Summary
In this chapter we have introduced new design parameters and performance meas-
ures. The ones we are mostly concerned with are as follows.
• The cardinality m of the message set H. Since in most cases the message consists
of bits, typically we choose m to be a power of 2. Whether m is a power of 2
or not, we say that a message carries k = log2 m bits of information (assuming
that all messages are equiprobable).
• The message error probability Pe and the bit error rate Pb . The former, also
called block error probability, is the error probability we have considered so
far. The latter can be computed, in principle, once we specify the mapping
between the set of k-bit sequences and the set of messages. Until then, the
only statement we can make about Pb is that Pke ≤ Pb ≤ Pe . The right bound
applies with equality if a message error always translates into 1-out-of-k bits
being incorrectly reproduced. The left is an equality if all bits are incorrectly
reproduced each time that there is a message error. Whether we care more
about Pe or Pb depends on the application. If we send a file that contains a
computer program, every single bit of the file has to be received correctly in
order for the transmission to be successful. In this case we clearly want Pe to be
small. However, there are sources that are more tolerant to occasional errors.
This is the case of a digitized voice signal. For voice it is sufficient to have Pb
small. To appreciate the difference between Pe and Pb , consider the hypothetical
situation in which one message corresponds to k = 103 bits and 1 bit of every
message is incorrectly reconstructed. Then the message error probability is 1
(every message is incorrectly reconstructed), whereas the bit-error probability
is 10−3 .
4.7. Summary 147
• The average signal’s energy E and the average energy per bit Eb , where Eb = Ek .
We are typically willing to double the energy to send twice as many bits. In
this case we fix Eb and let E be a function of k.
• The transmission rate Rb = Tk = log2T(m) [bits/second].
• The single-sided bandwidth B and the two-sided bandwidth W . There are
several meaningful criteria to determine the bandwidth.
• Scalability, in the sense that we ought to be able to communicate bit sequences
of any length (provided we let W T scale in a sustainable way).
• The implementation cost and computational complexity. To keep the discussion
as simple as possible, we assume that the cost is determined by the number of
matched filters in the n-tuple former and the complexity is that of the decoder.
Clearly we desire scalability, high transmission rate, little energy spent per
bit, small bandwidth, small error probability (message or bit, depending on the
application), low cost and low complexity. As already mentioned, some of these
goals conflict. For instance, starting from a given codebook we can trade energy
for error probability by scaling down all the codewords by some factor. In so
doing the average energy will decrease and so will the distance between codewords,
which implies that the error probability will increase. Alternatively, once we have
reduced the energy by scaling down the codewords we can add new codewords at
the periphery of the codeword constellation, choosing their location in such a way
that new codewords do not further increase the error probability. We keep doing
this until the average energy has returned to the original value. In so doing we
trade bit rate for error probability. By removing codewords at the periphery of the
codeword constellation we can trade bit rate for energy. All these manipulations
pertain to the encoder. By acting inside the waveform former, we can boost the
at the expense of bandwidth. For instance, we can substitute ψi (t) with
bit rate √
φi (t) = bψi (bt) for some b > 1. This scales the duration of all signals by 1/b with
two consequences. First, the bit rate is multiplied by b. (It takes a fraction b of
time to send the same number of bits.) Second, the signal’s bandwidth expands
by b. (The scaling property of the Fourier transform asserts that the Fourier
transform of ψ(bt) is |b|1
ψF ( fb ).) These examples are meant to show that there
is considerable margin for trading among bit rate, bandwidth, error probability,
and average energy.
We have seen that, rather surprisingly, it is possible to transmit an increasing
number k of bits at a fixed energy per bit Eb and to make the probability that even
a single bit is decoded incorrectly go to zero as k increases. However, the scheme
we used to prove this has the undesirable property of requiring an exponential
growth of the time–bandwidth product. Such a growth would make us quickly
run out of time and/or bandwidth even with moderate values of k. In real-world
applications, we are given a fixed bandwidth and we let the duration grow linearly
with k. It is not a coincidence that most signaling methods in use today can be
seen one way or another as refinements of bit-by-bit on a pulse train. This line of
signaling technique will be pursued in the next two chapters.
Information theory is a field that searches for the ultimate trade-offs, regardless
of the signaling method. A main result from information theory is the famous
formula

W 2P
C= log2 1 + (4.6)
2 N0 W

P
= B log2 1 + .
N0 B
It gives a precise value to the ultimate rate C bps at which we can transmit reliably
over a waveform AWGN channel of noise power spectral density N0 /2 watts/Hz
if we are allowed to use signals of power not exceeding P watts and absolute
(single-sided) bandwidth not exceeding B Hz.
This is a good time to clarify our non-standard use of the words coding, encoder,
codeword, and codebook. We have seen that no matter which waveform signals
we use to communicate, we can always break down the sender into a block that
provides an n-tuple and one that maps the n-tuple into the corresponding wave-
form. This view is completely general and serves us well, whether we analyze or
implement a system. Unfortunately there is no standard name for the first block.
Calling it an encoder is a good name, but the reader should be aware that the
current practice is to say that there is coding when the mapping from bits to
codewords is non-trivial, and to say that there is no coding when the map is trivial
as in bit-by-bit on a pulse train. Making such a distinction is not a satisfactory
solution in our view. An example of a non-trivial encoder will be studied in depth
in Chapter 6.
Calling the second block a waveform former is definitely non-standard, but we
find this name to be more appropriate than calling it a modulator, which is the
most common name used for it. The term modulator has been inherited from the
old days of analog communication techniques such as amplitude modulation (AM)
for which it was an appropriate name.
4.8 Appendix: Isometries and error probability

Here we give a formal proof that if we apply an isometry to a codebook and its
decoding regions, then the error probability associated to the new codebook and
the new regions is the same as that of the original codebook and original regions.
Let

1 γ2
g(γ) = exp − , γ ∈ R,
(2πσ 2 )n/2 2σ 2
so that for Z ∼ N (0, σ 2 In ) we can write fZ (z) = g(z). Then for any codebook
C = {c0 , . . . , cm−1 }, decoding regions R0 , . . . , Rm−1 , and isometry a : Rn → Rn
we have
Pc (i) = P r{Y ∈ Ri |codeword ci is transmitted}

= g(y − ci )dy
y∈Ri
4.9. Appendix: Bandwidth definitions 149

(a)
= g(a(y) − a(ci ))dy
y∈Ri

(b)
= g(a(y) − a(ci ))dy
y:a(y)∈a(Ri )

(c)
= g(α − a(ci ))dα
α∈a(Ri )
= P r{Y ∈ a(Ri )|codeword a(ci ) is transmitted},
where in (a) we use the distance-preserving property of an isometry, in (b) we use

the fact that y ∈ Ri if and only if a(y) ∈ a(Ri ), and in (c) we make the change
of variable α = a(y) and use the fact that the Jacobian of an isometry is ±1. The
last line is the probability of decoding correctly when the transmitter sends a(ci )
and the corresponding decoding region is a(Ri ).
4.9 Appendix: Bandwidth definitions

There are several widely accepted definitions of bandwidth, which we summarize
in this appendix. They are all meant for real-valued baseband functions h(t) that
represent either a signal or an impulse response. Because h(t) is real-valued, |hF (f )|
is an even function. This is a consequence of the conjugacy constraint (see Section
7.2). Baseband means that |hF (f )| is negligible outside an interval of the form
(−B, B), where the meaning of “negligible” varies from one bandwidth definition to
another. Extending to passband signals is straightforward. Here we limit ourselves
to baseband signals because passband signals are treated in Chapter 7.
Since these bandwidth definitions were meant for functions h(t) for which the
essential support of |hF (f )| has the form (−B, B), the common practice is to
say that their bandwidth is B rather than 2B. An exception is the definition of
bandwidth that we have given in Section 4.5, which applies also to complex-valued
signals. As already mentioned, we avoid confusion by using B and W for single-
sided and double-sided bandwidths, respectively.
For each bandwidth definition given below, the reader can find an example in
Exercise 4.13.
• Absolute bandwidth This is the smallest positive number B such that (−B, B)
is the support of hF (f ). As mentioned in the previous paragraph, for signals
that we use in practice the absolute bandwidth is infinite. The signals used in
practice have finite (time-domain) support, which implies that their absolute
bandwidth is ∞. However, in examples we sometimes use signals that do have
a finite absolute bandwidth.
• 3-dB bandwidth The 3-dB bandwidth, if it exists, is the positive number B
such that |hF (f )|2 > |hF 2(0)| in the interval I = (−B, B) and |hF (f )|2 ≤
2
|hF (0)|2
2 outside I. In other words, outside I the value of |hF (f )| is at least
3-dB smaller than at f = 0.
• η-bandwidth For any number η ∈ (0, 1), the η-bandwidth is the smallest
positive number B such that
B ∞
|hF (f )|2 df ≥ (1 − η) |hF (f )|2 df.
−B −∞
It defines the interval (−B, B) that contains a fraction (1 − η) of the signal’s

energy. Reasonable values for η are η = 0.1 and η = 0.01. (Recall that, by
Parseval’s relationship, the integral on the right equals the squared norm h2 .)
• First zero-crossing bandwidth The first zero-crossing bandwidth, if it exists,
is that B for which |hF (f )| is positive in I = (−B, B) and vanishes at ±B.
• Equivalent noise bandwidth This is B if
∞
|hF (f )|2 df = 2B|hF (0)|2 .
−∞
The name comes from the fact that if we feed with white noise a filter of
impulse response h(t) and we feed with the same input an ideal lowpass filter
of frequency response |hF (0)|1{f ∈ [−B, B]}, then the output power is the
same in both situations. ∞
• Root-mean-square (RMS) bandwidth This is defined if −∞ |hF (f )|2 df < ∞,
in which case it is
. ∞ /1
−∞
f 2 |hF (f )|2 df 2
B= ∞ .
−∞
|hF (f )|2 df
∞ |hF (f )|
2
To understand this definition, notice that the function g(f ) := |hF (f )|2 df
is
−∞
non-negative, even, and integrates
$ to 1. Hence it is the density of some zero-
mean random variable and B = f 2 g(f )df is the standard deviation of that
random variable.
4.10 Exercises
exercise 4.1 (Signal translation) Consider the signals w0 (t) and w1 (t) shown in
Figure 4.7, used to communicate 1 bit across the AWGN channel of power spectral
density N0 /2.
w0 (t) w1 (t)
1 1
2T
t t
T 2T
−1 −1
Figure 4.7.
4.10. Exercises 151
(a) Determine an orthonormal basis {ψ0 (t), ψ1 (t)} for the space spanned by
{w0 (t), w1 (t)} and find the corresponding codewords c0 and c1 . Work out
two solutions, one obtained via Gram–Schmidt and one in which the second
element of the orthonormal basis is a delayed version of the first. Which of
the two solutions would you choose if you had to implement the system?
(b) Let X be a uniformly distributed binary random variable that takes values
in {0, 1}. We want to communicate the value of X over an additive white
Gaussian noise channel. When X = 0, we send w0 (t), and when X = 1,
we send w1 (t). Draw the block diagram of an ML receiver based on a single
matched filter.
(c) Determine the error probability Pe of your receiver as a function of T and N0 .
(d) Find a suitable waveform v(t), such that the new signals w̃0 (t) = w0 (t) −
v(t) and w̃1 (t) = w1 (t) − v(t) have minimal energy and plot the resulting
waveforms.
(e) What is the name of the kind of signaling scheme that uses w̃0 (t) and w̃1 (t)?
Argue that one obtains this kind of signaling scheme independently of the
initial choice of w0 (t) and w1 (t).
exercise 4.2 (Orthogonal signal sets) Consider a set W = {w0 (t), . . . , wm−1 (t)}
of mutually orthogonal signals with squared norm E each used with equal
probability.
(a) Find the minimum-energy signal set W̃ = {w̃0 (t), . . . , w̃m−1 (t)} obtained by
translating the original set.
(b) Let Ẽ be the average energy of a signal picked at random within W̃. Determine
Ẽ and the energy saving E − Ẽ.
(c) Determine the dimension of the inner product space spanned by W̃.
exercise 4.3 (Suboptimal receiver for orthogonal signaling) This exercise takes
a different approach to the evaluation of the performance of block-orthogonal sig-
naling (Example 4.6). Let the message H ∈ {1, . . . , m} be uniformly distributed
and consider the communication problem described by
H=i: Y = ci + Z, Z ∼ N (0, σ 2 Im ),
where Y = (Y1 , . . . , Ym )T ∈ Rm is the received vector and {c1 , . . . , cm } ⊂ Rm is
the codebook consisting of constant-energy codewords that are orthogonal to each
other. Without loss of essential generality, we can assume
√
ci = Eei ,
where ei is the ith unit vector in Rm , i.e. the vector that contains 1 at position i
and 0 elsewhere, and E is some positive constant.
(a) Describe the statistic of Yj for j = 1, . . . , m given that H = 1.√
(b) Consider a suboptimal receiver that uses a threshold t = α E where 0 <
α < 1. The receiver declares Ĥ = i if i is the only integer such that Yi ≥ t.
If there is no such i or there is more than one index i for which Yi ≥ t, the
receiver declares that it cannot decide. This will be viewed as an error. Let
Ei = {Yi ≥ t}, Eic = {Yi < t}, and describe, in words, the meaning of the
event
E1 ∩ E2c ∩ E3c ∩ · · · ∩ Em
c
.
(c) Find an upper bound to the probability that the above event does not occur
when H = 1. Express your result using the Q function.
(d) Now let m = 2k and let E = kEb for some fixed energy per bit Eb . Prove
that the error probability goes to 0 as k → ∞, provided that σEb2 > 2αln2 2 .
(Notice that because we can choose α2 as close to 1 as we wish, if we insert
Eb
σ 2 = N20 , the condition becomes N 0
> ln 2, which is a weaker condition than
the one obtained in Example 4.6.) Hint: Use m − 1 < m = exp(ln m) and
2
Q(x) < 12 exp(− x2 ).
exercise 4.4 (Receiver diagrams) For each signaling method discussed in

Section 4.4, draw the block diagram of an ML receiver.
exercise 4.5 (Bit-by-bit on a pulse train) A communication system uses bit-

by-bit on a pulse train to communicate at 1 Mbps using a rectangular pulse. The
transmitted signal is of the form

Bj 1{jTs ≤ t < (j + 1)Ts },
j
where Bj ∈ {±b}. Determine the value of b needed to achieve bit-error probability

Pb = 10−5 knowing that the channel corrupts the transmitted signal with additive
white Gaussian noise of power spectral density N0 /2 = 10−2 watts/Hz.
exercise 4.6 (Bit-error probability) A discrete memoryless source produces

bits at a rate of 106 bps. The bits, which are uniformly distributed and iid, are
grouped into pairs and each pair is mapped into a distinct waveform and sent
over the AWGN channel of noise power spectral density N0 /2. Specifically, the
first two bits are mapped into one of the four waveforms shown in Figure 4.8 with
Ts = 2×10−6 seconds, the next two bits are mapped onto the same set of waveforms
delayed by Ts , etc.
(a) Describe an orthonormal basis for the inner product space W spanned by
wi (t), i = 0, . . . , 3 and plot the signal constellation in Rn , where n is the
dimensionality of W.
(b) Determine an assignment between pairs of bits and waveforms such that the
bit-error probability is minimized and derive an expression for Pb .
(c) Draw a block diagram of the receiver that achieves the above Pb using a single
causal filter.
(d) Determine the energy per bit Eb and the power of the transmitted signal.
4.10. Exercises 153
w0 (t) w2 (t)
1 1
Ts
t t
0 Ts
w1 (t) w3 (t)
Ts
t t
Ts
−1 −1
Figure 4.8.
exercise 4.7 (m-ary frequency-shift keying) m-ary frequency-shift keying (m-

FSK) is a signaling method that uses signals of the form
2E
wi (t) = cos 2π(fc + iΔf )t 1{t ∈ [0, T ]}, i = 0, . . . , m − 1,
T
where E, T , fc , and Δf are fixed parameters, with Δf fc .
(a) Determine the average energy E. (You can assume that fc T is an integer.)
(b) Assuming that fc T is an integer, find the smallest value of Δf that makes
wi (t) orthogonal to wj (t) when i = j.
(c) In practice the signals wi (t), i = 0, 1, . . . , m−1 can be generated by changing
the frequency of a single oscillator. In passing from one frequency to another a
phase shift θ is introduced. Again, assuming that fc T is an integer, determine
the smallest value Δf that ensures orthogonality between cos(2π(fc + iΔf )t +
θi ) and cos(2π(fc + jΔf )t + θj ) whenever i = j regardless of θi and θj .
(d) Sometimes we do not have complete control over fc either, in which case it
is not possible to set fc T to an integer. Argue that if we choose fc T 1 then
for all practical purposes the signals will be orthogonal to one another if the
condition found in part (c) is met.
(e) Give an approximate value for the bandwidth occupied by the signal constel-
lation. How does the W T product behave as a function of k = log2 (m)?
exercise 4.8 (Packing rectangular pulses) This exercise

√ is an interesting varia-
tion to Example 4.9. Let ψ(t) = 1 t ∈ [−Ts /2, Ts /2] / Ts be a normalized rectan-
√
gular pulse of duration Ts and let ψF (f ) = Ts sinc(Ts f ) be its Fourier transform.
The collection {ψ1 (t), . . . , ψn (t)}, where ψl (t) = ψ(t − lTs ), l = 1, . . . , n, forms an
orthonormal set. (This is obvious from the time domain.) It has dimension n by
construction.
(a) For the set G spanned by the above orthonormal basis, determine the relation-
ship between n and W T .
(b) Compare with Example 4.9 and explain the difference.
exercise 4.9 (Time- and frequency-limited orthonormal sets) Complement
Example 4.9 and Exercise 4.8 with similar examples in which the shifts occur in
the frequency domain. The corresponding time-domain signals can be complex-
valued.
exercise 4.10 (Root-mean-square bandwidth) The root-mean-square band-
width (abbreviated rms bandwidth) of a lowpass signal g(t) of finite-energy is
defined by
. ∞ /1/2
−∞
f 2 |gF (f )|2 df
Brms = ∞ ,
−∞ F
|g (f )|2 df
where |gF (f )|2 is the energy spectral density of the signal. Correspondingly, the
root-mean-square (rms) duration of the signal is defined by
. ∞ /1/2
−∞
t2 |g(t)|2 dt
Trms = ∞ .
−∞
|g(t)|2 dt
that, with the above definitions and assuming that |g(t)| → 0

We want to show
faster than 1/ |t| as |t| → ∞, the time–bandwidth product satisfies
1
Trms Brms ≥ .
4π
(a) Use the Schwarz inequality and the fact that for any c ∈ C, c + c∗ = 2{c} ≤
2|c| to prove that
∞
2 ∞ ∞
∗ ∗
[g1 (t)g2 (t) + g1 (t)g2 (t)]dt ≤ 4 |g1 (t)| dt
2
|g2 (t)|2 dt.
−∞ −∞ −∞
(b) In the above inequality insert g1 (t) = tg(t) and g2 (t) = dg(t)
dt and show that
∞ 2 ∞ ∞7 7
d 7 dg(t) 72
∗
t [g(t)g (t)] dt ≤ 4 t |g(t)| dt
2 2 7 7
7 dt 7 dt.
−∞ dt −∞ −∞
the left-hand side by parts and use the fact that |g(t)| → 0 faster
(c) Integrate
than 1/ |t| as |t| → ∞ to obtain
∞ 2 ∞ ∞7 7
7 dg(t) 72
|g(t)| dt ≤ 4
2
t |g(t)| dt
2 2 7 7
7 dt 7 dt.
−∞ −∞ −∞
(d) Argue that the above is equivalent to

∞ ∞ ∞ ∞
|g(t)|2 dt |gF (f )|2 df ≤ 4 t2 |g(t)|2 dt 4π 2 f 2 |gF (f )|2 df.
−∞ −∞ −∞ −∞
4.10. Exercises 155
(e) Complete the proof to obtain Trms Brms ≥ 4π 1

.
(f ) As a special case, consider a Gaussian pulse defined by g(t) = exp(−πt2 ).
1
Show that for this signal Trms Brms = 4π i.e. the above inequality holds with
F
equality. Hint: exp(−πt2 ) ←→ exp(−πf 2 ).
exercise 4.11 (Real basis for complex space) Let G be a complex inner product
space of finite-energy waveforms with the property that g(t) ∈ G implies g ∗ (t) ∈ G.
(a) Let GR be the subset of G that contains only real-valued waveforms. Argue
that GR is a real inner product space.
(b) Prove that if g(t) = a(t) + jb(t) is in G, then both a(t) and b(t) are in GR .
(c) Prove that if {ψ1 (t), . . . , ψn (t)} is an orthonormal basis for the real inner
product space GR then it is also an orthonormal basis for the complex inner
product space G.
Comment: In this exercise we have shown that we can always find a real-valued
orthonormal basis for an inner product space G such that g(t) ∈ G implies g ∗ (t)
∈ G. An equivalent condition is that if g(t) ∈ G then also the inverse Fourier
∗
transform of gF (−f ) is in G. The set G of complex-valued finite-energy waveforms
that are strictly time-limited to (− T2 , T2 ) and bandlimited to (−B, B) (for any
of the bandwidth definitions given in Appendix 4.9) fulfills the stated conjugacy
condition.
exercise 4.12 (Average energy of PAM) Let U be a random variable uniformly

distributed in [−a, a] and let S be a discrete random variable independent of U
and uniformly distributed over the PAM constellation {±a, ±3a, . . . , ±(m − 1)a},
where m is an even integer. Let V = S + U .
(a) Find the distribution of V .

(b) Find the variance of U and that of V .
(c) Use part (b) to determine the variance of S. Justify your steps. (Notice: by
finding the variance of S, we have found the average energy of the PAM
constellation used with uniform distribution.)
Exercises for Appendix 4.9
exercise 4.13 (Bandwidth) Verify the following statements.
(a) The absolute bandwidth of sinc( Tts ) is B = 2T1 s .

1
(b) The 3-dB bandwidth of an RC lowpass filter is B = 2πRC . Hint:
The impulse
response of an RC lowpass filter is h(t) = RC exp − RC
1 t
, t ≥ 0 and 0
otherwise. The squared magnitude of its Fourier transform is |hF (f )|2 =
1
1+(2πRCf )2 .
(c) The η-bandwidth of an RC lowpass filter is B = 2πRC 1
tan π2 (1 − η) . Hint:
Same as in part (b).
(d) The zero-crossing bandwidth of 1{t ∈ [− T2s , T2s ]} is B = T2s .
1
(e) The equivalent noise bandwidth of an RC lowpass filter is B = 4RC .
(f ) The RMS bandwidth of h(t) = exp(−πt2 ) is B = √1 . Hint: hF (f ) =

4π
exp(−πf 2 ).
exercise 4.14 (Antipodal signaling and Rayleigh fading) Consider using

antipodal signaling, i.e. w0 (t) = −w1 (t), to communicate 1 bit across a Rayleigh
fading channel that we model as follows. When wi (t) is transmitted the channel
output is
R(t) = Awi (t) + N (t),
where N (t) is white Gaussian noise of power spectral density N0 /2 and A is a
random variable of probability density function

2ae−a , if a ≥ 0,
2
fA (a) = (4.7)
0, otherwise.
We assume that, unlike the transmitter, the receiver knows the realization of A.
We also assume that the receiver implements a maximum likelihood decision, and
that the signal’s energy is Eb .
(a) Describe the receiver.
(b) Determine the error probability conditioned on the event A = a.
(c) Determine the unconditional error probability Pf . (The subscript stands for
fading.)
(d) Compare Pf to the error probability Pe achieved by an ML receiver that
observes R(t) = mwi (t) + N (t), where m = E [A]. Comment on the different
behavior of the two error probabilities. For each of them, find the Eb /N0 value
2
necessary to obtain the probability of error 10−5 . (You may use 12 exp(− x2 )
as an approximation of Q(x).)
exercise 4.15 (Non-white Gaussian noise) Consider the following transmit-
ter/receiver design problem for an additive non-white Gaussian noise channel.
(a) Let the hypothesis H be uniformly distributed in H = {0, . . . , m−1} and when
H = i, i ∈ H, let wi (t) be the channel input. The channel output is then
R(t) = wi (t) + N (t),
where N (t) is Gaussian noise of power spectral density G(f ), where we assume
that G(f ) = 0 for all f . Describe a receiver that, based on the channel output
R(t), decides on the value of H with least probability of error. Hint: Find a
way to transform this problem into one that you can solve.
(b) Consider the setting as in part (a) except that now you get to design the
signal set with the restrictions that m = 2 and that the average energy cannot
exceed E. We also assume that G2 (f ) is constant in the interval [a, b], a < b,
where it also achieves its global minimum. Find two signals that achieve the
smallest possible error probability under an ML decoding rule.
4.10. Exercises 157
exercise 4.16 (Continuous-time AWGN capacity) To prove the formula for the
capacity C of the continuous-time AWGN channel of noise power density N0 /2
when signals are power-limited to P and frequency-limited to (− W W
2 , 2 ), we first
derive the capacity Cd for the discrete-time AWGN channel of noise variance σ 2
and symbols constrained to average energy not exceeding Es . The two expressions
are:
1 Es
Cd = log2 1 + 2 [bits per channel use],
2 σ
P
C = (W/2) log2 1 + [bps].
W (N0 /2)
To derive Cd we need tools from information theory. However, going from Cd to
C using the relationship n = W T is straightforward. To do so, let Gη be the set
of all signals that are frequency-limited to (− W W T T
2 , 2 ) and time-limited to (− 2 , 2 )
at level η. We choose η small enough that for all practical purposes all signals of
Gη are strictly frequency-limited to (− W W T T
2 , 2 ) and strictly time-limited to (− 2 , 2 ).
Each waveform in Gη is represented by an n-tuple and as T goes to infinity n
approaches W T . Complete the argument assuming n = W T and without worrying
about convergence issues.
exercise 4.17 (Energy efficiency of single-shot PAM) This exercise comple-

ments what we have learned in Example 4.3. Consider using the m-PAM constel-
lation
{±a, ±3a, ±5a, . . . , ±(m − 1)a}
to communicate across the discrete-time AWGN channel of noise variance σ 2 = 1.

Our goal is to communicate at some level of reliability, say with error probability
Pe = 10−5 . We are interested in comparing the energy needed by PAM versus the
energy needed by a system that operates at channel capacity, namely at 12 log2 1 +
Es

σ 2 bits per channel use.
(a) Using the capacity formula, determine the energy per symbol EsC (k) needed
to transmit k bits per channel use. (The superscript C stands for channel
capacity.) At any rate below capacity it is possible to make the error prob-
ability arbitrarily small by increasing the codeword length. This implies that
there is a way to achieve the desired error probability at energy per symbol
EsC (k).
(b) Using single-shot m-PAM, we can achieve an arbitrary small error probability
by making the parameter a sufficiently large. As the size m of the constellation
increases, the edge effects become negligible, and the average error probability
approaches 2Q( σa ), which is the probability of error conditioned on an interior
point being transmitted. Find the numerical value of the parameter a for
2
which 2Q( σa ) = 10−5 . (You may use 12 exp(− x2 ) as an approximation of
Q(x).)
(c) Having fixed the value of a, we can use equation (4.1) to determine the average
energy EsP (k) needed by PAM to send k bits at the desired error probability.
(The superscript P stands for PAM.) Find and compare the numerical values
of EsP (k) and EsC (k) for k = 1, 2, 4.
E C (k+1) E P (k+1)
(d) Find limk→∞ sE C (k) and limk→∞ sE P (k) .
s s
(e) Comment on PAM’s efficiency in terms of energy per bit for small and large
values of k. Comment also on the relationship between this exercise and
Example 4.3.
5 Symbol-by-symbol on a pulse
train: Second layer revisited
5.1 Introduction
In this and the following chapter, we focus on the signal design problem. This
chapter is devoted to the waveform former and its receiver-side counterpart, the
n-tuple former. In Chapter 6 we focus on the encoder/decoder pair.1
In principle, the results derived in this chapter can be applied to both baseband
and passband communication. However, for reasons of flexibility, hardware costs,
and robustness, we design the waveform former for baseband communication and
assign to the up-converter, discussed in Chapter 7, the task of converting the
waveform-former output into a signal suitable for passband communication.
Symbol-by-symbol on a pulse train will emerge as a natural signaling technique.
To keep the notation to the minimum, we write

n
w(t) = sj ψ(t − jT ) (5.1)
j=1
n
instead of wi (t) = j=1 ci,j ψ(t − jT ). We drop the message index i from wi (t)
because we will be studying properties of the pulse ψ(t), as well as properties of
the stochastic process that models the transmitter output signal, neither of which
depends on a particular message choice. Following common practice, we refer to
sj as a symbol .
example 5.1 (PAM signaling) PAM signaling (PAM for short) is indeed symbol-
by-symbol on a pulse train, with the symbols taking value in a PAM alphabet as
described in Figure 2.9. It depends on the encoder whether or not all sequences
with symbols taking value in the given PAM alphabet are allowed. As we will see
in Chapter 6, we can decrease the error probability by allowing only a subset of the
sequences.
We have seen the acronym PAM in three contexts that are related but should
not be confused. Let us review them. (i) PAM alphabet as the constellation of
1
The two chapters are essentially independent and could be studied in the reverse order,
but the results of Section 5.3 (which is independent of the other sections) are needed
for a few exercises in Chapter 6. The chosen order is preferable for continuity with the
discussion in Chapter 4.
159
160 5. Second layer revisited
points of Figure 2.9. (ii) Single-shot PAM as in Example 3.8. We have seen that
this signaling method is not appropriate for transmitting many bits. Therefore we
will not discuss it further. (iii) PAM signaling as in Example 5.1. This is symbol-
by-symbol on a pulse train with symbols taking value in a PAM alphabet. Similar
comments apply to QAM and PSK, provided that we view their alphabets as
subsets of C rather than of R2 . The reason it is convenient to do so will become
clear in Chapter 7.
As already mentioned, most modern communication systems rely on PAM,
QAM, or PSK signaling. In this chapter we learn the main tool to design the
pulse ψ(t).
The chapter is organized as follows. In Section 5.2, we develop an instructive
special case where the channel is strictly bandlimited and we rediscover symbol-
by-symbol on a pulse train as a natural signaling technique for that situation.
This also forms the basis for software-defined radio. In Section 5.3 we derive the
expression for the power spectral density of the transmitted signal for an arbitrary
pulse when the symbol sequence constitutes a discrete-time wide-sense stationary
process. As a preview, we discover that when the symbols are uncorrelated, which
is frequently the case, the spectrum is proportional to |ψF (f )|2 . In Section 5.4, we
derive the necessary and sufficient condition on |ψF (f )|2 in order for {ψj (t)}j∈Z
to be an orthonormal set when ψj (t) = ψ(t − jT ). (The condition is that |ψF (f )|2
fulfills the so-called Nyquist criterion.)
5.2 The ideal lowpass case

Suppose that the channel is as shown in Figure 5.1, where N (t) is white Gaussian
noise of spectral density N0 /2 and the filter has frequency response

1, |f | ≤ B
hF (f ) =
0, otherwise.
This is an idealized version of a lowpass channel.

Because the filter blocks all the signal’s components that fall outside the fre-
quency interval [−B, B], without loss of optimality we consider signals that are
strictly bandlimited to [−B, B]. The sampling theorem, stated below and proved
in Appendix 5.12, tells us that such signals can be described by a sequence of

w(t) - h(t) - - R(t)

6
N (t)
N0
AWGN, 2
Figure 5.1. Lowpass channel model.

5.2. The ideal lowpass case 161
numbers. The idea is to let the encoder produce these numbers and let the wave-
form former do the “interpolation” that converts the samples into the desired w(t).
theorem 5.2 (Sampling theorem) Let w(t) be a continuous L2 function (possi-
bly complex-valued) and let its Fourier transform wF (f ) vanish for f ∈ [−B, B].
Then w(t) can be reconstructed from the sequence of T -spaced samples w(nT ),
n ∈ Z, provided that T ≤ 2B
1
. Specifically,
∞
t
w(t) = w(nT ) sinc −n , (5.2)
n=−∞
T
sin(πt)
where sinc(t) = πt .
In the sampling theorem, we assume that the signal of interest is in L2 . Essen-

tially, L2 is the vector space of finite-energy functions, and this is all you need
to know about L2 to get the most out of this text. However, it is recommended
to read Appendix 5.9, which contains an informal introduction of L2 , because we
often read about L2 in the technical literature. The appendix is also necessary
to understand some of the subtleties related to the Fourier transform (Appendix
5.10) and Fourier series (Appendix 5.11). Our reason for referring to L2 is that it
is necessary for a rigorous proof of the sampling theorem. All finite-energy signals
that we encounter in this text are L2 functions. It is safe to say that all finite-energy
functions that model real-world signals are in L2 .
In the statement of the sampling theorem, we require continuity. Continuity
does not follow from the condition that w(t) is bandlimited. In fact, if we take a
nice (continuous and L2 ) function and modify it at a single point, say at t = T /2,
then the original and the modified functions have the same Fourier transform.
In particular, if the original is bandlimited, then so is the modified function.
The sequence of samples is identical for both functions; and the reconstruction
formula (5.2) will reconstruct the original (continuous) function. This eventuality
is not a concern to practising engineers, because physical signals are continuous.
Mathematically, when the difference between two L2 functions is a zero-norm
function, we say that the functions are L2 equivalent (see Appendix 5.9).
To see what happens when we omit continuity in the sampling theorem, consider
the situation where a continuous signal that fulfills the conditions of the sampling
theorem is set to zero at all sampling points. Once again, the Fourier transform of
the modified signal is identical to that of the original one. Yet, when we sample the
modified signal and use the samples in the “reconstruction formula”, we obtain
the all-zero signal.
The sinc pulse used in the statement of the sampling theorem is not normalized
to unit energy. If we normalize it, specifically define ψ(t) = √1T sinc( Tt ), then
{ψ(t − jT )}∞j=−∞ forms an orthonormal set. (This is implied by Theorem 5.6. The
impatient reader can verify it by direct calculation, using Parseval’s relationship.
Useful facts about the sinc function and its Fourier transform are contained in
Appendix 5.10.) Thus (5.2) can be rewritten as
∞
1 t
w(t) = sj ψ(t − jT ), ψ(t) = √ sinc( ), (5.3)
j=−∞
T T
sj sj ψ(t − jT ) R(t) Yj

- ψ(t) - h(t) - - ψ ∗ (−t) @ -

6 t = jT
Waveform n-tuple
N (t)
Former Former
Figure 5.2. Symbol-by-symbol on a pulse train obtained naturally

from the sampling theorem.
√
where sj = w(jT ) T . Hence a signal w(t) that fulfills the conditions of the
sampling theorem is one that lives in the inner product space spanned by {ψ(t −
jT )}∞j=−∞ . When we sample such a signal, we obtain (up to a scaling factor) the
coefficients of its orthonormal expansion with respect to the orthonormal basis
{ψ(t − jT )}∞ j=−∞ .
Now let us go back to our communication problem. We have just seen that any
physical (continuous and L2 ) signal w(t) that has no energy outside the frequency
range [−B, B] can be synthesized as w(t) = j sj ψ(t−jT ). This signal has exactly
the form of symbol-by-symbol on a pulse train.√To implement this signaling method
we let the jth encoder output be sj = w(jT ) T , and let the waveform former be
defined by the pulse ψ(t) = √1T sinc( Tt ). The waveform former, the channel, and
the n-tuple former are shown in Figure 5.2.
It is interesting to observe that we use the sampling theorem somewhat back-
wards, in the following sense. In a typical application of the sampling theorem,
the first step consists of sampling the source signal, then the samples are stored or
transmitted, and finally the original signal is reconstructed from the samples. To
the contrary, in the diagram of Figure 5.2, the transmitter does the (re)construction
as the first step, the (re)constructed signal is transmitted, and finally the receiver
does the sampling.
Notice also that ψ ∗ (−t) = ψ(t) (the sinc function is even and real-valued) and
its Fourier transform is
√
T , |f | ≤ 2T
1
ψF (f ) =
0, otherwise
(Appendix 5.10 explains an effortless method for relating the rectangle and the sinc
as Fourier pairs). Therefore the matched filter at the receiver is a lowpass filter.
It does exactly what seems to be the right thing to do – remove the out-of-band
noise.
example 5.3 (Software-defined radio) The sampling theorem is the theoretical
underpinning of software-defined radio. No matter what the communications stan-
dard is (GSM, CDMA, EDGE, LTE, Bluetooth, 802.11, etc.), the transmitted
signal can be described by a sequence of numbers. In a software-defined-radio
implementation of a transmitter, the encoder that produces the samples is a com-
puter program. Only the program is aware of the standard being implemented.
5.3. Power spectral density 163
The hardware that converts the sequence of numbers into the transmitted signal
(the waveform former of Figure 5.2) can be the same off-the-shelf device for all
standards. Similarly, the receiver front end that converts the received signal into
a sequence of numbers (the n-tuple former of Figure 5.2) can be the same for
all standards. In a software-defined-radio receiver, the decoder is implemented in
software. In principle, any past, present, and future standard can be implemented
by changing the encoder/decoder program. The sampling theorem was brought to
the engineering community by Shannon [24] in 1948, but only recently we have the
technology and the tools needed to make software-defined radio a viable solution.
In particular, computers are becoming fast enough, real-time operating systems
such as RT Linux make it possible to schedule critical events with precision, and
the prototyping is greatly facilitated by the availability of high-level programming-
languages for signal-processing such as MATLAB.
In the rest of the chapter we generalize symbol-by-symbol on a pulse train. The

goal is to understand which pulses ψ(t) are allowed and to determine their effect
on the power spectral density of the communication signal they produce.
5.3 Power spectral density

A typical requirement imposed by regulators is that the power spectral density
(PSD) (also called power spectrum) of the transmitter’s output signal be below
a given frequency-domain “mask”. In this section we compute the PSD of the
transmitter’s output signal modeled as
∞

X(t) = Xi ξ(t − iT − Θ), (5.4)
i=−∞
where {Xj }∞ j=−∞ is a zero-mean wide-sense stationary (WSS) discrete-time pro-

cess, ξ(t) is an arbitrary L2 function (not necessarily normalized or orthogonal to
its T -spaced time translates), and Θ is a random dither (or delay) independent
of {Xj }∞j=−∞ and uniformly distributed in the interval [0, T ). (See Appendix 5.13
for a brief review on stochastic processes.)
The insertion of the random dither Θ, not considered so far in our signal’s model,
needs to be justified. It models the fact that a transmitter is switched on at a time
unknown to an observer interested in measuring the signal’s PSD. For this reason
and because of the propagation delay, the observer has no information regarding
the relative position of the signal with respect to his own time axis. The dither
models this uncertainty. Thus far, we have not inserted the dither because we did
not want to make the signal’s model more complicated than necessary. After a
sufficiently long observation time, the intended receiver can estimate the dither
and compensate for it (see Section 5.7). From a mathematical point of view, the
dither makes X(t) a WSS process, thus greatly simplifying our derivation of the
power spectral density.
In the derivation that follows we use the autocovariance
KX [i] := E[(Xj+i − E[Xj+i ])(Xj − E[Xj ])∗ ] = E[Xj+i Xj∗ ],
which depends only on i since, by assumption, {Xj }∞ j=−∞ is WSS. We use also
the self-similarity function of the pulse ξ(τ ), defined as
∞
Rξ (τ ) := ξ(α + τ )ξ ∗ (α)dα. (5.5)
−∞
(Think of the definition of an inner product if you tend to forget where to put the
∗
in the above definition.)
The process X(t) is zero-mean. Indeed, using the independence ∞ between
Xi and Θ and the fact that E[Xi ] = 0, we obtain E[X(t)] = i=−∞ E[Xi ]
E[ξ(t − iT − Θ)] = 0.
The autocovariance of X(t) is
& ∗ '
KX (t + τ, t) = E X(t + τ ) − E[X(t + τ )] X(t) − E[X(t)]
& '
= E X(t + τ )X ∗ (t)
. ∞ ∞
/

∗ ∗
=E Xi ξ(t + τ − iT − Θ) Xj ξ (t − jT − Θ)
i=−∞ j=−∞
. ∞ ∞
/

=E Xi Xj∗ ξ(t + τ − iT − Θ)ξ ∗ (t − jT − Θ)
i=−∞ j=−∞
∞
∞
(a)
= E[Xi Xj∗ ]E[ξ(t + τ − iT − Θ)ξ ∗ (t − jT − Θ)]
i=−∞ j=−∞
∞
∞

= KX [i − j]E[ξ(t + τ − iT − Θ)ξ ∗ (t − jT − Θ)]
i=−∞ j=−∞
∞ ∞
(b) 1 T
= KX [k] ξ(t + τ − iT − θ)ξ ∗ (t − iT + kT − θ)dθ
i=−∞
T 0
k=−∞
∞ ∞
(c) 1
= KX [k] ξ(t + τ − θ)ξ ∗ (t + kT − θ)dθ
T −∞
k=−∞
1
= KX [k] Rξ (τ − kT ),
T
k
where in (a) we use the fact that Xi Xj∗ and Θ are independent random variables,
in (b) we make the change of variable k = i − j, and in (c) we use the fact that
for an arbitrary function u : R → R, an arbitrary number a ∈ R, and a positive
(interval length) b,
∞
a+b ∞
u(x + ib)dx = u(x)dx. (5.6)
i=−∞ a −∞
5.3. Power spectral density 165
(If (5.6) is not clear to you, picture integrating from a to a + 2b by integrating first
from a to a + b, then from a + b to a + 2b, and summing the results. This is the
right-hand side. Now consider integrating both times from a to a + b, but before
you perform the second integration you shift the function to the left by b. This is
the left-hand side.)
We see that KX (t + τ, t) depends only on τ . Hence we simplify notation and
write KX (τ ) instead of KX (t + τ, t). We summarize:
1
KX (τ ) = KX [k] Rξ (τ − kT ). (5.7)
T
k
The process X(t) is WSS because neither its mean nor its autocovariance
depend on t.
For the last step in the derivation of the power spectral density, we use the
fact that the Fourier transform of Rξ (τ ) is |ξF (f )|2 . This follows from Parseval’s
relationship,
∞
Rξ (τ ) = ξ(α + τ )ξ ∗ (α)dα
−∞
∞
∗
= ξF (f )ξF (f ) exp(j2πτ f )df
−∞
∞
= |ξF (f )|2 exp(j2πτ f )df.
−∞
Now we can take the Fourier transform of KX (τ ) to obtain the power spectral
density
|ξF (f )|2
SX (f ) = KX [k] exp(−j2πkf T ). (5.8)
T
k
The above expression is in a form that suits us. In many situations, the infinite
sum has only a small number of non-zero terms. Note that the summation in (5.8)
is the discrete-time Fourier transform of {KX [k]}∞k=−∞ , evaluated at f T . This is
the power spectral density of the discrete-time process {Xi }∞i=−∞ . If we think of
|ξF (f )|2
T as being the power spectral density of ξ(t), we can interpret SX (f ) as
being the product of two PSDs, that of ξ(t) and that of {Xi }∞
i=−∞ .
In many cases of interest, KX [k] = E1{k = 0}, where E = E[|Xi |2 ]. In this case
we say that the zero-mean WSS process {Xi }∞i=−∞ is uncorrelated . Then (5.7) and
(5.8) simplify to
Rξ (τ )
KX (τ ) = E , (5.9)
T
|ξF (f )|2
SX (f ) = E . (5.10)
T
example 5.4 Suppose that {Xi }∞ is an independent and uniformly dis-
√
i=−∞
tributed sequence taking values in {± E}, and ξ(t) = 1/T sinc( Tt ). Then
KX [k] = E1{k = 0} and
SX (f ) = E1{f ∈ [−B, B]},

1
where B = 2T . This is consistent with our intuition. When we use the pulse
t
sinc( T ), we expect a flat power spectrum over [−B, B] and no power outside this
interval. The energy per symbol is E, hence the power is TE , and the power spectral
E
density is 2BT = E.
In the next example, we work out a case where KX [k] = E1{k = 0}. In this
case, we say that the zero-mean WSS process {Xi }∞
i=−∞ is correlated.
example 5.5 (Correlated symbol sequence) Suppose that the pulse is as in

Example √ 5.4, but the symbol sequence is now the output of an encoder described
by Xi = 2E(Bi − Bi−2 ), where {Bi }∞ i=−∞ is a sequence of independent random
variables that are uniformly distributed over the alphabet {0,
√ 1}. After verifying
that the symbol sequence is zero-mean, we insert Xi = 2E(Bi − Bi−2 ) into
KX [k] = E[Xj+k Xj∗ ] and obtain
⎧
⎪
⎨E, k = 0,
KX [k] = −E/2, k = ±2,
⎪
⎩
0, otherwise,
2
|ξF (f )| 1 1
SX (f ) = E 1 − ej4πf T − e−j4πf T = 2E sin2 (2πf T )1[−B,B] (f ).
T 2 2
(5.11)
When we compare this example to Example 5.4, we see that this encoder shapes
the power spectral density from a rectangular shape to a squared sinusoid. Notice
that the spectral density vanishes at f = 0. This is desirable if the channel blocks
very low frequencies, which happens for instance for a cable that contains ampli-
fiers. To avoid amplifying offset voltages and leaky currents, amplifiers are AC
(alternating current) coupled. This means that amplifiers have a highpass filter at
the input, often just a capacitor, that blocks DC (direct current) signals. Notice
that the encoder is a linear time-invariant system (with respect to addition and
multiplication in R). Hence the cascade of the encoder and the pulse forms a
linear time-invariant system. It is immediate to verify that its impulse response is
˜ = ξ(t) − ξ(t − 2T ). Hence in this case we can write
ξ(t)

X(t) = Xl ξ(t − lT ) = ˜ − lT ).
Bl ξ(t
l l
The technique described in this example is called correlative encoding or partial

response signaling.
The encoder in Example 5.5 is linear with respect to (the field) R and this is
the reason its effect can be incorporated into the pulse but this is not the case in
general (see Exercise 5.14 and see Chapter 6).
5.4. Nyquist criterion for orthonormal bases 167
5.4 Nyquist criterion for orthonormal bases

In the previous section we saw that symbol-by-symbol on a pulse train with
uncorrelated symbols – a condition often met – generates a stochastic process of
power spectral density E |ψFT(f )| , where E is the symbol’s average energy. This
2
is progress, because it tells us how to choose ψ(t) to obtain a desired power

spectral density. But remember that ψ(t) has another important constraint: it
must be orthogonal to its T -spaced time translates. It would be nice to have
a characterization of |ψF (f )|2 that is simple to work with and that guarantees
orthogonality between ψ(t) and ψ(t − lT ) for all l ∈ Z. This seems to be asking
too much, but it is exactly what the next result is all about.
So, we are looking for a frequency-domain equivalent of the condition
∞
ψ(t − nT )ψ ∗ (t)dt = 1{n = 0}. (5.12)
−∞
The form of the left-hand side suggests using Parseval’s relationship. Doing so
yields
∞ ∞
1{n = 0} = ψ(t − nT )ψ ∗ (t)dt = ψF (f )ψF ∗
(f )e−j2πnT f df
−∞ −∞
∞
7 7
= 7ψF (f )72 e−j2πnT f df
−∞
7 7
1
7ψF (f − k )72 e−j2πnT (f − Tk ) df

(a) 2T
=
− 2T
1 T
k∈Z
7 7
1
7ψF (f − k )72 e−j2πnT f df

(b) 2T
=
− 2T
1 T
k∈Z
1
2T
(c)
= g(f )e−j2πnT f df,
− 2T
1
where in (a) we use again (5.6) (but in the other direction), in (b) we use the fact
that e−j2πnT (f − T ) = e−j2πnT f , and in (c) we introduce the function
k
77 72
k 77
7
g(f ) = 7ψF f − T 7 .
k∈Z
Notice that g(f ) is a periodic function of period 1/T and the right-hand side of (c)
is 1/T times the nth Fourier series coefficient An of the periodic function g(f ). (A
review of Fourier series is given in Appendix 5.11.) Because A0 = T and Ak = 0 for
k = 0, the Fourier series of g(f ) is the constant T . Up to a technicality discussed
below, this proves the following result.
theorem 5.6 (Nyquist’s criterion for orthonormal pulses) Let ψ(t) be an L2

function. The set {ψ(t−jT )}∞
j=−∞ consists of orthonormal functions if and only if
∞
k 2
l. i. m. |ψF (f − )| = T, f ∈ R. (5.13)
T
k=−∞
A frequency-domain function aF (f ) is said to satisfy Nyquist’s criterion with

∞
parameter p if, for all f ∈ R, l. i. m. k=−∞ aF (f − kp ) = p. Theorem 5.6 says that
{ψ(t − jT )}∞ j=−∞ is an orthonormal set if and only if |ψF (f )| satisfies Nyquist’s
2
criterion with parameter T .

The l. i. m. in (5.13) stands for limit in L2 norm. It means that as we add
more and more terms to the sum on the left-hand side of (5.13) it becomes L2
equivalent to the constant on the right-hand side. The l. i. m. is a technicality due
to the Fourier series. To see that the l. i. m. is needed, take a ψF (f ) such that
|ψF (f )|2 fulfills (5.13) without the l. i. m. An example of such a function is the
rectangle |ψF (f )|2 = T for f ∈ [− 2T 1 1
, 2T ) and zero elsewhere. Now, take a copy
|ψ̃F (f )| of |ψF (f )| and modify it at an arbitrary isolated point. For instance, we
2 2
set ψ̃F (0) = 0. The inverse Fourier transform of ψ̃F (f ) is still ψ(t). Hence ψ(t) is
orthogonal to its T -spaced time translates. Yet (5.13) is no longer fulfilled if we
omit the l. i. m. For our specific example, the left and the right differ at exactly
one point of each period. Equality still holds in the l. i. m. sense. In all practical
applications, ψF (f ) is a smooth function and we can ignore the l. i. m. in (5.13).
Notice that the left side of the equality in (5.13) is periodic with period 1/T .
Hence to verify that |ψF (f )|2 fulfills Nyquist’s criterion with parameter T , it is
sufficient to verify that (5.13) holds over an interval of length 1/T .
example 5.7 The following functions satisfy Nyquist’s criterion with param-
eter T .
(a) |ψF (f )|2 = T 1{− 2T1

< f < 2T 1
}(f ).
(b) |ψF (f )| = T cos ( 2 f T )1{− T1 < f < T1 }(f ).
2 2 π
(c) |ψF (f )|2 = T (1 − T |f |)1{− T1 < f < T1 }(f ).
The following comments are in order.
(a) (Constant but not T ) ψ(t) is orthogonal to its T -spaced time translates even
when the left-hand side of (5.13) is L2 equivalent to a constant other than T ,
but in this case ψ(t)2 = 1. This is a minor issue, we just have to scale the
pulse to make it unit-norm.
(b) (Minimum bandwidth) A function |ψF (f )|2 cannot fulfill Nyquist’s criterion
with parameter T if its support is contained in an interval of the form [−B, B]
1
with 0 < B < 2T . Hence, the minimum bandwidth to fulfill Nyquist’s
1
criterion is 2T .
(c) (Test for bandwidths between 2T 1
and T1 ) If |ψF (f )|2 vanishes outside [− T1 , T1 ],
the Nyquist criterion is satisfied if and only if |ψF ( 2T 1
−)|2 +|ψF (− 2T
1
−)|2 =
T for ∈ [− 2T
1 1
, 2T ] (see Figure 5.3). If, in addition, ψ(t) is real-valued, which
is typically the case, then |ψF (−f )|2 = |ψF (f )|2 . In this case, it is sufficient
that we check the positive frequencies, i.e. Nyquist’s criterion is met if
5.4. Nyquist criterion for orthonormal bases 169
7 72 7 72
7 7 7 7
7ψF ( 1 − )7 + 7ψF ( 1 + )7 = T, ∈ 0,
1
.
7 2T 7 7 2T 7 2T
This means that |ψF ( 2T1
)|2 = T2 and the amount by which the function
|ψF (f )| varies when we go from f = 2T
2 1 1
to f = 2T − is compensated by
1 1
the function’s variation in going from f = 2T to f = 2T + . For examples
of such a band-edge symmetry see Figure 5.3, Figure 5.6a, and the functions
1
(b) and (c) in Example 5.7. The bandwidth BN = 2T is sometimes called the
Nyquist bandwidth.
6 |ψF ( 2T
1
− )|2 + |ψF (− 2T
1
− )|2 = T
?
T
|ψF (f )|2 HH HH |ψF (f − 1
)|2
H
T
H
HH HH
H H - f
1 1
2T T
Figure 5.3. Band-edge symmetry for a pulse |ψF (f )|2 that vanishes outside
[− T1 , T1 ] and fulfills Nyquist’s criterion.
(d) (Test for arbitrary finite bandwidths) When the support of |ψF (f )|2 is wider
than T1 , it is harder to see whether or not Nyquist’s criterion is met with
parameter T . A convenient way to organize the test goes as follows. Let I
be the set of integers i for which the frequency interval of width T1 centered
at fi = 2T1
+ i T1 intersects with the support of |ψF (f )|2 . For the example of
Figure 5.4, I = {−3, −2, 1, 2}, and the frequencies fi , i ∈ I are marked with
a “×”. For each i ∈ I, we consider the function |ψF (fi + )|2 , ∈ [− 2T 1 1
, 2T ],
as shown in Figure 5.5. Nyquist’s criterion is met if and only if the sum of
these functions,

1 1
g() = |ψF (fi + )|2 , ∈ − , ,
2T 2T
i∈I
is L2 equivalent to the constant T . From Figure 5.5, it is evident that the

test is passed by the |ψF (f )|2 of Figure 5.4.
|ψF (f )|2
T
3
× × × × f
1
2T
− 3
T
1
2T
− 2
T
− 2T
1 0 1
2T
1
2T
+ 1
T
1
2T
+ 2
T
Figure 5.4. A |ψF (f )|2 of support wider than 1

T that fulfills
Nyquist’s criterion.
|ψF (fi + )|2 |ψF (fi + )|2

T T
3 3

− 2T
1 0 1
2T
− 2T
1 0 1
2T
(a) i = −3 (b) i = −2
|ψF (fi + )| 2
|ψF (fi + )|2
T T
3 3

− 2T
1 0 1
2T
− 2T
1 0 1
2T
(c) i = 1 (d) i = 2
Figure 5.5. Functions of the form |ψF (fi + )|2 for ∈ [− 2T

1 1
, 2T ]
and i ∈ I = {−3, −2, 1, 2}. The sum-function g() is the constant
function T for ∈ [− 2T
1 1
, 2T ].
5.5 Root-raised-cosine family

For every β ∈ (0, 1), and every T > 0, the raised-cosine function
⎧
⎪
9 |f | ≤ 2T
1−β
⎪
⎨T, 8
|ψF (f )|2 = T2 1 + cos πT |f | − 2T
1−β
, 2T < |f | <
1−β 1+β
⎪
⎪
β 2T
⎩0, otherwise
fulfills Nyquist’s criterion with parameter T (see Figure 5.6a for a raised-cosine
function with β = 12 ). The expression might look complicated at first, but one can
easily derive it by following the steps in Exercise 5.10.
By using the relationship 12 (1 + cos α) = cos2 α2 , we can take the square root of
|ψF (f )|2 and obtain the root-raised-cosine function (also called square-root raised-
cosine function)
⎧√
⎪
9 |f | ≤ 2T
1−β
⎪
⎨√ T , 8
2β (|f | − 2T ) , 2T < |f | ≤ 2T
1−β 1−β 1+β
ψF (f ) = T cos πT
⎪
⎪
⎩0, otherwise.
The inverse Fourier transform of ψF (f ), derived in Appendix 5.14, is the root-
raised-cosine impulse response (also called root-raised-cosine pulse or impulse
response of a root-raised cosine filter )2
2
A root-raised-cosine impulse response should not be confused with a root-raised-cosine
function.
5.5. Root-raised-cosine family 171
|ψF (f )|2
f
1
2T
(a)
ψ(t)

1
T
t
T
(b)
Figure 5.6. (a) Raised-cosine function |ψF (f )|2 with β = 12 ;

(b) corresponding pulse ψ(t).
& ' (1−β)π & '

4β cos (1 + β)π T + 4β sinc (1 − β) T
t t
ψ(t) = √ 2 , (5.14)
π T 1 − 4β Tt
where sinc(x) = sin(πx) 1

πx . The pulse ψ(t) is plotted in Figure 5.6b for β = 2 .
At t = ± 4β , both the numerator and the denominator of (5.14) vanish. Using
T
L’Hospital’s rule we determine that

β π π
lim ψ(t) = √ (π + 2) sin + (π − 2) cos .
T
t→± 4β π 2T 4β 4β
The root-raised-cosine method is the most popular way of constructing pulses

ψ(t) that are orthogonal
to their T -spaced time translates. When β = 0, the pulse
becomes ψ(t) = 1/T sinc( T1 ).
Figure 5.7a shows a train of root-raised-cosine impulse responses, with each pulse
scaled by a symbol taking value in {±1}. Figure 5.7b shows the corresponding
sum-signal.
A root-raised-cosine impulse response ψ(t) is real-valued, even, and of infinite
support (in the time domain). In practice, such a pulse has to be truncated to
finite length and to make it causal, it has to be delayed. In general, as β increases
from 0 to 1, the pulse ψ(t) decays faster. The faster a pulse decays, the shorter
we can truncate it without noticeable difference in its main property, which is to
be orthogonal to its shifts by integer multiples of T . The eye diagram, described
next, is a good way to visualize what goes on as we vary the roll-off factor β. The
drawback of increasing β is that the bandwidth increases as well.
0.4
0.4
0.2
0.2
0 0
−0.2
−0.2
−0.4
−0.4
−40 −20 0 10 20 30 40 50 60 70 80 −40 −20 0 10 20 30 40 50 60 70 80
(a) (b)
Figure 5.7. (a) Superimposed sequence of scaled root-raised-cosine impulse

responses of the form si ψ(t − iT ), i = 0, . . . , 3, with symbols si taking value
in {±1}, and (b) corresponding sum-signal. The design parameters are
β = 0.5 and T = 10. The abscissa is the time.
5.6 Eye diagrams

The fact that ψ(t) has unit-norm and is orthogonal to ψ(t − iT ) for all non-zero
integers i, implies that the self-similarity function Rψ (τ ) satisfies

1, i = 0
Rψ (iT ) = (5.15)
0, i non-zero integer.

This is important for the following reason. The noiseless signal w(t) = i si ψ(t −
iT ) applied at the input of the matched filter of impulse response ψ ∗ (−t) produces
the noiseless output

y(t) = si Rψ (t − iT ) (5.16)
i
that, when sampled at t = jT , yields

y(jT ) = si Rψ (j − i)T = sj .
i
Figure 5.8 shows the matched filter outputs obtained when the functions of
Figure 5.7 are applied at the filter’s input. Specifically, Figure 5.8a shows the
train of symbol-scaled self-similarity functions. From the figure we see that (5.15)
is satisfied. (When a pulse achieves its maximum, which has value 1, the other
pulses vanish.) We see it also from Figure 5.8b, in that the signal y(t) takes values
in the symbol alphabet {±1} at the sampling times t = 0, 10, 20, 30.
If ψ(t) is not orthogonal to ψ(t−iT ), which can happen for instance if a truncated
pulse is made too short, then Rψ (iT ) will be non-zero for several integers i. If we
define li = Rψ (iT ), then we can write
5.6. Eye diagrams 173
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1 −1
−40 −20 0 10 20 30 40 50 60 70 80 −40 −20 0 10 20 30 40 50 60 70 80
(a) (b)
Figure 5.8. Each pulse in (a) has the form si Rψ (t − iT ), i = 0, . . . , 3.

It is the response of the matched filter of impulse response ψ ∗ (−t) to
the input si ψ(t − iT ). Part (b) shows y(t). The parameters are as in
Figure 5.7.

y(jT ) = si lj−i .
i
The fact that the noiseless y(jT ) depends on multiple symbols is referred to as
inter-symbol interference (ISI).
There are two main causes to ISI. We have already mentioned one, specifically
when Rψ (τ ) is non-zero for more than one sample. ISI happens also if the matched-
filter
output is not sampled
at the correct times. In this
case, we obtain y(jT +Δ) =
i si Rψ (j − i)T + Δ , which is again of the form i si lj−i for li = Rψ (iT + Δ).
The eye diagram is a technique that allows us to visualize if there is ISI and to
see how critical it is that the sampling time be precise. The eye diagramis obtained
from the matched-filter output before sampling. Let y(t) = i si Rψ t − iT ) be
the noiseless matched filter output, with symbols taking value in some discrete set
S. For the example that follows, S = {±1}. To obtain the eye diagram, we plot
the superposition of traces of the form y(t − iT ), t ∈ [−T, T ], for various integers
i. Figure 5.9 gives examples of eye diagrams for various roll-off factors and pulse
truncation lengths. Parts (a), (c), and (d) show no sign of ISI. Indeed, all traces
go through ±1 at t = 0, which implies that y(iT ) ∈ S. We see that truncating
the pulse to length 20T does not lead to ISI for either roll-off factor. However,
ISI is present when β = 0.25 and the pulse is truncated to 4T (part (b)). We
see its presence from the fact that the traces go through various values at t = 0.
This means that y(iT ) takes on values outside S. These examples are meant to
illustrate the point made in the last paragraph of the previous section.
Note also that the eye, the untraced space in the middle of the eye diagram, is
wider in (c) than it is in (a). The advantage of a wider eye is that the system is more
tolerant to small variations (jitter) in the sampling time. This is characteristic of
a larger β and it is a consequence of the fact that as β increases, the pulse decays
faster as a function of |t|. For the same reason, a pulse with larger beta can be
truncated to a shorter length, at the price of a larger bandwidth.
2
2
1 1
0 0
−1 −1
−2
−2
−10 −5 0 5 10 −10 −5 0 5 10
(a) (b)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−10 −5 0 5 10 −10 −5 0 5 10
(c) (d)

Figure 5.9. Eye diagrams of i si Rψ t − iT ) for si ∈ {±1} and pulse of the
root-raised-cosine family with T = 10. The abscissa is the time. The roll-off
factor is β = 0.25 for the top figures and β = 0.9 for the bottom ones. The
pulse is truncated to length 20T for the figures on the left, and to length 4T
for those on the right. The eye diagram of part (b) shows the presence of ISI.
The MATLAB program used to produce the above plots can be downloaded from
the book web page.
The popularity of the eye diagram lies in the fact that it can be obtained quite
easily by looking at the matched-filter output with an oscilloscope triggered by
the clock that produces the sampling time. The eye diagram is very informative
even if the channel has attenuated the signal and/or has added noise.
5.7 Symbol synchronization

The n-tuple former for symbol-by-symbol on a pulse train contains a matched filter
with the output sampled at θ + lT , for some θ and all integers l in some interval.
5.7. Symbol synchronization 175
Without loss of essential generality, we can assume that θ ∈ [0, T ). In general, θ is

not known to the receiver when the communication starts. This is obviously the
case for a cell phone after a flight during which the phone was switched off (or in
“airplane mode”). The n-tuple former is unable to produce the sufficient statistics
until it has a sufficiently accurate estimate of θ. (The eye diagram gives us an
indication of how accurate the estimate needs to be.) Finding the correct sampling
times at the matched-filter output goes under the topic of symbol synchronization.
Estimating a parameter θ is a parameter estimation problem. Parameter esti-
mation, like detection, is a well-established field. The difference between the two
is that – in detection – the hypothesis takes values in a discrete set and – in
estimation – it takes values in a continuous set. In detection, typically we are
interested in the error probability under various hypotheses or in the average error
probability. In estimation, the decision is almost always wrong, so it does not make
sense to minimize the error probability, yet a maximum likelihood (ML) approach
is a sensible choice. We discuss the ML approach in Section 5.7.1 and, in Section
5.7.2, we discuss a more pragmatic approach, the delay locked loop (DLL). In both
cases, we assume that the transmitter starts with a training sequence.
5.7.1 Maximum likelihood approach
Suppose that the transmission starts with a training signal s(t) known to the
receiver. For the moment, we assume that s(t) is real-valued. Extension to a
complex-valued s(t) is done in Section 7.5.
The channel output signal is
R(t) = αs(t − θ) + N (t),
where N (t) is white Gaussian noise of power spectral density N20 , α is an unknown
scaling factor (channel attenuation and receiver front-end amplification), and θ is
the unknown parameter to be estimated. We can assume that the receiver knows
that θ is in some interval, say [0, θmax ], for some possibly large constant θmax .
To describe the ML estimate of θ, we need a statistical description of the received
signal as a function of θ and α. Towards this goal, suppose that we have an
orthonormal basis φ1 (t), φ2 (t), . . . , φn (t) that spans the set {s(t−θ̂) : θ̂ ∈ [0, θmax ]}.
To simplify the notation, we assume that the orthonormal basis is finite, but an
infinite basis is also a possibility. For instance, if s(t) is continuous, has finite
duration, and is essentially bandlimited, then the sampling theorem tells us that
we can use sinc functions for such a basis.
For i = 1, . . . , n, let Yi = R(t), φi (t) and let yi be the observed sample value of
T
Yi . The random vector Y2 = (Y1 , . . . , Yn ) consists of independent random variables
with Yi ∼ N αmi (θ), σ , where mi (θ) = s(t − θ), φi (t) and σ 2 = N20 . Hence,
the density of Y parameterized by θ and α is
n 2
1 − i=1 (yi −αmi (θ))
f (y; θ, α) = n e 2σ 2 . (5.17)
(2πσ 2 ) 2
A maximum likelihood (ML) estimate θ̂M L is a value of θ̂ that maximizes the

n
likelihood function f (y; θ̂, α). This is the same as maximizing i=1 (yi αmi (θ̂) −
α2 mi (θ̂)2
2 ),
obtained by taking the natural log of (5.17), removing terms that do
not depend on θ̂, and multiplying by σ 2 .
Because the collection φ1 (t), φ2 (t), . . . , φn (t) spans the set {s(t − θ̂) : θ̂ ∈
[0, θmax ]}, additional projections of R(t) onto normalized functions that are
orthogonal to φ1 (t), . . . , φn (t) lead to zero-mean Gaussian random variables of
variance N0 /2. Their inclusion in the likelihood function has no effect on the

maximizer θM L . Furthermore, i m2i (θ̂) equals s2 (t − θ̂)dt = s2 (see (2.32)).

Finally, R(t)s(t − θ̂)dt equals i Yi mi (θ̂). (If this last step is not clear to you,

substitute s(t− θ̂) with i mi (θ̂)φi (t) and swap the sum and the integral.) Putting
everything together, we obtain that the maximum likelihood estimate θ̂M L is the
θ̂ that maximizes

r(t)s(t − θ̂)dt, (5.18)
where r(t) is the observed sample path of R(t).

This result is intuitive. If we write r(t) = αs(t − θ) + n(t), where n(t) is the

sample path of N (t), we see that we are maximizing αRs (θ̂ − θ) + n(t)s(t − θ̂)dt,

where Rs (·) is the self-similarity function of s(t) and n(t)s(t − θ̂)dt is a sample
of a zero-mean Gaussian random variable of variance that does not depend on θ̂.
The self-similarity function of a real-valued function achieves its maximum at the
origin, hence Rs (θ̂ − θ) achieves its maximum at θ̂ = θ.
Notice that we have introduced an orthonormal basis for an intermediate step
in our derivations, but it is not needed to maximize (5.18).
The ML approach to parameter estimation is a well-established method but it is
not the only one. If we model θ as the sample of a random variable Θ and we have
a cost function c(θ, θ̂) that quantifies the “cost” of deciding that the parameter
is θ̂ when it is θ, then we can aim for the estimate that minimizes the expected
cost. This is the Bayesian approach to parameter estimation. Another
approach
is the least-squares estimation (LSE). It seeks the θ̂ for which i (yi − αmi (θ̂))2
is minimized. In words, the LSE approach chooses the parameter that provides
the most accurate description of the measurements, where accuracy is measured
in terms of squared distance. This is a different objective than that of the ML
approach, which chooses the parameter for which the measurements are the most
likely. When the noise is additive and Gaussian, as in our case, the ML approach
and the LSE approach lead to the same estimate. This is due to the fact that the
likelihood function f (y; θ̂, a) depends on the squared distance i (yi − αmi (θ̂))2 .
5.7.2 Delay locked loop approach
The shape of the training signal did not matter for the derivation of the ML
estimate, but it does matter for the delay locked loop approach. We assume that
the training signal takes the same form as the communication signal, namely
5.7. Symbol synchronization 177
θ + lT
t
Figure 5.10. Shape of M (t).

L−1
s(t) = cl ψ(t − lT ),
l=0
where c0 , . . . , cL−1 are training symbols known to the receiver. The easiest way to
see how the delay locked loop works is to assume that ψ(t) is a rectangular pulse
1
ψ(t) = 1 {0 ≤ t ≤ T }
T
√
and let the training symbol sequence c0 , . . . , cL−1 be an alternating sequence of Es
√ L−1
and − Es . The corresponding received signal R(t) is α l=0 cl ψ(t − lT − θ) plus
white Gaussian noise, where α is the unknown scaling factor. If we neglect the noise
for the moment, the matched filter output (before sampling) is the convolution of
R(t) with ψ ∗ (−t), which can be written as

L−1
M (t) = α cl Rψ (t − lT − θ),
l=0
where

|τ |
Rψ (τ ) = 1− 1{−T ≤ τ ≤ T }
T
is the self-similarity function of ψ(t). Figure 5.10 plots a piece of M (t).
The desired sampling times of the form θ + lT , l integer, correspond to the
maxima and minima of M (t). Let tk be the kth sampling point. Until symbol
synchronization is achieved, M (tk ) is not necessarily near a maximum or a
minimum of M (t). For every sample point tk , we also collect an early sample
at tEk = tk − Δ and a late sample at tk = tk + Δ, where Δ is some small
L
positive value (smaller than T /2). The dots in Figure 5.11 are examples of sample
values. Consider the cases when M (tk ) is positive (parts (a) and (b)). We see that
M (tLk ) − M (tk ) is negative when tk is late with respect to the target, and it is
E
positive when tk is early. The opposite is true when M (tk ) is negative (parts (c)
and (d)). Hence, in general, M (tL k ) − M (tk ) M (tk ) can be used as a feedback
E
signal to the clock that determines the sampling times. A positive feedback signal
is a sign for the clock to speed up, and a negative value is a sign to slow down. This
can be implemented via a voltage-controlled oscillator (VCO), with the feedback
signal as the controlling voltage.
Now consider the effect of noise. The noise added to M (t) is zero-mean. Intu-
itively, if the VCO does not react too quickly to the feedback signal, or equivalently
if the feedback signal is lowpass filtered, then we expect the sampling point to settle
t t
tk tk
(a) (b)
tk tk
t t
(c) (d)
Figure 5.11. DLL sampling points. The three consecutive dots of each part
are examples of M (tE L
k ), M (tk ), and M (tk ), respectively.
at the correct position even when the feedback signal is noisy. A rigorous analysis
is possible but it is outside the scope of this text. For a more detailed introduction
on synchronization we recommend [14, Chapters 14–16], [15, Chapter 4], and the
references therein.
Notice the similarity between the ML and the DLL solution. Ultimately they
both make use of the fact that a self-similarity function achieves its maximum at
the origin. It is also useful to think of a correlation such as

r(t)s(t − θ̂)dt
r(t)s(t)
as a measure for the degree of similarity between two functions, where the denom-
inator serves to make the result invariant to a scaling of either function. In this
case the two functions are r(t) = αs(t − θ) + n(t) and s(t − θ̂) and the maximum
is achieved when s(t − θ̂) lines up with s(t − θ). The solution to the ML approach
correlates with the entire training signal s(t), whereas the DLL correlates with
the pulse ψ(t); but it does so repeatedly and averages the results. The DLL is
designed to work with a VCO, and together they provide a complete and easy-to-
implement solution that tracks the sampling times.3 It is easy to see that the DLL
provides valuable feedback even after the transition from the training symbols
to the regular symbols, provided that the symbols change polarity sufficiently
often. To implement the ML approach, we still need a good way to find the
maximum of (5.18) and to offset the clock accordingly. This can easily be done
if the receiver is implemented in a digital signal processor (DSP) but could be
costly in terms of additional hardware if the receiver is implemented with analog
technology.
3
The DLL can be interpreted as a stochastic gradient descent method that seeks the ML
estimate of θ.
5.8. Summary 179
5.8 Summary
The signal design consists of choosing the finite-energy signals w0 (t), . . . , wm−1 (t)
that represent the messages. Rather than choosing the signal set directly, we choose
an orthonormal basis {ψ1 (t), . . . , ψn (t)} and a codebook {c0 , . . . , cm−1 } ⊂ Rn and,
for i = 0, . . . , m − 1, we define

n
wi (t) = ci,j ψj (t).
j=1
In doing so, we separate the signal design problem into two subproblems: finding
an appropriate orthonormal basis and codebook. In this chapter, we have focused
on the choice of the orthonormal basis. The choice of the codebook will be the
topic of the next chapter.
Particularly interesting are those orthonormal bases that consist of an appropri-
ately chosen unit-norm function ψ(t) and its T -spaced translates. Then the generic
form of a signal is

n
w(t) = sj ψ(t − jT ).
j=1
In this case, the n inner products performed by the n-tuple former can be obtained
by means of a single matched filter with the output sampled at n time instants.
We call sj the jth symbol. In theory, a symbol can take values in R or in C; in
practice, the symbol alphabet is some discrete subset S of R or of C. For instance,
PAM symbols are in R, and QAM or PSK symbols, viewed as complex-valued
numbers, are standard examples of symbol in C (see Chapter 7).
If the symbol sequence is a realization of an uncorrelated WSS process, then the
power spectral density of the transmitted signal is SX (f ) = E|ψF (f )|2 /T , where E
is the average energy per symbol. The pulse ψ(t) has a unit norm and is orthogonal
to its T -spaced translates if and only if |ψF (f )|2 fulfills Nyquist’s criterion with
parameter T (Theorem 5.6).
Typically ψ(t) is real-valued, in which case |ψF (f )|2 is an even function. To save
bandwidth, we often choose |ψF (f )|2 in such a way that it vanishes for f ∈ [− T1 , T1 ].
When these two conditions are satisfied, Nyquist’s criterion is fulfilled if and only
if |ψF (f )|2 has the so-called band-edge symmetry.
It is instructive to compare the sampling
theorem to the Nyquist criterion.
Both are meant for signals of the form i si ψ(t − iT ), where ψ(t) is orthogonal
to its T -spaced time translates. Without loss of generality, we can assume that
ψ(t) has a unit norm. In this case, |ψF (f )|2 fulfills the Nyquist criterion with
parameter T . Typically we use the Nyquist criterion to select a pulse that leads
to symbol-by-symbol on a pulse train of a desired√ power spectral density. In the
sampling theorem, we choose ψ(t) = sinc( Tt )/ T because the orthonormal basis
√
B = {sinc[(t − iT )/T ]/ T }i∈Z spans the inner product space of continuous L2
1 1
functions that have a vanishing Fourier transform outside [− 2T , 2T ]. Any signal
in this space can be represented by the coefficients of its orthonormal expansion
with respect to B.

sj - - Yj = s j + Z j

6
Zj ∼ N (0, N0
2
)
Figure 5.12. Equivalent discrete-time channel used at the rate of one channel
use every T seconds. The noise is iid.
The eye diagram is a field test that can easily be performed to verify that the
matched filter output at the sampling times is as expected. It is also a valuable
tool for designing the pulse ψ(t).
Because the matched filter output is sampled at a rate of one sample every T
seconds, we say that the discrete-time AWGN channel seen by the top layer is used
at a rate of one symbol every T seconds. This channel is depicted in Figure 5.12.
Questions that pertain to the encoder/decoder pair (such as the number of bits
per symbol, the average energy per symbol, and the error probability) can be
answered assuming that the top layer communicates via the discrete-time channel.
Questions that pertain to the signal’s time/frequency characteristics need to take
the waveform former into consideration. Essentially, the top two layers can be
designed independently.
5.9 Appendix: L2 , and Lebesgue integral: A primer

A function g : R → C belongs to L2 if it is Lebesgue measurable and has a
∞
finite Lebesgue integral −∞ |g(t)|2 dt. Lebesgue measure, Lebesgue integral, and
L2 are technical terms, but the ideas related to these terms are quite natural.
In this appendix, our goal is to introduce them informally. The reader can skip
this appendix and read “L2 functions” as “finite-energy functions” and “Lebesgue
integral” as “integral”.
We can associate the integral of a non-negative function g : R → R to the
area between the abscissa and the function’s graph. We attempt to determine this
area via a sequence of approximations that – we hope – converges to a quantity
that can reasonably be identified with the mentioned area. Both the Riemann
integral (the one we learn in high school) and the Lebesgue integral (a more general
construction) obtain approximations by adding up the area of rectangles.
For the Riemann integral, we think of partitioning the area between the abscissa
and the function’s graph by means of vertical slices of some width (typically the
same width for each slice), and we approximate the area of each slice by the area
of a rectangle, as shown in Figure 5.13a. The height of the rectangle can be any
value taken by the function in the interval defined by the slice. Hence, by adding
up the area of the rectangles, we obtain a number that can underestimate or
5.9. Appendix: L2 , and Lebesgue integral: A primer 181
(a) Riemann integration. (b) Lebesgue integration.
Figure 5.13. Integration.
overestimate the area of interest. We obtain a sequence of estimates by choosing

slices of decreasing width. If the sequence converges for any such construction
applied to the function being integrated, then the Riemann integral of that function
is defined to be the limit of the sequence. Otherwise the function is not Riemann
integrable.
The definition of the Lebesgue integral starts with equally spaced horizontal
lines as shown in Figure 5.13b. From the intersection of these lines with the
function’s graph, we obtain vertical slices that, unlike for the Riemann’s integral,
have variable widths. The area of each slice between the abscissa and the function’s
graph is under-approximated by the area of the highest rectangle that fits under
the function’s graph. By adding the areas of these rectangles, we obtain an under-
estimate of the area of interest. If we refine the horizontal partition (a new line
halfway between each pair of existing lines) and repeat the construction, we obtain
an approximation that is at least as good as the previous one. By repeating the
operation, we obtain an increasing – thus convergent – sequence of approximations,
the limit of which is the Lebesgue integral of the positive function.
Every non-negative Riemann-integrable function is Lebesgue integrable; and the
values of the two integrals agree whenever they are both defined.
Next, we give an example of a function that is Lebesgue integrable but not
Riemann integrable, and then we define the Lebesgue integral of general real-
valued and complex-valued functions.
example 5.8 The Dirichlet function f : [0, 1] → {0, 1} is 0 where its argument
is irrational and 1 otherwise. Its Lebesgue integral is well defined and equal to 0.
But to see why, we need to introduce the notion of measure. We will do so shortly.
The Riemann integral of the Dirichlet function is undefined. The problem is that
in each interval of the abscissa, no matter how small, we can find both rational and
irrational numbers. So to each approximating rectangle, we can assign height 0 or
1. If we assign height 0 to all approximating rectangles, the integral is approximated
by 0; if we assign height 1 to all rectangles, the integral is approximated by 1. And
we can choose, no matter how narrow we make the vertical slices. Clearly, we can
create approximation sequences that do not converge. This could not happen with
a continuous function or a function that has a finite number of discontinuities, but
the Dirichlet function is discontinuous everywhere.
In our informal construction of the Lebesgue integral, we have implicitly

assumed that the area under the function can be approximated using rectangles
of a well defined width, which is not necessarily the case with bizarre functions.
The Dirichlet is such a function. When we apply the Lebesgue construction to the
Dirichlet function, we effectively partition the domain of the function, namely the
unit interval [0, 1], into two subsets, say the set A that contains the rationals and
the set B that contains the irrationals. If we could assign a “total length” L(A)
and L(B) to these sets as we do for countable unions of disjoint intervals of the
real line, then we could say that the Lebesgue integral of the Dirichlet function
is 1 × L(A) + 0 × L(B) = L(A). The Lebesgue measure does precisely this. The
Lebesgue measure of A is 0 and that of B is 1. This is not surprising, given that
[0, 1] contains a countable number of rationals and an uncountable number of
irrationals. Hence, the Lebesgue integral of the Dirichlet function is 0.
Note that it is not possible to assign the equivalent of a “length” to every subset
of R. Every attempt to do so leads to contradictions, whereby the measure of the
union of certain disjoint sets is not the sum of the individual measures. The subsets
of R to which we can assign the Lebesgue measure are called Lebesgue measurable
sets. It is hard to come up with non-measurable sets; for an example see the end of
Appendix 4.9 of [2]. A function is said to be Lebesgue measurable if the Lebesgue’s
construction partitions the abscissa into Lebesgue measurable sets.
We would not mention the Lebesgue integral if it were just to integrate bizarre
functions. The real power of Lebesgue integration theory comes from a number
of theorems that give precise conditions under which a limit and an integral
can be exchanged. Such operations come up frequently in the study of Fourier
transforms and Fourier series. The reader should not be alarmed at this point.
We will interchange integrals, swap integrals and sums, swap limits and integrals,
all without a fuss: but we do this because we know that the Lebesgue integration
theory allows us to do so, in the cases of interest to us.
In introducing the Lebesgue integral, we have assumed that the function being
integrated is non-negative. The same idea applies to non-positive functions. (In
fact, we can integrate the negative of the function and then take the negative of
the result.) If the function takes on positive as well as negative values, we split
the function into its positive and negative parts, we integrate each part separately,
and we add the two results. This works as long as the two intermediate results
are not +∞ and −∞, respectively. If they are, the Lebesgue integral is undefined;
otherwise the Lebesgue integral is defined.
A complex-valued function g : R → C is integrated by separately integrating its
real and imaginary parts. If the Lebesgue integral over both parts is defined and
finite, the function is said to be Lebesgue integrable. The set of Lebesgue-integrable
functions is denoted by L1 . The notation L1 comes from the easy-to-verify fact
that for a Lebesgue measurable function g, the Lebesgue integral is finite if and
∞
only if −∞ |g(x)|dx < ∞. This integral is the L1 norm of g(x).
Every bounded Riemann-integrable function of bounded support is Lebesgue
integrable and the values of the two integrals agree. This statement does not extend
to functions that are defined over the real line. To see why, consider integrating
sinc(t) over the real line. The Lebesgue integral of sinc(t) is not defined, because
the integral of the positive part of sinc(t) and that over the negative part are
+∞ and −∞, respectively. The Riemann integral of sinc(t) exists because, by
definition, Riemann integrates from −T to T and then lets T go to infinity.
5.9. Appendix: L2 , and Lebesgue integral: A primer 183
All functions that model physical processes are finite-energy and measurable,
hence L2 functions. All finite-energy functions that we encounter in this text are
measurable, hence L2 functions. There are examples of finite-energy functions that
are not measurable, but it is hard to imagine an engineering problem where such
functions would arise.
The set of L2 functions forms a complex vector space with the zero vector being
the all-zero function. If we modify an L2 function in a countable number of points,
the result is also an L2 function and the (Lebesgue) integral over the two functions
is the same. (More generally, the same is true if we modify the function over a set
of measure zero.) The difference between the two functions is an L2 function ξ(t)
such that the Lebesgue integral |ξ(t)|2 dt = 0. Two L2 functions that have this
property are said to be L2 equivalent.
Unfortunately, L2 with the (standard) inner product that maps a(t), b(t) ∈ L2 to

a, b = a(t)b∗ (t)dt (5.19)
does not form an inner product space because axiom (c) of Definition 2.36 is not
fulfilled. In fact, a, a = 0 implies that a(t) is L2 equivalent to the zero function,
but this is not enough: to satisfy the axiom, a(t) must be the all-zero function.
There are two obvious ways around this problem if we want to treat finite-
energy functions as vectors of an inner product space. One way is to consider
only subspaces
V of L2 such that there is only one vector ξ(t) in V that has the
property |ξ(t)|2 dt = 0. This will always be the case when V is spanned by a set
W of waveforms that represent electrical signals.
example 5.9 The set V that consists of the continuous functions of L2 is a
vector space. Continuity is sufficient
to ensure that there is only one function
ξ(t) ∈ V for which the integral |ξ(t)|2 dt = 0, namely the zero function. Hence V
equipped with the standard inner product is an inner product space.4
Another way is to form equivalence classes. Two signals that are L2 equivalent
cannot be distinguished by means of a physical experiment. Hence the idea of
partitioning L2 into equivalence classes, with the property that two functions
are in the same equivalence class if and only if they are L2 equivalent. With an
appropriately defined vector addition and multiplication of a vector by a scalar,
the set of equivalence classes forms a complex vector space. We can use (5.19)
to define an inner product over this vector space. The inner product between two
equivalence classes is the result of applying (5.19) with an element of the first class
and an element of the second class. The result does not depend on which element
of a class we choose to perform the calculation. This way L2 can be transformed
into an inner product space denoted by L2 . As a “sanity check”, suppose that we
want to compute the inner product of a vector with itself. Let a(t) be an arbitrary
element of the corresponding class. If a, a = 0 then a(t) is in the equivalence class
that contains all the functions that have 0 norm. This class is the zero vector.
4
However, one can construct a sequence of continuous functions that converges to a
discontinuous function. In technical terms, this inner product space is not complete.
For a thorough treatment of measurability, Lebesgue integration, and L2 func-

tions we refer to a real-analysis book (see e.g. [9]). For an excellent introduction
to the Lebesgue integral in a communication engineering context, we recommend
the graduate-level text [2, Section 4.3]. Even more details can be found in [3]. A
very nice summary of Lebesgue integration can be found on Wikipedia (July 27,
2012).
5.10 Appendix: Fourier transform: A review

In this appendix, we review a few facts about the Fourier transform. They belong
to two categories. One category consists of facts that have an operational value.
You should know them as they help us in routine manipulations. The other
category consists of mathematical subtleties that you should be aware of, even
though they do not make a difference in engineering applications. We mention
them for awareness, and because in certain cases they affect the language we use
in formulating theorems. For an in-depth discussion of this second category we
recommend [2, Section 4.5] and [3].
The following two integrals relate a function g : R → C to its Fourier transform
gF : R → C
∞
g(u) = gF (α)ej2πuα dα (5.20)
−∞
∞
gF (v) = g(α)e−j2πvα dα, (5.21)
−∞
√
where j = −1.
Some books use capital letters for the Fourier transform, such as G(f ) for
the Fourier transform of g(t). Other books use a hat, like in ĝ(f ). We use the
subscript F because, in this text, capital letters denote random objects (random
variables and random processes) and hats are used for estimates in detection-
related problems.
Typically we take the Fourier transform of a time-domain function and the
inverse Fourier transform of a frequency-domain function. However, from a math-
ematical point of view it does not matter whether or not there is a physical meaning
associated to the variable of the function being transformed. To underline this fact,
we use the unbiased dummy variable α in (5.20) and (5.21).
Sometimes the Fourier transform is defined using ω instead of 2πf in (5.20)
1
and (5.21), but (5.21) also inherits the factor 2π , which breaks the nice symmetry
between (5.20) and (5.21). Most books on communication define the Fourier trans-
form as we do, because it makes the formulas easier to remember. Lapidoth [3,
Section 6.2.1] gives five “good reasons” for defining the Fourier transform as we do.
When we say that the Fourier transform of a function exists, we mean that the
integral (5.21) is well defined. It does not imply that the inverse Fourier transform
exists. The Fourier integral (5.21) exists and is finite for all L2 functions defined
over a finite-length interval. The technical reason for this is that an L2 function
5.10. Appendix: Fourier transform: A review 185

f (t) defined over a finite-length interval is also an L1 function, i.e. |f (t)|dt is
finite, which the ∞ − ∞ problem mentioned in Appendix 5.9. Then also
excludes
|f (t)ej2παt |dt = |f (t)|dt is finite. Hence (5.21) is finite.
If the L2 function is defined over R, then the ∞ − ∞ problem mentioned in
Appendix 5.9 can arise. This is the case with the sinc(t) function. In such cases,
we truncate the function to the interval [−T, T ], compute its Fourier transform,
and let T go to infinity. It is in this sense that the Fourier transform of sinc(t)
is defined. An important result of Fourier analysis says that we are allowed to do
so (see Plancherel’s theorem, [2, Section 4.5.2]). Fortunately we rarely have to do
this, because the Fourier transform of most functions of interest to us is tabulated.
Thanks to Plancherel’s theorem, we can make the sweeping statement that the
Fourier transform is defined for all L2 functions. The transformed function is in
L2 , hence also its inverse is defined. Be aware though, when we compute the
transform and then the inverse of the transformed, we do not necessarily obtain
the original function. However, what we obtain is L2 equivalent to the original. As
already mentioned, no physical experiment will ever be able to detect the difference
between the original and the modified function.
example 5.10 The function g(t) that has value 1 at t = 0 and 0 everywhere
else is an L2 function. Its Fourier transform gF (f ) is 0 everywhere. The inverse
Fourier transform of gF (f ) is also 0 everywhere.
One way to remember whether (5.20) or (5.21) has the minus sign in the
exponent is to think of the Fourier transform as a tool that allows us to write
a function g(u) as a linear combination of complex exponentials. Hence we are
writing g(u) = gF (α)φ(α, u)dα with φ(α, u) = exp(j2πuα) viewed as a function
of u with parameter α. Technically this is not an orthonormal expansion, but
it looks like one, where gF (α) is the coefficient of the function φ(α, u). Like for
an orthonormal expansion, the coefficient is obtained from an expression that
takes the form gF (u) = g(α), φ(α, u) = g(α)φ∗ (α, u)dα. It is the complex
conjugate in the computation of the inner product that brings in the minus sign in
the exponent. We emphasize that we are working by analogy here. The complex
exponential has infinite energy – hence not a unit-norm function (at least not with
respect to the standard inner product).
A useful formula is Parseval’s relationship

a(t)b∗ (t)dt = aF (f )b∗F (f )df, (Parseval) (5.22)
which states that a(t), b(t) = aF (f ), bF (f ).

Rectangular pulses and their Fourier transforms often show up in examples and
we should know how to go back and forth between them. Two tricks make it easy
to relate the rectangle and the sinc as Fourier transform pairs. First, let us recall
how the sinc is defined: sinc(x) = sin(πx)
πx . (The π makes sinc(x) vanish at all integer
values of x except for x = 0 where it is 1.)
The first trick is well known: the value of a (time-domain) function at (time) 0
equals the area under its Fourier transform. Similarly, the value of a (frequency-
domain) function at (frequency) 0 equals the area under its inverse Fourier
transform. These properties follow directly from the definition of Fourier transform,
namely
∞
g(0) = gF (α)dα (5.23)
−∞
∞
gF (0) = g(α)dα. (5.24)
−∞
Everyone knows how to compute the area of a rectangle. But how do we compute
the area under a sinc? Here is where the second and not-so-well-known trick comes
in handy. The area under a sinc is the area of the triangle inscribed in its main
lobe. Hence the integral under the two curves of Figure 5.14 is identical and equal
to ab, and this is true for all positive values a, b.
Let us consider a specific example of how to use the above two tricks. It does
not matter if we start from a rectangle or from a sinc and whether we want to
find its Fourier transform or its inverse Fourier transform. Let a, b, c, and d be as
shown in Figure 5.15.
We want to relate a, b to c, d (or vice versa). Since b must equal the area under
the sinc and d the area under the rectangle, we have
b = cd
d = 2ab,
which may be solved for a, b or for c, d.
b sinc( xa )
b b
x x
a a
∞
Figure 5.14. −∞
b sinc( xa )dx equals the area under the triangle on the right.
b1{−a ≤ x ≤ a} d sinc( xc )
d
b
x x
a c
Figure 5.15. Rectangle and sinc to be related as Fourier transform pairs.

5.11. Appendix: Fourier series: A review 187
Table 5.1 Fourier transform (right-hand column) for a few

L2 functions (left-hand column), where a > 0.
F
e−a|t| ⇐⇒ 2a
a2 +(2πf )2
−at F
e , t≥0 ⇐⇒ 1
a+j2πf
−πt2 F −πf 2
e ⇐⇒ e
F
1{−a ≤ t ≤ a} ⇐⇒ 2a sinc(2af )
example 5.11 The Fourier transform of b1{−a ≤ t ≤ a} is a sinc that has

1
height 2ab. Its first zero crossing must be at 2a so that the area under the sinc
becomes b. The result is 2ab sinc(2af ).
example 5.12 The Fourier transform of b sinc( at ) is a rectangle of height ab.

Its width must be a1 so that its area is b. The result is ab1{f ∈ [− 2a
1 1
, 2a ]}.
See Exercise 5.16 for a list of useful Fourier transform relations and see Table
5.1 for the Fourier transform of a few L2 functions.
5.11 Appendix: Fourier series: A review

We review the Fourier series, focusing on the big picture and on how to remember
things. Let f (x) be a periodic function, x ∈ R. It has period p if f (x) = f (x + p)
for all x ∈ R. Its fundamental period is the smallest positive such p. We are using
the “physically unbiased” variable x instead of t, because we want to emphasize
that we are dealing with a general periodic function, not necessarily a function of
time.
The main idea is that a sufficiently well-behaved function f (x) of fundamental
period p can be written as a linear combination of all the complex exponentials of
period p. Hence

f (x) = Ai ej2πxi/p (5.25)
i∈Z
for some sequence of coefficients . . . A−1 , A0 , A1 , . . . with values in C.

Two functions of fundamental period p are identical if and only if they coincide
over a period. Hence to check if a given series of coefficients . . . A−1 , A0 , A1 , . . . is
the correct series, it suffices to verify that
p p √ ej2πxi/p p p
f (x)1 − ≤ x ≤ = pAi √ 1 − ≤x≤ ,
2 2 p 2 2
i∈Z
√
ej2πxi/p
where we have multiplied and divided by p to make φi (x) = √
p 1 − p
≤
2
x ≤ p2 of unit norm, i ∈ Z. Thus we can write

p p √
f (x)1 − ≤ x ≤ = pAi φi (x). (5.26)
2 2
i∈Z
The right-hand side of the above expression is an orthonormal expansion. The

coefficients of an orthonormal expansion are always computed according to the
√
same expression. The ith coefficient pAi equals f, φi . Hence,
p
1 2
Ai = f (x)e−j2πxi/p dx. (5.27)
p −p
2
Notice that the right-hand side of (5.25) is periodic, and that of (5.26) has finite
support. In fact we can use the Fourier series for both kind of functions, periodic
and finite-support. That Fourier series can be used for finite-support functions is
an obvious consequence of the fact that a finite-support function can be seen as
one period of a periodic function. In communication, we are more interested in
functions that have finite support. (Once we have seen one period of a periodic
function, we have seen them all: from that point on there is no information being
conveyed.)
In terms of mathematical rigor, there are two things that can go wrong in what
we have said. (1) For some i, the integral in (5.27) might not be defined or might
not be finite. We can show that neither is the case if f (x)1 − p2 ≤ x ≤ p2 is
l √
an L2 function. (2) for a specific x, the truncated series i=−l pAi φi (x) might
not converge
as l goes
to infinity or might converge to a value that differs from
f (x)1 − p2 ≤ x ≤ p2 . It is not hard to show that the norm of the function
p p √
l
f (x)1 − ≤ x ≤ − pAi φi (x)
2 2
i=−l
goes to zero as l goes to infinity. Hence the two functions are L2 equivalent. We
write this as follows
p p √
f (x)1 − ≤ x ≤ = l. i. m. pAi φi (x),
2 2
i∈Z
where l. i. m. means limit in mean-square: it is a short-hand notation for

p
7
l 72
2
7 √ 7
lim 7f (x) − pAi φi (x)7 dx = 0.
l→∞ −p
2 i=−l
We summarize with the following rigorous statement. The details that we have
omitted in our “proof” can be found in [2, Section 4.4].
5.12. Appendix: Proof of the sampling theorem 189
theorem 5.13 (Fourier series) Let g(x) : [− p2 , p2 ] → C be an L2 function. Then

for each k ∈ Z, the (Lebesgue) integral
p/2
1
Ak = g(x)e−j2πkx/p dx
p −p/2
exists and is finite. Furthermore,

p p
g(x) = l. i. m. Ak ej2πkx/p 1 − ≤ x ≤ .
2 2
k
5.12 Appendix: Proof of the sampling theorem

In this appendix, we prove Theorem 5.2. By assumption, for any b ≥ B, wF (f ) = 0
for f ∈
/ [−b, b]. Hence we can write wF (f ) as a Fourier series (Theorem 5.13):

wF (f ) = l. i. m. Ak ejπf k/b 1 − b ≤ f ≤ b .
k
Note that wF (f ) is in L2 (it is the Fourier transform of an L2 function) and

vanishes outside [−b, b], hence it is L1 (see e.g. [2, Theorem 4.3.2]). The inverse
Fourier transform of an L1 function is continuous (see e.g. [2, Lemma 4.5.1]). Hence
it must be identical to w(t),which is also
continuous.
Even though wF (f ) and k Ak ejπf k/b 1 − b ≤ f ≤ b might not agree at every
point, they are L2 equivalent. Hence they have the same inverse Fourier transform
which, as argued above, is w(t). Thus
Ak
t
w(t) = Ak 2b sinc (2bt + k) = sinc +k ,
T T
k k
1
where T = 2b .
We still need to determine ATk . It is straightforward to determine Ak from the
definition of the Fourier series, but it is even easier to plug t = nT in both sides
of the above expression to obtain w(nT ) = AT−n . This completes the proof. To see
that we can easily obtain Ak from the definition (5.27), we write
b ∞
1
Ak = wF (f )e−jπkf /b df = T wF (f )e−j2πT kf df = T w(−kT ),
2b −b −∞
where the first equality is the definition of the Fourier coefficient Ak , the second
uses the fact that wF (f ) = 0 for f ∈
/ [−b, b], and the third is the inverse Fourier
transform evaluated at t = −kT .
5.13 Appendix: A review of stochastic processes

In the discussion of Section 5.3, we assume familiarity with the following facts
about stochastic processes.
Let (Ω, F, P ) be a probability space: Ω is the sample space, which consists of all
possible outcomes of a random experiment; F is the set of events, which is the
set of those subsets of Ω on which a probability is assigned; P is the probability
measure that assigns a probability to every event. There are technical conditions
that F and P must satisfy to ensure consistency. F must be a σ-algebra, i.e. it
must satisfy the following conditions: it contains the empty set ∅; if A ∈ F, then
Ac ∈ F; the union of every countable collection of sets of F must be in F.
The probability measure is a function P : F → [0, 1] with the properties that
∞
P (Ω) = 1 and P (∪∞ i=1 Ai ) = i=1 P (Ai ) whenever A1 , A2 , . . . is a collection of
disjoint events.
A random variable X, defined over a probability space (Ω, F, P ), is a function
X : Ω → R, such that for every x ∈ R, the set {ω ∈ Ω : X(ω) ≤ x} is contained
in F. This ensures that the cumulative distribution function FX (x) := P ({ω :
X(ω) ≤ x}) is well defined for every x ∈ R.
A stochastic process (also called random process) is a collection of random
variables defined over the same probability space (Ω, F, P ). The stochastic process
is discrete-time if the collection of random variables is indexed by Z or a subset
thereof, such as in {Xi : i ∈ Z}. It is continuous-time if the collection of random
variables is indexed by R or a continuous subset thereof, such as in {Xt : t ∈ R}.
Engineering texts often use the short-hand notations X(t) and X[i] to denote
continuous-time and discrete-time stochastic processes, respectively.
We continue this review, assuming that the process is a continuous-time process,
trusting that the reader can make the necessary changes for discrete-time processes.
Let t1 , t2 , . . . , tk be an arbitrary finite collection of indices. The technical condi-
tion about F mentioned above ensures that the cumulative distribution function
FXt1 ,...,Xtk (x1 , . . . , xk ) := P (ω : Xt1 (ω) ≤ x1 , . . . , Xtk (ω) ≤ xk )
is defined for all (x1 , . . . , xk ) ∈ Rk . In words, the statistic is defined for every finite
collection Xt1 , Xt2 , . . . , Xtk of samples of {Xt : t ∈ R}.
The mean mX (t), the autocorrelation RX (s, t), and the autocovariance KX (s, t)
of a continuous-time stochastic process {Xt : t ∈ R} are, respectively,
mX (t) := E [Xt ] (5.28)
RX (s, t) := E [Xs Xt∗ ] (5.29)
KX (s, t) := E [(Xs − E [Xs ])(Xt − E [Xt ])∗ ] = RX (s, t) − mX (s)m∗X (t), (5.30)
where the “∗ ” denotes complex conjugation and can be omitted for real-valued
stochastic processes.5 For a zero-mean process, which is usually the case in our
applications, KX (s, t) = RX (s, t).
5
To remember that the “∗ ” in (5.29) goes on the second random variable, it helps to
observe the similarity between the definition of RX (s, t) and that of an inner product
such as a(t), b(t) = a(t)b∗ (t)dt.
5.13. Appendix: A review of stochastic processes 191
A continuous-time stochastic process {Xt : t ∈ R} is said to be stationary if, for

every finite collection of indices t1 , t2 , . . . , tk , the statistic of
Xt1 +τ , Xt2 +τ , . . . , Xtk +τ and that of Xt1 , Xt2 , . . . , Xtk
are the same for all τ ∈ R, i.e. for every (x1 , x2 , . . . , xk ) ∈ Rk ,

FXt1 +τ ,Xt2 +τ ,...,Xtk +τ (x1 , x2 , . . . , xk )
is the same for all τ ∈ R.
example 5.14 For a stationary process {Xt : t ∈ R},
mX (t) = E [Xt ] = E [X0 ] = mX (0)
RX (s, t) = E [Xs Xt∗ ] = E [Xs−t X0∗ ] = RX (s − t, 0).
We see that a stationary stochastic process has a constant mean mX (t) and
has an autocorrelation RX (s, t) that depends only on the difference τ = s − t.
This is a property that simplifies many results. A stochastic process that has this
property is called wide-sense stationary (WSS). (An equivalent condition is that
mX (t) does not depend on t and KX (s, t) depends only on s − t.) In this case we
can use the short-hand notation RX (τ ) instead of RX (t + τ, t), and KX (τ ) instead
of KX (t + τ, t). A stationary process is always WSS, but a WSS process is not
necessarily stationary. For instance, the process X(t) of Section 5.3 is WSS but
not stationary.
The Fourier transform of the autocovariance KX (τ ) of a WSS process is the
power spectral density (or simply spectral density) SX (f ). To understand why this
name makes sense, suppose for the moment that the process is zero-mean, and
recall that the value of a function at the origin equals the integral of its Fourier
transform (see (5.23)). Hence,
∞
& '
SX (f )df = KX (0) = RX (0) = E |Xt |2 .
−∞
' {Xt : t ∈ R} represents an electrical signal (voltage or current),

If the& process
then E |Xt |2 is associated with an average power (see the discussion in Example
3.2), and so is the integral of SX (f ). This partially justifies calling SX (f ) a power
spectral density. For the full justification, we need to determine how the spectral
density at the output of a linear time-invariant system depends on the spectral
density of the input.
Towards this goal, let X(t) be a zero-mean WSS process at the input of a linear
time-invariant system of impulse response h(t) and let Y (t) = h(α)X(t − α)dα
be the output. It is straightforward to determine the mean of Y (t):

mY (t) = E h(α)X(t − α)dα = h(α)E [X(t − α)] dα = 0.
With slightly more effort we determine

∞ ∞
KY (t + τ, t) = h(α)h∗ (β)KX (τ + β − α)dαdβ.
−∞ −∞
We see that Y (t) is a zero-mean

∞ WSS process. If we write h(β) in terms of its
Fourier transform h(β) = −∞ hF (f )ej2πf β df and substitute into KY (t + τ, t),
henceforth denoted by KY (τ ), after a few standard manipulations we obtain
∞
KY (τ ) = |hF (f )|2 SX (f )ej2πf τ df.
−∞
This proves that KY (τ ) is the inverse Fourier transform of |hF (f )|2 SX (f ). Hence,
SY (f ) = |hF (f )|2 SX (f ).
To understand the physical meaning of SX (f ), we let X(t) be the input to a

filter that cuts off all the frequencies of X(t), except for those contained in a
small interval I = [fc − Δf Δf
2 , fc + 2 ] of width Δf around fc . Suppose that Δf is
sufficiently small that SX (f ) is constant over I. Then
SY (f ) = |hF (f )|2 SX (f ) = SX (f )1{f ∈ I}

and the power of Y (t) is SY (f )df = SX (fc )Δf . We conclude that SX (fc )Δf is
associated with the power of X(t) contained in a frequency interval of width Δf
centered around fc , which explains why SX (f ) is called the power spectral density
of X(t).
SX (f ) is well-defined and is called the power spectral density of X(t) even when
X(t) is a WSS process with a non-vanishing mean mX (t), but in this case the
integral of SX (f ) is the power of the zero-mean process X̃(t) = X(t) − m, where
m = mX (t). For this reason, one could argue that spectral density is a better
name than power spectral density. This is not a big issue, because most processes
of interest to us are indeed zero-mean. Alternatively, we could be tempted to define
SX (f ) as the Fourier transform of RX (τ ), thinking that in so doing, the integral
of SX (f ) is the average power of X(t), even when X(t) has a non-vanishing mean.
But this would create a technical problem, because when X(t) has a non-vanishing
mean, RX (τ ) = KX (τ ) + |m|2 is not an L2 function. (We have defined the Fourier
transform only for L2 functions.)
5.14 Appendix: Root-raised-cosine impulse response

In this appendix, we derive the inverse Fourier transform of the root-raised-cosine
pulse
⎧√
⎪
9 |f | ≤ 2T
1−β
⎪
⎨√ T , 8
2β (|f | − 2T ) , 2T < |f | ≤ 2T
1−β 1−β 1+β
ψF (f ) = T cos πT
⎪
⎪
⎩0, otherwise.
√
We write ψF (f ) = aF (f ) + bF (f ) where aF (f ) = T 1{f ∈ [− 1−β 1−β
2T , 2T ]} is the
central piece of the root-raised-cosine impulse response and bF (f ) accounts for the
two root-raised-cosine edges.
5.15. Appendix: The picket fence “miracle” 193
The inverse Fourier transform of aF (f ) is

√
T π(1 − β)t
a(t) = sin .
πt T
Write bF (f ) = b− F (f ) + bF (f ), where bF (f ) = bF (f )1{f ≥ 0}. Let cF (f ) =

+ +
+ 1
bF (f + 2T ). Specifically,

√ πT β β β
cF (f ) = T cos f+ 1 f∈ − , .
2β 2T 2T 2T
The inverse Fourier transform of cF (f ) is

β −j π tβ 1 π tβ 1
c(t) = √ e 4 sinc − j
+ e 4 sinc + .
2 T T 4 T 4
1
Now we use the relationship b(t) = 2{c(t)ej2π 2T t } to obtain

β tβ 1 πt π tβ 1 πt π
b(t) = √ sinc − cos − + sinc + cos + .
T T 4 T 4 T 4 T 4
After some manipulations of ψ(t) = a(t) + b(t) we obtain the desired expression
& ' (1−β)π & '
4β cos (1 + β)π T + 4β sinc (1 − β) T
t t
ψ(t) = √ 2 .
π T 1 − 4β Tt
5.15 Appendix: The picket fence “miracle”

In this appendix, we “derive” the picket fence miracle, which is a useful tool for
informal derivations related to sampling (see Example 5.15 and Exercises 5.1 and
5.8). The “derivation” is not rigorous in that we “handwave” over convergence
issues.
Recall that δ(t) is the (generalized) function defined through its integral against
a function f (t) (assumed to be continuous at t = 0):
∞
f (t)δ(t)dt = f (0).
−∞
∞
It follows that −∞ δ(t)dt = 1 and that the Fourier transform of δ(t) is 1 for all
f ∈ R.
The T -spaced picket fence is the train of Dirac delta functions
∞

δ(t − nT ).
n=−∞
The picket fence miracle refers to the fact that the Fourier transform of a picket
fence is again a (scaled) picket fence. Specifically,
. ∞ /
1 n
∞
F δ(t − nT ) = δ f− ,
n=−∞
T n=−∞ T
where F[·] stands for the Fourier transform of the enclosed expression.
The above
relationship can be derived by expanding the periodic function n δ(t − nT ) as a
Fourier series, namely:
∞
∞
1 j2πtn/T
δ(t − nT ) = e .
n=−∞
T n=−∞
(The careful reader should wonder in which sense the above equality holds. We
are indeed being informal here.)
Taking the Fourier transform on both sides yields
. ∞ /
1 n
∞
F δ(t − nT ) = δ f− ,
n=−∞
T n=−∞ T
which is what we wanted to prove.

It is convenient to have a notation for the picket fence. Thus we define6
∞

ET (x) = δ(x − nT ).
n=−∞
Using this notation, the relationship that we just proved can be written as
1
F [ET (t)] =
E 1 (f ).
T T
The picket fence miracle is a practical tool in engineering and physics, but in
the stated form it is not appropriate to obtain results that are mathematically
rigorous. An example follows.
example 5.15 We give an informal proof of the sampling theorem by using
the picket fence miracle. Let s(t) be such that sF (f ) = 0 for f ∈
/ [−B, B] and
let T ≤ 2B1
. We want to show that s(t) can be reconstructed from the T -spaced
samples {s(nT )}n∈Z . Define
∞

s|(t) = s(nT )δ(t − nT ).
n=−∞
(Note that s| is just a name for the expression on the right-hand side of the
equality.) Using the fact that s(t)δ(t − nT ) = s(nT )δ(t − nT ), we can also write
s|(t) = s(t)ET (t).
6
The choice of the letter E is suggested by the fact that it looks like a picket fence when
rotated 90 degrees.
5.15. Appendix: The picket fence “miracle” 195
Taking the Fourier transform on both sides yields

n
∞
1 1
F[s|(t)] = sF (f ) E T1 (f ) = sF f − .
T T n=−∞ T
The relationship between sF (f ) and F[s|(t)] is depicted in Figure 5.16.
1
sF (f )
f
−B 0 B
1
T
n sF (f −
1 n
T T
)
f
− T1 − B − T1 + B −B 0 B
1
T
−B 1
T
+B
Figure
5.16. Fourier transform of a function s(t) (top) and of
s|(t) = n s(nT )δ(t − nT ) (bottom).
From the figure, it is obvious that we can reconstruct the original signal s(t) by
filtering s|(t) with a filter that scales (1/T )sF (f ) by T and blocks (1/T )sF f − Tn
for n = 0. Such a filter exists if,like in the figure, the support of sF (f ) does not
intersect with the support of sF f − Tn for n = 0. This is the case if T ≤ 2B 1
.
(We allow equality because the output of a filter is unchanged if the filter’s input is
modified at a countable number of points.) If h(t) is the impulse response of such
a filter, the filter output y(t) when the input is s|(t) satisfies
!
1
∞ n
yF (f ) = sF f − hF (f ) = sF (f ).
T n=−∞ T
After taking the inverse Fourier transform, we obtain the reconstruction (also
called interpolation) formula
∞ ∞

y(t) = s(nT )δ(t − nT ) h(t) = s(nT )h(t − nT ) = s(t).
n=−∞ n=−∞
A specific filter that has the desired properties is the lowpass filter of frequency
response

T, f ∈ [− 2T1 1
, 2T ]
hF (f ) =
0, otherwise.
Its impulse response is sinc( Tt ). Inserting into the reconstruction formula yields
∞
t
s(t) = s(nT ) sinc −n , (5.31)
n=−∞
T
which matches (5.2).

The picket fence miracle is useful for computing the spectrum of certain signals
related to sampling. Examples are found in Exercises 5.1 and 5.8.
5.16 Exercises
exercise 5.1 (Sampling and reconstruction) Here we use the picket fence mir-
acle to investigate practical ways to approximate sampling and/or reconstruction.
We assume that for some positive B, s(t) satisfies sF (f ) = 0 for f ∈ [−B, B]. Let
T be such that 0 < T ≤ 1/2B.
(a) As a reference, review Example 5.15 of Appendix 5.15.

(b) To generate the intermediate signal s(t)ET (t) of Example 5.15, we need an
electrical circuit that produces δ Diracs. Such a circuit does not exist. As
a substitute for δ(t), we use a rectangular pulse of the form T1w 1{− T2w ≤
t ≤ T2w }, where 0 < Tw ≤ T and the scaling by T1w is to ensure that the
integral over the substitute pulse and that over δ(t) give the same result,
namely 1. The intermediate signal at the input of the reconstruction filter
is then [s(t)ET (t)] [ T1w 1{− T2w ≤ t ≤ T2w }]. (We can generate this signal
without passing through ET (t).) Express the Fourier transform yF (f ) of the
reconstruction filter output.
(c) In the so-called zero-order interpolator, the reconstructed approximation is
the step-wise signal [s(t)ET (t)] 1{− T2 ≤ t ≤ T2 }. This is the intermediate
signal of part (b) with Tw = T . Express its Fourier transform. Note: There is
no interpolation filter in this case.
(d) In the first-order interpolator, the reconstructed approximation consists of
straight lines connecting the values of the original signal at the sampling
points. This can be written as [s(t)ET (t)] p(t) where p(t) is the triangular
shape waveform

T −|t|
p(t) = T , t ∈ [−T, T ]
0, otherwise.
Express the Fourier transform of the reconstructed approximation.
Compare sF (f ) to the Fourier transform of the various reconstructions you have

obtained.
exercise 5.2 (Sampling and projections) We have seen that the reconstruction
formula of the sampling theorem can be rewritten in such a way that it becomes
an orthonormal expansion (expression (5.3)). If ψj (t) is the jth element of an
orthonormal set of functions used to expand w(t), then the jth coefficient cj equals
the inner product w, ψj . Explain why we do not need to explicitly perform an
inner product to obtain the coefficients used in the reconstruction formula (5.3).
5.16. Exercises 197
exercise 5.3 (Properties of the self-similarity function) Prove the following

properties of the self-similarity function
∞ (5.5). Recall that the self-similarity func-
tion of an L2 pulse ξ(t) is Rξ (τ ) = −∞ ξ(t + τ )ξ ∗ (t)dt.
(a) Value at zero:
Rξ (τ ) ≤ Rξ (0) = ξ2 , τ ∈ R. (5.32)
(b) Conjugate symmetry:
Rξ (−τ ) = Rξ∗ (τ ), τ ∈ R. (5.33)
(c) Convolution representation:
Rξ (τ ) = ξ(τ ) ξ ∗ (−τ ), τ ∈ R. (5.34)
Note: The convolution between a(t) and b(t) can be written as (a b)(t) or as
a(t)b(t). Both versions are used in the literature. We prefer the first version,
but in the above case the second version does not require the introduction of
a name for ξ ∗ (−τ ).
(d) Fourier relationship:
Rξ (τ ) is the inverse Fourier transform of |ξF (f )|2 . (5.35)
Note: The fact that ξF (f ) is in L2 implies that |ξF (f )|2 is in L1 . The Fourier
inverse of an L1 function is continuous. Hence Rξ (τ ) is continuous.
exercise 5.4 (Power spectrum: Manchester pulse) Derive the power spectral
density of the random process
∞

X(t) = Xi ψ(t − iT − Θ),
i=−∞
where {Xi }∞i=−∞ is an iid sequence of uniformly distributed random variables

√
taking values in {± E}, Θ is uniformly distributed in the interval [0, T ], and ψ(t)
is the so-called Manchester pulse shown in Figure 5.17. The Manchester pulse
guarantees that X(t) has at least one transition per symbol, which facilitates the
clock recovery at the receiver.
ψ(t)
√1
T
T
t
− √1T
Figure 5.17.
exercise 5.5 (Nyquist’s criterion) For each function |ψF (f )|2 in Figure 5.18,
indicate whether the corresponding pulse ψ(t) has unit norm and/or is orthogonal
to its time-translates by multiples of T . The function in Figure 5.18d is sinc2 (f T ).
|ψF (f )|2 |ψF (f )|2

T
T 2
f f
− 2T
1 1
2T
− T1 1
T
(a) (b)
|ψF (f )|2 |ψF (f )|2
T 1
f f
− 2T
1 1
2T
− T1 1
T
(c) (d)
Figure 5.18.
exercise 5.6 (Nyquist pulse) A communication system uses signals of the form

sl p(t − lT ),
l∈Z
where sl takes values in some symbol alphabet and p(t) is a finite-energy pulse.
The transmitted signal is first filtered by a channel of impulse response h(t) and
then corrupted by additive white Gaussian noise of power spectral density N20 . The
receiver front end is a filter of impulse response q(t).
(a) Neglecting the noise, show that the front-end filter output has the form

y(t) = sl g(t − lT ),
l∈Z
where g(t) = (p h q)(t) and denotes convolution.

(b) The necessary and sufficient (time-domain) condition that g(t) has to fulfill
so that the samples of y(t) satisfy y(lT ) = sl , l ∈ Z, is
g(lT ) = 1{l = 0}.
A function g(t) that fulfills this condition is called a Nyquist pulse of param-
eter T . Prove the following theorem:
5.16. Exercises 199
theorem 5.16 (Nyquist criterion for Nyquist pulses) The L2 pulse g(t) is
a Nyquist pulse (of parameter T ) if and only if its Fourier transform gF (f )
fulfills Nyquist’s criterion (with parameter T ), i.e.

l
l. i. m. gF f − = T, t ∈ R.
T
l∈Z
Note: Because of the periodicity of the left-hand side, equality is fulfilled if

and
only if it is fulfilled over an interval of length 1/T . Hint: Set g(t) =
gF (f )ej2πf t df , insert on both sides t = −lT and proceed as in the proof of
Theorem 5.6.
(c) Prove
Theorem 5.6 as a corollary to the above theorem. Hint: 1{l = 0} =
ψ(t−lT )ψ ∗ (t)dt if and only if the self-similarity function Rψ (τ ) is a Nyquist
pulse of parameter T .
(d) Let p(t) and q(t) be real-valued with Fourier transform as shown in Figure
5.19, where only positive frequencies are plotted (both functions being even).
The channel frequency response is hF (f ) = 1. Determine y(kT ), k ∈ Z.
pF (f ) qF (f )
T 1
f f
0 1 0 1
T T
Figure 5.19.
exercise 5.7 (Pulse orthogonal to its T -spaced time translates) Figure 5.20
shows part of the plot of a function |ψF (f )|2 , where ψF (f ) is the Fourier transform
of some pulse ψ(t).
|ψF (f )|2
f
0 1
2T
Figure 5.20.
Complete the plot (for positive and negative frequencies) and label the ordinate,
knowing that the following conditions are satisfied:

• For every pair of integers k, l, ψ(t − kT )ψ(t − lT )dt = 1{k = l};
• ψ(t) is real-valued;
• ψF (f ) = 0 for |f | > T1 .
exercise 5.8 (Nyquist criterion via picket fence miracle) Give an informal proof
of Theorem 5.6 (Nyquist criterion for orthonormal pulses) using the picket fence
miracle (Appendix 5.15). Hint: A function p(t) is a Nyquist pulse of parameter T
if and only if p(t)ET (t) = δ(t).
exercise 5.9 (Peculiarity of Nyquist’s criterion) Let

1 1
(0)
gF (f ) = T 1 − ≤f ≤
3T 3T
(n)
be the central rectangle in Figure 5.21, and for every positive integer n, let gF (f )
(0) T 1
consist of gF (f ) plus 2n smaller rectangles of height 2n and width 3T , each placed
in the middle of an interval of the form [ T , T ], l = −n, −n + 1, . . . , n − 1. Figure
l l+1
(3)
5.21 shows gF (f ).
(3)
gF (f )
f
− T3 − T2 − T1 0 1
3T
2
3T
1
T
2
T
3
T
Figure 5.21.
(n)
(a) Show that for every n ≥ 1, gF (f ) fulfills Nyquist’s criterion with parameter
T . Hint: It is sufficient that you verify that Nyquist’s criterion is fulfilled for
f ∈ [0, T1 ]. Towards that end, first check what happens to the central rectangle
(n)
when you perform the operation l∈Z gF (f − Tl ). Then see how the small
rectangles fill in the gaps.
(n) (0)
(b) As n goes to infinity, gF (f ) converges to gF (f ). (It converges for every f
(n)
and it converges also in L2 , i.e. the squared norm of the difference gF (f ) −
(0) (0)
gF (f ) goes to zero.) Peculiar is that the limiting function gF (f ) fulfills
Nyquist’s criterion with parameter T (0) = T . What is T (0) ?
(c) Suppose that we use symbol-by-symbol on a pulse train to communicate across
the AWGN channel. To do so, we choose a pulse ψ(t) such that |ψF (f )|2 =
(n) T
gF (f ) for some n, and we choose n sufficiently large that 2n is much smaller
N0
than the noise power spectral density 2 . In this case, we can argue that our
1
bandwidth B is only 3T . This means a 30% bandwidth reduction with respect
1
to the minimum absolute bandwidth 2T . This reduction is non-negligible if we
pay for the bandwidth we use. How do you explain that such a pulse is not
used in practice? Hint: What do you expect ψ(t) to look like?
(d) Construct a function gF (f ) that looks like Figure 5.21 in the interval shown
by the figure except for the heights of the rectangles. Your function should have
infinitely many smaller rectangles on each side of the central rectangle and
5.16. Exercises 201
(n)
(like gF (f )) shall satisfy Nyquist’s criterion.
∞ Hint: One such construction is
suggested by the infinite geometric series i=1 ( 12 )i , which adds to 1.
exercise 5.10 (Raised-cosine expression) Let T be a positive number. Following

the steps below, derive the raised-cosine function |ψF (f )|2 of roll-off factor β ∈
(0, 1]. (It is recommended to plot the various functions.)
(a) Let p(f ) = cos(f ), defined over the domain f ∈ [0, π], be the starting point
for what will become the right-hand side roll-off edge.
(b) Find constants c and d so that q(f ) = cp(f ) + d has range [0, T ] over the
domain [0, π].
(c) Find a constant e so that r(f ) = q(ef ) has domain [0, Tβ ].
(d) Find a constant g so that s(f ) = r(f − g) has domain [ 2T 1
− 2T
β 1
, 2T β
+ 2T ].
(e) Write an expression for the function |ψF (f )| that has the following proper-
2
ties:
(i) it is T for f ∈ [0, 2T
1
− 2T β
);
(ii) it equals s(f ) for f ∈ [ 2T − 2T
1 β 1
, 2T β
+ 2T ];
(iii) it is 0 for f ∈ ( 2T
1 β
+ 2T , ∞);
(iv) it is an even function.
exercise 5.11 (Peculiarity of the sinc pulse) Let {Uk }nk=0 be an iid sequence
n in {±1}. Prove that for certain values of
of uniformly distributed bits taking value
t and for n sufficiently large, s(t) = k=0 Uk sinc(t − k) can become ∞ larger than
∞
any given constant. Hint: The series k=1 k1 diverges and so does k=1 k−a 1
for
any constant a ∈ (0, 1). Note: This implies that the eye diagram of s(t) is closed.
exercise 5.12 (Matched filter basics) Let

K
w(t) = dk ψ(t − kT )
k=1
be a transmitted signal where ψ(t) is a real-valued pulse that satisfies

∞
0, k integer = 0
ψ(t)ψ(t − kT )dt =
−∞ 1, k = 0,
and dk ∈ {−1, 1}.

(a) Suppose that w(t) is filtered at the receiver by the matched filter with impulse
response ψ(−t). Show that the filter output y(t) sampled at mT , m integer,
yields y(mT ) = dm , for 1 ≤ m ≤ K.
(b) Now suppose that the (noiseless) channel outputs the input plus a delayed and
scaled replica of the input. That is, the channel output is w(t) + ρw(t − T ) for
some T and some ρ ∈ [−1, 1]. At the receiver, the channel output is filtered
by ψ(−t). The resulting waveform ỹ(t) is again sampled at multiples of T .
Determine the samples ỹ(mT ), for 1 ≤ m ≤ K.
(c) Suppose that the kth received sample is Yk = dk + αdk−1 + Zk , where Zk ∼
N (0, σ 2 ) and 0 ≤ α < 1 is a constant. Note that dk and dk−1 are realizations
of independent random variables that take on the values 1 and −1 with equal
probability. Suppose that the receiver decides dˆk = 1 if Yk > 0, and decides
dˆk = −1 otherwise. Find the probability of error for this receiver.
exercise 5.13 (Communication link design) Specify the block diagram for a
digital communication system that uses twisted copper wires to connect devices that
are 5 km apart from each other. The cable has an attenuation of 16 dB/km. You
are allowed to use the spectrum between −5 and 5 MHz. The noise at the receiver
input is white and Gaussian, with power spectral density N0 /2 = 4.2×10−21 W/Hz.
The required bit rate is 40 Mbps (megabits per second) and the bit-error probability
should be less than 10−5 . Be sure to specify the symbol alphabet and the waveform
former of the system you propose. Give precise values or bounds for the bandwidth
used, the power of the channel input signal, the bit rate, and the error probability.
Indicate which bandwidth definition you use.
exercise 5.14 (Differential encoding) For many years, telephone companies

built their networks on twisted pairs. This is a twisted pair of copper wires invented
by Alexander Graham Bell in 1881 as a means to mitigate the effect of electromag-
netic interference. In essence, an alternating magnetic field induces an electric field
in a loop. This applies also to the loop created by two parallel wires connected at both
ends. If the wire is twisted, the electric field components that build up along the wire
alternate polarity and tend to cancel out one another. If we swap the two contacts
at one end of the cable, the signal’s polarity at one end is the opposite of that on
the other end. Differential encoding is a technique for encoding the information in
such a way that the decoding process is not affected by polarity. The differential
encoder takes the data sequence {Di }ni=1 , here assumed to have independent and
uniformly distributed components taking value in {0, 1}, and produces the symbol
sequence {Xi }ni=1 according to the following encoding rule:

Xi−1 , Di = 0,
Xi =
−Xi−1 , Di = 1,
√
where X0 = E by convention. Suppose that the symbol sequence is used to form

n
X(t) = Xi ψ(t − iT ),
i=1
where ψ(t) is normalized and orthogonal to its T -spaced time-translates. The signal
is sent over the AWGN channel of power spectral density N0 /2 and at the receiver
5.16. Exercises 203
is passed through the matched filter of impulse response ψ ∗ (−t). Let Yi be the filter
output at time iT .
(a) Determine RX [k], k ∈ Z, assuming an infinite sequence {Xi }∞ i=−∞ .
(b) Describe a method to estimate Di from Yi and Yi−1 , such that the performance
is the same if the polarity of Yi is inverted for all i. We ask for a simple
decoder, not necessarily ML.
(c) Determine (or estimate) the error probability of your decoder.
exercise 5.15 (Mixed questions)
2
(a) Consider the signal x(t) = cos(2πt) sin(πt)
πt . Assume that we sample x(t)
with sampling period T . What is the maximum T that guarantees signal
recovery?
(b) You are given a pulse p(t) with spectrum pF (f ) = T (1 − |f |T ), |f | ≤ T1 .
What is the value of p(t)p(t − 3T )dt?
exercise 5.16 (Properties of the Fourier transform) Prove the following prop-
F
erties of the Fourier transform. The sign ⇐⇒ relates Fourier transform pairs,
with the function on the right being the Fourier transform of that on the left. The
Fourier transforms of v(t) and w(t) are denoted by vF (f ) and wF (f ), respectively.
(a) Linearity:
F
αv(t) + βw(t) ⇐⇒ αvF (f ) + βwF (f ).
(b) Time-shifting:
F
v(t − t0 ) ⇐⇒ vF (f )e−j2πf t0 .
(c) Frequency-shifting:
F
v(t)ej2πf0 t ⇐⇒ vF (f − f0 ).
(d) Convolution in time:
F
(v w)(t) ⇐⇒ vF (f )wF (f ).
(e) Time scaling by α = 0:

F 1 f
v(αt) ⇐⇒ vF .
|α| α
(f ) Conjugation:
F
v ∗ (t) ⇐⇒ ∗
vF (−f ).
(g) Time-frequency duality:
F
vF (t) ⇐⇒ v(−f ).
(h) Parseval’s relationship:

∗ ∗
v(t)w (t)dt = vF (f )wF (f )df.
Note: As a mnemonic, notice that the above can be written as v, w =

vF , wF .
(i) Correlation:

F
v(λ + t)w∗ (λ)dλ ⇐⇒ ∗
vF (f )wF (f ).
Hint: Use Parseval’s relationship on the expression on the right and interpret
the result.
6 Convolutional coding
and Viterbi decoding:
First layer revisited
6.1 Introduction
In this chapter we shift focus to the encoder/decoder pair. The general setup is
that of Figure 6.1, where N (t) is white Gaussian noise of power spectral density
N0 /2. The details of the waveform former and the n-tuple former are immaterial
for this chapter. The important fact is that the channel model from the encoder
output to the decoder input is the discrete-time AWGN channel of noise variance
σ 2 = N0 /2.
The study of encoding/decoding methods has been an active research area
since the second half of the twentieth century. It is called coding theory. There
are many coding techniques, and a general introduction to coding can easily
occupy a one-semester graduate-level course. Here we will just consider an example
of a technique called convolutional coding. By considering a specific example,
we can considerably simplify the notation. As seen in the exercises, applying
the techniques learned in this chapter to other convolutional encoders is fairly
straightforward. We choose convolutional coding for two reasons: (i) it is well
suited in conjunction with the discrete-time AWGN channel; (ii) it allows us to
introduce various instructive and useful tools, notably the Viterbi algorithm to
do maximum likelihood decoding and generating functions to upper bound the
bit-error probability.
6.2 The encoder

The encoder is the device that takes the message and produces the codeword. In
this chapter the message consists of a sequence (b1 , b2 , . . . , bk ) of binary source
symbols. For comparison with bit-by-bit√on a pulse train, we let the codewords
consist of symbols that take value in {± Es }. To simplify the description of the
√ take value in {±1} (rather than in {0, 1}). For
encoder, we let the source symbols
the same reason, we factor out Es from the encoder output. Hence we declare
the encoder output to be√the sequence (x1 , x2 , . . . , xn ) with components in {±1}
and let the codeword be E(x1 , . . . , xn ).
205
206 6. First layer revisited
6
(b1 , . . . , bk ) (b̂1 , . . . , b̂k )
?
Encoder Decoder
√ √ 6
E(x1 , . . . , xn ) Exj yj (y1 , . . . , yn )
- -
?
6
Waveform Baseband
Former Front End
Z ∼ N 0, N0
2
6
R(t)
-

6
N (t)
Figure 6.1. System view for the current chapter.
The source symbols enter the encoder sequentially, at regular intervals deter-
mined by the encoder clock. During the jth epoch, j = 1, 2, . . . , the encoder takes
bj and produces two output symbols, x2j−1 and x2j , according to the encoding
map 1
x2j−1 = bj bj−2
x2j = bj bj−1 bj−2 .
To produce x1 and x2 the encoder needs b−1 and b0 , which are assumed to be 1
by default.
The circuit that implements the convolutional encoder is depicted in Figure 6.2,
where “×” denotes multiplication in R. A shift register stores the past two inputs.
As implied by the indices of x2j−1 , x2j , the two encoder outputs produced during
an epoch are transmitted sequentially.
Notice that the encoder output has length n = 2k. The following is an example of
a source sequence of length k = 5 and the corresponding encoder output sequence
of length n = 10.
bj 1 −1 −1 1 1
x2j−1 , x2j 1, 1 −1, −1 −1, 1 −1, 1 −1, −1
j 1 2 3 4 5
1
We are choosing this particular encoding map because it is the simplest one that is
actually used in practice.
6.2. The encoder 207
× x2j−1 = bj bj−2
bj
bj−1 bj−2
× × x2j = bj bj−1 bj−2
Figure 6.2. Convolutional encoder.
−1|1, −1
t
State Labels
−1| − 1, 1 1| − 1, 1 t = (−1, −1)
1|1, −1
l = (−1, 1)
r = (1, −1)
l r
b = (1, 1)
−1|1, 1
−1| − 1, −1 1| − 1, −1
1|1, 1
Figure 6.3. State diagram description of the convolutional encoder.
Because the n = 2k encoder output √ symbols are determined by the k input

bits, only 2k of the 2n sequences of {± Es }n are codewords. Hence we use only a
fraction 2k /2n = 2−k of all possible n-length channel input sequences. Compared
with bit-by-bit on a pulse train, we are giving up a factor two in the bit rate to
make the signal space much less crowded, hoping that this significantly reduces
the probability of error.
We have already seen two ways to describe the encoder (the encoding map and
the encoding circuit). A third way, useful in determining the error probability, is
the state diagram of Figure 6.3. The diagram describes a finite state machine. The
state of the convolutional encoder is what the encoder needs to know about past
inputs so that the state and the current input determine the current output. For
the convolutional encoder of Figure 6.2 the state at time j can be defined to be
(bj−1 , bj−2 ). Hence we have four states.
As the diagram shows, there are two possible transitions from each state. The
input symbol bj decides which of the two transitions is taken during epoch j.
Transitions are labeled by bj |x2j−1 , x2j . To be consistent with the default b−1 =
b0 = 1, the state is (1, 1) when b1 enters the encoder.
The choice of letting the encoder input and output symbols be the elements
of {±1} is not standard. Most authors choose the input/output alphabet to be
{0, 1} and use addition modulo 2 instead of multiplication over R. In this case, a
memoryless mapping
√ at the encoder output transforms the symbol alphabet from
{0, 1} to {± Es }. The notation is different but the end result is the same. The
choice we have made is better suited for the AWGN channel. The drawback of
this choice is that it is less evident that the encoder is linear. In Exercise 6.12 we
establish the link between the two viewpoints and in Exercise 6.5 we prove from
first principles that the encoder is indeed linear.
In each epoch, the convolutional encoder we have chosen has k0 = 1 symbol
entering and n0 = 2 symbols exiting the encoder. In general, a convolutional
encoder is specified by (i) the number k0 of source symbols entering the encoder in
each epoch; (ii) the number n0 of symbols produced by the encoder in each epoch,
where n0 > k0 ; (iii) the constraint length m0 defined as the number of input
k0 -tuples used to determine an output n0 -tuple; and (iv) the encoding function,
specified for instance by a k0 × m0 matrix of 1s and 0s for each component of the
output n0 -tuple. The matrix associated to an output component specifies which
inputs are multiplied to obtain that output. In our example, k0 = 1, n0 = 2,
m0 = 3, and the encoding function is specified by [1, 0, 1] (for the first component
of the output) and [1, 1, 1] (for the second component). (See the connections that
determine the top and bottom output in Figure 6.2.) In our case, the elements of
the output n0 tuple are serialized into a single sequence that we consider to be
the actual encoder output, but there are other possibilities. For instance, we could
take the pair x2j−1 , x2j and map it into an element of a 4-PAM constellation.
6.3 The decoder

A maximum likelihood (ML) decoder for the discrete-time AWGN channel decides
for (one of) the encoder output sequence x that maximizes
c2
c, y − ,
2
√
where y is the channel output sequence and c = Es x is the codeword associated to
x. The last term in the above expression is irrelevant as it is nEs /2, thus
√ the same
for all codewords. Furthermore, finding an x that maximizes c, y = Es x, y is
the same as finding an x that maximizes x, y.
Up to this point the inner product and the norm have been defined for vectors
of Cn written in column form, with n being an arbitrary but fixed positive integer.
Considering n-tuples in column form is a standard mathematical practice when
matrix operations are involved. (We have used matrix notation to express the
density of Gaussian random vectors.) In coding theory, people find it more useful
to write n-tuples in row form because it saves space. Hence, we refer to the encoder
input and output as sequences rather than as k and n-tuples, respectively. This is
a minor point. What matters is that the inner products and norms of the previous
paragraph are well defined.
6.3. The decoder 209
To find an x that maximizes x, y, we could in principle compute x, y for all 2k
sequences that can be produced by the encoder. This brute-force approach would
be quite unpractical. As already mentioned, if k = 100 (which is a relatively modest
value for k), 2k = (210 )10 which is approximately (103 )10 = 1030 . A VLSI chip that
makes 109 inner products per second takes 1021 seconds to check all possibilities.
This is roughly 4 × 1013 years. The universe is “only” roughly 2 × 1010 years old!
We wish for a method that finds a maximizing x with a number of operations
that grows linearly (as opposed to exponentially) in k. We will see that the so-called
Viterbi algorithm achieves this.
To describe the Viterbi algorithm (VA), we introduce a fourth way of describing
a convolutional encoder, namely the trellis. The trellis is an unfolded transition
diagram that keeps track of the passage of time. For our example, if we assume
that we start at state (1, 1), that the source sequence is b1 , b2 , . . . , b5 , and that
we complete the transmission by feeding the encoder with two “dummy bits”
b6 = b7 = 1 that make the encoder stop in the initial state, we obtain the trellis
description shown on the top of Figure 6.4, where an edge (transition) from a state
at depth j to a state at depth j+1 is labeled with the corresponding encoder output
x2j−1 , x2j . The encoder input that corresponds to an edge is the first component
of the next state.
There is a one-to-one correspondence between an encoder input sequence b, an
encoder output sequence x, and a path (or state sequence) that starts at the initial
state (1, 1) (left state) and ends at the final state (1, 1) (right state) of the trellis.
Hence we can refer to a path by means of an input sequence, an output sequence
or a sequence of states.
To decode using the Viterbi algorithm, we replace the label of each edge with
the edge metric (also called branch metric) computed as follows. The edge with
x2j−1 = a and x2j = b, where a, b ∈ {±1}, is assigned the edge metric ay2j−1 +by2j .
Now if we add up all the edge metrics along a path, we obtain the path metric
x, y.
example 6.1 Consider the trellis on the top of Figure 6.4 and let the decoder
input sequence be y = (1, 3), (−2, 1), (4, −1), (5, 5), (−3, −3), (1, −6), (2, −4). For
convenience, we chose the components of y to be integers, but in reality they are
real-valued. Also for convenience, we use parentheses to group the components of y
into pairs (y2j−1 , y2j ) that belong to the same trellis section. The edge metrics are
shown on the second trellis (from the top) of Figure 6.4. Once again, by adding the
edge metric along a path, we obtain the path metric x, y, where x is the encoder
output associated to the path.
The problem of finding an x that maximizes x, y is reduced to the problem of

finding a path with the largest path metric. The next example illustrates how the
Viterbi algorithm finds such a path.
example 6.2 Our starting point is the second trellis of Figure 6.4, which has
been labeled with the edge metrics. We construct the third trellis in which every
state is labeled with the metric of the surviving path to that state obtained as
follows. We use j = 0, 1, . . . , k + 2 to run over the trellis depth. Depth j = 0 refers
STATE
−1,−1 1,−1 1,−1 1,−1
− 1, − 1, − 1, − 1,
1 1 1 1
,1
,1
,1
,1
1,−1
−1
−1
−1
−1
1 1 1 1 1
1 ,− 1,− 1,− 1,− 1,−
1,1 1,1 1,1
−1
−1
−1
−1
−1
−1,1
,−
,−
,−
,−
,−
1
1
−1 −1 −1 −1 −1
− 1, − 1, − 1, − 1, − 1,
1,1
1,1 1,1 1,1 1,1 1,1 1,1 1,1
STATE
−1,−1 5 0 0
−5 0 0 −7
1,−1
−5
3
0
−3 5 0 0 7
3 10 −6
−3
−1
2
−1,1
−4 1 −3 − 10 0 6
1,1
4 −1 3 10 −6 −5 −2
STATE
−1 5
4 0
4 0
20
−1,−1
−5 0 0 −7
−7 10 4 20 29
1,−1
−5
3
−3 5 0 0 7
3 10 −6
−4 5 0 20 22
−3
−1
2
−1,1
0
−4 1 3 −3 6 − 10 16 6 10 25 31
1,1
4
4 −1 3 10 −6 −5 −2
STATE
−1 5
4 0
4 0
20
−1,−1
−5 0 0 −7
−7 10 4 20 29
1,−1
−5
3
−3 5 0 0 7
3 10 −6
−4 5 0 20 22
−3
−1
−1,1
0
−4 1 3 −3 6 − 10 16 6 10 25 31
1,1
4
4 −1 3 10 −6 −5 −2
Figure 6.4. The Viterbi algorithm. Top figure: Trellis representing the encoder
where edges are labeled with the corresponding output symbols. Second figure:
Edges are re-labeled with the edge metric corresponding to the received
sequence (1, 3), (−2, 1), (4, −1), (5, 5), (−3, −3), (1, −6), (2, −4) (parentheses have
been inserted to facilitate parsing). Third figure: Each state has been labeled
with the metric of a survivor to that state and non-surviving edges are pruned
(dashed). Fourth figure: Tracing back from the end, we find the decoded path
(bold); it corresponds to the source sequence 1, 1, 1, 1, −1, 1, 1.
6.4. Bit-error probability 211
to the initial state (leftmost) and depth j = k + 2 to the final state (rightmost)
after sending the k bits and the 2 “dummy bits”. Let j = 0 and to the single state
at depth j assign the metric 0. Let j = 1 and label each of the two states at depth
j with the metric of the only subpath to that state. (See the third trellis from the
top.) Let j = 2 and label the four states at depth j with the metric of the only
subpath to that state. For instance, the label to the state (−1, −1) at depth j = 2 is
obtained by adding the metric of the single state and the single edge that precedes
it, namely −1 = −4 + 3. From j = 3 on the situation is more interesting, because
now every state can be reached from two previous states. We label the state under
consideration with the largest of the two subpath metrics to that state and make
sure to remember to which of the two subpaths it corresponds. In the figure, we
make this distinction by dashing the last edge of the other path. (If we were doing
this by hand we would not need a third trellis. Rather we would label the states on
the second trellis and cross out the edges that are dashed in the third trellis.) The
subpath with the highest edge metric (the one that has not been dashed) is called
survivor. We continue similarly for j = 4, 5, . . . , k + 2. At depth j = k + 2 there is
only one state and its label maximizes x, y over all paths. By tracing back along
the non-dashed path, we find the maximum likelihood path. From it, we can read
out the corresponding bit sequence. The maximum likelihood path is shown in bold
on the fourth and last trellis of Figure 6.4.
From the above example, it is clear that, starting from the left and working
its way to the right, the Viterbi algorithm visits all states and keeps track of the
subpath that has the largest metric to that state. In particular, the algorithm finds
the path that has the largest metric between the initial state and the final state.
The complexity of the Viterbi algorithm is linear in the number of trellis sections,
i.e. in k. Recall that the brute-force approach has complexity exponential in k. The
saving of the Viterbi algorithm comes from not having to compute the metric of
non-survivors. When we dash an edge at depth j, we are in fact eliminating 2k−j
possible extensions of that edge. The brute-force approach computes the metric of
all those extensions but not the Viterbi algorithm.
A formal definition of the VA (one that can be programmed on a computer) and
a more formal argument that it finds the path that maximizes x, y is given in
the Appendix (Section 6.6).
6.4 Bit-error probability

In this section we derive an upper bound to the bit-error probability Pb . As usual,
we fix a signal and we evaluate the error probability conditioned on this signal
being transmitted. If the result depends on the chosen signal (which is not the
case here), then we remove the conditioning by averaging over all signals.
Each signal that can be produced by the transmitter corresponds to a path in
the trellis. The path we condition on is referred to as the reference path. We are
free to choose the reference path and, for notational convenience, we choose the
all-one path: it is the one that corresponds to the information sequence being
start end start end

1st detour 1st detour 2nd detour 2nd detour
reference path
Figure 6.5. Detours.
a sequence of k ones with initial encoder state (1, 1). The encoder output is a
sequence of 1s of length n = 2k.
The task of the decoder is to find (one of) the paths in the trellis that has the
largest x, y, where x is the encoder output that corresponds to that path. The
encoder input b that corresponds to this x is the maximum likelihood message
chosen by the decoder.
The concept of a detour plays a key role in upper-bounding the bit-error
probability. We start with an analogy. Think of the trellis as a road map,
of the path followed by the encoder as the itinerary you have planned for a
journey, and of the decoded path as the actual route you follow on your journey.
Typically the itinerary and the actual route differ due to constructions that
force you to take detours. Similarly, the detours taken by the Viterbi decoder
are those segments of the decoded path that share with the reference path
only their initial and final state. Figure 6.5 illustrates a reference path and two
detours.
Errors are produced when the decoder follows a detour. To the trellis path
selected by the decoder, we associate a sequence ω0 , ω1 , . . . , ωk−1 defined as follows.
If there is a detour that starts at depth j, j = 0, 1, . . . , k − 1, we let ωj be the
number of bit errors produced by that detour. It is determined by comparing the
corresponding segments of the two encoder input sequences and by letting ωj be
the number of positions in which they differ.
k−1If depth j does not correspond to
the start of a detour, then ωj = 0. Now j=0 ωj is the number of bits that are
k−1
incorrectly decoded and kk1 0 j=0 ωj is the fraction of such bits (k0 =1 in our
running example; see Section 6.2).
example 6.3 Consider the example of Figure 6.4 where k = 5 bits are transmit-
ted (followed by the two dummy bits 1, 1). The reference path is the all-one path
and the decoded path is the one marked by the solid line on the bottom trellis. There
is one detour, which starts at depth 4 in the trellis. Hence ωj = 0 for j = 0, 1, 2, 3
whereas ω4 = 0. To determine the value of ω4 , we need to compare the encoder
input bits over the span of the detour. The input bits that correspond to the detour
are −1, 1, 1 and those that correspond to the reference path are 1, 1, 1. There is
one disagreement,
k−1 hence ω4 = 1. The fraction of bits that are decoded incorrectly
is k1 j=0 ωj = 1/5.
Over the ensemble of all possible noise processes, ωj is modeled by a random

variable Ωj and the bit-error probability is
⎡ ⎤
1
k−1
1
k−1
Pb = E ⎣ Ωj ⎦ = E [Ωj ] .
kk0 j=0
kk0 j=0
To upper bound the above expression, we need to learn how many detours of a
certain kind there are. We do so in the next section.
6.4.1 Counting detours
In this subsection, we consider the infinite trellis obtained by extending the finite
trellis in both directions. Each path of the infinite trellis corresponds to an infinite
encoder input sequence b = . . . b−1 , b0 , b1 , b2 , . . . and an infinite encoder output
sequence x = . . . x−1 , x0 , x1 , x2 , . . .. These are sequences that belong to {±1}∞ .
Given any two paths in the trellis, we can take one as the reference and consider
the other as consisting of a number of detours with respect to the reference. To
each of the two paths there corresponds an encoder input and an encoder output
sequence. For every detour we can compare the two segments of encoder output
sequences and count the number of positions in which they differ. We denote this
number by d and call it the output distance (over the span of the detour). Similarly,
we can compare the segments of encoder input sequences and call input distance
(over the span of the detour) the number i of positions in which they differ.
example 6.4 Consider again the example of Figure 6.4 and let us choose the
all-one path as the reference. Consider the detour that starts at depth j = 0 and
ends at j = 3. From the top trellis, comparing labels, we see that d = 5. (There are
two disagreements in the first section of the trellis, one in the second, and two in
the third.) To determine the input distance i we need to label the transitions with
the corresponding encoder input. If we do so and compare we see that i = 1. As
another example, consider the detour that starts at depth j = 0 and ends at j = 4.
For this detour, d = 6 and i = 2.
We seek the answer to the following question: For any given reference path and
depth j ∈ {0, 1, . . . }, what is the number a(i, d) of detours that start at depth
j and have input distance i and output distance d, with respect to the reference
path? This number depends neither on j nor on the reference path. It does not
depend on j because the encoder is a time-invariant machine, i.e. all the sections
of the infinite trellis are identical. (This is the reason why we are considering the
infinite trellis in this section.) We will see that it does not depend on the reference
path either, because the encoder is linear in a sense that we will discuss.
example 6.5 Using the top trellis of Figure 6.4 with the all-one path as the
reference and j = 0, we can verify by inspection that there are two detours that
have output distance d = 6. One ends at j = 4 and the other ends at j = 5. The
input distance is i = 2 in both cases. Because there are two detours with parameters
d = 6 and i = 2, a(2, 6) = 2.
ID
ID D
D
ID2 D2
s l r e
Figure 6.6. Detour flow graph.
To determine a(i, d) in a systematic way, we arbitrarily choose the all-one path

as the reference and modify the state diagram into a diagram that has a start and
an end and for which each path from the start to the end represents a detour with
respect to the reference path. This is the detour flow graph shown in Figure 6.6.
It is obtained from the state diagram by removing the self-loop of state (1, 1) and
by split opening the state to create two new states denoted by s (for start) and
e (for end). For every j, there is a one-to-one correspondence between the set of
detours to the all-one path that starts at depth j and the set of paths between
state s and state e of the detour flow graph.
The label I i Dd (where i and d are non-negative integers) on an edge of the
detour flow graph indicates that the input and output distances (with respect to
the reference path) increase by i and d, respectively, when the detour takes this
edge.
In terms of the detour flow graph, a(i, d) is the number of paths between s and
e that have path label I i Dd , where the label of a path is the product of all labels
along that path.
example 6.6 In Figure 6.6, the shortest path that connects s to e has length 3.
It consists of the edges labeled ID2 , D, and D2 , respectively. The product of these
labels is the path label ID5 . This path tells us that there is a detour with i = 1 (the
exponent of I) and d = 5 (the exponent of D). There is no other path with path
label ID5 . Hence, as we knew already, a(1, 5) = 1.
Our next goal is to determine the generating function T (I, D) of a(i, d) defined as

T (I, D) = I i Dd a(i, d).
i,d
The letters I and D in the above expression should be seen as “place holders” with-
out any physical meaning. It is like describing a set of coefficients a0 , a1 , . . . , an−1
by means of the polynomial p(x) = a0 + a1 x + · · · + an−1 xn−1 . To determine
T (I, D), we introduce auxiliary generating functions, one for each intermediate
state of the detour flow graph, namely

Tl (I, D) = I i Dd al (i, d),
i,d

Tt (I, D) = I i Dd at (i, d),
i,d

Tr (I, D) = I i Dd ar (i, d),
i,d

Te (I, D) = I i Dd ae (i, d),
i,d
where in the first line we define al (i, d) as the number of paths in the detour flow
graph that start at state s, end at state l, and have path label I i Dd . Similarly, for
x = t, r, e, ax (i, d) is the number of paths in the detour flow graph that start at
state s, end at state x, and have path label I i Dd . Notice that Te (I, D) is indeed
the T (I, D) of interest to us.
From the detour flow graph, we see that the various generating functions are
related as follows, where to simplify notation we drop the two arguments (I and
D) of the generating functions:
Tl = ID2 + Tr I
Tt = Tl ID + Tt ID
T r = T l D + Tt D
T e = Tr D 2 .
To write down the above equations, the reader might find it useful to apply
the following rule. The Tx of a state x is the sum of a product: the sum is over
all states y that have an edge into x and the product is Ty times the label on the
edge from y to x. The reader can verify that this rule applies to all of the above
equations except the first. When used in an attempt to find the first equation, it
yields Tl = Ts ID2 + Tr I, but Ts is not defined because there is no detour starting
at s and ending at s. If we define Ts = 1 by convention, the rule applies without
exception.
The above system can be solved for Te (hence for T ) by pure formal manipula-
tions, like solving a system of equations. The result is
ID5
T (I, D) = .
1 − 2ID
As we will see shortly, the generating function T (I, D) of a(i, d) is more useful than
a(i, d) itself. However, to show that we can indeed obtain a(i, d) from T (I, D) we
use the expansion2 1−x 1
= 1 + x + x2 + x3 + · · · to write
2
We do not need to worry about convergence issues at this stage, because for now, xi is
just a “place holder”. In other words, we are not adding up the powers of x for some
number x.
ID5
T (I, D) = = ID5 (1 + 2ID + (2ID)2 + (2ID)3 + · · ·
1 − 2ID
= ID5 + 2I 2 D6 + 22 I 3 D7 + 23 I 4 D8 + · · ·
This means that there is one path with parameters i = 1, d = 5, that there are
two paths with i = 2, d = 6, etc. The general expression for i = 1, 2, . . . is

2i−1 , d = i + 4
a(i, d) =
0, otherwise.
By means of the detour flow graph, it is straightforward to verify this expression

for small values of i and d.
It remains to be shown that a(i, d) (the number of detours that start at any
given depth j and have parameter i and d) does not depend on which reference
path we choose. We do this in Exercise 6.6.
6.4.2 Upper bound to Pb
We are now ready to upper bound the bit-error probability. We recapitulate.

√ we let x = (x1 , x2 , . . . , xn ) be the
We fix an arbitrary encoder input sequence,
corresponding encoder output and let c = Es x be the codeword. The waveform
signal is
√ n
w(t) = E xj ψj (t),
j=1
where ψ1 (t), . . . , ψn (t) forms an orthonormal collection. We transmit this signal

over the AWGN channel with power spectral density N0 /2. Let r(t) be the received
signal, and let

y = (y1 , . . . , yn ), where yi = r(t)ψi∗ (t)dt,
be the decoder input.

The Viterbi algorithm labels each edge in the trellis with the corresponding edge
metric and finds the path through the trellis with the largest path metric. An edge
from depth j − 1 to j with output symbols x2j−1 , x2j is labeled with the edge
metric y2j−1 x2j−1 + y2j x2j .
The maximum likelihood path selected by the Viterbi decoder could contain
detours. For j = 0, 1, . . . , k − 1, if there is a detour that starts at depth j, we
set ωj to be the number of information-bit errors made on that detour. In all
other cases, we set ωj = 0. Let Ωj be the corresponding random variable (over all
possible noise realizations).
For the
k−1path selected by the Viterbi algorithm, the total number of incorrect
k−1
bits is j=0 ωj and kk1 0 j=0 ωj is the fraction of errors with respect to the kk0
source bits. Hence the bit-error probability is
1
k−1
Pb = E[Ωj ]. (6.1)
kk0 j=0
The expected value E[Ωj ] can be written as follows

E[Ωj ] = i(h)π(h), (6.2)
h
where the sum is over all detours h that start at depth j with respect to the
reference path, π(h) stands for the probability that detour h is taken, and i(h) for
the input distance between detour h and the reference path.
Next we upper bound π(h). If a detour starts at depth j and ends at depth
l = j + m, then the corresponding encoder-output symbols form a 2m tuple
ū ∈ {±1}2m . Let u = x2j+1 , . . . , x2l ∈ {±1}2m and ρ = y2j+1 , . . . , y2l be the
corresponding sub-sequence of the reference path and of the channel output,
respectively, see Figure 6.7.
A necessary (but not sufficient) condition for the Viterbi algorithm to take a
detour is that the subpath metric along the detour is at least as large as the
corresponding subpath metric √ along the reference
√ path. An equivalent condition is
that
√ ρ is at least as close to E s ū as it is to E s u. Observe that ρ has the statistic
of Es u + Z where Z ∼ N (0, N20 I2m ) and 2m is the common length of u, ū, and ρ.
√ √
The probability that ρ is at least as close to Es ū as it is to Es u is Q d2σE , where
√ √ √
dE = 2 Es d is the Euclidean distance between Es u and Es ū. Using dE (h) to
denote the Euclidean distance of detour h to the reference path, we obtain
!
dE (h) Es d(h)
π(h) ≤ Q =Q ,
2σ σ2
where the inequality

√ sign is needed √ because, as mentioned, the event that ρ is at
least as close to Es ū as it is to Es u is only a necessary condition for the Viterbi
decoder to take detour ū. Inserting the above bound into (6.2) we obtain the first
inequality in the following chain.
detour starts at depth

j ū
u detour ends at depth

l =j+m
Figure 6.7. Detour and reference path, labeled with the corresponding
output subsequences.

E[Ωj ] = i(h)π(h)
h
!
Es d(h)
≤ i(h)Q
σ2
h
∞ ∞
!
(a) Es d
= iQ ã(i, d)
i=1 d=1
σ2
∞ ∞
!
(b) Es d
≤ iQ a(i, d)
i=1
σ2
d=1
∞
(c) ∞
≤ iz d a(i, d).
i=1 d=1
To obtain equality (a) we group the terms of the sum that have the same i and
d and introduce ã(i, d) to denote the number of such terms in the finite trellis.
Note that ã(i, d) is the finite-trellis equivalent to a(i, d) introduced in Section
6.4.1. As the infinite trellis contains all the detours of the finite trellis and more,
ã(i, d) ≤ a(i, d). This justifies (b). In (c) we use
!
Es d Es d Es
Q ≤ e− 2σ2 = z d , for z = e− 2σ2 .
σ2
For the final step towards the upper bound to Pb , we use the relationship
∞ ∞ 7
∂ i 7
if (i) = I f (i)77 ,
i=1
∂I i=1 I=1
which
∞ holds for any function f and can be verified by taking the derivative of
i
i=1 I f (i) with respect to I and then setting I = 1. Hence
∞
∞
E[Ωj ] ≤ iz d a(i, d) (6.3)
i=1 d=1
∞ ∞ 7
∂ 7
= I D a(i, d)77
i d
∂I i=1 d=1 I=1,D=z
7
∂ 7
= T (I, D)77 .
∂I I=1,D=z
Plugging into (6.1) and using the fact that the above bound does not depend on
j yields
7
1 7
k−1
1 ∂
Pb = E[Ωj ] ≤ T (I, D)77 . (6.4)
kk0 j=0 k0 ∂I I=1,D=z
6.5. Summary 219
ID 5 ∂T
In our specific example we have k0 = 1 and T (I, D) = 1−2ID , hence ∂I =
5
D
(1−2ID)2 . Thus
z5
Pb ≤ . (6.5)
(1 − 2z)2
The bit-error probability depends on the encoder and on the channel. Bound
(6.4) nicely separates the two contributions. The encoder is accounted for by
T (I, D)/k0 and the channel by z. More precisely, z d is an upper bound to the
probability that a maximum likelihood receiver makes a decoding error when the
choice is between two encoder output sequences that have Hamming distance d. As
shown in Exercise 2.32(b) of Chapter 2, we can use the Bhattacharyya bound to
determine z for any binary-input discrete memoryless channel. For such a channel,

z= P (y|a)P (y|b), (6.6)
y
where a and b are the two letters of the input alphabet and y runs over all the
elements of the output alphabet. Hence, the technique used in this chapter is
applicable to any binary input discrete memoryless channel.
It should be mentioned that the upper bound (6.5) is valid under the condition
that there is no convergence issue associated to the various sums following (6.3).
This is the case when 0 ≤ z ≤ 12 , which is the case when the numerator and
the denominator of (6.5) are non-negative. The z from (6.6) fulfills 0 ≤ z ≤ 1.
However, if we use the tighter Bhattacharyya bound discussed in Exercise 2.29 of
Chapter 2 (which is tighter by a factor 12 ) then it is guaranteed that 0 ≤ z ≤ 12 .
6.5 Summary
To assess the impact of the convolutional encoder, let us compare two situations.
In both cases, the transmitted signal looks identical to an observer, namely it has
the form

2l
w(t) = si ψ(t − iT )
i=1
for some positive integer l and some unit-norm pulse ψ(t) that is √ orthogonal to
its T -space translates. In both cases, the symbols take values in {± Es } for some
fixed energy-per-symbol Es , but the way the symbols are obtained differs in the two
cases. In one case, the symbols are obtained from the output of the convolutional
encoder studied in this chapter. We call this the coded case. In the other √
case, the
symbols are simply the source bits, which take value in {±1}, scaled by Es . We
call this the uncoded case.
For the coded case, the number of symbols is twice the number of bits. Hence,
letting Rb , Rs , and Eb be the bit rate, the symbol rate, and the energy per bit,
respectively, we obtain
Rs 1
Rb = = [bits/symbol],
2 2T
Eb = 2Es ,
z5
Pb ≤ ,
(1 − 2z)2
Es
where z = e− 2σ2 . As 2σEs
2 becomes large, the denominator of the above bound for
Pb becomes essentially 1 and the bound decreases as z 5 .

For the uncoded case, the symbol rate equals the bit rate and the energy per bit
equals the energy per symbol. For this case we have an exact expression for the
bit-error probability. However, for comparison with the coded case, it is useful to
upper bound also the bit-error probability of the uncoded case. Hence,
1
Rb = Rs = [bits/symbol],
T
Eb = Es ,
!
Es Es
Pb = Q ≤ e− 2σ2 = z,
σ2
2
where we have used Q(x) ≤ exp − x2 . Recall that σ 2 is the noise variance
of the discrete-time channel, which equals the power spectral density N20 of the
continuous-time channel.
Figure 6.8 plots various bit-error probability curves. The dots represent simu-
lated results. From right to left, we see the simulation results for the uncoded
system, for the system based on the convolutional encoder, and for a system
based on a low-density parity-check (LDPC) code, which is a state-of-the-art code
used in the DVB-S2 (Digital Video Broadcasting – Satellite – Second Generation)
standard. Like the convolutional encoder, the LDPC encoder produces a symbol
rate that is twice the bit rate. For the uncoded system, we have plotted also
the exact expression for the bit-error probability (dashed curve labeled by the Q
function). We see that this expression is in perfect agreement with the simulation
results. The upper bound that we have derived for the system that incorporates the
convolutional encoder (solid curve) is off by about 1 dB at Pb = 10−4 with respect
to the simulation results. The dashed curve, which is in excellent agreement with
the simulated results for the same code, is the result of a more refined bound (see
Exercise 6.11).
Suppose we are to design a system that achieves a target error probability Pb ,
say Pb = 10−2 . From the plots in Figure 6.8, we see that the required Es /σ 2 is
roughly 7.3 dB for the uncoded system, whereas the convolutional code and the
LDPC code require about 2.3 and 0.75 dB, respectively. The gaps become more
significant as the target Pb decreases. For Pb = 10−7 , the required Es /σ 2 is about
14.3, 7.3, and 0.95 dB, respectively. (Recall that a difference of 13 dB means a
factor 20 in power.)
Instead of comparing Es /σ 2 , we might be interested in comparing Eb /σ 2 . The
conversion is straightforward: for the uncoded system Eb /σ 2 = Es /σ 2 , whereas for
the two coded systems Eb /σ 2 = 2Es /σ 2 .
6.5. Summary 221
101
100
10−1
Uncoded
−2
10
10−3

Es
Q σ2
Pb
10−4
10−5
10−6 z5
(1−2z)2
10−7 LDPC
10−8
Conv. Enc.
0 1 2 3 4 5 6 7 8
Es /σ in dB
2
Figure 6.8. Bit-error probabilities.
From a high-level point of view, non-trivial coding is about using only selected
sequences to form
√ the codebook. In this chapter, we have fixed the channel-input
alphabet to {± Es }. Then our only option to introduce non-trivial coding is to
increase the codeword length from n = k to n > k. For a fixed bit rate, increasing n
implies increasing the symbol rate. To increase the symbol rate we time-compress
the pulse ψ(t) by the appropriate factor and the bandwidth expands by the same
factor. If we fix the bandwidth, the symbol rate stays the same and the bit rate
has to decrease.
It would be wrong to conclude that non-trivial coding always requires reducing
the bit rate or increasing the bandwidth. Instead of keeping the channel-input
alphabet constant, for the coded system we could have used, say, 4-PAM. Then
each pair of binary symbols produced by the encoder can be mapped into a single
4-PAM symbol. In so doing, the bit rate, the symbol rate, and the bandwidth
remain unchanged.
The ultimate answer comes from information theory (see e.g. [19]). Information
theory tells us that, by means of coding, we can achieve an error probability as
small as desired, provided that we send fewer bits per symbol than the channel
capacity C, which for the discrete-time AWGN channel is C = 12 log2 (1 + σEs2 )
bits/symbol. According to this expression, to send 1/2 bits per symbol as we do in
our example, we need Es /σ 2 = 1, which means 0 dB. We see that the performance
of the√LDPC code is quite good. Even with the channel-input alphabet restricted
to {± Es } (no such restriction is imposed in the derivation of C), the LDPC code
achieves the kind of error probability that we typically want in applications at an
Es /σ 2 which is within 1 dB from the ultimate limit of 0 dB required for reliable
communication.
Convolutional codes were invented by Elias in 1955 and have been used in many
communication systems, including satellite communication, and mobile communi-
cation. In 1993, Berrou, Glavieux, and Thitimajshima captured the attention of
the communication engineering community by introducing a new class of codes,
called turbo codes, that achieved a performance breakthrough by concatenating
two convolutional codes separated by an interleaver. Their performance is not far
from that of the low-density parity-check codes (LDPC) – today’s state-of-the-art
in coding.
Thanks to its tremendous success, coding is in every modern communication
system. In this chapter we have only scratched the surface. Recommended books
on coding are [22] for a classical textbook that covers a broad spectrum of coding
techniques and [23] for the reference book on LDPC coding.
6.6 Appendix: Formal definition of the Viterbi algorithm

Let Γ = {(1, 1), (1, −1), (−1, 1), (−1, −1)} be the state space and define the edge
metric μj−1,j (α, β) as follows. If there is an edge that connects state α ∈ Γ at
depth j − 1 to state β ∈ Γ at depth j let
μj−1,j (α, β) = x2j−1 y2j−1 + x2j y2j ,
where x2j−1 , x2j is the encoder output of the corresponding edge. If there is no
such edge, we let μj−1,j (α, β) = −∞.
Since μj−1,j (α, β) is the jth term in x, y for any path that goes through state
α at depth j − 1 and state β at depth j, x, y is obtained by adding the edge
metrics along the path specified by x.
The path metric is the sum of the edge metrics taken along the edges of a path.
A longest path from state (1, 1) at depth j = 0, denoted (1, 1)0 , to a state α at
depth j, denoted αj , is one of the paths that has the largest path metric. The
Viterbi algorithm works by constructing, for each j, a list of the longest paths
to the states at depth j. The following observation is key to understanding the
Viterbi algorithm. If path ∗ αj−1 ∗ βj is a longest path to state β of depth j, where
path ∈ Γj−2 and ∗ denotes concatenation, then path ∗ αj−1 must be a longest path
to state α of depth j − 1, for if another path, say alternatepath ∗ αj−1 were shorter
for some alternatepath ∈ Γj−2 , then alternatepath ∗ αj−1 ∗ βj would be shorter
than path ∗ αj−1 ∗ βj . So the longest depth j path to a state can be obtained by
checking the extension of the longest depth (j − 1) paths by one edge.
The following notation is useful for the formal description of the Viterbi algo-
rithm. Let μj (α) be the metric of a longest path to state αj and let Bj (α) ∈
{±1}j be the encoder input sequence that corresponds to this path. We call
6.7. Exercises 223
Bj (α) ∈ {±1}j the survivor because it is the only path through state αj that
will be extended. (Paths through αj that have a smaller metric have no chance of
extending into a maximum likelihood path.) For each state, the Viterbi algorithm
computes two things: a survivor and its metric. The formal algorithm follows,
where B(β, α) is the encoder input that corresponds to the transition from state
β to state α if there is such a transition and is undefined otherwise.
(1) Initially set μ0 (1, 1) = 0, μ0 (α) = −∞ for all α = (1, 1), B0 (1, 1) =
∅, and j = 1.
(2) For each α ∈ Γ, find one of the β for which

μj−1 (β) + μj−1,j (β, α) is a maximum. Then set
μj (α) ← μj−1 (β) + μj−1,j (β, α),
Bj (α) ← Bj−1 (β) ∗ B(β, α).
(3) If j = k + 2, output the first k bits of Bj (1, 1) and stop. Other-
wise increment j by one and go to Step 2.
The reader should have no difficulty verifying (by induction on j) that μj (α) as
computed by Viterbi’s algorithm is indeed the metric of a longest path from (1, 1)0
to state α at depth j and that Bj (α) is the encoder input sequence associated to it.
6.7 Exercises
exercise 6.1 (Power spectral density) Consider the random process

∞

X(t) = Xi Es ψ(t − iTs − T0 ),
i=−∞
where Ts and Es are fixed positive numbers, ψ(t) is some unit-energy function, T0
is a uniformly distributed random variable taking values in [0, Ts ), and {Xi }∞
i=−∞
is the output of the convolutional encoder described by
X2n = Bn Bn−2
X2n+1 = Bn Bn−1 Bn−2
with iid input sequence {Bi }∞

i=−∞ taking values in {±1}.
(a) Express the power spectral density of X(t) for a general ψ(t).
(b) Plot the power spectral density of X(t) assuming that ψ(t) is a unit-norm
rectangular pulse of width Ts .
exercise 6.2 (Power spectral density: Correlative encoding) Repeat Exercise

6.1 using the encoder:
Xi = Bi − Bi−1 .
Compare this exercise to Exercise 5.4 of Chapter 5.
exercise 6.3 (Viterbi algorithm) An output sequence x1 , . . . , x10 from the con-
volutional encoder of Figure 6.9 is transmitted over the discrete-time AWGN chan-
nel. The initial and final state of the encoder is (1, 1). Using the Viterbi algorithm,
find the maximum likelihood information sequence b̂1 , . . . , b̂4 , 1, 1, knowing that
b1 , . . . , b4 are drawn independently and uniformly from {±1} and that the channel
output y1 , . . . , y10 = 1, 2, −1, 4, −2, 1, 1, −3, −1, −2. (It is for convenience that we
are choosing integers rather than real numbers.)
× x2j−1
bj ∈ {±1}
bj−1 bj−2
× × x2j
Figure 6.9.
exercise 6.4 (Inter-symbol interference) From the decoder’s point of view,

inter-symbol interference (ISI) can be modeled as follows
Y i = Xi + Z i

L
Xi = Bi−j hj , i = 1, 2, . . . (6.7)
j=0
where Bi is the ith information bit, h0 , . . . , hL are coefficients that describe the
inter-symbol interference, and Zi is zero-mean, Gaussian, of variance σ 2 , and
statistically independent of everything else. Relationship (6.7) can be described by
a trellis, and the ML decision rule can be implemented by the Viterbi algorithm.
(a) Draw the trellis that describes all sequences of the form X1 , . . . , X6 resulting
from information sequences of the form B1 , . . . , B5 , 0, Bi ∈ {0, 1}, assuming
⎧
⎨ 1, i=0
hi = −2, i = 1
⎩
0, otherwise.
To determine the initial state, you may assume that the preceding informa-
tion sequence terminated with 0. Label the trellis edges with the input/output
symbols.
6.7. Exercises 225
6
(b) Specify a metric f (x1 , . . . , x6 ) = i=1 f (xi , yi ) whose minimization or max-
imization with respect to the valid x1 , . . . , x6 leads to a maximum likelihood
decision. Specify if your metric needs to be minimized or maximized.
(c) Assume y1 , . . . , y6 = 2, 0, −1, 1, 0, −1. Find the maximum likelihood estimate
of the information sequence B1 , . . . , B5 .
exercise 6.5 (Linearity) In this exercise, we establish in what sense the encoder
of Figure 6.2 is linear.
(a) For this part you might want to review the axioms of a field. Consider the set
F0 = {0, 1} with the following addition and multiplication tables.
+ 0 1 × 0 1
0 0 1 0 0 0
1 1 0 1 0 1
(The addition in F0 is the usual addition over R with result taken modulo 2.
The multiplication is the usual multiplication over R and there is no need to
take the modulo 2 operation because the result is automatically in F0 .) F0 ,
“+”, and “×” form a binary field denoted by F2 . Now consider F− = {±1}
and the following addition and multiplication tables.
+ 1 −1 × 1 −1
1 1 −1 1 1 1
−1 −1 1 −1 1 −1
(The addition in F− is the usual multiplication over R.) Argue that F− , “+”,
and “×” form a binary field as well. Hint: The second set of operations can be
obtained from the first set via the transformation T : F0 → F− that sends 0 to
1 and 1 to −1. Hence, by construction, for a, b ∈ F0 , T (a + b) = T (a) + T (b)
and T (a × b) = T (a) × T (b). Be aware of the double meaning of “+” and “×”
in the previous sentence.
(b) For this part you might want to review the notion of a vector space. Let
F0 , “+” and “×” be as defined in (a). Let V = F0∞ . This is the set of
infinite sequences taking values in F0 . Does V, F0 , “+” and “×” form a
vector space? (Addition of vectors and multiplication of a vector with a scalar
is done component-wise.) Repeat using F− .
(c) For this part you might want to review the notion of linear transformation.
Let f : V → V be the transformation that sends an infinite sequence b ∈ V to
an infinite sequence x ∈ V according to
x2j−1 = bj−1 + bj−2 + bj−3
x2j = bj + bj−2 ,
where the “+” is the one defined over the field of scalars implicit in V. Argue
∞
that this f is linear. Comment: When V = F− , this encoder is the one used
throughout Chapter 6, with the only difference that in the chapter we multiply
over R rather than adding over F− , but this is just a matter of notation,
the result of the two operations on the elements of F− being identical. The
standard way to describe a convolutional encoder is to choose F0 and the
corresponding addition, namely addition modulo 2. See Exercise 6.12 for the
reason we opt for a non-standard description.
exercise 6.6 (Independence of the distance profile from the reference path)
We want to show that a(i, d) does not depend on the reference path. Recall that
in Section 6.4.1 we define a(i, d) as the number of detours that leave the reference
path at some arbitrary but fixed trellis depth j and have input distance i and output
distance d with respect to the reference path.
(a) Let b and b̄, both in {±1}∞ , be two infinite-length input sequences to the
encoder of Figure 6.2 and let f be the encoding map. The encoder is linear
in the sense that the componentwise product over the reals bb̄ is also a valid
input sequence and the corresponding output sequence is f (bb̄) = f (b)f (b̄)
(see Exercise 6.5). Argue that the distance between b and b̄ equals the distance
between bb̄ and the all-one input sequence. Similarly, argue that the distance
between f (b) and f (b̄) equals the distance between f (bb̄) and the all-one output
sequence (which is the output to the all-one input sequence).
(b) Fix an arbitrary reference path and an arbitrary detour that splits from the
reference path at time 0. Let b and b̄ be the corresponding input sequences.
Because the detour starts at time 0, bi = b̄i for i < 0 and b0 = b̄0 . Argue that
b̄ uniquely defines a detour b̃ that splits from the all-one path at time 0 and
such that:
(i) the distance between b and b̄ is the same as that between b̃ and the all-one
input sequence;
(ii) the distance between f (b) and f (b̄) is the same as that between f (b̃) and
the all-one output sequence.
(c) Conclude that a(i, d) does not depend on the reference path.
exercise 6.7 (Rate 1/3 convolutional code) For the convolutional encoder of
Figure 6.10 do the following.
× x3n = bn bn−2
bn ∈ {±1}
bn−1 bn−2
× x3n+1 = bn−1 bn−2
× x3n+2 = bn bn−1 bn−2
Figure 6.10.
6.7. Exercises 227
(a) Draw the state diagram and the detour flow graph.
(b) Suppose that the serialized encoder output symbols are scaled so that the
resulting energy per bit is Eb and are sent over the discrete-time AWGN
channel of noise variance σ 2 = N0 /2. Derive an upper bound to the bit-error
probability assuming that the decoder implements the Viterbi algorithm.
exercise 6.8 (Rate 2/3 convolutional code) The following equations describe
the output sequence of a convolutional encoder that in each epoch takes k0 = 2
input symbols from {±1} and outputs n0 = 3 symbols from the same alphabet.
x3n = b2n b2n−1 b2n−2
x3n+1 = b2n+1 b2n−2
x3n+2 = b2n+1 b2n b2n−2
(a) Draw an implementation of the encoder based on delay elements and multi-
pliers.
(b) Draw the state diagram.
(c) Suppose that the serialized encoder output symbols are scaled so that the
resulting energy per bit is Eb and are sent over the discrete-time AWGN
channel of noise variance σ 2 = N0 /2. Derive an upper bound to the bit-error
probability assuming that the decoder implements the Viterbi algorithm.
exercise 6.9 (Convolutional encoder, decoder, and error probability) For the
convolutional code described by the state diagram of Figure 6.11:
(a) draw the encoder;
(b) as a function of the energy per bit Eb , upper bound the bit-error probability of
the Viterbi algorithm when the scaled encoder output sequence is transmitted
over the discrete-time AWGN channel of noise variance σ 2 = N0 /2.
−1| − 1, 1
t
State Labels
−1|1, 1 1| − 1, −1 t = (−1, −1)
1|1, −1
l = (−1, 1)
r = (1, −1)
l r
b = (1, 1)
−1| − 1, −1
−1|1, −1 1| − 1, 1
1|1, 1
Figure 6.11.
exercise 6.10 (Viterbi for the binary erasure channel) Consider the convolu-
tional encoder of Figure 6.12 with inputs and outputs over {0, 1} and addition
modulo 2. Its output is sent over the binary erasure channel described by
PY |X (0|0) = PY |X (1|1) = 1 − ,
PY |X (?|0) = PY |X (?|1) = ,
PY |X (1|0) = PY |X (0|1) = 0,
where 0 < < 0.5.
⊕ x2j−1
bj
bj−1 bj−2
⊕ ⊕ x2j
Figure 6.12.
(a) Draw a trellis section that describes the encoder map.

(b) Derive the branch metric and specify whether a maximum likelihood decoder
chooses the path with largest or smallest path metric.
(c) Suppose that the initial encoder state is (0, 0) and that the channel output is
{0, ?, ?, 1, 0, 1}. What is the most likely information sequence?
(d) Derive an upper bound to the bit-error probability.
exercise 6.11 (Bit-error probability) In the process of upper bounding the bit-
error probability, in Section 6.4.2 we make the following step
∞ ∞
!
Es d
E[Ωj ] ≤ iQ a(i, d)
i=1
σ2
d=1
∞
∞
≤ iz d a(i, d).
i=1 d=1
(a) Instead of upper bounding the Q function as done above, use the results of
Section 6.4.1 to substitute a(i, d) and d with explicit functions of i and get
rid of the second sum. You should obtain
∞
!
Es (i + 4) i−1
Pb ≤ iQ 2 .
i=1
σ2
(b) Truncate the above sum to the first five terms and evaluate it numerically for
Es /σ 2 between 2 and 6 dB. Plot the results and compare to Figure 6.8.
6.7. Exercises 229
exercise 6.12 (Standard description of a convolutional encoder) Consider the

two encoders of Figure 6.13, where the map T : F0 → F− sends 0 to 1 and 1 to
−1. Show that the two encoders produce the same output when their inputs are
related by bj = T (b̄j ). Hint: For a and b in F0 , T (a + b) = T (a) × T (b), where
addition is modulo 2 and multiplication is over R.
x̄2j−1
+ T
b̄j ∈ F0 ={0, 1}
b̄j−1 b̄j−2
x̄2j
+ + T
(a) Conventional description. Addition is modulo 2.
× x2j−1
bj ∈ F− ={±1}
bj−1 bj−2
× × x2j
(b) Description used in this text. Multiplication is over R.
Figure 6.13.
Comment: The encoder of Figure 6.13b is linear over the field F− (see Exercise
6.5), whereas the encoder of Figure 6.13a is linear over F0 only if we omit the
output map T . The comparison of the two figures should explain why in this chapter
we have opted for the description of part (b) even though the standard description
of a convolutional encoder is as in part (a).
exercise 6.13 (Trellis with antipodal signals) Figure 6.14a shows a trellis
section labeled with the output symbols x2j−1 , x2j of a convolutional encoder. Notice
how branches that are the mirror-image of each other have antipodal output symbols
(symbols that are the negative of each other). The purpose of this exercise is to see
that when the trellis has this particular structure and codewords are sent through the
discrete-time AWGN channel, the maximum likelihood sequence detector further
simplifies (with respect to the Viterbi algorithm).
Figure 6.14b shows two consecutive trellis sections labeled with the branch metric.
Notice that the mirror symmetry of part (a) implies the same kind of symmetry
for part (b). The maximum likelihood path is the one that has the largest path
metric. To avoid irrelevant complications we assume that there is only one path
that maximizes the path metric.
j−1 j j−1 j j+1

−1, −1 a c
−1 −1 −1
,1 b d
−1 −b −d
1,
+1 +1
1, 1 −a −c
(a) (b)
Figure 6.14.
(a) Let σj ∈ {±1} be the state visited by the maximum likelihood path at depth j.
Suppose that a genie informs the decoder that σj−1 = σj+1 = 1. Write down
the necessary and sufficient condition for the maximum likelihood path to go
through σj = 1.
(b) Repeat for the remaining three possibilities of σj−1 and σj+1 . Does the nec-
essary and sufficient condition for σj = 1 depend on the value of σj−1 and
σj+1 ?
(c) The branch metric for the branch with output symbols x2j−1 , x2j is
x2j−1 y2j−1 + x2j y2j ,
where yj is xj plus noise. Using the result of the previous part, specify a
maximum likelihood sequence decision for σj = 1 based on the observation
y2j−1 , y2j , y2j+1 , y2j+2 .
exercise 6.14 (Timing error) A transmitter sends

X(t) = Bi ψ(t − iT ),
i
where {Bi }∞i=−∞ , Bi ∈ {1, −1}, is a sequence of independent and uniformly dis-
tributed bits and ψ(t) is a centered and unit-energy rectangular pulse of width T .
The communication channel between the transmitter and the receiver is the AWGN
channel of power spectral density N20 . At the receiver, the channel output Z(t) is
passed through a filter matched to ψ(t), and the output is sampled, ideally at times
tk = kT , k integer.
(a) Consider that there is a timing error, i.e. the sampling time is tk = kT −
τ where Tτ = 0.25. Ignoring the noise, express the matched filter output
observation wk at time tk = kT − τ as a function of the bit values bk and
bk−1 .
(b) Extending to the noisy case, let rk = wk + zk be the kth matched filter
output observation. The receiver is not aware of the timing error. Compute
the resulting error probability.
6.7. Exercises 231
(c) Now assume that the receiver knows the timing error τ (same τ as above) but
it cannot correct for it. (This could be the case if the timing error becomes
known once the samples are taken.) Draw and label four sections of a trellis
that describes the noise-free sampled matched filter output for each input
sequence b1 , b2 , b3 , b4 . In your trellis, take
into consideration the fact that
4
the matched filter is “at rest” before x(t) = i=1 bi ψ(t − iT ) enters the filter.
(d) Suppose that the sampled matched filter output consists of 2, 0.5, 0, −1. Use
the Viterbi algorithm to decide on the transmitted bit sequence.
exercise 6.15 (Simulation) The purpose of this exercise is to determine, by
simulation, the bit-error probability of the communication system studied in this
chapter. For the simulation, we recommend using MATLAB, as it has high-level func-
tions for the various tasks, notably for generating a random information sequence,
for doing convolutional encoding, for simulating the discrete-time AWGN channel,
and for decoding by means of the Viterbi algorithm. Although the actual simulation
is on the discrete-time AWGN, we specify a continuous-time setup. It is part of
your task to translate the continuous-time specifications into what you need for the
simulation. We begin with the uncoded version of the system of interest.
(a) By simulation, determine the minimum obtainable bit-error probability Pb of
bit-by-bit on a pulse train transmitted over the AWGN channel. Specifically,
the channel input signal has the form

X(t) = Xj ψ(t − jT ),
j
√
where the symbols are iid and take value in {± Es }, the pulse ψ(t) has unit
norm and is orthogonal to its T -spaced time translates. Plot Pb as a function
of Es /σ 2 in the range from 2 to 6 dB, where σ 2 is the noise variance. Verify
your results with Figure 6.8.
(b) Repeat with the symbol sequence
√ being the output of the convolutional encoder
of Figure 6.2 multiplied by Es . The decoder shall implement the Viterbi
algorithm. Also in this case you can verify your results by comparing with
Figure 6.8.
7 Passband communication
via up/down conversion:
Third layer
7.1 Introduction
We speak of baseband communication when the signals have their energy in some
frequency interval [−B, B] around the origin (Figure 7.1a). Much more common
is the situation where the signal’s energy is concentrated in [fc − B, fc + B] and
[−fc − B, −fc + B] for some carrier frequency fc greater than B. In this case,
we speak of passband communication (Figure 7.1b). The carrier frequency fc is
chosen to fulfill regulatory constraints, to avoid interference from other signals, or
to make the best possible use of the propagation characteristics of the medium
used to communicate.
W
f
−B 0 B
(a)
W W
f
−fc − B −fc −fc + B 0 fc − B fc fc + B
(b)
Figure 7.1. Baseband (a) versus passband (b).
The purpose of this chapter is to introduce a third and final layer responsible for
passband communication. With this layer in place, the upper layers are designed
for baseband communication even when the actual communication happens in
passband.
example 7.1 (Regulatory constraints) Figure 7.2 shows the radio spectrum allo-
cation for the United States (October 2003). To get an idea about its complexity, the
chart is presented in its entirety even if it is too small to read. The interested reader
can find the original on the website of the (US) National Telecommunications and
Information Administration.
232
Administration, Office of Spectrum Management (October 2003).
US Department of Commerce, National Telecommunications and Information
Figure 7.2. Radio spectrum allocation in the United States, produced by the
Primary
Secondary
SERVICE
FIXED
MOBILE
current status of U.S. allocations.

AMATEUR
SATELLITE
SATELLITE
Mobile
FIXED
ACTIVITY CODE
AERONAUTICAL
BROADCASTING
BROADCASTING
AERONAUTICAL
AERONAUTICAL
EXAMPLE
FIXED SATELLITE
RADIONAVIGATION
MOBILE SATELLITE
August 2011
AMATEUR SATELLITE
EARTH EXPLORATION
GOVERNMENT EXCLUSIVE
NON-GOVERNMENT EXCLUSIVE
Capital Letters
DESCRIPTION
MOBILE
MARITIME
SATELLITE
SATELLITE
SATELLITE
ALLOCATION USAGE DESIGNATION

LAND MOBILE
LAND MOBILE
INTER-SATELLITE
METEOROLOGICAL
METEOROLOGICAL
MOBILE SATELLITE
RADIONAVIGATION
1st Capital with lower case letters

MARITIME MOBILE
MARITIME MOBILE
of Frequency Allocations. Therefore, for complete information, users should consult the Table to determine the
This chart is a graphic single-point-in-time portrayal of the Table of Frequency Allocations used by the FCC and
RADIO SERVICES COLOR LEGEND
U.S. DEPARTMENT OF COMMERCE
Facsimile: (202) 512-2250 Mail: Stop SSOP, Washington, DC 20402-0001

Internet: bookstore.gpo.gov Phone toll free (866) 512-1800; Washington, DC area (202) 512-1800
SATELLITE
National Telecommunications and Information Administration

STATES
GOVERNMENT/NON-GOVERNMENT SHARED
RADIOLOCATION
SPACE RESEARCH
AND TIME SIGNAL

SPACE OPERATION
RADIONAVIGATION
RADIO ASTRONOMY
UNITED
THE RADIO SPECTRUM
RADIODETERMINATION
STANDARD FREQUENCY
TIME SIGNAL SATELLITE

RADIOLOCATION SATELLITE
ALLOCATIONS
STANDARD FREQUENCY AND

FREQUENCY
RADIONAVIGATION SATELLITE
30.0 3.0 300.0 30.6 3.0 300 3

MARITIME FIXED MOBILE AERONAUTICAL
FIXED - MOBILE - RADIONAVIGATION RADIOLOCATION Radiolocation FIXED MOBILE 30.56 MOBILE (OR)
SATELLITE SATELLITE 3.1 3.155
(Earth-to-space)
and
(Earth-to-space) 328.6 MOBILE

Satellite
Space research
3 kHz
RADIOLOCATION FIXED LAND MOBILE FIXED

Time Signal
Earth
31.0 Radiolocation AERONAUTICAL RADIONAVIGATION

3MHz
except aeronautical mobile (R)

(active)
satellite
(space-to-Earth)
exploration-
3 GHz
MARITIME
FIXED MOBILE (active) 3.23

Standard Frequency
30GHz
335.4
Aeronautical
300 kHz
(radiobeacons)
(radiobeacons)
Radionavigation
30 MHz
3GHz
31.3 3.3 32.0

SPACE EARTH FIXED MOBILE
300 MHz
RADIONAVIGATION
RADIO RESEARCH EXPLORATION -

BY
ASTRONOMY (passive) SATTELLITE (passive) Radiolocation Radiolocation 399.9 FIXED MOBILE 325
Amateur MOBILE SATELLITE
mobile
Maritime AERONAUTICAL
FIXED
31.8 RADIONAVIGATION SATELLITE Aeronautical

SPACE RESEARCH (Earth-to-space) RADIONAVIGATION
MOBILE
33.0 Radionavigation
Radiolocation
RADIONAVIGATION 3.5 400.05 Mobile

except aeronautical
(deep space) (space-to-Earth) (radiobeacons) (radiobeacons)

32.3 AERONAUTICAL STANDARD FREQUECY AND TIME SIGNAL - SATELLITE 3.4 335
RADIONAVIGATION RADIOLOCATION Radiolocation FIXED
(ground based) (400.1 MHz) AERONAUTICAL
** EXCEPT AERONAUTICAL MOBILE
RADIONAVIGATION INTER-SATELLITE 3.6 400.15 LAND MOBILE

AERONAUTICAL MET. AIDS MOBILE SPACE RES. Space Opn. MET. SAT. MOBILE (R)
* EXCEPT AERONAUTICAL MOBILE (R)
FIXED-SATELLITE (Radiosonde) SAT (S-E) (S-E) (S-E) (S-E)

RADIONAVIGATION RADIOLOCATION
JUNE 1, 2011
33.0 (ground based) (space-to-Earth) Radiolocation 401.0 34.0 3.5

RADIONAVIGATION MET-SAT. EARTH Met-Satellite
3.65 MET. AIDS SPACE OPN. EXPL Earth Expl Sat
33.4 FIXED-SATELLITE (Radiosonde) (S-E) (E-S) SAT. (E-S) (E-S) (E-S) FIXED MOBILE
RADIOLOCATION Radiolocation MOBILE** (space-to-Earth) FIXED 402.0
3.7 MET. AIDS MET-SAT. EARTH EXPL Met-Satellite Earth Expl Sat 35.0
34.2 (Radiosonde) (E-S) SAT. (E-S) (E-S) (E-S)
Space research
(deep space) Radio- SPACE RESEARCH RADIO- 403.0
(Earth-to-space) location (deep space) (Earth-to-space) LOCATION METEOROLOGICAL AIDS (RADIOSONDE) FIXED
34.7 406.0 LAND MOBILE
DELMON C. MORRISON
RADIOLOCATION Radiolocation MOBILE SATELLITE (Earth-to-space)

Mobile
35.5 406.1 36.0

(radiobeacons)
Earth EARTH RADIO

AERONAUTICAL
FIXED MOBILE
Aeronautical
AMATEUR
EXPLORATION -
RADIONAVIGATION
exploration - ASTRONOMY
THIS CHART WAS CREATED
(active)
SPACE
RADIO
Radio
Space
FIXED MOBILE
LOCATION
(active)
location
RESEARCH
SATTELLITE (active)
research
sattellite (active) 410.0

SPACE EARTH EXPLORATION -
36.0 SPACE RESEARCH
FIXED MOBILE (space-to-space) 37.0
FIXED MOBILE RESEARCH SATELLITE 420.0 LAND MOBILE
FIXED
(passive) (passive) 37.5

SPACE RESEARCH
37.0
FIXED-SATELLITE Radio astronomy LAND MOBILE
MOBILE FIXED (space-to-Earth)
(space-to-Earth)
(space-to-Earth) RADIOLOCATION Amateur 38.0

FIXED-SATELLITE
37.5 RADIO ASTRONOMY

FIXED SPACE RESEARCH MOBILE FIXED 4.0
MOBILE (space-to-Earth) 450.0 38.25
FIXED MARITIME MOBILE 405
38.0 LAND MOBILE FIXED MOBILE 4.063 Aeronautical Mobile RADIONAVIGATION
454.0 39.0
NOT ALLOCATED
MOBILE FIXED FIXED-SATELLITE FIXED LAND MOBILE LAND MOBILE 415

(space-to-Earth) 455.0 40.0
38.6 4.2 LAND MOBILE 456.0 AERONAUTICAL MARITIME
MOBILE FIXED-SATELLITE FIXED FIXED LAND MOBILE
(space-to-Earth) 460.0 FIXED MOBILE RADIONAVIGATION MOBILE
39.5 AERONAUTICAL FIXED LAND MOBILE
462.5375 435
FIXED-SATELLITE MOBIL-ESATELLITE
MOBILE
MOBILE FIXED RADIONAVIGATION LAND MOBILE

MARITIME
(space-to-Earth) (space-to-Earth) 462.7375 42.0

40.0 FIXED LAND MOBILE
EARTH EXPLORATION
Earth exploration
467.5375
Satellite
satellite SATELLITE
4.4 LAND MOBILE
FIXED-
Mobile-
SPACE
satellite
SATELLITE
467.7375
RESEARCH
Meteorological
(space-to-Earth) (Earth-to-space)
4.438
(space-to-Earth)
(space-to-Earth)
(Earth-to-space)
FIXED LAND MOBILE
(space-to-Earth)
40.5 FIXED MOBILE FIXED LAND MOBILE

470.0
BROADCASTING- 4.5 MOBILE FIXED
BROADCASTING except aeronautical mobile (R)
MOBILE-
SATELLITE
SATELLITE 43.69
(space-to-Earth)
(space-to-Earth)
FIXED-SATELLITE
Mobile
41.0
ISM - 40.68 ± .02 MHz
4.65
Maritime
Aeronautical
FIXED
FIXED- BROADCASTING-
BROADCASTING
AERONAUTICAL MOBILE (R)

Radionavigation
LAND MOBILE
SATELLITE BROADCASTING
(TV CHANNELS 14 - 20)
FIXED
SATELLITE Fixed FIXED LAND MOBILE 4.7

MOBILE
(space-to-Earth)
Mobile MOBILE
(space-to-Earth)
512.0
FIXED-SATELLITE
42.0 AERONAUTICAL MOBILE (OR)

BROADCASTING SATELLITE BROADCASTING MOBILE FIXED 4.8 4.75
42.5 46.6 MOBILE
RADIO ASTRONOMY FIXED-SATELLITE FIXED
(EARTH-to-space)
MOBILE** FIXED FIXED MOBILE FIXED MOBILE except aeronautical mobile (R) 495
43.5 4.94 47.0 4.85
MOBILE-SATELLITE (Earth-to-space)
MOBILE (distress and c alling)
FIXED-SATELLITE (Earth-to-space) FIXED MOBILE**
45.5 FIXED MOBILE 505
RADIONAVIGATION- 4.99 MARITIME MOBILE
SATELLITE MOBILE-SATELLITE (Earth-to-space) MOBILE
46.9 RADIO ASTRONOMY Space Research (Passive) LAND MOBILE 4.995 510
RADIO- 5.0
NAVIGATION- MOBILE-SATELLITE RADIONAVIGATION-SATELLITE STANDARD FREQUENCY AND TIME SIGNAL (5 MHz) AERONAUTICAL
MOBILE FIXED AERONAUTICAL RADIONAVIGATION 9
SATELLITE (Earth-to-space) (Earth-to-space) 5.005 MARITIME MOBILE
47.0 5.01 FIXED RADIONAVIGATION
AMATEUR AMATEUR-SATELLITE AERONAUTICAL RADIONAVIGATION
RADIONAVIGATION-SATELLITE 49.6 (ships only)
47.2 (space-to-Earth)(space-to-space) FIXED MOBILE 5.06 (radiobeacons)
FIXED-SATELLITE (Earth-to-space) MOBILE FIXED 5.03 50.0
48.2 525
FIXED-SATELLITE (Earth-to-space) AERONAUTICAL RADIONAVIGATION AERONAUTICAL
BROADCASTING
MOBILE FIXED
50.2 MOBILE RADIONAVIGATION
mobile
(TV CHANNELS 21-36)
SPACE RESEARCH EARTH EXPLORATION- 5.15

FIXED
MOBILE
(passive) SATELLITE (passive) AERONAUTICAL FIXED-SATELLITE 535

except aeronautical
50.4 RADIONAVIGATION (Earth-to-space)

MOBILE-SATELLITE (Earth-to-space) 5.25 5.45
AMATEUR
FIXED
MOBILE
FIXED-SATELLITE (Earth-to-space) RADIOLOCATION 608.0
Earth
LAND MOBILE AERONAUTICAL MOBILE (R)

(active)
54.0
(active)
EARTH
51.4
(medical telemetry and
SATELLITE
exploration-
MOBILE FIXED medical telecommand)

RADIO ASTRONOMY
Radiolocation
satellite (active)
Space research
SPACE RESEARCH
52.6 614.0
EARTH EXPLORATION-SATELLITE (passive) SPACE RESEARCH (passive) 5.255 5.68
SPACE RESEARCH
(passive) INTER- SATELLITE EARTH EXPLORATION-SATELLITE (passive)
55.78 5.73
RADIOLOCATION
Earth
EARTH EXPLORATION-SATELLITE (passive)
(active)
(active)
(active)
EARTH
INTER- MOBILE
exploration-
RADIONAVIGATION
SATELLITE
MOBILE FIXED
Radiolocation
except aeronautical mobile (R)

Space research
satellite (active)
SATELLITE SPACE RESEARCH (passive) 5.59

56.9 5.35
INTER-
SPACE RESEARCH (passive) SATELLITE
Earth
BROADCASTING
(active)
(active)
(active)
EARTH
exploration-
SATELLITE
57.0
AERONAUTICAL
RADIOLOCATION
Space research
satellite (active)
Radiolocation
RADIONAVIGATION
INTER- 5.46
SATELLITE SPACE RESEARCH (passive)
6.2 14
(TV CHANNELS 38-51)
BROADCASTING
58.2
SPACE
Earth
EARTH EXPLORATION-
(active)
(active)
(active)
EARTH
RESEARCH
exploration-
SATELLITE
SATELLITE (passive)
FIXED FIXED FIXED FIXED
MOBILE MOBILE MOBILE

RADIOLOCATION
Space research
satellite (active)
Radiolocation
(passive) MARITIME MOBILE
RADIONAVIGATION
SPACE RESEARCH SPACE RESEARCH SPACE RESEARCH
EXPLORATION- EXPLORATION- EXPLORATION- EXPLORATION-
59.0 5.47 698.0
RADIO- INTER- MARITIME BROADCASTING
LOCATION SATELLITE RADIONAVIGATION FIXED MOBILE
SPACE RESEARCH (passive)
MOBILE
FIXED
(TV CHANNELS 52-61)
6.525
Earth
(active)
(active)
EARTH
(active)
59.3
exploration-
SATELLITE
763.0
ISM – 5.8 ± .075 GHz

INTER- RADIOLOCATION
Space research
satellite (active)
AERONAUTICAL MOBILE (R)
SPACE RESEARCH
EXPLORATION-
MOBILE FIXED RADIOLOCATION SATELLITE 5.57 FIXED MOBILE 6.85
64.0 RADIOLOCATION MARITIME RADIONAVIGATION
BROADCASTING

Fixed
MARITIME METEOROLOGICAL 775.0

(TV CHANNELS 2-4)
INTER-
FIXED
RADIONAVIGATION AIDS RADIOLOCATION 6.765

MOBILE** FIXED SATELLITE 5.65 FIXED MOBILE BROADCASTING MOBILE
65.0 RADIOLOCATION Amateur
EARTH EXPLORATION-SATELLITE
INTER- Amateur-satellite
5.83 793.0 except aeronautical mobile (R) FIXED
MOBILE ** FIXED RADIOLOCATION
ISM - 61.25± 0.25 GHz

SATELLITE Amateur
MARITIME MOBILE
SPACE RESEARCH (space-to-Earth) 7.0

66.0 FIXED-SATELLITE
5.85 FIXED MOBILE
RADIO RADIOLOCATION MOBILE Amateur AMATEUR AMATEUR SATELLITE
MOBILE- RADIO INTER- (Earth-to-space)
ISM - 6.78 ± .015 MHz
NAVIGATION- 805.0 7.1

MOBILE SATELLITE NAVIGATION SATELLITE FIXED SATELLITE 5.925
SATELLITE FIXED
71.0 (Earth-to-space) FIXED MOBILE BROADCASTING
6.425 19.95
FIXED-SATELLITE AMATEUR
(Earth-to-space) MOBILE 806.0
LAND MOBILE
Earth)
STANDARD FREQUENCY AND TIME SIGNAL (20 kHz)
MOBILE
FIXED
Earth)
6.525
(space-to-
FIXED-
809.0
MOBILE
FIXED-SATELLITE
SATELLITE
(space-to-
FIXED LAND MOBILE 72.0
74.0 (Earth-to-space) FIXED 849.0 FIXED MOBILE 7.3
6.7 AERONAUTICAL MOBILE BROADCASTING
20.05
BROADCASTING FIXED-SATELLITE (Earth-to-space)(space-to-Earth) FIXED 73.0
BROADCASTING 6.875 851.0 7.4
Earth)
Space
Earth)
MOBILE FIXED-SATELLITE (Earth-to-space)(space-to-Earth) LAND MOBILE
research
FIXED-
FIXED RADIO ASTRONOMY
(space-to-
SATELLITE FIXED
MOBILE
(space-to-
SATELLITE
SATELLITE 7.025 854.0 74.6
76.0 MOBILE FIXED-SATELLITE (Earth-to-space) FIXED FIXED LAND MOBILE FIXED MOBILE
Space research RADIO 7.075 894.0
(space-to-Earth)
RADIOLOCATION Amateur AERONAUTICAL MOBILE 74.8
ASTRONOMY MOBILE FIXED
77.0 896.0 AERONAUTICAL RADIONAVIGATION
mobile (R)
FIXED
Space research RADIO 7.125 75.2

MOBILE
(space-to-Earth)
RADIOLOCATION Amateur-satellite Amateur FIXED FIXED LAND MOBILE MOBILE
ASTRONOMY FIXED
except aeronautical
77.5 7.145 901.0 75.4

Space research RADIO FIXED SPACE RESEARCH (deep space)(Earth-to-space) FIXED MOBILE
Amateur-satellite AMATEUR 902.0 FIXED MOBILE 8.1
(space-to-Earth) ASTRONOMY 7.19 76.0
RADIOLOCATION MARITIME MOBILE FIXED
Space research RADIO
78.0 FIXED SPACE RESEARCH (Earth-to-space)
RADIOLOCATION Amateur- Amateur 7.235 928.0 8.195
(space-to-Earth) ASTRONOMY satellite
81.0 FIXED FIXED
7.25 929.0
(AM RADIO)
FIXED-SATELLITE (space-to-Earth) MOBILE-SATELLITE (space-to-Earth) Fixed FIXED LAND MOBILE MARITIME MOBILE
BROADCASTING
930.0
Space
RADIO
FIXED-
7.3
research
FIXED
MOBILE
MOBILE-
FIXED
MOBILE
SATELLITE
SATELLITE
(Earth-to-space)
FIXED-SATELLITE (space-to-Earth) Mobile-satellite (space-to-Earth) 8.815
ASTRONOMY
FIXED
(space-to-Earth)
ISM - 915.0± .13 MHz
931.0 AERONAUTICAL MOBILE (R)
84.0 METEOROLOGICAL Mobile-satellite
7.45 FIXED LAND MOBILE
FIXED-SATELLITE (space-to-Earth) FIXED 932.0 8.965
SATELLITE (space-to-Earth) (space-to-Earth) AERONAUTICAL MOBILE (OR)
RADIO 7.55 FIXED 9.04
935.0
FIXED-
ASTRONOMY FIXED-SATELLITE (space-to-Earth) Mobile-satellite (space-to-Earth) FIXED FIXED LAND MOBILE
FIXED
SATELLITE
FIXED
MOBILE
(Earth-to-space) (Earth-to-space)
7.75 940.0 9.4
METEOROLOGICAL-
BROADCASTING
86.0 FIXED FIXED MOBILE

(TV CHANNELS 5-6)
SATELLITE (space-to-Earth) 941.0

7.85 FIXED
FIXED 944.0 BROADCASTING
7.9 FIXED
FIXED-SATELLITE (Earth-to-space) MOBILE-SATELLITE (Earth-to-space) Fixed
RADIO
SPACE
EARTH
960.0
(passive)
(passive)
Mobile-satellite
8.025 88.0
SATELLITE
RESEARCH
EARTH EXPLORATION- FIXED SATELLITE
ASTRONOMY
(Earth-to-space)
9.9
EXPLORATION-
SATELLITE (space-to-Earth) (Earth-to-space) FIXED FIXED
92.0 (no airborne) 9.995
METEOROLOGICAL- Mobile-satellite
8.175 STANDARD FREQUENCY AND TIME SIGNAL (10 MHz)
FIXED-SATELLITE EARTH EXPLORATION- 1.005
RADIO- RADIO SATELLITE FIXED (Earth-to-space)
FIXED MOBILE (Earth-to-space) SATELLITE (space-to-Earth) AERONAUTICAL MOBILE (R)
(space-to-Earth) (no airborne) 1.01
LOCATION ASTRONOMY 8.215 AMATEUR
Mobile-satellite
94.0 EARTH EXPLORATION-SATELLITE FIXED-SATELLITE (Earth-to-space)
10.15
(space-to-Earth) (Earth-to-space) FIXED
(no airborne)
FIXED
RADIO FIXED
8.4
SPACE RESEARCH (deep space)(space-to-Earth) Space research (deep space)(space-to-Earth)
AERONAUTICAL
(R)
(active)
EARTH
(active)
SPACE
8.45
except
RADIONAVIGATION
ASTRONOMY
(FM RADIO)
RADIO-
SATELLITE
SPACE RESEARCH (space-to-Earth) FIXED
RESEARCH
FIXED
Mobile
LOCATION
EXPLORATION-
8.5
BROADCASTING
RADIOLOCATION
aeronautical mobile
94.1 Radiolocation
8.55 1164.0 11.175
RADIO
MARITIME MOBILE
FIXED MOBILE RADIO- RADIONAVIGATION-SATELLITE AERONAUTICAL
Earth
AERONAUTICAL MOBILE (OR)
EARTH
(active)
(active)
satellite
Space
RADIO-
LOCATION ASTRONOMY
Radio-
SPACE
SATELLITE
(active)
(active)
(space-to-Earth)(space-to-space) RADIONAVIGATION
research
location
exploration -
LOCATION
EXPLORATION-
RESEARCH
95.0 1215.0 11.275
8.65 RADIONAVIGATION-
RADIOLOCATION Radiolocation SATELLITE AERONAUTICAL MOBILE (R)
Earth
(active)
EARTH
satellite
(active)
(active)
(active)
SPACE
RADIO-
(space-to-Earth)
SATELLITE
exploration-
RESEARCH
9.0
LOCATION
Space research
AERONAUTICAL RADIONAVIGATION
EXPLORATION-
Radiolocation (space-to-space)
11.4
RADIO
RADIO-
RADIO-
FIXED
RADIO-
MOBILE
108.0 FIXED
SATELLITE
9.2
ASTRONOMY
NAVIGATION-
LOCATION
MARITIME RADIONAVIGATION
NAVIGATION
Radiolocation 1240.0
100.0 9.3 11.6
Earth
RADIONAVIGATION
(active)
(active)
Meteorological Aids Radiolocation
EARTH
satellite
RADIO -
(active)
RADIO-
EARTH
SPACE
Amateur
(active)
SATELLITE
exploration-
SPACE
NAVIGATION
BROADCASTING
LOCATION
Space research
AERONAUTICAL
EXPLORATION-
RESEARCH
RADIO EXPLORATION- 9.5
RESEARCH 1300.0 12.1
ASTRONOMY SATELLITE
Earth
EARTH
(active)
(active)
satellite
(passive)
Space
RADIO-
Radio-
SPACE
SATELLITE
(active)
(active)
location
exploration -
research
(passive) AERONAUTICAL RADIONAVIGATION Radiolocation FIXED
LOCATION
EXPLORATION
RESEARCH
102.0 1350.0
9.8 12.23
RADIO RADIOLOCATION Radiolocation FIXED MOBILE RADIOLOCATION
MOBILE FIXED 10.0 1390.0
ASTRONOMY RADIOLOCATION Radiolocation Amateur FIXED MOBILE ** Fixed-satellite (Earth-to-space)
MARITIME
105.0 10.45 1392.0
AERONAUTICAL
RADIOLOCATION Radiolocation Amateur Amateur-satellite MOBILE ** MOBILE
SPACE FIXED
RADIONAVIGATION
RADIO 10.5 1395.0
RESEARCH MOBILE FIXED RADIOLOCATION LAND MOBILE (medical telemetry and medical telecommand)
ASTRONOMY 10.55 1400.0 117.975 13.2
(passive) FIXED EARTH EXPLORATION - SATELLITE SPACE RESEARCH AERONAUTICAL MOBILE (OR)
109.5 10.6 (passive) RADIO ASTRONOMY (passive)
AERONAUTICAL
EARTH SPACE RESEARCH (passive) EARTH EXPLORATION-SATELLITE (passive) 1427.0 13.26
SPACE EXPLORATION-
FIXED MOBILE (R)
RADIO LAND MOBILE LAND MOBILE Fixed AERONAUTICAL MOBILE (R)
Non-Federal Travelers Information Stations (TIS), a mobile service, are authorized in the 535-1705 kHz band.
RESEARCH (medical telemetry and

SATELLITE RADIO ASTRONOMY SPACE RESEARCH (passive) EARTH EXPLORATION-SATELLITE (passive) 10.68 121.9375 13.36
ASTRONOMY medical telecommand (telemetry and telecommand) (telemetry) AERONAUTICAL MOBILE
(passive) (passive) 10.7 1429.5 123.0875 RADIO ASTRONOMY
111.8 FIXED FIXED-SATELLITE (space-to-Earth) FIXED (telemetry and 13.41
LAND MOBILE (telemetry & telecommand) AERONAUTICAL MOBILE
SPACE 11.7 telecommand) 123.5875 FIXED Mobile
RESEARCH
RADIO FIXED (telemetry and Fixed-satellite
1430.0 except aeronautical mobile (R)
FIXED MOBILE FIXED-SATELLITE (space-to-Earth) LAND MOBILE
ASTRONOMY telecommand) (telemetry & telecommand) (space-to-Earth) 13.57
(passive) 12.2 1432.0 AERONAUTICAL BROADCASTING
114.25 BROADCASTING-SATELLITE FIXED FIXED MOBILE ** MOBILE (R)
SPACE EARTH 12.7 1435.0 13.87
RADIO EXPLORATION- FIXED-SATELLITE (Earth-to-space) MOBILE FIXED 59
RESEARCH MOBILE (aeronautical telemetry) FIXED Mobile
ASTRONOMY SATELLITE 13.25 1525.0 128.8125
(passive) Aeronatuical AERONAUTICAL except aeronautical mobile (R) STANDARD FREQUENCY AND TIME SIGNAL (60 kHz)
(passive)
Earth
MOBILE SATELLITE (space-to-Earth)
(active)
(active)
EARTH
satellite
116.0
Space
14.0
SPACE
(active)
(active)
SATELLITE
exploration -
research
EARTH MOBILE (R) 61
EXPLORATION -
RESEARCH
SPACE Radionavigation 1559.0 132.0125 AMATEUR AMATEUR SATELLITE
INTER- EXPLORATION- 13.4 RADIONAVIGATION-SATELLITE AERONAUTICAL
RESEARCH SATELLITE AERONAUTICAL AERONAUTICAL MOBILE (R) 14.25
SATELLITE (passive) (space-to-Earth)(space-to-space) RADIONAVIGATION
Earth
(passive)
Earth
AMATEUR
(active)
satellite
(active)
satellite
136.0
SPACE
Radio-
Space
exploration -
location
RADIO -
exploration -
1610.0
research
RESEARCH
LOCATION
122.25
FIXED MOBILE INTER-SATELLITE Amateur RADIODETERMINATION- AERONAUTICAL MOBILE SATELLITE AERONAUTICAL MOBILE (R) 14.35
ISM - 122.5± 0.500 GHz

13.75 SATELLITE (Earth-to-space) (Earth-to-space) Mobile
ISM - 13.560 ± .007 MHz

123.0 RADIONAVIGATION 137.0 FIXED
MOBILE-SATELLITE SPACE RESEARCH SPACE OPERATION MET. SATELLITE except aeronautical mobile (R)
satellite
1610.6 (space-to-Earth) (space-to-Earth) (space-to-Earth) (space-to-Earth)
RADIO
FIXED
RADIODETERMINATION- AERONAUTICAL MOBILE SATELLITE

MOBILE
14.99
and time signal

Space
RADIONAVIGATION
RADIO -
FIXED-
(Earth-to-space)
Space
(Earth-to-space) 137.025
research
SATELLITE (Earth-to-space) STANDARD FREQUENCY AND TIME SIGNAL (15 MHz)
Radio -
MARITIME
LOCATION
location
ASTRONOMY
research
Mobile-satellite SPACE RESEARCH SPACE OPERATION MET. SATELLITE
SATELLITE
MOBILE-
FIXED-
Standard frequency
Radio
RADIO-
RADIO-
SATELLITE
(Earth-to-space)
15.01
SATELLITE
(space-to-Earth) (space-to-Earth)
SATELLITE
(space-to-Earth)
(space-to-Earth)
1613.8 (space-to-Earth)
NAVIGATION
astronomy
NAVIGATION-
AERONAUTICAL MOBILE SATELLITE
(space-to-Earth)
Space
14.0 Mobile-satellite RADIODETERMINATION- 137.175 AERONAUTICAL MOBILE (OR) 70
130.0 FIXED-SATELLITE (Earth-to-space) (space-to-Earth) SATELLITE (Earth-to-space) RADIONAVIGATION (Earth-to-space) MOBILE-SATELLITE SPACE RESEARCH SPACE OPERATION MET. SATELLITE
Mobile-satellite (Earth-to-space) 15.1
EARTH research (space-to-Earth) (space-to-Earth) (space-to-Earth) (space-to-Earth)
EXPLORATION- RADIO 14.2 1626.5 137.825
Mobile-satellite (space-to-Earth) FIXED-SATELLITE (Earth-to-space) MOBILE SATELLITE(Earth-to-space) Mobile-satellite SPACE RESEARCH SPACE OPERATION MET. SATELLITE 1605
SATELLITE ASTRONOMY
INTER-
FIXED
(space-to-Earth) (space-to-Earth) (space-to-Earth) (space-to-Earth) BROADCASTING
MOBILE
SATELLITE
(active) 14.4 1660.0 138.0 MOBILE BROADCASTING
134.0 Mobile-satellite FIXED-SATELLITE 1615
Fixed Mobile (Earth-to-space) MOBILE SATELLITE
AMATEUR - SATELLITE (Earth-to-space) RADIO ASTRONOMY (Earth-to-space)
Radio astronomy AMATEUR 14.5 FIXED MOBILE 15.8
136.0 FIXED Mobile Space research 1660.5
RADIO RADIO 14.7145 144.0
Amateur Amateur - satellite MOBILE Fixed Space research
RADIO ASTRONOMY SPACE RESEARCH (passive) AMATEUR AMATEUR- SATELLITE
LOCATION ASTRONOMY 146.0 FIXED
FIXED
141.0 14.8 1668.4 AMATEUR

FIXED MOBILE RADIO ASTRONOMY RADIOLOCATION Fixed SPACE RESEARCH MOBILE METEOROLOGICAL AIDS 148.0
EARTH EXPLORATION-
148.5 15.1365 (radiosonde) RADIO ASTRONOMY MOBILE-SATELLITE
SPACE RESEARCH FIXED SPACE RESEARCH Mobile (Earth-to-space) FIXED MOBILE 16.36
Radiolocation
RADIO ASTRONOMY SATELLITE 15.35 1670.0 149.9

(passive) (passive) RESEARCH EARTH EXPLORATION - FIXED MOBILE **
Federal TIS operates at 1610 kHz.

RADIO ASTRONOMY SPACE(passive) SATELLITE (passive) 1675.0 RADIONAV-SATELLITE MOBILE-SATELLITE MARITIME
MARITIME MOBILE
BROADCASTING
151.5
FIXED MOBILE RADIO ASTRONOMY RADIOLOCATION 15.4 METEOROLOGICAL METEOROLOGICAL AIDS (Earth-to-space) MOBILE
155.5 AERONAUTICAL RADIONAVIGATION SATELLITE (space-to-Earth) (radiosonde) 150.05
EARTH EXPLORATION- SPACE 15.43 90
RADIO SATELLITE RESEARCH AERONAUTICAL FIXED-SATELLITE 1700.0 FIXED MOBILE FIXED
17.41
MOBILE METEOROLOGICAL 150.8
(Earth-to-space) Fixed FIXED
FIXED
ASTRONOMY (passive) (passive) RADIONAVIGATION SATELLITE (space-to-Earth) FIXED LAND MOBILE 17.48 1705
158.5 15.63 1710.0 152.855
MOBILE- FIXED- AERONAUTICAL RADIONAVIGATION LAND MOBILE
BROADCASTING
15.7 FIXED MOBILE 17.9
FIXED MOBILE SATELLITE SATELLITE Radiolocation RADIOLOCATION 1755.0 154.0 AERONAUTICAL MOBILE (R)
(space-to-Earth) (space-to-Earth)
16.6 SPACE OPERATION (Earth-to-space) MOBILE FIXED FIXED LAND MOBILE
Space research (deep space)(Earth-to-space)
FIXED
RADIO-
Radiolocation RADIOLOCATION 156.2475 17.97
MOBILE
AERONAUTICAL MOBILE (OR)
LOCATION
164.0 17.1 1850.0 MARITIME MOBILE Radiolocation
EARTH RADIO SPACE RESEARCH Radiolocation RADIOLOCATION MOBILE FIXED 156.725 FIXED
18.03
EXPLORATION- 17.2 MARITIME MOBILE (distress, urgency, safety and calling) 1800
SATELLITE (passive) ASTRONOMY (passive) 2000.0 18.068
MOBILE SATELLITE 156.8375 AMATEUR AMATEUR SATELLITE
MOBILE FIXED MARITIME MOBILE
Earth
167.0
(active)
EARTH
(Earth-to-space)
Space
RADIO-
SPACE
(active)
(active)
SATELLITE
Radio-
research
exploration-
FIXED-
location
LOCATION
157.0375
EXPLORATION-
satellite (active)
RESEARCH
INTER- 2020.0 MARITIME MOBILE FIXED 18.168
FIXED MOBILE SATELLITE 17.3 Mobile AMATEUR
SATELLITE FIXED-SATELLITE FIXED MOBILE 157.1875 110
(space-to-Earth) BROADCASTING-SATELLITE Radiolocation MOBILE except aeronautical mobile MARITIME MOBILE 18.78
(Earth-to-space) 2025.0
174.5 17.7 157.45 18.9
INTER- FIXED-SATELLITE (Earth-to-space) FIXED BROADCASTING
FIXED MOBILE 17.8 SPACE OPERATION FIXED LAND MOBILE 19.02 1900
SATELLITE FIXED FIXED-SATELLITE (space-to-Earth) (Earth-to-space) 161.575 FIXED
EARTH
SPACE
RSEARCH
SATELLITE
(space-to-space) MARITIME MOBILE
FIXED
174.8 18.3
MOBILE
(Earth-to-space)
EXPLORATION-
(space-to-space)
(Earth-to-space)
FIXED-SATELLITE (space-to-Earth)
(space-to-space)
SPACE RESEARCH EARTH 161.625 MARITIME MOBILE 19.68
INTER- EXPLORATION- 18.6
SATELLITE (passive) SPACE RESEARCH FIXED-SATELLITE EARTH EXPLORATION - 2110.0 LAND MOBILE 19.8 RADIOLOCATION
SATELLITE (passive) 161.775 FIXED
(passive) MOBILE FIXED
FIXED
182.0 (space-to-Earth) SATELLITE (passive)

MOBILE except aeronautical mobile 19.99
MOBILE
RADIO EARTH SPACE RESEARCH

18.8 2180.0 161.9625 STANDARD FREQUENCY AND TIME SIGNAL (20 MHz)
MARITIME
EXPLORATION- FIXED-SATELLITE (space-to-Earth) MOBILE SATELLITE MARITIME MOBILE (AIS) 20.01

ASTRONOMY SATELLITE (passive) (passive) 19.3 MOBILE FIXED (space-to-Earth) FIXED Mobile 2000
Radiolocation
185.0 FIXED-SATELLITE (space-to-Earth) FIXED 2200.0 161.9875
21.0
EARTH 19.7 MOBILE except aeronautical mobile MARITIME 130
SPACE RESEARCH INTER- FIXED-SATELLITE (space-to-Earth) MOBILE-SATELLITE (space-to-Earth) 162.0125 AMATEUR AMATEUR SATELLITE MOBILE FIXED
EXPLORATION- SPACE OPERATION MOBILE
(passive) SATELLITE SATELLITE (passive) (space-to-Earth) MARITIME MOBILE (AIS) 21.45
20.2
EARTH
SPACE
163.0375 BROADCASTING 2065
SATELLITE
190.0 FIXED-SATELLITE (space-to-Earth) (space-to-space)
FIXED
RESEARCH
(line of sight only)
(space-to-Earth)
MOBILE
EXPLORATION-
(space-to-space)
(space-to-Earth)
(space-to-space)
21.85 MARITIME MOBILE
(line of sight only)

SPACE RESEARCH
and
EARTH EXPLORATION- FIXED
Earth)
satellite
Standard
frequency
(space-to-
SATELLITE (passive)
time signal
(passive) MOBILE-SATELLITE (space-to-Earth) SPACE RESEARCH
2290.0 MOBILE 2107
FIXED 21.924 MOBILE
191.8 21.2 (space-to-Earth) MOBILE** FIXED AERONAUTICAL MOBILE (R)
RADIONAVIGATION SPACE RESEARCH EARTH EXPLORATION - MOBILE (deep space)
FIXED MOBILE except aeronautical mobile
FIXED MARITIME MOBILE
22.0
INTER- MOBILE (passive) SATELLITE (passive) 2300.0 173.2
FIXED
RADIONAVIGATION- FIXED Land mobile 2170
MOBILE
SATELLITE SATELLITE 21.4 Amateur 22.855
FIXED
FIXED 173.4 FIXED
MARITIME
MOBILE
MOBILE
SATELLITE 2305.0 MARITIME MOBILE
200.0 22.0
Amateur RADIOLOCATION MOBILE** FIXED FIXED MOBILE Mobile 23.0 MARITIME MOBILE
FIXED MOBILE** FIXED
(telephony)
RADIO EARTH SPACE RESEARCH 2310.0 174.0
22.21 BROADCASTING Radio- except aeronautical mobile (R)
EXPLORATION- RADIOLOCATION MOBILE FIXED SATELLITE location Mobile Fixed 2173.5
ASTRONOMY SATELLITE (passive) (passive) SPACE 2320.0 23.2 160
RADIO AERONAUTICAL MOBILE (OR) MOBILE (distress and calling)
EARTH
RESEARCH
(passive)
209.0 Radiolocation Fixed BROADCASTING - SATELLITE
SATELLITE
ASTRONOMY
FIXED
EXPLORATION-
MOBILE**
FIXED- (passive) 2345.0 MOBILE 23.35
RADIO BROADCASTING Radio- FIXED 2190.5
RADIOLOCATION MOBILE FIXED SATELLITE location Mobile Fixed
FIXED MOBILE SATELLITE 22.5 except aeronautical mobile
(Earth-to-space) ASTRONOMY FIXED MOBILE 2360.0 24.89
22.55 MOBILE Fixed RADIOLOC ATION MARITIME MOBILE
217.0 FIXED MOBILE INTER-SATELLITE 2390.0 AMATEUR AMATEUR SATELLITE MARITIME MOBILE
FIXED-SATELLITE (Earth-to-space) MOBILE AMATEUR 24.99 (telephony)
RADIO 23.55
FIXED
FIXED MOBILE 2395.0 STANDARD FREQ. AND TIME SIGNAL (25 MHz)
MOBILE
ASTRONOMY
FIXED
SPACE RESEARCH (passive)
23.6 25.01 2194
MOBILE
LAND MOBILE
MARITIME
RADIO SPACE RESEARCH EARTH EXPLORATION - AMATEUR
226.0 ASTRONOMY (passive) SATELLITE - (passive) MARITIME MOBILE 25.07
BROADCASTING
24.0 2417.0
Radiolocation 25.21 190
(TV CHANNELS 7 - 13)

RADIO AMATEUR AMATEUR-SATELLITE Amateur 2450.0 LAND MOBILE
24.05
SPACE
EARTH
ASTRONOMY 25.33
(Passive)
(Passive)
Radiolocation MOBILE FIXED MOBILE except aeronautical mobile
SATELLITE
FIXED
RESEARCH
RADIO- Radio- 216.0 AERONAUTICAL
EXPLORATION-
Amateur 2483.5 MOBILE
Earth
231.5 location RADIODETERMINATION-
(active)
MOBILE SATELLITE
satellite
LOCATION except aeronautical
25.55
MOBILE
FIXED
FIXED Fixed RADIO ASTRONOMY RADIONAVIGATION
exploration -
MOBILE
FIXED MOBILE SATELLITE (space-to-Earth) (space-to-Earth) mobile
Land mobile
Radiolocation 24.25 2495.0 BROADCASTING 25.67 200
232.0 FIXED RADIODETERMINATION- MOBILE SATELLITE
217.0
MOBILE** MOBILE except
except aeronautical mobile

FIXED- 24.45 SATELLITE (space-to-Earth) (space-to-Earth) FIXED Land mobile FIXED aeronautical mobile FIXED MARITIME MOBILE 26.1
RADIONAVIGATION INTER-SATELLITE 219.0
FIXED MOBILE SATELLITE Radiolocation 24.65 2500.0 26.175
MOBILE except LAND MOBILE
SPECTRUM OCCUPIED.
INTER-SATELLITE RADIOLOCATION-SATELLITE (Earth-to-space) MOBILE** FIXED Mobile FIXED
(space-to-Earth) 24.75 Fixed aeronautical mobile Amateur 26.48 2495
FIXED-SATELLITE MOBILE
ISM - 2450.0± .50 MHz

2655.0 220.0 FIXED
235.0 RADIONAVIGATION (Earth-to-space) except aeronautical mobile
FIXED-SATELLITE SPACE RESEARCH EARTH FIXED LAND MOBILE
EXPLORATION- FIXED-SATELLITE
25.05 222.0 FIXED
26.95 STANDARD FREQ. AND TIME SIGNAL (2500kHz)
(space-to-Earth) (passive)
satellite
SATELLITE (passive) AMATEUR
(passive)
(passive)
FIXED
Radio
FIXED
(Earth-to-space)
Space research
225.0
Earth exploration-
MOBILE**
238.0 MOBILE 26.96
astronomy
INTER-SATELLITE
25.25
RADIO- FIXED MOBILE 2690.0 except aeronautical mobile 2505
RADIO- 25.5 27.23
Mobile
FIXED-
FIXED MOBILE except aeronautical mobile
RADIO-
NAVIGATION NAVIGATION-
ISM –FIXED
SPACE RESEARCH
ISM - 245.0± 1 GHz

SATELLITE
LOCATION
MOBILE
SATELLITE
(space-to-Earth)
EARTH
(passive)
RADIO
(passive)
ISM - 27.12 ± .163 MHz
SATELLITE
Aeronautical
27.41
EXPLORATION-
ASTRONOMY
EARTH
240.0
SPACE
FIXED LAND MOBILE
Inter-satellite
FIXED
SATELLITE
RESEARCH
(Earth-to-space)
MOBILE
AERONAUTICAL
(space-to-Earth)
MOBILE
time signal satellite

FIXED
(space-to-Earth)
EXPLORATION -
RADIOLOCATION 2700.0
Standard frequency and

27.54
INTER-SATELLITE
FIXED
MOBILE
RADIONAVIGATION
METEOROLOGICAL AERONAUTICAL MOBILE
24.125 ± 0.125
Amateur-satellite
241.0
FIXED
RADIOLOCATION RADIOASTRONOMY Amateur 27.0 Radiolocation AIDS RADIONAVIGATION
MOBILE
28.0
248.0 Inter-satellite FIXED MOBILE INTER-SATELLITE AMATEUR AMATEUR SATELLITE
Radioastronomy AMATEUR-SATELLITE AMATEUR 2900.0
except aeronautical mobile

SPACE RESEARCH EARTH EXPLORATION-
250.0 27.5 29.7 275
LAND MOBILE
FIXED
RADIOASTRONOMY (passive) FIXED-SATELLITE AERONAUTICAL Aeronautical Maritime
MOBILE
SATELLITE (passive) FIXED MOBILE 29.8 Radionavigation
252.0 (Earth-to-space) FIXED RADIONAVIGATION Mobile
RADIONAVIGATION-SATELLITE MOBILE-SATELLITE
(radiobeacons)
RADIO 29.5 29.89
2850 285
FIXED
RADIO-
(Earth-to-space) MOBILE
FIXED
ASTRONOMY
MOBILE
RADIO NAVIGATION
MARITIME
MOBILE-SATELLITE AERONAUTICAL MARITIME RADIONAVIGATION Aeronautical Radionavigation
NAVIGATION
FIXED-SATELLITE
Radiolocation
FIXED-SATELLITE 265.0 29.91
RADIOLOCATION
FIXED MOBILE RADIO ASTRONOMY (Earth-to-space) (Earth-to-space) MOBILE (R)
300 GHz
275.0 (Earth-to-space) FIXED (radiobeacons) (radiobeacons)
30 GHz
300 kHz
3 GHz
300 MHz
NOT ALLOCATED
3 MHz
30 MHz
SEGMENTS SHOWN IS NOT PROPORTIONAL TO THE ACTUAL AMOUNT OF

PLEASE NOTE: THE SPACING ALLOTTED THE SERVICES IN THE SPECTRUM
300.0 30.0 3000.0 300.0 30.0 3000 300
233 7.1. Introduction
234 7. Third layer
Electromagnetic waves are subject to reflection, refraction, polarization, diffrac-

tion and absorption, and different frequencies experience different amounts of
these phenomenons. This makes certain frequency ranges desirable for certain
applications and not for others. A few examples follow.
example 7.2 (Reflection/diffraction) Radio signals reflect and/or diffract off
obstacles such as buildings and mountains. For our purpose, we can assume that
the signal is reflected if its wavelength is much smaller than the size of the obstacle,
whereas it is diffracted if its wavelength is much larger than the obstacle’s size.
Because of their large wavelengths, very low frequency (VLF) radio waves (see
Table 7.1) can be diffracted by large obstacles such as mountains, thus are not
blocked by mountain ranges. By contrast, in the UHF range signals propagate
mainly by line of sight.
Table 7.1 The radio spectrum
Range Name Frequency Range Wavelength Range
Extremely Low Frequency (ELF) 3 Hz to 300 Hz 100 000 km to 1, 000 km

Ultra Low Frequency (ULF) 300 Hz to 3 kHz 1000 km to 100 km
Very Low Frequency (VLF) 3 kHz to 30 kHz 100 km to 10 km
Low Frequency (LF) 30 kHz to 300 kHz 10 km to 1 km
Medium Frequency (MF) 300 kHz to 3 MHz 1 km to 100 m
High Frequency (HF) 3 MHz to 30 MHz 100 m to 10 m
Very High Frequency (VHF) 30 MHz to 300 MHz 10 m to 1 m
Ultra High Frequency (UHF) 300 MHz to 3 GHz 1 m to 10 cm
Super High Frequency (SHF) 3 GHz to 30 GHz 10 cm to 1 cm
Extremely High Frequency (EHF) 30 GHz to 300 GHz 1 cm to 1 mm
Visible Spectrum 400 THz to 790 THz 750 nm to 390 nm

X-Rays 3 × 1016 to 3 × 1019 Hz 10 nm to 10 pm
example 7.3 (Refraction) Radio signals are refracted by the ionosphere sur-
rounding the Earth. Different layers of the ionosphere have different ionization
densities, hence different refraction indices. As a result, signals can be bent by a
layer or can be trapped between layers. This phenomenon concerns mainly the MF
and HF range (300 kHz to 30 MHz) but can also affect the MF through the LF and
VLF range. As a consequence, radio signals emitted from a ground station can be
bent back to Earth, sometimes after traveling a long distance trapped between layers
of the ionosphere. This mode of propagation, called sky wave (as opposed to ground
wave) is exploited, for instance, by amateur radio operators to reach locations on
Earth that could not be reached if their signals traveled in straight lines. In fact,
under particularly favorable circumstances, the communication between any two
regions on Earth can be established via sky waves. Although the bending caused by
the ionosphere is desirable for certain applications, it is a nuisance for Earth-to-
satellite communication. This is why satellites use higher frequencies for which the
ionosphere is essentially transparent (typically GHz range).
7.2. The baseband-equivalent of a passband signal 235
example 7.4 (Absorption) Because of absorption, electromagnetic waves do

not go very far under sea. The lower the frequency the better the penetration. This
explains why submarines use the ELF through the VLF range. Because of the very
limited bandwidth, communication in the ELF range is limited to a few characters
per minute. For this reason, it is mainly used to order a submarine to rise to a
shallow depth where it can be reached in the VLF range. Similarly, long waves
penetrate the Earth better than short waves. For this reason, communication in
mines is done in the ULF band. On the Earth’s surface, VLF waves have very
little path attenuation (2−3 dB per 1000 km), so they can be used for long-distance
communication without repeaters.
7.2 The baseband-equivalent of a passband signal

In this section we show how a real-valued passband signal x(t) can be represented
by a baseband-equivalent xE (t), which is in general complex-valued. The relation-
ship between these two signals is established via the analytic-equivalent x̂(t).
Recall two facts from Fourier analysis. If x(t) is a real-valued signal, then its
Fourier transform xF (f ) is conjugate symmetric, that is
x∗F (f ) = xF (−f ), (7.1)
where x∗F (f ) is the complex conjugate of xF (f ).1 If x(t) is a purely imaginary
signal, then its Fourier transform is conjugate anti-symmetric, i.e.
x∗F (f ) = −xF (−f ).
The symmetry and anti-symmetry properties can easily be verified from the
definition of the Fourier transform and the fact that the complex conjugate
operator commutes with both the integral and the product. For instance, the
proof of the symmetry property is
∗
x∗F (f ) = x(t)e−j2πf t dt

& '∗
= x(t)e−j2πf t dt

= x∗ (t)ej2πf t dt

= x(t)e−j2π(−f )t dt
= xF (−f ).
1
In principle, the notation x∗F (f ) could mean (xF )∗ (f ) or (x∗ )F (f ), but it should be
clear that we mean the former because the latter is not useful when x(t) is real-valued,
in which case (x∗ )F (f ) = xF (f ).
236 7. Third layer
The symmetry property implies that the Fourier transform xF (f ) of a real-valued

signal x(t) has redundant information: If we know xF (f ) for f ≥ 0, then we know
it also for f < 0. √
If we remove the negative frequencies from x(t) and scale the result by 2,
we obtain the complex-valued signal x̂(t) called the analytic-equivalent of x(t).
Intuitively, by
√ removing the negative
√ frequencies of a real-valued signal we reduce
its norm by 2. The scaling by 2 in the definition of x̂(t) is meant to make the
norm of x̂(t) identical to that of x(t) (see Corollary 7.6 for a formal proof).
To remove the negative frequencies of x(t) we use the filter of impulse response
h> (t) that has Fourier transform2

1 for f ≥ 0
h>, F (f ) = (7.2)
0 for f < 0.
Hence,
√
x̂F (f ) =2xF (f )h>,F (f ), (7.3)
√
x̂(t) = 2(x h> )(t). (7.4)
It is straightforward to go from x̂(t) back to x(t). We claim that
√
x(t) = 2 x̂(t) , (7.5)
√
where, once again, we remember the factor 2 as being the one that compensates
for halving the signal’s energy by removing the signal’s imaginary part. To prove
(7.5), we use the fact that the real part of a complex number is half the sum of
the number and its complex conjugate, i.e.
√ 1
2 {x̂(t)} = √ (x̂(t) + x̂∗ (t)) .
2
To complete the proof, it suffices to show that the Fourier transform of the right-
hand side, namely
xF (f )h>,F (f ) + x∗F (−f )h∗>,F (−f )
is indeed xF (f ). For non-negative frequencies, the first term
√ equals xF (f ) and
the second term vanishes. Hence the Fourier transform of 2 {x̂(t)} and that
of x(t) agree for non-negative frequencies. As they are the Fourier√ transform of
real-valued signals, they must agree everywhere. This proves that 2 {x̂(t)} and
x(t) have the same Fourier transform, hence they are L2 equivalent.
To go from x̂(t) to the baseband-equivalent xE (t) we use the frequency-shifting
property of the Fourier transform that we rewrite for reference:
x(t)ej2πfc t ←→ xF (f − fc ).
2
Note that h>, F (f ) is not an L2 function, but it can be made into one by setting it to
zero at all frequencies that are outside the support of xF (f ). Note also that we can
arbitrarily choose the value of h>, F (f ) at f = 0, because two functions that differ at a
single point are L2 equivalent.
|xF (f )|
f
−fc 0 fc
|x̂F (f )|
√
2a
f
0 fc
|xE,F (f )|
√
2a
f
0
Figure 7.3. Fourier-domain relationship between a real-valued signal, and the

corresponding analytic and baseband-equivalent signals.
The baseband-equivalent of x(t) with respect to the carrier frequency fc is defined
to be
xE (t) = x̂(t)e−j2πfc t
and its Fourier transform is
xE,F (f ) = x̂F (f + fc ).
Figure 7.3 depicts the relationship between |xF (f )|, |x̂F (f )|, and |xE,F (f )|. We
plot the absolute value to avoid plotting the real and the imaginary components.
We use dashed lines to plot |xF (f )| for f < 0 as a reminder that it is completely
determined by |xF (f )|, f > 0.
The operation that recovers x(t) from its baseband-equivalent xE (t) is
√
x(t) = 2 xE (t)ej2πfc t . (7.6)
The circuits to go from x(t) to xE (t) and back to x(t) are depicted in Figure 7.4,
where double arrows denote complex-valued signals. Exercises 7.3 and 7.5 derive
equivalent circuits that require only operations over the reals.
The following theorem and the two subsequent corollaries are important in that
they establish a geometrical link between baseband and passband signals.
238 7. Third layer
x(t) xE (t)
- √2h> (t) - × -

6
e−j2πfc t
(a)
xE (t) √ x(t)
- × - 2{·} -

6
ej2πfc t
(b)
Figure 7.4. From a real-valued signal x(t) to its baseband-equivalent

xE (t) and back.
theorem 7.5 (Inner product of passband signals) Let x(t) and y(t) be (real-
valued) passband signals, let x̂(t) and ŷ(t) be the corresponding analytic signals, and
let xE (t) and yE (t) be the baseband-equivalent signals (with respect to a common
carrier frequency fc ). Then
x, y = {x̂, ŷ} = {xE , yE }.
Note 1: x, y is real-valued, whereas x̂, ŷ and xE , yE are complex-valued in gen-
eral. This helps us see/remember why the theorem cannot hold without taking the
real part of the last two inner products. The reader might prefer to remember the
more symmetric (and more redundant) form {x, y} = {x̂, ŷ} = {xE , yE }.
Note 2: From the proof that follows, we see that the second equality holds also for
the imaginary parts, i.e. x̂, ŷ = xE , yE .
Proof Let x̂(t) = xE (t)ej2πfc t . Showing that x̂, ŷ = xE , yE is immediate:
x̂, ŷ = xE (t)ej2πfc t , yE (t)ej2πfc t = ej2πfc t e−j2πfc t xE (t), yE (t) = xE , yE .
To prove x, y = {x̂, ŷ}, we use Parseval’s relationship (first and last equality
below), we use the fact that the Fourier transform of x(t) = √12 [x̂(t) + x̂∗ (t)] is
xF (f ) = √12 [x̂F (f ) + x̂∗F (−f )] (second equality), that x̂F (f )ŷF (−f ) = 0 because
the two functions have disjoint support and similarly x̂∗F (−f )ŷF ∗
(f ) = 0 (third
equality), and finally that the integral over a function is the same as the integral
over the time-reversed function (fourth equality):

∗
x, y = xF (f )yF (f )df

1 & '& ∗ '
= x̂F (f ) + x̂∗F (−f ) ŷF (f ) + ŷF (−f ) df
2

1 & ∗
'
= x̂F (f )ŷF (f ) + x̂∗F (−f )ŷF (−f ) df
2

1 & ∗
'
= x̂F (f )ŷF (f ) + x̂∗F (f )ŷF (f ) df
2

∗
= x̂F (f )ŷF (f )df

= x̂, ŷ .
The following two corollaries are immediate

√ consequences of the above theorem.
The first proves that the scaling factor 2 in (7.4) and (7.5) is what keeps the
norm unchanged. We will use the second to prove Theorem 7.13.
corollary 7.6 (Norm preservation) A passband signal has the same norm as
its analytic and its baseband-equivalent signals, i.e.
x2 = x̂2 = xE 2 .
corollary 7.7 (Orthogonality of passband signals) Two passband signals are

orthogonal if and only if the inner product between their baseband-equivalent signals
(with respect to a common carrier frequency fc ) is purely imaginary (i.e. the real
part vanishes).
Typically, we are interested in the baseband-equivalent xE (t) of a passband
signal x(t), but from a mathematical point of view, x(t) need not be passband
for xE (t) to be defined. Specifically, we can feed the circuit of Figure 7.4a with
any real-valued signal x(t) and feed the baseband-equivalent output xE (t) to the
circuit of Figure 7.4b to recover x(t).3
However, if we reverse the order in Figure 7.4, namely feed the circuit of part (b)
with an arbitrary signal g(t) and feed the circuit’s output to the input of part (a),
we do not necessarily recover g(t) (see Exercise 7.4), unless we set some restriction
on g(t). The following lemma sets such a restriction on g(t). (It will be used in the
proof of Theorem 7.13.)
lemma 7.8 If g(t) is bandlimited√to [−b, ∞) for some b > 0 and fc > b, then
g(t) is the baseband-equivalent of 2{g(t)ej2πfc t } with respect to the carrier
frequency fc .
3
It would be a misnomer to call xE (t) a baseband signal if x(t) is not passband.
240 7. Third layer
Proof If g(t) satisfies the stated condition, then g(t)ej2πfc t has √ no negative fre-
quencies. Hence g(t)ej2πfc t is the analytic signal x̂(t) of x(t) = 2{g(t)ej2πfc t },
which implies that g(t) is the baseband-equivalent xE (t) of x(t).
Hereafter all passband signals are assumed to be real-valued as they represent
actual communication signals. Baseband signals can be signals that we use for
baseband communication on real-world channels or can be baseband-equivalents
of passband signals. In the latter case, they are complex-valued in general.
7.2.1 Analog amplitude modulations: DSB, AM, SSB, QAM
There is a family of analog modulation techniques, called amplitude modulations,

that can be seen as a direct application of what we have learned in this section.
The well-known AM modulation used in broadcasting is the most popular member
of this family.
example 7.9 (Double-sideband modulation with suppressed carrier (DSB-SC))
Let the source signal be a real-valued baseband signal b(t). The arguably easiest
way to convert √ b(t) into a passband signal x(t) that has the same norm as b(t)
is to let x(t)√= 2b(t) cos(2πfc t). This is amplitude modulation in the sense that
the carrier 2 cos(2πfc t) is being amplitude modulated by the analog information
signal b(t). To see how this relates to what we have learned in the previous section,
we write
√
x(t) = 2b(t) cos(2πfc t)
√
= 2{b(t)ej2πfc t }.
If b(t) is bandlimited to [−B, B] and the carrier frequency satisfies fc > B, which
we assume to be the case, then b(t) is the baseband-equivalent of the passband signal
x(t). Figure 7.5 gives an example of the Fourier-domain relationship between the
|bF (f )|
f
−B 0 B
(a) Information signal.
|xF (f )|
1
√ |b (f
2 F
+ fc )| 1
√ |b (f
2 F
− fc )|
f
−fc 0 fc
(b) DSB-SC modulated signal.
Figure 7.5. Spectrum of a double-sideband modulated signal.

baseband information signal b(t) and the modulated signal x(t). The dashed parts
of the plots are meant to remind us that they can be determined from the solid
parts.
This modulation scheme is called “double-sideband” because of the two bands on
the left and right of fc , only one is needed to recover b(t). Specifically, we could
eliminate the sideband below fc ; and to preserve the conjugacy symmetry required
by real-valued signals, we would eliminate also the sideband above −fc and still
be able to recover the information signal b(t) from the resulting passband signal.
Hence, we could eliminate one of the sidebands and thereby reduce the bandwidth
and the energy by a factor 2. (See Example 7.11.) The SC (suppressed carrier)
part of the name distinguishes this modulation technique from AM (amplitude
modulation, see next example), which is indeed a double-sideband modulation with
carrier (at ±fc ).
example 7.10 (AM modulation) AM modulation is by far the most popular
member of the family of amplitude modulations. Let b(t) be the source signal,
and assume that it is zero-mean and |b(t)| ≤ 1 for all t. AM modulation of
b(t) is DSB-SC modulation of 1 + mb(t) for some modulation index m such
that 0 < m ≤ 1. Notice that 1 + mb(t) is always non-negative. By using this
fact, the receiver can be significantly simplified (see Exercise 7.7). The possibility
of building inexpensive receivers is what made AM modulation the modulation of
choice in early radio broadcasting. AM is also a double-sideband modulation but,
unlike √ √ at ±fc . We see√the carrier by expanding x(t) = (1+
DSB-SC, it has a carrier
mb(t)) 2 cos(2πfc t) = mb(t) 2 cos(2πfc t) + 2 cos(2πfc t). The carrier consumes
energy without carrying any information. It is the “price” that broadcasters are
willing to pay to reduce the cost of the receiver. The trade-off seems reasonable
given that there is one sender and many receivers.
The following two examples are bandwidth-efficient variants of double-sideband
modulation.
example 7.11 (Single-sideband modulation (SSB)) As in the previous example,
let b(t) be the real-valued baseband information signal. Let b̂(t) = (b h> )(t) be the
analytic-equivalent of b(t). We define x(t) to be the passband signal that has b̂(t) as
its baseband-equivalent (with respect to the desired carrier frequency). Figure 7.6
shows the various frequency-domain signals. A comparison with Figure 7.5 should
suffice to understand why this process is called single-sideband modulation. Single-
sideband modulation is widely used in amateur radio communication. Instead of
removing the negative frequencies of the original baseband signal we could remove
the positive frequencies. The two alternatives are called SSB-USB (USB stands for
upper side-band) and SSB-LSB (lower side-band), respectively. A drawback of SSB
is that it requires a sharp filter to remove the negative frequencies. Amateur radio
people are willing to pay this price to make efficient use of the limited spectrum
allocated to them.
example 7.12 (Quadrature amplitude modulation (QAM)) The idea consists
of taking two real-valued baseband information signals, say bR (t) and bI (t), and
forming the signal b(t) = bR (t) + jbI (t). As b(t) is complex-valued, its Fourier
242 7. Third layer
|bF (f )|
f
−B 0 B
|b̂F (f )|
f
0 B
(b) Analytic-equivalent of b(t) (up to scaling).
√ |xF (f )| √
2|b̂F (−f − fc )| 2|b̂F (f − fc )|
f
−fc 0 fc
(c) SSB modulated signal.
Figure 7.6. Spectrum of SSB-USB.
|bF (f )|
f
−B 0 B
|xF (f )|
1
√ |b (−f
2 F
− fc )| 1
√ |b (f
2 F
− fc )|
f
−fc 0 fc
(b) Modulated signal.
Figure 7.7. Spectrum of QAM.
transform no longer satisfies the conjugacy constraint. Hence the asymmetry of

|bF (f )| on the top of Figure 7.7. We let the passband signal x(t) be the one that
has baseband-equivalent b(t) (with respect to the desired carrier frequency). The
spectrum of x(t) is shown on the bottom plot of Figure 7.7. QAM is as bandwidth
efficient as SSB. Unlike SSB, which achieves the bandwidth efficiency by removing
the extra frequencies, QAM doubles the information content. The advantage of
7.3. The third layer 243
QAM over SSB is that it does not require a sharp filter to remove one of the two
sidebands. The drawback is that typically a sender has one, not two, analog signals
to send. QAM is not popular as an analog modulation technique. However, it is
a very popular technique for digital communication. The idea is to split the bits
into two streams, with each stream doing symbol-by-symbol on a pulse train to
obtain, say, bR (t) and bI (t) respectively, and then proceeding as described above.
(See Example 7.15.)
7.3 The third layer

In this section, we revisit the second layer using the tools that we have learned
in Section 7.2. From a structural point of view, the outcome is that the second
layer splits into two, giving us the third layer. The third layer enables us to choose
the carrier frequency independently from the shape of the power spectral density.
It gives us also the freedom to vary the carrier frequency in an easy way without
redesigning the system. Various additional advantages are discussed in Section 7.7.
We assume passband communication over the AWGN channel.
We start with the ML (or MAP) receiver, following the general approach learned
in Chapter 3. Let ψ1 (t), . . . , ψn (t) be an orthonormal basis for the vector space
spanned by the passband signals w0 (t), . . . , wm−1 (t). The n-tuple former computes
Y = (Y1 , . . . , Yn )T
Yl = R(t), ψl (t), l = 1, . . . , n.
When H = i,
Y = ci + Z,
where ci ∈ Rn is the codeword associated to wi (t), with respect to the orthonormal
basis ψ1 (t), . . . , ψn (t) and Z ∼ N (0, N20 In ). When Y = y, an ML decoder chooses
Ĥ = ı̂ for one of the ı̂ that minimizes y − cı̂ or, equivalently, for one of the ı̂ that
maximizes y, cı̂ − cı̂ 2 /2.
This receiver has the serious drawback that all of its stages depend on the choice
of the passband signals w0 (t), . . . , wm−1 (t). For instance, suppose that the sender
decides to use a different frequency band. If the signaling method is symbol-by-
symbol on a pulse train based on a pulse ψ(t) orthogonal to its T -spaced translates,
i.e. such that |ψF (f )|2 fulfills Nyquist’s criterion with parameter T , then one way
to change the frequency band is to use a different pulse ψ̃(t) that, as the original
pulse, is orthogonal to its T -spaced translates, and such that |ψ̃F (f )|2 occupies
the desired frequency band. Such a pulse ψ̃(t) exists only for certain frequency
offsets with respect to the original band (see Exercise 7.14), a fact that makes this
approach of limited interest. Still, if we use this approach, then the n-tuple former
has to be adapted to the new pulse ψ̃(t). A much more interesting approach,
that applies to any signaling scheme (not only symbol-by-symbol on a pulse
train), consists in using the ideas developed in Section 7.2 to frequency-translate
w0 (t), . . . , wm−1 (t) into a new set of signals w̃0 (t), . . . , w̃m−1 (t) that occupy the
244 7. Third layer
desired frequency band. This can be done quite effectively; we will see how. But if
we re-design the receiver starting with a new arbitrarily-selected orthonormal basis
for the new signal set, then we see that the n-tuple former as well as the decoder
could end up being totally different from the original ones. Using the results of
Section 7.2, we can find a flexible and elegant solution to this problem, so that we
can frequency-translate the signal’s band to any desired location without affecting
the n-tuple former and the decoder. (The encoder and the waveform former are
not affected either.)
Let wE,0 (t), . . . , wE,m−1 (t) be the baseband-equivalent signal constellation. We
assume that they belong to a complex inner product space and let ψ1 (t), . . . , ψn (t)
be an orthonormal basis for this space. Let ci = (ci,1 , . . . , ci,n )T ∈ Cn be the
codeword associated to wE,i (t), i.e.

n
wE,i (t) = ci,l ψl (t),
l=1
√
wi (t) = 2 wE,i (t)ej2πfc t .
The orthonormal basis for the baseband-equivalent signal set can be lifted up to
an orthonormal basis for the passband signal set as follows.
√
wi (t) = 2 wE,i (t)ej2πfc t
√ n

= 2 ci,l ψl (t)ej2πfc t
l=1
√ n

= 2 ci,l ψl (t)ej2πfc t
l=1
√ n

= 2 {ci,l } ψl (t)ej2πfc t
l=1
√ n
− 2 {ci,l } {ψl (t)ej2πfc t }
l=1

n
n
= {ci,l }ψ1,l (t) + {ci,l }ψ2,l (t), (7.7)
l=1 l=1
with ψ1,l (t) and ψ2,l (t) in (7.7) defined as

√
ψ1,l (t) = 2 ψl (t)ej2πfc t , (7.8)
√
ψ2,l (t) = − 2 ψl (t)ej2πfc t . (7.9)
From (7.7), we see that the set {ψ1,1 (t), . . . , ψ1,n (t), ψ2,1 (t), . . . , ψ2,n (t)} spans
a vector space that contains the passband signals. As stated by the next
theorem, this set forms an orthonormal basis, provided that the carrier frequency
is sufficiently high.
theorem 7.13 Let {ψl (t) : l = 1, 2, . . . , n} be an orthonormal set of functions

that are frequency-limited to [−B, B] for some B > 0 and let fc > B. Then the set
{ψ1,l (t), ψ2,l (t) : l = 1, 2, . . . , n} (7.10)
defined via(7.8) and (7.9) consists of orthonormal functions. Furthermore, if

n
wE,i (t) = l=1 ci,l ψl (t), then
√
n
n
wi (t) = 2{wE,i (t)ej2πfc t } = {ci,l }ψ1,l (t) + {ci,l }ψ2,l (t).
l=1 l=1
Proof The last statement is (7.7). Hence (7.10) spans a vector space that contains
the passband signals. It remains to be shown that this set is orthonormal. From
Lemma 7.8, √ the baseband-equivalent signal of ψ1,l (t) is ψl (t). Similarly, by writing
ψ2,l (t) = 2{[jψl (t)]ej2πfc t }, we see that the baseband-equivalent of ψ2,l (t) is
jψl (t). From Corollary 7.7, ψ1,k (t), ψ1,l (t) = {ψk (t), ψl (t)} = 1{k = l},
showing that the set {ψ1,l (t) : l = 1, . . . , n} is made of orthonormal functions.
Similarly, ψ2,k (t), ψ2,l (t) = {jψk (t), jψl (t)} = {ψk (t), ψl (t)} = 1{k = l},
showing that also {ψ2,l (t) : l = 1, . . . , n} is made of orthonormal functions.
To conclude the proof, it remains to be shown that functions from the first
set are orthogonal to functions from the second set. Indeed ψ1,k (t), ψ2,l (t) =
{ψk (t), jψl (t)} = {−jψk (t), ψl (t)} = 0. The last equality holds for k = l
because ψk and ψl are orthogonal and it holds for k = l because ψk (t), ψk (t) =
ψk (t)2 is real.
From the above theorem, we see that if the vector space spanned by the
baseband-equivalent signals has dimensionality n, the vector space spanned by the
corresponding passband signals has dimensionality 2n. However, the number of
real-valued “degrees of freedom” is the same in both spaces. In fact, the coefficients
used in the orthonormal expansion of the baseband signals are complex, hence
with two degrees of freedom per coefficient, whereas those used in the orthonormal
expansion of the passband signals are real.
Next we re-design the receiver using Theorem 7.13 to construct an orthonor-
mal basis for the passband signals. The 2n-tuple former now computes Y1 =
(Y1,1 , . . . , Y1,n )T and Y2 = (Y2,1 , . . . , Y2,n )T , where for l = 1, . . . , n
Y1,l = R(t), ψ1,l (t) (7.11)

√
= R(t), 2 ψl (t)ej2πfc t
√
= R(t), 2ψl (t)ej2πfc t
√
= 2e−j2πfc t R(t), ψl (t) , (7.12)
and similarly
Y2,l = R(t), ψ2,l (t) (7.13)
√
= 2e−j2πfc t R(t), ψl (t) . (7.14)
246 7. Third layer
To simplify notation, we define the complex random vector Y ∈ Cn

Y = Y1 + jY2 . (7.15)
The lth component of Y is
Yl = Y1,l + jY2,l (7.16)
√
= 2e−j2πfc t R(t), ψl (t). (7.17)
This is an important turning point: it is where we introduce complex-valued
notation in the receiver design.
Figure 7.8 shows the new architecture, where we use double lines for connec-
tions that carry complex-valued quantities (symbols, n-tuples, signals). At the
transmitter, the waveform former decomposes into a waveform former for the
baseband signal, followed by the up-converter . At the receiver, the n-tuple former
for passband decomposes into the down-converter followed by an n-tuple former for
baseband. The significant advantage of this architecture is that fc affects only the
up/down-converter. With this architecture, varying the carrier frequency means
turning the knob that varies the frequency of the oscillators that generate e±j2πfc t
in the up/down-converter.
The operations performed by the blocks of Figure 7.8 can be implemented in a
digital signal processor (DSP) or with analog electronics. If the operation is done
wE,i (t) √
c i ∈ Cn - ψl (t) - × - 2{·} - wi (t)
l = 1, . . . , n
6
waveform former
ej2πfc t
up-converter
(a) Transmitter back end.
·, ψl (t)
R(t) - × - - Y ∈ Cn
l = 1, . . . , n
6
√ n-tuple former
2e−j2πfc t
down-converter
(b) Receiver front end.
Figure 7.8. Baseband waveform former and up-converter at the transmitter

back end (a); down-converter and baseband n-tuple former at the receiver
front end (b).
√
2 cos(2πfc t)
{ci } ∈ Rn ?

- ψl (t) - ×
l = 1, . . . , n @
R
@ wi (t)
-

- ψl (t) - ×
{ci } ∈ Rn
l = 1, . . . , n
6
√
− 2 sin(2πfc t)
(a) Transmitter back end.
√
2 cos(2πfc t)
?
{Y } ∈ Rn
- ·, ψl (t) -
×
l = 1, . . . , n
R(t)
·, ψl (t)
× - -
l = 1, . . . , n
{Y } ∈ Rn
6
√
− 2 sin(2πfc t)
(b) Receiver front end.
Figure 7.9. Real-valued implementation of the diagrams of Figure 7.8

for a real-valued orthonormal basis ψl (t), l = 1, . . . , n.
in a DSP, the programmer might be able to rely on functions that can cope with
complex numbers. If done with analog electronics, the real and the imaginary parts
are kept separate. This is shown in Figure 7.9, for the common situation where
the orthonormal basis is real-valued. There is no loss in performance in choosing
a real-valued basis and, if we do so, the implementation complexity using analog
circuitry is essentially halved (see Exercise 7.9).
We have reached a conceptual milestone, namely the point where working with
complex-valued signals becomes natural. It is worth being explicit about how and
why we make this important transition. In principle, we are only combining two
real-valued vectors of equal length into a single complex-valued vector of the same
length (see (7.15)). Because it is a reversible operation, we can always pack a
248 7. Third layer
pair of real numbers into a complex number, and we should do so if it provides

an advantage. In our case, we can identify a few benefits by doing so. First, the
expressions are simplified: compare the pair (7.12) and (7.14) to the single and
somewhat simpler (7.17). Similarly, the block diagrams are simplified: compare
Figure 7.8 with Figure 7.9, keeping in mind that the former is general, whereas
the latter becomes more complicated if the orthonormal basis is complex-valued.
(see Exercise 7.9). Finally, as we will see, the expression for the density of the
complex random vector Y takes a somewhat simpler form than that of {Y } (or
of {Y }), thus simpler than the joint density of {Y } and {Y }, which is what
we need if we keep the real and the imaginary parts separate.
example 7.14 (PSK signaling via complex-valued symbols) Consider the signals

wE (t) = sl ψ(t − lT )
l
√
w(t) = 2 wE (t)ej2πfc t ,
where ψ(t) is real-valued, normalized, and orthogonal to its T -spaced time-

translates, and where symbols take value in √ a 4-ary PSK alphabet, seen as a
subset of C. So we can write a symbol sl as Eejϕl with ϕl ∈ {0, π2 , π, 3π2 } or,
alternatively, as s√
l = {s l } + j {s l }. We work it out both ways.
If we plug sl = Eejϕl into wE (t) we obtain
!
√ √
w(t) = 2 Ee ψ(t − lT ) e
jϕl j2πfc t
l

√ √
= 2 Eej(2πfc t+ϕl ) ψ(t − lT )
l
√ √
= 2 E ej(2πfc t+ϕl ) ψ(t − lT )
l
√
= 2E cos(2πfc t + ϕl )ψ(t − lT ).
l
$
Figure 7.10 shows a sample w(t) with ψ(t) = 1
T 1{0 ≤ t < T }, T = 1, fc T = 3
(there are three periods in a symbol interval T ), E = 12 , ϕ0 = 0, ϕ1 = π, ϕ2 = π
2,
and ϕ3 = 3π 2 .
If we plug sl = {sl } + j {sl } into wE (t) we obtain
!
√
w(t) = 2 ({sl } + j {sl }) ψ(t − lT ) e j2πfc t
l
√
= 2 ({sl } + j {sl }) ej2πfc t ψ(t − lT )
l
√
= 2 {sl }{ej2πfc t } − {sl } {ej2πfc t } ψ(t − lT )
l
w(t) 0
−1
0 0.5 1 1.5 2 2.5 3 3.5 4
t
Figure 7.10. Sample PSK modulated signal.
√
= 2 {sl }ψ(t − lT ) cos(2πfc t)
l
√
− 2 {sl }ψ(t − lT ) sin(2πfc t). (7.18)
l
√ √
For a rectangular pulse ψ(t), 2ψ(t − lT ) cos(2πfc t) is orthogonal to 2ψ(t −
iT ) sin(2πfc t) for all integers l and i, provided that 2fc T is an integer.4 From
(7.18), we see that the PSK signal is the superposition of two PAM signals. This
view is not very useful for PSK, because {sl } and {sl } cannot be chosen inde-
pendently of each other.5 Hence the two superposed signals cannot be decoded
independently. It is more useful for QAM. (See next example.)
example 7.15 (QAM signaling via complex-valued symbols) Suppose that the
signaling method is as in Example 7.14 but that the symbols take value in a QAM
alphabet. As in Example 7.14, it is instructive to write the symbols in two ways.
If we write sl = al ejϕl , then proceeding as in Example 7.14, we obtain
!
√
w(t) = 2 al e ψ(t − lT ) e
jϕl j2πfc t
l

√
= 2 al e j(2πfc t+ϕl )
ψ(t − lT )
l
√
= 2 al ej(2πfc t+ϕl ) ψ(t − lT )
l
√
= 2 al cos(2πfc t + ϕl )ψ(t − lT ).
l
4
See the argument in Example 3.10. In practice, the integer condition can be ignored
because 2fc T is large, in which case the inner product between the two functions is
negligible compared to 1 – the
√ norm of both functions. For
√ a general bandlimited ψ(t),
the orthogonality between 2ψ(t − lT ) cos(2πfc t) and 2ψ(t − iT ) sin(2πfc t), for a
sufficiently large fc , follows from Theorem 7.13.
5
Except for 2-PSK, for which {sl } is always 0.
250 7. Third layer
Figure 7.11 shows a sample w(t) with ψ(t) and fc as in Example 7.14, with s0 =
√ π √ −1 1 √ −1 1
1 + j = 2ej 4 ,√s1 = 3 + j = 10ej tan ( 3 ) , s2 = −3 + j = 10ej(tan (− 3 )+π) ,
3π
s3 = −1 + j = 2ej 4 .
4
2
w(t)
0
−2
−4
0 0.5 1 1.5 2 2.5 3 3.5 4
t
Figure 7.11. Sample QAM signal.
If we write sl = {sl } + j {sl }, then we obtain the same expression as in

Example 7.14
√
w(t) = 2 {sl }ψ(t − lT ) cos(2πfc t)
l
√
− 2 {sl }ψ(t − lT ) sin(2πfc t),
l
but unlike for PSK, the {sl } and the {sl } of QAM can be selected indepen-
dently. Hence, the two superposed PAM signals√ can be decoded independently, with
no
√ interference between the two because 2ψ(t − lT ) cos(2πfc t) is orthogonal to
2ψ(t − iT ) sin(2πfc t). Using (5.10), it is straightforward to verify that the band-
width of the QAM signal is the same as that of the individual PAM signals. We con-
clude that the bandwidth efficiency (bits per Hz) of QAM is twice that of PAM.
Stepping back and looking at the big picture, we now view the physical layer of
the OSI model (Figure 1.1) for the AWGN channel as consisting of the three sub-
layers shown in Figure 7.12. We are already familiar with all the building blocks
of this architecture. The channel models “seen” by the first and second sub-layer,
respectively, still need to be discussed. New in these channel models is the fact
that the noise is complex-valued. (The signals are complex-valued as well, but we
are already familiar with complex-valued signals.)
Under hypothesis H = i, the discrete-time channel seen by the first (top) sub-
layer has input ci ∈ Cn and output
Y = ci + Z,
where, according to (7.16), (7.11) and (7.13), the lth component of Y is Y1,l +
jY2,l = ci,l + Zl and Zl = Z1,l + jZ2,l , where Z1,1 , . . . , Z1,n , Z2,1 , . . . , Z2,n is
a collection of iid zero-mean Gaussian random variables of variance N0 /2. We
have all the ingredients to describe the statistical behavior of Y via the pdf of
Z1,1 , . . . , Z1,n , Z2,1 , . . . , Z2,n , but it is more elegant to describe the pdf of the
complex-valued random vector Y . To find the pdf of Y , we introduce the random
6
i ı̂
?
Encoder Decoder
discrete-time 6
ci Y
- -
?
6 n-Tuple
Waveform
Former Z ∼ NC (0, N0 In ) Former
6
baseband-equivalent RE (t)
wE,i (t)
- -
?
Up-
6
Down-
Converter NE (t) Converter
6
wi (t) R(t)

-

6
N (t)
Figure 7.12. Sub-layer architecture for passband communication.
vector Ŷ that consists of the (column) n-tuple Y1 = {Y } on top of the (column)
n-tuple Y2 = {Y }. This notation extends to any complex n-tuple: if a ∈ Cn
(seen as a column n-tuple), then â is the element of R2n consisting of {a} on
top of {a} (see Appendix 7.8 for an in-depth treatment of the hat operator). By
definition, the pdf of a complex random vector Y evaluated at y is the pdf of Ŷ
at ŷ (see Appendix 7.9 for a summary on complex-valued random vectors).
Hence,
fY |H (y|i) = fŶ |H (ŷ|i)
= fY1 ,Y2 |H ({y}, {y}|i)
= fY1 |H ({y}|i)fY2 |H ( {y}|i)
n
l=1 ({yl } − {ci,l })
2
1
= √ exp −
( πN0 )n N0
n
l=1 ( {yl } − {ci,l })
2
1
× √ exp −
( πN0 )n N0

1 y − ci 2
= exp − . (7.19)
(πN0 )n N0
252 7. Third layer
Naturally, we say that Y is a complex-valued Gaussian random vector with

mean ci and variance N0 in each (complex-valued) component, and write
Y ∼ NC (ci , N0 In ), where In is the n × n identity matrix.
In Section 2.4, we assumed that the codebook and the noise are real-valued.
If Y ∈ Cn as in Figure 7.12, the MAP receiver derived in Section 2.4 applies as
is to the observable Ŷ ∈ R2n and codewords ĉi ∈ R2n , i = 0, . . . , m − 1. But in
fact, to describe the decision rule, there is no need to convert Y ∈ Cn to Ŷ ∈ R2n
and convert ci ∈ Cn to ĉi ∈ R2n . To see why, suppose that we do the conversion.
A MAP (or ML) receiver decides based on ŷ − ĉi or, equivalently, based on
{ŷ, ĉi } − ĉ2i . But ŷ − ĉi is identical to y − ci . (The squared norm of a
2
complex n-tuple can be obtained by adding the squares of the real components
and the squares of the imaginary components.) Similarly, ŷ, ĉi = {y, ci }.
In fact, if y = yR + jyI and c = cR + jcI are (column vectors) in Cn , then
{y, c} = yR T
cR + yIT cI , but this is exactly the same as ŷ, ĉ.6
We conclude that an ML decision rule for the complex-valued decoder-input
y ∈ Cn of Figure 7.12 is
ĤM L (y) = arg min y − ci
i
ci 2
= arg max {y, ci } −
.
i 2
Describing the baseband-equivalent channel model, as seen by the second sub-layer
of Figure 7.12, requires slightly more work. We do this in the next section for
completeness, but it is not needed in order to prove that the receiver structure of
Figure 7.12 is completely general (for the AWGN channel) and that it minimizes
the error probability. That part is done.
7.4 Baseband-equivalent channel model

We want to derive a baseband-equivalent channel that can be used as the channel
model seen by the second layer of Figure 7.12. Our goal is to characterize the
impulse response h0 (t) and the noise NE (t) of the baseband-equivalent channel of
Figure 7.13b, such that the output U has the same statistic as the output of Figure
7.13a. We assume that every wE,i (t) is bandlimited to [−B, B] where 0 < B < fc .
Without loss of essential generality, we also assume that every gl (t) is bandlimited
to [−B, B]. Except for this restriction, we let gl (t), l = 1, . . . , k, be any collection
of L2 functions.
We will use the following result,
[w(t) h(t)]e−j2πfc t = [w(t)e−j2πfc t ] [h(t)e−j2πfc t ], (7.20)
6
For an alternative proof that ŷ, ĉi = {y, ci }, subtract the two equations y −ci 2 =
y2 + ci 2 − 2{y, ci } and ŷ − ĉi 2 = ŷ2 + ĉi 2 − 2ŷ, ĉi and use the fact that
the hat on a vector has no effect on the vector’s norm.
7.4. Baseband-equivalent channel model 253
wE,i (t) ·, gl (t) U ∈ Ck

- ×m- {·} - h(t) - m - ×m - -
l = 1, . . . , k
6 6 6
√ j2πfc t
√ −j2πfc t
2e N (t) 2e
up-converter channel down-converter

(a)
wE,i (t) ·, gl (t) U ∈ Ck

- h0 (t) - m- -
l = 1, . . . , k
6
NE (t)
baseband-equivalent channel
(b)
Figure 7.13. Baseband-equivalent channel.
which says that if a signal w(t) is passed through a filter of impulse response h(t)
and the filter output is multiplied by e−j2πfc t , we obtain the same as passing the
signal w(t)e−j2πfc t through the filter with impulse response h(t)e−j2πfc t . A direct
(time-domain) proof of this result is a simple exercise,7 but it is more insightful if
we take a look at what it means in the frequency domain. In fact, in the frequency
domain, the convolution on the left becomes wF (f )hF (f ), and the subsequent
multiplication by e−j2πfc t leads to wF (f + fc )hF (f + fc ). On the right side we
multiply wF (f + fc ) with hF (f + fc ).
The above relationship should not be confused with the following equalities that
hold for any constant c ∈ C
[w(t) h(t)]c = [w(t)c] h(t) = w(t) [h(t)c]. (7.21)
This holds because the left-hand side at an arbitrary time t is c times the integral
of the product of two functions. If we bring the constant inside the integral and
use it to scale the first function, we obtain the expression in the middle; whereas
we obtain the expression on the right if we use c to scale the second function. In
the derivation that follows, we use both relationships.
The up-converter, the actual channel, and the down-converter perform linear
operations, in the sense that their action on the sum of two signals is the sum of
the individual actions. Linearity implies that we can consider the signal and the
noise separately. We start with the signal part (assuming that there is no noise).
7
Relationship (7.20) is a form of distributivity law, like [a + b]c = [ac] + [bc].
254 7. Third layer
& '√
Ul = wi (t) h(t) 2e−j2πfc t , gl (t)
& √ '
= wi (t) 2 h(t) e−j2πfc t , gl (t)
√
= wi (t) 2e−j2πfc t h(t)e−j2πfc t , gl (t)
√
= (wi (t) 2e−j2πfc t ) h0 (t), gl (t)
∗
= (wE,i (t) + wE,i (t)e−j4πfc t ) h0 (t), gl (t)
= wE,i (t) h0 (t), gl (t), (7.22)
where in the second line we use (7.21), in the third we use (7.20), in the fourth we
introduce the notation
h0 (t) = h(t)e−j2πfc t ,
in the fifth we use
1 & ∗
'
wi (t) = √ wE,i (t)ej2πfc t + wE,i (t)e−j2πfc t ,
2
and in the sixth we remove the term
∗
wE,i (t)e−j4πfc t h0 (t),
which is bandlimited to [−2fc − B, −2fc + B] and therefore has no frequencies in
common with gl (t). By Parseval’s relationship, the inner product of functions that
have disjoint frequency support is zero.
From (7.22), for all wE,i (t) and all gl (t) that are bandlimited to [−B, B], the
noiseless output of Figure 7.13a is identical to that of Figure 7.13b.
Notice that, not surprisingly, the Fourier transform of h0 (t) is hF (f +fc ), namely
hF (f ) frequency-shifted to the left by fc .
The reader might wonder if h0 (t) is the same as the baseband-equivalent hE (t)
of h(t) (with respect to fc ). In fact it is not, but we can use h√
E (t)
2
instead of h0 (t).
The two functions are not the same, but it is straightforward to verify that their
Fourier transforms agree for f ∈ [−B, B].
Next we consider the noise alone. To specify NE (t), we need the following notion
of independent noises.8
definition 7.16 (Independent white Gaussian noises) NR (t) and NI (t) are
independent white Gaussian noises if the following two conditions are satisfied.
(i) NR (t) and NI (t) are white Gaussian noises in the sense of Definition 3.4.
(ii) For any two real-valued functions
h1 (t) and h2 (t)
(possibly the same), the
Gaussian random variables NR (t)h1 (t)dt and NI (t)h2 (t)dt are indepen-
dent.
The noise at the output of the down-converter has the form
ÑE (t) = ÑR (t) + jÑI (t) (7.23)
8
The notion of independence is well-defined for stochastic processes, but we do not model
the noise as a stochastic process (see Definition 3.4).
7.4. Baseband-equivalent channel model 255
with
√
ÑR (t) = N (t) 2 cos(2πfc t) (7.24)
√
ÑI (t) = −N (t) 2 sin(2πfc t). (7.25)
ÑR (t) and ÑI (t) are not independent white Gaussian noises in the sense of
Definition 7.16 (as can be verified by setting fc = 0), but we now show that they
do fulfill the conditions of Definition 7.16 when the functions used in the definition
are bandlimited to [−B, B] and B < fc .
Let hi (t), i = 1, 2, be real-valued L2 functions that are bandlimited to [−B, B]
and define

Zi = ÑR (t)hi (t)dt.
√
Zi , i = 1, 2, is Gaussian, zero-mean, and of variance N20 2 cos(2πfc t)hi (t)2 .
√
The function 2 cos(2πfc t)hi (t) is passband with baseband-equivalent hi (t). By
Definition 3.4 and Theorem 7.5,
N0 √ √
cov(Z1 , Z2 ) = 2 cos(2πfc t)h1 (t), 2 cos(2πfc t)h2 (t)
2
N0
= h1 (t), h2 (t)
2
N0
= h1 (t), h2 (t).
2
This proves that under the stated bandwidth limitation, ÑR (t) behaves as white
Gaussian noise of power spectral density N20 . The proof that the same is true for
√
ÑI (t) follows similar patterns, using the fact that − 2 sin(2πfc t)hi (t) is passband
with baseband-equivalent jhi (t). It remains to be shown that ÑR (t) and ÑI (t) are
independent noises in the sense of Definition 7.16. Let

Z3 = ÑI (t)h3 (t)dt.
Z3 is zero-mean and jointly Gaussian with Z2 with

N0 √ √
cov(Z2 , Z3 ) = 2 cos(2πfc t)h2 (t), − 2 sin(2πfc t)h3 (t)
2
N0
= h2 (t), jh3 (t)
2
= 0.
Let us summarize the noise contribution. The noise at the output of the down-
converter has the form (7.23)–(7.25). However, as long as we are taking inner
products of the noise with functions that are bandlimited to [−B, B] for B < fc ,
which is the case for gl (t), l = 1, . . . , k, we can model the equivalent noise NE (t)
of Figure 7.13b as
NE (t) = NR (t) + jNI (t) (7.26)
256 7. Third layer
where NR (t) and NI (t) are independent white Gaussian noises of spectral density
N0 /2.
This last characterization of NE (t) suffices to describe the statistic of U ∈ Ck ,
even when the gl (t) are complex-valued, provided they are bandlimited as specified.
For the statistical description of a complex random vector, the reader is referred
to Appendix 7.9 where, among other things, we introduce and discuss circularly
symmetric Gaussian random vectors (which are complex-valued) and prove that
the U at the output of Figure 7.13b is always circularly symmetric (even when the
gl (t) are not bandlimited to [−B, B]).
7.5 Parameter estimation

In Section 5.7 we have discussed the symbol synchronization problem. The
down-converter should also be (phase) synchronized with the up-converter,
and the attenuation between the transmitter and the n-tuple former should be
estimated.
We assume that the down-converter and the n-tuple former use independent
clocks. This is a valid assumption because it is not uncommon that they be
implemented in independent modules, possibly made by different manufacturers.
The same is true, of course, for the corresponding blocks of the transmitter.
The down-converter uses an oscillator that, when switched on, starts with an
arbitrary initial phase. Hence, the received signal r(t) is initially multiplied by
e−j(2πfc t−ϕ) = ejϕ e−j2πfc t for some ϕ ∈ [0, 2π). As a consequence, the n-tuple
former input and output have an extra factor of the form ejϕ . Accounting for this
factor goes under the name of phase synchronization.
Finally, there is a scaling by some positive number a due to the channel atten-
uation and the receiver front end amplification. For notational convenience, we
sometimes combine the scaling by a and that by ejϕ into a single complex scaling
factor α = aejϕ .
This results in the n-tuple former input
RE (t) = αsE (t − θ) + NE (t),
where NE (t) is complex white Gaussian noise of power spectral density N0 (N0 /2
in both real and imaginary parts) and θ ∈ [0, θmax ] accounts for the channel delay
and the time offset. For this section, the function sE (t) represents a training signal
known to the receiver, used to estimate θ, a, ϕ. Once estimated, these channel
parameters are used as the true values for the communication that follows. Next
we derive the joint ML estimates of θ, a, ϕ. The good news is that the solution
to this joint estimation problem essentially decomposes into three separate ML
estimation problems.
The derivation that follows is a straightforward generalization of what we have
done in Section 5.7, with the main difference being that signals are now complex-
valued. Accordingly, let Y = (Y1 , . . . , Yn )T be the random vector obtained by
7.5. Parameter estimation 257
projecting RE (t) onto the elements of an orthonormal basis9 for an inner product
space that contains sE (t − θ̂) for all possible values of θ̂ ∈ [0, θmax ]. The likelihood
function with parameters θ̂, â, ϕ̂ is
1 y−âejϕ̂ m(θ̂)2
−
f (y; θ̂, â, ϕ̂) = e N0
,
(πN0 )n
where m(θ̂) is the n-tuple of coefficients of sE (t − θ̂) with respect to the chosen
orthonormal basis.
A joint maximum likelihood estimation of θ, a, ϕ is a choice of θ̂, â, ϕ̂ that
maximizes the likelihood function or, equivalently, that maximizes any of the
following three expressions
− y − âejϕ̂ m(θ̂)2 ,

− y2 + âejϕ̂ m(θ̂)2 − 2 y, âejϕ̂ m(θ̂) ,
|âejϕ̂ |2
y, âejϕ̂ m(θ̂) − m(θ̂)2 . (7.27)
2
Notice that m(θ̂)2 = sE (t − θ̂)2 = sE (t)2 . Hence, for a fixed â, the second
term in (7.27) is independent of θ̂, ϕ̂. Thus, for any fixed â, we can maximize over
θ̂, ϕ̂ by maximizing any of the following three expressions

y, âejϕ̂ m(θ̂) ,

e−jϕ̂ y, m(θ̂) ,

e−jϕ̂ rE (t), sE (t − θ̂) , (7.28)
where the last line is justified by the argument preceding (5.18). The maximum of

e−jϕ̂ rE (t), sE (t − θ̂)
is achieved when θ̂ is such that the absolute value of rE (t), sE (t− θ̂) is maximized
and ϕ̂ is such that e−jϕ̂ rE (t), sE (t − θ̂) is real-valued and positive. The latter
happens when ϕ̂ equals the phase of rE (t), sE (t − θ̂). Thus
θ̂M L = arg max |rE (t), sE (t − θ̂)|, (7.29)
θ̂
ϕ̂M L = ∠rE (t), sE (t − θ̂M L ). (7.30)
Finally, for θ̂ = θ̂M L and ϕ̂ = ϕ̂M L , the maximum of (7.27) with respect to â is
achieved by
|âejϕ̂M L |2
âM L = arg maxy, âejϕ̂M L m(θ̂M L ) − m(θ̂M L )2
â 2
9
As in Section 5.7.1, for notational simplicity we assume that the orthonormal basis has
finite dimension n. The final result does not depend on the choice of the orthonormal
basis.
258 7. Third layer
E
= arg max âe−jϕ̂M L rE (t), sE (t − θ̂M L ) − â2
â 2
E
= arg max â|rE (t), sE (t − θ̂M L )| − â2 ,
â 2
where E is the energy of sE (t), and in the last line we use the fact that
e−jϕ̂M L rE (t), sE (t − θ̂M L )
is real-valued and positive (by the choice of ϕ̂M L ). Taking the derivative of
â|rE (t), sE (t − θ̂M L )| − â2 E2 with respect to â and equating to zero yields
|rE (t), sE (t − θ̂M L )|

âM L = .
E
Until now we have not used any particular signaling method. Now assume that
the training signal takes the form

K−1
sE (t) = cl ψ(t − lT ), (7.31)
l=0
where the symbols c0 , . . . , cK−1 and the pulse ψ(t) can be real- or complex-valued
and ψ(t) has unit norm and is orthogonal to its T -spaced translates.
The essence of what follows applies whether the n-tuple former incorporates
a correlator or a matched filter. For the sake of exposition, we assume that it
incorporates the matched filter of impulse response ψ ∗ (−t).
Once we have determined θ̂M L according to (7.29), we sample the matched filter
output at times t = θ̂M L + kT , k integer. The kth sample is
yk = rE (t), ψ(t − kT − θ̂M L ) (7.32)

K−1
= α cl ψ(t − lT − θ) + NE (t), ψ(t − kT − θ̂M L )
l=0

K−1
=α cl ψ(t − lT − θ), ψ(t − kT − θ̂M L ) + Zk
l=0

K−1
=α cl Rψ (θ̂M L − θ + (k − l)T ) + Zk
l=0

K−1
=α cl hk−l + Zk ,
l=0
where {Zk } is iid ∼ NC (0, N0 ) and we have defined hi = Rψ (θ̂M L − θ + iT ), where

Rψ is ψ’s self-similarity function. In particular, if θ̂M L = θ, then hi = 1{i = 0}
and
yk = αck + Zk . (7.33)
7.5. Parameter estimation 259
If N0 is not too large compared to the signal’s power, θ̂M L should be sufficiently
close to θ for (7.33) to be a valid model.
Next we re-derive the ML estimates of ϕ and a in terms of the matched filter
output samples. We do so because it is easier to implement the estimator in a
DSP that operates on the matched filter output samples rather than by analog
technology operating on the continuous-time signals. Using (7.31) and the linearity
K−1
of the inner product, we obtain rE (t), sE (t − θ̂M L ) = l=0 c∗l rE (t), ψ(t − lT −
θ̂M L ), and using (7.32) we obtain

K−1
rE (t), sE (t − θ̂M L ) = yl c∗l .
l=0
Thus, from (7.30),

K−1
ϕ̂M L = ∠ yl c∗l .
l=0
It is instructive to interpret
K−1 ϕ̂M L without noise. In the absence of noise, θ̂M L = θ
K−1
and yk = αck . Hence l=0 yl c∗l = aejϕ l=0 cl c∗l = aejϕ E, where E is the energy
of the training sequence. From (7.30), we see that ϕ̂M L is the angle of ejϕ aE, i.e.
ϕ̂M L = ϕ.
Proceeding similarly, we obtain
K−1
| l=0 yl c∗l |
âM L = .
E
It is immediate to check that if there is no noise and ϕ̂M L = ϕ, then âM L = a.
Notice that both ϕ̂M L and âM L depend on the observation y0 , . . . , yK−1 only
K−1
through the inner product l=0 yl c∗l .
Depending on various factors and, in particular, on the duration of the trans-
mission, the stability of the oscillators and the possibility that the delay and/or
the attenuation vary over time, a one-time estimate of θ, a, and ϕ might not be
sufficient.
In Section 5.7.2, we have presented the delay locked loop to track θ, assuming
real-valued signals. The technique can be adapted to the situation of this section.
In particular, if the symbol sequence c0 , . . . , cK−1 that forms the training signal is
as in Section 5.7.2, once ϕ has been estimated and accounted for, the imaginary
part of the matched filter output contains only noise and the real part is as in
Section 5.7.2. Thus, once again, θ can be tracked with a delay locked loop.
The most critical parameter is ϕ because it is very sensitive to channel delay
variations and to instabilities of the up/down-converter oscillators.
example 7.17 A communication system operates at a symbol rate of 10 Msps
(mega symbols per second) with a carrier frequency fc = 1 GHz. The local oscillator
that produces ej2πfc t is based on a crystal oscillator and a phase locked loop (PLL).
The frequency of the crystal oscillator can only be guaranteed up to a certain
precision and it is affected by the temperature. Typical precisions are in the range
260 7. Third layer
of 10–100 parts per million (ppm). If we take a 50 ppm precision as a mid-range

value, then the frequency used at the transmitter and that used at the receiver
are guaranteed to be within 100 ppm from each other, i.e. the carrier frequency
difference could be as large as Δfc = 10−4 fc = 105 Hz. Over the symbol time, this
difference accumulates a phase offset Δϕ = 2πΔfc T = 2π×10−2 , which constitutes
a rotation of 3.6 degrees. For 16-QAM, a one-time rotation of 3.6 degrees might
have a negligible effect on the error probability, but clearly we cannot let the phase
drift accumulate over several symbols.
Example 7.17 is quite representative. We are in the situation where there is a

small phase drift that creates a rotation of the sampled matched filter output.
From one sample to the next, the rotation is by some small angle Δϕ. An effective
technique to deal with this situation is to let the decoder decide on symbol ck and
then, assuming that the decision ĉk is correct, use it to estimate ϕk . This general
approach is denoted decision-directed phase synchronization.
A popular technique for doing decision-directed phase synchronization is based
on the following idea. Suppose that after the sample indexed by k − 1, the rotation
has been correctly estimated and corrected. Because of the frequency offset, the
kth sample is yk = ejΔϕ ck plus noise. Neglecting for the moment the noise, yk c∗k =
ej(Δϕ) |ck |2 . Hence
{yk c∗k } = sin(Δϕ)|ck |2 ≈ Δϕ|ck |2 ,
where the approximation holds for small values of |Δϕ|. Assuming that |Δϕ| is
indeed small, the idea is to decode yk ignoring the rotation by Δϕ. With high
probability the decoded symbol ĉk equals ck and {yk ĉ∗k } ≈ Δϕ|ck |2 . The feedback
signal {yk ĉ∗k } can be used by the local oscillator to correct the phase error.
Alternatively, the decoder can use the feedback signal to find an estimate Δϕ ˆ
of Δϕ and to rotate yk by −Δϕ. ˆ This method works well also in the presence
of noise, assuming that the noise is zero-mean and independent from sample to
sample. Averaging over subsequent samples helps to mitigate the effect of the
noise.
Another possibility of tracking ϕ is to use a phase locked loop – a technique
similar to the delay locked loop discussed in Section 5.7.2 to track θ.
Differential encoding is a different technique to deal with a constant or slowly
changing phase. It consists in encoding the information in the phase difference
between consecutive symbols.
When the phase ϕk is either constant or varies slowly, as assumed in this section,
we say that the phase comes through coherently. In the next section, we will see
what we can do when this is not the case.
7.6 Non-coherent detection

Sometimes it is not realistic to assume that the phase comes through coherently.
We are then in the domain of non-coherent detection.
7.6. Non-coherent detection 261
example 7.18 Frequency hopping (FH) is a communication technique that

consists in varying the carrier frequency according to a predetermined schedule
known to the receiver. An example is symbol-by-symbol on a pulse train with
non-overlapping pulses and a carrier frequency that changes after each pulse. It
would be hard to guarantee a specified initial phase after each change of frequency.
So a model for the sampled matched filter output is
yk = ejϕk ck + Zk ,
with ϕk uniformly distributed in [0, 2π) and independent from sample to sample and
from anything else. Frequency hopping is sometimes used in wireless military com-
munication, because it makes it harder for someone that does not know the carrier-
frequency schedule to detect that a communication is taking place. In particular, it
is harder for an enemy to locate the transmitter and/or to jam the signal. Frequency
hopping is also used in wireless civil applications when there is no guarantee
that a fixed band is free of interference. This is often the case in unlicensed
bands. In this case, rather than choosing a fixed frequency that other applica-
tions might choose – risking that a long string of symbols are hit by interfering
signals – it is better to let the carrier frequency hop and use coding to deal with the
occasional symbols that are hit by interfering signals. Frequency hopping is a form
of spread spectrum (SS) communication.10 We speak of spread spectrum when the
signal’s bandwidth is much larger (typically by several orders of magnitude) than
the symbol rate.
We consider the following fairly general situation. When H = i ∈ H = {0, 1, . . . ,
m − 1}, the baseband-equivalent channel output is
RE (t) = aejϕ wE,i (t) + NE (t),
where wE,i (t) is the baseband-equivalent signal, NE (t) is the complex-valued white
Gaussian noise of power spectral density N0 , a ∈ R+ is an arbitrary scaling factor
unknown to the receiver, and ϕ ∈ [0, 2π) an unknown phase. We want to do
hypothesis testing in the presence of the unknown parameters a and ϕ. (We are
not considering a delay θ in our model, because a delay, unlike a scaling factor aejϕ ,
can be assumed to be constant over several symbols, hence it can be estimated via
a training sequence as described in Section 7.5.)
One approach is to extend the maximum likelihood (ML) decision rule by
maximizing the likelihood function not only over the message i but also over
the unknown parameters a and ϕ. Another approach assumes that the unknown
parameters are samples of random variables of known statistic. In this case, we can
obtain the distribution of the channel-output observations given the hypothesis,
by marginalizing over the distribution of a and ϕ, at which point we are back
to the familiar situation (but with a new channel model). We work out the ML
approach, which has the advantage of not requiring the distributions associated to
a and ϕ.
10
There is much literature on spread spectrum. The interested reader can find introduc-
tory articles on the Web.
262 7. Third layer
The steps to maximize the likelihood function mimic what we have done in the
previous section. Let ci be the codeword associated with wE,i (t) (with respect
to some orthonormal basis). Let y be the n-tuple former output. The likelihood
function is
1 y−âejϕ̂ cı̂ 2
−
f (y; ı̂, â, ϕ̂) = e N0
.
(πN0 )n
We seek the ı̂ that maximizes
1
g(cı̂ ) = max {y, âejϕ̂ cı̂ } − âejϕ̂ cı̂ 2
â,ϕ̂ 2
â2
= max â{e−jϕ̂ y, cı̂ } − cı̂ 2 .
â,ϕ̂ 2
The ϕ̂ that achieves the maximum is the one that makes e−jϕ̂ y, cı̂ real-valued
and positive. Let ϕ̂M L be the maximizing ϕ̂ and observe that {e−jϕ̂M L y, cı̂ } =
|y, cı̂ |. Hence,
â2
g(cı̂ ) = max â|y, cı̂ | − cı̂ 2 . (7.34)
â 2
â2
By taking the derivative of â|y, cı̂ | − 2 cı̂
2
with respect to â and setting to
zero, we obtain the maximizing â
|y, cı̂ |
âM L = .
cı̂ 2
Inserting into (7.34) yields
1 |y, cı̂ |2
g(cı̂ ) = .
2 cı̂ 2
Hence
ı̂M L = arg max g(cı̂ )

ı̂
1 |y, cı̂ |2
= arg max
ı̂ 2 cı̂ 2
|y, cı̂ |
= arg max . (7.35)
ı̂ cı̂
Notice that for every vector c ∈ Cn and every scalar b ∈ C, |

y,c| |
y,bc|
c = bc . Hence,
it would be a bad idea if the codebook contains two or more codewords that are
collinear with respect to the complex inner product space Cn .
The decoding rule (7.35) is not surprising. Recall that, without phase and
amplitude uncertainty and with signals of equal energy, the maximum likelihood
decision is the one that maximizes {y, cı̂ }. If the channel scales the signal by
an arbitrary real-value a unknown to the decoder, then the decoder should not
take the signal’s energy into account. This can be accomplished by a decoder
7.6. Non-coherent detection 263
that maximizes {y, cı̂ }/cı̂ . Next, assume that the channel can also rotate the
signal by an arbitrary phase ϕ (i.e. the channel multiplies the signal by ejϕ ). As we
increase the phase by π/2, the imaginary part of the new inner product becomes
the real part of the old (with a possible sign change). One way to make the decoder
insensitive to the phase, is to substitute {y, cı̂ } with |y, cı̂ |. The result is the
decoding rule (7.35).
example 7.19 (A bad choice √of signals) Consider m-ary phase-shift keying, i.e.
wE,i (t) = ci ψ(t), where ci = Es ej2πi/m , i = 0, . . . , m−1, and ψ(t) is a unit-norm
pulse. If we plug into (7.35), we obtain
7 √ j2πı̂/m 7
7y, Es e 7
ı̂M L = arg max √
ı̂ Es
7 7
7 −j2πı̂/m 7
= arg max 7e y, 17
ı̂
= arg max |y, 1|
ı̂
= arg max |y| ,
ı̂
which means that the decoder has no preference among the ı̂ ∈ H, i.e. the error
probability is the same independently of the decoder’s choice. In fact, a PSK
constellation is a bad choice for a codebook, because it conveys information in
the phase and the phase information is destroyed by the channel.
More generally, any one-dimensional codebook, like a PAM, a QAM, or a PSK

alphabet, is a bad choice, because in that case all codewords are collinear to
each other. (However, for n > 1, we could use codebooks that consist of n-length
sequences with components (symbols) taking value in a PAM, QAM, or PSK
alphabet.)
example 7.20 (A good choice) Two vectors in Cn that are orthogonal to each
other cannot be made equal by multiplying one of the two by a scalar aejϕ , which
was the underlying issue in Example 7.19. Complex-valued orthogonal signals
remain orthogonal after we multiply them by aejϕ . This suggests that they are a
good choice for the channel model assumed in this√section. Specifically, suppose
that the ith codeword ci ∈ Cm , i √ = 1, . . . , m, has Es at position i and is zero
elsewhere. In this case, |y, ci | = Es |yi | and
|y, ci |
ı̂M L = arg max √
i Es
= arg max |yi |.
i
(Compare this rule to the decision rule of Example 4.6, where the signaling scheme
is the same but there is neither amplitude nor phase uncertainty.)
264 7. Third layer
7.7 Summary
The fact that each passband signal (real-valued by definition) has an equivalent
baseband signal (complex-valued in general) makes it possible to separate the
communication system into two parts: a part (top two layers) that processes base-
band signals and a part (bottom layer) that implements the conversion to/from
passband. With the bottom layer in place, the top two layers are designed to
communicate over a complex-valued baseband AWGN channel. This separation
has several advantages: (i) it simplifies the design and the analysis of the top two
layers, where most of the system complexity lies; (ii) it reduces the implementation
costs; and (iii) it provides greater flexibility by making it possible to choose
the carrier frequency, simply by changing the frequency of the oscillator in the
up/down-converter. For instance, for frequency hopping (Example 7.18), as long
as the down-converter is synchronized with the up-converter, the top two layers
are unaware that the carrier frequency is hopping. Without the third layer in
place, we change the carrier frequency by changing the pulse, and the options that
we have in choosing the carrier frequency are limited. (If the Nyquist criterion
is fulfilled for |ψF (f + f1 )|2 + |ψF (f − f1 )|2 , it is not necessarily fulfilled for
|ψF (f + f2 )|2 + |ψF (f − f2 )|2 , f1 = f2 .)
Theorem 7.13 tells us how to transform an orthonormal basis of size n for the
baseband-equivalent signals into an orthonormal basis of size 2n for the corres-
ponding passband signals. The factor 2 is due to the fact that the former is
used with complex-valued coefficients, whereas the latter is used with real-valued
coefficients.
For mathematical convenience, we assume that neither the up-converter nor the
down-converter modifies the signal’s norm. This is not what happens in reality,
but the system-level designer (as opposed to the hardware designer) can make this
assumption because all the scaling factors can be accounted for by a single factor
in the channel model. Even this factor can be removed (i.e. it can be made to
be 1) without affecting the system-level design, provided that the power spectral
density of the noise is adjusted accordingly so as to keep the signal-energy to
noise-power-density ratio unchanged.
In practice, the up-converter, as well as the down-converter, amplifies the signals,
and the down-converter contains a noise-reduction filter that removes the out-of-
band noise (see Section 3.6). The transmitter back end (the physical embodi-
ment of the up-converter) deals with high power, high frequencies, and a variable
carrier frequency fc . The skills needed to design it are quite specific. It is very
convenient that the transmitter back end can essentially be designed and built
separately from the rest of the system and can be purchased as an off-the-shelf
device.
With the back end in place, the earlier stages of the transmitter, which perform
the more sophisticated signal processing, can be implemented under the most
favorable conditions, namely in baseband and using voltages and currents that are
in the range of standard electronics, rather than being tied to the power of the
transmitted signal. The advantage of working in baseband is two-fold: the carrier
frequency is fixed and working with low frequencies is less tricky.
7.8. Real- vs. complex-valued operations 265
A similar discussion applies to the receiver. Contrary to the transmitter back

end, the receiver front end (the embodiment of the down-converter) processes very
small signals, but the conclusions are the same: it can be an off-the-shelf device
designed and built separately from the rest of the receiver.11
There is one more non-negligible advantage to processing the high-frequency
signals of a device (transmitter or receiver) in a confined space that can be
shielded from the rest of the device. Specifically, the powerful high-frequency
signal at the transmitter output can feed back over the air to the earlier stages
of the transmitter. If picked up and amplified by them, the signal can turn the
transmitter into an oscillator, similarly to what happens to an audio amplifier
when the microphone is in the proximity of the speaker. This cannot happen if
the earlier stages are designed for baseband signals. A similar discussion applies
for the receiver: the highly sensitive front end could pick up the stronger signals
of later stages, if they operate in the same frequency band.
We have seen that a delay in the signal path translates into a rotation of the
complex-valued symbols at the output of the n-tuple former. We say that the
phase comes through coherently when the rotation is constant or varies slowly
from symbol to symbol. Then, the phase can be estimated and its effect can be
removed. In this case the decision process is called coherent detection. Sometimes
the phase comes through incoherently, meaning that each symbol can be rotated
by a different amount. In this case PSK, which carries information in the symbol’s
phase, becomes useless; whereas PAM, which carries information in the symbol’s
amplitude, is still a viable possibility. When the phase comes through incoherently,
the decision process is referred to as non-coherent detection.
7.8 Appendix: Relationship between real- and

complex-valued operations
In this appendix we establish the relationship between complex-valued operations
involving n-tuples and matrices and their real-valued counterparts. We use these
results in Appendix 7.9 to derive the probability density function of circularly
symmetric Gaussian random vectors.
To every complex n-tuple u ∈ Cn , we associate a real 2n-tuple û defined as
follows:

uR [u]
û = := . (7.36)
uI [u]
Now let A be an m × n complex matrix and define u = Av. We are interested in
the real matrix Â so that
û = Âv̂. (7.37)
11
Realistically, a specific back/front end implementation has certain characteristics that
limit its usage to certain applications. In particular, for the back end we consider its
gain, output power, and bandwidth. For the front end, we consider its bandwidth,
sensitivity, gain, and noise temperature.
266 7. Third layer
It is straightforward to verify that

AR −AI [A] − [A]
Â = := . (7.38)
AI AR [A] [A]
To remember the form of Â, observe that the top half of û is the real part of
Av, i.e. AR vR − AI vI . This explains the top half of Â. Similarly, the bottom half
of û is the imaginary part of Av, i.e. AR vI + AI vR , which explains the bottom half
of Â. The following lemma summarizes a number of useful properties.
lemma 7.21 The following properties hold
: = Âû
Au (7.39a)
u
+ v = û + v̂ (7.39b)
†
{u v} = û v̂ T
(7.39c)
u = û
2 2
(7.39d)
; = ÂB̂
AB (7.39e)

A + B = Â + B̂ (7.39f)
:† = (Â)
A T
(7.39g)
I<n = I2n (7.39h)
;
A−1 = Â−1 (7.39i)
†
det(Â) = | det(A)| = det(AA )
2
(7.39j)
Proof Property (7.39a) is (7.37) restated; (7.39b) is immediate from (7.36);

(7.39c) follows from the fact that {u† v} is (uR )T vR + (uI )T vI = ûT v̂; (7.39d)
follows from (7.39c) with v = u and using the fact that u2 = u† u is already
real-valued; (7.39e) follows from the observation that if v = ABu, we can write
; û but also v̂ = ÂBu
v̂ = AB : = ÂB̂ û. By comparing terms we obtain AB ; = ÂB̂;
(7.39f), (7.39g), and (7.39h) follow immediately from (7.38); (7.39i) follows from
(7.39e) and (7.39h); finally, to prove (7.39j) we use the fact that the determinant
of a product is the product of the determinants and the determinant of a block
triangular matrix is the product of the determinants of the diagonal blocks. Hence

I jI I −jI A 0
det(Â) = det Â = det = det(A) det(A)∗ ,
0 I 0 I (A) A∗
where A∗ is A with each element complex conjugated.

corollary 7.22 If U ∈ Cn×n is unitary then Û ∈ R2n×2n is orthonormal.
Proof From U † U = In , applying the hat operator on both sides we obtain
(Û )T Û = Iˆn = I2n .

7.9. Appendix: Complex-valued random vectors 267
To remember (7.39c), observe that u† v is complex-valued in general whereas ûT v̂

is always real-valued. To remember (7.39h), recall that the hat operator doubles
the size of a matrix. In (7.39j) we need to take the absolute value of det(A) because
det(A) can be a complex number, and we need to square because the hat operator
doubles the size of a matrix. In doubt, it helps to do a “sanity check” using a
scalar ja, a ∈ R as a special case of a matrix A. The determinant of a scalar is
the scalar itself. Hence det(ja) = ja. From

ˆ = 0 −a ,
ja
a 0
ˆ = a2 = | det(ja)|2 . The remaining relationships of Lemma 7.21 are natural

det(ja)
and easy to remember. However, be careful that ûT = u<† . To see this, consider
a scalar u as a special case of an n-tuple. Then u† is still a scalar, hence u<† has
dimension 2 × 1, whereas ûT has dimension 1 × 2. Finally, notice that when we
apply the hat operator to a scalar, the result depends on whether we consider the
scalar as a special case of an n-tuple or of a matrix.
corollary 7.23 If Q ∈ Cn×n is non-negative definite, then so is Q̂ ∈ R2n×2n .

Moreover, u† Qu = ûT Q̂û.
Proof If Q is non-negative definite, u† Qu is a non-negative real-valued number

for all u ∈ Cn . Hence,

u† Qu = {u† (Qu)} = ûT (Qu)
= ûT Q̂û,
where in the last two equalities we use (7.39c) and (7.39a), respectively.
7.9 Appendix: Complex-valued random vectors

7.9.1 General statements
A complex-valued random variable Z (hereafter simply called complex random

variable) is an object of the form
Z = X + jY,
where X and Y are real random variables.

Recall that a real random variable X is specified by its cumulative distribution
function FX (x) = P r{X ≤ x}. For a complex random variable Z, as there is
no natural ordering in the complex plane, the event Z ≤ z does not make sense.
Instead, we specify a complex random variable by giving the joint distribution of
its real and imaginary parts F{Z},{Z} (x, y) = P r{{Z} ≤ x, {Z} ≤ y}. Since
the pair of real numbers (x, y) can be identified with a complex number z = x+jy,
we will write the joint distribution F{Z},{Z} (x, y) as FZ (z). Just as we do for
268 7. Third layer
real-valued random variables, if the function F{Z},{Z} (x, y) is differentiable in

x and y, we will call the function
∂2
f{Z},{Z} (x, y) = F{Z},{Z} (x, y)
∂x∂y
the joint density of ({Z}, {Z}), and again associating with (x, y) the complex
number z = x + jy, we will call the function
fZ (z) = f{Z},{Z} ({z}, {z})
the density of the random variable Z.

A complex random vector Z = (Z1 , . . . , Zn ) is specified by the joint distribution
of its real and imaginary parts
FZ (z) =
Pr({Z1 } ≤ {z1 }, . . . , {Zn } ≤ {zn }, {Z1 } ≤ {z1 }, . . . , {Zn } ≤ {zn }),
and if this function is differentiable in {z1 }, . . . , {zn }, {z1 }, . . . , {zn }, then

we define the density of Z as
∂ 2n
fZ (x1 + jy1 , . . . , xn + jyn ) = FZ (x1 + jy1 , . . . , xn + jyn ).
∂x1 · · · ∂xn ∂y1 · · · ∂yn
If X is a real random vector with density fX , and A is a nonsingular matrix, then

the density of Y = AX is given by (see Appendix 2.9)
fY (y) = | det(A)|−1 fX (A−1 y).
If W is a complex random vector with density fW and if A is a complex nonsingular

matrix, then Z = AW is again a complex random vector with

{Z} {A} − {A} {W }
=
{Z} {A} {A} {W }
and thus the density of Z will be given by

7 7−1
7 {A} − {A} 7
fZ (z) = 77det 7 fW (A−1 z).
7
{A} {A}
From (7.39j) we know that

{A} − {A}
det = | det(A)|2 ,
{A} {A}
and thus the transformation formula becomes
fZ (z) = | det(A)|−2 fW (A−1 z). (7.40)

7.9.2 The Gaussian case
definition 7.24 The random vector Z = X + jY ∈ Cn , where X ∈ Rn and

Y ∈ Rn , is a complex Gaussian random vector if X and Y are jointly Gaussian
random vectors.
The probability density function (pdf) of a complex Gaussian random vector is
obtained from the pdf of Ẑ (see Appendix 2.10),
1 −1
e− 2 (ẑ−m̂) KẐ (ẑ−m̂) ,
1 T
fZ (z) := fẐ (ẑ) = (7.41)
det(2πKẐ )
where
m̂ = E[Ẑ]

KX KXY
KẐ = E[(Ẑ − m̂)(Ẑ − m̂) ] = T
. (7.42)
KY X KY
In writing (7.41), we assume that KẐ is nonsingular. Example 2.32 shows how to
proceed when a Gaussian random vector has a singular covariance matrix.
Define the following complex-valued quantities
E[Z] := E [X] + jE [Y ] (7.43)
†
KZ := E[(Z − E[Z])(Z − E[Z]) ] (7.44)
JZ := E[(Z − E[Z])(Z − E[Z]) ] T
(7.45)
called the mean, the covariance matrix , and the pseudo-covariance matrix (of the
complex random vector Z), respectively.
Clearly E[Ẑ] is in one-to-one correspondence with E[Z]. Furthermore
KZ = KX + KY − j(KXY − KY X ) (7.46)
JZ = KX − KY + j(KXY + KY X ), (7.47)
showing that we can compute KZ and JZ from KẐ . We can also go the other way
by using
1
KX = {KZ + JZ } (7.48)
2
1
KY = {KZ − JZ } (7.49)
2
1
KY X = {KZ + JZ }, (7.50)
2
and of course KXY = KYT X . Hence, KZ and JZ are in one-to-one correspondence
with KẐ . This implies that, in principle, even though fZ (z) is defined via fẐ (ẑ), we
can express it using E[Z], KZ and JZ . Doing so is rather cumbersome for general
complex Gaussian random vectors, but the expression simplifies for so-called proper
Gaussian vectors. Fortunately, all complex random vectors that concern us are
proper. The next subsection is dedicated to their study.
270 7. Third layer
We conclude this subsection with a lemma that gives an alternative character-

ization of a complex Gaussian random vector. In Appendix 2.10 we have defined
a zero-mean Gaussian random vector Z ∈ Rn as a vector that can be obtained
via Z = AW for some n × m matrix A and W ∼ N (0, Im ). If A has linearly
independent rows, then KZ is nonsingular and the pdf of Z is well defined. We
can do the same for zero-mean complex Gaussian random vectors.
lemma 7.25 The random vector Z ∈ Cn is a zero-mean complex Gaussian
random vector if and only if there exists a matrix A ∈ Cn×m such that
Z = AW, (7.51)
where W ∼ N (0, Im ).
Proof The “if” part is obvious. If Z can be written as in (7.51), X = {Z} =
{A}W and Y = {Z} = {A}W are jointly Gaussian random vectors
(Definition 2.30) and Z is a complex Gaussian random vector (Definition 7.24).
For the “only if” part, let Z ∈ Cn be a complex Gaussian random vector. By
Definition 7.24, {Z} ∈ Rn and {Z} ∈ Rn are jointly Gaussian random vectors.
By Definition 2.30, there exist n × m real matrices AR and AI such that

X AR
= W,
Y AI
where W ∼ N (0, Im ). Hence
Z = AR W + jAI W = AW
with A = AR + jAI .
7.9.3 The circularly symmetric Gaussian case
Define V = X + jY ∈ Cn , where X ∈ Rn and Y ∈ Rn are independent Gaussian

random vectors with iid ∼ N (0, 12 ) components. Then
1 − 2n 2
i=1 v̂i
fV (v) = fV̂ (v̂) = n
e
π
1
= n e−v
2
π
1 †
= n e−v v . (7.52)
π
Notice that although fV (v) is derived via v̂, it can be expressed in compact form as
a function of v. Notice also that fV (v) only depends on v. Hence ejθ V , which is
V with each component rotated by the angle θ, has the same pdf as V . Gaussian
random vectors that have this property are of particular interest to us for two
reasons: (i) all the noise vectors of interest to us are of this kind, and (ii) the pdf
of a Gaussian random vector that has this property takes on a simplified form. For
these two reasons, it is worthwhile investing in the study of such random vectors,
called circularly symmetric.
definition 7.26 A complex random vector Z ∈ Cn is circularly symmetric if

for any θ ∈ [0, 2π), the distribution of Zejθ is the same as the distribution of Z. (If
it is the case and n = 1, we speak of a circularly symmetric random variable.)
A circularly symmetric random vector is always zero-mean. In fact
E[Z] = E[ejθ Z] = ejθ E[Z],
which can be true for all θ only if E[Z] = 0.
definition 7.27 A complex random vector Z is called proper if its pseudo-

covariance JZ vanishes.
The above two definitions are related. To see how, let Z be circularly symmetric.
Then it is zero-mean and, for every θ ∈ [0, 2π),
JZ = E[ZZ T ] = E[(ejθ Z)(ejθ Z)T ] = ej2θ E[ZZ T ],
which, evaluated for θ = π/2, yields JZ = E[ZZ T ] = −E[ZZ T ]. Hence JZ = 0. We

conclude that if a random vector Z is circularly symmetric, then it is zero-mean
and proper. Note that Z need not be Gaussian.
If Z is indeed Gaussian, then the converse is also true, i.e. if Z is zero-mean and
proper, then it is circularly symmetric. To see this, let Z be zero-mean, proper, and
Gaussian. Then ejθ Z is also zero-mean and Gaussian. Hence Z and ejθ Z have the
same density if and only if they have the same covariance and pseudo-covariance
matrices. We know that the pseudo-covariance matrix of Z vanishes. From
E[(ejθ Z)(ejθ Z)T ] = ej2θ E[ZZ T ] = 0
we see that the pseudo-covariance matrix of ejθ Z vanishes as well. Finally, the
covariance matrix of ejθ Z is
& ' & '
E (ejθ Z)(ejθ Z)† = E ZZ † = KZ ,
proving that the covariance matrices are identical. We summarize this result in a
lemma.
lemma 7.28 For a complex-valued Gaussian random vector Z, the following

statements are equivalent:
(i) Z is circularly symmetric;

(ii) Z is zero-mean and proper.
The covariance matrix of two real-valued random vectors X and Y satisfies

KXY = KYT X . Hence, (7.47) can be written as
JZ = (KX − KY ) + j(KXY + KXY

T
),
which proves the following.

272 7. Third layer
lemma 7.29 A complex random vector Z = X + jY ∈ Cn , where X ∈ Rn and

Y ∈ Rn , is proper if and only if
K X = KY and KXY = −KXY
T
,
i.e. JZ vanishes if and only if X and Y have identical auto-covariance matrices
and their cross-covariance matrix is skew-symmetric.12
corollary 7.30 A real random vector Z is a proper complex random vector if
and only if it is constant (with probability 1).
Proof If Z = X + jY ∈ Cn , with X ∈ Rn and Y = 0 ∈ Rn , then KY = 0. If Z
is proper, Lemma 7.29 implies KX = 0. Hence X is constant (with probability 1).
For the other direction, if Z is constant, JZ = 0, hence Z is proper.
To gain insight into Lemma 7.29, it is helpful to consider a circularly symmetric
random vector V = X + jY , with X, Y ∈ Rn . When we multiply V by ejπ/2 = j,
we obtain jV = −Y + jX. For V and jV to have the same distribution, X and
−Y have to have the same distribution. This explains why the real part and the
imaginary part of V have to have the same covariance. The covariance between
the real and the imaginary part of V is KXY , whereas that between the real and
the imaginary part of jV is K(−Y )X = −KY X = −KXY T
. This explains why KXY
has to be skew-symmetric.
The skew-symmetry of KXY implies that KXY has zeros on the main diagonal,
which means that the real and imaginary part of each component Zk of Z are
uncorrelated. However, it does not imply that the real part of Zk and the imaginary
part of Zl are uncorrelated for k = l. In the following example, we construct a case
where the real part of a component is correlated with the imaginary part of another
component.
example 7.31 Let U = X + jY ∈ C, where X ∈ R and Y ∈ R are independent,
zero-mean, Gaussian random variables of unit variance, and let
Z = (U, jU )T ∈ C2 .
Now ZR = {Z} = (X, −Y )T and ZI = {Z} = (Y, X)T have the same covari-
ance, i.e. KZR = KZI = I2 . Furthermore,

0 1
KZR ZI = .
−1 0
By Lemma 7.29, Z is proper despite the fact that its real and imaginary parts are
correlated.
lemma 7.32 (Closure under affine transformations) Let Z be a proper n-di-
mensional random vector, i.e. JZ = 0. Then any vector obtained from Z by an
affine transformation, i.e. any vector Z̃ of the form Z̃ = AZ + b, where A ∈ Cm×n
and b ∈ Cm are constant, is also proper.
12
A real-valued matrix A is skew-symmetric if AT = −A.
Proof From
E[Z̃] = AE[Z] + b,
it follows that
Z̃ − E[Z̃] = A(Z − E[Z]).
Hence we have
JZ̃ = E[(Z̃ − E[Z̃])(Z̃ − E[Z̃])T ]
= E{A(Z − E[Z])(Z − E[Z])T AT }
= AJZ AT = 0.
The pdf of a proper Gaussian random vector Z is described by the mean mZ

and covariance matrix KZ . We denote it by NC (mZ , KZ ).
The complex Gaussian random vector V introduced at the beginning of this
subsection is characterized by
E[V ] = 0
KV = I n
JV = 0.
Hence V ∼ NC (0, In ).
We have seen that every zero-mean complex Gaussian random vector Z ∈ Cn
can be written as Z = AW for some complex n × m matrix A and W ∼ N (0, Im ).
There is a similar result for circularly symmetric Gaussian random vectors.
lemma 7.33 The random vector Z ∈ Cn is a circularly symmetric Gaussian
random vector if and only if there exists a matrix A ∈ Cn×m such that
Z = AV, (7.53)
with V ∼ NC (0, Im ). Furthermore, if Z has nonsingular covariance matrix, then
we can write (7.53) with V ∼ NC (0, In ) and a nonsingular matrix A ∈ Cn×n .
Proof Clearly any Z of the form (7.53) for some A ∈ Cn×m and V ∼ NC (0, Im )
is Gaussian, zero-mean, and proper (Lemma 7.32), hence circularly symmetric
(Lemma 7.28). To prove the other direction, let Z be a circularly symmetric
Gaussian random vector specified by its n × n covariance matrix KZ . A covariance
matrix is Hermitian, hence KZ can be written in the form KZ = U ΛU † , where U ∈
Cn×n is unitary and Λ ∈ Rn×n is diagonal (Appendix 2.8). If KZ is nonsingular,
then all the diagonal elements of Λ are positive. (In fact, they are non-negative
because a covariance matrix is positive semidefinite, and they cannot vanish,
because the product of the eigenvalues equals the determinant of the matrix,
which is non-zero by assumption.) Define V = Λ− 2 U † Z, where for α ∈ R, Λα
1
is the diagonal matrix obtained by raising to the power α the diagonal elements
1
of Λ. Clearly V is a zero-mean Gaussian random vector and Z = U Λ 2 V has the
274 7. Third layer
1
form (7.53) for the nonsingular matrix A = U Λ 2 ∈ Cn×n . The covariance matrix
of V is
& '
KV = E V V †
8 1 9
= E Λ− 2 U † ZZ † U Λ− 2
1
& '
= Λ− 2 U † E ZZ † U Λ− 2
1 1
= Λ − 2 U † KZ U Λ − 2
1 1
= Λ− 2 U † U ΛU † U Λ− 2
1 1
= Λ− 2 ΛΛ− 2
1 1
= In .
Finally, V is proper (by Lemma 7.32) and circularly symmetric (by Lemma 7.28).
This completes the proof for the case that KZ is nonsingular. If KZ is singular,
then some of its components are linearly dependent on other components. In this
case, we can write Z = B Z̃ for some B ∈ Cn×m , where Z̃ ∈ Cm consists of linearly
independent components of Z. The covariance matrix of Z̃ is nonsingular. Hence we
can find a nonsingular matrix Ã ∈ Cm×m such that Z̃ = ÃV with V ∼ NC (0, Im ).
Finally, Z = B Z̃ = B ÃV = AV has the desired form with A = B Ã ∈ Cn×m .
We are now in the position to derive a general expression for a circularly
symmetric Gaussian random vector Z of nonsingular covariance matrix.
theorem 7.34 The probability density function of a circularly symmetric Gaus-
sian random vector Z ∈ Cn of nonsingular covariance matrix KZ can be written as
1 † −1
fZ (z) = e−z KZ z . (7.54)
πn det(KZ )
Proof Let Z ∈ Cn be circularly symmetric and Gaussian with nonsingular covari-

ance matrix KZ . By Lemma 7.33, we can write Z = AV where A is a nonsingular
n × n matrix and V ∼ NC (0, In ). From (7.40),
1 −1 † −1
fZ (z) = e−(A z) (A z) . (7.55)
π n | det(A)|2
Now
& '
KZ = E ZZ †
& '
= E AV V † A†
& '
= AE V V † A†
= AA† ,
hence
det(KZ ) = det(AA† ) = | det(A)|2 . (7.56)
7.10. Exercises 275
Furthermore,
(A−1 z)† (A−1 z) = z † (A−1 )† A−1 z
= z † (AA† )−1 z
= z † KZ−1 z, (7.57)
where in the second equality we use the fact that, for nonsingular n × n matrices,
(AB)−1 = B −1 A−1 and (A† )−1 = (A−1 )† . Inserting (7.56) and (7.57) into (7.55)
yields (7.54).
The above theorem justifies one of the two claims we have made at the beginning
of this appendix, specifically that the pdf of a circularly symmetric Gaussian
random vector takes on a simplified form. (Compare (7.54) and (7.41) when m̂ = 0,
keeping in mind (7.42) to compute KẐ from KX , KY , and KXY .) The next
theorem justifies the other claim: that the complex-valued noise vectors of interest
to us, those at the output of Figure 7.13b, are Gaussian and circularly symmetric.
theorem 7.35 Let NE (t) = NR (t) + jNI (t), where NR (t) and NI (t) are inde-
pendent white Gaussian noises of spectral density N0 /2. For any collection of
L2 functions gl (t), l = 1, . . . , k that belong to a finite-dimensional inner product
space V, the complex-valued random vector Z = (Z1 , . . . , Zk )T , Zl = NE (t), gl (t),
is circularly symmetric and Gaussian.
Proof Let ψ1 (t), . . . , ψn (t) be an orthonormal basis for V. First consider the
random vector V = (V1 , . . . , Vn )T , where Vi = NE (t), ψi (t). It is straightforward
n
to check that V ∼ NC (0, In ). Every gl (t) can be written as gl (t) = j=1 cl,j ψj (t),
where cl,j ∈ C. By the linearity of the inner product, Z = AV , where A ∈ Ck×n is
the matrix that has (c∗l,1 , . . . , c∗l,n ) in its lth row. By Lemma 7.33, Z is circularly
symmetric with KZ = AA† .
7.10 Exercises
exercise 7.1 (Lifting up) Let p(t) be real-valued and frequency-limited to

where 0 < B < fc for some
[−B, B], √ √ fc . Without making any calculations, argue
that p(t) 2 cos(2πfc t) and p(t) 2 sin(2πfc t) are orthogonal to each other and
have the same norm as p(t).
exercise 7.2 (Bandpass filtering in √ baseband) We want to implement a pass-
band filter of impulse response h(t) = 2{hE (t)ej2πfc t } using baseband filters,
where hE (t) is frequency-limited to [−B, B] and 0 < B < fc .
(a) Draw the block diagram of an implementation of the filter of impulse response
h(t), based on a filter of impulse response hE (t) (possibly scaled). Your imple-
mentation can use an up-converter, a down-converter, and shall behave like
the filter of impulse response h(t) for all (passband) input signals of bandwidth
not exceeding 2B and center frequency fc .
276 7. Third layer
(b) Draw the box diagram of an implementation that uses only real-valued
signals.
exercise 7.3 (Equivalent √ representations) A real-valued passband signal x(t)
can be written as x(t) = 2{xE (t)ej2πfc t }, where xE (t) is the baseband-equivalent
signal (complex-valued in general) with respect to the carrier frequency fc . Also,
a general complex-valued signal xE (t) can be written in terms of two real-valued
signals, either as xE (t) = u(t) + jv(t) or as α(t) exp(jβ(t)).
(a) Show that a real-valued passband signal x(t) can always be written as
xEI (t) cos(2πfc t) − xEQ (t) sin(2πfc t)
and relate xEI (t) and xEQ (t) to xE (t). Note: This formula can be used at the
sender to produce x(t) without doing complex-valued operations. The signals
xEI (t) and xEQ (t) are called the in-phase and the quadrature components,
respectively.
(b) Show that a real-valued passband signal x(t) can always be written as
a(t) cos[2πfc t + θ(t)]
and relate xE (t) to a(t) and θ(t). Note: This explains why sometimes people
make the claim that a passband signal is modulated in amplitude and in phase.
(c) Use part (b) to find the baseband-equivalent of the signal
x(t) = A(t) cos(2πfc t + ϕ),
where A(t) is a real-valued lowpass signal. Verify your answer with Example
7.9 where we assumed ϕ = 0.
exercise 7.4 (Passband) Let fc be a positive carrier frequency and consider
an arbitrary real-valued function w(t). You can visualize its Fourier transform as
shown in Figure 7.14.
(a) Argue that there are two different functions, a1 (t) and a2 (t), such that, for
i = 1, 2,
√
w(t) = 2{ai (t) exp(j2πfc t)}.
This shows that, without some constraint on the input signal, the operation
performed by the circuit of Figure 7.4b is not reversible, even in the absence
of noise. This was already pointed out in the discussion preceding Lemma 7.8.
(b) Argue that if we limit the input of Figure 7.4b to signals a(t) such that
aF (f ) = 0 for f < −fc , then the circuit of Figure 7.4a will retrieve a(t)
when fed with the output of Figure 7.4b.
(c) Find an example showing that the condition of part (b) is necessary. (Can
you find an example with a real-valued a(t)?)
(d) Argue that if we limit the input of Figure 7.4b to signals a(t) that are real-
valued, then the input of Figure 7.4b can be retrieved from the output. Hint
1: we are not claiming that the circuit of Figure 7.4a will retrieve a(t).
Hint 2: You may argue in the time domain or in the frequency domain.
7.10. Exercises 277
If you argue in the time domain, you can assume that a(t) is continuous.
In the frequency-domain argument, you can assume that a(t) has finite
bandwidth.
|wF (f )|
f
−fc 0 fc
Figure 7.14.
exercise 7.5 (From passband to baseband via real-valued √ operations) Let the
signal xE (t) be bandlimited to [−B, B] and let x(t) = 2{xE (t)ej2πfc t }, where
0 < B < fc . Show that the circuit of Figure 7.15, when fed with x(t), recovers the
real and imaginary part of xE (t). (The two boxes are ideal lowpass filters of cutoff
frequency B.) Note: The circuit uses only real-valued operations.
√
2 cos(2πfc t)
?
{xE (t)}
- × - 1{−B ≤ f ≤ B} -

x(t)

- × - 1{−B ≤ f ≤ B} -
{xE (t)}
6
√
− 2 sin(2πfc t)
Figure 7.15.
exercise 7.6 (Reverse engineering) Figure 7.16 shows a toy passband signal.
(Its carrier frequency is unusually the horizontal time scale, which is 1 ms per
square and the vertical scale is 1 unit per square. Specify the three layers of a
transmitter that generates the given signal, namely the following.
(a) The carrier frequency fc used by the up-converter.
(b) The orthonormal basis used by the waveform former to produce the baseband-
equivalent signal wE (t).
(c) The symbol alphabet, seen as a subset of C.
(d) An encoding map, the encoder input sequence that leads to w(t), the bit rate,
the encoder output sequence, and the symbol rate.
278 7. Third layer
w(t)
0 Ts 2Ts 3Ts 4Ts
Figure 7.16.
√
exercise 7.7 (AM receiver) Let x(t) = (1 + mb(t)) 2 cos(2πfc t) be an AM
modulated signal as described in Example 7.10. We assume that 1 + mb(t) > 0,
that b(t) is bandlimited to [−B, B], and that fc > 2B.
√
(a) Argue that the envelope of |x(t)| is (1 + mb(t)) 2 (a drawing will suffice).
(b) Argue that with a suitable choice of components, the output in Figure 7.17 is
essentially b(t). Hint: Draw, qualitatively, the voltage on top of R1 and that
on top of R2 .
(c) As an alternative approach, prove that if we pass the signal |x(t)| through
an ideal lowpass filter of cutoff frequency f0 , we obtain 1 + mb(t) scaled by
some factor. Specify a suitable interval for f0 . Hint: Expand | cos(2πfc t)|
as a Fourier series. No need to find explicit values for the Fourier series
coefficients.
C2
x(t) C1 R1 R2 output
Figure 7.17.
exercise 7.8 (Alternative down-converter) Assuming that all the ψl (t) are
bandlimited to [−B, B] and that 0 < B < fc , show that the n-tuple former output
remains unchanged if we substitute the down-converter of Figure 7.8b with the
block diagram of Figure 7.4a.
exercise 7.9 (Real-valued implementation) Draw a block diagram for the imple-
mentation of the transmitter and receiver of Figure 7.8 by means of real-valued
operations. Unlike in Figure 7.9, do not assume that the orthonormal basis is real-
valued.
7.10. Exercises 279
exercise 7.10 (Circular symmetry)
(a) Suppose X and Y are real-valued iid random variables with probability density
function fX (s) = fY (s) = c exp(−|s|α ), where α is a parameter and c = c(α)
is the normalizing factor.
(i) Draw the contour of the joint density function for α = 0.5, α = 1, α = 2,
and α = 3. Hint: For simplicity, draw the set of points (x, y) for which
fX,Y (x, y) equals the constant c2 (α)e−1 .
(ii) For which value of α is the joint density function invariant under rota-
tion? What is the corresponding distribution?
(b) In general we can show that if X and Y are iid random variables and
fX,Y (x, y) is circularly symmetric, then X and Y are Gaussian. Use the
following steps to prove this.
(i) Show that if X and Y are iid and fX,Y (x, y) is circularly symmetric
then fX (x) fY (y) = ψ(r) where ψ is a univariate function and r =

x2 + y 2 .
(ii) Take the partial derivative with respect to x and y to show that

fX (x) ψ (r) f (y)
= = Y .
x fX (x) r ψ(r) y fY (y)
(iii) Argue that the only way for the above equalities to hold is that they be
f (x)
f (y)
equal to a constant value, i.e. x fXX (x) = rψψ(r)
(r)
= y fYY (y) = − σ12 .
(iv) Integrate the above equations and show that X and Y should be Gaus-
sian random variables.
exercise 7.11 (Real-valued versus complex-valued constellation) Consider

2-PAM and 4-QAM. The source produces iid and uniformly distributed source bits
taking value in {±1}, and the constellations are {±1} and {1 + j, −1 + j, −1 − j,
1 − j}, respectively. For 2-PAM, the mapping between the source bits and √ the
channel symbols is the obvious one, i.e. bit bi is mapped into symbol si = Es bi .
For the 4-QAM constellation, pairs of bits are mapped into a symbol according to

b2i , b2i+1 → si = Es (b2i + jb2i+1 ).
The symbols are mapped into a signal via symbol-by-symbol on a pulse train,
where the pulse is real-valued, normalized, and orthogonal to its shifts by multiples
of T . The channel adds white Gaussian noise of power spectral density N20 . The
receiver implements an ML decoder. For the two systems, determine (if possible)
and compare the following.
(a) The bit-error rate Pb .

(b) The energy per symbol Es .
280 7. Third layer
(c) The variance σ 2 of the noise seen by the decoder. Note: when the symbols are
real-valued, the decoder disregards the imaginary part of Y . In this case, what
matters is the variance of the real part of the noise.
(d) The symbol-to-noise power ratio σEs2 . Write them also as a function of the
power P and N0 .
(e) The bandwidth.
(f ) The expression for the signals at the output of the waveform former as a
function of the bit sequence produced by the source.
(g) The bit rate R.
Summarize, by comparing the two systems from a user’s point of view.
exercise 7.12 (Smoothness of bandlimited signals) We show that a continuous
signal of small bandwidth cannot vary much over a small interval. (This fact is used
in Exercise 7.13.) Let w(t) be a finite-energy continuous-time passband signal and
let wE (t) be its baseband-equivalent signal. We assume that wE (t) is bandlimited
to [−B, B] for some positive B.
(a) Show that the baseband-equivalent of w(t − τ ) can be modeled as wE (t − τ )ejφ
for some φ.
(b) Let hF (f ) be the frequency response of the ideal lowpass-filter, i.e. hF (f ) = 1
for |f | ≤ B and 0 otherwise. Show that

wE (t + τ ) − wE (t) = wE (ξ)[h(t + τ − ξ) − h(t − ξ)]dξ. (7.58)
(c) Use the Cauchy–Schwarz inequality to prove that

|wE (t + τ ) − wE (t)|2 ≤ 2Ew [Eh − Rh (τ )], (7.59)
where

Rh (τ ) = h(ξ + τ )h(ξ)dξ
is the self-similarity function of h(t), Eh = Rh (0) is the energy of h(t), and

Ew = Rw (0) is the energy of w(t).
(d) Show that Rh (τ ) = (h h)(τ ).
(e) Show that Rh (τ ) = h(τ ).
(f ) Put things together to derive the upper bound

|wE (t + τ ) − wE (t)| ≤ 2Ew [Eh − h(τ )] = 4BEw (1 − sinc(2Bτ )). (7.60)
Verify that the bound is tight for τ = 0.
(g) Using part (a) and part (f), conclude that if τ is small compared to 1/B, the
baseband-equivalent of w(t − τ ) can be modeled as wE (t)ejφ .
exercise 7.13 (Antenna array) Assume that a transmitter uses an L-element
antenna array as shown in Figure 7.18 for L = 5.
The receiving antenna is located in the direction pointed by the arrows, far
enough that we can approximate the wavefront as being a straight line as shown
in the figure. Let βi wE (t) be the baseband-equivalent signal transmitted at antenna
7.10. Exercises 281
• • • • •
5 4 3 2 1
d
Figure 7.18.
element i, i = 1, 2, . . . , L, with respect to some carrier frequency fc . We assume

that each antenna element irradiates isotropically. (More realistically, you can
picture each dot as a dipole seen from above and we are interested in the radiation
pattern in the plane perpendicular to the dipoles.)
(a) Argue that the received baseband-equivalent signal (at the matched filter input)
can be written as

L
rE (t) = wE (t − τi )βi αi ,
i=1
where αi = e−j2πfc τi with τi = T + iτ for some T and τ . Express τ as a

function of the geometry (d and α) shown in Figure 7.18.
(b) We assume that wE (t) is a continuous bandlimited signal, which implies that
for a sufficiently small τi , wE (t − τi ) is essentially wE (t) (see Exercise 7.12).
We assume that all τi are small enough to satisfy this condition, but that |fc τi |
is not negligible with respect to 1, where fc is the carrier frequency. Under
these assumptions, we model the received baseband-equivalent signal as

L
rE (t) = wE (t)βi αi
i=1
plus noise. Choose the vector β = (β1 , β2 , . . . , βL )T that maximizes the energy
|rE (t)| dt, subject to the constraint β = 1. Hint: Use the Cauchy–Schwarz
2
inequality (Appendix 2.12).

(c) Let E be the energy of wE (t). Determine the energy of rE (t) as a function of
L when β is selected as in part (b).
(d) In the above problem the received energy grows monotonically with L although
β = 1 implies that the transmitted energy is constant. Does this violate
energy conservation or some other fundamental law of physics?
exercise 7.14 (Bandpass pulses) Let p(t) be the pulse that has Fourier trans-
form pF (f ) as in Figure 7.19.
(a) What is the expression for p(t)?
(b) Determine the constant c so that ψ(t) = cp(t) has unit energy.
282 7. Third layer
(c) Assume that f0 − B2 = B and consider the infinite set of functions {ψ(t −
1
lT )}l∈Z . Do they form an orthonormal set for T = 2B ? (Explain.)
(d) Determine all possible values of f0 − 2 so that {ψ(t − lT )}l∈Z forms an
B
1
orthonormal set for T = 2B .
pF (f )
1
f
−f0 − B
2
−f0 + B
2
f0 − B
2
f0 + B
2
Figure 7.19.
exercise 7.15 (Bandpass sampling) The Fourier transform of a real-valued

∗
signal w(t) satisfies the conjugacy constraint wF (f ) = wF (−f ). Hence if wF (f ) is
non-zero in some interval (fL , fH ), 0 ≤ fL < fH , then it is non-zero also in the
interval (−fH , −fL ). This fact adds a complication to the extension of the sampling
theorem to real-valued bandpass signals. Let D+ = (fL , fH ), D− = (−fH , −fL ),
and let D = D− ∪ D+ be the passband frequency range of interest. Define W to be
the set of L2 signals w(t) that are continuous and for which wF (f ) = 0, for f ∈/ D.
(a) Assume T > 0 such that
n
∩ D = ∅. (7.61)
2T n∈Z
The above means that D+ ⊂ [ 2T
l l+1
, 2T ] for some integer l. Define
l
l + 1 k
hF (f ) = 1 ≤ |f | ≤ and w̃F (f ) = wF f − ,
2T 2T T
k∈Z
where the latter is the periodic extension of wF (f ). Prove that for all f ∈ R,
wF (f ) = w̃F (f )hF (f ).
− −
Hint: Write wF (f ) = wF (f ) + wF+
(f ) where wF (f ) = 0 for f ≥ 0 and
+
wF (f ) = 0 for f < 0 and consider the support of wF +
(f − 2T
k
) and that of
−
wF (f − 2T ), k integer.
k
(b) Prove that when (7.61) holds, we can write

w(t) = T w(kT )h(t − kT ),
k∈Z
where

1 t
h(t) = sinc cos(2πfc t)
T 2T
7.10. Exercises 283
l 1
is the inverse Fourier transform of hF (f ) and fc = 2T + 4T is the center
& l l+1 '
frequency of the interval 2T , 2T . Hint: Neglect convergence issues, use the
Fourier series to write

w̃F (f ) = l. i. m. Ak ej2πf T k
k
and use the result of part (a).

(c) Argue that if (7.61) is not true, then we can find two distinct signals a(t)
and b(t) in W such that a(nT ) = b(nT ) for all integers n. Hence (7.61)
is necessary and sufficient to guarantee reconstruction from samples taken
every T seconds. Hint: Show that we can choose a(t) and b(t) in W such that
ãF (f ) = b̃F (f ), where tilde denotes periodic extension as in the definition of
w̃F (f ).
(d) As an alternative characterization, show that (7.61) is true if and only if
2T fL + 1 = 2T fH .
(e) Show that the largest T , denoted Tmax that satisfies (7.61) is
= fH >
fH −fL
Tmax = .
2fH
Hence 1/Tmax is the smallest sampling rate that permits the reconstruction of
the bandpass signal from its samples.
(f ) As an alternative characterization, show that h(t) can be written as

l+1 (l + 1)t l lt
h(t) = sinc − sinc .
T T T T
Hint: Rather than manipulating the right side of

1 t
h(t) = sinc cos(2πfc t),
T 2T
start with a suitable description of hF (f ).
(g) As an application to (7.61), let w(t) be a continuous finite-energy signal at
the output of a filter of impulse response hF (f ) as in Figure 7.20. For which
of the following sampling frequencies fs is it possible to reconstruct w(t) from
its samples taken every T = 1/fs seconds: fs = 12, 14, 16, 18, 24 MHz?
hF (f )
f [M Hz]
−15 −10 10 15
Figure 7.20.
Bibliography
[1] J. M. Wozencraft and I. M. Jacobs, Principles of Communication Engineering. New

York: Wiley, 1965.
[2] R. G. Gallager, Principles of Digital Communication. New York: Cambridge
University Press, 2008.
[3] A. Lapidoth, A Foundation in Digital Communication. New York: Cambridge
[4] J. F. Kurose and K. W. Ross, Computer Networking. New York: Addison Wesley,
2010.
[5] J. Gleick, The Information: A History, a Theory, a Flood. London: Fourth Estate,
2011.
[6] M. Vetterli, J. Kovačević, and V. Goyal, Foundations of Signal Processing. New
York: Cambridge University Press, 2014.
[7] S. Ross, A First Course in Probability. New York: Macmillan & Co., 1994.
[8] W. Rudin, Real and Complex Analysis. New York: McGraw-Hill, 1966.
[9] T. M. Apostol, Mathematical Analysis. Reading, MA: Addison–Wesley, 2nd edn,
1974.
[10] S. Axler, Linear Algebra Done Right. New York: Springer-Verlag, 2nd edn, 1997.
[11] K. Hoffman and R. Kunze, Linear Algebra. Englewood Cliffs: Prentice-Hall, 2nd edn,
1971.
[12] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge: Cambridge University
Press, 1999.
[13] J. G. Proakis and M. Salehi, Communication Systems Engineering. Englewood Cliffs:
Prentice-Hall, 1994.
[14] J. R. Barry, E. A. Lee, and D. G. Messerschmitt, Digital Communication. New York:
Springer, 3rd edn, 2004.
[15] U. Madhow, Fundamentals of Digital Communication. New York: Cambridge
[16] S. G. Wilson, Digital Modulation and Coding. Englewood Cliffs: Prentice-Hall, 1996.
[17] D. Tse and P. Viswanath, Fundamentals of Wireless Communications. New York:
Cambridge University Press, 2005.
[18] A. Goldsmith, Wireless Communication. New York: Cambridge University Press,
2005.
284
Bibliography 285
[19] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley,
2nd edn, 2006.
[20] R. G. Gallager, Information Theory and Reliable Communication. New York: Wiley,
1968.
[21] D. MacKay, Information Theory, Inference, and Learning Algorithms. New York:
Cambridge University Press, 2003.
[22] S. Lin and D. J. Costello, Error Control Coding. Englewood Cliffs: Prentice-Hall,
2nd edn, 2004.
[23] T. Richardson and R. Urbanke, Modern Coding Theory. New York: Cambridge
[24] C. Shannon, “A mathematical theory of communication,” Bell System Tech. J.,
vol. 27, pp. 379–423 and 623–656, 1948.
[25] H. Nyquist, “Thermal agitation of electric charge in conductors,” Physical Review,
vol. 32, pp. 110–113, July 1928.
[26] A. W. Love, “Comment: On the equivalent circuit of a receiving antenna,” IEEE
Antenna’s and Propagation Magazine, vol. 44, pp. 124–125, October 2001.
[27] D. Slepian, “On bandwidth,” Proceedings of the IEEE, vol. 64, pp. 292–300, March
1976.
Index
a posteriori probability, see posterior memoryless, 24

absorption, 235 passband AWGN, 243, 252
additive white Gaussian noise (AWGN) time-varying, 113
channel, see channel wireless, 111
affine plane, 37, 38, 48, 70 wireline, 111
amateur radio, 234, 241 channel attenuation, 111, 256
antenna channel capacity, 8, 14, 221
effective area, 120 codes
gain, 119 convolutional, 146, 205–208, 226–229
antenna array, 80, 81, 128, 280 low-density parity-check (LDPC), 9
autocorrelation, 117, 190, 191 turbo, 9, 222
autocovariance, 164, 190 codeword, 205
automatic gain control (AGC), 112 collinear, 67
colored Gaussian noise, 113
band-edge symmetry, 169 communication, digital vs. analog, 13
bandwidth, 142, 149 complex vector space, 66
single-sided vs. double-sided, 144 conjugate
basis, 71 (anti)symmetry, 235
Bhattacharyya bound, see error probability transpose, 36
bit-by-bit on a pulse train, 137, 145 converter, up/down, 12, 232, 246, 264, 278
bit-error probability (convolutional code), convolutional codes, see codes
219 correlated symbol sequence, 166
block-orthogonal signaling, 139, 145 correlative encoding, 166, 224
Bolzmann’s constant, 118 correlator, 108
covariance matrix, 61, 269
carrier frequency, 232, 237–240, 244, 246, 259, cumulative distribution function, 190
261
Cauchy–Schwarz inequality, 67 decision
channel, 24 function, 29, 31
baseband-equivalent, 11, 252 regions, 29, 31
binary symmetric (BSC), 51 decoder, 12, 97, 205
complex-valued AWGN, 264 decoding, 26
continuous-time AWGN, 25, 95, 96, 102, function, see decision function
103, 105, 119, 148 regions, see decision regions
discrete memoryless, 49, 89 delay, 256
discrete-time AWGN, 25, 32, 39, 45, 46, 53, delay locked loop, 259
97, 102, 180, 205, 208 detection, see hypothesis testing
discrete-time vs. continuous-time, 144, 180 detour, 212, 213
impulse response, 112 detour flow graph, 214
lowpass, 160 differential encoding, 202, 260
286
Index 287
diffraction, 112, 234 inter-symbol interference (ISI), 173, 224

dimensionality, 135, 142 interference, 114
Dirichlet function, 181 Internet, 1, 2, 4
dither, 163 Internet message access protocol
down-converter, see converter (IMAP), 2
duration, 142 Internet protocol (IP), 5
irrelevance, 41, 84
electromagnetic waves, 234 isometry, 132
encoder, 12, 97, 102, 205
encoding Jacobian, 59
circuit, 207 jitter, 173
map, 206
entropy, 19 L2
error probability, 26, 29, 31, 35 equivalence, 161, 183
Bhattacharyya bound, 50, 87–90, 219 functions, 161, 180
union Bhattacharyya bound, 48, 53 Laplacian noise, 75, 77, 90
union bound, 44, 86, 91, 141, 228 law of total expectation, 30
events, 190 law of total probability, 29
eye diagram, 173 layer
application, 2
fading, 81, 114 data link, 3
finite-energy functions, 161 network, 3
first layer, 23 physical, 4
Fisher–Neyman factorization, 43, 83, 101, 130 presentation, 3
Fourier session, 3
series, 167, 187 transport, 3
transform, 16, 184, 203 layering, 1
frame, 3 LDPC code, see codes
free-space path loss, 120 Lebesgue
frequency hopping (FH), 261 integrable functions, 182
frequency-shift keying, see FSK integral, 180
frequency-shift property, 236 measurable sets, 182
FSK, 104, 139 measure, 182
light-emitting diode (LED), 26
generating function, 214 likelihood
Gram–Schmidt procedure, 72, 123, 124 function, 27, 257
ratio, 29, 34, 43
Hermitian l. i. m., 168, 188
adjoint, 53 log likelihood ratio, 34
matrix, 54, 61 low-noise amplifier, 111
skew, 54
hyperplane, 69, 70 MAP
hypothesis testing, 26, 32, 102 rule, 27, 51, 77
binary, 28 test, 29, 107
m-ary, 30 Mariner-10, 123
exercises, 74–77, 80–82, 91, 93, 125, 126, Markov chain, 41
129, 130 matched filter, 108, 127, 128, 201
maximum a posteriori, see MAP
independent white Gaussian noises, 254 maximum likelihood, see ML
indicator function, 16, 30, 43 Mercury, 123
indistinguishable signals, 142 messages, 12
information theory, xi, 9, 18, 19, 103, 142, metric, 209
147, 221 military communication, 261
inner product space, 15, 65–67, 73, 96, 97, minimum distance, 40, 136, 145, 146
100, 107, 132, 135, 162, 183, 244 minimum-distance rule, 39
288 Index
ML parameter estimation, 175, 257 Plancherel’s theorem, 185

ML rule, 27, 39, 208, 261 Planck’s constant, 119
for complex-valued AWGN channels, 252 Poisson distribution, 26
mobile communication, 113, 222 positive (semi)definite, 55
modem, 4 post office protocol (POP), 2
modulation (analog) posterior, 27, 51
single sideband (SSB), 241 power
amplitude (AM), 13, 241, 278 dissipated, 121
DSB-SC, 240 received, 120, 121
frequency (FM), 13 power spectral density (PSD), 119, 163, 191,
multiaccess, 93 197, 223, 224
multiple antennas, see antenna array power spectrum, see power spectral density
PPM, 104, 139
n-tuple former, 12, 97, 102, 159 prior, 51
NASA, 123 probability measure, 190
nearest neighbors, 145 probability of error, see error probability
noise, 111 probability space, 190
correlated, 93 projection, 69, 72
data dependent, 93 propagation delay, 112
man-made, 99 pseudo-covariance matrix, 269
Poisson, 75 PSK, 46, 48, 80, 106, 136, 160, 179, 248, 263
shot, 99 pulse, 15
thermal (Johnson), 99, 118 pulse amplitude modulation, see PAM
white Gaussian, 98 pulse position modulation, see PPM
noise temperature, 119 Pythagoras’ theorem, 69
non-coherent detection, 260, 263
nonlinearities, 114 Q function, 31, 78, 228
norm, 67, 96 QAM, 40, 48, 79, 86, 106, 126, 160, 179, 249,
Nyquist 263, 279
bandwidth, 169 QAM (analog), 241
criterion, 167, 168, 170, 179, 198–200, 281 quantization, 20
pulse, 198
radio spectrum, 234
observable, 26, 101 raised-cosine function, 170
OFDM, 113 random process, see stochastic process
open system interconnection (OSI), 1, 103, random variables and vectors, 61, 190
250 circularly symmetric, 270, 271, 273, 274,
optical fiber, 26 279
orthonormal expansion, 71 complex-valued, 252, 267, 269
proper, 269, 271
packet, 3 Rayleigh probability density, 60
PAM, 39, 48, 79, 105, 135, 159, 179, 208, 221, real vector space, 66
279 receiver, 24
parallelogram equality, 67 receiver design
parameter estimation, 175, 256 for continuous-time AWGN channels, 95
Parseval’s relationship, 99, 144, 161, 165, 167, for discrete-time AWGN channels, 32
185, 238, 254 for discrete-time observations, 23
partial response signaling, 166 for passband signals, 243
passband, 12, 232, 235, 276, 277 reflection, 112, 234
phase drift, 260 refraction, 234
phase locked loop, 260 Riemann integral, 180
phase-shift keying, see PSK root-raised-cosine (im)pulse (response), 170,
phase synchronization, 256 192
decision-directed, 260 root-raised-cosine function, 170
picket fence miracle, 193, 196, 200 rotation (of the matched filter output), 260
Index 289
sample space, 190 symbol, 159

sampling theorem, 161 PAM, 159, 160, 179, 208, 221, 279
satellite communication, 119, 222 PSK, 160, 179, 248, 263
Schur, 54 QAM, 160, 179, 249, 263, 279
second layer, 95 symbol-by-symbol on a pulse train, 159
self-similarity function, 164, 172, 178, 197 synchronization, 112, 175
Shannon, xv, 8 Bayesian approach, 176
σ-algebra, 190 DLL approach, 176
signal for passband signals, 256
analytic-equivalent, 235, 236, 241 LSE approach, 176
baseband-equivalent, 12, 235 ML approach, 175
energy, 46, 95, 96, 106, 111, 119, 121,
133–135, 137, 147, 166, 183, 236, 241, third layer, 232
264 time offset, 256
power, 95 training
signal design trade-offs, 132 signal, 256, 258
simple mail transfer protocol (SMTP), 2 symbols, 177
simulation, 115, 231 transmission control protocol (TCP), 5
sinc transmitter, 24
Fourier transform, 185 trellis, 209
peculiarity, 201 triangle inequality, 67
singular value decomposition (SVD), 56, 62
uncorrelated, 64, 165
sky wave, 234
union Bhattacharyya bound, see error
Slepian, 143
probability
software-defined radio, 162
union bound, see error probability
source, 23
unitary, 53
binary symmetric, 19
up-converter, see converter
continuous-time, 20
user datagram protocol (UDP), 5
discrete, 18
discrete memoryless, 19 vector space, 65
discrete-time, 20 Viterbi algorithm (VA), 209, 222, 224
spread spectrum (SS), 105, 261 voltage-controlled oscillator (VCO), 177
square-root raised-cosine function, Voronoi region, 39
see root-raised-cosine function
standard inner product, 66 water-filling, 113
state diagram, 207 waveform, 15
stationary, 191 waveform former, 97, 102, 159
stochastic process, 190 whitening filter, 113
sufficient statistic, 41, 83, 85, 102, 130 wide-sense stationary (WSS), 163, 191

PDC BixioRimoldi

Uploaded by

Copyright:

Available Formats

PDC BixioRimoldi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDC BixioRimoldi

Uploaded by

Copyright:

Available Formats

Principles of Digital Communication

This comprehensive and accessible text teaches the fundamentals of digital

Bixio Rimoldi is a Professor at the Ecole Polytechnique Fédérale de Lausanne

Cambridge University Press is part of the University of Cambridge.

1 Introduction and objectives 1

2 Receiver design for discrete-time observations: First layer 23

2.8 Appendix: Facts about matrices 53

3 Receiver design for the continuous-time AWGN channel:

4 Signal design trade-oﬀs 132

5 Symbol-by-symbol on a pulse train: Second layer revisited 159

6 Convolutional coding and Viterbi decoding: First layer revisited 205

7 Passband communication via up/down conversion: Third layer 232

7.6 Non-coherent detection 260

This text is intended for a one-semester course on the foundations of digital

The channel is discrete-time.

We step up to the continuous-time AWGN channel.

We prepare to address the signal design.

The second layer is revisited.

The ﬁrst layer is revisited.

The third layer emerges from a reﬁnement aimed at

such abstract channels needs to be motivated. I motivate in two ways: (i) by

This book focuses on the system-level engineering aspects of digital point-to-point

1.1 The big picture through the OSI layering model

Application AH Data Application

Figure 1.1. OSI layering model.

is represented by a stream of bytes. Received e-mails usually sit on a remote server.

is implemented either by the Transmission Control Protocol (TCP) or by the

1.2 The topic of this text and some historical perspective

Figure 1.2. Information-carrying pulse train. It is the scaling factor

Figure 1.3. Symbol alphabet of size m = 6.

k bits n symbols w(t)

Figure 1.4. Transmitter.

Information theory, which is mainly concerned with discerning between what

1.3 Problem formulation and preview

i∈H wi (t) ∈ W R(t) Ĥ

Figure 1.6. Basic point-to-point communication system over

Digital communication is a rather unique ﬁeld in engineering in which theoretical

Figure 1.7. Decomposed transmitter and receiver and corresponding

the text the receiver’s objective is to minimize the probability of an incorrect

Chapter 7 is about passband communication. A typical passband channel is the

1.4 Digital versus analog communication

example 1.2 Consider a very basic transmitter that maps a sequence b0 , b1 , b2 , b3

Figure 1.8. Transmitted signal w(t).

If it is a discrete alphabet, like {0.9, 2, −1.3}, then we speak of digital commu-

The Fourier transform of a function g : R → C is denoted by gF . Hence

1.6 A few anecdotes

1.7 Supplementary reading

1.8 Appendix: Sources and source coding

Discrete sources A discrete source is modeled by a discrete-time random process

where pi , i = 1, . . . , m is the probability associated to the ith letter. For instance,

Discrete-time continuous-alphabet sources These are modeled by a discrete-time

(d) (X1 − 12 )2 + (X2 − 12 )2 ≤ ( 12 )2 .

exercise 1.2 (Basic probabilities) Find the following probabilities.

exercise 1.3 (Conditional distribution) Assume that X and Y are random

(a) Are X and Y independent?

(a) Write down X in terms of X1 , X2 and Z.

exercise 1.5 (Uncorrelated vs. independent random variables)

Figure 2.1. General setup for Chapter 2.

for discrete-output alphabets and

a ± b2 = a2 + b2 ± 2{a, b}, (2.8)