Introduction To Information Theory
Introduction To Information Theory
Introduction To Information Theory
INFORMATION THEORY
CLAUDE E. SHANNON
The Father of Information Theory
(Photo by Frank Ross, Cape May Courthouse, New Jersey)
INTRODUCTION TO
INFORMATION THEORY
Masud Mansuripur
College of Engineering
Boston University
PRENTICE-HALL, INC.
Englewood Cliffs , New Jersey 0763 2
86- 17033
765432
ISBN
0 -13-484668-0
0 25
CONTENTS
Preface
Information Theory and the Modern Scientific Outlook
by Professor L. B. Lev itin
Chapter 1
MATHEMATICAL PRELIMINARIES
1. 1
1.2
1.3
1.4
xv
Review of Probabilit y, I
Chebys hev Inequ ality and the Weak Law of Large Numbers, 4
Convex Sets and Functions: Jen sen 's Inequality, 6
Stirlin g 's Formula , 7
Problem s, 9
Chapter 2
ENTROPY, DISCRETE MEMORYLESS SOURCE,
AND BLOCK ENCODING
2. I
2.2
2.3
2.4
2.5
2.6
Chapter 3
VARIABLE-LENGTH SOURCE CODING
3. 1
3.2
xi
11
25
VIII
3.3
3.4
3.5
3.6
3.7
\",VIILIJI IL.:J
Chapter 4
UNIVERSAL SOURCE CODING
4.1
4.2
4.3
Chapter 5
MUTUAL INFORMATION, DISCRETE MEMORYLESS
CHANNEL, AND CHANNEL CAPACITY
5.1
5.2
5.3
5.4
5.5
49
Chapter 6
CHANNEL THEOREMS
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
41
63
Contents
Chapter 7
RATE-DISTORTION THEORY
7.1
7.2
7.3
7.4
8.11
8.12
8.13
95
Chapter 9
ADVANCED TOPICS
9.1
9.2
9.3
9.4
9.5
83
Chapter 8
ERROR-CORRECTING CODES FOR THE
BINARY SYMMETRIC CHANNEL
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
ix
125
Appendix
143
Bibliography
145
Index
147
PREFACE
This book has been developed out of a course in information theory that I
have taught at Boston University during the past several years . The course is
intended for beginning graduate students, as well as advanced undergraduates
who desire a conceptual understanding of the foundations of information theory.
It is thus designed to cover the major ideas involved in this important branch of
modem science and to give the student an overview of the subject without getting nailed down in more advanced and sometimes less intuitive issues. The
required background is probability theory at an undergraduate level. Although
some basic concepts of probability are reviewed in the beginning, it is essential
that the reader be familiar with the techniques of probability theory in general.
A certain degree of mathematical maturity and capacity for abstract thinking on
the part of the reader is also assumed.
Several excellent textbooks on information theory are available at th is
time, and it may seem inappropriate to write yet another one on the subject. I
found it difficult, however, to find a text that addressed all the major concerns
of the theory at the proper level for my intended audience . Some of the books
were written for the Ph .D. level students and researchers and thus required a
much higher knowledge base than introductory probability theory, while others,
although appropriate for the less advanced student, were not comprehensive
enough in their coverage. The present book is thus the outcome of my effort to
bridge this gap; although it contains some of the more recent developments in
information theory, which, to my knowledge, have not yet appeared in any
textbook, it remains an introductory exposition of the basic concepts at the
core. In organizing the book, I have been guided by the original paper of C. E.
Shannon, "A Mathematical Theory of Communication" and by the monumental
work of R . Gallager, Information Theory and Reliable Communication. I also
benefited from the books Information Theory and Coding, by N. Abramson,
Information Theory , by R. Ash , The Theory of Information and Coding, by
R. McEliece, Coding and Information Theory, by R . Hamming, and Rate Distortion Theory, by T. Berger. The chapter on universal source coding is based
on the original papers of B . Fitingof and the section on numerical computation
of channel capacity is based on the papers of R. Blahut and S. Arimoto.
Preface
xiii
teach the course in a school with the quarter system, I would probably teach the
first six chapters in the first quarter and use some additional material with the
last three chapters for the second quarter.
I would like to thank Professor Lev Levitin who inspired me and spent
many of his precious hours discussing the more subtle points of information
theory with me . He also honored me by agreeing to write an introductory
chapter for this book. I would also like to thank Dean Louis Padulo of the
College of Engineering who encouraged me in writing this book and provided
me with the time needed for completing it. Thanks are due to Mr. Tim Bazik ,
the editor, for his support of the project. Finally, I would like to thank my wife,
Annegret, without whose patience and encouragement this book would not
have become a reality.
Acknowledgments
Problems 1.1,2.1,2 .2 ,2.6,3 .1 ,3.4,3.9,3.10,3 .11,3.12,5.4,5 .5, 5.6, 6.4 , 6 .9 ,
7.4 , 8. 8, 8.9 , 8. 10, and 9.5 are from R. G . Gallagher, INFORMATION THEORY AND
RELIABLE COMMUNICATION . New York: Wiley , 1968. Copyright 1968 by John
Wiley & Sons, Inc. Reprinted with permission.
Problem s 2.4, 2.5, 2.9, 6 .8, 8.2 , 8.5, and 8.6 are from R. Ash, INFORMATION TH EORY. New York: Interscience, 1965. Copyright 1965 by John Wiley & Sons, Inc .
Reprinted with permission .
Problems 2.11 ,2 .13,3 .15 (adapted), and 8.16 are from R .J . McEliece, THE THEORY
OF INFORMATION AND CODING : A MATHEMATICAL FRAMEWORK FOR
COMMUNICATION . New York: Cambridge , 1977. Copyright 1977 by Cambridge
University Press. Reprinted with permi ssion .
One of the greatest revolutions in the scientific world outlook in our century is the turn from Laplacian determinism to a probabilistic picture of nature.
The development of statistical mechanics and (even in a more categorical way)
of quantum theory has brought us to the appreciation of the fact that the world
we live in is essentially probabilistic. A natural extension of this point of view
is an understanding that our knowledge is of a probabilistic nature, too. Any
information we obtain affects the probabilities of possible alternatives , rather
than indicates uniquely one particular outcome (as "genuine" deterministic
knowledge was supposed to do).
Therefore , it seems to be not just a sheer coincidence that information
theory emerged after statistical and quantum mechanics had been developed,
and that it shares with statistical physics the fundamental concept of entropy.
Mathematically, information theory is a branch of the theory of probabili ties and stochastical processes . It has won its first victories by answering the
most basic questions concerning information transmission over communication
channels. In the past, communication engineers believed that the rate of information transmission over a noisy channel had to decline to zero, if we require
the error probability to approach zero. Shannon was the first to show that the
information-transmission rate can be kept constant for an arbitrarily small probability of error. Besides its technological importance, this result has a remarkable philosophical meaning. Information theory not only gives a quantitative
measure of information common for both deterministic and probabilistic cases,
but it also shows the qualitative equivalence of these two kinds of knowledge in
the following sense: even if the input and the output of a communication channel are only statistically dependent, it is possible to transmit an amount of data
that is arbitrarily close to the amount of information in the output about the
input, with .a vanishing error probability (i.e., in almost deterministic way) .
The development of information theory is an excellent illustration of the
statement that "nothing is more practical than a good theory." Indeed , at the
time when the basic results of information theory related to communication
channels had first been formulated, communication technology was not at all
XVI
IllIUl l lldllUl1
111t::ury
Information Theory
xvii
Another important area of the application of information theory to computer science is fault -tolerant computing . Although the first attempts in this
direction were unsuccessful, more thorough investigations [5,6,7] have shown
that it is possible to achieve arbitrarily small probability of error in data storage
and computation at the expense of limited redundancy, exactl y as in the case of
communications .
Essential interconnections between information theory and statistics have
been found [8] and new methods of stati stical analysi s based on information
theory have been suggested [9].
A number of interesting attempts have been made to apply information
theory in political economy and economics . For instance , the theory of optimal
investment appears to be exactly parallel to the theory of optimal source coding [10].
Application of information theory to linguistics seem to be highly relevant
(e.g ., [11-15]) . Indeed, a natural language gives us a remarkable example of a
system used for generating long sequences of symbols (i.e ., texts) that can be
considered as realizations of a random process. But , in contrast with other random processes that exist in nature, this random process was developed, modified, and selected during a long period of evolution and " natural selection,"
being specially intended for meaningful communication between human beings .
Information-theoretical studies of texts have revealed a number of significant
linguistic features. For instance, they provide objective criteria for the charac terization of different styles and different schools in poetry and prose, and even
for identification of individual authors [16]. An illustration of the importance of
the information-theoretical characteristics of a language is given by the fact
(noted first by Shannon) that large crossword puzzles are only possible if the
redundancy of a language does not exceed 50 percent (on the vocabulary level) .
If the entropy of a language were two times less than its actual value, poetry in
its usual form (with rhymes and meters) would be impossible.
Application of information theory t experimental psychology made it possible to discover some remarkable facts related to sensory organ s and neural
systems [17]. It was found, for example, that the reaction time of a subject
is a linear function of the amount of information contained in the stimulus
[4,18,19] . Moreover, our sensory organs can be characterized by a certain
information capacity, as engineering communication lines are .
Perhaps the most important and meaningful are interconnections between
information theory and statistical physics . Long before information theory was
founded , L. Boltzman and later L. Szilard [20] attributed an information meaning to the thermodynamical notion of entropy. On the other hand , D. Gabor
[21] pointed out that "the communication theory should be considered as a
branch of physics ." In the classical work of L. Brillouin [22], a profound relationship between physical entropy and information was first formulated in a
general form . Later the "entropy defect principle" was establi shed in the quasi-
XVIII
u u or n re uut r
11 1t: U I Y
classical [23,24] and the quantum [25,26] form . According to this principle,
any information is represented by a certain ensemble of states of a physical system and associated with its deviation from the thermodynamic equilibrium.
Thus the basic concepts of information theory can be defined on the basis of
statistical physics, and a way is open to develop a consistent physical information theory. The subject of this theory is investigation of the physical nature of
information transmission, storage, retrieval, and processing (so called physics
of communication and physics of computation) (e.g., [27-29]). On the other
hand, the information -theoretical approach has been applied to statistical
physics [30-34]. Information theory shed a new light on the classical problems
of Maxwell 's demon, Gibbs' paradox [22,34], and on the foundations of statistical physics in general. There is a hope that further development in this
direction will eventually bridge the gap between physical and cybernetical
descriptions of a complex system and will lead to formulation of a physical
theory of high -organized systems, both artificial and natural (biological) . The
progress on this way is slow and difficult [35], but the goal is worth all
the efforts.
Here we have to touch on another philosophical problem . Since the time of
Newton (or even Democritus), scientists believed that the most fundamental
laws, the most hidden secrets of nature, are those of elementary particles and
elementary forces acting between them . Indeed, if everything that happens in
the universe is no more than a combination of these elementary acts, then isn't
the knowledge of laws that govern the interactions between elementary particles
sufficient to describe any phenomenon in the world, to predict theoretically outcomes of any experiment? Today we have achieved incredible progress in discovering and describing the nature of elementary particles and interactions, and
have learned that this knowledge is incredibly insufficient for the ambitious
purpose of "explaining everything," of building an accomplished scientific picture of the world.
It seems that we know, indeed, how to derive the behavior of a system,
even as large as stars and galaxies, from the "first principles," if we deal with a
low-organized, chaotic system. But we find our knowledge almost irrelevant
when we have to face a system at a higher level of organization (whatever it
means, I should add, since we still lack even a good definition of "level of
organization"). For instance, we have a wonderful theory of electromagnetic
interactions that predicts the experimental results with an accuracy of 15 decimal digits, and we know that the processes in a human body are mostly electromagnetic, but this perfect theory tells us very little, if anything , about how our
bodies function .
There has occurred another revolutionary shift in the minds of scientists: to
become aware that the greatest secrets of nature, the hardest to discover- and
the most important for us - are the laws of organization, the understanding of
how a certain complex "combination" of particles and processes can emerge
and persist as an organized system among the chaos of myriads of elementary
Information Theory
xix
References
1. C. E. Shannon , "The Bandwagon," Trans . IRE , IT-2, No.1, 1956.
2. C. E. Shannon, "Communication Theor y of Secrecy Systems," BSTJ , 28, No.4,
1949.
3. C. R. P. Hartmann, and others, "Application of Information Theory to the Construction of Efficient Decision Trees," IEEE Trans. , IT-28, No.4, 1982.
4. A. M. Yaglom and I. M. Yaglom, Probability and Information . Hingham, Mass.:
D. Reidel Publishing Co., 1983.
5. S. Winograd and 1. D. Cowan, Reliabl e Computatio n in the Presence of No ise.
Cambridge, Mass.: MIT Press, 1963.
6. M. C. Taylor, "Reliable Information Storage in Memories Designed from Unreliable Comp onent s" and "Reliable Comput ation in Computing System s Designed
from Unreliable Components," BSTJ, 47, No. 10, 1968.
7. A. V. Kuznetsov, "Information Storage in a Memory Assembled from Unreliable
Component s," Problems in Infor mation Transmission, 9, No.3, 1973.
8. S. Kulback, Information Theory and Statistics . New York: Wiley, 1959.
9. S. Watanable, "Information-Theoretical Analysis of Multivariate Correlation," IBM
J . Res. Develop ., 4, No. I, 1960.
10. T. M . Cover, "Information Theory and Investment ," 1985 IEEE Int ern . Symp.
Information Theory, Brighton , England , June 1985. Also "An Algorithm for Maximizing Expected Log Investment," IEEE Trans ., IT-3D, No.2, 1984.
11. C. E. Shanno n, "Prediction and Entropy of Printed English," BSTJ , 30, No. 1,
1951.
xx
Information Theory
xxi
34. L. B. Levitin , "Quantum Amount of Information and Maximum Work," Proc. i 3th
iUPAP Conf. Statisti cal Physics, D. Cabib, D. G . Kuper, I. Riess, eds ., Bristol,
Eng. : A. Hilger, 1978.
35. H. P. Yockey, R. L. Platzman, and H. Quastler, eds ., information Theory in Biology, Elmsford , N.Y.: Pergamon Press, 1958.
36 . N. Wiener, Cybernetics, or Control and Commu nication in the Animal and
Machine, 2nd ed . Cambridge , Mass .: MIT Press, 1961.
37. J. von Neumann , "The General and Logical Theory of Automata ," The Hixon Symposium , 1948, L. Jeffres , ed. , Wiley, 1951.
38. C. E. Shannon , "Computers and Automata, " Proc. IRE, 41, No. 10, 1953.
39. C. E. Shannon , "Von Neumann 's Contribution s to Automata Theory," Bull. Amer.
Math . Soc., 64, No.2, 1958.
1
MATHEMATICAL PRELIMINARIES
1.1
1.2
1.3
1.4
Review of Probability
Chebyshev Inequality and the Weak Law of Large Numbers
Convex Sets and Functions: Jensen's Inequality
Stirling's Formula
Problems
r r em r urtcnrce
\",110..., .
probability theory. For the most part, in this book we assume either that the set
E is countable or that it is divided into a countable number of elementary
In a coin-flipping experiment, E = {H, rj. If the coin is fair P{H} = P{T} = 0.5 .
Example 2
If the experiment is to throw a die, then E = {I, 2, 3, 4, 5, 6}. If the die is fair,
P{I} = P{2} = . . . = P{6} = P{2, 5} = P{I, 3, 5} = ~.
i,
Example 3
Let the experiment be the weather forecast for a certain day; then E = {sunny, cloudy,
rainy, snowy}. A possible probability assignment is P{sunny} = t P{cloudy} =
P{rainy} = i, P{snowy} = i
Example 4
Take a typical English text and pick a letter at random; then E = {a, b, c, . .. , x , y , z}.
The probability distribution will be P{a} = 0.0642, . . . , P {e} = 0 .103, . . . ,
P{z} = 0.0005 .
If the outcome of an experiment is always the same, it is said to be a degenerate (or certain) experiment.
Example 5
In the English language, the letter that follows q in a word is always u. The experiment
is to open a book at random and to pick the letter following the first q encountered. The
sample space is E = {a, b, c, . . . , x, y, z}, but P{u} = I and P{a} = ... = P{z} = o.
= E(x
+ x2
2X . x = E(x 2)
Sec. 1.1
Review of Probability
-h
i-
. . . + P{(H, 6)} = !.
Now assume that the coin and the die are somehow interacting and as a
result their outcomes are not independent. The joint density may have the distribution shown in Figure 1.2 . The following statements can be directly verified
from the diagram.
Die
6
5
4
3
2
1/12
1/ 12
1/1 2
1/1 2
1/ 12
1/12
1/1 2
1/12
1/1 2
1/12
1/12
1/12
Coin
Figure 1.1
Die
6
5
4
3
2
1/12
1/1 2
1/1 8
1/24
1/ 9
1/24
1/9
1/6
1/ 18
1/ 9
1/18
1/ 12
Figure 1.2
Coin
(i)
(ii)
(iii)
(iv)
(v)
P{(H ,5)} = 18
P{Coin = H} =
P{Coin = T} =
P{Die = 5} = ?z
P{Die = 5 given Coin
H} =
1/18
(1/12)
......,VOr-- .
= 1/10
( I )-
P(Xi, yJ
( )
PXi
P Yj Xi -
where the marginal density P(Xi ) is assumed to be nonzero . The marginal densities are defined as follows :
2: 2: (Xi i
= E(xy) -
x )( Yj - Y)P(Xi' Yj)
xY
p {lx -
xl
2::
o}
:5
0";
s
Sec. 1.2
Cheby shev Ineq uality and the Weak Laws of Larg e Numbers
A simple proof of this result is given here (for another proof, see Problem 1.4).
Define the function f(x) as follows:
f (x)
g:
iflx-xl ~ 8
if
Ix -
xl< 8
Then
p{lx -
x l ~ 8} = Lf(x;)p(x;)
f(x) ::5 [ - 8-
X]2
Therefore,
11 = 1
= - L X (II)
I ~
=-
L.,
()
E (x " )
11 =1
I ~_
=-
L., X
=x
11 = 1
\(x8~y
I
I
\
\
f (x)
Figure 1.3
L E ((x(1I) 11 =1
(X(1I) -
cnap,
r)
x)
X)2) = o-x
N
P{!YN- YNI 2:
e} ::::;
O-J
e
x(
11 =1
Notice that the right side approaches zero with increasing N. The weak law of
large numbers thus asserts that the sample average of x approaches the statistical average with high probability as N ---7 00 .
In the Euclide an space, a set S is convex if, for every pair of points PI , P2 in S,
the straight line connecting PI to P2 is completely contained in S. Figure 1A a
shows two convex sets in the two-dimensional Euclidean space . The sets of
Figure lAb are not convex.
If PI = (x l s ,x1l ) and P2 = CYt> . .. , Y1I) are points in the Euclidean 11space , then the straight line connecting them is represented by the set of points
(a)
(b)
Figure 1.4
Sec. 1.4
Stirling's Formul a
f(#) AIIPII)
#)Allf(P,.)
2:
f ( ~ AIIPII
N- )
2:
ANf(PN)
+ (l
2:
ANf(PN)
+ (l -
N- )
A
)
- AN)f( I~) 1 - \ NPII
AN) ~: 1 ~\Nf(PII )
= 2:
A,J (PII )
n= l
f(E(x))
2:
E(J(x))
In this section we derive Stirling's formula , which provide s upper and lower
bounds to N!. Consider the function In(x) whose integral from 1 to N is given
f( x)
_ - - _ f( x 2 ,
Xl
Ax ,+ (1 -A)X 2
Figure 1.5
by
In(x) dx
= [x
In(x) -
x]~ = N
In(N) - N
r
r
+ . .. +
In(x) dx
2::
In 2
In(x) dx
::5
In 3
+ !In(N)
- N
+ (i)
::5
In( x)
In(N!)
::5
N In(N)
+ !In(N)
- N
In (x )
N
(a)
(b)
Figur e 1.6
Chap. 1
Pro blems
7/8
::5 N ! ::5 N Ne - N
yIN e
PROBLEMS
1.1. Three events E "E 2 , and E ), defined on the same space, have probabilities p ee l) =
P (E 2) = p eE )) = ~ ' Let Eo be the event that one or more of the events E l> E 2 , E )
occurs.
(a) Find P(Eo) when:
(1) s; E 2 , E ) are disjoint.
(2) e; E 2 , E ) are statistically independent.
(3) El>E 2 , E ) are three names for the same event.
(b) Find the maximum values that P(Eo) can assume when:
(1) Nothing is known about the independence or disjointness of E i, E 2 , E ) .
(2) It is known that E "E 2,E ) are pairwise independent; that is, the probabilit y
of realizing both E; and E, is P(E ;)P(Ej ) , but nothing is known about the
probability of realizing all three events together.
Hint: Use Venn diagrams.
1.2. A box contains two dice, one fair and the other loaded , so that for the first one
p el) = P (2 ) = . .. P (6 ) = k and for the seco nd one P(6) = L P(I) = .. . =
P(5) =
We choose one die from the box at random and roll it. If the outcome is
the number 6, what is the probability that the loaded die has been selected? What if
the die is rolled twice and the outcome of both trials is the number 6?
1.3 . Let x and y be discrete random variables,
(a) Prove that E(x + y) = E (x) + E(y)
(b) If x and yare independent, prove that E(xy ) = E(x) ' E(y ) ; that is, x and yare
uncorrelated .
(c) Is it possible for x and y to be dependent but uncorrelated? Give an example.
(d) If x and y are independent , prove that Var(x + y) = Var(x) + Var(y). Is this
relationship valid when x and yare dependent but uncorrelated?
1.4 . (a) For any random variable y that assumes only nonnegative values , prove the
followi ng inequality :
-y
p {y ~ /)} ::5 -
p{lx -
\J 1 10tJ "
LYn
N ,,~I
= -
p{lx - :YI
~ O}:s
(J yz
No
1.8. The functionf(x) is defined and has second derivative on the interval (a , b) . Prove
that the necessary and sufficient condition for f(x) to be convex n is
d1(x)
~:SO
2
ENTROPY, DISCRETE MEMORYLESS
SOURCE, AND BLOCK ENCODI NG
2.3
2.4
2.5
2.6
I n t rod u c t ion The concept of entropy is the central concept in information theory. The entropy of a random variable is defined in terms of its probability distribution and can be shown to be a good measure of randomness or
uncertainty. This chapter begins by introducing the entropy for a discrete random variable in Section 2. I and describes some of its properties in Section 2.2.
Other important properties of entropy are discussed in the problems at the end
of the chapter.
The importance of entropy in practical applications has much to do with
its relationship with long sequences of a random variable. It turns out that repeated trials of a probabilistic experimen t give rise to outcomes that can, under
quite general circumstances, be divided into two categories. The first category
consists of sequences that have a very small probabil ity of occurrence; in fact,
the probabi lity of occurrence of this category as a whole approaches zero with
the increasing length of the sequences. The second category consists of the
remaining sequences; these sequences, which are known as likely or typical
sequences, all have more or less the same probability and , as their length
increases, they become closer and closer to being equally likely. The number
of typical seque nces and their indi vidual probabilities are fu nctions of the
entropy, and it is this relation ship that gives entropy a major role in the theory
1.
UIV,-," "
".::1
.......... ,..... -
of data compression and erro r control coding . Sections 2.3 and 2.4 explore this
relationship.
The concept of entropy as developed in the original work of C . E . Shannon [1] has been used in seve ral branches of science, but its most significa nt
applications to date have bee n in the areas of data communication and signa l
processing . We will therefore restrict our discussion in this book to situations
that arise in source coding (data com pression) and channel coding (error correction). This chapter and the next focus on the problem of source coding for discrete memo ryless sources, where the goal is to accurately represent (encode)
the output of the information source with as few code words as possible .
This chapter deals with situations where a fixed-length block of data is
mapped onto a fixed-length block of code lett ers. Sec tio n 2. 5 considers the
case where the cod e letters are of equal duration in time, whil e Sectio n 2. 6
considers situations such as telegraph y, where different code letters have different dur ation s. In both cases , typical sequences produ ced by the information
source are encoded (represented) by sequences of code letters, and the minimum required rate of gen erat ion of code letters is show n to be equal to the
sourc e entropy. In Chapter 3, fixed-length bloc ks of dat a are mapped onto
variable-length blocks of code letters , and encoding scheme s are describ ed
for the minimization of the ave rage cod e-word length . It is show n that the
minimum achievable code-word length , on the average, is equal to the source
entropy.
Let x be a random vari able (r. v.) with sample space X = {XI, ... , XN} and
probability measure P(x ,,) = p.: The entropy of x is defined as
-.z:
N
H(x) =
p; log(p,,)
(2. 1)
n= 1
The base of the logarithm is arbitrary and amounts to a constant multipli cative
factor. [Remember that log/ = (log,") (log j").'] If this base is 2, the entropy is
said to be in bits (for binary digits); if the base is e, the entropy is said to be in
nats (for natural units). A plot of the function -p In(p) is shown in Figure 2.1 .
Note that at p = 0 we have set the function equal to zero, in agreement with
the fact that X In(x) ---7 0 as X ---7 O. Also notice that the function - p In(p) is a
convex n function of p .
Example 1
Sec. 2.2
13
e- 1 = 0.37
Figure 2.1
H (x)
0.5
Figure 2.2
a and p
The entropy as defined by Eq. (2.1) has several properties. These properties
make H(x) a good measure of the uncertainty about the outcome of a probabilistic experiment. In this section, we describe some of these properties.
(i) If PI = I and P" = 0 for n "" I, then H(x) = 0; that is, the uncertainty about an experiment with deterministic outcome is zero.
(ii) If PI = pz = .. . = PN = it, then H(x) = log N. Note that , if
the number of equiprobable outcomes of an experiment increases
(i.e ., if N increases), then the entropy of the experiment increases .
(iii) H(x) :5 log N with equality if( P" = l iN for all n. The proof of this
statement is based on the important inequality In(x) :5 x - I , whose
validity becomes obvious upon inspection of Figure 2.3. In this figure the function s In(x) and x - I are plotted on the same set of axes.
Note that the two functions are equal only at x = I.
tiff means if and only if.
14
l-nap.
Figure 2.3
( I)
(1
) LN(l)
- -
N
LPn
- 1
n=1
Np;
11= 1
LN Pn
11 = 1
=1 -1 =0
The refore, H (x ) :5 In(N ) with equality iff Np; = 1 for all n; that is,
P = l iN for all n . Thi s impli es that the uncertainty of an experiment is high est when the outcomes are equiprobable.
(iv) Let the r. v. x have sample space X = {XI , . . . , XN } and the r. v. y
have sample space Y = {Yb . . . , YM }. Then the joint r.v. z = (x, y)
has sample space Z = {(XI> YI), .. . , (XI> YM), (X2, YI), . . . , (X2, YM),
. . , (X N, Yl), . .. , (XN, YM) } with NM element s. Let x and y be independent and have equipro ba ble outcomes . Then H (x ) = lo g(N ) ,
H(y) = l og(M ), a nd H (z ) = Iog(MN ) = lo g(M) + lo g(N ) =
H(x) + H(y); that is , the uncertainty of the joint experiment (x , y) is
the sum of the uncertainties of the indep endent experiments x and y .
This prop erty remains valid even if x and y have nonuniform distributions. In this case ,
N
H (z)
=-L L
11 = 1 m= l
= -
n= ! m= l
=-
H(x)
+ H(y)
Sec. 2.3
15
(v) If an experiment is decomposed into two subexperiments, the original uncertainty will be the uncertainty associated with the first
experiment plus the average uncertainty associated with the second
experiment. This is the grouping property of the entrop y. To explain
this property, con sider the LV. x with sample space X = {X I ,' . . ,XII'
X//+I , . . . , XN} and probability measure P(x;) = Pi. Let us decompo se X
into two subspaces, Y = {XI>' " , XII } and Z = {X// +I.. . , XN} '
The probabilities associated with Y and Z are given by P(Y) =
~;~ 1 Pi and P(Z) = ~ {://+ 1 Pi' Furthermore, let us define random
variables y and z by P(y;) = P(Xi)/P(y), i = 1,2, .. . nand P(Zi) =
P(XII+i) /P(Z) , i = 1,2, . . . N - n. H(x) can now be written as
N
H(x)
=-
LPi log Pi
= -
i= l
L Pi log Pi
i=1I+ 1
log P(Y))
log P(Z ))
i~ 1
s-
In the final expression , the first bracket represents the uncertainty associated with the first experiment and the second bracket is the average uncertainty associated with the second experiment.
Note: The preceding propertie s of the function H(x) defined by Eq. (2. 1)
are reasonable properties for a measure of uncertainty. It is possible to prove,
however, that this function is the only one that satisfies a small set of reason able requirements. In reference [4], it is shown that the entropy must be defined by Eq. (2.1) if properties (ii), (iv), and (v) , plus a continuity propert y,
are required .
The following example is designed to bring out the significance of H(x ) as related to long sequences of the random variable x.
Example 2
Consider the r. v. x with sample space X = {x I> X2}, P(x I) =
tropy of x is given by
H(y)
-(~) Jop(~) -
(~) IO!!(~)
= 0 .918 bits
'"
L lllIUt-'y,
LJI;:)\",I C:a C
VVU . ..........,
loAl . ....
' tJ
Plt- n
Since there are ( ~) = N! I n ! (N - n)! such sequences , their total probabilit y is equal to
( ~)p~(I -
Plt-"
Table 2. 1 shows probabilities of various sequences for N = 15. Notice that the
most probable sequences are those for which n is close to Np , = 5. In fact , the probability of 2 ~ n ~ 8 is 0.95 . In other words,
The probability of occurre nce of a sequence for which n is significa ntly different
from Np, is very small.
Also notice that the individual probabilities of these sequences are between T15xO.718 and
2- 15XI.18, which are fairly close to 2- NH (x ) = 2 - 15xO.918. In other words,
All the likely sequences are more or less equiprobable with prob ability about
2- NH ( x) .
Finally, notice that the total number of sequences with 2 ~ n ~ 8 is 22,803 = 2' 5xO.965,
which is not far from 2NH ( x ) . In other words,
The total number of likely sequences is about 2NH (x ) .
TABLE 2.1
II
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
No. of Sequences
(~)
I
15
105
455
1365
3003
5005
6435
6435
5005
3003
1365
455
105
15
I
Plt-"
2- 15XO.585
2-15xO.652
2- 15xO.718
2-15XO.785
2-15XO.852
2-15xO.918
2-15xO.985
2-15X1.052
2- 15x1.1I8
2- 15x1. 185
2- 15X1.252
2-15x1.318
2- 15 X 1.385
2- 15X1.452
2- 15X1.518
2- 15x 1.585
Total Probability
(~)p7(l - Plt -"
0.002284
0.017127
0.059946
0. 129883
0.194825
0.2 14307
0. 178589
0.114807
0.057404
0.022324
0.006697
0.001522
0.000254
0.000029
0.000002
0.000000
Sec. 2.4
17
The last conclusion is consistent with the previous two conclu sions: if a subset of nearly
equiprobable elements has a total probability close to I, then the number of elements in
the subset must be appro ximately the inverse of the probability of each element.
The three conclusions of Example 2 are quite general. In the next section
we will show that the degree of approximation can be arbitrarily improved by
letting N assume larger and larger values . Later we will see that the sequences
that behave in this manner are far more general than the memory less binary sequences considered in this section .
The practical significance of these results in data compres sion should be
quite clear. To represent a binary sequence of N digits in our example, we only
need NH (x) = 0 .9l8N bits, which constitute s an 8.2 percent reduction in the
volume of data. Any degree of accuracy can then be achieved by choosing sufficiently large values of N.
We now put the conclusions of the previou s section on a firm basis. We will
need the weak law of large numbers , which asserts that, if the experiment corresponding to the r. v. x is repeated N independent times with x U) the outcome
of the jth trial, then
p{ IJ..N
x U) -
o}
xl::::
j=l
::5
(Y
;2
No
Here and (Yx are the average and standard deviation of x, respectiv ely, and 0
is an arbitrary positive number . For a proof of the weak law, the reader is referred to Section 1.2 and Problem 1.5 .
Suppose a source generates a letter belonging to the alphabet {a I , . . . ,aK}
every T seconds . The probability of a, is p(ad , and each letter in the sequence
is generated independently. Starting at time t = 0, the source will have produced N = tiT symbols at time t. The number of possible sequences generated
in this period is thus K N , and each sequence S; has probability of occurrence
peS;). peS;) is the product of N terms p(a {j, where a {j) is the jth letter of
the sequence.
We now define a random variable z corresponding to the set of sequences
S;. The value z, assigned to S; is given by
N
z,
= - log P(S;) = -
log p(a W)
j=l
10
ClllIUIJY,
UI;,:)\"IClC
Vy,
, '-" v
...,
:::;t
Notice that
K
k= l
has been used in the preceding equation. We conclude that, with probability
greater than I - 8 ,
- 8 ::::; -
(~)
for large N
(2 .2)
The number of seque nces for which Eq. (2 .2) is valid ca n now be
bounded. Denoting by v the number of such sequences, we will have
(a) p eS;) ;:::
(b) peS;) ::::;
T N[H(x)+SJ
T N[H(x)-8J
and
and
v[P(S;)]min::::; 1
v[P(S;)]max > 1 -
Therefore ,
(l -
8) 2N[H(X)-8J ::::;
v ::::;
2N[H(x)+S]
(2 .3)
Equations (2 .2 ) and (2 .3) impl y that the probability of each likel y sequence is about T NH(x) and the number of these sequences is about 2NH(x) . Furthermore, since H(x ) ::::; log K (see Section 2 .2 , prop erty (iii) of the entropy
function), we conclude that
v =0 2NH (X) ::::; 2N 1og(K) = K N
(2 .4)
where K N is the maximum possible number of sequences Sj. Equality holds iff
p(ad = 11K for all k.
To encode the output of the source described in the preceding section, consider
a binary encoder that can create binary sequences at a rate of f3 bits per second.
After t = NT seconds, the total number of sequences generated by the encoder
is 2Nf3T (f3T can be interpreted as the number of bits generated by the encoder
per source symbol.) We wish to encode the source into binary sequences, that
is, to assign a unique binary sequence to each source sequence. We also wish to
use a machine with finite (and otherwise unlimited) memory so that, aside from
an initial delay, data can be encoded as they are generated by the source. Under
these circumstances, if a unique binary word is assigned to each of the likely
sequences and the remaining (unlikely) sequences are ignored , an error will be
Sec. 2.6
19
committed with probability less than s provided that enough binary seque nces
are av ailable to cover all the likely source sequences. Thus , if 2N{3T 2: 2N (H(X)+8 ],
or equivalently (3T > H(x), the probability of error will be less than e.
As long as the binary sequence s are generated with a rate (per source symbol)
greater than the entropy of the source H (x ), the probability of error can be made
as small as desired.
Conversel y, if (3T < H(x) , say (3T = H(x) - 2y , set 0 = y and choose
an arbitrarily small positive number e . There exists a time to after which, with
probability greater than 1 - e , the source seq ue nce S; will be a likely se quence . The probability of S, is bounded by
P(Si)
::5 r
N[H (x)- y]
2N[H(x)- 2y]
2-Ny
We co uld also have ass igned some codes to th e un likely seque nces ; but since
these seque nces have probability less than e , the probability of havin g a co de
for a so urce sequence will, in gen eral , be less than e + z": For very lon g seque nces, this probability will be extr emely small.
If the binary sequences are generated with a rate (per source symbol) less than the
entropy of the source H (x) , the probability of error will be as close to I as desired, irrespective of the manner in which sequences are encoded.
Not e: If (3T is lar ger than H(x) but very close to it, the o utput of the
binary en coder, in the long ru n , will have 21lt sequences with equal probabilities
_ r NH(x) == z". This means that the encoder is working close to its ma ximum
entropy with P(O ) == PO ) == 0 .5 .
Exercise
The results of this section are not restricted to binary encoders. It is quite straightforward to generalize the results to an encoder with D different symbols in its alphabet.
CaJTy out this generalization .
20
Chap. L
determine the number of different sequences the device can generate in t seconds , we define the function N(t) as
NU) =
O'
1,
{
no. of possible sequences in [0, t ],
N(t)
N(t - T 1)
:2:
if t <
if 0 ~ t < Tmin = min{T;}
if t :2: Tmin
Tmin ,
(2.5)
Thus N(t) is the solution of a linear difference equation. Before proceeding further, let us consider a simple example .
Example 3
Let D = 2, T 1 = I , and T z = 2. It can readily be verified that Figure 2.4 is the correc t
represe ntation of N(t) for the first few units of time. The difference equatio n N(t ) =
N(t - I) + N(t - 2) has continuous solutions of the form A', where A is found by substitution in the equation:
( ) = A( I
Nt
+ V5)' +
2
(I -2V5)'
N (t ) = -
(j =
(V5 + 1)/2V5
v=T)
At noninte ger values of t, the precedin g function is quite different from the function represented in Figure 2.4. It assumes correc t values , however, at integer times t = II . Since
our main interest is in the integer values of t, we ignore these differences and take the
preceding function to represe nt N( t).
We can now write
N(t ) = -
I) + log (V5
+ I) + log { I + (V5
- 1
2V5
V5 +
I ) '+1.
e'" }
For large t , the last term of this equation is very close to zero and may be ignored , yielding
f3 = ( tI )
log N(t)
==
= 0 .694
bits/unit time
Sec. 2.6
21
--
--
-r
f3
is sometimes referred to as the capacity of the encoder. If the entrop y of the source is
less than f3, long blocks of data can be encoded with high accurac y. If the entropy is
larger than f3 , however , the probabilit y of error approaches I with increasing length of
the sequences.
Returning now to Eq . (2.5), the general equation for N (t) , the characteristic equation can be written as
A- T,
A- T 2
+ . .. +
A-
TD
(2 .6)
This equation can have both real and complex solutions, but the solution with
the largest absolute value, Am.., is alway s real and positive (see Problem 2.8).
Furthermore, Amax can be determined graphically, as shown in Figure 2.5 .
D
-T;
LA
i=l
D f-----1..
Figure 2.5
22
Chap . 2
PROBLEMS
2.1. Let E = {el> . . . , eN} be a set of disjoint events, and let PI> . . . ,PN and q l , .. . , qN
be two different probabil ity assignm ents on E, with 2.P" = 2.q" = I. Prove that:
N
(a)
'L, P"
10g(p,.1q,,) ~ 0
n= l
N
(b)
n= t
2.2. A random variable x assume s nonnegative integer values 0 , 1,2 , . . . . Find the
probability assignment P" that maximizes H(x) subject to the constraint that the
mean value of x is a fixed value A; that is,
'L, np;
= A
n=O
Evaluate the resulting H(x) . (Hint: Use Lagrange multipliers . See, for example ,
F. B. Hildebrand, Advanced Calculus for Applications, 2nd ed ., Prentice-Hall,
Inc ., Englewood Cliffs, N.J., 1976, Chap. 7.)
2.3. Prove that any transfer of probability from one member of a discrete sample space
to another that makes the ir probabil itie s more nearl y equal will incre ase the
entropy.
aij
for all i, j ,
'L, aij =
for all j ,
;= 1
and
'L,a ij = I
for all i,
j= 1
q, = 'L,a ijpj
j~ l
Show that H (ql> . . . , qN) ;::: Hi p, . . . ,PN) with equality iff ql> . .. ,qN is a rearrangement of PI> . . ,PN.
2.5. Let XI> . . . ,X" be arbitrary positive numbers, and let al> . . . .a; be positive numbers whose sum is unity. Prove that
X ~I
. x ~n S;
'L, a.x,
;=1
with equality if and only if all x, are equal. Note that the inequality still holds if
some of the a, are allowed to be zero . However, the condition for equality becomes : All x, corresponding to positi ve a, are equal.
Chap. 2
Proble ms
23
2.6. A source produces a sequen ce of statistically independent binary digits with the
probabilities P(I ) = 0.005 and P (O) = 0.995. These digit s are taken 100 at a
time and a binary code word is provided for every sequence of 100 digits contain ing three or fewer I's .
(a) If the code words are all of the same length, find the minimum length required to provide the specified set of code words .
(b) Find the probab ility of getting a source sequence for which no code word has
been provided.
(c) Use the Chebyshev inequality to bound the probabilit y of getting a sequence
for which no code word has been provided and compare with part (b).
:5
P2 :5 .. . :5 PN' Show
(a) H(x) ~
:z: p; (1 - Pn) ~ I - PN
(in nats)
,,=1
(b) H(x) ~ H( PN), where H(PN) = -PN 10g(PN) - (1 - PN) 10g(1 - PN)
x,
X2
X3
IO
II
2.10. (a) Show that the function log x in the region
(b) Using Jensen 's inequalit y, prove that
a< x <
00
is convex
n.
H(x) =
-:z: Pn log Pn
:5
log N
n= 1
2.11 . Let f ey) be an arbitrary function defined for y ~ I. Let x be a discrete random
variable with range {x" . . . ,XN} and probability distribution {PI, ' . . ,PN}' Define
the f-entropy of x by
Iff( y) is convex
n , show
L't
2.12. Let
x
A =
~n
IOg2 n
P {x = n} =
I
2
A n log n
'
(a) Show that A exists; that is, the sum defining A converges.
(b) Find the entropy of the random variable x.
2:
0 for
3
VARIABLE-LENGTH SOURCE CODING
3.1
3.2
3.3
3.4
3.5
3.6
3.7
In t rod u c t ion In Chapter 2 we studied the probl em of source cod ing using fixed-length code words for fixed-length blocks of data . The likely
(typical) sequences were encoded into equal-length code words and, as long as
the encoder was capable of producing letters above a certain rate, the encoding
process was apt to go smoothly. There are two problems with this scheme .
First, the atypical or unlikely sequences cannot be reproduced from the code,
and , second, the encoding/deco ding procedures are not practic al. The first
problem is not a serious one because the probability of occurrence of an atypical sequence is extremely small and, in practice, small failure rates are almost
always acceptable. The second problem arises from the fact that a code table
must be stored in the memory of the encoder and decoder, relating each block
of data to a corresponding code word. Moreover, this table must be searched
for each block until a match is found. The required amount of memory and the
search times thus grow exponentially with the block length and soon become
impractically large.
Variable-length source coding, which is the subject of this chapter, is, in
a sense, the answer to the preceding problems. Here again one chooses a fixedlength block of data but assigns a variable-length code to it. In practice, the
""
V al la lJ l'IJ - L. 'lJ I I ~ L I I
\-Jvu, ..... v
_VUI I'~
V'
length of the block of data is short, and more often than not this block is composed of only one or two letters fro m the source alphabet. The code -word
lengths are related to the probability of their occurrence and , in general, shorter
code words are assigned to more probable blocks of data. The average codeword length will be shown to be bounded from below by the source entropy.
Moreover, we will show that this lower bound can be approached in the limit of
long sequences.
In Section 3.1 we address the problem of unique decodability. Prefixfree codes will be defined in this section and their properties explored in
Sections 3.2 and 3.3. The Shannon-Fano coding scheme is described in Section 3.4, and upper and lower bounds on the average code-word length are established in this section . Finally, in the remainder of the chapter , an optimum
encoding scheme known as Huffman coding is described .
Consider an information source with alphabet {aj, .. . ,ad. The source produces a sequence of independent, identically distributed letters with probability
distribution P(ai) = Pi (I ~ i ~ K). Next, consider an encoder with alphabet
{x j, .. . , X D}, where all X i have the same duration. Depending on the duration of
source letters, there may be a waiting line at the input to the encoder . We assume that the system has infinite memory and thereby ignore the waiting line
problem . The encoder encodes a sequence of source letters by assigning a predetermined sequence of X i'S to each incoming letter. Suppose the code word assigned to a, is Xii' . . XiIIi' where n, is the length of the code word. The average
length of this code is given by
n=
LPini
i= 1
From the law of large numbers, we know that for long source sequences the average code-word length will be close to with high probability. Our goal in
this chapter is to provide insight into method s of encoding that minimize n.
Obviously, not all codes that can be imagined are useful. The most general requirement is that all finite sequence s of code words must be uniquely decodable . In other words ,
For each source sequence of finite length, the sequen ce of code letters corresponding to that source sequence must be different from the sequence of code letters corresponding to any other source sequence. t
t R. G . Galla gher, Inf ormation Theory and Reliable Communications . New York : John
Wiley & Sons, 1968, p . 45 . Copyright 1968 by John Wiley & Sons. Reprinted with permission.
Sec. 3.2
Kraft Inequality
27
3 .2 KRAFT INEQUALITY
Let n l , . . . , nK represent the lengths of code words in a prefix-free code , and let
D be the size of the encoder's alphabet. Then
K
2. D - ni
:5
(3.1)
;=1
a2
a)
a4
P(a,)
Code I
Code II
Code III
Code IV
0.5
0 .25
0 .125
0 .125
0
0
I
10
0
I
00
II
0
10
110
III
0
01
0 11
0111
Source: R. G . Galla gher, Inf ormati on Theory and Reliable Communications , New York: John
Wiley & Sons , 1968, p . 45 . Copyright 1968 by John Wiley & Sons. Reprinted with permission .
t R. G . Gall agher, Informati on Theory and Reliable Communications . New York : John
Wiley & Sons, 1968, p . 45 . Copyrig ht 1968 by John Wiley & Sons. Reprinted with permission.
\",o llcqJ .
Root
Level l
Level 2
. . .
.. .
. . .
Figure 3.1
node , in tum , creates D new branches and the process continues. The nodes at
level i can be associated with code words of length i. Assuming that the word
lengths are arranged in increasing order (i.e ., 11 , ::5 112' ::5 11 K ), we assign a
node at level 11, to the first code word . The code being prefi x free, no other
code word should stem from the node associated with Ill ' Therefore, the first
word disables a fraction D - Il l of branches . When the second code word is assigned to a node at level 112, the total fraction of disabled branches becomes
D - Il l + D -112 . This proce ss can continue as long as a fraction of branches remains. It is therefore necessary that Eq. (3. I) be satisfied in order to have a
node associated with each and every code word.
Conversely, if integers 11, ::5 112::5 ::5 11K satisfy the Kraft inequalit y,
one can assign nodes to code words beginn ing with Il l . The Kraft inequality
then assures that at each stage of assignment the necessary nodes are available.
(D-;)N
Il
1= 1
; 1= ,
iN= 1
2: .. . 2:
- (Il ; I + '
' + Il ;N )
Sec . 3.4
29
(D-i)N
lI
1=1
N"K
AiD -i
j=NlI j
Since the code is uniquely deco dab le , sequences of N code words and tota l
length j must be distinct. The maximum possible number of sequences of
length j is o', therefore , Ai :5 D i and
K
L D - ll i
) N
NIIK
:5 .L
1= 1
Il l)
] = NII ]
If L r=I D -II i is greater than unity, the left side of the precedin g inequality grows
expo nentially with N , whereas the right side grows linearly. Since N can be
arbitrarily large, this can not happen , and therefore the Kra ft inequality must
be satisfied.
In concl usion, any uniquely decodable code with word lengths Il ), . , n
sa tisfies the Kraft ine q uality . In Secti on 3 .2 we learned that if integers
Il), . . . , n sat isfied the Kr aft inequ ality then a pre fix-free code with those
lengths could be constructed . Hence
A uniquely decodable code can always be replaced with a prefix-free code with
the same word lengths.
3. 4 SHANNON-FANO CODING
Ii ~ H (x)
(3 .2)
log D
where H (x ) is the source entropy and the base of the logarithm is arbitrary.
Proof
H (x) -
Il
K K K
D '?'
- LPi In Pi - LPilli In D = L Pi I n i=1
i= 1
i=1
Pi
In D
:5
LKPi( -D -"i
. - I) =
i~ 1
P,
( LK D - i )
ll
\ i= 1
Since the code is uniqu ely decodable , the Kraft inequality must be satisfied.
Therefore, H(x) - Ii In D :5 0 and the proof is compl ete .
L.. .... I I ~ L II
\.J V U I ........
v i 10...,' ..J
_ V U Il I ~
i,
Theorem 2. Given a source with alphabet {at. .. . , ad and probabilities Pt. . . . , PK, it is possible to construct a prefix-free code from an alphabet of
size D such that
Ii < H(x)
(3.3)
log D
Proof. Choose the length n, of the code word for a, according to the rule
n, = I -log~l, where Ia l is the smalle st integer greater than or equal to a.
Now
K
n,
I -logm ::? n,
:2=
K
ll
~ 2:Pi
i=1
<
- log~
Therefore,
K K K
2: pn , < -
2: Pi log~ + 2: Pi
i= l _
i= l
i=1
The last inequality is the same as Eq. (3.3) and the proof is therefore complete.
It is possible to encode a source and to achieve the lower bound H(x)/
log D on Ii as closely as desired. This is done by encoding sequenc es of N
source letters at a time according to the procedure described in Theorem 2. In
other words , the source is assumed to have a new alphabet of size K N , each
member of which is a sequence of N independently generated letters from
{at. .. . , ad . The entropy of the new source is NH(x) (Problem 3.3), and its
average code-word length is, by definition , N times the average code-word
length for the original source. Applying Theorems 1 and 2 to the new source
then yields
NH(x)
- log D
~Nn
NH(x)
+ 1
log D
<- -
Or, equivalently,
H(x)
log D
--~n
H(x)
log D
< -- +
Since N can be arbitrar ily large, values of Ii that are arbitrarily clo se to
H(x)/log D can be achieved .
Note that by encoding long sequences at a time one cannot improve the
lower bound on Ii . However, if, due to the particular distribution of probabili-
Sec. 3.5
31
ties, the lower bound is not automatically achievable, one may resort to the preceding technique to bring the average code-word length closer to its minimum.
n.
An = p;nj + PP i
- Pin; - pjnj
which is less than or equal to zero. But the original code is optimum , which
means that its average length cannot be improved by the exchange of two
words; therefore ,
= O. The new code obtained by the exchange is thus optimum, and it is readily observed that an optimum prefix-free code exists for
which nl ~ n2' .. ~ nK' For this code we prove the following two theorems.
An
to
aK
Proof. If nK-! < nK, then the last digit of X K can be dropped without
violating the prefix condition. This results in a new code with smaller ii, which
is contrary to the assumption that the code under consideration is optimum.
Therefore , nK-! = nK'
If X K and X K - 1 are identical except for the last digit , the theorem is
proved. If not, there may be another code word with length nK (such as XK - 2 ,
XK- 3 , ) that differs from XK in only the last digit. This code word can be exchanged with XK - 1 and the resulting code will satisfy the theorem. If such a
code word does not exist, the last digit of XK can be dropped without violating
the prefix condition . This , however, results in a smaller
which contradicts
the assumption. Therefore, with proper rearrangement, XK and XK - 1 differ in
only the last digit.
n,
Xi
XK -
::s.
X;',
Chap . 3
::s. K - 2
= X~-J O
XK = X~-J I
Proof. Since n
Ii
= nK-J =
= pJnJ +
=
.. . + PKnK
Ii ' + (PK-J + PK)
+ n~_ I '
= Pin;
we have
+ ... +
(PK-J
+ PK) (l +
n~-J )
Th at is , the difference between Ii and Ii' is a con stant. If the optimum code for
S is better than the preceding cod e , the cod e derived fro m it for S' must be better than optimum, which is impo ssible . Th erefore, the code obtained for S according to the preceding rule is optimum.
Theorem s I and 2 redu ce the problem of findin g an opt imum code for a
source with K letters to the probl em of finding an optimum code for a new
source with K - I letters. The procedure , however, ca n be repeated until the
new source has only two letters, in which case the optimum code is trivial. The
procedure of bin ary Huffman coding is summarized in the following steps.
1. Arrange the source letters in decrea sing ord er of probabilities (i.e .,
PI ?:. P2?:' . . . ?:. PK)'
2 . Assign 0 to the last digit of XK and I to the last digit of XK - 1 (or vice
versa) .
3. Combine PK and PK- J to form a new set of probabilities , PI, ' . . ,PK-2,
(PK-J + PK)'
4 . Repeat all the steps for this new set.
Example 1
The Huffman coding proc edure for a source with a five-lett er alph abet is show n in
Table 3.2. The optimum code is shown in the rightmost column and its average length is
given by
5
letter
;=1
TABLE 3.2
Source
Letter
peat)
Code
Word
0.3
0.3 / 1 0.45
(i)0.55
0.25
0. 25
(i)0.3 } /@0.45
0.25
(i)0.25}
@0. 25
(i)0.1}_ _ @O,2
@O. I
10
01
001
000
pea;)
pea;')
II
Sec. 3.6
33
Notice that H(x) = - 2,~=IPi log Pi = 2.1855 bits/symbol is less than Ii; therefore , Eq . (3.2) is satisfied .
If we choose II i = I -log Pil according to the Shannon-Fano scheme, we will
have III = 112 = 113 = 2, 114 = 115 = 4 , and Ii = 2.4, which satisfies Eq . (3.3) but is
larger than the average code-word length for the Huffman code .
3 .6 GEOMETRICAL INTERPRETATION OF
HUFFMAN'S TECHNIQUE
Level 3 - - - - - - - - - - - -
Level 4 - - - - - - - - - - - - - - - P4
Figure 3.2
Ps
Ps
"..
Chap. 3
From the foregoing discussion we conclude that the final level contains
the two least -probable code words . These code words have the same parent
node and therefore differ in only the last digit. To complete our geometrical interpretation of Huffman coding , we need to know the effect of rearrangement
of probabilities on Ii . This is the subject of our final question .
4 . Can nodes at a given level be exchanged? The answer is yes. The
code words associated with the nodes at a given level have the same lengths
and, as a result, can be excha nged without increasing Ii or violating the prefix
condition . For instance, at level 3 in Figure 3.2 we can exchange the node associated with P3 and the node that is parent to P4 and p-; This kind of rearrangement is essential to Huffman coding since at every stage of the coding process
we must be able to shuffle the probabi lities until the two smallest ones have the
same parent node . Also notice that , once the two smallest prob abil ities are
brought together under the same parent, any reshuffling at a lower level does
not destroy this arrangement. For example, any rearrangement at level I , 2, or
3 cannot assign P4 and p, to different parents .
We are now in a position to generalize Huffman's technique to nonbinary situations where the encoder alphabet size D is larger than 2. Visualizing an optimum code tree with D branche s stemming from each node, we ask the same
four questions as in the previous section. The answers to questions I, 2, and 4
remain the same . The answer to question 3, however, must be slightly modified
since free nodes in the final level can now exist. As long as there is more than
one final level code associated with a parent node, that parent node can have
other offspring with no code words . The situation for D = 3, for example , is
shown in Figure 3.3 . The free node in the final level cannot be eliminated by
assigning Ps or P6 to their parent node since the other probability then remains
unassigned .
By exchanging nodes in the fina l level, all the free nodes should be
brought together under the same parent. We then assign zero probabiliti es to the
free nodes and proceed with the general Huffman coding procedure, that is,
Figure 3.3
Chap . 3
Problems
35
TABLE 3.3
Code
Source
Letter
pea:)
0.4
0.3
0.2
0.05
@0.03 }
(Do.oz >:
0 0
Pea ?)
Word
0.4
@0.4
0.3
(D0.3
@0.2 }
.z-> 0 0.3
(D0.05 ----00.05
1
20
21
220
221
combining the D smallest prob abil itie s and con tinuin g with a set that has
D - I fewer elements.
With a complete tree, the set of probabilities is reduced by D - I elements at each stage until D elements remain . Hence, the initial number of elements in the set must be D + neD - 1), where n is an arbitrary integer. Zeros
must therefore be added to the set to bring the total number of elements to
D + neD - 1).
Example 2
If D = 3, then D + ntD - I) = 3 + 2n , which can be any odd integer. A zero is
therefore added to the set of probab ilities if the number of elements is even. (If the number is already odd , nothing needs to be done .) The procedure of ternary Huffman coding
for a set with six letters is shown in Table 3.3 .
PROBLEMS
3.1. A code is not uniquely decodable if there exists a finite sequence of code letters
that can be resolved in two different ways into sequences of code words . That is,
a situation such as
I AI
B,
I
A2
I A) Am
I B I B) I ... I B.
must occur where each Ai and each B, is a code word. Note that B , must be a prefix of A l with some resulting dan glin g suffix . Each dangling suffix in this sequence in turn must either be a prefix of a code word or have a code word as
prefix, resulting in another dangling suffix. Finally, the last dangling suffix in the
sequence must itself be a code word . Thus one can set up a test for unique decodability in the followin g way. Construct a set S of all possible dangling suffixes .
The code is uniquely decodable if and only if S contains no code word.
(a) State the preci se rules for building the set S.
(b) Suppose the code word length s are n" .. . , nK. Find a good upper bound on
the number of elements in the set S.
{OO,OI, 10, nj
{O,OI,lO}
{O ,OI, II}
{O,OI}
(d) For each uniquely decodable code in part (c), construct, if possible, an infinitely long encoded sequence with known starting point such that it can be
resolved into code words in two different ways. (Unique decadability does
not always imply finite decodability .) Can this situation arise for a prefix-free
code?
3.2. A DMS is encoded by a prefix -free ternary code and the average code-word
length is found to be 11 = H 3(x ) (the subscript 3 indicates the base of the logarithm). Can the source have an even number of letters in its alphabet?
3.3. Let a source SI have alphabet {ah .. . , ad and probability distribution
PI> ... ,PK' Construct a new source SN with alphabet size K N where each member
of the new alphabet is a sequence of N independent, identically distributed letters
from SI' Prove that the entropy of SN is N times the entropy of SI' [Hint: Use
property (iv) of the function H(x) in Section 2.2.]
3.4. Consider the following method of constructing a binary code for a source S with
alphabet {aI> ... , aK} and probability distribution PI ~ pz ~ ... ~ PK ' Define
i-I
qi
LPj,
for i > 1
j ~1
(b) Prove that the method yields in all cases a prefix-free code whose average
length 11 satisfies the inequality H(x) :s 11 < H(x) + 1.
3.5. Use the concept of code tree in solving this problem: There are N coins of which
N - 1 are identical and one may have the same weight as others , may be heavier,
or may be lighter. A balance and a standard coin (same as the N - I identical
coins) are also available. Each use of the balance can establish a relationship between two groups of coins as to whether one group is equal to, heavier than, or
lighter than the other. The goal is to identify the odd coin, if any, and to determine its nature (heavy or light) after n uses of the balance. What is the maximum
value of N for a given n? Could you solve the problem without the standard coin?
3.6. For each of the following discrete memoryless sources, construct a binary and a
ternary Huffman code and find the corresponding 11 in each case.
(a) A source with a six-letter alphabet and probabilities 0.33, 0.23, 0.12, 0.12,
O.I,O.1.
(b) A source with a seven-letter alphabet and probabilities 0.35, 0.2, 0.15, 0.15,
0.1, 0.03, 0.02.
3.7. For the source in Problem 3.6(b), construct two different binary Huffman codes
with the same (minimum) average length but different variances . Which code is
preferable in practice and why?
Chap. 3
Problems
37
3.8. (a) Give an example in which a source with equiprobable letters is encoded into
an optimum (Huffman) code with unequal code-word lengths.
(b) Give an example in which a source with unequiprobable letters is encoded
into an optimum (Huffman) code with equal code-word lengths.
3.9. Consider a discrete memoryless source with an alphabet of seven letters. The
probabilities are 0.3 , 0.25, 0.15, O. I , O. I , 0.05, 0.05. Find a prefix-free code of
minimum average length under the constraint that the first letter of each word
must be 0 or I and each successive code letter can be 0, I , or 2. Find a general
rule for constructing prefix-free codes of minimum average length with this constraint and outline why it works.
3.10. A source with K equiprobable letters is encoded into a binary Huffman code . Let
K = r, where J is an integer and I ::;:; ex < 2.
(a) Are there any code words with length not equal to either J or J + I?
(b) In terms of ex and J , how many code words have length J?
(c) What is the average code-word length?
3. 11. (a) A source has five letters with the probabilities 0 .3, 0 .2, 0 .2, 0.15, 0 . 15.
These letters are to be coded into binary digits and transmitted over a channel. It takes 1 second to transmit a 0 and 3 seconds to transmit a 1. Using
cut-and -try techniqu es, find a prefix-free code that minimizes the avera ge
transmission time and calculate this minimum average time.
(b) Any such code can be represented by a tree in which the length of a branch
is proportional to the time required to transmit the associated digit. Show
that, for a code to minimize the average transmission time, the probabilities
associated with intermediate and ter min al nodes must be non increasing
with length.
3.12. Run Length Coding. A source produces a sequence of independent binary digits
with the probabilities P (O) = 0.9 , P(l ) = 0.1. We shall encode this sequence in
two stages, first counting the number of zeros between successive ones in the
source output and then encoding their run lengths into binary code words. The
first stage of encoding maps source sequences into intermediate digits by the following rule:
01
00 1
1
2
0001
00000001
00000000
7
8
VUI IU ...... l
3.13.
v - L. vl l ~ L I I
...... V U I VV
_VUIII~
The final stage of encoding assigns a code word of one binary digit to the intermediate integer 8 and code words of four binary digits to the other intermediate integers.
(a) Prove that the overall code is uniquely decodable.
(b) Find the average number nj of source digits per intermediate digit.
(c) Find the average number nz of encoded binary digits per intermediate digit.
(d) Show, by appeal to the law of large numbers, that for a very long sequence of
source digits the ratio of the number of encoded binary digits to the number
of source digits will, with high probability, be close to nz/nj . Compare this
ratio to the average number of code letters per source letter for a Huffman
code encoding four source digits at a time.
A discrete memoryless source has a six-letter alphabet with probabilities {0.3,
0.2,0.15,0.15, 0.15, 0.05}.
(a) Construct an optimum (Huffman) binary code for this source. What is the average code-word length
(b) A code is said to satisfy the parity check condition if all its code words have
an even (or odd) number of 1's. Does the code in part (a) satisfy the parity
check condition? If not, can you construct another Huffman code that does
so?
Consider a discrete memoryless source S with alphabet {ah . . . , ad and probability distribution {Ph ' . . , pd. A new source S' is formed by adding a zeroprobability letter to the alphabet of S, forming the alphabet {ah ... , aK+j} with
probability distribution {Ph ' .. ,PK, O}. Each source is then encoded into binary
digits (D = 2) using Huffman's technique . Is it true, in general, that the average code-word lengths of Sand S' are the same? If not, which code is better
and why?
A pair of independent, honest dice are rolled and the sum of the outcomes is denoted U" . Obviously, U" is a member of the set S = {2, 3, 4, ... , 11 , 12}. A student
is asked to determine U" after asking questions that can be answered Yes or No. If
he asked, Is it 2?, Is it 3?, and so on, he would average a little under six questions. It is possible to do better, however.
(a) Using the concepts of code tree and Huffman coding, design an optimum
strategy for asking questions .
(b) What is the minimum average number of questions?
(a) Consider a DMS, Sh with probability distribution {PI,' .. ,PK}. Assuming
that PI ~ P: ~ ... ~ PK, we divide PK in two equal halves and form a new
source Sz with probability distribution {Ph ' " ,PK- h O.5PK, 0.5PK }. If op,
represents the average code-word length for the binary Huffman code, prove
that the redundancy R = Op! - H(x) is the same for both sources S, and Sz.
(b) Show that for a source with the two-letter alphabet {aj,az} it is possible to
find a probability distribution {p, 1 - p} such that any amount of redundancy
in the range 0 < R < 1 is achieved.
(c) Combining the conclusions in parts (a) and (b), show that it is always
possible to find a probability distribution {Ph' . . , PK} for a source with a
K-letter alphabet such that any amount of redundancy in the range 0 < R <
1 is achieved .
n?
3.14.
3.15.
3.16.
Chap. 3
Problems
39
3.17. A discrete memoryless source S\ has alphabet A = {at> a-, a-; a4,as} and a certain
probability distribution . The following code C, is suggested for this source.
C1 = {1,000,001,010,0 11}
A second source , S2' has alphabet B = {bt> b-;b3 , b4, bs} and a probability distribution different from that of St. Two alternative codes C2 and C ~ are suggested
for S2' as follows:
Figure P3.17
4
UNIVERSAL SOURCE CODING
I n t r o d u c t i o n In this chapter we introduce the concept of universal coding. Unlike much of information theory , this is an area that was not pioneered
by Shannon himself and represents a fundamentall y different approach to
source coding. The ideas presented here are due to B. M. Fitingof [9, 10].
Shannon' s theory is based on the knowledge of probability distribution
functions, without which optimum encoding is impossible. Universal coding ,
on the other hand , utilizes the structure of the sequences and arrives at the same
optimum answer. In situations where the probab ility distribution functions are
not available or the source statistics are time dependent, the universal coding
techniques are clearly the right choice.
The material covered in this chapter should serve as an introductory exposition of the idea of universal coding. A growing number of papers are now
published in this area each year, and universal coding constitutes an active field
of research in information theory.
II IVC' I.:lO I
t,JUUI\.oC'
..... UUI
I I~
..... IIOtJ .
"'T
knowledge of the probability distribution in order to achieve optimum or nearoptimum encoding . For example, in the block-encoding scheme of Chapter 2,
the likely sequences are classified according to their probability of occurrence,
and in the Shannon-Fano technique of Chapter 3, the code-word length n,
associated with a given letter (or sequence of letters) is [log Pil, where Pi is the
probability of the letter (or sequence) . In practice , however, it is often the
case that the probability distribution is not known, or, at best, only an approximation to it is available. It may also happen that the source statistics vary
slowly so that for a period of time that is long compared to the block length the
statistics are fixed, but change from one period to another. In such situations, a
probabilistic coding scheme that is efficient for some time frames will not be so
for others.
To appreciate the effect of inexact knowledge of the sourc e statistics
on the coding efficiency , consider a binary DMS with probability distribution
P(O) = P, P(l) = 1 - p. If the encoder 's approximation to P is denoted by Po ,
then Shannon-Fane coding of long blocks effectively assigns a code word of
length - log Po to 0 and a code word of length - loge1 - Po) to 1. The average
code-word length will then be
n=
n
A plot of versus P for a fixed Po is shown in Figure 4.1 . For comparison with the optimum in the ca se whe re P is known, a plot of op , =
- P log P - (l - p) log(l - p) is also shown in the figure. Note that is tangent to Op! at p = Po, but as p deviates from Po, the difference between the two
curves increases rather sharply.
In this chapter we develop the basic ideas of universal coding, a scheme
that relies not on the probability of the sequences but on their structure. We will
show that for any positive number e it is possible to encode a source such
that
is less than or equ al to H (x) + e for all possible sourc e statistics
{ Ph ... ,pd The counterp art of Figure 4.1 for the universal scheme is shown
in Figure 4. 2. The distance e between the upper bound to n and n Op! can be
made as small as desired by choosing a large enough block length .
n
n
Po
Figure 4.1
Sec. 4.2
43
Upper bound of
"
~- --,I
'",
\
\
I
I
,
\
..
p
Figure 4.2
E(qk) =
1/= 1
w(Q)
N'
K
'
TIk=INk .
Next we inquire about the total number of different frequency vectors that
represent source sequences of length N. The number of such vectors is a function of K and N and will be denot ed by cjJ(K, N) . Lemma 2 gives an explicit
formula for this function .
Lemma 2.
cjJ(K,N)
(N+~-I)
The frequency vector Q(Sj) = (qli, . . . , qKj) is now used to define the
quasi-entropy 'l' of the sequen ce Sj. By definition
K
'l'(Sj)
= - L qkj log s
k=1
Realizing that Qi = (ql;, ' .. , qKj) can in fact be a probability vector, one can
say that the quasi-entropy 'l'(Qj)t has all the propert ies of the entropy function
H (Q j) that depend solely on Qj. For example , 0 ~ 'l'(Qj) ~ log K. Not ice ,
however, that the entrop y is a quantity defined for the source and depends on
the probability distribution of letters , whereas the quasi-entropy is defined for
each source sequence se pa rately and for a gi ven seque nce is indep endent
of the probability distr ibuti on. The follo wing theorem es tablishes the connecti on between the qu asi -entrop y 'l'(ql , ' . . , qK) and the so urce entropy
H (p " . . . ,PK)'
Theorem.
E('l'(Q ))
Ht p, . . . ,PK)'
Proof.
E('l'(Q))
E( -
tSince 'I' (S i) is only a fun ction of Q(S i ) or Qi ' we use the not ati on s '1 / (5 ,) and 'I'(Q ,)
inlerchangeably.
Sec. 4.3
45
= p i . Therefore,
E('I'(Q))
:S
- Pk log Pk
k= 1
We are now in a position to outline an optimal encodin g scheme for the source
sequences based on their structure without requiring an a priori knowledge of
the probab ilities involved. The code for a given sequence S, is compo sed of two
parts. The first part identifies the frequency vector Qi associated with Sj. The
second part identifie s S, among all the sequences that have the same Q. Since
the total number of different frequency vectors is cP(K, N), the number of bits
required for the first part is flog cP(K,N)l Simi larly, the number of bits required for the second part is flog W(Qi
The base of the logarithm is 2 for a
binary encoder and D for an encoder with alphab et size D. To prove the optimality of the coding scheme , we need the following lemma .
Lemma 3
n!
n il
where nj ~ 0
L nj
----;--- <: - --
I 1nj'I
IIj=
IIIj= 1n/
_ '_1!_ <:
IIf=,nj! -
II ' ,
and
j=1
[e
2, we use
l _ (71/8)( _ _
II]=,nj
IIf=,n?
The proof is complete if it can be shown that the bracketed term is less than or
equal to 1. Since 2:.]=1nj = n, at least one of the nj' s is greater than or equal to
nil. Consequentl y, II]=I nj ~ nil and
log w(Q;)
'to
I I IVC; 1 ~al
.JUU I
""c:::
,",UUIlI~
viICt-', ....
But
log 4;(K , N )
(N + K - I) !
log N! (K - I)!
::5
(N + K - I) N+Klog N N . (K - I I
IOg(I + K ~ I) + (K
- I)
IOg(1+ K ~ I)
Similarly,
As a result,
IOg( I + K ~
I) +
IOg( I + K ~
I) + 2
+N
IOg(1
+K
IOg(1+ K ~ 1)
~ 1) + 2
The ave rage code- wo rd len gth per so urce lett er is , by defin iti on ,
n = E(L(Si))IN. Therefore,
[-K--N l1og ( 1 + -K N)
- 1
I) 2]
K + log ( 1 + ----;:;+
00.
Thus, for a
Cha p . 4
Probl ems
47
TABLE 4.1
Qi
(0, ~)
CH)
CU)
21
CH)
35
CH)
35
d,~)
(H)
'I'(S;)
in bits
s,
w(Q ;)
1 1
10
o1
o1
0
0.592
000
001 000
001 001
1
1 00
0 0
0.863
001 110
010 00000
010 00001
00
1 1 I 1
1 1 1 1 000
1 1 1o10 0
0.985
010 10100
0 11 000000
011 000001
0 1 1 I 1
1 1 1 000 0
1 1 0 1 000
0.985
011 100010
100 000000
100 000001
21
000 0 1 1 1
11 0 0 0 0 0
1 0 10 0 0 0
0.863
100 100010
101 00000
101 00001
00000 1 1
1 0 0 0 0 00
o 1 0 0 000
0.592
101 10100
110 000
110 001
III
1 1
1 1
o0
C~, 0)
Code Word
000000 1
000000 0
110 11 0
encode the frequency vector; these are the first 3 bits in every code word. The
remaining bits identify each sequence in a given class .
PRO B LE MS
4.1. Consider a binary memoryless source with P(O) = Po and assume that sequences of
length N are encod ed accor ding to the Sha nnon-Fano algor ithm. If the source
statistics now change so that P(D) = p, what can be said about the new average
code-word length per source letter?
4.2 . The vector 11 = (Il l, . . . , 11 K ) co ns ist s of integer s sat isfying Il k ~ I for k =
1, ... ,KandL r=l llk =N.
(a) Show that , for fixed Nand K , there exist (%
=:) different vectors 11.
..0
U IIIVt; I;;)C1 1
vUUIL.~
vUU III~
\,.,IICltJ6-t
(b) If nk's are allowed to be zero as well, show that the total number of vectors
becomes
~
K
(K) (N - 1)
j
j - 1
2:
nm'!'(sm) s; '!'(S)
m ~' N
4.4. In a modified version of the universal encoding algorithm, all frequency classes
that correspond to different permutations of the same frequency vector are combined. For example, if Q(Si) = (q" q2,q3) and Q(Sj) = (q2' q-, ql), then S, and S,
belong to the same frequency class.
(a) Encode binary sequences of length 7 according to the modified algorithm.
(b) Prove that the modified algorithm is universal; that is, for any 8 > 0 there
exists No such that for N > No the average code-word length per source
letter Ii satisfies the relationship Ii < H(Ph '" ,PK) + 8.
4.5. Consider a DMS with alphabet {a" . .. ,ad and unidentified probability distribution P = {Ph' . . ,pd. p is known, however, to be one of the M distribut ions
r, . . . , pCMl. The K N sequences S, of length N are encoded according to the following scheme:
(i) Construct M prefix-free codes C h . . . , CM , one code for each distribution
r, . . . ,p CM l . The length of the code word assigned to S, in Cm is
I - log p(Si lm)l, where p(Silm) is the probability of S, assuming P = p Cm) .
(ii) Upon observing Si, compare the M codes and choose the one that contains
the shortest code word for S, .
(iii) Identify S, by an identifier for the chosen code, followed by the code word
corresponding to S, in that code.
Prove that in the limit when N ~ 00 this scheme becomes optimum for each of
the M possible distributions .
5
MUTUAL INFORMATION,
DISCRETE MEMORYLESS
CHANNEL, AND CHANNEL CAPACITY
5.1
5.2
5.3
5.4
5.5
Introduct ion In this chapter we extend the concept of entropy and introduce an important information theoretic function known as the average mutual information. The imme diate appl ication of these new concepts is in the
area of comm unication over noisy channels to which the present chapter and
the following chapter are devoted. The ideas, however, are quite general and
find application in other areas of informat ion theory.
The simplest class of noisy channe ls is the class of discrete memoryless
channels (DMC) described in Section 5 .2. Di screte memory less channels
provide reasonably good models for many practical commu nication situations;
moreover, extensions of the theory to more comp licated systems are possible
only when this simplest class is well understood. It is for this class that the
theory of reliable communication will be developed in this book .
In the remaining part of the chapter, we invoke the concept of equivocation to justify the definition for channel capacity given in Section 5.5. The capacity can be analytically determined only in special cases and, in general,
numerical methods must be employed for the calculation of capacity. One such
method is described in Chapter 9. The significance of capacity and its role in
the theory of reliable communication over noisy channels are the subjects of the
next chapter.
"IUI"UUIIIIIVIIIIU\.IV.I
...... 1.....
1-" . ....
Con sider the random variables x and y with joint probability distribution
p(x; ,Yj), 1 ::5 i ::5 N, 1 ::5 j ::5 M. The condit ional entrop y of x given y is de-
fined as
N
H (x ly)
= -L
;=1 j = !
has been revealed . In this context, the following version of the same definition
is perhaps more informative.
H(xly) =
The inner summation is now a measure of the uncertainty about x when the outcome of y is known to be Yj' The outer summation is the average of this uncertainty over y. In the following, we derive some important properties of the
conditional entropy.
Theorem 1.
H(x Iy) ::5 H(x) with equality iff x and yare independent.
Proof
'" '"
p(x i)
LJ LJP(Xi ,Yj ) In (
i
j
P x, Yj
I )
L
i
LP(x;,Yj) [
j
=L L
i
'"
f(~i\ -
P X, Y;
O. Thus
I]
[p(x;)p(Yj) - p(x;,Yj)]
The inequality is therefore proved. Equality holds if two conditions are satisfied. First, P(Xi) = p(xiIYj) for all pairs (i,j) such that P(Xi,Yj) '" 0 and,
second, the sum of P(Xi)P(Yj) over all such pairs is equal to I. The second condition will be valid iff p(x;)p( Yj) = 0 whenever P(Xi' yJ = O. The conclusion
is that H(x Iy) = H(x) iff x and yare independent.
Theorem 2.
H(x , y)
H(y)
Sec. 5.2
51
Proof
H(x,y)
= - 2:
i
= H(y) + H(xly)
The second part follows from symmetry, completing the proof.
It follows from T heorems I and 2 that H(x, y) ~ H(x ) + H(y) with
equality iff x and yare independent.
The average amount of informat ion about x contained in y can now be
defined in terms of the reduction in the uncertainty of x upon disclosure of y.
Denot ing this information by l ex, y) , we define
l(x, y)
l (y , x)
l(x, y)
That is, the information about x contained in y is equal to the informati on abou t
y contained in x. For this reason , l ex, y) is called the average mutual information between x and y.
From Theorem 1 we obse rve that lex, y) 2:: 0 with equality iff x and yare
independent.
In Problem 5.3 the reader is asked to show that the follow ing relationship
is a direct consequence of the definition of l ex , y) :
_ ~ ~ (
)I
p(Xj,Yj)
L..-P <,, og ( .) ( .)
j=Jj= 1
pX,P YJ
One can thus interpret the functio n 10g{p (x;' Yj )/[ p(xi)p(Yj)]} as the mutual informa tion between x , and Yj . This functio n, however, can be positive, zero, or
negative , whereas its average, l ex , y), is always greater than or equal to zero.
I (x,y) -
L..-
'-I"
....
t-".....
a sequence of input letters is equal to the product of the conditional probabilities of the individual output letters given the corresponding input letter. In
other words,
N
[lp(YJIlIXkll)
n= l
p(YJ)
L P(Xk)P(YJ IXk)
k=!
= I
[ P( YJIXk)] =
[P(YJIXk)] =
0.2 0.3
0.3 0.5
(
0.5 0.2
0.5)
0.2
0.3
[P(YJIXk)] =
C: e
~ e) ,
0.3)
0.2 '
k = I
k = 2
The last example is the binary symmetric channel (BSC), which repre sents an important practical situation and has been studied extensively. In this
channel, either or I is transmitted and either or I is received. The probability of error is e and is the same for both transmitted letters .
An important property of a symmetric channel is that its H(y Ix) is independent of the input distribution and is solely determined by the channel matrix. This can be proved as follows:
Sec. 5.3
Symm etric, Lossless, Dete rministi c, and Use less Chann els
H(y Ix)
= -
L
i
53
L P(Xj,Yj) log p( Yj lx J
i
- L P(xJ L Pi log p/
j
Therefore , H(y Ix) = - 2:;=1pi log p/ ' independ ent of input distribution.
Lossless Channel
Figure 5.1 is a graphical representation of the lossless channel. A connection between Xk and Yj means that P(Yj IXk) is nonzero. In a lossless channel the
output uniquely specifies the input; therefore , H (x Iy) = O.
Deterministic Channel
In this channel the input uniquely specifies the output (see Figure 5.2);
therefore , H(y Ix) = O.
Useless Channel
A channel is useless iff x and y are independent for all input distributions . For a useless channel , H(x Iy) = H (x); that is, knowledge of the output
does not reduce the uncertainty about the input. Thus, for the purpose of determining the input , we may ignore the output altogether. In the follow ing we
prove that a DMC is useless iff its matrix has identical rows.
(i) Pro of of s uffici enc y : A ssum e the matrix has identic al row s
p;, . . . ,pl. Then , for every output Yj,
Figure 5.1
Y,
Y2
Figure 5.2
IVIUlUdlllllUlllldL IUl 1
v lldJJ . ;,)
K K K
k= l
k=l
Therefore, the input and output are independent, irrespective of the input distribution.
(ii) Proof of necessity: Assume the rows of the matr ix are not identical.
As a result, there exists a column, say lo, whose elements are not identical.
Assume p(YjO IXkO) is the largest element in this column . Then, for a uniform input distribution,
K
p(yjO)
2:P(Xk)P(YjO!Xk)
k= l
2:P(YjOI Xk)
= -
<
P(YjOIXkO)
k= l
That is, x and y, at least for the uniform distribution of x, are not independent
or, what is the same, the channel is not useless .
llx
= 0) + P(x = I)P(y = 0 Ix = l )
= -+ - =s
P(z = 0) = I - P(z = I ) = I - e
H(z)
- s log e - (I - e) log (I - e)
bits/symbol
For a given output sequence y<lly(2) . . . , the receiver can exactly reconstruct the
input sequence X(l\ (2) . . only when the observer output Z (I )Z (2 ) . is made
available. It is therefore natural to define the rate of transmission of information
Sec. 5.4
55
Figure 5.3
over the channel R as the rate of generation of information H(x) minus the rate
of generation of supplemental information H(z) .
= H(x) - H(z)
bits/symbol
For example, if the input data are generated at the rate of 1000 symbols/second
and e = 0.01, we have
H(x)
H(z)
One may argue that in the long run, since e = 0 .01, only 1 percent of the
transmitted bits will be in error (with probability close to 1), and therefore the
rate of information transmission must be 990 bits/second. The answer is that
the knowledge of the number of erroneous bits alone is not sufficient for the reconstruction of the data. One also needs to know the location of the erroneous
bits, and for this reason the information transmission rate is actually at the
lower value of 919 bits/second.
In the extreme case when e = 0, we have H(z) = 0 and consequently
R = 1000 bits/second. On the other hand, if e = then H(z) = 1, resulting in
zero transmission rate . Both conclusions are consistent with our expectations.
For the binary symmetric channel with equiprobable inputs, it is easy to
verify that H(z) = H(x Iy). In general, we will show that exact reconstruction
of the input sequence from the output sequence is possible only when the observer can generate supplemental information at a rate greater than or equal to
H(x Iy) . To see this intuit ively, observe that for long sequences of length N
there are roughly 2NH (x 1y) input sequences that can produce a particular output
sequence . Only when the supplemental information is generated at the rate
of H(x Iy) or faster does it become feasible to distinguish among these possibilities. For this reason, H(x Iy) is usually referred to as the equivocation of
the channel.
The remainder of this section contains a proof of the preceding statements. Specifically, we prove that, in general, exact reconstruction is possible
if H(z), the entropy of the supplemental information, is allowed to be larger
than H(x Iy), the equivocation of the channel.
Consider, a DMC with matrix [p(Yj I xdJ and input distribution
{PI , ... ,pd The observer is capable of observing both the input to the chan-
...u
\..1li;!fJ
::>
nel x, and the corresponding output Yj' For every output letter, the observer has
a code for the input alphabet. Upon observation of Yj, the observer transmits
the code word for Xk, chosen from the code that corresponds to Yj' Table 5.1
shows the code-word lengths for all the input-output pairs (Xb Yj)'
Since code words from different codes may now follow each other, it is
no longer enough to require that all the J different codes be uniquely decodable. Prefix-free codes, however, are sufficient to guarantee unique decoda bility of the encoded sequences, and attention will be confined to them in this
discussion . The following theorems concern the average code-word length
generated by the observer.
Theorem 1.
alphabet.
Proof
H(xly) -
n In D
= -
LP(XbyJ In P(Xk/Yj) -
=L
k
:s;
L
j
L
k
LP(XbYj) In[
j
I]
~ -In kj )
j= !
k= 1
~ -I"k})]
P x, Y1
LP(Xb Yj) [
k
P Xk Y1
= LP(Yj) L o
LllkjP(XbyJ In D
Since all codes are prefix free, they satisfy Kraft's inequality: Lf=l D for all j. Therefo re,
H(x ly) -
n In D
:s;
Il kj
:s;
completing the proof. Equality holds iff - Iogv P(Xk IYj) is integer for all k, j .
Theorem 2. There exists a set of prefix-free codes for which the average code-word length satisfies
H(xly)
log D
n < - --+1
TABLE 5.1
YI
Yz
YJ
XI
II"
1112
IlIJ
X2
1121
1122
1l 2J
XK
Il KI
lin
Il KJ
Sec. 5.5
Channe l Capacity
57
:5
L P(XkIYj)
2:
for every j . Therefore , a prefix-free code can be constructed for each Yj . Since
nkj < - log p(xklYJ + 1, we have
Ii
LP(xb Yj)nkj
< -
L
k
H(xl y)
log D
LP(Xb yJ
j
HD(xly)
:5
Ii < HD(xly) +
where N is the length of the encoded sequences . It is seen therefore that the
minimum rate of supplemental information required for exact reconstruct ion of
the input at the receiver is equal to the equivocation H(x Iy) of the channel; this
is what we had initially set out to prove.
5.5
CHANNEL CAPACITY
In Section 5.4 we saw that the rate of transfer of informati on over a DMC can
be defined as
where H(x) is the rate of generation of information at the input and H(x Iy) is
the channel's equivocation . It is seen that R is equal to the mutual information
l ex, y) between the input and output of the channel. l ex, y) is, in general, a
function of input probability distribution {p" . . . ,pd. It is possible , therefore,
to find a distribution that maximizes l ex , y). [Since the set of possible distrib utions is closed and bounded, l ex, y) has a maximum rather than a least upper
bound on this set. ] The maximum value of l ex , y) is defined as the channel capacity C and is a function of the channel matrix alone.
C
= Maximum
vila.., .
;.J
For this class of channels, H(y Ix) was shown to be independent of input
distribution. To maximize lex, y) = H(y) - H(y Ix), therefore, we have to
maximize H(y) over the input distributions . The sample space of y has J elements, and the maximum possible value of H(y) is log J, which is achieved for
uniform distribution of y. This, in tum, is achieved by uniform distribution of
x, the reason being that the columns of a symmetric channel 's matrix are permutations of the same set of numbers q;, . . , ,q~ and, consequently,
K
p(Yj)
2,P(XbYj)
k~ l
2,p(xk)p(Yjlxd
k~ l
2, q~
k=1
log J
+ 2, P; log P;
j= l
+ e log, e + (l - e)
logi l - e)
bits/symbol
Figure 5.4 shows a plot of C versus e for the BSC. At e = 0 the capacity is
1 bit/symbol; that is, all the information is being transferred. As e ~ 0,5
the channel becomes less and less efficient, and the capacity is reduced until
at e = 0.5 the channel becomes useless. For 0.5 < e :s I, the efficiency increases again, since a channel that lies consistently is just as informative as a
consistently honest channel.
0,5
Figure 5.4
Sec. 5.5
Channel Capacity
59
Lossless Channel
= O. Therefore ,
C = Max{H(y)
- H(y Ix)}
log J
where J is the size of the output alphabet. The capacity is achieved by assigning the probabilities 1/ J to each group of input letters that correspond to the
same output letter and distributing this probability arbitrarily among members
of each group .
Useless Channel
= H(x).
Therefore,
Figure 5.5
ou
I V I U l U dl ll ll U l l Jl d ll U l 1
vlldJJ .
and 2J~1 P
I, the space of input distributions is a subset of the K dimensional Euclidean space . It is not difficult to show that the set of all possible input distributions in this space is a convex set. Figure 5.5 shows this set
for the special case of K = 3. For simplicity of notation, let us refer to I(x, y)
as I(P), where P = {PI, . . . , pd is a po int in the space of input probability
distributions . We prove that, for every pair of points Ph P 2 and for every real
number A in the interval [0 ,1], the relationship I[AP, + (l - A)P 2] 2:
M(P I) + (I - A)/(P2) is valid .
Proof
M(P,) + (I - A)/(P2) - I[AP, + (I - A)P2]
p(Yjlx;)
= AL LP,(x;)p(Yjlxi) log
()
; j
PI Yj
p(Yj lx;)
+ (I - A) L L P2(x;)p( v, Ix;) log ( )
; j
P2 Yj
p(Yjlx;)
- L L [API(X j) + (I - A)pix;)]P(Yj lx j) log A ( ) + (1 - A) ( )
j
j
PI Yl
P2 Yl
I
; j
og
Ap,(yJ + (1 - A)P2(yJ
( )
P, Yj
:S ALPI(Yj) [APJYj)
+ (1 -
P2 Yj
+ (1 -
A) LP2(Yj) [API(Yj) + (1 j
A)P2(Yj) - IJ
pJYj)
A)P2(Yj) -
1]
P2(Yj)
PROBLEMS
5.1. Prove that conditioning cannot increase entropy; that is,
H(xIY,z):S H(x IY)
Chap . 5
Problems
61
5.2. Let x be a discrete random variable and f( ) an arbitrary function. Define the random variable y as y = I(x ).
(a) Prove that H (y Ix) = o.
(b) Show that H (y ) :S H (x ). Under what conditions will equality hold?
5.3. Show that l ex , y) = H (x) - H (x Iy) can also be written as
l (x ,y) =
2: 2:p(x
j
i
Yj ) log
P Xi P Yj
5.4. Let X, Y, and Z be ensembl es with two element s in each ensemble so that the
eight elements in the joint XYZ ensemble can be taken as the vertices of a unit
cube.
(a ) Find a joint pr ob ab ility assignment p (x , y , z) such that l ex, y ) = 0 and
lex , y I z) = 1 bit.
(b) Find a joint probability assignment p(x , y, z) such that l ex, y) = 1 bit and
l ex , y !z) = o.
5.5 . A source has an alphab et of four letters . The probabilities of the letters and two
possible binary codes for the sou rce are given next.
Letter
peak)
Code I
Code II
a,
a2
a,
a4
0.4
0.3
0.2
0.1
1
01
001
000
1
10
100
1000
For each code, what is the average mutual information provided about the source
letter by the specification of the first letter of the code word?
5.6. Consider the ensemble of seq uen ces of N binary di gits, x , . .. , XN. Each
seque nce containing an even number of I's has probability Z- N+' , and each
seque nce with an odd number of I 's has prob ability zero. Find the average
mutual informat ions
l (x2, x ,),l(x3, x2!x, ) , . .. ,l(XN, XN- l l x" . . . ,XN- 2)
IVIUW 81 Information
UL
I.-n ap . ::>
1.
3/4
Figure PS.9
each of the following classes of channels, give an example of the channel maand calculate the capacity.
Lossless but not deter ministic and not symmetric.
Deterministic but not lossless and not symmetric .
Lossless and determini stic.
Symmetric and lossless but not determi nistic .
Useless and determini stic.
5.11 . Consider a DMC with x and y as the input and output random variables, respectively. Show that H(y), the entrop y of the output, is a convex n function of the
input probability distribution and that H(y Ix) , the conditional entropy of the output give n the input, is a linear func tion of the input probability di stributi on.
Combining these results , show that I(x , y) is convex n on the space of the input
probability distribution.
6
CHANNEL THEOREMS
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
I XbY = .6 P
j =1
(v,I)
P(Yj IXk)
Xk log
( .)
P Y1
vl lUlll lvl
I I l v V
vlll~
.... I I U p .
l(x , y)
L p(xk)l(xb y)
k= 1
In this section we derive a useful relationship between the channel capacity C and the conditional mutual information. The following theorem is helpful
in this derivation.
af(p *)
---a;;:-
Y,
af(p*)
apk -
-- < y
Proof
(i) Sufficiency: Assume that for some P * and y the preceding relations
are satisfied. We show for every P that f(P) - f(P *) ~ and therefore prove
that P * maximizes f(P) . From convexity
Af(P)
(l - A)f(P *)
f[AP
+ (I - A)P*]
Therefore ,
f(P) - f (P *) ~f[P *
f(P) - f(P *)
af(p *) (Pk - p t)
apk
Now, if p i #- 0, by assumption af(p*) japk = y. If p i = 0, then af(p *)j
apk -s y and Pk - pi ~ 0, yielding (af(p *)japk)(Pk - pJ) ~ Y(P k - p i).
Consequently,
k= 1
f(P) - f(P*) -s
#1 Y(Pk -
pt)
= Y(~Pk
~pt)
(ii) Necessity: Let P> maximize the functionf(P) . Then for every P and
every A in the interval [0, 1]
f[P *
Sec. 6.1
65
~ af(p *) (
L..-
k=!
apk
p,
*) < 0
Pk -
Since P * is a probability vec tor, at least one of its components must be strictly
positive . For simplicity of notation , let p i > O. Let P = ( p i - e, p f, . . . ,
p J"o + s , . . . ,pi) ; th at is, P differs from P in only PI and PkO. Any e satisfying -p to :5 e :5 p i is acc eptable . Now
> 0,
If p J"o
have
ap,
apkO
apkO
:5
apt
af(p *)
af(p *)
apkO
apt
af(p *)
af(p *)
apkO apt
- - <--
Taking y = af( p *)/ aph we see that the conditions of the theorem are satisfied .
The proof is thus complete .
Theorem I is now applied to the mutual information l ex, y) betw een the
input and output of a DMC . l ex, y) is a convex n function of the inpu t distribution P = (Ph' . . ,PK ), and its partial derivative with respect to P is obtained
as follo ws:
al(p) _ a ~ ~ (
) I P(YjIXi)
L..- L..-P Xi,Yj og
apk
apk ;~ 1 j~t
p( Yj)
a K J
p( Yjlx;)
= - LPiLP(Yjlxi) log ----='---'-'-"-'---;-apk ;= t j =1
L mPmP( YjJxm)
_
J
p(Yj lxd
K J
P(Yj IXk) log e
- LP(Yj lxk) log
( .) - LPiLP(Yjl x;)""
( .1)
j= t
P Y1
;= 1 j =1
L.mPmP Y1 Xm
-- - -
Here y' = y
l(x b y)
= y',
l (xk, y)
:5
y ',
t- o
for all k such that P t = 0
for all k suc h that P
l ex , y)
L P t i.. y)
x,
- - - - - - ---:::;:ooe
Yl
-------........:~ Y2
Figure 6.1
Theorem 2. A necessa ry and sufficient condition for an input distribution {PI , .. . ,PK} to achieve capacity is that, for some number C ,
l(xk' y) = C ,
l(xk ' y) :s C,
Since l(x2 'Y) < l (x j,y) = l(x3'y) , the conditions of theorem 2 are satisfied . The assumed distribution therefo re achieve s capacity.
6 .2 CASCADED CHANNELS
_I
D MC1
-I
Figure 6.2
D MC2
Sec. 6.2
Cascaded Channe ls
67
'" L..
'" L..
'" P(xi; Yj, z; ) In P(Xk
I (x, z) - I (x, Y) -_ L..
( IZm)
)
k j m
P Xk
'"
'" L..P
'" (XbYj , z; ) In P(Xk
IYj)
L.. L..
()
k
j
m
P x,
-s
L.. L..
Now P(XkIYj, zm) = P(XkIyJ since the knowledge of z.; does not provide additional information about x, once )'j is identified . This is a consequence of the
definition of cascaded channels . Therefore,
I(x,z) - I(x,y)::s
LP(Yj,zm)LP(Xkl zm) - 1 = 1 - 1 = 0
m
Equality holds iff P(Xk IYj) = P(Xk Iz.,) for all triplets (x, ; Yj, z.,) with nonzero
joint probability. If the second channel is lossless, this requirement is obviously
satisfied, making I(x, y) = I(x, z). Losslessness , howeve r, is not a necessary
condition for the equality of I(x, y) and I(x, z), as shown in the following
example.
Example
For the cascaded channels in Figure 6 .3
P(Xl !Yl)
= p(xl l zl) =
I,
Also ,
(t)P(X l)
(1) [ I - P(Xl)]
2P(Xl)
3 - P(X l)
p
(X l I22 ) =
P(XI ,Y2,22)
(t)p(X))
2p(x))
3 - p(x))
Therefore , pix , Y2) = p(x I Z2) ' Other equalities ca n be checked similarly. Note that in
this example lex , y) = l ex , z) for all inpu t distribution s. Th ere are examples where the
equality hold s for only some of the distributions.
z,
Y,
Y3
Figure 6.3
PI ,j
P I ,j +1
PI,J
P2 ,l
P2 ,j
P2 ,j + 1
PZ,J
PK ,I
PK ,j
PK ,j+1
PK ,J
and if the elementary reduction is applied to letters Yj and Yj + I, the matrix of the
reduced channel will be
P 2,j
+ PI ,j+l
+ P2 ,j + I
PZ,J
PK ,j
+ PK ,j +1
PK ,J
PI ,!
P I,j
P2 ,1
PK , I
PI,J
Sec. 6.3
--- ---
Xl
69
z,
Y3 I---">----t- - - - - - -
DMC
YJ f---<o--+- -
- - - --
Figure 6.4
P(XkIYI) = P(XkIZI) for all x, that can reach Y" and P(XkIyz)
Xk that can reach Yz. Now
p(xklzl)
P(Xk)P(Zll xk)
P(ZI)
_ p ( YI)P(XkIYI)
P(YI)
= P(XkIZI) for
all
+ p ( Y2)P(XkIY2)
+ P( Y2)
P(XkIy .) J
p(xk)p( Yj lxk)
L ;p(x;)p(Yjlx;)
It is easy to verify that this relation will hold for all input distribution s iff
p( YII Xd ! P(Y21xd is a constant for all k , that is , iff column s I and 2 of the
channel matrix differ by a multiplicative constant. The two columns can then
be comb ined and the resulting reduction is a sufficient one .
IV
l..rl dl ll l t:::= 1
1 1 1t:::=Urt:' ll l ~
.... "Cll-' . 0
Example
The matrix
[i t
1
12
0]
kI
I
6
I
2
can be reduced to
0]
kk
["4 "4
1
'2
and then to
Both reductions are sufficient, and therefore the final two-output channel is as good as
the original four-output channel.
6.4
Consider the cascade of two discrete memoryless channels shown in Figure 6.5. The second channel is a special channel whose output alphabet Z has a
one-to-one correspondence with the first channel's input alphabet X. DMC2
can be interpreted as a decision device, which, upon observation of Yj, decides
which input letter Xk has been transmitted . In general , the decision is not free
from errors, and the probability of error PE is given by
K
PE =
2: 2: P(Xb 2
m)
k=l m= 1
m=;r6k
+ PE log(K - 1)
H(x I z) - H(PE )
PE log(K - 1)
K K K
- 2: 2: P(Xb 2
m)
k=l m=l
log P(Xk
12 +
m)
2: 2: P(Xb 2
m)
k= l m = 1
k"' m
log PE
2::
Sec. 6.5
.1
DMCl
71
y 1
DMC2
Figure 6.5
Con sid er a binary symmetric ch annel (BSC) with cro ssover probab ilit y
e = 0.01, where data are transmitted at the rate of 1 bit/ second . If each bit is
repeated three times during transmission, the useful rate will obviously drop to
1 bit every 3 seconds but the error rate will improve . This scheme is capable of
correcting for one transmission error if the receiver applies the majority rule in
decoding 3-bit sequences. The majority rule is to decide 1 if the majority of bits
are l's and to decide 0 otherwise. The probability of error per information bit is
now 3(1 - e)e2 + e3 = 2.98 x 10 - 4, which is better than the original value
v lICl I Il It:1
11 1 t: U I t:I I I ~
v llCltJ . U
of 0.01. It is possible to improve the error rate even further by repeating each
letter 2N + 1 times, but in the limit when N becomes large the rate approaches
zero and the communication system becomes useless .
The Hamming distance between two equal-length binary sequences is defined as the number of positions in which the sequences differ from each other.
The distance between 001011 and 001101, for example, is 2. The important
feature of the repetition scheme just described is that the input letters, 0 and 1,
which have a distance of I from each other, are replaced with input sequences
II . . . 1 and 00 . . . 0, which are a distance of 2N + 1 apart. This feature allows
the channel to commit a maximum of N transmission errors without hindering the ability of the receiver to make a correct decision . The setback is, of
course, the reduced transmission rate.
The essence of the noisy channel theorem of the next section is that it is
not necessary to have excessively small data rates in order to have arbitrarily
accurate communication . The trick is to handpick a set of appropriate sequences from the input alphabet and to confine transmitted signals to these sequences. The distance between each and every pair of selected sequences must
be large to ensure that a large number of transmission errors can be tolerated.
At the same time, the number of these sequences must also be large so that a
reasonable data rate is maintained .
The probability of error is a function of the decision scheme used at the
receiver. This is a decision as to which input sequence should be assigned to a
given received sequence . Before proceeding to the noisy channel theorem, it is
appropriate to say a few words about the decision schemes . Let Wi = XilXi2'
XiN and Vj = Yj1Yj2 ' . YjN represent the input and output sequences, respec tively. Wi must belong to the set of selected code words, whereas Vj can be any
output sequence . The optimum decision scheme upon observation of Vj is to
choose the particular Wi that maximizes the backward channel probability
p(Wi I Vj). This, however, requires the knowledge of the input distribution
{peW;)}. A suboptimum scheme that is generally used, known as the maximum
likelihood (ML) scheme, chooses the sequence Wi that maximizes the forward
channel probability p(Vj I W;). Since p(Wi I Vj) = p(W;)p(Vj I Wi)/p(Vj), it is easy
to show that the ML rule is optimum in the special case where the Wi'S are
equally likely.
For the BSC with crossover probability e < 0.5,
p(Vj IW;) = eD(I - et- D
where D is the number of positions in which Wi and Vj differ (i.e., the Hamming distance between them). Consequently, the ML rule requires minimization of D (i.e., choosing the code word Wi that is closest to the received
sequence Vj in the Hamming sense). Although this is not an optimum decision
rule, it is sufficiently good for the fulfillment of the promise of the noisy channel theorem.
Sec. 6.6
73
Consider a BSC with crossover probability p . The channel capa city in unit s
of bits /s ymbol is C = I - H(p ) , wh e re H (p) = -p log p - (l - p )
log(l - p ) . Ass uming that ea ch sy mbo l ha s durat ion T , the ca paci ty in
bits/ second is C = [I - H(p )] /T . Consider a source that creates information
at the rate of R bits/second for transmission over the channel. The noisy channel theorem states that, as long as R is less than C , communication with negligible error rate is possible. This is achieved by encoding sequences of data into
certain sequences (code words) from the channel input alphabet and treating the
code words as elementary channel inputs. Any degree of reliability is obta ined
provided that these code words are (I ) long enough and (2) far from each other
in the Hamming sense.
The first step in the proof of the theorem is the derivation of error probability P, for a given set of code words {WI. ' . . , WM } . Each word has length N ,
and the total number of words is given by
M = 2NRT
The justificat ion for the preceding equation is that each word has duration N T,
and therefore the total number of bits produced by the source in this interval is
NRT . This corre spond s to 2NRT source sequences , each of which must be assigned a unique Wi'
In the second step , P e is averaged over all possible code-w ord sets of
length N and size M to yield Fe. This is equivalent to random ly selecting M
code words from the set of 2N possible binary sequences and calculating the
corresponding error probability. The random coding argument enables us to obtain a closed form for Fe' which is then used to identify the condit ions under
which Fe can be made arbitrarily small. Finally, we argue that, since Fe is the
average error probabi lity over the set of all possible codes {WI, . . . , WM}, there
must exist at least one code for which P, is less than or equal to Fe. We thus
conclude that under certain conditions a set of code words exists for which the
probability of error is arbitrarily small. These steps are now described in detail.
Assume the code {WI. ... , WM}, consisting of code words of length N , is
used over a BSC with crossover probability p . Since the channel is memoryless, the average number of transmi ssion errors for each Wi is Np . Given two
arbitrar y positive numbers e and 0, the law of large numbers asserts that for
sufficiently large N the probabi lity of receiving a sequence Vj whose distance
from Wi is larger than N( p + e) is less than o. Define the test set as the set of
all binary sequences of length N whose distance from Vj is less than N(p + e) .
The decision strategy is the follow ing: If a code word such as Wi belongs to the
test set and no other code word is in the test set, annou nce Wi as the transmitted
sequence; otherwise, announce that an error has occurred . An error thus occurs
if (I ) the transmitted word Wi is not in the test set or (2) if some other code
word , say Wj , is in the test set. An upper bound on error prob ability P, is therefore given by
M
2: P {W
P, ::; 0
j= 1
j"' i
The average of P, over all possible code s of length N and size M can now
be calculated using the rand om coding argument. Assume the code words
WI, . . . , WM are cho sen independentl y and with equal prob ability from the set
of all possible sequences of N binary digit s. (The random selection is done with
replacement ; therefore , the possibil ity of two or more identical code words in a
given code exists. ) The probability that Wj belongs to the test set is now equal
to the number of sequences in the test set divided by the total number of sequence s of length N , the reason being that Wj is selected independently of Wi
and according to a uniform distr ibution .
N( p+c)
~2
(N)
k
k -O N
The summ ation on the right side gives the number of sequences of length N that
differ from a given sequence in at most N( p + e) places. Noting that the preceding is independent of i . the upper bound on the average error probability can
be written as
.
Pe
(N)
N(p+"j
::;
+ (M - 1)
2:
k~O
k
N
~ (N)
k=O
::; 2
NH
(a ) ,
Q'
<
0.5
we have
Pe
::;
0+M .
rN[ I - H (p +c)]
Since e and 0 can be arbitr arily small, it is seen that, as long as R is less than
[I - H ( p) ] /T, the average error probability can be made arbitrari ly small by
increasing N. As we have argued earlier, a small Pe means that at least one
code exists that has a small error proba bility P e . The proof is thus complete.
Sec. 6.7
75
Con sider a DMC w ith input alphabet {Xl, .. . , XK}, output alphabet
{y" ... , YJ}, transition matrix [pC Yj IxdL and capacity C. All input-output letters have equal duration 7 . The input distribution that achieves capacity is denoted P x* and, without loss of generality, we assume that p *(xd = 0 for
I :S k :S K. [If P *(Xi) = 0 for some letter, omit Xi from the list of the input
alphabet. The omission is of no consequence since the promise of the noisy
channel theorem can still be fulfilled.] The output distribution corresponding to
P ;'" is denoted p }*. If an information source generates data at the rate of
R bits/second with R < C, then the noisy channel theorem states that the data
can be transmitted over the channel with neglig ible probability of error. The
proof is similar to the one given for the BSC in Section 6.6, and only major
differences are outlined in this section .
First, a general distance between sequences must be defined since the
Hamming distance is only applicab le to the binary case. The maximum likelihood rule compares p(Vj IWi) for all code words Wi upon arrival of Vj and
chooses the code word with the largest forward probability. The following definition for the distance between Wi and Vj therefore seems reasonable:
p *(V)
D(Wi , Vj) = log p(Vjl ~i)
p * (Vj ) is the probability of Vj when the input letters are chosen independently
P:.
p(V j WiO)
= -
p(Vj WiO)
= TI p(Yjll IXii.)
n= !
76
Chap. 6
p *(Vj )
TI P *(Yj,,)
Il = I
Therefore ,
D WiO, Vj
)=
~ ~
- LJ LJp
,,= 1 j = l
P(Yj !Xi,,)
*( .)
P Y1
is the nth letter of W iO, and J is the size of the output alphabet. The inner
summation is the conditional mutu al information for Xi" and, according to a previous theorem , is equ al to C, indep endent of Xi". Thu s
X i"
D (WiO, Vj )
- NC
for all i o. The follow ing decision scheme is now equivalent to the one used in
Secti on 6.6 . Given a received sequ ence Vj , calculate the distance between Vj
and each code word Wi' If there is one and only one code word with a distance
less than - NC + N e from Vj , choo se it as the tran smitted sequence . Otherwise, announce that an error has occu rred . The probability of error P, upon
transmi ssion of W i is now written as
M
P, ::s
a + 2 P{W
j= l
i foi
Using the random coding argument, P, can now be averaged over all poss ible
code-word sets. The rando m select ion of code words, however, is according to
the capacity-achieving distribution
P:.
P {Wj
= 2 p*(V)
p *(W)
all V
all IV with
D(IV, V)< -N(C-e )
P *(V )p *(W )
All pairs (W, V) with D (W, V) < - N(C - e) satisfy the followi ng:
p *(V)
log p(V IW)
< - N(C - e)
Therefore ,
N(C - e)
Consequently,
2
all pairs .
N(C - e)
2
all pairs .
P*(W, V) ::s
N(C - el
Sec . 6.8
Converse Theorem
77
+ 2 - N(C-e- R.)
Since e and 8 can be arbitrarily small, the average probability of error approaches zero with increasing N, pro vided that Rr < C. A small Fe means that
at least one appropriate code exists, and thus the proof is compl ete .
6.8 CONVERSE THEOREM
The noisy channel theorem states that a source that generates information at a
rate below channel capacity can be transmitted with negligib le error probability.
The con verse theorem is concerned with the opposite situation and states that at
rates above channel capacity arbitrarily accurate communication is impossible .
There are several versions of the converse theorem in the literature. The strong
convers e theorem asserts that at rates above capacity the probability of message
error approaches I with increasing length of the message sequences . This version has only been proved for block codes and, in addition , does not pro vide
any inform ation about the probability of single-letter errors. Another version of
the theorem states that the average probability of error per source letter is
bounded away from zero when the source maintain s a rate above channel capacity. In other words , the probability of error per source letter (on the average)
cannot be made to go to zero no matter what coding techniqu e is emplo yed .
These theorems will not be prov ed here . Instead, we consider a simple
version of the converse theorem that is based on Fano 's inequ ality. Consider
M messages WI, ' .. , WM of length N that are used with equal probabi lity for
transmission over a DMC. If the input ensemb le is Wand the output ensemble
V, then
H (wlv)
-I(W,V)
+ H(W) =
- I (W , V)
logM
But
I(W , V)
= H(V)
- H(V IW)
H(V) -
11=
Therefore ,
H(WIV)
2:
NC
+ log M
Therefore,
-NC
1)
Replacing M with 2
+ H(P e )
::;
P, log M
log 2
yields,
RT
NRT
- - - --
PROBLEMS
6.1. The channel shown in Figure P6. I is known as the binary erasure channel. (A bit
transmitted through this channel may be erased but can never be received in
error.) Asuming that P(x = 0) = Po, calculate I(x, y) in terms of Po. What is the
channe l capacity?
x
1-
Figure P6.1
6.2. Suppose that the transmitter in the binary erasure channel of Problem 6.1 is connected to the receiver via a feedback loop and can repeat the erased bits until they
are correctly received . Assuming that the input stream of data is i.i .d ., determine
the average number of transmitted information bits per use of the channel.
6.3. Find a discrete memory less channel together with an input distribution such that
y(N) is independent even though the corresponding
the sequence of outputs
sequence of inputs x(l) . . . X(N) might be dependent.
r'" ...
6.4. In ajoint ensemble XY, the mutua l information I(x i, Yj) = log [p(xi' Yj)/p(x;)p(Yj)]
is a random variable . In this problem we are concerned with the variance of that
random variable.
(a) Prove that Var[I(x;, Yj)] = 0 iff there is a constant a such that, for all Xi, Yj
with P(Xi,Yj) # 0,
P(Xi,Yj) = ap(x;)p(Yj)
Problems
Chap. 6
x
79
x
a,
b,
Figur e P6.4
6.5. A DMC is ca lle d additive modul o K if it has the inp ut and output alphabet
0, I, .. . , K - I and the output Y is given in terms of the input x and a noise variable z by Y = x Ell z. The noise variable takes on values 0 , I , . .. , K - I and is
statistically independent of the input , and the addition x Ell z is taken modulo K
(i.e ., as x + z or x + z - K , whichever is between 0 and K - I).
(a) Show that I (x , y) = H (y ) - H (z) .
(b) Find the capacity in terms of H (z) and find the maximizing input distributi on.
6.6. Two discrete memory less channels A and B are connected in cascade . Assuming
that MA and MB are the individual channel matrices, determine the matrix of the
cascade channel.
6 .7. Consider a DMC wi th inp ut alphabet {Xl ,' " ,XK} and ou tput alphabet
{ Yb ' . . ,y}} . Given the input distribution P(Xk) for I ::; k -s K and the channel
transition matrix [P ( Yj IXk)], the optimum detection rule is as follows: For a reIYj) be the
ceived letter Yj, choose x, such that P(XkIYj) is maximum . Let
probability of error when Yj is received.
(a) Express p eEIYj) in terms of the backward channel probabilities P(XkIYj)'
(b) Show thatH(x IYj ) 2: 2P(EIYj) when entropy is in bits [see Problem 2.7(d)].
(c) Prove the following upper bound on the total probability of error, peE), in
terms of equivocation:
ro:
(in bits)
i,
If p(x, ) = ~ and P(X2) = P(X3) = find the optimum decision scheme and calculate the associated probability of error.
6.9 . A source produces statistically indepe ndent, equally probable letters from an alphabet (ab a2) at a rate of one letter each 3 seconds. These letters are transmitted
over a binary symmet ric channe l that is used once each second by encoding the
source letter a , into the code word 000 and encod ing a2 into the code word Ill.
If, in the corre sponding 3-second interval at the channel output , any of the se-
Channe l Theorems
80
Chap . 6
6.10. Let the discrete memoryless channels A and B have matrices M A and M B and capacities CA and CB , respectively.
(a) The sum of the channels is a new channel with the following matrix:
0]
MA
[
MB
2c =
2CA
+ 2 CB
(b) The product of the channels is defined as the channel whose inputs are the
pairs (Xia, Xjb) and whose outputs are the pairs (Yma, Yllb), where Xia and
Yma are the input and output letters of channel A, while Xjb and Yllb are the
input and output letters of channel B. The elements of the channel matrix are
given by
a
(b)
(I --p)
2
(l -2
-P
I - : -
(3)
P)
Chap. 6
Prob lems
81
6.12. Consider a BSC with crosso ver probab ility I> and a DMS that produces binary
data at the rate of R = 3 bits per channel symbol [p(O) = p(l ) = kJ. The following two strategies for the reprodu ction of source data at the channel output are
suggested:
(i) Of every 3 bits produced by the source , transmit I bit and at the channel output reconstru ct the other 2 bits by coin flipping .
(ii) For every 3-bit block of source data, transmit I bit that represents the majority and at the channel output repeat each received bit three times.
Compute the average probability of error in each case and compare the two strategies. Which is better ?
6.13
Y,
x2
Y2
x3
Y3
Y4
Ys
Figure P6.13
6.14. Consider a binary channel in which noise affects the input in blocks of seven
symbols. A block of seven is either transmitted without error or exactly one sym-
bol out of seven is incorrect. These eight possibilities are equally likely. Define W
and Vas ensembles of the 7-bit input and output sequences, respectively.
(a) Calculate H(V IW).
(b) What is the maximum value of I(W, V)?
(c) What is the capacity of the original binary channel?
6.15. Consider all binary sequences of length n = 7. We wish to choose M code words
WI' ... , WM for use on a binary channel. To correct all single errors that occur
during transmission , the Hamming distance between all code words Wi and Wj ,
i 'I' j , must be ~ 3.
(a) How many binary sequences have a distance less than or equal to I from a
given sequence Wi?
(b) What is the maximum possible value of M ?
(c) Assuming that the maximum value of M can be achieved, what is the rate of
generation of information that can be transmitted over the channel?
7
RATE-DISTORTION THEORY
7.1
7.2
7.3
7.4
84
Chap. 7
Con sider a DMS with alpha be t {a I , . . . , ad and pro bab ility di str ibu tion
. .. , pd Sequ ences of length N from thi s source will be encoded into
sequences of length N from the reprodu cing alphabet {bl> .. . , bj } . The singleletter distortion measure is defined as d(ak' bj ) , which assigns a real, nonnegative value to every pair (at, bj ) . Let Wi = ailai2. . . aiN be a source sequence
whose reproduced version in a give n encoding scheme is Vj = bj )bj 2 bjN
The distortion per letter of Wi is then give n by
{p 1,
1 N
L dia .; , bjn )
d(W i , Vj ) = -
(7. 1)
n= 1
The average distortion will then be the statistical average of d(Wi , Vj ) over all
source sequences; that is,
p(Wi)d(W i , Vj )
(7.2)
all IV;
In general, a code consists of M code words {V I , ' . . , VM } , and each Wi is assigned a code word Vj such that d(W i , Vj ) is minimi zed . The average distortio n
is given by Eq. (7 .2) , and the rate of the code is defined as
I
N log2M
(7.3)
in bits per source letter. A code is said to be D-admissible if the average distortion d associated with the code is less than or equal to D. The smallest value of
R amo ng all D-admissible codes is denoted R(D ) and is know n as the ratedistortio n function.
The minimum distortion is achieved if each letter a, is assigned a letter bjO
with the property that d(ab bjO) :5 d(at, bj ) for all j . The minimum distortion is
then given by
K
dmin =
L p(ak)d(ak' bjo)
(7.4)
k ~1
dmin will be zero if for every a, there exists at least one bj such that d(at, b) =
0; otherwise, dmin will be strictly positive . In any event , R(D ) is defined only
for D 2: dmin For D =
each letter a, is assigned a certain bjO' In the worst
case, all the bjo will be different and the code will have a rate equ al to the
source entropy . Thus , in general ,
K
R(dOlin )
:5 -
L Pk log Pk
(7.5)
k= )
The function R(D ) is a non inc reasing function of D . To see this, let D ) >
D 2 2: dmin Then the set of Dr admissible codes will be a subset of the set of
D)-admissible codes and, consequentl y, R(D ) :5 R(D 2) .
Sec. 7.1
85
dma"
_
d
2: p(Wi )d(W
j ,
yo)
=-
all Wi
n= I
(7 .6)
To minimize d , one must minim ize the inner sum in Eq. (7.6) by choo sing the
proper bOil" This choice is independent of n, however, and thus the minimum
value of d for a code of size M = I is
K
min
)
2: peak)d(ah bJ
(7.7 )
k= l
This is the smallest value of D at which R(D ) become s equal to zero, and we
denote it by d max .
From the foregoing discu ssion it is co ncluded that the rate -di stortion
function R(D ) is a nonincreasing funct ion of D , and the interv al [dmin , dmaxl
whose boundaries are defined by Eqs . (7.4) and (7.7) is the range of interest.
Example 1
Consider a memoryless binar y source with alphabet {D, I} and probabili ty distribution
{Po, I - Po}, where Po :::; ~ . The reprod ucing alphabet is also {D, I}, and the distortion
measure is given in Figure 7.1 . The minimum distortion is achieved when D is mapped
to D and 1 mapped to 1; thus dmin = O. This is obviously the case of perfec t reconstruction , and therefore
R (d min) = H(p o) = "P log Po - (l - Po) log(l - Po)
d(O, 0)
=0
d( 1, 1)
=0
Figure 7.1
Reproduct ion
00
ndlt:: -UI:>lUllIUII
~ ( ) (
LJP a, d abbj
k~ l
Since Po :s
! :s
) _ {P O X 0
Po X I
+ (I - Po)
(I - Po)
I II CUI
v llOtJ
I = I - Po,
_
0 - Po,
We are not yet in a position to calculate R(D ) for values of D in the interval [(fmin, d maxL but we shall give an exa mple of methods that are generally
used in data compression . Thi s exa mple will bring out the esse nce of source
coding with a fide lity criter ion and will shed light on the intimate relation
betwee n the noisy channel theorem of Chapter 6 and the fundamental theorem
of the rate-distortion theory. The method is based on the (7,4) Hamming error correcting-code, and we digress at this point to introdu ce this code .
The (7,4) Hamming Code
Consider the circle diagram in Figure 7 .2 where the three circles are drawn such that
their intersections create the regions numb ered I through 7. Using this diagram , it is
possible to construct 16 binary sequences of length 7 according to the following rules:
(l ) Bits I, 2 , 3, and 4 are chosen arbitr arily and placed in the corresponding regions of
the circ le diagram, regions I to 4 . (2) Bit s 5 , 6 , and 7 are chose n such that when
placed in the corresponding regions of the diagram the total number of l 's in each circle
becomes an even number. These are thus the parity chec k bits.
We now show that the 16 seq uences constructed according to these rules form
a single-erro r-correcting code . If a code word were transmitt ed over a BSC and the
receive d sequence satisfied all the parity checks , we would say that no errors have
occ urre d . A single error in bits 5 , 6, or 7 would violate one of the par ity checks, a
single error in bits 2, 3, or 4 would violate two parity checks, while an error in bit I
would violate all three parity checks . In eac h case it is possible to detect the erro neous
bit by consulting the circle diagram. Thus, whether the received sequence is error free
or contains a single error, it is always possible to recover the transmitted sequence .
Now consider the space of all binary sequences of length 7. There are 27 = 128
points in this space . The 16 code words of the Hamming code can be considered as the
centers of 16 nonoverlapping spheres of radius I in this space . Each sphere contains
the code word plus seven other sequences that have a unit distance from it. The total
number of points covered by the spheres is 16 X 8 = 128, which is all the points that
belong to the space . Therefore the spheres cove r the entire space .
Figur e 7.2
Sec. 7.1
87
"7I
d =
[ 16
128 x 0
112]
+ 128 x 1
8"I
The rate is obtained from Eq . (7.3) with N = 7 and M = 16. Thus R = 0.571
bit/source letter.
The rate distortion function for this problem is known to be (see
Problem 7. I)
= H(p o)
R(D)
- H(D),
0::; D ::; Po
(7.8)
Thus, for Po = 0.5, we have R(l /8) = 0.456 bit/ source letter, which is somewhat better than that achieved by the (7, 4) Hamming code
Example 2
Con sider a DMS with alphabet {a i, a2, a 3} and probability distribution P I = P2 =
P3 =
The reproducing alphabet is {bl , b2}, and the distortion mea sure is given in
Figure 7.3 . The minimum distortion is achieved by assigning al to b-, a2 to b2, and a3 to
either b, or b2 The average distortion is then given by d min = I. To obtain dmax> observe
that
"
( ) d(
L.JP a,
k= I
ai ,
b)
j
{!3
!3
X
X
I +!
3
2 +!
X
X
I +!
3
I +!
X
X
2
I
= .13 ,
b, = b,
b. = b
= .1
3'
and thus d max = 4/3 according to Eq. (7.7). The rate-distortion function for this source
is known to be (see Problem 7 .5)
R(D) =
f [I -
He(D
2- I)], I
:5
:5
in bits per source symbol. Notice that R(dmin ) = ~ is much less than H(1)
which one would obtain by assigning al to b., a2 to b2, and a3 to b. ,
Source
Reproduct ion
a,
b,
= 0 .9183,
Rate-Distortion Th eo ry
88
Chap. 7
In this secti on we define the rate-distortion functi on from a comp letely different point of view. It turn s out, however, that the definition given here will result in the same function R(D ) as descri bed in Sectio n 7 .1 . We continue to use
the nota tions d and R(D ) for the average distort ion and the rate-distortion function , respectively, but the reader should bear in mind that these quantities are
now defined fro m a new perspecti ve and mayor may not have the previo us
connotations.
,a K},
Con sider a discrete mem oryless chann el with input alphab et {a] ,
input probability distribution { Ph ' .. ,pd, and output alph abet {bl ,
, bJ } .
Let an arb itra ry se t of transition prob abilities [p (bj Iad] be assigned to the
channel. We define the average distorti on d associated with this channel in the
follow ing way:
K
(7.9)
k= l j = l
A channel matrix is said to be D- admi ssibl e if the average distortion d associated with that mat rix is less than or equ al to D .
W ith eac h cha nne l matrix is associated an average mutual inform ation
l ea , b ) between the cha nnel input and output. We de fine the rate-distoration
function R(D ) as the minimum of l ea, b ) over all D- admi ssible channel matrices; that is,
R(D)
= all D-admmin
{l ea , b)}
issible channels
(7.10)
Since the set of D-admi ssible channel matr ices is a closed and bound ed set and
since l ea , b) is a real-valued , continuous , and boun ded function of [ p(bj Iad ],
the minimum indica ted in Eq . (7. 10) exists .
It is not difficult to show that the minimum achieva ble d in Eq . (7 .9) is
the same as d min in Eq . (7.4) . Th erefore , R(D) is only defined for D 2':: dmin'
Moreover, since lea , b) ::5 H (a), we have
K
R(d min)
::5 -
2": Pk log Pk
(7. 11)
k~ l
similar to Eq. (7.5). R(D) is a nonincreasing functio n of D beca use the set of
D-admissi ble channels can only expand with increasing D . Th e following theorem establishes the smallest value of D at which R(D) becom es equal to zero .
Theorem 1.
Th en R(D)
= 0 for D
2'::
k= l
(7. 12)
Sec . 7.2
89
LP(at, bj)d(at, bJ
k= ' j= 1
J
2:
LP(bj)d max
j=1
k= 1
s.:
j='
2:
The next theorem show s that R(D) is a con vex U function. Convexity
implies continuity at all point s except possibly at D = d min. [It has been shown
however that R(D ) is also continuous at D = dmin . ] The fact that R(D ) is a nonincreasing co nvex function of D also implies that it is strictly decreasin g .
Therefore , the gen eral behavior of R(D) is simil ar to the function shown in
Figure 7.4 .
Theorem 2.
Proof" Let 'A E [0, I] and D" D 2 belong to [d min,d max ] . We will show that
R[AD,
+ (I -
'A)D 2]
::5
AR(D 1)
+ (I -
'A)R(D2)
(7. 13)
Let [p ,(bj Iak)] be the D ,-adm issible chann el with / ,(a , b) = R(D, ) and , similarly, let [pi bj Iad ] be the Dradmissible channel with / 2(a , b) = R(D 2) . Consider a new channel matrix defined as
R(D)
d min
d max
Figure 7.4
I(a , b)
+ (I - A)lz(a, b)
A1 1(a , b)
(7.16)
Combining Eqs. (7.15) and (7.16) and replacing I l(a, b) and Iz(a, b) with R(D,)
and R(D z)' we obtain Eq . (7. 13), thus completing the proof.
As was mention ed ea rl ier, co nvexi ty impl ies th at R (D ) is a strictly
de cr easin g fun ction of D . This , in tu rn , implies th at th e cha nne l mat ri x
that achieves R(D ) for a given D is not merely D- admi ssible , but its average
distorti on d is exactl y equal to D ; for if d were less than D , then the ratedistorti on function would have been constant on the interval [d , Dl .
In this section we establi sh the relation ship between the rate-distorit on function,
which was formally defined in Secti on 7 .2 , and its practical, more intuitive version, which was describ ed in Section 7 . 1. In part icul ar, we will show that for
eve ry D in the interva l [d min, d maxl there ex ists a so urce code {V" . . . , VM}
whose ave rage distorti on d is arbitrarily close to D and whose rate R is arbitrarily clo se to R(D ). The co nce pt of rand om coding, whic h was ce ntra l to the
proof of the noisy channel theorem, will again be used here, and by showing
that the average distortion over all randomly selected code s of rate less than
R(D) + e is less than D + e , we will establish the existence of at least one
code with the desired properties.
Consider the discrete memoryless channel that achieves the minimum in
Eq . (7.10 ), and let its matrix be [p(bj Iad ]. From the discussions of Sec tion 7.2
we know that
K
j )
= D
(7. 17)
k = l j =l
I(a, b)
~ ~
p (b [ak)
p (b )
= R(D)
(7. 18)
Sec. 7.3
91
This will be the probability distribution for random coding; in other words, the
M code words will be selected independently and according to the probability
distribution in Eq. (7.19).
Now consider a source sequence Wi = a;lai2. . . aiN and define the set S;
as the set of all sequences Vj such that d(W; , Vj ) < D + e . If S; happens to
contain at least one of the code words VI,' .. , VM , then the distortion associated with W; will be less than D + e. We are therefore interested in the probability that a code word, chosen randomly according to the probability
distribution ofEq . (7.19), belongs to S; . We call this probability peS;):
(7.20)
We also define the set T; as the set of all sequences Vj such that
(lIN) log2[p(V j l WJ lp(Vj)]
<
R(D )
+ e.
Thus
(7.21)
Now
ns, n T;) = 2:
2:
2:
(7.23)
If the code words are randomly selected and at least one of them belongs
to S;, the average distortion for W; will be less than D + e . The probability
that none of the M code words belongs to S; is given by
[I - p(Si)r < [1 -
rN[R(D)+ e]
vE~nrP(Vj l WJ
J
(7.24)
where the upper bound is a direct consequence of Eq. (7.22) . We can further
upper bound Eq. (7.24) by using the following inequality:
(l - af3}M ::5 1 - (3
::5 a , (3 ::5
(7.25)
0'------ -- - -- -'-- - - - - o
f3
Figure 7.5
eM InO -all)
1 -
f3 +
e - Ma
P(Si)]M
<
I -
2:
p(Vjl
W i)
e - M' 2- NIRlD)+, ]
(7. 26)
Vj Es ,n T,
Now if M = 2N [RlD)+ 2e] , the last term in Eq. (7.26) can be made extremely small
for sufficiently large N . Similarly, the second term can be made arbitrarily
close to 1 according to Eq. (7.23) . Consequently, the probability of having no
code words in S; is arbitrarily small. In such situation s where W i has to be encoded into some Vj outside of S i, the distortion, although larger than D + e ,
will still remain below some finite value . [The maximum possib le distortion is
given by max k.j{d(a b bj )} .] Therefore , the contribution to the average distortion
by codes that have no code words in S; can be made less than e. From the foregoing discussion , we conclude that for M = 2N [R(D ) + 2eJ , or equi valently for
R = R(D) + 2e , there exists at least one code whose average distortion is less
than D + 2e. The proof is thus complete.
The converse to the fundamental theorem states that no D-admissible code can
have a rate less than R(D). To prove this, let {V" . . . , VM } be aD-admissible
code and consider a deterministic channel that assigns to each W i the closest Vj
in the code . Obviously, H(V IW) = 0, and consequently
I(W, V)
::5
log M
(7 .27)
Problems
Chap. 7
93
:z: H(a(II) -
H(W IV)
(7.28)
11 =1
Now
H(W IV)
:z: H(a(lI)Ib(II)
(7.29)
11= 1
Therefore,
N
I(W, V) ;::=
(7.30)
11 =1
l i N
log M ;::= I(a (II), b(II)
N
N 11=1
:z:
=-
(7.31)
Let dll be the average distortion with which a'" is reproduced; then
R(dll ) -s I(a (II), b(II)
(7.32)
and consequently
R ;::= N
(7.33)
The last inequality is a consequence of the fact that R(D) is a convex U function. Now the average distortion d is equal to (l IN) 2:~=1 dll , and since the code
is D-admissible by definition, we have d ~ D. From Eq. (7.33) and the fact
that R(D) is a decreasing function of D, we conclude that R ;::= R(D) . The proof
is thus complete .
PROBLEMS
7.1. A discrete memoryless source has alphabet {a, I} and probability distribution
{Po, I - Po}. The reproducing alphabet is also {a, I}, and the distortion measure is
d(O, 0) = d(l , I) =
and d(O, I) = d(l,O) = I. Show that the rate -distortion
function is given by R(D) = H(po) - H(D) for ::s; D ::s; Po.
7.2. Prove that the average mutual information lex, y) between the input and output of a
DMC is convex U in the transition probabilit y matrix [P( Yj IXk)] '
7.3. A source produces independent, equiprobable binary digits. The reproducing alphabet is ternary, and the distortion measure is shown in Figure P7.3 (omitted transitions in the diagram correspond to infinite distortion).
(a) Find and sketch the rate-distortion function R(D) .
(b) Find a simple encoding scheme that achieves any desired rate R at the distortion level D determined from R = R(D) .
nate-tnstorno n I neo ry
"
Sou rce
Reproduct ion
cnap .
d (l , 1) = O
Figure P7.3
7. 4. A source produc es independent equiprobable letters from an alphabet of five letters. The distortion measure is as shown in Figure P7.4 (omitted transitions correspond to infinite distorti on and included transi tions to zero distort ion) . Find the
rate-distortion function for this source and distortion measure.
Source
Repro duction
Figure P7.4
k = j
k ~j
What is the rate -di stortion functi on R(D ) for thi s source? [Hint : You may use
Fano 's inequality to bound R(D). ]
8
LINEAR ERROR-CORRECTING
CODES FOR THE BINARY
SYMMETRIC CHANNEL
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
In t r o d u c t ion
In this chapter we introduce some elementary concepts
from the theory of error-correcting codes. We do not intend to get involved in a
detailed study of the techniques that have been developed over the past four
decades for encoding and decoding of information. Rather, we want to introduce the reader to the practic al problem s involved in achieving the promise of
information theory . Shannon 's idea of random coding has been instrumental in
proving the existence of good code s, but it does not help us find codes that can
be implemented on systems with speed and memory constraints. The problem
is that a randoml y selected code , which with a high probability is a good code ,
lacks any useful structure that can be utilized for encoding and decoding purposes. Consequently, the brute-force method of looking up a table is about the
best technique for its implementation . But to achieve significant improvements
in performance , one must use long codes; the resulting tables will then be excessively large and the search times impractically long.
By imposing various constraints on the code, researchers have been able
to develop highly structured and yet powerful error-correcting codes. The most
significant of these constraints is the constraint of linearity, which will be discussed throughout this chapter in the context of codes for the binary symmetric
channel. More general treatments of the subject, as well as detailed description
of various error -correcting codes, can be found in the extensive literature of
coding theory.
It has been shown that linear codes fulfill the promise of Shannon's theory; that is, the noisy channel theorem can be proved even when the randomly
selected code words are confined to the class of linear codes . The price one
generally pays for the convenience of linearity is the longer code words one has
to use for a given error-correcting power. Many classes of linear error-correcting codes have been found to date , but unfortunately a general technique of designing good codes for arbitrary channel s is yet to be discovered .
Our discussion of linear block codes for the binary symmetric channel
begins with the introduction of linear block codes, also known as parity check
codes . We discuss methods for encoding and decoding arbitrary parity check
codes and study their error-correcting power. Later in the chapter, we introduce
the idea of convolutional cod ing and describe an algor ithm, kno wn as the
Viterbi algorithm , for decoding line ar con volutional codes . Both block and
convolutional codes are widely used in practice , and the decision as to which
code is more appropriate in a given application depend s on the specific problem
at hand , as well as the cost/performance requirements.
Consider a binary symmetric channel with crossover probability e < 0.5 . Let a
code of block length N consisting of code words WI, ... , WM be chosen for use
over this channel. We saw in Section 6.5 that if a sequence V) is received at the
output the maximum likelihood recei ver decodes it as Wi, where Wi is the
closest code word to V) in the Hamming sense. For example , if the code words
are WI = 00000, Wz = 10011 , W3 = 11100, and W4 = 01111 , then V) = 01011
is decoded as W4 since D(W4 , V)) = 1 and D(Wi , V)) > 1 for i 4. Similarly,
V) = 00110 may be decoded as either WI or W4 since D(WI, V)) = D(W4 , V)) =
2 and D(Wz, V)) = D(W3 , V)) = 3. *
*From R. Asn, Inf ormation Theory. New York: Interscience , 1965, p. 88. Reprinted by
permission of John Wiley & Sons, Inc.
Sec. 8.1
97
Vz.
2:
2v
+ 1
(8. 1)
Then the code can correct all errors of magnitud e :::; u, Conversely, any code
with this error-correcting power must satisfy Eq . (8. l) .
Proof.
Assume some code word, say Wi o' is transmitted . If the magnitud e of erro r is :::; v, th e recei ved seque nce Vj sati sfie s the relation
D(Wio' \tj) :::; v , By the triangle inequality, we then have, for all i - io,
D (Wio' Wi) :::; D (Wio' \tj) + D (Wi , \tj)
Consequently,
D (Wi , \tj)
2:
2:
2v
I - v
= v
Thus Wio is closest to \tj and as a result \tj is decoded as Wio. All errors of magnitude :::; v are thus correctable.
Conversely, assume that two code words, say WI and Wz, have a distance
less than 2v + 1 [i.e., D(W!> Wz) :::; 2v ]. Upo n transmission of WJ, an error
pattern of magnitud e u that only affects the positions where WI and Wz differ
will result in a received sequence \tj, where D(Wz, \tj) :::; v , Therefore, in this
case at least, an error of magnitude v cannot be corrected. The proof is now
complete .
The next theorem is similar to the previous one, and its proof is left as an
exercise .
Theorem 2. Let WI, . .. , WM be binary code words of length N, and assume that a positive integer v exists such that, for all i - j,
D(Wj,H-j )
2:
2v
(8.2)
98
Chap. 8
Then the code can correct all errors of magnitude ::::; v - I and detect all errors of magnitude v , Conversely, any code with this error correcting/detecting
capability must satisfy Eq. (8.2).
8.2 HAMMING UPPER BOUND ON THE NUMBER OF
CODE WORDS
M ::::;
2N
- v - N
L k=o (k)
(8.3)
Proof. For each code word W;, define a set B, as the set of all sequences
Vj that differ from Wi in at most u places . The total number of sequences in B, is
then given by
Since the code is capable of correcting all errors of magnitude v and less, the
sets B I> ' .. ,BM corresponding to the code words WI, ... , WM must be disjoint.
The reason is that if B , and B , have a common element V then D (Wi , W;) -s
D(W;, V ) + D(W;, V ) ::::; u + v = 2v , contradicting Theorem I of the previous
section. The total number of elements in the sets B I> .. ,BM must now be less
than or equal to the total number of sequences of length N ; that is,
M
(N) : : ; 2
k=O k
N
So far in our discussions of coding we have not addressed the practical problems of encoding and decoding . We have assumed that a "table" or "code
Sec. 8.3
99
book" exists that at the transmitter associates a code word with every source
sequence, and at the recei ver asso ciates every received sequence with a cod e
word. In the case of minimum-distanc e decoding , for instance , every received
sequence is associated with the closest code word in the Hamming sense . In
general , whe n the code words are long , the storage and pro ce ssing of data
become excessively prohibitive . We now introduce the clas s of parit y check
codes that, thanks to their structure , have substantially improved the efficiency
of encoding and decoding procedures.
Assuming that r\, r i , . .. , r N represent the digits of a binary sequence of
length N , a linear (parity check) code is defi ned as one whose code words are
solutions of the follo wing set of simultaneous linear equations:
(8.4)
The coe fficient s a ij are given bina ry numb ers (either 0 or 1), and the additions
and multiplications are all modulo 2 operations . All solutions of Eqs . (8.4)
must be included in the code.
Example
Consider the set of simultaneous linear equation s*
,.\
"1
+
,.z + "3+
"4+ "5
"4 +
+ "3 + r4
=0
"6 = 0
r6 = 0
(mod 2)
(8 .5)
Any sequence of binary digits r lrz . . . r6 that satisfies these equations will be an acceptable cod e word. For instance, WI = 001001 and W z = 111010 are acceptable code
words, whereas the sequence 101101 is not.
*From R. Ash, Inf ormation Theory. New York: Interscience, 1965, p. 164. Reprinted by
permission of John Wiley & Sons, Inc.
a mN
(~I)
.
(mo d 2)
(8.6)
rN
A =
1 0 0 1
0 1 1 1
(
101 1
(8 .7)
Sec. 8.4
101
A' =
1 10)
1 0 0
0 1 1 101
(
001 o 1 1
(8.8)
IVL
cn ap. !l
Example
The following sequence shows the process of constructing a triangular variant.
A = [ 0:
HH:,
~ ! ~ i ~].c
0
1 0]
10 01
~ !~ :
'>I
'l
01001
(8. 9)
'>I
i ~ i :!].c
01
001
'>I
A4 is now a triangular matrix. Its lower left triangle is all D's, and its columns C" C2 ,
C3 , and Cs are linearly independent. This is due to the fact that these columns have the
"diagonal" elements of the triangular matrix. If, for some set of numbers a , b, C, d, we
happen to have
(8.10)
then d = 0 because of the last row. Consequently, C = 0 from the third row, b = 0
from the second row, and, eventually, a = 0 from the first row. Since Eq. (8. 10) is only
valid for a = b = C = d = 0, we conclude that the preceding columns of the matrix
Sec. 8.5
103
are linearly independent. This is obviously quite general and applies to triangular variants of any matrix . Also observe that the remaining columns C4 and C6 of A4 can be
obtained by a linear combination of C r , C2 , C3 , and Cs . To obtain C6 , for example
we write
{] + {~l +{rl+{] m
(8 .11)
and find that d = 1, c = I , b = 0 , and a = I; that is, C6 is the sum of C" C3 , and C s.
Again , this is quite general and could be applied to triangular variants of any matrix .
Using these results together with property (vi), we conclude that the number of indepen dent columns of the matrix A is exactly equal to In, the number of (independent) rows of
the matrix .
I 0 0
A = 0 1 I
(
1 0 1
(8.12)
Since columns C" C2 , and C 3 are linear ly independent, we choose 1'" 1'2, and 1'3 as check
digits. 1'4 , I's, and 1'6 will then be information digits and can be chosen arbitrarily. There
are eight possible cho ices for the sequence 1'41'S1'6, and therefore there are eight code
words altogether, as follow s:*
*From R. Ash, Information Theory. New York: Interscience, 1965, p. 93. Reprinted by
permission of John Wiley & Sons, Inc.
....,,'~~ ,
rl
r2
r,
r4
r5
r6
WI
W2
0
0
0
0
0
0
W3
W4
W5
0
0
0
0
0
0
1
0
0
W6
W7
0
0
0
0
Ws
11 . ....... "
........ "
.... ' 01
A binary group code is a code {WI, ... , WM} whose code words are bin ary
sequences of equal length that form an algebraic group under modulo 2 addition .
a for all a, {3 in G ,
The set of integ ers (positive, zero, and negative) under ordinary add ition
forms an Abelian group, as doe s the set of all binary sequences of fixed length
under modulo 2 addition.
The following theorems establish the equivalence of group codes and
parity check codes .
Sec. 8.6
105
lUb
r3 r4 r5 r,
rl
rz
WI
Wz
I
I
W3
W4
0
0
0
0
W5
W6
W7
w,
0
0
0
0
0
0
0
0
0
Chap. 8
There are eight code words; therefore, there must be three independent rows. It can
easily be checked that W" Wz , and W 3 are linearly independent. Therefore, we must find
a set of linear equations for r" ri . .. . , r that is satisfied by W" Wz , and W 3. Observe
that in the generating matrix
001
o I 0
I
+
rz +
r3 +
r4
+
r4 +
r4 +
rs
(~ ~ ~
001
r = 0
r5 = 0
r = 0
1I)
I
Sec. 8.7
107
IUO
L nap.
1:1
the decoder can then add the proper error pattern to Vi to obtain the transmitted
code word.
Example
Consider the following parity check matrix:
A =
(Ia
a1
I 0)1
(8.13)
The code words associated with A are WI = 0000, W2 = 0101 , W3 = 1110 , and
W4 = 1011. All cosets of this code are listed next and the coset leaders are underlined .
S = {OOOO, 0101,1 110, lOl l}
S, = {OOOI, 0100 , 1111, IOIO}
S2 = {IOOI, lIOO,OIII,OOIO}
Thus all elements of Vi E9 S have the same corrector Ci . On the other hand, if
Vk does not belong to Vi EB S, then C, = AVJ is different from C, = AvL for if
C, = Ci , then A(VJ E9 VJ) = 0, which means tha t Vi E9 V k = Wi or V k =
Vi EB Wi (i.e ., Vk belongs to Vi E9 S ) which is a contradiction.
The corrector vectors are binary sequences of length m (m is the rank
of A) and there are 2m such sequences . The number of cosets associated with
the matrix A was also found to be T" , Therefore, there is a one-to-one correspondence between the cosets of the code assoc iated with A and all binary
sequences of length m.
The parity check decoder must keep a table that associates all binary
sequences of length m (the correctors) with the corre sponding error-pattern
tIn the literature, the corrector is also known as syndrome.
Sec. 8.7
109
vectors (coset leaders) . Upon receiving a sequence Vj , the decoder computes its
corrector C, = AVr It then searches the table to find the error-pattern vector Z,
corresponding to the corrector Cj . The minimum-di stance decision will then be
the code word Wi = Vj EEl Zj.
The amount of storage for the preceding decoding technique is on the
order of 2m , which could be significantly lower than 2N , the storage needed for
a direct table-lookup approach.
For the code corresponding to the matrix A in Eq. (8.13) , the correc tors and their
error-pattern vectors are as follow s:
00
0000
01
0100
10
1000
II
0010
This code can thus correct some of the single error s. All other errors will be neither
detected nor corrected .
Example
A =
(~o ~ ~ ~
o
0
0
I
0
0
I
I
0
0;1)
(8. 14)
(8. 15)
Here I - P E is the probabil ity of correct detection, the first term on the right side is the
probability of correc t detection corresponding to correct transmission , the second term is
the probability of correct detection corresponding to single errors in transmission, the
third term is the probability of correct detection corresponding to double errors in transmission, and so on . If coding is not used, the transm ission of four messages requires
2 bits and the probability of correct transmission would be
1 - P E = (I - e)2
(8. 16)
For e = 0.01, Eq. (8.15) yields PE = 7.86 X 10- , whereas from Eq. (8.16), PE =
I. 99 X 10- 2 This improvement in error probability is achieved at the cost of reducing
the data rate to one-third .
4
"V
o rnarv
ovnunernc \..11i:H1Ilt:1
\..flap .
1)
TABLE 8.1
000000
111010
101101
010111
Correctors
0000
Single
errors
100000
010000
001000
000100
000010
000001
011010
101010
110010
111110
111000
111011
001101
111101
100101
101001
101111
101100
110111
000111
011111
010011
010101
010110
1000
0100
0010
0001
1110
1011
Double
errors
110000
101000
100100
100010
100001
010100
010001
001010
010010
0 11110
011000
011011
101110
101011
0 11101
000101
001001
001111
001100
111001
111100
100111
111111
110011
110101
110110
000011
000110
1100
1010
1001
0110
0011
0101
1111
Triple
errors
110100
110001
001110
001011
011001
011100
100011
100110
1101
0111
Code
Words
In Section 8.1 we saw that to correct all errors of magnitude v and less, a code
{WI, . .. , WM} must satisfy
D(Wj , Wj ) ~ 2v
+ 1,
for all i -# j
For parity check codes, this condition can be stated in terms of the minimum
code-word weight, where the weight of a code word is the number of l's in it.
For instance, the weight of Wi = 1001100 is 3.
The minimum distance between code words of a parity check code is
equal to the weight of the code word with the least number of l 's (excluding
the trivial word W = 000 . . . 0). The reason is that, if the distance between Wi
and Wj is D, then the code word W k = W j EEl Wj has a weight equal to D that
cannot be less than the minimum weight. On the other hand, the minimumweight code word itself has a distance from the trivial code word equal to its
own weight. As an example, consider the parity check code {0000,0101, 1110,
lOll} . The minimum-weight code word is 0101, which has a weight of 2, and
the minimum distance is indeed equal to 2.
In light of the preceding result, Theorem 1 of Section 8.1 can be stated as
follows: A necessary and sufficient condition for a parity check code to correct
all errors of magnitude u and less is to have a minimum weight greater than or
equal to 2v + 1.
Sec. 8.9
111
If the code word with minimum weight has at least 2v + lones, then no
code word (except the trivial one) can have 2v or less 1's. Since the l's in a
code word represent a linear dependence among the columns of the matrix A
[see Section 8.4 , property (v)], the preceding theorem can be restated as follows: A necessary and sufficient condition for a parity check code to correct all
errors of magnitude v and less is that every set of 2v columns of the matrix A
be linearly independent.
Example
The following parity check matrix represents a double-error-correcting code with seven
check digits and three information digits .
0 0 0
0 0
0 0
0 0 0 0
0 0
0 0
0 0
0 0 0
0 0
0 0
0
0
0 0
0 0
(8.17)
It may be verified that any four columns of this matrix are linearly independent. *
k, we need (1) distinct correctors . To correct all errors of magnitude v and less,
we need L ~= o (1 ) distinct correctors. But there are 2'" correctors altogether;
11<:
l-nap.~
is necessary. However, it may be verified (by trial and error, for example) that
the smallest number of check digits for a double-error-correcting parity check
code of block length 10 is m = 7.
The Hamming lower bound on the number of check digits for a parity
check code is, in fact, identical to the Hamming upper bound on the number of
. code words for a general binary code, which was described in Section 8.2 .
To see this, observe that M = 2N - m and therefore inequalities (8.3) and (8.18)
are equivalent.
2 >
2v -1
k=O
(N - I)
(8.19)
Sec. 8.11
113
CN ' 0,
CN ' C;,
CN ' C; +
i =I,2 ,
c.,
i ,j
,N - l
1, 2,
,N - 1
i , j , k = 1, 2,
i.j, k distinct
CN ' C; + C, + Ci,
il>iz,
i I> i z,
,N
and
and
At worst , all the se combinations will be distinct, in which case the total number
of excl uded vectors will be
+ ... +
(N1)
2v - 1
Z~l
(N- 1)
k=O
Since there are 2 vec tors of len gth m , we can choo se the last column C N if
Eq . (8. 19) is satisife d . Th e pro of is thu s co mplete .
Example
If N = 10 and v = 2, the upper bound for m is 8, which means that for m 2: 8 and
N = 10 a double-error-correcting parity check matrix can be constructed. The preceding
theorem does not exclude, however, the possibility of finding a parity check code with
m < 8. From the Hamming bound, we know that m must be larger than or equal to 6.
In fact, a double-error-correcting parity check code of length N = 10 exists with m = 7
check digits. [See the matrix in Eq. (8. 17) .]
We clo se our discu ssion of parity chec k codes by notin g that in the spe cia l case of v = 1 (sing le-error-co rrec ting codes) the Hamming lower bound is
2m ~ 1 + N and the Varsharmov-Gilbert upp er bound is 2m > N . Since 2m is
an integer, the two bounds are equal, and the minimum value of m is uniquel y
determ ined by m = rlogz(N + 1 Th is ca n be seen in a much simpler way by
observing that the columns of a sing le-error-co rrecting code must satisfy onl y
two requirements: (1) all columns are di stinct and (2) no column is entire ly
zero . For a further discu ssion of sing le-erro r-correcting code s , see Probl em 8 . 1.
~I"""UI
V V ..... ""..,
IVI
L II\:;
LJII IU I
""'y"l l l I C Lll\"
ll O l l ll C I
Vl la~ .
. X13x12Xl1
Figure 8.1
code sequence, however, can be thought of as one giant block code, and it is in
this sense that linear convolutional codes and linear block codes are similar.
Convolutional codes are powerful error-correcting codes with relatively simple
encoding and decoding schemes and have found application in many communication systems . We introduce the concept of convolutional codes by means of
simple examples and describe an optimum decoding technique known as the
Viterbi algorithm. We will not discuss topics such as the error-correcting power
of convolutional codes and sequential decoding techniques; the interested reader
is referred to the vast literature on coding theory for further information [3, 5].
The following example of a linear binary convolutional code will bring
out the important features of the convolutional technique . Consider a sequence
of binary information digits VI V 2 V 3 at the output of an information source ,
with 'T s being the fixed duration of each digit. The information sequence
enters a three-digit shift register as shown in Figure 8.1. The shift register is
initially reset to 000. At t = (n - 1)'T" the nth information digit U; enters the
shift register and creates two code letters X o" = V" EB V"-2 and XI" = V" EB
V n - I EB V"-2' which are then transmitted during the time intervals [(n - 1)'T"
(n - 0.5)'T.] and [(n - 0.5)'T" nTs ], respectively. Since for every information
digit there are two code digits, this code is a (2, I) convolutional code; in other
words, the rate of the code is R = ~. Since at every stage the encoder remembers the previous two input letters, the code is said to have a memory or constraint length of 3. Notice that the stream of data cannot be partitioned into
nonoverlapping blocks with each block corresponding to a block of code letters. It is in this respect that convolutional codes differ from block codes.
Sec. 8.12
115
00
Figure 8.2 State diagram. (From R. J. McEliece, The Theory of Inf ormation and Coding, Addison-Wesley, Reading, Mass., 1977. )
states corresponding to four different values of Un - 1 Un - 2 ; these states are labeled a, b, c, d and are represented by boxes in Figure 8.2. The lines that join
these boxes show possible transitions from state to state upon arrival of a new
information digit. A solid line represents a 0 input , while a broken line represents a 1 input. Each line has a two-digit label that represents the resulting output X OnX 1n. For example , if the input sequence is 110100 and the initial state of
the system is So = a, the state follo ws the path acdbcba and the encoder output
will be 111010000111.
The ladder diagram is the extension of the state diagram in time. The ladder diagram for the encod er of Figure 8. 1 is shown in Figure 8.3. Each column
of four dots in Figure 8.3 represents the states of the system at t = n, with
j = 0, 1,2, ... . In going from S, to S;+l s a solid line represents a 0 input and a
broken line represe nts a 1 input. The 2-bit sequences on the lines correspond to
the output at each transition . The encoder output stream can be found by tracing the appropriate path through the ladder. For exa mple , the input strea m
110100. . . corresponds to the path aocld2b3c4bsa6' . . and the output sequence
111010000111 . . . .
In practice, it is necessary to work with a truncated version of a convolutional code. If the stream of informati on is truncated after N digits, it is usually
preferred to input a suffic ient numb er of zeros so that the last bit can pass
through the shift register. For the encoder of Figure 8. 1, the required number of
zeros is 2, and the last few stages of the ladder diagram for a code truncated
after N input letters are shown in Figure 8.4 . Notice that in the last two stages
all the lines are solid; this is because the last two input letters are zeros . The
truncated convolution al code can, in fact, be thought of as a single-block code
St at e
t =
...
I V'
.. , . ....
LJII I UI
0
00
. ,,.,.. .
.....
,,
Figure 8.3 Ladder diagra m. (Fro m R. J . McEl iece , The Theory of Inf ormation and
Coding, Addison-Wesley, Reading, Mass. , 1977 .)
State
t=(N - 1),s
t=N ' s
t=(N+1),s
t =(N+2},s
Figure 8.4
of length 2N + 4 with N inform ation digits. The rate, R = N j(2N + 4), approaches ! as N ~ 00 . Moreo ver, it is not difficult to show that this linear block
code is in fact a parity check code (see Problem 8.17).
8.13 MAXIMUM LIKELIHOOD DECODING AND THE
VITERBI ALGORITHM
Sec. 8.13
117
with minimum distance from Vj. The sequence of states So SI .. . SN+Z is not
arbitrary; it represents a continuous path in the ladder diagram. Moreover, the
initial and final states are always known in advance : So = SN+Z = a.
For a given received sequence Vj , we assign a "length" to each branch of
the ladder diagram. A branch that connects the state S in level (j - I) (i.e .,
Sj-I) to state S ' in level j (i.e. , S/) will have a length equal to the Hamming
distance between its corresponding encoder output (XOjx lj) and the correspond ing 2 bits in the recei ved sequence (YOjY lj)' Figure 8.5 shows the ladder diagram of the encoder of Figure 8.1, truncated at N = 6. A received sequence
Vj = 10110011101 11100 is also shown.
As an example, consider the bran ch that connects d z to b 3 . For this
branch , X0 3 XI 3 = 10 and Y 03 Y13 = 00; therefore, the length of d zb3 is 1. Figure 8.6 shows the length s of all the branches of Figure 8.5.
V.=
[ 10
11
00
11
11
10
00 J
11
00
\ .....
\ .....
\
\
\
\
\
\
\
j = 0
Figure 8.5 (From R. J. McEliece , The Theory of Inf ormation and Coding, AddisonWesley, Reading, Mass. , 1977.)
ao
a8
\
b
j =O
\ 1
Figure 8.6 (From R . J. McE liece , The Theory of Information and Coding, AddisonWP<] P V Readino_Mass.. 1977.)
LII' tJi:ll
"0
"UUtJ~
l,nap. ts
3. If j
=N +
=j +
1 and go to step 2.
Figure 8.7 shows this process for the ladder diagram of Figure 8.6. The
numbers appearing on the states represent the length of the shortest path that
leads to them. The shortest path to a 8 is readily seen to be a oajc2b3a4c5b6a7a8.
The decoding is now complete .
\
\
\\
d
j =
4
as
,
)\;\
,
-,
\
c
1
ao \
\ 1
-,
,2
-,
-,
"-
"2
'2
'-
\ 3
Figure 8.7 The Viterbi algorithm. (From R. J. McEliece, The Theory of Information
and Coding, Addison-Wesley, Reading, Mass., 1977.)
Chap. 8
Problems
119
PROBLEMS
8.1. The Hamming code is a parity check code whose matrix A has the following
property: column n of A is the binary number n . For example, the matrix of a
Hamming code with m = 3, N = 7 is
0 0 0 I 1 1
A=OIIOOI
(
I 0 1 0 I 0
(a) For a given m, what is the maximum code-word length N ?
(b) Construct the largest Hamming matrix for m = 4 and show that it corrects all
single errors of transmission .
L. II I G O I
'-'VU'c;) IV I
L.nap.
(c) Show that errors of magnitude 2 and lar ger cannot be corrected by the
Hamming code .
8.2. (a) Show that a parity check code will correct v-tuple (and all smaller) errors and
detect (but not necessarily correct) (v + I)-tuple errors if and only if every
set of 2v + 1 columns of the parity check matrix is linearly independent.
(b) Given a v-tuple error correcting parity check code with parity check matrix
A = [1mIAI], where 1m is an identity matrix of order m, show that the code
defined by the matrix
0
0
Ao =
t;
AI
0
will correct v-tuple errors and detect (v + I)-tuple errors . (This corresponds
to adding one check digit that performs a parity check on all the digits of a
code word .)
(c) Find a parity check matrix of a single-error-correcting, double-error-detecti ng
code with five check digits and eight information digits.
(d) What is the analog of the Varsharmov-Gilbert condition for codes that correct i--tuple and detect (v + I)-tuple errors?
8.3. Consider the following parity check matrix:
(~q ~ !)
Chap. 8
Problems
121
(b) Find a convenient condition on the code words that is equivalent to the condi tion of part (a) .
(c) Given a parity check code with 2K words of length N, find the number of
error patterns that can be detected by the code . [Note that this number is the
same for all (N, K) codes.]
8.6 . A generator matrix for an (N , K ) binary code S is a matrix G whose rows consist of K linearly independent code words of S. Every code word of S may be
obta ined as a linear co mbination of the row s of G . Assumi ng that the last K
column s of G are linearl y independent (if not we can interchange columns) , we
may reduce G by elementary row transformations to a new generator matrix
G * = [B lh]
where B is K by m (m = rank of the parity check matrix = N - K ) and IK is an
identity matrix of order K.
(a) Show the matrix A = [1mIB T] is a parity check matrix for S.
(b) Find parity check matrices for the codes whose generator matrices are
(i) G
[~ ~ ~ ~ ~ ~J
0
000
(ii) G = [I
I]
8.7. The parity check matri x A represe nts a single-error-correcting, double-errordetecting code . By eliminating a row and a column from A, it is always possible
to obtain the matrix of a sing le-error -correcting code . How do you go abo ut
selecting the row and column that must be eliminated for this purpose?
8.8. The circuit in Figure P8.8 is used to encode binary digits for transmission over a
binary symmetric channel with crossover probability e < !. The shift register is
initially filled with O's; then four information digits are shifted into the shift register and simultaneously transmitted . Next three check digits are transmitted; the
four information digits in the shift regi ster shift one place to the right between
each check digit calculation .
Find the parity check matrix , the genera tor matrix, a decoding table , and the
probabilit y of decoding error for the code .
Source
7X e
x u u u
5
Channel
u,
Figure P8.8
8.9 . (a) Show that in a parity check code either all the code words contain an even
number of ones or half the code words contain an odd number of ones and
half an even number.
,.<:.<:
L lllt::dl
\,.,UUt::::; lUI
li l t::
o u rcn v
vlldlJ. 0
(b) Let x m , 1I be the nth digit in the mth code word of a parity check code, Show
that, for any given n, either exactly half or all the x m , 1I are zero, If all the X m , 1I
are zero for a given n , explain how the code could be improved,
(c) Show that the average number of ones per code word, averaged over all code
words in a parity check code of block length N, must be at most N /2.
8.10. Consider two parity check codes. Code I is generated by the rule
Xz
Uz
X4
UI
Xs
UI
X6
= u:
X7
UI
EB
EB
EB
EB
Uz
U3
U3
Uz
EB
U3
4.
8.11.
8.12.
8.13.
8.14.
(a) Can you obtain the code words of Al from the code words of A? If the answer
is positive, describe a procedure.
(b) In terms of Nand m, how many code words belong to AI?
(c) If A is a v-tuple error-correcting code, prove that Al is at least v-tuple error
correcting, (v + I)-tuple error detecting.
8.15. Let the m X N parity check matrix A be expressed as A = [B 11m ], where B is
m X (N - m) and I", is the m X m identity matrix. Construct a new matrix Al by
eliminating the first column of A (i.e. , Al = [Bi ll",]) .
Chap. 8
Problems
123
(a) Describe a procedure for obtaining the code words of Al from the code words
of A.
(b) In terms of Nand m, how many code words belong to AI?
(c) Show that the minimum distance (d min) for the new code is ;:=: d min for the
original code .
8.16. Suppose you were approached by a customer who told you' that his (binary) channel accepts words of length n = 7 and that the only kind of error pattern ever
observed is one of the eight patterns 0000000, 1000000, 1100000 , 1110000,
1111000, 1111100, 1111110, 1111111. Design a parity check code that will correct all such patterns with as large a rate as possible.
8.17. Using the equivalence between parity check codes and group codes, show that a
truncated linear convolutional code is a parity check block code.
8.18. The convolution of a sequence {ao, aI, ...} with another sequence {bo, b., . .. } is
defined as the sequence {co, CI> }, where c; = 2,;~oa ;bn -;. Using modulo 2
addition and multiplication, show that each of the sequences XOIXQ2X03 and
XllX12X13 . of Figure 8.1 can be expressed as the convolution of the input
sequence VI V 2 V 3 . . with an appropriate (finite) sequence .
9
ADVANCED TOPICS
9.1
9.2
9.3
9.4
9.5
Introduction
In this final chapter we introduce some of the more advanced topics of information theory . This presentation is neither detailed nor
exhaustive; its only aim is to convey to the reader the sense that the concepts of
information theory that have been developed in this book in connection with
simple discrete memoryless sources and channels are far more general and can
be extended in a variety of directions . In Section 9.1, for example, we remove
the restriction of memorylessness from the discrete source and show that entropy can be defined in a natural way for discrete stationary sources . We then
show that this ge nera lize d entro py has the same practical sig nifica nce that
was associated with its restricted versio n for the discrete memoryless source;
that is, the entropy of a stationary source is the minimum average code-word
length that one has to use in order to represent the source accurately. In Section 9.2 , we generalize the concept of likely or typical sequences by proving
the Shannon- McMillan theorem for a broad subclass of discret e stationary
sources . By imposing the constraint of ergodicity, we will be able to show the
generality of the relationship between entropy and typical sequences that was
studied in Chapter 2.
Another direction in which the results of information theory can be extended is in relation to continuou s messages and channels. Although some informatio n generation, storage, and tra nsmiss ion sys tems in use tod ay are
1"-0
MUVQI I\..CU
IUIJI\...~
'-"IICp . oJ
discrete in nature, the vast majority of such systems remain continuous. Microwave channels, telephone lines, fiber -optic networks, and magnetic disk and
tape drives are all examples of continuous systems. A great deal of effort has
been expended to extend the results of information theory to such practical situations. The cornerstone of all these efforts is the concept of differential entropy,
which will be introduced in Section 9.3. Although we shall not build upon this
concept here to investigate the properties of continuous (or waveform) messages and channels, we will point out its major differences with conventional
(discrete) entropy. Our analysis of a continuous channel in Section 9.4 is based
on intuition rather than rigor, but the derivation of the formula for the capacity
of a band-limited Gaussian channel in this section is expected to give the reader
a great deal of insight.
Section 9.5 describes a numerical algorithm for the calculation of the
capacity of discrete memoryless channels . This is a simple, yet elegant method ,
which was not discovered until later in the development of information theory
[13,14]. The discussion here takes advantage of the knowledge developed in
Chapters 5 and 6 with regard to the information theoretic functions .
Sec. 9.1
HN(u)
Obviously, HN(u)
function of N .
?:
~{ - ~ P(SN) IOgp(SN)}
(9 .1)
H (UNIU IU Z .
Lemma 1.
127
. . U N- I )
Proof. Using first the fact that cond itioning cannot increase entropy and
then the stationarity of the source, we obtain
H(UNI UIU z
"
U N-I) :5
H (UNIUz... U N-
I)
H (UN-I [ Ul
'"
UN- Z)
(9.2)
Lemma 2.
(9.3)
Proof
HN(u)
I
N H (u,u z, . . UN)
I
=N
{H( ul)
,UN- l)}
From Lemma I , the last term on the right side of the preceding equation is a
lower bound to each of the other term s. Inequal ity (9. 3) then follows .
Theorem 1.
HN(u) = N H (u l .. . UN)
I
= N H (u , . . . U N-I ) +
N - I
:5 ~ HN- I (U)
I
N H (UNIu
, . .. U N- I )
N HN(u)
(9 .4)
(9.5)
lim HN(u)
(9.6)
N --> x
MUVClI IL.t;U
UIJ Il,,;~
\.,.,IIClIJ ;;,
Lemma 3.
for all N
(9 .7)
Proof
1
HN+M(U)
= N + MH(UI . . . UN+M)
1
+ MH(UI
. {H(UN UI
+ ... +
UN- I)
+N +M
+ H(UN+I l UI . . . UN)
UN- I)
H(UN+M UI . . . UN+M-I)}
+ MH(u l UN- I) +
00 ,
M
M
+ 1
+ N H( UNl UI . .. UN- I)
we have
Theorem 2.
equal to H (u) .
00
ex ists and is
Pro of. From Lemma I and the fact that H(UNl UI . . . UN-I ) is larger than
or equal to zero for all N , we conclude that the limit exists. Lemm a 2 asserts
that the lim it is ::; H (u ), while from Lemma 3 the limit mu st be ~ H (u ).
Therefore , H(u ) = limN_X H (UNl UI . .. UN-I ). Thi s completes the proof.
The following theorem establishes the relationship between the entropy of
a discrete stationary source and the average code-word length required for its
perfect reco nstruction .
Theorem 3. Let H(u) be the entropy per letter for a discret e stationary
source with a finite alphabet, and let 0 > 0 be arbitrary. Given a code alphabet
of D symbols , it is possible to encode source sequences into a prefix-free code
with the average code -word length per source letter satisfying the following
relation:
Sec. 9.2
H (u )
-
:5
log D
129
H (u )
< -- + 0
(9.8)
log D
n cannot violate the left inequality for any uniqu ely decodable code.
Proof. The proof is similar to the proof of the corresponding theorem for
memoryless sources given in Chapte r 2. By encoding sequences of length N at
a time , we can achieve
HN(u )
log D
_
:5
H N(u)
n < log D
+ N
(9.9)
Since H N(u ) ::::: H(u ) , the left inequ ality in Eq. (9 .9) yields H(u) /I og D :5 1l ,
and any uniquely decodabl e code must satisfy this inequ ality. By choosing a
large enough value for N, the term on the right side of Eq . (9.9) can be made
less than H (u )/Iog D + o. The proof is thus complete.
9 .2 ERGODICITV AND THE
SHANNON-MCMILLAN THEOREM
Consider a source with alphabet {a], a i , a3} and two modes of behavior, each
occurring with probab ility ]. In the first mode the source produce s an infinite
sequence of repetitions of al. In the second mode the source produce s an infinite sequence of statistically independent, equiprobable selection s of the letters a2 and a-; This is obviously a stationary source, since the probability of a
given sequence does not depend on the starting point. A sequence of length N
has either probability! (if it is all ai's) or G)N+l (if it is composed of a2's and
a3's). We then have
HN(u)
~ (- t
~{ ~
log 2
2N(N2':11 log 2) }
= N ;; 2
log 2
The entrop y per source letter is thus given by H (u ) = limN--+oo HN(u) = ! bit.
If we encode sequences of length N into binary code (u sing either
Shannon- Fano coding or Huffman codin g) , it is not hard to see that the sequence of a I ' S is encoded into a single binary digit, and each of the 2N sequences compo sed of a2's and a3's is encoded into a binary sequence of length
N + I. The average code-wo rd length is ! + (N + 1)/2 = (N + 2)/2 , and
the average number of bits per source letter is (N + 2)/2N , which is close to
H (u ) for large N . However, it is clear that neither Ii nor H (u ) are quantiti es of
great significance in this case .
Source s that cannot be separated into different persisting modes of behavior are known as ergodic sources . A stationary ergodic source has the property
that statistical averages of functions defined on its output sequences are arbitrarily close to the time averages with probability close to 1; that is, the proba-
. .... W .... I I . . . . . . . . . . . . . .
t-" ......
bility that the time average differs significantly from the statistical average is
almost zero. For the stationary ergodic source, the following theorem, which is
a generalization of the results of Section 2.4, can be proved .
The Shannon-McMillan Theorem. Let a discrete stationary ergodic
source have entropy H(u) . For arbitrary e > 0, a > 0, there exists No such
that, for N 2: No,
(9.10)
Proof. The proof will be given in two steps. First we prove the theorem
for sources with finite memory; then we show that a stationary source can be
arbitrarily closely approximated by a finite-memory source.
Part 1: Let
N
= piu, . .. U m ) 11
P(U I . . . UN)
piu; U n- m Un-I)
n=m +l
. . . UN)
I
N log P(U I
= -
um )
. . .
I
]
- N-m[
--- --2: N
log ptu; I U n- m
Un-I)
N
N - m
n=IIl+ !
(9.11)
Since m is fixed, in the limit N ~ 00 the first term on the right side of
Eq. (9.11) approaches zero and the coefficient of the second term approaches
unity. The second term is now the time average of - log ptu, IUn -Ill . . . Un- I)
and, because of ergodicity, it approaches the statistical average with probability
1. The statistical average of this quantity is
E( - log piu; IUn-Ill
. . Un-I))
= -
2:
p(U n-
1Il
Un)
. . Un- I)
all sequences
Un - m"
Un
ni, IUn-In
.. Un-I)
(9.12)
Since the memory is finite and equal to m, the conditional entropy in Eq. (9.12)
is equal to H(u) . Therefore, in the limit of large N, Eq. (9.10) will be satisfied.
Part 2: We now show that, for sufficiently large m, the quantity
N
jJ(UI . . . UN)
= P(U I .. . U IIl)
11
ptu;
IUn-Ill .. Un- I)
n=m +l
Sec. 9.2
131
Figure 9.1
p{
>
e}
. . . UN)/
p{11 og P(UI
p(Uj ",UN)
> Ne }
(9 .13)
The last inequality follows from a version of Cheby shev inequality for positive
random variables (see Problem 1.4) .
Let us digres s momentarily to prove the following inequality concerning
the absolute value of the function log x:
2 log e
[log xl ::s - - x - log x
e
(9.14)
The function !(log x + [log xl) is shown in Figure 9.1 . It is equal to log x
for x ;;::: 1 and is zero otherwise. The tangent to this function passing through
the origin is (log e/e )x , which is also shown in the figure . Thu s !(log x +
Ilogxl)::s (log e/e)x , which is equivalent to Eq. (9.14) .
Using Eq. (9.14) in Eq. (9. 13), we obtain
p{
log piu ,
uN)1 >
e}
{2lo e
g
UN)] - E [ log -'---'----=---~
P(U I ",UN)]}
::s - 1 - - E [fJ(Ul
Ne
e
P(UI" ,UN)
P(U I " ,UN)
(9. 15)
However,
""
LJ
P(UI
all sequences
[P(Ul . . .
.. .
"" ' (U l
UN) P(UI .. . UN) -_ LJp
p(UI UN)
' . .
UN)
= I
UN
(Q 1(;)
Chap. 9
And
(9. 17)
Therefore,
p{
~ -
[Hm(u) - H(Um+11 UI
. . .
..
um)]
Urn )]}
(9. 18)
For sufficiently large m , the terms in the brackets on the right side of Eq. (9. 18)
will be as close to zero as desired; and since N is always larger than m, the first
term can be made arbitrar ily small as well . Therefore, for sufficiently large m,
(9.19)
for any e > 0, 0 > 0 chosen arbitrarily . This result , combined with the result
of part 1, completes the proof of the Shannon-McMillan theorem .
9 .3 DIFFERENTIAL ENTROPY
Consider the continuous random variable x with density function p(x ). The differential entropy of x is defined as follows:
H(x)
= - {,/(x)
log p(x) dx
(9.20)
Unlike the entrop y for a discrete r.v. , the precedin g function is not necessarily
positive, not necessarily finite , and not necessarily invariant under transforma tions of x. Therefore, its interpr etation as the amount of uncertainty associated
with x is no longer valid . The following examples demonstr ate the se facts
clearly.
Example 1
Let x be uniformly distributed between a and b. Then
H (x)
_fb
_
a b
1- log _ 1- dx
- a
b -a
log(b - a)
(9.21)
As b - a assumes values in the interval (0 , 00), the function H (x ) in Eq. (9.21) takes on
negative, zero, and positive values.
Sec. 9.3
Differential Entropy
133
Example 2
Let x have the following density function:
p(x) =
x < e
(9.22)
_ '_
I_
{x
In2 .c '
2:
Then
()-
00
(9.23)
Example 3
Let x be a Gaussian random variable with average t-t and variance u 2 Then
H(x) =
-f~ , ~
-~
v 2r. U
= 10g('I/2;u) + -
2u
log e
=-
log(27Teu 2)
(9.24)
The entropy in Eq. (9.24) is independent of t-t and, depending on the value of o , could
be positive , zero, or negative.
Now co ns ide r the re ver sible (o ne -to -one) tran sformation y = ax . Since
2
E(y) = ap: and Var(y) = a 2u2 , we have H(y) = log(27Tea 2u ) . Clearly, the differential entropy is not invariant under this reversible transformation.
The differences between entropy for a discrete r. v. and differential entropy for a continu ous r. v. can be traced to the fact that differential entropy is
not a limiting case obtained by approximating a continuou s distribut ion with a
discrete one. If the real axis is divided into intervals of length Li and each interval is assigned the probability pen Li)Li, then the discrete appro ximation to the
entropy will be
cc
H(x) = -
2:
,,=- x
2:
- log Li
n= -x
pen Li) Li -
2:
n = - :x;l
H(x)
(9.25)
where - log Li approac hes infinity . It is the absence of this infinitely large term
from the differential entropy that is responsible for its odd behavior.
Among the continuous random variables with fixed standard deviation 0-,
the Gaussian r.v. has the largest differential entrop y. To prove this, we need the
following lemma .
MUVdlll;t:U
I UIJIl;:i
l..nap.
::J
Lemma.
(9.26)
Proof
- f ",p(x) log p(x) dx
+ f"P(x)
log q(x) dx
f ",p(x)
f",
IOg[;~;n dx ~ f
q(x) dx - f"P(x) dx
",p(x)
[;~;~ -
I]
dx
=0
q(x)
~a- exp[ -
(x
~f)2]
H(x)
= -
log(y!2;a-)
+-
a-
2a-
~f)2 log e] dx
log e = - log(27Tea- 2)
2
(9.27)
The right side of Eq. (9 .27) is the entropy for a Gaussian r.v. with variance a-2
(see Example 3). Thus the Gaussian distribution has the largest differential entropy, as stated before . Moreover, in Eq . (9 .27) equality holds iff p(x) = q(x)
everywhere.
Joint and conditional entropies as well as average mutual information for
continuous random variables can be defined in a similar way. For instance, if
x and yare continuous random variables with joint density p(x, y), conditional
densities p(x Iy) and p(y Ix), and marginal densities p(x) and p(y), we define
cc
H(x,y) =
-JJ
(9.28)
(9.29)
co
H(x Iy)
- '"
Sec. 9.4
135
ce
I(x, Y)
p(x,y)
LrJ p(x,y) log p(x)p(y)
dxdy
(9.30)
In this section we deri ve the expression for the capacity of a band-limited channel with average power constraint and signal-independent, additive, Gaussian
noise. The arguments used here are not mathematic ally rigorous but are intuitively acceptable, and in fact the final result , which is the expression for the
channel capacity, is an exact result. The channel is diagramatically shown
in Figure 9.2 . The Fourier spectrum of the input waveform x (t ) is limited to
frequencies betw een - Wand + W cycles/second , and its average power,
defined as
P
= lim T -->"
2T
f +Tx 2(t ) dt
(9.31)
-T
is confined to values belo w Po. The noise z(t) is white, Gaussian, and indepen dent of x(t) . Its spectral density is !No, independent of frequency.
To analyze the system, we first sample the waveform s at the Nyquist rate
of 2W samples per second. This preserves the information content of the waveforms and allows us to confine attention to the samples alone. Let us denote the
total number of samples in the time interval [ -T, T ] by K ; then K = 4WT. For
sufficiently large K, the input power constraint can be expressed as
1 K
- 2: xf s. Po
K
(9.32)
i= 1
zt t )
H(f)
x (t )
+'
I
- '1'1
Figure 9.2
v tt )
IN f
Advanced Topics
Chap . 9
where X i are the samples of x(t) in the interval [ -T, T]. The noise samples z,
are zero-mean, Gau ssian random variables with variance U"; = NoW. These
samples are independent of the samples of x( t) and of each other. From the law
of large number s, one can deduce that
1
-K 2: z2 =
i= 1
(9.33)
U"2
for sufficiently large K . Finally, the output samples Yi must satisfy the following inequalit y:
1 ~ 2
2
(9.34)
- L..J Yi :5 Po + o ;
K i~1
Now consider a K-dimen sional Eucl idean space and assign the K coordinates of the space to the K samples of the wave forms . In this way, each waveform is uniquely identified with a point in the K-dimensional space . Inequalit y
(9.32) then states that the input signal x( t) must be within a sphere of radius
-vPo. Similarly, the output signal yet) is within a sphere of radius Y Po + U"; ,
and each realization of the noise waveform z(t) is, with a probability close to
1, inside a sphere of radius cr. , As the diagram in Figure 9.3 shows, when an
input signal such as xo{t ) is transmitted , the output signal Yo{t) must be somewhere within a sphere of radius U"z around xo(t) . If the input signals that are
used for communication over the channel are far enough from each other, their
Axi s 2
Ax is 1
Figure 9.3
Sec. 9.4
137
corre sponding noise spheres will not overlap and, consequently, the receiver
can correctly identify the tran smitted signals .
Th e problem of calculating the channel capacity is now reduced to that of
calculating the maximum num ber of spheres of radi us (Jz that can be packed inside a sphere of radius Y Po + (J ~ without overlap . Since the dimensionality K
of the space is large, this is approximately the ratio of the volumes of the two
spheres . Thu s , if M denotes the maximum numb er of input signals that can be
corre ctly identi fied at the recei ver, we can write
(Y P +
0
(J ~
(J Z)K
(9 .35)
The maximum rate of tran sfer of data over the channel is then given by
Po)
= 4WT
and
(J ;
bits/ second
= NoW; therefore ,
=W
10gZ( 1
+~)
NoW
bits / second
(9 .36 )
In this section we describe an iterative method of calculating the channel capacity for a discret e memoryless channel and prove its convergence . The algorithm
was discovered by S . Arimoto [13] and, independently, by R. Blahut [14]; our
proof of convergence here follows that of Arimoto. Thi s algorithm can be applied without any restrictions to a DMC with tran sition matri x [pC Ixd] and
can easily be programmed on a per sonal computer. To begin with , let us rewrite
the mutu al inform ation between the input and output of the channel in the following form :
s,
I(x, y)
k= 1
k= 1
(9 .37 )
where
J
a,
= LP(Yj l xd
10gp(xkIYj )
(9 .38)
j=l
is a function of both the channel matri x and the input distribution P = {P (XI),
. . . ,p(xd}. It will be assumed throughout the sect ion that the input distribution
P is strictly positive ; that is, P(Xk) > 0 for 1 :5 k :5 K . Under this condition
. . . . .......... ....
.r , .. .
P(XK) = 11K.
r,
1(1l)(x, y) = -
~I
~1
[ (II)
1(1l+I)(x, y)
LP (Il+Il(xk)a')
k ~1
k~ 1
k ~1
k~ 1
(9.39)
e v: can
(9.40)
(9.41 )
(9.42)
Obviously, qk > 0 and L kqk
written as
-
IOg(~ eak)
(9.43)
In Eq. (9.43), equality holds iff p(xd = qk for all k (that is, the maximizing
distribution is Q). Therefore, p (Il+I) = Q and [ (II) = log(L ke"k); the values of
ak are obtained from Eq. (9.38) with P = p (Il).
If at some stage in the computation rr: becomes equal to r, the algorithm has come to successful completion because now p (ll) is the distribution
Sec. 9.4
139
r (lI) -
/ (II+ l)(X,
y)
a ~l + l )]
k~ 1
(11) (
2:
~1 ~1
~l+ l)t
~ ~
I .)
~ I~ I
J
j=1
k=1
~ ~
r)
I )
]
1
1= 0
This proves the validity of Eq. (9.45) and consequentl y the claim that the sequence {/ (II)(X, y)} is strictly increasing . Since the sequence is bounded from
above by the channel capacity C , it must have a limit C that is less than or
equal to C. Our task is now reduced to proving that Cis in fact equal to C.
Lemma . Define 0 (11 ) = C - r (II), where C is the capacity, and denote
the input distribution that achieves capacity by p * = {P * (Xt), . . . ,P *(XK)} .
Then
(11+ 1)(
0 (11) ::;
"p * (x ) In P
L..-
k=1
(II)(
Xk
x,
(9.46 )
Proof. From Eq . (9.42) and the discu ssion follo wing Eq . (9.43), we
can write
r (lI)
= In(~ ea(n) =
~
In
ea(n)
P (11+ l)(Xk)
= a (II)
k
In P (11+ l) (x
(9.47)
I ) 1n
P (II)(Xk)P(Yj !x d - ~ (
1 (II+ I)( )
(11) ( .)
L..-P Yj Xk n P
x,
P Yl
j=1
_~
P (II)(Xk)P(Yjl xk)
r (lI) -
- L..-p(Yj Xk) In
j =l
I )
(11 + 1)(
Xk P
(11)(
.)
Yl
(9.48)
._ . _
Theorem.
00
..,,...........
_ IIUt-' .
oJ
is C.
Finally, we remark on the speed of convergence. Assuming the initial distribution p O) to be uniform , Eq. (9.50) can be written as
N
8 (11)
:'5
In K - H *(x) :'5 In K
(9. 5 1)
n=l
(9.52)
11= 1
:'5
In K
N
(9.53)
The inequality in Eq. (9.53) provides a simple criterion for terminating the
algorithm. A more sophisticated criterion is discussed in Problem 9.6.
PROBLEMS
9.1. One method of obtaining the entropy for a discrete stationary source
U N . is to consider
U ,U 2 ' . .
Chap. 9
Problems
141
fJ
p(x ,y)
l (x , y) = ~", p(x , y) log p (x)p( y ) dx dy
(a) Prove that, in gener al, l ex, y) 2': 0 with equality iff x and yare independent.
(b) Assuming that H(x) , the differential entropy of x, exists, is it always true that
l ex, y) ~ H(x) ?
9.3 . Let x and z be independent, Gaussian random variables with E(x) = E(z) = 0 and
var(x) = CT~ , var(z) = CT;, and let y = x + z. What is the average mutual information between x and y?
9.4 . The input x to a channel has uniform distribution on the interval (0, 27T) . The noise
z is independent of x with an arbitrary distribution p (z) . The output y is the sum
modulo 27T of x and z (i.e . , the addition takes place on the unit circle). Determine
l ex, y) in terms of p (z) and evaluate it in the special case where z is uniformly distributed on the interva l (a,f3) , with f3 - a es 27T .
9.5 . A channel has an input ensemble X consisting of the numbers + I and -I used
with the probabilities p( + I) = p ( - I) = The output y is the sum of the input x
and an independent noise random variable z with the probability density p( z) = ~
for - 2 < z ~ 2 and p( z) = 0 elsew here .
(a) Find and sketch the output probab ility density for the channel.
(b) Find l (x , y).
(c) Suppose the output is transformed into a discrete processed ouput u defined
by u = I for y > I , u = 0 for - I < y ~ I , and u = - I for y ~ - I . Find
l ex , u) and interpret your result .
!.
9.6 . Let l (x., y) be the condit ional mutual informatio n between the input and output of
a DMC for an arbitrary input distribut ion P = { p (x, ) , . . . ,P(XK)} . Prove that
Max l (x. , y)
2':
This provide s a way to term inate the Ari moto-Blahut algo rithm by calc ulating
the quantity
Max l (x" y) - l ex, y)
at each iteration and stopping the search when this value is sufficiently small.
[Hint: Let P I and P z be two arbitrary input distributions and show that
K
l i x , y) -s
L Pz(x.)Il . . .. y)
.~ ,
APPENDIX
<
~ (N)
where Nee
= K,
<
k=O
NH
(a )
-a a- a)
log,
(I -
logz(l - a).
(K
~ 1) = (K -
(K ". 2)
1)!
(;!-
K + 1)! =
(K - 2)' (;'- K
=
+ 2)!
K- 1 (N) ( a )z
(K N)
- I ' N- K + 2 < K I - a
we write
N)
(K
<:
NNe -NVNe
N!
K! (N - K) ! - KKe-KVK . (N - Kt-Ke-(N- khJN - K e 7/4
<
I N
V K(N -
(N)K(
K)
N ) N-K
N - K
is
---r.:::=;=1====7
.
. a - Na . (1 - a ) - N(I - a )
Y Na (I - a)
2N[ - a logza -
(I - a ) logz(I -a)]
Y Na(I - a)
Therefore,
~ (N)
k= O
<
1 - a 2
Na(I - 2a)
' 2 NH(a )
~ (N)
k= O
<
NH
(a )
BIBLIOGRAPHY
1. C. E . Shannon and W. Weaver, The Math ematical Theory of Communication, University of Illinoi s Press , Urbana (1964).
2. N . Abramson, Information Theory and Coding, McGraw-HilI, New York (1963) .
3. R . G . Gallager, Information Theory and Reliable Communicati on, Wiley, New
York (1968).
4 . R. Ash , Information Theory, Wiley-Interscience, New York (1965) .
INDEX
Arimoto , S. , 137
Arirnoto-Blahut algorithm , 137, 141
Ash, R., 96 , 99 ,103 ,110 , III
Binary erasure channel, 78
Binary symmetric channel (BSC), 52 , 58
Bits, 12
Blahut , R. , 137
(See also Arimoto-Blahut algorithm)
Block encoder
with fixed-length alph abet , 18
with variable-length alphabet, 19
BSC , 52
Capacity
of continuous channel, 135
of discrete mem oryless channel, 57
of encoder with variable length
alphabet , 21, 22
Channel
binary eras ure, 78
binary symmetric, 52 , 58
capacity, 49 , 57, 64, 66
coding, 12
determini stic , 53 , 59, 68
discrete memoryless, 49
equivocation, 71
lossless , 53, 59, 67
matrix, 51
useless, 53, 59
Chebyshev inequality, 4, 9, 131
Check digits , 103
Code book (or table), 98
Conditional entropy, 98
Conditional mutual informati on , 63 , 76
Constraint length , 114
function . ji
set , 6
Convolutional code , 113
constraint length of, 114
maximum likelihood decoding of, 116
truncat ed , 115
Convo lutional encoder, 113, 11 9
ladder diagram for, 115
state diagram for, 114
Corrector, 108
Correlation , 4
Coset , 107
Coset leader , 107
D-adm issible, 84
Data compression, 17, 86
Decoding set, 106
Differential entropy, 126, 132
Discrete memor yless
channel (DMC) , 49
source (OMS), 17
Distortion
measure , 83
average , 84 , 88
single-letter, 84
(See also Rate-distortion)
DMC ,49
OMS , 17
Double-error-correcting code, III
Entropy , II
conditional, 50
differe ntial, 126, 132
grouping proper ty of , 15
(See a/so Quasi-entropy)
Equ ivocation, 54, 57
Ergod icity, 125, 130
Error-correcting code
linear convolutional, 11 3
linear block or parity check, 98
power of , 96, 110
Error-pattern, 97
Error-pattern vector, 107
Fano
inequality (or bound ), 70 , 77, 94
(See a/so Shannon-Fano code)
Fidelity criterion, single-letter, 83
Fitingof, B. M . , 41
Frequency vector , 43
Ga llager, R. G., 26, 27, 127
Gaussian random variable, 133
Generator matrix, 105, 106 , 121
Gilbert (See Varsharmov-Gilbert bound )
Gro up code , 104
Hammi ng
code, 86
distance, 72, 75, 96
lower bound, II I
upper bound, 98
Huffman code
binary, 3 1
generalization, 34
geometrical interpre tation, 33
Independence , 4
Information
averag e or average mutual or mutual ,
51
conditional mutual, 63, 76
Information digits, 103
Jensen inequality, 6, 44
Joint random variable, 3
Kraft inequality, 27, 28 , 29 , 30, 56
Ladder diagram, 115
Law of large numb ers, weak , 4 , 136
Likely sequence, 17, 42
Majo rity rule , 71
Maximum likelihood (ML) , 72 , 75 , 11 6
McEliece, R. J ., 115, 11 6 , 117, 118
Index
fixed-lengt h, 18
variable-length, 25
with fidelity criterion, 83
State diagram , 114
Stationary source , discre te, 125, 126
entropy of , 127
ergodic, 129
Stirli ng' s form ula, 7,45 , 143
Stochast ic process , 126
Sufficie nt reduction, 69
Syndrom e. 108
149
T hresho ld decodi ng , 11 9
Transition matrix (See Channel matrix)
Triang le ineq uality, 97
Typic al sequence (See Likely sequence)
Unique decodabil ity, 26
Universa l cod ing, 4 1, 45
Variant of matrix, 101
Varsharmov-Gilbert bound, 112
Viterbi algorithm, 116
Weight of code-word , 11 0