Introduction To Information Theory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 178

INTRODUCTION TO

INFORMATION THEORY

CLAUDE E. SHANNON
The Father of Information Theory
(Photo by Frank Ross, Cape May Courthouse, New Jersey)

INTRODUCTION TO
INFORMATION THEORY

Masud Mansuripur
College of Engineering
Boston University

PRENTICE-HALL, INC.
Englewood Cliffs , New Jersey 0763 2

Masud Mansuripur (date)


Introduction to informatio n .theory.
Bibliography: p.
Includes index .
I. Information theory. I. Title.
Q360 .M32 15 1987
001.5 3'9
ISBN 0- 13-484668-0

86- 17033

Cover Design: Dian e Saxe


Manufacturing Buyer: Rhett Conklin

Copyright 1987 by Prentice -Ha ll, Inc .


A division of Simon & Schuster
Englewoo d Cliffs, New Jersey 07632
All rights reserved . No part of this book may be reproduced,
in any form or by any means,
without permission in writin g from the publ isher.
Printed in the United States of America
1098

765432

ISBN

0 -13-484668-0

0 25

Prentice-Hall International (UK) Lim ited, London


Prent ice-Hall of Australia Pty. Limited, Sydney
Prentice-Hall Canada Inc ., Toronto
Prentice-Hall Hispanoamericana, S.A ., Mexico
Prentice-Hall of India Private Limit ed , New Delhi
Prentice-Hall of Japan, Inc ., Tokyo
Prentice-Hall of Southeast Asia Pte . Ltd. , Singapo re
Editora Prentice-Hall do Brasil, Ltda ., Rio de Janeiro

To Anne and Kave h

CONTENTS

Preface
Information Theory and the Modern Scientific Outlook
by Professor L. B. Lev itin

Chapter 1
MATHEMATICAL PRELIMINARIES

1. 1
1.2
1.3
1.4

xv

Review of Probabilit y, I
Chebys hev Inequ ality and the Weak Law of Large Numbers, 4
Convex Sets and Functions: Jen sen 's Inequality, 6
Stirlin g 's Formula , 7
Problem s, 9

Chapter 2
ENTROPY, DISCRETE MEMORYLESS SOURCE,
AND BLOCK ENCODING

2. I

Entropy of a Discrete Random Variable, 12

2.2

H (x ) as a Meas ure of Uncertainty, 13

2.3
2.4
2.5
2.6

Entropy and Sequences of a Random Variable, 15


Discre te Memoryless Sourc e and Likely Sequences, I7
Block Encoder with Fixed-length Alphabet , 18
Block Encoder with Variable-length Alphabet, 19
Problems, 22

Chapter 3
VARIABLE-LENGTH SOURCE CODING

3. 1
3.2

xi

Unique Decodabilit y and Prefix Condition, 26


Kraft Inequality, 27

11

25

VIII

3.3
3.4
3.5
3.6
3.7

\",VIILIJI IL.:J

Significance of the Prefix Condition, 28


Shannon-Fano Coding, 29
Optimum Binary Coding: Huffman's Technique, 31
Geometrical Interpretation of Huffman's Technique, 33
Generalization of Huffman's Technique, 34
Problems, 35

Chapter 4
UNIVERSAL SOURCE CODING

4.1
4.2
4.3

Discrete Memoryless Source with Unknown Statistics, 41


Frequency Vectors and Quasi-entropy, 43
A Universal Encoding Scheme for the DMS, 45
Problems, 47

Chapter 5
MUTUAL INFORMATION, DISCRETE MEMORYLESS
CHANNEL, AND CHANNEL CAPACITY

5.1
5.2
5.3
5.4
5.5

49

Conditional Entropy and Mutual Information, 50


Discrete Memoryless Channel and the Channel Matrix, 51
Symmetric, Lossless, Deterministic, and Useless Channels, 52
Equivocation and the Rate of Transfer of Information , 54
Channel Capacity, 57
Problems, 60

Chapter 6
CHANNEL THEOREMS

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

41

Conditional Mutual Information and Capacity, 63


Cascaded Channels, 66
Reduced Channels and Sufficient Reductions, 68
Fano's Bound on Equivocation, 70
Improving Channel Reliability, 71
Noisy Channel Theorem for the BSC, 73
Noisy Channel Theorem for the General DMC, 75
Converse Theorem, 77
Problems, 78

63

Contents

Chapter 7
RATE-DISTORTION THEORY

7.1
7.2
7.3
7.4

8.11
8.12
8.13

95

Hamming Distance and Error-correct ing Power, 96


Hamming Upper Bound on the Number of Code Words, 98
Parity Check Codes, 98
Properties of the Parity Check Matrix, 100
Construction of Parity Check Codes from the Matrix, 103
Equivalence of Group Codes and Parity Check Codes, 104
Decoding of Parity Check Codes: Decoding Sets, 106
Error-correcting Power of Parity Check Codes, 110
Hamming Lower Bound on the Number of Check Digits, III
Varsharmov-Gilbert Upper Bound on the Number of
Check Digits, 112
Convolutional Encoder, 113
State and Ladder Diagrams, 114
Maximum Likelihood Decoding and the Viterbi Algorithm, 116
Problems, 119

Chapter 9
ADVANCED TOPICS

9.1
9.2
9.3
9.4
9.5

83

Source Coding with a Fidelity Criterion, 84


Rate-distortion Function and Its Properties, 88
Fundamental Theorem of Rate-distortion Theory, 90
Converse Theorem, 92
Problems, 93

Chapter 8
ERROR-CORRECTING CODES FOR THE
BINARY SYMMETRIC CHANNEL

8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10

ix

125

Discrete Stationary Sources, 126


Ergodicity and the Shannon-McMillan Theorem, 129
Differential Entropy, 132
A Simple Continuous Channel, 135
The Arimoto-Blahut Algorithm , 137
Problems , 140

Appendix

143

Bibliography

145

Index

147

PREFACE

This book has been developed out of a course in information theory that I
have taught at Boston University during the past several years . The course is
intended for beginning graduate students, as well as advanced undergraduates
who desire a conceptual understanding of the foundations of information theory.
It is thus designed to cover the major ideas involved in this important branch of
modem science and to give the student an overview of the subject without getting nailed down in more advanced and sometimes less intuitive issues. The
required background is probability theory at an undergraduate level. Although
some basic concepts of probability are reviewed in the beginning, it is essential
that the reader be familiar with the techniques of probability theory in general.
A certain degree of mathematical maturity and capacity for abstract thinking on
the part of the reader is also assumed.
Several excellent textbooks on information theory are available at th is
time, and it may seem inappropriate to write yet another one on the subject. I
found it difficult, however, to find a text that addressed all the major concerns
of the theory at the proper level for my intended audience . Some of the books
were written for the Ph .D. level students and researchers and thus required a
much higher knowledge base than introductory probability theory, while others,
although appropriate for the less advanced student, were not comprehensive
enough in their coverage. The present book is thus the outcome of my effort to
bridge this gap; although it contains some of the more recent developments in
information theory, which, to my knowledge, have not yet appeared in any
textbook, it remains an introductory exposition of the basic concepts at the
core. In organizing the book, I have been guided by the original paper of C. E.
Shannon, "A Mathematical Theory of Communication" and by the monumental
work of R . Gallager, Information Theory and Reliable Communication. I also
benefited from the books Information Theory and Coding, by N. Abramson,
Information Theory , by R. Ash , The Theory of Information and Coding, by
R. McEliece, Coding and Information Theory, by R . Hamming, and Rate Distortion Theory, by T. Berger. The chapter on universal source coding is based
on the original papers of B . Fitingof and the section on numerical computation
of channel capacity is based on the papers of R. Blahut and S. Arimoto.

Although intuitive development of concepts has been my goal in this


book, I have tried not to achieve this goal at the expense of mathematical rigor.
All the results are derived from the basic principles, and every step in the
derivation is carefully described. The problems are an important part of the
book in that they either give a different perspective on a subject that has already
been developed in the text or try to extend and generalize the concepts.
Chapter I is a review of the mathematical tools that are required for the
understanding of the book. It is not necessary, however, to cover this material
in the beginning; the reader can refer to it as the need arises. Chapter 2 concerns entropy and its properties; typical or likely sequences are introduced here,
and a first hint of the usefulness of the concept in data-compression applications is given . Chapter 3 expands on the idea of source coding for data compression and introduces various properties of variable-length source codes. In
Chapter 4 we look at source coding from a different point of view and introduce the idea of universal coding. This is a departure from the classical information theory in that a previous knowledge of the probabilistic structure of the
source is no longer required for the construction of an optimum universal code.
Chapters 5 and 6 are concerned with the discrete memoryless channel; after
defining conditional entropy and mutual information and familiarizing the
reader with some elementary discrete memoryless channels in Chapter 5, we
prove the noisy channel theorem and its converse in Chapter 6. Chapter 7 is an
elementary treatment of the rate-distortion theory. Here the concept of source
coding with a fidelity criterion is introduced and the rate-distortion function is
defined; the fundamental theorem of rate-distortion theory for the discrete
memoryless source with single-letter fidelity criterion and its converse are then
established .
I found it hard to completely ignore the problems associated with the
more practical aspects of information theory, which, more often than not, are
treated independently under the name of Coding Theory . I thus included some
material on error-correcting codes, addressing the elementary aspects of linear
codes for error correction in Chapter 8. Both block codes and convolutional
codes are discussed here. It is hoped that through this chapter the reader will
see the relation between theory and practice . I deliberately avoided any mention
of cyclic codes in order to stay away from algebraic field theory; that would
have been beyond the scope of this book . Finally, in Chapter 9, some advanced
topics relating to stationary and ergodic sources and continuous channels are
discussed. This can be viewed as an extension of some of the ideas developed
in the previous chapters and is aimed at students who like to have a glimpse at
what could be the subject of a second course in information theory.
I have used the book for a one-semester, four-credit course in information
theory at Boston University. Chapters 1 through 6 are usually covered in the
beginning and then, depending on the level of understanding and the interest of
students, I have selected topics from the remaining three chapters. Chapter 8 on
linear codes has always been a favorite of engineering students. If I were to

Preface

xiii

teach the course in a school with the quarter system, I would probably teach the
first six chapters in the first quarter and use some additional material with the
last three chapters for the second quarter.
I would like to thank Professor Lev Levitin who inspired me and spent
many of his precious hours discussing the more subtle points of information
theory with me . He also honored me by agreeing to write an introductory
chapter for this book. I would also like to thank Dean Louis Padulo of the
College of Engineering who encouraged me in writing this book and provided
me with the time needed for completing it. Thanks are due to Mr. Tim Bazik ,
the editor, for his support of the project. Finally, I would like to thank my wife,
Annegret, without whose patience and encouragement this book would not
have become a reality.

Acknowledgments
Problems 1.1,2.1,2 .2 ,2.6,3 .1 ,3.4,3.9,3.10,3 .11,3.12,5.4,5 .5, 5.6, 6.4 , 6 .9 ,
7.4 , 8. 8, 8.9 , 8. 10, and 9.5 are from R. G . Gallagher, INFORMATION THEORY AND
RELIABLE COMMUNICATION . New York: Wiley , 1968. Copyright 1968 by John
Wiley & Sons, Inc. Reprinted with permission.

Problem s 2.4, 2.5, 2.9, 6 .8, 8.2 , 8.5, and 8.6 are from R. Ash, INFORMATION TH EORY. New York: Interscience, 1965. Copyright 1965 by John Wiley & Sons, Inc .
Reprinted with permission .

Problems 2.11 ,2 .13,3 .15 (adapted), and 8.16 are from R .J . McEliece, THE THEORY
OF INFORMATION AND CODING : A MATHEMATICAL FRAMEWORK FOR
COMMUNICATION . New York: Cambridge , 1977. Copyright 1977 by Cambridge
University Press. Reprinted with permi ssion .

INFORMATION THEORY AND THE


MODERN SCIENTIFIC OUTLOOK
Lev B. Levitin

One of the greatest revolutions in the scientific world outlook in our century is the turn from Laplacian determinism to a probabilistic picture of nature.
The development of statistical mechanics and (even in a more categorical way)
of quantum theory has brought us to the appreciation of the fact that the world
we live in is essentially probabilistic. A natural extension of this point of view
is an understanding that our knowledge is of a probabilistic nature, too. Any
information we obtain affects the probabilities of possible alternatives , rather
than indicates uniquely one particular outcome (as "genuine" deterministic
knowledge was supposed to do).
Therefore , it seems to be not just a sheer coincidence that information
theory emerged after statistical and quantum mechanics had been developed,
and that it shares with statistical physics the fundamental concept of entropy.
Mathematically, information theory is a branch of the theory of probabili ties and stochastical processes . It has won its first victories by answering the
most basic questions concerning information transmission over communication
channels. In the past, communication engineers believed that the rate of information transmission over a noisy channel had to decline to zero, if we require
the error probability to approach zero. Shannon was the first to show that the
information-transmission rate can be kept constant for an arbitrarily small probability of error. Besides its technological importance, this result has a remarkable philosophical meaning. Information theory not only gives a quantitative
measure of information common for both deterministic and probabilistic cases,
but it also shows the qualitative equivalence of these two kinds of knowledge in
the following sense: even if the input and the output of a communication channel are only statistically dependent, it is possible to transmit an amount of data
that is arbitrarily close to the amount of information in the output about the
input, with .a vanishing error probability (i.e., in almost deterministic way) .
The development of information theory is an excellent illustration of the
statement that "nothing is more practical than a good theory." Indeed , at the
time when the basic results of information theory related to communication
channels had first been formulated, communication technology was not at all

XVI

IllIUl l lldllUl1

111t::ury

able to implement them in practice. It took a long time , about a quarter of a


century, until the development of practical methods of data encoding and decoding togeth er with new computer technology made it possibl e to process
inform ation in real time in acco rdance with the recommend ations follow ing
from information theor y, and thus to make the theoretical limits attainable. The
results regarded once by some too "practical" people as "ac ademic exercises"
have become a matter of today's engineering practice . We are witnessing now
the ongoing inform ation revoluti on , the vigorou s development of new means
of inform ation tran smission , storage , and retrieval that may lead to profound
changes in the nature of our society in ways that could hardly be envisioned
by the most daring utopian s. And information theory plays a cruci al role in this
development by providing not only a theoretical basis , but also a deep philosophical insight into new and challenging problems we have to encounter today and tomorrow.
However, the importance and generality of information theory concepts
and approaches go far beyond the area of communication engineering . The
ideas of information theory were applied in a variety of diverse field s from
physics to linguistics, from biolog y to computer science, and from psychology
to chemistry and proved to be productive and innovative -of cour se, in the
hands of those who knew how to handle them properly. Thu s information
theory has become not ju st another special branch, but an indispen sable part
of the modern scientific outlo ok. It should be borne in mind, howe ver, that
"seldom do more than a few of nature 's secrets give way at one time ," as the
founder of information theory, Claude E. Shannon, observed , warning against
immature attempts to use in form ati on theory just be cau se it had becom e
"something of a scientific band wagon. "
A thorough understanding of the mathematical found ation and its communication
application is surely a prerequi site to other applications. I personally believe that
many of the concepts of inform ation theory will prove useful in these other fields and , indeed , some result s are already quit e promising - but the establishing of
such applications is not a trivial matter of translating words to a new doma in, but
rather the slow tedious process of hypothesis and exper imental verification [Il.

Histor ically, the basic concepts of information theory, such as entropy,


mutual information, equi vocation , and redundancy were first introduced by
Shannon in connection with cryptographic systems [2J, rather than the usual
communication channels . The modern development of cryptography added an
import ant aspect of complexity to be taken into consideration . lnformationtheoretical analysis plays a significant role in the theory of computational and
structural complexity and in the design of effective decision algorithms (e.g.,
[3]). (An elementary example of such a decision algorithm is that related to the
famous count erfeit coin problem [4J.)

Information Theory

xvii

Another important area of the application of information theory to computer science is fault -tolerant computing . Although the first attempts in this
direction were unsuccessful, more thorough investigations [5,6,7] have shown
that it is possible to achieve arbitrarily small probability of error in data storage
and computation at the expense of limited redundancy, exactl y as in the case of
communications .
Essential interconnections between information theory and statistics have
been found [8] and new methods of stati stical analysi s based on information
theory have been suggested [9].
A number of interesting attempts have been made to apply information
theory in political economy and economics . For instance , the theory of optimal
investment appears to be exactly parallel to the theory of optimal source coding [10].
Application of information theory to linguistics seem to be highly relevant
(e.g ., [11-15]) . Indeed, a natural language gives us a remarkable example of a
system used for generating long sequences of symbols (i.e ., texts) that can be
considered as realizations of a random process. But , in contrast with other random processes that exist in nature, this random process was developed, modified, and selected during a long period of evolution and " natural selection,"
being specially intended for meaningful communication between human beings .
Information-theoretical studies of texts have revealed a number of significant
linguistic features. For instance, they provide objective criteria for the charac terization of different styles and different schools in poetry and prose, and even
for identification of individual authors [16]. An illustration of the importance of
the information-theoretical characteristics of a language is given by the fact
(noted first by Shannon) that large crossword puzzles are only possible if the
redundancy of a language does not exceed 50 percent (on the vocabulary level) .
If the entropy of a language were two times less than its actual value, poetry in
its usual form (with rhymes and meters) would be impossible.
Application of information theory t experimental psychology made it possible to discover some remarkable facts related to sensory organ s and neural
systems [17]. It was found, for example, that the reaction time of a subject
is a linear function of the amount of information contained in the stimulus
[4,18,19] . Moreover, our sensory organs can be characterized by a certain
information capacity, as engineering communication lines are .
Perhaps the most important and meaningful are interconnections between
information theory and statistical physics . Long before information theory was
founded , L. Boltzman and later L. Szilard [20] attributed an information meaning to the thermodynamical notion of entropy. On the other hand , D. Gabor
[21] pointed out that "the communication theory should be considered as a
branch of physics ." In the classical work of L. Brillouin [22], a profound relationship between physical entropy and information was first formulated in a
general form . Later the "entropy defect principle" was establi shed in the quasi-

XVIII

u u or n re uut r

11 1t: U I Y

classical [23,24] and the quantum [25,26] form . According to this principle,
any information is represented by a certain ensemble of states of a physical system and associated with its deviation from the thermodynamic equilibrium.
Thus the basic concepts of information theory can be defined on the basis of
statistical physics, and a way is open to develop a consistent physical information theory. The subject of this theory is investigation of the physical nature of
information transmission, storage, retrieval, and processing (so called physics
of communication and physics of computation) (e.g., [27-29]). On the other
hand, the information -theoretical approach has been applied to statistical
physics [30-34]. Information theory shed a new light on the classical problems
of Maxwell 's demon, Gibbs' paradox [22,34], and on the foundations of statistical physics in general. There is a hope that further development in this
direction will eventually bridge the gap between physical and cybernetical
descriptions of a complex system and will lead to formulation of a physical
theory of high -organized systems, both artificial and natural (biological) . The
progress on this way is slow and difficult [35], but the goal is worth all
the efforts.
Here we have to touch on another philosophical problem . Since the time of
Newton (or even Democritus), scientists believed that the most fundamental
laws, the most hidden secrets of nature, are those of elementary particles and
elementary forces acting between them . Indeed, if everything that happens in
the universe is no more than a combination of these elementary acts, then isn't
the knowledge of laws that govern the interactions between elementary particles
sufficient to describe any phenomenon in the world, to predict theoretically outcomes of any experiment? Today we have achieved incredible progress in discovering and describing the nature of elementary particles and interactions, and
have learned that this knowledge is incredibly insufficient for the ambitious
purpose of "explaining everything," of building an accomplished scientific picture of the world.
It seems that we know, indeed, how to derive the behavior of a system,
even as large as stars and galaxies, from the "first principles," if we deal with a
low-organized, chaotic system. But we find our knowledge almost irrelevant
when we have to face a system at a higher level of organization (whatever it
means, I should add, since we still lack even a good definition of "level of
organization"). For instance, we have a wonderful theory of electromagnetic
interactions that predicts the experimental results with an accuracy of 15 decimal digits, and we know that the processes in a human body are mostly electromagnetic, but this perfect theory tells us very little, if anything , about how our
bodies function .
There has occurred another revolutionary shift in the minds of scientists: to
become aware that the greatest secrets of nature, the hardest to discover- and
the most important for us - are the laws of organization, the understanding of
how a certain complex "combination" of particles and processes can emerge
and persist as an organized system among the chaos of myriads of elementary

Information Theory

xix

interactions; how it can be formed and sustained by those interactions; how a


"more organized" system can be built from " less organized" parts; why, when,
and how such a system becomes able to display the properti es of "h igh organization" such as self-regulation, self-organization, learnin g, adaptivity, expedient behavior, self-reproduction, and , eve ntua lly, inte lligence . These are the
crucial problems of life and death . Until we solve them , we remain, with all
our knowledg e and techn ology, helple ss children in the cradle of nature .
Three geniuses of our time , J. von Neumann , N. Wiener , and C. E. Shan non , were the most influential in recognizin g the probl em with all its generality
and co nsequence and bringing it to our attenti on [36-39] . And all of them
stressed the importance of the concepts of entrop y, informati on, and control for
developing a genera l theory of high-organized sys tems, a new science for
which Wiener coined the name cybernetics, but which is still waiting to be created . Tod ay, information theor y pr ovid es the only ex isting narrow bridge
between the two different worlds of ch aotic and org aniz ed systems . And I
strongly believe that inform ation theory will win its new triumphs by helpin g us
on our way to the ultim ate knowledge intellectual beings desire -the knowledge of ourselves .

References
1. C. E. Shannon , "The Bandwagon," Trans . IRE , IT-2, No.1, 1956.
2. C. E. Shannon, "Communication Theor y of Secrecy Systems," BSTJ , 28, No.4,
1949.
3. C. R. P. Hartmann, and others, "Application of Information Theory to the Construction of Efficient Decision Trees," IEEE Trans. , IT-28, No.4, 1982.
4. A. M. Yaglom and I. M. Yaglom, Probability and Information . Hingham, Mass.:
D. Reidel Publishing Co., 1983.
5. S. Winograd and 1. D. Cowan, Reliabl e Computatio n in the Presence of No ise.
Cambridge, Mass.: MIT Press, 1963.
6. M. C. Taylor, "Reliable Information Storage in Memories Designed from Unreliable Comp onent s" and "Reliable Comput ation in Computing System s Designed
from Unreliable Components," BSTJ, 47, No. 10, 1968.
7. A. V. Kuznetsov, "Information Storage in a Memory Assembled from Unreliable
Component s," Problems in Infor mation Transmission, 9, No.3, 1973.
8. S. Kulback, Information Theory and Statistics . New York: Wiley, 1959.
9. S. Watanable, "Information-Theoretical Analysis of Multivariate Correlation," IBM
J . Res. Develop ., 4, No. I, 1960.
10. T. M . Cover, "Information Theory and Investment ," 1985 IEEE Int ern . Symp.
Information Theory, Brighton , England , June 1985. Also "An Algorithm for Maximizing Expected Log Investment," IEEE Trans ., IT-3D, No.2, 1984.
11. C. E. Shanno n, "Prediction and Entropy of Printed English," BSTJ , 30, No. 1,
1951.

xx

into rrnan on I neorv

12. B. Mandelbrot, "An Informational Theory of the Statistical Structure of Language."


In Communication Theory, W. Jackson, ed. New York: Academic Press, 1953.
-13. I. M. Yaglom, R. L. Dobrushin, and A. M. Yaglom , "Information Theo ry and
Linguistics," Voprosy yazykoznaniya (Problems of Linguistics), No.1, 1960 (in
Russian).
14. T. M. Cover and R. C. King, "A Convergent Gambling Estimate of the Entropy of
English," IEEE Trans., IT-24, No.4, 1978.
15. L. Levitin and Z. Reingold, "Evaluation of the Entropy of a Language by an Improved Prediction Method," Proc IIth IEEE Convention in Israel, Tel-Aviv, 1979.
16. A. M. Kondratov, "Information Theory and Prosody (Entropy of the Rhythm of
Russian Speech)," Problemy Kibernetiki (Prob lems of Cybernetics), 9, 1963 (in
Russian).
17. H. Quastler ed., Information Theory in Psycho logy. New York: Free Press, 1955.
18. J. A. Leonard, "Choice Reaction Time Experiments and Information Theory." In
Information Theory, C. Cherry ed., London, Eng.: Butterworth, 1961.
19. A. T. Welford, "The Measurement of Sensory-Motor Performance: Survey and
Reappraisal of Twelve Years Progress," Ergonomics, 3, No.3, 1960.
20. L. Szilard, "Uber die Entropieverminderung in einem Thermodynamischen System
bei Eingriff Intelligenter Wesen," Z. Physik , 53, No.5 , 1929.
21. D. Gabor, "Communication Theory and Physics," Phil. Mag. , 41, No.7, 1950.
22. L. Brillouin, Science and Information Theory. New York: Academic Press, 1956.
23. D. S. Lebedev and L. B. Levitin, "Information Transmission by Electromagnetic
Field," Informa tion and Control, 9, No.1, 1966.
24. L. B. Levitin, "A Thermodynamic Characterization of Ideal Physical Information
Channels," Journal of Information and Optimization Sciences, 2, No.3, 1981.
25. L. B. Levitin, "On the Quantum Measure of the Amount of Information," Proc . IV
National Conf. Information Theory, Tashkent, 1969 (in Russian).
26. L. B. Levitin, "The Amount of Information and the Quantum-Mechanical Irreversibility of Measurements," Proc. II Intern . Symp . Information Theory, Yerevan,
1971 (in Russian).
27. V. V. Mityugov, "Physical Grounds of Information Theory," Sovietskoe Radia. ,
Moscow, 1976 (in Russian) .
28. R. Landauer, "Fundamental Physical Limitation of the Computational Process." In
Noise in Physical Systems, P. H. E. Meijer, R. D. Mountain, R.1. Soulen , eds .,
NBS Spec. Pub . 614, Washington, D.C.: 1981.
29. L. B. Levitin, "Physical Limitations of Rate, Depth and Minimum Energy in Information Processing," Intern. J. Theoretical Physics , 2.1, No. 213, 1982.
30. E. T. Jaynes, "Information Theory and Statistical Mechanics ," Phys . Rev., Part I,
106,620-630, 1957; Part II, 108, 171-190, 1959.
31. A. Katz, Princip les of Statistical Mechanics. The Information Theory Approach.
San Francisco: W. H. Freeman, 1967.
32. R. S. Ingarden, "Information Theory and Thermodynamics," Part I, Torun, Poland,
1974; Part II, Torun, Poland, 1975.
33. R. S. Ingarden, "Quantum Information Theory," Torun, Poland, 1975.

Information Theory

xxi

34. L. B. Levitin , "Quantum Amount of Information and Maximum Work," Proc. i 3th
iUPAP Conf. Statisti cal Physics, D. Cabib, D. G . Kuper, I. Riess, eds ., Bristol,
Eng. : A. Hilger, 1978.
35. H. P. Yockey, R. L. Platzman, and H. Quastler, eds ., information Theory in Biology, Elmsford , N.Y.: Pergamon Press, 1958.
36 . N. Wiener, Cybernetics, or Control and Commu nication in the Animal and
Machine, 2nd ed . Cambridge , Mass .: MIT Press, 1961.
37. J. von Neumann , "The General and Logical Theory of Automata ," The Hixon Symposium , 1948, L. Jeffres , ed. , Wiley, 1951.
38. C. E. Shannon , "Computers and Automata, " Proc. IRE, 41, No. 10, 1953.
39. C. E. Shannon , "Von Neumann 's Contribution s to Automata Theory," Bull. Amer.
Math . Soc., 64, No.2, 1958.

1
MATHEMATICAL PRELIMINARIES

1.1
1.2
1.3
1.4

Review of Probability
Chebyshev Inequality and the Weak Law of Large Numbers
Convex Sets and Functions: Jensen's Inequality
Stirling's Formula
Problems

I n t rod u c t i o n The purpose of thi s chapter is to familiarize the reader


with some of the mathematical tools that are needed for the study of information theory. Altho ugh the tool s are simple and their mastering requ ires no
knowledge of advanced mathem atics, the reader must have a certain level of
mathem atical ma tur ity and sophis ticatio n in order to use them effectively.
Knowledge of probability theory at an introductory level and familiarity with
combi natorial techniques are the only prerequisites for the study of this book;
although Section 1. 1 is devoted to a review of probability, it is no substitute for
formal education in this area . Section 1.2 describes the Chebyshev inequality
and the weak law of large numb ers. In Section 1.3 we define convexity and
prove an important relationship, known as the Jensen inequality, for convex
functions. A simple derivation of Stirling's approximation to the factorial function is given in Section 1.4 .

1.1 REVIEW OF PROBABILITY

In a probability experiment the outcome can be any member of a given set E.


The set E is known as the sample space of the experiment. Certain subsets of E
are defined as events. If the outcome of a trial belongs to the event E j , then E;
is said to have occurred. To each E j , a probability P i is assigned. The assignment is not arbitrary, and the co llection of P i'S must sat isfy the axio ms of

IYld lllt'll ldllL;dl

r r em r urtcnrce

\",110..., .

probability theory. For the most part, in this book we assume either that the set
E is countable or that it is divided into a countable number of elementary

events . In either case, we have a discrete probability space .


Example 1

In a coin-flipping experiment, E = {H, rj. If the coin is fair P{H} = P{T} = 0.5 .
Example 2
If the experiment is to throw a die, then E = {I, 2, 3, 4, 5, 6}. If the die is fair,
P{I} = P{2} = . . . = P{6} = P{2, 5} = P{I, 3, 5} = ~.

i,

Example 3

Let the experiment be the weather forecast for a certain day; then E = {sunny, cloudy,
rainy, snowy}. A possible probability assignment is P{sunny} = t P{cloudy} =
P{rainy} = i, P{snowy} = i

Example 4

Take a typical English text and pick a letter at random; then E = {a, b, c, . .. , x , y , z}.
The probability distribution will be P{a} = 0.0642, . . . , P {e} = 0 .103, . . . ,
P{z} = 0.0005 .

If the outcome of an experiment is always the same, it is said to be a degenerate (or certain) experiment.
Example 5

In the English language, the letter that follows q in a word is always u. The experiment
is to open a book at random and to pick the letter following the first q encountered. The
sample space is E = {a, b, c, . . . , x, y, z}, but P{u} = I and P{a} = ... = P{z} = o.

A discrete random variable x is defined by assigning a real number Xi to


each elementary event E; in a discrete sample space. The probability of X; is
then denoted by p(x;). The average (or expected value) and the variance of x
are defined as follows:

= E(x

+ x2

2X . x = E(x 2)

where E(x 2) is the expected value of x 2 . In general, the expected value of a


function of x, such as f(x), is given by
E(j(x)) = 'iJ(x;) . p(x; )

Sec. 1.1

Review of Probability

A pair of random variable s (x, y) associated with an experiment forms a


joint random variable. If x, yare discrete, the joint probability distribution is
defined as {Pij} , where Pij = P{x = X i, Y = yJ.
Example 6
The experiment is to throw a coin and a die . The possib le outcomes are
(H, 1), (H, 2), . . . ,(H , 6), (T, 1), (T, 2), . . . ,(T, 6)

-h

for all i ,j . The diagram of


If the coin and the die are fair and independent, P ij =
Figure 1.1 is often helpful in visualizing the situation . Using the diagram, it is easy to
verify that
(i) The sum of the probabilities Pij is equal to I.
(ii) P{Coin = Head} = P{(H, I)} + P{(H, 2)} +
(iii) P{Die = 5} = P{ (H, 5)} + P{(T, 5)} =

i-

. . . + P{(H, 6)} = !.

Now assume that the coin and the die are somehow interacting and as a
result their outcomes are not independent. The joint density may have the distribution shown in Figure 1.2 . The following statements can be directly verified
from the diagram.
Die

6
5

4
3
2

1/12

1/ 12

1/1 2

1/1 2

1/ 12

1/12

1/1 2

1/12

1/1 2

1/12

1/12

1/12

Coin

Figure 1.1
Die

6
5
4
3
2

1/12

1/1 2

1/1 8

1/24

1/ 9

1/24

1/9

1/6

1/ 18

1/ 9

1/18

1/ 12

Figure 1.2

Coin

I 'II ULI , .... , I I ULI .... U

(i)

(ii)
(iii)
(iv)
(v)

P{(H ,5)} = 18
P{Coin = H} =
P{Coin = T} =
P{Die = 5} = ?z
P{Die = 5 given Coin

I , .... .. " "" .... " .... ....

H} =

1/18

(1/12)

......,VOr-- .

+ (1/ 18) + (1/9) + (1/9) + (1/9) + (1/ 12)

= 1/10

The conditional density of y given x is defined as

( I )-

P(Xi, yJ
( )
PXi

P Yj Xi -

where the marginal density P(Xi ) is assumed to be nonzero . The marginal densities are defined as follows :

The random variables x and yare said to be independent if P(Xi' Yj) =


P(Xi)P(Yj) for all i,j.
The correlation C between x and y is defined as the expected value of
(x - x)(y - y ):

C(x, y) = E((x - x )(y - y )) =

2: 2: (Xi i

= E(xy) -

x )( Yj - Y)P(Xi' Yj)

xY

In the special case where x and y are independent,

Independence thus results in a lack of correlation. The opposite, however, is


not true (see Problem 1.3).

1.2 CHEBYSHEV INEQUALITY AND THE WEAK LAW OF


LARGE NUMBERS

Given a random variable x with average and variance


equality for an arbitrary positive number 0 is

p {lx -

xl

2::

o}

:5

0";
s

0";, the Chebyshev in-

Sec. 1.2

Cheby shev Ineq uality and the Weak Laws of Larg e Numbers

A simple proof of this result is given here (for another proof, see Problem 1.4).
Define the function f(x) as follows:

f (x)

g:

iflx-xl ~ 8

if

Ix -

xl< 8

Then

p{lx -

x l ~ 8} = Lf(x;)p(x;)

Upon inspection of Figure 1.3, it is obvious that


X

f(x) ::5 [ - 8-

X]2

Therefore,

p{lx - xl ~ 8}::5 ~ [Xi; xr p(x;) = ~;


The Chebyshev inequality can now be used to derive the weak law of
large numbers. Consider a binary experiment where the outcomes are 0 and I
with probabilities Po and I - Po, respectively. This experiment is repeated N
times independentl y, and the average output is defined as YN; that is, YN is equal
to the total number of I's in the N trials divided by N . Obviously, YN is a random variable with sample space {O, u, 21N , . .. , I}. Defining X (II) as the r. v.
associated with the outcome of the nth trial , we have
YN

11 = 1

= - L X (II)

The average and variance of YN are obtained as follows:


_
YN

I ~

=-

L.,

()
E (x " )

11 =1

I ~_

=-

L., X

=x

11 = 1

\(x8~y

I
I

\
\

f (x)

Figure 1.3

rvratne rnancat rrenrmnanes

0-; = E((YN- YN)2) = :2E([,~


1
= 2

L E ((x(1I) 11 =1

(X(1I) -

cnap,

r)

x)

X)2) = o-x
N

For an arbitrary positive number s , the Chebyshev inequality


2

P{!YN- YNI 2:

e} ::::;

O-J
e

leads to the following statement of the weak law of large numbers:

p{ I[~N 1] - xl ~ e}~ Ne0-;2


1I

x(

11 =1

Notice that the right side approaches zero with increasing N. The weak law of
large numbers thus asserts that the sample average of x approaches the statistical average with high probability as N ---7 00 .

1.3 CONVEX SETS AND FUNCTIONS:


JENSEN'S INEQUALITY

In the Euclide an space, a set S is convex if, for every pair of points PI , P2 in S,
the straight line connecting PI to P2 is completely contained in S. Figure 1A a
shows two convex sets in the two-dimensional Euclidean space . The sets of
Figure lAb are not convex.
If PI = (x l s ,x1l ) and P2 = CYt> . .. , Y1I) are points in the Euclidean 11space , then the straight line connecting them is represented by the set of points

(a)

(b)

Figure 1.4

Sec. 1.4

Stirling's Formul a

P, where P = AP) + (l - A)P2 = [Ax) + (l - A)Yh' . . , Axil + (l - A)YII],


and A is a real number in the interval [0 ,1] . An important example of a convex
set is the set of all n-dimensional probability distributions {Ph' . . , P,.} (see
Problem 1.6).
The real functionf(P) , defined on a convex set S, is convex cap (n) if
for every pair of points Ph P2 in the set and for every A in [0, 1] the following
inequality is satisfied:
f[AP l + (l - A)P2] 2: Af(p)) + (l - A)f(P2)
This property is shown graph ically in Figure 1.5 for a function defined on
a one-dimensional space . If the direction of inequalit y is reversed (for all
Ph P2 , A), then the function is convex cup (U).
Theorem. If AI. .. . , AN are nonnegative numbers whose sum is unity,
then, for every set of points PI. .. . , PN in the domain of the convex n function
f(P), the following inequality is valid:

f(#) AIIPII)

#)Allf(P,.)

2:

Proof. We use induction on N . For N = 2, the inequality is valid by


definition of convexity. Assumi ng the validity of the theorem for N - 1, we
prove it for N .
N

f ( ~ AIIPII

N- )

f ANPN + (l - AN) ,~) 1 _" AN PII

2:

ANf(PN)

+ (l

2:

ANf(PN)

+ (l -

N- )
A
)
- AN)f( I~) 1 - \ NPII

AN) ~: 1 ~\Nf(PII )

= 2:

A,J (PII )

n= l

The proof is thus complete.


A direct consequence of the preceding theorem is Jensen 's inequality for
discrete random variable s. Let r. v. x assume the values XI. . . . .x, with probabilities Ph . . . ,PII ' Let f(x) be a convex n functio n whose domain includes
X), ... , XII ' Now E(x) = 2:i p.x, and E(f(x)) = 2:; pJ(Xi)' Therefore,

f(E(x))

2:

E(J(x))

This is known as Jensen's inequality.

1.4 STIRLING'S FORMULA

In this section we derive Stirling's formula , which provide s upper and lower
bounds to N!. Consider the function In(x) whose integral from 1 to N is given

f( x)

_ - - _ f( x 2 ,

Xl

Ax ,+ (1 -A)X 2

Figure 1.5

by

In(x) dx

= [x

In(x) -

x]~ = N

In(N) - N

This integral is underapproximated by the trapezoid method of Figure 1.6a and


overapproximated by the midpoint method of Figure 1.6b. Consequentl y,

r
r

+ . .. +

In(N - 1) + ! In(N) = In(N !) - ! In(N)

In(x) dx

2::

In 2

In(x) dx

::5

(k) + In 2 + In 3 + .. . + In(N - 1) + ! In(N)

In 3

= In(N !) + (k) - ! In(N)


Therefore ,
N In(N)

+ !In(N)

- N

+ (i)

::5

In( x)

In(N!)

::5

N In(N)

+ !In(N)

- N

In (x )

N
(a)

(b)

Figur e 1.6

Chap. 1

Pro blems

which yields Stirling 's formula as


NNe - N yIN e

7/8

::5 N ! ::5 N Ne - N

yIN e

PROBLEMS
1.1. Three events E "E 2 , and E ), defined on the same space, have probabilities p ee l) =
P (E 2) = p eE )) = ~ ' Let Eo be the event that one or more of the events E l> E 2 , E )
occurs.
(a) Find P(Eo) when:
(1) s; E 2 , E ) are disjoint.
(2) e; E 2 , E ) are statistically independent.
(3) El>E 2 , E ) are three names for the same event.
(b) Find the maximum values that P(Eo) can assume when:
(1) Nothing is known about the independence or disjointness of E i, E 2 , E ) .
(2) It is known that E "E 2,E ) are pairwise independent; that is, the probabilit y
of realizing both E; and E, is P(E ;)P(Ej ) , but nothing is known about the
probability of realizing all three events together.
Hint: Use Venn diagrams.
1.2. A box contains two dice, one fair and the other loaded , so that for the first one
p el) = P (2 ) = . .. P (6 ) = k and for the seco nd one P(6) = L P(I) = .. . =
P(5) =
We choose one die from the box at random and roll it. If the outcome is
the number 6, what is the probability that the loaded die has been selected? What if
the die is rolled twice and the outcome of both trials is the number 6?
1.3 . Let x and y be discrete random variables,
(a) Prove that E(x + y) = E (x) + E(y)
(b) If x and yare independent, prove that E(xy ) = E(x) ' E(y ) ; that is, x and yare
uncorrelated .
(c) Is it possible for x and y to be dependent but uncorrelated? Give an example.
(d) If x and y are independent , prove that Var(x + y) = Var(x) + Var(y). Is this
relationship valid when x and yare dependent but uncorrelated?
1.4 . (a) For any random variable y that assumes only nonnegative values , prove the
followi ng inequality :
-y

p {y ~ /)} ::5 -

where /) is an arbitrary positive number.


(b) Let x be a random variable with average and variance CT ;. Define a nonnegative random variable y = (x - X)2 and show that

p{lx -

xl ~ /)} ::5 CT;,


s

This is the Chebyshev inequality derived in Section 1.2.


1.5. A sequence of independent identically distributed random variables Yl, . . . , YN with
average y and standard deviation CTy is give n. Defin e the random variable x as
follows:

\J 1 10tJ "

LYn
N ,,~I

= -

Show that the Chebyshev inequality for x is

p{lx - :YI

~ O}:s

(J yz

No

and investigate the limit when N -+ 00.


1.6. Prove that the set of all n-dimensional probability distributions {PI,' . . ,Pn} is convex . For n = 3, show the set graphically.
1.7. The functionf(x) is defined on the open interval (a, b) and is convex n . Prove that
f(x) is continuous. Would this conclusion be valid if the domain of the function
were a closed interval?

1.8. The functionf(x) is defined and has second derivative on the interval (a , b) . Prove
that the necessary and sufficient condition for f(x) to be convex n is

d1(x)

~:SO

for all points in (a, b).

2
ENTROPY, DISCRETE MEMORYLESS
SOURCE, AND BLOCK ENCODI NG

2.1 Entropy of a Discrete Random Variable


2.2 H(x) as a Measure of Uncertainty

2.3
2.4
2.5
2.6

Entropy and Sequences of a Random Variabl e


Discrete Memoryless Source and Likely Sequences
Block Encode r w ith Fixed -length Alphabet
Block Encoder with Variable-length Alphabet
Prob lems

I n t rod u c t ion The concept of entropy is the central concept in information theory. The entropy of a random variable is defined in terms of its probability distribution and can be shown to be a good measure of randomness or
uncertainty. This chapter begins by introducing the entropy for a discrete random variable in Section 2. I and describes some of its properties in Section 2.2.
Other important properties of entropy are discussed in the problems at the end
of the chapter.
The importance of entropy in practical applications has much to do with
its relationship with long sequences of a random variable. It turns out that repeated trials of a probabilistic experimen t give rise to outcomes that can, under
quite general circumstances, be divided into two categories. The first category
consists of sequences that have a very small probabil ity of occurrence; in fact,
the probabi lity of occurrence of this category as a whole approaches zero with
the increasing length of the sequences. The second category consists of the
remaining sequences; these sequences, which are known as likely or typical
sequences, all have more or less the same probability and , as their length
increases, they become closer and closer to being equally likely. The number
of typical seque nces and their indi vidual probabilities are fu nctions of the
entropy, and it is this relation ship that gives entropy a major role in the theory

1.

CI IUUIJY, U I::)L;lta~ I V I O I I I U r Y I C';:);:) oJu u ..... 'V', OIIU

UIV,-," "

' - . ' .............

".::1

.......... ,..... -

of data compression and erro r control coding . Sections 2.3 and 2.4 explore this
relationship.
The concept of entropy as developed in the original work of C . E . Shannon [1] has been used in seve ral branches of science, but its most significa nt
applications to date have bee n in the areas of data communication and signa l
processing . We will therefore restrict our discussion in this book to situations
that arise in source coding (data com pression) and channel coding (error correction). This chapter and the next focus on the problem of source coding for discrete memo ryless sources, where the goal is to accurately represent (encode)
the output of the information source with as few code words as possible .
This chapter deals with situations where a fixed-length block of data is
mapped onto a fixed-length block of code lett ers. Sec tio n 2. 5 considers the
case where the cod e letters are of equal duration in time, whil e Sectio n 2. 6
considers situations such as telegraph y, where different code letters have different dur ation s. In both cases , typical sequences produ ced by the information
source are encoded (represented) by sequences of code letters, and the minimum required rate of gen erat ion of code letters is show n to be equal to the
sourc e entropy. In Chapter 3, fixed-length bloc ks of dat a are mapped onto
variable-length blocks of code letters , and encoding scheme s are describ ed
for the minimization of the ave rage cod e-word length . It is show n that the
minimum achievable code-word length , on the average, is equal to the source
entropy.

2.1 ENTROPY OF A DISCRETE RAN DOM VA RIABLE

Let x be a random vari able (r. v.) with sample space X = {XI, ... , XN} and
probability measure P(x ,,) = p.: The entropy of x is defined as

-.z:
N

H(x) =

p; log(p,,)

(2. 1)

n= 1

The base of the logarithm is arbitrary and amounts to a constant multipli cative
factor. [Remember that log/ = (log,") (log j").'] If this base is 2, the entropy is
said to be in bits (for binary digits); if the base is e, the entropy is said to be in
nats (for natural units). A plot of the function -p In(p) is shown in Figure 2.1 .
Note that at p = 0 we have set the function equal to zero, in agreement with
the fact that X In(x) ---7 0 as X ---7 O. Also notice that the function - p In(p) is a
convex n function of p .
Example 1

Let X = {G, I}, P(G) = p , and P(I) = I - p . Then


H(x) = -p log(p) - (I - p) log( 1 - p)
A plot of H(x) (in bits) versus p is shown in Figure 2.2 . Thi s function is convex nand

Sec. 2.2

H(x) as a Measure of Uncerta inty

13

e- 1 = 0.37

Figure 2.1

H (x)

0.5

Figure 2.2

has symmetry about p = 0.5 . Note that at p =


the vertical axis .

a and p

= 1 the function is tangent to

2.2 H(x) AS A MEASURE OF UNCERTAINTY

The entropy as defined by Eq. (2.1) has several properties. These properties
make H(x) a good measure of the uncertainty about the outcome of a probabilistic experiment. In this section, we describe some of these properties.
(i) If PI = I and P" = 0 for n "" I, then H(x) = 0; that is, the uncertainty about an experiment with deterministic outcome is zero.
(ii) If PI = pz = .. . = PN = it, then H(x) = log N. Note that , if
the number of equiprobable outcomes of an experiment increases
(i.e ., if N increases), then the entropy of the experiment increases .
(iii) H(x) :5 log N with equality if( P" = l iN for all n. The proof of this
statement is based on the important inequality In(x) :5 x - I , whose
validity becomes obvious upon inspection of Figure 2.3. In this figure the function s In(x) and x - I are plotted on the same set of axes.
Note that the two functions are equal only at x = I.
tiff means if and only if.

14

I:: ntropy, ui scret e Memoryless source, ana tllOCK cncoo mq

l-nap.

Figure 2.3

We now proceed with the proo f:


H (x) - In(N)

( I)

= - ~I Pn In(Pn) - '~I p; In(N ) = ~I P II In Npn


:5

(1

) LN(l)
- -

N
LPn
- 1
n=1
Np;

11= 1

LN Pn
11 = 1

=1 -1 =0
The refore, H (x ) :5 In(N ) with equality iff Np; = 1 for all n; that is,
P = l iN for all n . Thi s impli es that the uncertainty of an experiment is high est when the outcomes are equiprobable.
(iv) Let the r. v. x have sample space X = {XI , . . . , XN } and the r. v. y
have sample space Y = {Yb . . . , YM }. Then the joint r.v. z = (x, y)
has sample space Z = {(XI> YI), .. . , (XI> YM), (X2, YI), . . . , (X2, YM),
. . , (X N, Yl), . .. , (XN, YM) } with NM element s. Let x and y be independent and have equipro ba ble outcomes . Then H (x ) = lo g(N ) ,
H(y) = l og(M ), a nd H (z ) = Iog(MN ) = lo g(M) + lo g(N ) =
H(x) + H(y); that is , the uncertainty of the joint experiment (x , y) is
the sum of the uncertainties of the indep endent experiments x and y .
This prop erty remains valid even if x and y have nonuniform distributions. In this case ,
N

H (z)

=-L L

P(xn , Ym) log P(x'l> Ym)

11 = 1 m= l

= -

L L P(x,JP( Ym) [log P(x + log P(Ym)]


n)

n= ! m= l

=-

L P(xn) log P(xn) L P(Ym)


11=1

- L P(Ym) log P(YnJ L P(x n)


m=l

H(x)

+ H(y)

Sec. 2.3

Entropy and Sequences of a Random Variable

15

(v) If an experiment is decomposed into two subexperiments, the original uncertainty will be the uncertainty associated with the first
experiment plus the average uncertainty associated with the second
experiment. This is the grouping property of the entrop y. To explain
this property, con sider the LV. x with sample space X = {X I ,' . . ,XII'
X//+I , . . . , XN} and probability measure P(x;) = Pi. Let us decompo se X
into two subspaces, Y = {XI>' " , XII } and Z = {X// +I.. . , XN} '
The probabilities associated with Y and Z are given by P(Y) =
~;~ 1 Pi and P(Z) = ~ {://+ 1 Pi' Furthermore, let us define random
variables y and z by P(y;) = P(Xi)/P(y), i = 1,2, .. . nand P(Zi) =
P(XII+i) /P(Z) , i = 1,2, . . . N - n. H(x) can now be written as
N

H(x)

=-

LPi log Pi

= -

LPi log Pi i=1

i= l

L Pi log Pi
i=1I+ 1

- P(Y) L P( Yi) (log P( Yi)

log P(Y))

log P(Z ))

i~ 1

s-

- P(Z ) L P(Zi ) (log P(Zi )


i~ 1

- [P(Y) log P(Y) + P(Z) log P(Z)] + [P(Y)H(y) + P(Z)H(z)] .

In the final expression , the first bracket represents the uncertainty associated with the first experiment and the second bracket is the average uncertainty associated with the second experiment.
Note: The preceding propertie s of the function H(x) defined by Eq. (2. 1)
are reasonable properties for a measure of uncertainty. It is possible to prove,
however, that this function is the only one that satisfies a small set of reason able requirements. In reference [4], it is shown that the entropy must be defined by Eq. (2.1) if properties (ii), (iv), and (v) , plus a continuity propert y,
are required .

2.3 ENTROPY AND SEQUENCES OF A


RANDOM VARIABLE

The following example is designed to bring out the significance of H(x ) as related to long sequences of the random variable x.
Example 2
Consider the r. v. x with sample space X = {x I> X2}, P(x I) =
tropy of x is given by
H(y)

-(~) Jop(~) -

(~) IO!!(~)

and P(X2) = ~ . The en-

= 0 .918 bits

'"

L lllIUt-'y,

LJI;:)\",I C:a C

IVI C I IIUI Y IC.:: ,:)

VVU . ..........,

loAl . ....

&..II .... .... "

.... , . .... ......'-" ..

' tJ

.... " .... t" . -

Let us repeat this experiment N times to obtain a sequence of N independent , identically


distributed (i.i .d) random variables . In general, there are 2N possible sequences . If the
number of occurre nces of X l in a certain sequence is n , the probability of that sequence
is given by
p ~ (I -

Plt- n

Since there are ( ~) = N! I n ! (N - n)! such sequences , their total probabilit y is equal to
( ~)p~(I -

Plt-"

Table 2. 1 shows probabilities of various sequences for N = 15. Notice that the
most probable sequences are those for which n is close to Np , = 5. In fact , the probability of 2 ~ n ~ 8 is 0.95 . In other words,
The probability of occurre nce of a sequence for which n is significa ntly different
from Np, is very small.
Also notice that the individual probabilities of these sequences are between T15xO.718 and
2- 15XI.18, which are fairly close to 2- NH (x ) = 2 - 15xO.918. In other words,

All the likely sequences are more or less equiprobable with prob ability about
2- NH ( x) .
Finally, notice that the total number of sequences with 2 ~ n ~ 8 is 22,803 = 2' 5xO.965,
which is not far from 2NH ( x ) . In other words,
The total number of likely sequences is about 2NH (x ) .

TABLE 2.1

II

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

No. of Sequences
(~)
I

15
105
455
1365
3003
5005
6435
6435
5005
3003
1365
455
105
15
I

Probability of Each Sequence


p '!(l -

Plt-"

2- 15XO.585
2-15xO.652
2- 15xO.718
2-15XO.785
2-15XO.852
2-15xO.918
2-15xO.985
2-15X1.052
2- 15x1.1I8
2- 15x1. 185
2- 15X1.252
2-15x1.318
2- 15 X 1.385
2- 15X1.452
2- 15X1.518
2- 15x 1.585

Total Probability
(~)p7(l - Plt -"

0.002284
0.017127
0.059946
0. 129883
0.194825
0.2 14307
0. 178589
0.114807
0.057404
0.022324
0.006697
0.001522
0.000254
0.000029
0.000002
0.000000

Sec. 2.4

Discrete Memoryless Source and Likely Sequences

17

The last conclusion is consistent with the previous two conclu sions: if a subset of nearly
equiprobable elements has a total probability close to I, then the number of elements in
the subset must be appro ximately the inverse of the probability of each element.

The three conclusions of Example 2 are quite general. In the next section
we will show that the degree of approximation can be arbitrarily improved by
letting N assume larger and larger values . Later we will see that the sequences
that behave in this manner are far more general than the memory less binary sequences considered in this section .
The practical significance of these results in data compres sion should be
quite clear. To represent a binary sequence of N digits in our example, we only
need NH (x) = 0 .9l8N bits, which constitute s an 8.2 percent reduction in the
volume of data. Any degree of accuracy can then be achieved by choosing sufficiently large values of N.

2.4 DISCRETE MEMORYLESS SOURCE AND


LIKELY SEQUENCES

We now put the conclusions of the previou s section on a firm basis. We will
need the weak law of large numbers , which asserts that, if the experiment corresponding to the r. v. x is repeated N independent times with x U) the outcome
of the jth trial, then

p{ IJ..N

x U) -

o}

xl::::

j=l

::5

(Y

;2

No

Here and (Yx are the average and standard deviation of x, respectiv ely, and 0
is an arbitrary positive number . For a proof of the weak law, the reader is referred to Section 1.2 and Problem 1.5 .
Suppose a source generates a letter belonging to the alphabet {a I , . . . ,aK}
every T seconds . The probability of a, is p(ad , and each letter in the sequence
is generated independently. Starting at time t = 0, the source will have produced N = tiT symbols at time t. The number of possible sequences generated
in this period is thus K N , and each sequence S; has probability of occurrence
peS;). peS;) is the product of N terms p(a {j, where a {j) is the jth letter of
the sequence.
We now define a random variable z corresponding to the set of sequences
S;. The value z, assigned to S; is given by
N

z,

= - log P(S;) = -

log p(a W)

j=l

z is thus the sum of N i. i . d . random variables x with outcomes x ,


-log peak)' The weak law of large numbers asserts that, if 0 and B are arbitrary
positive number s, there exists No such that for N :::: No

10

ClllIUIJY,

UI;,:)\"IClC

IY lt::Ol l l U I Ylv .,;> ~

Vy,

, '-" v

...,

:::;t

Notice that
K

- 2: peak) log p(ad = H(x)

k= l

has been used in the preceding equation. We conclude that, with probability
greater than I - 8 ,
- 8 ::::; -

(~)

log p eS;) - H(x) ::::; 8 ,

for large N

Using 2 as the base of the logarithm, this yields


T N[H(x) +S] ::::; peS;) ::::; TN[H(x)-8J

(2 .2)

The number of seque nces for which Eq. (2 .2) is valid ca n now be
bounded. Denoting by v the number of such sequences, we will have
(a) p eS;) ;:::
(b) peS;) ::::;

T N[H(x)+SJ
T N[H(x)-8J

and
and

v[P(S;)]min::::; 1
v[P(S;)]max > 1 -

Therefore ,

(l -

8) 2N[H(X)-8J ::::;

v ::::;

2N[H(x)+S]

(2 .3)

Equations (2 .2 ) and (2 .3) impl y that the probability of each likel y sequence is about T NH(x) and the number of these sequences is about 2NH(x) . Furthermore, since H(x ) ::::; log K (see Section 2 .2 , prop erty (iii) of the entropy
function), we conclude that
v =0 2NH (X) ::::; 2N 1og(K) = K N
(2 .4)
where K N is the maximum possible number of sequences Sj. Equality holds iff
p(ad = 11K for all k.

2.5 BLOCK ENCODER WITH FIXED-LENGTH ALPHABET

To encode the output of the source described in the preceding section, consider
a binary encoder that can create binary sequences at a rate of f3 bits per second.
After t = NT seconds, the total number of sequences generated by the encoder
is 2Nf3T (f3T can be interpreted as the number of bits generated by the encoder
per source symbol.) We wish to encode the source into binary sequences, that
is, to assign a unique binary sequence to each source sequence. We also wish to
use a machine with finite (and otherwise unlimited) memory so that, aside from
an initial delay, data can be encoded as they are generated by the source. Under
these circumstances, if a unique binary word is assigned to each of the likely
sequences and the remaining (unlikely) sequences are ignored , an error will be

Sec. 2.6

Block Encoder with Vari abl e-Lengt h Alp habet

19

committed with probability less than s provided that enough binary seque nces
are av ailable to cover all the likely source sequences. Thus , if 2N{3T 2: 2N (H(X)+8 ],
or equivalently (3T > H(x), the probability of error will be less than e.
As long as the binary sequence s are generated with a rate (per source symbol)
greater than the entropy of the source H (x ), the probability of error can be made
as small as desired.
Conversel y, if (3T < H(x) , say (3T = H(x) - 2y , set 0 = y and choose
an arbitrarily small positive number e . There exists a time to after which, with
probability greater than 1 - e , the source seq ue nce S; will be a likely se quence . The probability of S, is bounded by

P(Si)

::5 r

N[H (x)- y]

Th e number of available binary sequences, however, is


2 N{3T

2N[H(x)- 2y]

Therefore, the probability of having a cod e for a likely sequence is at mo st


2 N[H (x)- 2y ] 2 - N[H (x)- y ]

2-Ny

We co uld also have ass igned some codes to th e un likely seque nces ; but since
these seque nces have probability less than e , the probability of havin g a co de
for a so urce sequence will, in gen eral , be less than e + z": For very lon g seque nces, this probability will be extr emely small.
If the binary sequences are generated with a rate (per source symbol) less than the
entropy of the source H (x) , the probability of error will be as close to I as desired, irrespective of the manner in which sequences are encoded.

Not e: If (3T is lar ger than H(x) but very close to it, the o utput of the
binary en coder, in the long ru n , will have 21lt sequences with equal probabilities
_ r NH(x) == z". This means that the encoder is working close to its ma ximum
entropy with P(O ) == PO ) == 0 .5 .
Exercise
The results of this section are not restricted to binary encoders. It is quite straightforward to generalize the results to an encoder with D different symbols in its alphabet.
CaJTy out this generalization .

2.6 BLOCK ENCODER WITH


VARIABLE-LEN GTH ALPHABET
Con sider an encoder with alphabet {XI, ' " , XD }, where the d uration of X i is Ti .
It is assumed that the encoder can generate every po ssi ble seq uence of x;'s. To

Entropy, Discrete Memoryless Source, and Bloc k Encodi ng

20

Chap. L

determine the number of different sequences the device can generate in t seconds , we define the function N(t) as
NU) =

O'
1,
{
no. of possible sequences in [0, t ],

It is not difficult to verify that, for t

N(t)

N(t - T 1)

:2:

if t <
if 0 ~ t < Tmin = min{T;}
if t :2: Tmin

Tmin ,

+ N(t - Tz) + ... + N(t - TD )

(2.5)

Thus N(t) is the solution of a linear difference equation. Before proceeding further, let us consider a simple example .
Example 3
Let D = 2, T 1 = I , and T z = 2. It can readily be verified that Figure 2.4 is the correc t
represe ntation of N(t) for the first few units of time. The difference equatio n N(t ) =
N(t - I) + N(t - 2) has continuous solutions of the form A', where A is found by substitution in the equation:

Fro m the characteri st ic eq uation A2 - A - I = 0, the values of A are fo und to be


(I V5)/2. The general solution is therefore given by

( ) = A( I
Nt

+ V5)' +
2

(I -2V5)'

The initial conditions N( I) = I and N(2) = 2 are satisfied with A =


and 8 = (V5 - 1)/2V5. There fore ,

I { ( V5 + 1) '+1 + (V5 - 1) '+1.


el rrl } .
V5
2
2
'

N (t ) = -

(j =

(V5 + 1)/2V5
v=T)

At noninte ger values of t, the precedin g function is quite different from the function represented in Figure 2.4. It assumes correc t values , however, at integer times t = II . Since
our main interest is in the integer values of t, we ignore these differences and take the
preceding function to represe nt N( t).
We can now write

I (V5 + I) '+I{ I + (V5 - 1) '+1e lm


.}
V5
2
V5 + 1

N(t ) = -

which can equiva lently be written as


log N(t) = t log ( V52+

I) + log (V5
+ I) + log { I + (V5
- 1
2V5
V5 +
I ) '+1.
e'" }

For large t , the last term of this equation is very close to zero and may be ignored , yielding

f3 = ( tI )

log N(t)

==

log [V5 2+I]

= 0 .694

bits/unit time

Sec. 2.6

Block Encoder with Variab le-Length Alphabet

21

--

--

-r

f3

is sometimes referred to as the capacity of the encoder. If the entrop y of the source is
less than f3, long blocks of data can be encoded with high accurac y. If the entropy is
larger than f3 , however , the probabilit y of error approaches I with increasing length of
the sequences.

Returning now to Eq . (2.5), the general equation for N (t) , the characteristic equation can be written as
A- T,

A- T 2

+ . .. +

A-

TD

(2 .6)

This equation can have both real and complex solutions, but the solution with
the largest absolute value, Am.., is alway s real and positive (see Problem 2.8).
Furthermore, Amax can be determined graphically, as shown in Figure 2.5 .
D

-T;

LA
i=l

D f-----1..

OL-- ----:------ - - --"-- -- - - -- -__


Amax

Figure 2.5

22

Entropy, Disc rete Memoryless Source, and Block Encoding

Chap . 2

From Example 3 it must be clear that Amax determines the capacity of


the encoder and that this capacity is given by (3 = log(Amax) . To achieve reliable encoding, the entropy of the source must be less than the capacity of
the encode r.

PROBLEMS
2.1. Let E = {el> . . . , eN} be a set of disjoint events, and let PI> . . . ,PN and q l , .. . , qN
be two different probabil ity assignm ents on E, with 2.P" = 2.q" = I. Prove that:
N

(a)

'L, P"

10g(p,.1q,,) ~ 0

n= l
N

(b)

'L, (p")2/ q,, ~

n= t

2.2. A random variable x assume s nonnegative integer values 0 , 1,2 , . . . . Find the
probability assignment P" that maximizes H(x) subject to the constraint that the
mean value of x is a fixed value A; that is,

'L, np;

= A

n=O

Evaluate the resulting H(x) . (Hint: Use Lagrange multipliers . See, for example ,
F. B. Hildebrand, Advanced Calculus for Applications, 2nd ed ., Prentice-Hall,
Inc ., Englewood Cliffs, N.J., 1976, Chap. 7.)

2.3. Prove that any transfer of probability from one member of a discrete sample space
to another that makes the ir probabil itie s more nearl y equal will incre ase the
entropy.

2.4. Let A = [aij] be an N X N doubly stocha stic matrix; that is,


N

aij

for all i, j ,

'L, aij =

for all j ,

;= 1

and

'L,a ij = I

for all i,

j= 1

Given a set of probabilities PI> . . . , PN , we define a new set of probabilities


ql> . . . ,qN by
N

q, = 'L,a ijpj
j~ l

Show that H (ql> . . . , qN) ;::: Hi p, . . . ,PN) with equality iff ql> . .. ,qN is a rearrangement of PI> . . ,PN.
2.5. Let XI> . . . ,X" be arbitrary positive numbers, and let al> . . . .a; be positive numbers whose sum is unity. Prove that
X ~I

. x ~n S;

'L, a.x,
;=1

with equality if and only if all x, are equal. Note that the inequality still holds if
some of the a, are allowed to be zero . However, the condition for equality becomes : All x, corresponding to positi ve a, are equal.

Chap. 2

Proble ms

23

2.6. A source produces a sequen ce of statistically independent binary digits with the
probabilities P(I ) = 0.005 and P (O) = 0.995. These digit s are taken 100 at a
time and a binary code word is provided for every sequence of 100 digits contain ing three or fewer I's .
(a) If the code words are all of the same length, find the minimum length required to provide the specified set of code words .
(b) Find the probab ility of getting a source sequence for which no code word has
been provided.
(c) Use the Chebyshev inequality to bound the probabilit y of getting a sequence
for which no code word has been provided and compare with part (b).

2.7. Let x be a discrete random variable with probabilities P,


that

:5

P2 :5 .. . :5 PN' Show

(a) H(x) ~

:z: p; (1 - Pn) ~ I - PN

(in nats)

,,=1

(b) H(x) ~ H( PN), where H(PN) = -PN 10g(PN) - (1 - PN) 10g(1 - PN)

(c) H(x) ~ -log(PN)


(d) H(x) ~ 2(1 - PN) (in bits)
(e) Compare the lower bounds obtained for H(x) in terms of PN in parts (a) and
(d). Which bound is stronger and why?
Hint: Use part (b) when PN ~ 0.5 and part (c) when PN :5 0.5 to prove part (d).
2.8. The construction of Figure 2.5 guarant ees that a real, positive solution can always
be found for Eq . (2.6). Denoting this solution by Ao, show that if A is a complex
solution of the equati on its magnitude must be less than Ao. (Hint: Represent complex numbers A-T; as vectors in the complex plane and use the fact that a straight
line connecting two points in this plane is the shortest path between them .)
2.9 . For the following binary code , let N(k) be the number of messages that can be
formed using exactly k code characters . For examp le, N(I) = I (i.e ., x,), N(2) =
3 (X,X" X2, X3)' and N (3) = 5 (x,x,X"X ,X2, X,X3,X2X" X3X,) . Find a general expression for N(k) (k = 1,2, .. .).

x,
X2
X3

IO
II
2.10. (a) Show that the function log x in the region
(b) Using Jensen 's inequalit y, prove that

a< x <

00

is convex

n.

H(x) =

-:z: Pn log Pn

:5

log N

n= 1

2.11 . Let f ey) be an arbitrary function defined for y ~ I. Let x be a discrete random
variable with range {x" . . . ,XN} and probability distribution {PI, ' . . ,PN}' Define
the f-entropy of x by

Iff( y) is convex

n , show

that the following inequality is always satisfied:


Hf(x) :5 fe N )

L't

2.12. Let
x

A =

~n

IOg2 n

and define the rand om variable x by

P {x = n} =

I
2

A n log n

'

for n = 2,3 , . . . ,00

(a) Show that A exists; that is, the sum defining A converges.
(b) Find the entropy of the random variable x.

2.13. Let P = ( PI, P2 , . . .) be a countable probability distribution ; that is, p;


n = 1,2 , ... and 2::~ I PII = 1.

2:

0 for

(a) Show that if 2:: =IPn log n converges then H (P ) is finite .


(b) Assuming that PI 2: P2 2: . . . 2: P 2: . . . , show that the exis tence of H (P )
implies the exis tence of 2:: =IPn log n.

3
VARIABLE-LENGTH SOURCE CODING

3.1
3.2
3.3
3.4
3.5
3.6
3.7

Unique Decodability and Prefix Condition


Kraft Inequality
Significance of the Prefix Cond it ion
Shannon-Fano Coding
Optimum Binary Coding: Huffman's Technique
Geometrical Interpretation of Huffman's Techniqu e
Generalization of Huffman's Technique
Problems

In t rod u c t ion In Chapter 2 we studied the probl em of source cod ing using fixed-length code words for fixed-length blocks of data . The likely
(typical) sequences were encoded into equal-length code words and, as long as
the encoder was capable of producing letters above a certain rate, the encoding
process was apt to go smoothly. There are two problems with this scheme .
First, the atypical or unlikely sequences cannot be reproduced from the code,
and , second, the encoding/deco ding procedures are not practic al. The first
problem is not a serious one because the probability of occurrence of an atypical sequence is extremely small and, in practice, small failure rates are almost
always acceptable. The second problem arises from the fact that a code table
must be stored in the memory of the encoder and decoder, relating each block
of data to a corresponding code word. Moreover, this table must be searched
for each block until a match is found. The required amount of memory and the
search times thus grow exponentially with the block length and soon become
impractically large.
Variable-length source coding, which is the subject of this chapter, is, in
a sense, the answer to the preceding problems. Here again one chooses a fixedlength block of data but assigns a variable-length code to it. In practice, the

""

V al la lJ l'IJ - L. 'lJ I I ~ L I I

\-Jvu, ..... v

_VUI I'~

....." ..... 1-'.

V'

length of the block of data is short, and more often than not this block is composed of only one or two letters fro m the source alphabet. The code -word
lengths are related to the probability of their occurrence and , in general, shorter
code words are assigned to more probable blocks of data. The average codeword length will be shown to be bounded from below by the source entropy.
Moreover, we will show that this lower bound can be approached in the limit of
long sequences.
In Section 3.1 we address the problem of unique decodability. Prefixfree codes will be defined in this section and their properties explored in
Sections 3.2 and 3.3. The Shannon-Fano coding scheme is described in Section 3.4, and upper and lower bounds on the average code-word length are established in this section . Finally, in the remainder of the chapter , an optimum
encoding scheme known as Huffman coding is described .

3 .1 UNIQUE DECODABILlTV AND PREFIX CONDITION

Consider an information source with alphabet {aj, .. . ,ad. The source produces a sequence of independent, identically distributed letters with probability
distribution P(ai) = Pi (I ~ i ~ K). Next, consider an encoder with alphabet
{x j, .. . , X D}, where all X i have the same duration. Depending on the duration of
source letters, there may be a waiting line at the input to the encoder . We assume that the system has infinite memory and thereby ignore the waiting line
problem . The encoder encodes a sequence of source letters by assigning a predetermined sequence of X i'S to each incoming letter. Suppose the code word assigned to a, is Xii' . . XiIIi' where n, is the length of the code word. The average
length of this code is given by

n=

LPini
i= 1

From the law of large numbers, we know that for long source sequences the average code-word length will be close to with high probability. Our goal in
this chapter is to provide insight into method s of encoding that minimize n.
Obviously, not all codes that can be imagined are useful. The most general requirement is that all finite sequence s of code words must be uniquely decodable . In other words ,

For each source sequence of finite length, the sequen ce of code letters corresponding to that source sequence must be different from the sequence of code letters corresponding to any other source sequence. t

t R. G . Galla gher, Inf ormation Theory and Reliable Communications . New York : John
Wiley & Sons, 1968, p . 45 . Copyright 1968 by John Wiley & Sons. Reprinted with permission.

Sec. 3.2

Kraft Inequality

27

We restrict attention here to the class of all uniquely decodable codes .


There is a subset of this class with which we will be mainly concerned. This is
the class of prefix-free codes.
A prefix-free code is a code in which no code word is the prefix of any other
code word.'

A string of code words from a prefix-free code is uniquely decoded by


starting from the first letter and decoding one word at a time . Table 3. 1 is a
collection of several binary (D = 2) codes for a given source. Code I is not
uniquely decodable because, among other things , the code words for al and a2
are the same. Code II is not uniquely decodable because a l a I is not distinguishable from a 3' Code III is uniquely decodab le because it is a prefix-free code .
Code IV, although not prefix free , is uniquely decodable . In this code, 0 signals the beginning of a new word and the number of 1's that follow identifies
the word uniquely.

3 .2 KRAFT INEQUALITY

Let n l , . . . , nK represent the lengths of code words in a prefix-free code , and let
D be the size of the encoder's alphabet. Then
K

2. D - ni

:5

(3.1)

;=1

Conversely, if integers n], . . . , n satisfy Eq. (3.1) , a prefix-free code can be


found whose word lengths are given by n I , .. . , nK '
For a proof of this theorem, cons ider the tree depicted in Figure 3.1.
Starting at the root , the tree branches in D directions to create D nodes. Each
TABLE 3.1
Source
Letter
at

a2
a)

a4

P(a,)

Code I

Code II

Code III

Code IV

0.5
0 .25
0 .125
0 .125

0
0
I
10

0
I
00
II

0
10
110
III

0
01
0 11
0111

Source: R. G . Galla gher, Inf ormati on Theory and Reliable Communications , New York: John
Wiley & Sons , 1968, p . 45 . Copyright 1968 by John Wiley & Sons. Reprinted with permission .
t R. G . Gall agher, Informati on Theory and Reliable Communications . New York : John
Wiley & Sons, 1968, p . 45 . Copyrig ht 1968 by John Wiley & Sons. Reprinted with permission.

\",o llcqJ .

Root

Level l

Level 2

. . .

.. .

. . .

Figure 3.1

node , in tum , creates D new branches and the process continues. The nodes at
level i can be associated with code words of length i. Assuming that the word
lengths are arranged in increasing order (i.e ., 11 , ::5 112' ::5 11 K ), we assign a
node at level 11, to the first code word . The code being prefi x free, no other
code word should stem from the node associated with Ill ' Therefore, the first
word disables a fraction D - Il l of branches . When the second code word is assigned to a node at level 112, the total fraction of disabled branches becomes
D - Il l + D -112 . This proce ss can continue as long as a fraction of branches remains. It is therefore necessary that Eq. (3. I) be satisfied in order to have a
node associated with each and every code word.
Conversely, if integers 11, ::5 112::5 ::5 11K satisfy the Kraft inequalit y,
one can assign nodes to code words beginn ing with Il l . The Kraft inequality
then assures that at each stage of assignment the necessary nodes are available.

3.3 SIGNIFICANCE OF THE PREFIX CONDITION

In Section 3. 2 we proved that a prefix-free code satisfies the Kraft inequality.


We now take this one step furth er and prove that any uniqu el y decodable
code satisfie s this inequality. Thi s will help us establis h the important result
that for any uniquely decodable code there exists a prefix-free code with the
same word lengths.
Let th e word len gth s of a un iqu el y decodab le co de be 11 , ::5 112 ::5
. . . ::5 11K , and let the code have D symbols in its alphabet. With N an arbitrary
integer, we can write

(D-;)N
Il

1= 1

; 1= ,

iN= 1

2: .. . 2:

- (Il ; I + '

' + Il ;N )

Now, n., + . . . + !l iN is the length of a sequence of N code words and can be


anywhere between Nil, and NIlK' Define Aj as the number of sequences of N
code words with total length j . Then

Sec . 3.4

Shannon-Fane Cod ing

29

(D-i)N
lI

1=1

N"K

AiD -i

j=NlI j

Since the code is uniquely deco dab le , sequences of N code words and tota l
length j must be distinct. The maximum possible number of sequences of
length j is o', therefore , Ai :5 D i and
K

L D - ll i

) N

NIIK

:5 .L

1= 1

o'o :' = N(IlK

Il l)

] = NII ]

If L r=I D -II i is greater than unity, the left side of the precedin g inequality grows
expo nentially with N , whereas the right side grows linearly. Since N can be
arbitrarily large, this can not happen , and therefore the Kra ft inequality must
be satisfied.
In concl usion, any uniquely decodable code with word lengths Il ), . , n
sa tisfies the Kraft ine q uality . In Secti on 3 .2 we learned that if integers
Il), . . . , n sat isfied the Kr aft inequ ality then a pre fix-free code with those
lengths could be constructed . Hence
A uniquely decodable code can always be replaced with a prefix-free code with
the same word lengths.

3. 4 SHANNON-FANO CODING

In this sectio n we establish a lower bound on the average code-word length in


term s of the so urce entro py . We then sh ow that th is lowe r bound ca n be
achieved arbitrarily closely by the Shannon-Fano coding technique.
Theo rem 1. Given a source with alphabet {a), ... , ad and probabilities PI, . . . ,PK, the average length Ii of a uniquely decodable code with alphabet size D satisfies the inequality

Ii ~ H (x)

(3 .2)

log D

where H (x ) is the source entropy and the base of the logarithm is arbitrary.

Proof
H (x) -

Il

K K K
D '?'
- LPi In Pi - LPilli In D = L Pi I n i=1
i= 1
i=1
Pi

In D
:5

LKPi( -D -"i
. - I) =
i~ 1

P,

( LK D - i )
ll

\ i= 1

Since the code is uniqu ely decodable , the Kraft inequality must be satisfied.
Therefore, H(x) - Ii In D :5 0 and the proof is compl ete .

..... 11 .... ..,1 ....

L.. .... I I ~ L II

\.J V U I ........

Note: The equality Ii = H(x) /log D is achieved iff Pi


that is, iff lo&; is an integer for all i,

v i 10...,' ..J

_ V U Il I ~

= D'?' for all

i,

Theorem 2. Given a source with alphabet {at. .. . , ad and probabilities Pt. . . . , PK, it is possible to construct a prefix-free code from an alphabet of
size D such that

Ii < H(x)

(3.3)

log D

Proof. Choose the length n, of the code word for a, according to the rule
n, = I -log~l, where Ia l is the smalle st integer greater than or equal to a.
Now
K

n,

I -logm ::? n,

:2=

-log ~::? D '?' ~ Pi::? 2:Di= 1

K
ll

~ 2:Pi

i=1

Since the Kraft inequality is satisfied, it is possible to find a prefix-free code


with word lengths n t. .. . , n K . Next observe that
n, = I- log~l ::? n,

<

- log~

Therefore,
K K K

2: pn , < -

2: Pi log~ + 2: Pi

i= l _

i= l

i=1

The last inequality is the same as Eq. (3.3) and the proof is therefore complete.
It is possible to encode a source and to achieve the lower bound H(x)/
log D on Ii as closely as desired. This is done by encoding sequenc es of N
source letters at a time according to the procedure described in Theorem 2. In
other words , the source is assumed to have a new alphabet of size K N , each
member of which is a sequence of N independently generated letters from
{at. .. . , ad . The entropy of the new source is NH(x) (Problem 3.3), and its
average code-word length is, by definition , N times the average code-word
length for the original source. Applying Theorems 1 and 2 to the new source
then yields

NH(x)

- log D

~Nn

NH(x)
+ 1
log D

<- -

Or, equivalently,

H(x)
log D

--~n

H(x)
log D

< -- +

Since N can be arbitrar ily large, values of Ii that are arbitrarily clo se to
H(x)/log D can be achieved .
Note that by encoding long sequences at a time one cannot improve the
lower bound on Ii . However, if, due to the particular distribution of probabili-

Sec. 3.5

Optimum Binary Coding: Huffman's Technique

31

ties, the lower bound is not automatically achievable, one may resort to the preceding technique to bring the average code-word length closer to its minimum.

3.5 OPTIMUM BINARY CODING:


HUFFMAN'S TECHNIQUE

Consider a source S with alphabet {at> . . . , ad and probability distribution


Pt> . .. , PK' We wish to assign a binary sequence X i to each source letter a,

to create a uniquely decodable binary code (D = 2) with minimum average


length
Such a code exists since the choice can be confined to a finite number of possibilities. Consequently, a prefix-free code with minimum average
length exists.
Let us assume that PI ~ P2. .. ~ PK' Let the word lengths of the optimum prefix-free code be n! , . .. , nK' If n, > nj for some i < j, exchange the
corresponding code words . The resulting improvement in the average codeword length will be

n.

An = p;nj + PP i

- Pin; - pjnj

(Pj - p;) (n; - nj)

which is less than or equal to zero. But the original code is optimum , which
means that its average length cannot be improved by the exchange of two
words; therefore ,
= O. The new code obtained by the exchange is thus optimum, and it is readily observed that an optimum prefix-free code exists for
which nl ~ n2' .. ~ nK' For this code we prove the following two theorems.

An

to

aK

Theorem 1. nK = nK-l and the code words X K and X K- 1 corresponding


and aK-! differ in only the last digit.

Proof. If nK-! < nK, then the last digit of X K can be dropped without
violating the prefix condition. This results in a new code with smaller ii, which
is contrary to the assumption that the code under consideration is optimum.
Therefore , nK-! = nK'
If X K and X K - 1 are identical except for the last digit , the theorem is
proved. If not, there may be another code word with length nK (such as XK - 2 ,
XK- 3 , ) that differs from XK in only the last digit. This code word can be exchanged with XK - 1 and the resulting code will satisfy the theorem. If such a
code word does not exist, the last digit of XK can be dropped without violating
the prefix condition . This , however, results in a smaller
which contradicts
the assumption. Therefore, with proper rearrangement, XK and XK - 1 differ in
only the last digit.

n,

Theorem 2. Consider a new source S I with alphabet {a; , . . . , ak-l}


and probability distribution p;,
.pl-, , where p! = P; for I ~ i ~ K - 2
and pk-l = PK-l + PK. If {X;,
, Xk-I} is an optimum prefix-free code for S I,
then the code obtained according to the following rule is optimum for S.

Variable-Length Source Coding

Xi

XK -

::s.

X;',

Chap . 3

::s. K - 2

= X~-J O

XK = X~-J I

Proof. Since n

Ii

= nK-J =

= pJnJ +
=

.. . + PKnK
Ii ' + (PK-J + PK)

+ n~_ I '

= Pin;

we have

+ ... +

(PK-J

+ PK) (l +

n~-J )

Th at is , the difference between Ii and Ii' is a con stant. If the optimum code for
S is better than the preceding cod e , the cod e derived fro m it for S' must be better than optimum, which is impo ssible . Th erefore, the code obtained for S according to the preceding rule is optimum.
Theorem s I and 2 redu ce the problem of findin g an opt imum code for a
source with K letters to the probl em of finding an optimum code for a new
source with K - I letters. The procedure , however, ca n be repeated until the
new source has only two letters, in which case the optimum code is trivial. The
procedure of bin ary Huffman coding is summarized in the following steps.
1. Arrange the source letters in decrea sing ord er of probabilities (i.e .,
PI ?:. P2?:' . . . ?:. PK)'
2 . Assign 0 to the last digit of XK and I to the last digit of XK - 1 (or vice

versa) .
3. Combine PK and PK- J to form a new set of probabilities , PI, ' . . ,PK-2,
(PK-J + PK)'
4 . Repeat all the steps for this new set.
Example 1
The Huffman coding proc edure for a source with a five-lett er alph abet is show n in
Table 3.2. The optimum code is shown in the rightmost column and its average length is
given by
5

n = LP;n; = 2.2 bits/ source

letter

;=1

TABLE 3.2
Source
Letter

peat)

Code
Word

0.3
0.3 / 1 0.45
(i)0.55
0.25
0. 25
(i)0.3 } /@0.45
0.25
(i)0.25}
@0. 25
(i)0.1}_ _ @O,2
@O. I

10
01
001
000

pea;)

pea;')

II

Sec. 3.6

Geomet rical Inte rpretatio n of Huff m an's Technique

33

Notice that H(x) = - 2,~=IPi log Pi = 2.1855 bits/symbol is less than Ii; therefore , Eq . (3.2) is satisfied .
If we choose II i = I -log Pil according to the Shannon-Fano scheme, we will
have III = 112 = 113 = 2, 114 = 115 = 4 , and Ii = 2.4, which satisfies Eq . (3.3) but is
larger than the average code-word length for the Huffman code .

3 .6 GEOMETRICAL INTERPRETATION OF
HUFFMAN'S TECHNIQUE

The concepts of Huffman coding can be easily understood by analyzing a code


tree . Consider the tree in Figure 3.2 in which the nodes corresponding to a
code word are identified with the probability Pi associated with that word. Since
prefix-free codes are being considered, none of the nodes that represent a code
word should branch to higher levels. Let us explore the conditions under which
the selected nodes would represent an optim um code for the probability distribution PI> . . . , P6. To clarify the situation by way of examples, we ask the following questions .
1. Can P3 be larger than PI? The answer is no. If P3 > PI, then Ii can be
reduced by exchanging the corresponding code words . Therefore, longer code
words have a probability less than or equal to that of shorter words .
2. Can a free node exis t in an intermediate level (leve ls I , 2 and 3 in
Figure 3.2)? The answer is no. The free node in level 3, for examp le, can be
associated with P4, Ps, or P6, res ulting in a smaller Ii witho ut violating the
prefix condition.
3. Can a free node exist in the final level (level 4 in Figure 3.2)? The answer is no. The branching in level 3 that create s the node associated with P6
and the free node is unnecessary and in fact harmful. P6 should have been assigned to its parent node in level 3 to give a smaller Ii without violating the prefix condition.
Root

Level 3 - - - - - - - - - - - -

Level 4 - - - - - - - - - - - - - - - P4

Figure 3.2

Ps

Ps

"..

v ari able-Len gt h sour ce Cod ing

Chap. 3

From the foregoing discussion we conclude that the final level contains
the two least -probable code words . These code words have the same parent
node and therefore differ in only the last digit. To complete our geometrical interpretation of Huffman coding , we need to know the effect of rearrangement
of probabilities on Ii . This is the subject of our final question .
4 . Can nodes at a given level be exchanged? The answer is yes. The
code words associated with the nodes at a given level have the same lengths
and, as a result, can be excha nged without increasing Ii or violating the prefix
condition . For instance, at level 3 in Figure 3.2 we can exchange the node associated with P3 and the node that is parent to P4 and p-; This kind of rearrangement is essential to Huffman coding since at every stage of the coding process
we must be able to shuffle the probabi lities until the two smallest ones have the
same parent node . Also notice that , once the two smallest prob abil ities are
brought together under the same parent, any reshuffling at a lower level does
not destroy this arrangement. For example, any rearrangement at level I , 2, or
3 cannot assign P4 and p, to different parents .

3.7 GENERA LIZATION OF HUFFMAN'S TECHNIQUE

We are now in a position to generalize Huffman's technique to nonbinary situations where the encoder alphabet size D is larger than 2. Visualizing an optimum code tree with D branche s stemming from each node, we ask the same
four questions as in the previous section. The answers to questions I, 2, and 4
remain the same . The answer to question 3, however, must be slightly modified
since free nodes in the final level can now exist. As long as there is more than
one final level code associated with a parent node, that parent node can have
other offspring with no code words . The situation for D = 3, for example , is
shown in Figure 3.3 . The free node in the final level cannot be eliminated by
assigning Ps or P6 to their parent node since the other probability then remains
unassigned .
By exchanging nodes in the fina l level, all the free nodes should be
brought together under the same parent. We then assign zero probabiliti es to the
free nodes and proceed with the general Huffman coding procedure, that is,

Figure 3.3

Chap . 3

Problems

35

TABLE 3.3

Code

Source
Letter

pea:)
0.4
0.3
0.2
0.05
@0.03 }
(Do.oz >:
0 0

Pea ?)

Word

0.4
@0.4
0.3
(D0.3
@0.2 }
.z-> 0 0.3
(D0.05 ----00.05

1
20
21

220
221

combining the D smallest prob abil itie s and con tinuin g with a set that has
D - I fewer elements.
With a complete tree, the set of probabilities is reduced by D - I elements at each stage until D elements remain . Hence, the initial number of elements in the set must be D + neD - 1), where n is an arbitrary integer. Zeros
must therefore be added to the set to bring the total number of elements to
D + neD - 1).
Example 2
If D = 3, then D + ntD - I) = 3 + 2n , which can be any odd integer. A zero is
therefore added to the set of probab ilities if the number of elements is even. (If the number is already odd , nothing needs to be done .) The procedure of ternary Huffman coding
for a set with six letters is shown in Table 3.3 .

PROBLEMS
3.1. A code is not uniquely decodable if there exists a finite sequence of code letters
that can be resolved in two different ways into sequences of code words . That is,
a situation such as

I AI
B,

I
A2
I A) Am
I B I B) I ... I B.

must occur where each Ai and each B, is a code word. Note that B , must be a prefix of A l with some resulting dan glin g suffix . Each dangling suffix in this sequence in turn must either be a prefix of a code word or have a code word as
prefix, resulting in another dangling suffix. Finally, the last dangling suffix in the
sequence must itself be a code word . Thus one can set up a test for unique decodability in the followin g way. Construct a set S of all possible dangling suffixes .
The code is uniquely decodable if and only if S contains no code word.
(a) State the preci se rules for building the set S.
(b) Suppose the code word length s are n" .. . , nK. Find a good upper bound on
the number of elements in the set S.

(c) Determine which of the following codes are uniquely decodable .


{O, 10, ll}

{OO,OI, 10, nj

{O,OI,lO}

{I 10, 11, 1O}

{O ,OI, II}

{lID, 11, 100,00 , 1O}

{O,OI}

(d) For each uniquely decodable code in part (c), construct, if possible, an infinitely long encoded sequence with known starting point such that it can be
resolved into code words in two different ways. (Unique decadability does
not always imply finite decodability .) Can this situation arise for a prefix-free
code?
3.2. A DMS is encoded by a prefix -free ternary code and the average code-word
length is found to be 11 = H 3(x ) (the subscript 3 indicates the base of the logarithm). Can the source have an even number of letters in its alphabet?
3.3. Let a source SI have alphabet {ah .. . , ad and probability distribution
PI> ... ,PK' Construct a new source SN with alphabet size K N where each member
of the new alphabet is a sequence of N independent, identically distributed letters
from SI' Prove that the entropy of SN is N times the entropy of SI' [Hint: Use
property (iv) of the function H(x) in Section 2.2.]
3.4. Consider the following method of constructing a binary code for a source S with
alphabet {aI> ... , aK} and probability distribution PI ~ pz ~ ... ~ PK ' Define
i-I

qi

LPj,

for i > 1

j ~1

The code word assigned to a, is formed by finding the binary expansion of qi


(i.e., ~ ~ 100 ... , ~ ~ 0100 . .. , i ~ 10100 ... ) and then truncating this expansion to the first n, digits, where n, = i -IogzPil
(a) Construct the binary code for an eight-letter source with the probabilities ~, L
1

8","8'16' "16, 16' 16

(b) Prove that the method yields in all cases a prefix-free code whose average
length 11 satisfies the inequality H(x) :s 11 < H(x) + 1.
3.5. Use the concept of code tree in solving this problem: There are N coins of which
N - 1 are identical and one may have the same weight as others , may be heavier,
or may be lighter. A balance and a standard coin (same as the N - I identical
coins) are also available. Each use of the balance can establish a relationship between two groups of coins as to whether one group is equal to, heavier than, or
lighter than the other. The goal is to identify the odd coin, if any, and to determine its nature (heavy or light) after n uses of the balance. What is the maximum
value of N for a given n? Could you solve the problem without the standard coin?
3.6. For each of the following discrete memoryless sources, construct a binary and a
ternary Huffman code and find the corresponding 11 in each case.
(a) A source with a six-letter alphabet and probabilities 0.33, 0.23, 0.12, 0.12,
O.I,O.1.
(b) A source with a seven-letter alphabet and probabilities 0.35, 0.2, 0.15, 0.15,
0.1, 0.03, 0.02.
3.7. For the source in Problem 3.6(b), construct two different binary Huffman codes
with the same (minimum) average length but different variances . Which code is
preferable in practice and why?

Chap. 3

Problems

37

3.8. (a) Give an example in which a source with equiprobable letters is encoded into
an optimum (Huffman) code with unequal code-word lengths.
(b) Give an example in which a source with unequiprobable letters is encoded
into an optimum (Huffman) code with equal code-word lengths.
3.9. Consider a discrete memoryless source with an alphabet of seven letters. The
probabilities are 0.3 , 0.25, 0.15, O. I , O. I , 0.05, 0.05. Find a prefix-free code of
minimum average length under the constraint that the first letter of each word
must be 0 or I and each successive code letter can be 0, I , or 2. Find a general
rule for constructing prefix-free codes of minimum average length with this constraint and outline why it works.

3.10. A source with K equiprobable letters is encoded into a binary Huffman code . Let
K = r, where J is an integer and I ::;:; ex < 2.
(a) Are there any code words with length not equal to either J or J + I?
(b) In terms of ex and J , how many code words have length J?
(c) What is the average code-word length?
3. 11. (a) A source has five letters with the probabilities 0 .3, 0 .2, 0 .2, 0.15, 0 . 15.
These letters are to be coded into binary digits and transmitted over a channel. It takes 1 second to transmit a 0 and 3 seconds to transmit a 1. Using
cut-and -try techniqu es, find a prefix-free code that minimizes the avera ge
transmission time and calculate this minimum average time.
(b) Any such code can be represented by a tree in which the length of a branch
is proportional to the time required to transmit the associated digit. Show
that, for a code to minimize the average transmission time, the probabilities
associated with intermediate and ter min al nodes must be non increasing
with length.
3.12. Run Length Coding. A source produces a sequence of independent binary digits
with the probabilities P (O) = 0.9 , P(l ) = 0.1. We shall encode this sequence in
two stages, first counting the number of zeros between successive ones in the
source output and then encoding their run lengths into binary code words. The
first stage of encoding maps source sequences into intermediate digits by the following rule:

Source Seq uence

Interme diate Digits


(no. of zeros)

01
00 1

1
2

0001

00000001
00000000

7
8

Thus the following sequence is encoded as


1 0 0 1 0 0 000 0 0 0 00 1 1 000 0 1
2,0,
4
0,
2,
8,

VUI IU ...... l

3.13.

v - L. vl l ~ L I I

...... V U I VV

_VUIII~

The final stage of encoding assigns a code word of one binary digit to the intermediate integer 8 and code words of four binary digits to the other intermediate integers.
(a) Prove that the overall code is uniquely decodable.
(b) Find the average number nj of source digits per intermediate digit.
(c) Find the average number nz of encoded binary digits per intermediate digit.
(d) Show, by appeal to the law of large numbers, that for a very long sequence of
source digits the ratio of the number of encoded binary digits to the number
of source digits will, with high probability, be close to nz/nj . Compare this
ratio to the average number of code letters per source letter for a Huffman
code encoding four source digits at a time.
A discrete memoryless source has a six-letter alphabet with probabilities {0.3,
0.2,0.15,0.15, 0.15, 0.05}.
(a) Construct an optimum (Huffman) binary code for this source. What is the average code-word length
(b) A code is said to satisfy the parity check condition if all its code words have
an even (or odd) number of 1's. Does the code in part (a) satisfy the parity
check condition? If not, can you construct another Huffman code that does
so?
Consider a discrete memoryless source S with alphabet {ah . . . , ad and probability distribution {Ph ' . . , pd. A new source S' is formed by adding a zeroprobability letter to the alphabet of S, forming the alphabet {ah ... , aK+j} with
probability distribution {Ph ' .. ,PK, O}. Each source is then encoded into binary
digits (D = 2) using Huffman's technique . Is it true, in general, that the average code-word lengths of Sand S' are the same? If not, which code is better
and why?
A pair of independent, honest dice are rolled and the sum of the outcomes is denoted U" . Obviously, U" is a member of the set S = {2, 3, 4, ... , 11 , 12}. A student
is asked to determine U" after asking questions that can be answered Yes or No. If
he asked, Is it 2?, Is it 3?, and so on, he would average a little under six questions. It is possible to do better, however.
(a) Using the concepts of code tree and Huffman coding, design an optimum
strategy for asking questions .
(b) What is the minimum average number of questions?
(a) Consider a DMS, Sh with probability distribution {PI,' .. ,PK}. Assuming
that PI ~ P: ~ ... ~ PK, we divide PK in two equal halves and form a new
source Sz with probability distribution {Ph ' " ,PK- h O.5PK, 0.5PK }. If op,
represents the average code-word length for the binary Huffman code, prove
that the redundancy R = Op! - H(x) is the same for both sources S, and Sz.
(b) Show that for a source with the two-letter alphabet {aj,az} it is possible to
find a probability distribution {p, 1 - p} such that any amount of redundancy
in the range 0 < R < 1 is achieved.
(c) Combining the conclusions in parts (a) and (b), show that it is always
possible to find a probability distribution {Ph' . . , PK} for a source with a
K-letter alphabet such that any amount of redundancy in the range 0 < R <
1 is achieved .

n?

3.14.

3.15.

3.16.

Chap. 3

Problems

39

3.17. A discrete memoryless source S\ has alphabet A = {at> a-, a-; a4,as} and a certain
probability distribution . The following code C, is suggested for this source.

C1 = {1,000,001,010,0 11}
A second source , S2' has alphabet B = {bt> b-;b3 , b4, bs} and a probability distribution different from that of St. Two alternative codes C2 and C ~ are suggested
for S2' as follows:

C2 = {1, Ol, OO l , OOOI, OOOO}


C~

= {I, 10, 100,1000,0000}

Notice that C2 and C ~ have identical code-word lengths.


(a) Prove that all three codes C 1, C 2 , and C ~ are uniquely decodable. Which ones
are prefix free?
(b) The sources are multiple xed as follows:

Figure P3.17

Here b belongs to B and a belongs to A . The initial position of the switch is


always fixed . (You can assume that the first letter of the multip lexed sequence is always b.) If the multiplexed sequence bababa ... is encoded using
C \ for a and C2 for b, is the resulting code uniquely decodable? What if C ~ is
used for b?

4
UNIVERSAL SOURCE CODING

4.1 Discrete Memoryless Source with Unknown Statistics


4.2 Frequency Vectors and Quasi-entropy
4.3 A Universal Encod ing Scheme for the DMS
Problems

I n t r o d u c t i o n In this chapter we introduce the concept of universal coding. Unlike much of information theory , this is an area that was not pioneered
by Shannon himself and represents a fundamentall y different approach to
source coding. The ideas presented here are due to B. M. Fitingof [9, 10].
Shannon' s theory is based on the knowledge of probability distribution
functions, without which optimum encoding is impossible. Universal coding ,
on the other hand , utilizes the structure of the sequences and arrives at the same
optimum answer. In situations where the probab ility distribution functions are
not available or the source statistics are time dependent, the universal coding
techniques are clearly the right choice.
The material covered in this chapter should serve as an introductory exposition of the idea of universal coding. A growing number of papers are now
published in this area each year, and universal coding constitutes an active field
of research in information theory.

4.1 DISCRETE MEMORVLESS SOURCE WITH


UNKNOWN STATISTICS

Consider a source with alphabet {a I. . . . , ad and probability distribution


{PI. . . . , pd , producin g a sequence of independent, identically distributed
letters. We assume that the probability distribution {P h . . . ,PK } is fixed but
unknown to the encoder. All the techniques we have seen so far require the

II IVC' I.:lO I

t,JUUI\.oC'

..... UUI

I I~

..... IIOtJ .

"'T

knowledge of the probability distribution in order to achieve optimum or nearoptimum encoding . For example, in the block-encoding scheme of Chapter 2,
the likely sequences are classified according to their probability of occurrence,
and in the Shannon-Fano technique of Chapter 3, the code-word length n,
associated with a given letter (or sequence of letters) is [log Pil, where Pi is the
probability of the letter (or sequence) . In practice , however, it is often the
case that the probability distribution is not known, or, at best, only an approximation to it is available. It may also happen that the source statistics vary
slowly so that for a period of time that is long compared to the block length the
statistics are fixed, but change from one period to another. In such situations, a
probabilistic coding scheme that is efficient for some time frames will not be so
for others.
To appreciate the effect of inexact knowledge of the sourc e statistics
on the coding efficiency , consider a binary DMS with probability distribution
P(O) = P, P(l) = 1 - p. If the encoder 's approximation to P is denoted by Po ,
then Shannon-Fane coding of long blocks effectively assigns a code word of
length - log Po to 0 and a code word of length - loge1 - Po) to 1. The average
code-word length will then be

n=
n

- P log Po - (l - p) log(l - Po)

A plot of versus P for a fixed Po is shown in Figure 4.1 . For comparison with the optimum in the ca se whe re P is known, a plot of op , =
- P log P - (l - p) log(l - p) is also shown in the figure. Note that is tangent to Op! at p = Po, but as p deviates from Po, the difference between the two
curves increases rather sharply.
In this chapter we develop the basic ideas of universal coding, a scheme
that relies not on the probability of the sequences but on their structure. We will
show that for any positive number e it is possible to encode a source such
that
is less than or equ al to H (x) + e for all possible sourc e statistics
{ Ph ... ,pd The counterp art of Figure 4.1 for the universal scheme is shown
in Figure 4. 2. The distance e between the upper bound to n and n Op! can be
made as small as desired by choosing a large enough block length .

n
n

Po

Figure 4.1

Sec. 4.2

43

Frequ ency Vectors and Quasi-Ent ro py

Upper bound of

"

~- --,I

'",

\
\

I
I

,
\

..
p

Figure 4.2

4 .2 FREQUENCY VECTORS AND QUASI-ENTROPY

Conside r a source sequence S, of length N. Sin ce there are K letter s in the


source alphabet and individual letters are generated independently, there are K N
sequences of length N and, therefore, the index i of S, ranges from I to K N . We
denote the number of appeara nces of a, in the sequence S, by N ki and define the
frequency of a, in Si by qki = Nk;/N . The vector Q(S;) = (qli, . . . , qKi) is the
frequency vector associ ated with Si ' It is obvious that qki'S are nonnegative and
their sum over k is equal to I.
Lemma 1.

The expec ted value of qk is the probability Pk of ai .


KN

E(qk) =

"'i P(Si )qki = Pk


j~ 1

Proof. Define the random variab le x ~') to be equal to l iN if the source


outputs the letter ak as the 11th letter of the sequence , and equal to zero otherwise. Since the source is memo ryless, the sequence x]" , .. . ,xf') is independent and identically distributed . The expected value of x ~') is PklN for all 11 .
Now qk = L ~~ 1 x j"; therefore
N

1/= 1

and the proof is complete.


Each sequence S, is associated with a frequenc y vector Qi' but the relation ship is not one to one; in other words , a given vector Q = (ql , ... ,qK)
may correspond to more than one sequence . Let w(Q) denote the number of sequences that have the same frequ ency vector Q (i.e ., sequences in which the
number of occurrences of a, is fixed at N , = Nq; for k = I , . .. , K ). It is not
difficult to show that

w(Q)

N'
K

'

TIk=INk .

Next we inquire about the total number of different frequency vectors that
represent source sequences of length N. The number of such vectors is a function of K and N and will be denot ed by cjJ(K, N) . Lemma 2 gives an explicit
formula for this function .

Lemma 2.

cjJ(K,N)

(N+~-I)

Pro of. Consider a row of N + K - I empty slots. If N identical objects


are used to fill N of these slots, K - I of them remain empty. The empty slots
thus divide the N object s into K groups. If the kth group contains N, objec ts, we
identify it by qk = Nk/N. A given distribution of objects therefore corresponds
to a vector Q = (q" . . . , qK)' There is a one-to -one correspondence between
the set of possible distributi ons of the objects and the set of possible vectors Q,
and, since the objects can be distributed in (N+~-' ) different ways, the lemma
is proved.

The frequency vector Q(Sj) = (qli, . . . , qKj) is now used to define the
quasi-entropy 'l' of the sequen ce Sj. By definition
K

'l'(Sj)

= - L qkj log s
k=1

Realizing that Qi = (ql;, ' .. , qKj) can in fact be a probability vector, one can
say that the quasi-entropy 'l'(Qj)t has all the propert ies of the entropy function
H (Q j) that depend solely on Qj. For example , 0 ~ 'l'(Qj) ~ log K. Not ice ,
however, that the entrop y is a quantity defined for the source and depends on
the probability distribution of letters , whereas the quasi-entropy is defined for
each source sequence se pa rately and for a gi ven seque nce is indep endent
of the probability distr ibuti on. The follo wing theorem es tablishes the connecti on between the qu asi -entrop y 'l'(ql , ' . . , qK) and the so urce entropy
H (p " . . . ,PK)'

Theorem.

E('l'(Q ))

Ht p, . . . ,PK)'

Proof.

E('l'(Q))

E( -

~ qk log qk) = ~ E( -qk log qk)

The function - x log x is convex


That is,

and therefore Jensen 's inequality applies .

tSince 'I' (S i) is only a fun ction of Q(S i ) or Qi ' we use the not ati on s '1 / (5 ,) and 'I'(Q ,)
inlerchangeably.

Sec. 4.3

A Universa l Encodin g Schem e fo r the OMS

45

= p i . Therefore,

From Lemma 1, we have E(q k)

E('I'(Q))

:S

- Pk log Pk

k= 1

and the proof is complete .


The concepts developed here will now be used to describe a universal
coding scheme and prove its optimality.

4 .3 A UNIVERSAL ENCODING SCHEME FOR THE DMS

We are now in a position to outline an optimal encodin g scheme for the source
sequences based on their structure without requiring an a priori knowledge of
the probab ilities involved. The code for a given sequence S, is compo sed of two
parts. The first part identifies the frequency vector Qi associated with Sj. The
second part identifie s S, among all the sequences that have the same Q. Since
the total number of different frequency vectors is cP(K, N), the number of bits
required for the first part is flog cP(K,N)l Simi larly, the number of bits required for the second part is flog W(Qi
The base of the logarithm is 2 for a
binary encoder and D for an encoder with alphab et size D. To prove the optimality of the coding scheme , we need the following lemma .

Lemma 3
n!

n il

where nj ~ 0

L nj

Proof. For 1 = 1, the inequality is easily verified . For 1


Stirling's formula to obtain

----;--- <: - --

I 1nj'I
IIj=

IIIj= 1n/

_ '_1!_ <:
IIf=,nj! -

II ' ,

and

j=1

[e

2, we use

n_) 1I2] . _'_1"_

l _ (71/8)( _ _

II]=,nj

IIf=,n?

The proof is complete if it can be shown that the bracketed term is less than or
equal to 1. Since 2:.]=1nj = n, at least one of the nj' s is greater than or equal to
nil. Consequentl y, II]=I nj ~ nil and

e 1- (7118)(- ; - ) 112 :S VJ e 1- (71/8)


IIj=1 nj
It is a trivial matter to prove that the right side of the preceding inequality is
less than 1 for 1 ~ 2, and the proof of the lemma is thus complete .
Now let L(Si) repre sent the length of the code word assigned to Si ' From
the previou s discussion we have

L(Sj) < log cP(K,N )

log w(Q;)

'to

I I IVC; 1 ~al

.JUU I

""c:::

,",UUIlI~

viICt-', ....

But
log 4;(K , N )

(N + K - I) !
log N! (K - I)!

::5

(N + K - I) N+Klog N N . (K - I I

IOg(I + K ~ I) + (K

- I)

IOg(1+ K ~ I)

Similarly,

As a result,

L(Si) < N'I'(SJ + (K - I)

IOg( I + K ~

I) +

IOg( I + K ~

I) + 2

The expected value of L(Si) now satisfies

E(L(Si < NH( Pl" . . ,PK) + (K - I)

+N

IOg(1

+K

IOg(1+ K ~ 1)

~ 1) + 2

The ave rage code- wo rd len gth per so urce lett er is , by defin iti on ,

n = E(L(Si))IN. Therefore,

[-K--N l1og ( 1 + -K N)
- 1

I) 2]

K + log ( 1 + ----;:;+

The bracketed term in this inequality approaches zero as N


given e, it is possible to find No such that, for N > No ,

00.

Thus, for a

Ii < H(p " . . . ,PK) + e


This completes the proof of universality of the encoding algorithm. Notice that
s approaches zero as log N IN . This is slow compared to the probabilistic encoding schemes where e approaches zero as l iN . This is the price one generally has to pay for a universal scheme.
Table 4.1 describes the universal encoding of a binary source in blocks of
length 7. There are 4;(2,7) = 8 frequency classes and thus 3 bits are used to

Cha p . 4

Probl ems

47

TABLE 4.1

Qi

(0, ~)

CH)

CU)

21

CH)

35

CH)

35

d,~)

(H)

'I'(S;)
in bits

s,

w(Q ;)

1 1
10
o1

o1

0
0.592

000
001 000
001 001

1
1 00
0 0

0.863

001 110
010 00000
010 00001

00
1 1 I 1
1 1 1 1 000
1 1 1o10 0

0.985

010 10100
0 11 000000
011 000001

0 1 1 I 1
1 1 1 000 0
1 1 0 1 000

0.985

011 100010
100 000000
100 000001

21

000 0 1 1 1
11 0 0 0 0 0
1 0 10 0 0 0

0.863

100 100010
101 00000
101 00001

00000 1 1
1 0 0 0 0 00
o 1 0 0 000

0.592

101 10100
110 000
110 001

III

1 1
1 1

o0

C~, 0)

Code Word

000000 1
000000 0

110 11 0

encode the frequency vector; these are the first 3 bits in every code word. The
remaining bits identify each sequence in a given class .

PRO B LE MS
4.1. Consider a binary memoryless source with P(O) = Po and assume that sequences of
length N are encod ed accor ding to the Sha nnon-Fano algor ithm. If the source
statistics now change so that P(D) = p, what can be said about the new average
code-word length per source letter?
4.2 . The vector 11 = (Il l, . . . , 11 K ) co ns ist s of integer s sat isfying Il k ~ I for k =
1, ... ,KandL r=l llk =N.
(a) Show that , for fixed Nand K , there exist (%
=:) different vectors 11.

..0

U IIIVt; I;;)C1 1

vUUIL.~

vUU III~

\,.,IICltJ6-t

(b) If nk's are allowed to be zero as well, show that the total number of vectors
becomes

~
K

(K) (N - 1)
j

j - 1

Notice that this must be the same as cjJ(K,N) in Lemma 2.


4.3. (a) Show that the quasi-entropy ,!,(Q) is a convex n function of the frequency
vectorQ = (q" ... ,qK)'
(b) Consider the sequence S of length N composed of letters from an alphabet
of size K . This sequence is broken into subsequences Sh .. . ,SM of lengths
nl, . . . , n, where L~~, n m = N. Prove that
M

2:

nm'!'(sm) s; '!'(S)
m ~' N
4.4. In a modified version of the universal encoding algorithm, all frequency classes
that correspond to different permutations of the same frequency vector are combined. For example, if Q(Si) = (q" q2,q3) and Q(Sj) = (q2' q-, ql), then S, and S,
belong to the same frequency class.
(a) Encode binary sequences of length 7 according to the modified algorithm.
(b) Prove that the modified algorithm is universal; that is, for any 8 > 0 there
exists No such that for N > No the average code-word length per source
letter Ii satisfies the relationship Ii < H(Ph '" ,PK) + 8.
4.5. Consider a DMS with alphabet {a" . .. ,ad and unidentified probability distribution P = {Ph' . . ,pd. p is known, however, to be one of the M distribut ions
r, . . . , pCMl. The K N sequences S, of length N are encoded according to the following scheme:
(i) Construct M prefix-free codes C h . . . , CM , one code for each distribution
r, . . . ,p CM l . The length of the code word assigned to S, in Cm is
I - log p(Si lm)l, where p(Silm) is the probability of S, assuming P = p Cm) .
(ii) Upon observing Si, compare the M codes and choose the one that contains
the shortest code word for S, .
(iii) Identify S, by an identifier for the chosen code, followed by the code word
corresponding to S, in that code.
Prove that in the limit when N ~ 00 this scheme becomes optimum for each of
the M possible distributions .

5
MUTUAL INFORMATION,
DISCRETE MEMORYLESS
CHANNEL, AND CHANNEL CAPACITY

5.1
5.2
5.3
5.4
5.5

Conditional Ent rop y and Mutual Informati on


Discrete Memoryless Channel and t he Channel Matr ix
Symmetric, Lossless, Deterministic, and Useless Channe ls
Equivocation and the Rate of Transfer of Information
Channel Capacity
Problems

Introduct ion In this chapter we extend the concept of entropy and introduce an important information theoretic function known as the average mutual information. The imme diate appl ication of these new concepts is in the
area of comm unication over noisy channels to which the present chapter and
the following chapter are devoted. The ideas, however, are quite general and
find application in other areas of informat ion theory.
The simplest class of noisy channe ls is the class of discrete memoryless
channels (DMC) described in Section 5 .2. Di screte memory less channels
provide reasonably good models for many practical commu nication situations;
moreover, extensions of the theory to more comp licated systems are possible
only when this simplest class is well understood. It is for this class that the
theory of reliable communication will be developed in this book .
In the remaining part of the chapter, we invoke the concept of equivocation to justify the definition for channel capacity given in Section 5.5. The capacity can be analytically determined only in special cases and, in general,
numerical methods must be employed for the calculation of capacity. One such
method is described in Chapter 9. The significance of capacity and its role in
the theory of reliable communication over noisy channels are the subjects of the
next chapter.

"IUI"UUIIIIIVIIIIU\.IV.I

...... 1.....

1-" . ....

5.1 CONDITIONAL ENTROPY AND


MUTUAL INFORMATION

Con sider the random variables x and y with joint probability distribution
p(x; ,Yj), 1 ::5 i ::5 N, 1 ::5 j ::5 M. The condit ional entrop y of x given y is de-

fined as
N

H (x ly)

= -L

LP(x ; , Yj ) logp(x; IYj )

;=1 j = !

H (x Iy) can be interpreted as the average amount of uncertaint y about x after y

has been revealed . In this context, the following version of the same definition
is perhaps more informative.
H(xly) =

~p(Yj) [ - ~p(xi IYJ logp(xiIYj)]

The inner summation is now a measure of the uncertainty about x when the outcome of y is known to be Yj' The outer summation is the average of this uncertainty over y. In the following, we derive some important properties of the
conditional entropy.

Theorem 1.

H(x Iy) ::5 H(x) with equality iff x and yare independent.

Proof

'" '"
p(x i)
LJ LJP(Xi ,Yj ) In (
i
j
P x, Yj

I )

The summation is over all pairs (i ,j) for which P(Xi' yJ


H(xly) - H(x)::5

L
i

LP(x;,Yj) [
j

=L L
i

'"

f(~i\ -

P X, Y;

O. Thus
I]

[p(x;)p(Yj) - p(x;,Yj)]

The inequality is therefore proved. Equality holds if two conditions are satisfied. First, P(Xi) = p(xiIYj) for all pairs (i,j) such that P(Xi,Yj) '" 0 and,
second, the sum of P(Xi)P(Yj) over all such pairs is equal to I. The second condition will be valid iff p(x;)p( Yj) = 0 whenever P(Xi' yJ = O. The conclusion
is that H(x Iy) = H(x) iff x and yare independent.

Theorem 2.

H(x , y)

H(y)

+ H(x Iy) = H (x) + H(y Ix).

Sec. 5.2

Discret e Memoryless Channel and th e Channel Matrix

51

Proof
H(x,y)

= - 2:
i

2:p(X j, Yj) 10gp(xi, Yj)


j

= - 2: 2: P(Xi' Yj)[log p(Yj) + log pi, IYj)]


i

= H(y) + H(xly)
The second part follows from symmetry, completing the proof.
It follows from T heorems I and 2 that H(x, y) ~ H(x ) + H(y) with
equality iff x and yare independent.
The average amount of informat ion about x contained in y can now be
defined in terms of the reduction in the uncertainty of x upon disclosure of y.
Denot ing this information by l ex, y) , we define

l(x, y)

H(x ) - H(x Iy)

With the aid of Theorem 2, it is easy to show that

l (y , x)

H(y) - H(y Ix)

l(x, y)

That is, the information about x contained in y is equal to the informati on abou t
y contained in x. For this reason , l ex, y) is called the average mutual information between x and y.
From Theorem 1 we obse rve that lex, y) 2:: 0 with equality iff x and yare
independent.
In Problem 5.3 the reader is asked to show that the follow ing relationship
is a direct consequence of the definition of l ex , y) :
_ ~ ~ (

)I
p(Xj,Yj)
L..-P <,, og ( .) ( .)
j=Jj= 1
pX,P YJ
One can thus interpret the functio n 10g{p (x;' Yj )/[ p(xi)p(Yj)]} as the mutual informa tion between x , and Yj . This functio n, however, can be positive, zero, or
negative , whereas its average, l ex , y), is always greater than or equal to zero.

I (x,y) -

L..-

5.2 DISCRETE MEMORVLESS CHANNEL AND THE


CHANNEL MATRIX

A discrete memoryless channel (DMC) is


X = {x. , . . . ,xd, a finite output alphabet
probability distribution P( Yj IXk) , where
memory less implie s that the probability of

identified by a finite input alphabet


= {Y l' . . . ,YJ}, and a conditional
I ~ k ~ K , I ~j ~ J . The word
any sequence of output letters given

'-I"

....

t-".....

a sequence of input letters is equal to the product of the conditional probabilities of the individual output letters given the corresponding input letter. In
other words,
N

p{YJJ 'YJN lxkJ" , XkN }

[lp(YJIlIXkll)

n= l

for any integer N . A DMC is therefore completely identified by its channel


matrix [p(YJ I Xk)], which is a K X J matrix . If a probability distribution on X is
given, the distribution of Y can be obtained as follows:
K

p(YJ)

L P(Xk)P(YJ IXk)

k=!

In the following section we introduce and discuss some special classes of


DMC . We then define the rate of transfer of information over a channel, which
leads in a natural way to the concept of capacity . Determination of capacity for
different channels is then covered in the remainder of the chapter.

5.3 SYMMETRIC, LOSSLESS, DETERMINISTIC, AND


USELESS CHANNELS
Symmetric Channel

By definition, a DMC is symmetric if every row of the channel matrix


contains the same set of numbers p;, ... , P; and every column contains the
same set of numbers q;, .. . ,q~. The following matrices represent symmetric
channels .
j

= I

[ P( YJIXk)] =

(~:~ ~:~ ~:~

[P(YJIXk)] =

0.2 0.3
0.3 0.5
(
0.5 0.2

0.5)
0.2
0.3

[P(YJIXk)] =

C: e

~ e) ,

0.3)
0.2 '

k = I
k = 2

The last example is the binary symmetric channel (BSC), which repre sents an important practical situation and has been studied extensively. In this
channel, either or I is transmitted and either or I is received. The probability of error is e and is the same for both transmitted letters .
An important property of a symmetric channel is that its H(y Ix) is independent of the input distribution and is solely determined by the channel matrix. This can be proved as follows:

Sec. 5.3

Symm etric, Lossless, Dete rministi c, and Use less Chann els

H(y Ix)

= -

L
i

53

L P(Xj,Yj) log p( Yj lx J
i

- L P(xJ L Pi log p/
j

Therefore , H(y Ix) = - 2:;=1pi log p/ ' independ ent of input distribution.
Lossless Channel

Figure 5.1 is a graphical representation of the lossless channel. A connection between Xk and Yj means that P(Yj IXk) is nonzero. In a lossless channel the
output uniquely specifies the input; therefore , H (x Iy) = O.
Deterministic Channel

In this channel the input uniquely specifies the output (see Figure 5.2);
therefore , H(y Ix) = O.
Useless Channel

A channel is useless iff x and y are independent for all input distributions . For a useless channel , H(x Iy) = H (x); that is, knowledge of the output
does not reduce the uncertainty about the input. Thus, for the purpose of determining the input , we may ignore the output altogether. In the follow ing we
prove that a DMC is useless iff its matrix has identical rows.
(i) Pro of of s uffici enc y : A ssum e the matrix has identic al row s
p;, . . . ,pl. Then , for every output Yj,

Figure 5.1

Y,

Y2

Figure 5.2

IVIUlUdlllllUlllldL IUl 1

v lldJJ . ;,)

K K K

p(Yj) = 2:p(Xk,Yj) = 2:P(Xk)P(Yj IXk) = P; 2:P(Xk) = P;


k=l

k= l

k=l

For every input-output pair (Xk, Yj),


p(Xk,Yj) = P(Xk)P(Yj IXk) = P(Xk)P; = p(xdp(Yj)

Therefore, the input and output are independent, irrespective of the input distribution.
(ii) Proof of necessity: Assume the rows of the matr ix are not identical.
As a result, there exists a column, say lo, whose elements are not identical.
Assume p(YjO IXkO) is the largest element in this column . Then, for a uniform input distribution,
K

p(yjO)

2:P(Xk)P(YjO!Xk)
k= l

2:P(YjOI Xk)

= -

<

P(YjOIXkO)

k= l

That is, p(YjO) -# p(YjO IXkO). Therefore,


P(XkO' YjO) = P(XkO)P(YjO IXkO) -# P(XkO)P(YjO)

That is, x and y, at least for the uniform distribution of x, are not independent
or, what is the same, the channel is not useless .

5.4 EQUIVOCATION AND THE RATE OF TRANSFER


OF INFORMATION

Consider a binary symmetric channel with crossover probability e. Assuming


that at the input P(O) = P(l) = t the rate of generation of input information is
equal to H(x) = I bit/symbol. A device, called the observer, receives every
pair of input/output letters (x, y) and produces an output z that is either 0 (if
x = y) or I (if x -# y), as shown in Figure 5.3 . The distribution of z is found
as follows:
P(z = I) = P(x = O)P(y =

llx

= 0) + P(x = I)P(y = 0 Ix = l )

= -+ - =s
P(z = 0) = I - P(z = I ) = I - e

The rate of generation of information by the observer is therefore given by

H(z)

- s log e - (I - e) log (I - e)

bits/symbol

For a given output sequence y<lly(2) . . . , the receiver can exactly reconstruct the
input sequence X(l\ (2) . . only when the observer output Z (I )Z (2 ) . is made
available. It is therefore natural to define the rate of transmission of information

Sec. 5.4

Equivocation and the Rate of Transfer of Information

55

Figure 5.3

over the channel R as the rate of generation of information H(x) minus the rate
of generation of supplemental information H(z) .

= H(x) - H(z)

bits/symbol

For example, if the input data are generated at the rate of 1000 symbols/second
and e = 0.01, we have
H(x)
H(z)

= 1 ~ input data rate = 1000 bits/second


= 0.081 ~ supplemental data rate = 81 bits/second

R = 0.919 ~ information transmission rate = 919 bits/second

One may argue that in the long run, since e = 0 .01, only 1 percent of the
transmitted bits will be in error (with probability close to 1), and therefore the
rate of information transmission must be 990 bits/second. The answer is that
the knowledge of the number of erroneous bits alone is not sufficient for the reconstruction of the data. One also needs to know the location of the erroneous
bits, and for this reason the information transmission rate is actually at the
lower value of 919 bits/second.
In the extreme case when e = 0, we have H(z) = 0 and consequently
R = 1000 bits/second. On the other hand, if e = then H(z) = 1, resulting in
zero transmission rate . Both conclusions are consistent with our expectations.
For the binary symmetric channel with equiprobable inputs, it is easy to
verify that H(z) = H(x Iy). In general, we will show that exact reconstruction
of the input sequence from the output sequence is possible only when the observer can generate supplemental information at a rate greater than or equal to
H(x Iy) . To see this intuit ively, observe that for long sequences of length N
there are roughly 2NH (x 1y) input sequences that can produce a particular output
sequence . Only when the supplemental information is generated at the rate
of H(x Iy) or faster does it become feasible to distinguish among these possibilities. For this reason, H(x Iy) is usually referred to as the equivocation of
the channel.
The remainder of this section contains a proof of the preceding statements. Specifically, we prove that, in general, exact reconstruction is possible
if H(z), the entropy of the supplemental information, is allowed to be larger
than H(x Iy), the equivocation of the channel.
Consider, a DMC with matrix [p(Yj I xdJ and input distribution
{PI , ... ,pd The observer is capable of observing both the input to the chan-

IVI U l U d 1 1l1 1u r 111d ll U1 1

...u

\..1li;!fJ

::>

nel x, and the corresponding output Yj' For every output letter, the observer has
a code for the input alphabet. Upon observation of Yj, the observer transmits
the code word for Xk, chosen from the code that corresponds to Yj' Table 5.1
shows the code-word lengths for all the input-output pairs (Xb Yj)'
Since code words from different codes may now follow each other, it is
no longer enough to require that all the J different codes be uniquely decodable. Prefix-free codes, however, are sufficient to guarantee unique decoda bility of the encoded sequences, and attention will be confined to them in this
discussion . The following theorems concern the average code-word length
generated by the observer.

Theorem 1.
alphabet.

n ;::: H(x Iy)/ log D,

where D is the size of the observer's

Proof
H(xly) -

n In D

= -

LP(XbyJ In P(Xk/Yj) -

=L
k

:s;

L
j

L
k

LP(XbYj) In[
j

I]

~ -In kj )

j= !

k= 1

~ -I"k})]

P x, Y1

LP(Xb Yj) [
k
P Xk Y1

= LP(Yj) L o

LllkjP(XbyJ In D

Since all codes are prefix free, they satisfy Kraft's inequality: Lf=l D for all j. Therefo re,
H(x ly) -

n In D

:s;

Il kj

:s;

completing the proof. Equality holds iff - Iogv P(Xk IYj) is integer for all k, j .
Theorem 2. There exists a set of prefix-free codes for which the average code-word length satisfies

H(xly)
log D

n < - --+1
TABLE 5.1

YI

Yz

YJ

XI

II"

1112

IlIJ

X2

1121

1122

1l 2J

XK

Il KI

lin

Il KJ

Sec. 5.5

Channe l Capacity

57

Proof. Use base D logarithm and let nkj


-logp(xk! Yj), we have
LD-nkj

:5

I - log P(XkIYj)l. Since nkj

L P(XkIYj)

2:

for every j . Therefore , a prefix-free code can be constructed for each Yj . Since
nkj < - log p(xklYJ + 1, we have

Ii

LP(xb Yj)nkj

< -

L
k

LP(XbYj) log p(xkIYj)

H(xl y)
log D

LP(Xb yJ
j

and the proof is complete.


The difference between Ii and H (x Iy)/log D can be arbitraril y reduced
provided that long sequences (instead of single letters) are encoded at a time .
This yields

HD(xly)

:5

Ii < HD(xly) +

where N is the length of the encoded sequences . It is seen therefore that the
minimum rate of supplemental information required for exact reconstruct ion of
the input at the receiver is equal to the equivocation H(x Iy) of the channel; this
is what we had initially set out to prove.

5.5

CHANNEL CAPACITY

In Section 5.4 we saw that the rate of transfer of informati on over a DMC can
be defined as

H(x) - H(x ly)

where H(x) is the rate of generation of information at the input and H(x Iy) is
the channel's equivocation . It is seen that R is equal to the mutual information
l ex, y) between the input and output of the channel. l ex, y) is, in general, a
function of input probability distribution {p" . . . ,pd. It is possible , therefore,
to find a distribution that maximizes l ex , y). [Since the set of possible distrib utions is closed and bounded, l ex, y) has a maximum rather than a least upper
bound on this set. ] The maximum value of l ex , y) is defined as the channel capacity C and is a function of the channel matrix alone.
C

= Maximum

(over input distribut ions) l ex, y)

vila.., .

;.J

In general, the problem of calculating the channel capacity in closed form


is a difficult problem and one that has not yet been solved. For the channels introduced in Section 5.3, however , simple solutions exist. In the following paragraphs we derive the capacity for these channels .
Symmetric Channel

For this class of channels, H(y Ix) was shown to be independent of input
distribution. To maximize lex, y) = H(y) - H(y Ix), therefore, we have to
maximize H(y) over the input distributions . The sample space of y has J elements, and the maximum possible value of H(y) is log J, which is achieved for
uniform distribution of y. This, in tum, is achieved by uniform distribution of
x, the reason being that the columns of a symmetric channel 's matrix are permutations of the same set of numbers q;, . . , ,q~ and, consequently,
K

p(Yj)

2,P(XbYj)

k~ l

2,p(xk)p(Yjlxd
k~ l

2, q~
k=1

which is a constant , independent of j , Therefore, p( y) = I/J for all i. and the


capacity is given by
J

log J

+ 2, P; log P;
j= l

where P ;, ... ,P; are constituents of the rows of the matrix.


For the special case of a binary symmetric channe l with crossover probability s,
C = 1

+ e log, e + (l - e)

logi l - e)

bits/symbol

Figure 5.4 shows a plot of C versus e for the BSC. At e = 0 the capacity is
1 bit/symbol; that is, all the information is being transferred. As e ~ 0,5
the channel becomes less and less efficient, and the capacity is reduced until
at e = 0.5 the channel becomes useless. For 0.5 < e :s I, the efficiency increases again, since a channel that lies consistently is just as informative as a
consistently honest channel.

0,5

Figure 5.4

Sec. 5.5

Channel Capacity

59

Lossless Channel

Here H(x Iy) = O. Therefore,


C = Max{H(x) - H(x Iy)} = Max{H(x)} = log K
where K is the size of the input alphabet. Capacity is achieved for uniform input distribution .
Deterministic Channel

Here H(y Ix)

= O. Therefore ,
C = Max{H(y)

- H(y Ix)}

log J

where J is the size of the output alphabet. The capacity is achieved by assigning the probabilities 1/ J to each group of input letters that correspond to the
same output letter and distributing this probability arbitrarily among members
of each group .
Useless Channel

Here H(x Iy)

= H(x).

Therefore,

C = Max{H(x) - H(x Iy)} = Max{H(x) - H(x)} = 0


A useless channel thus has zero capacity, as expected.
As stated earlier, the problem of calculating the capacity C of a general
DMC has no simple solution. The convexity of lex, y), however, makes the
problem amenable to numerical solutions. For a given channel matrix
[p( Yj Ixd]' we now prove that the mutual information lex, y) is convex over the
space of input distributions {PI> . . . ,pd. Becau se of the restrictions Pk ?: 0

Figure 5.5

ou

I V I U l U dl ll ll U l l Jl d ll U l 1

vlldJJ .

and 2J~1 P
I, the space of input distributions is a subset of the K dimensional Euclidean space . It is not difficult to show that the set of all possible input distributions in this space is a convex set. Figure 5.5 shows this set
for the special case of K = 3. For simplicity of notation, let us refer to I(x, y)
as I(P), where P = {PI, . . . , pd is a po int in the space of input probability
distributions . We prove that, for every pair of points Ph P 2 and for every real
number A in the interval [0 ,1], the relationship I[AP, + (l - A)P 2] 2:
M(P I) + (I - A)/(P2) is valid .

Proof
M(P,) + (I - A)/(P2) - I[AP, + (I - A)P2]
p(Yjlx;)
= AL LP,(x;)p(Yjlxi) log
()
; j
PI Yj
p(Yj lx;)
+ (I - A) L L P2(x;)p( v, Ix;) log ( )
; j
P2 Yj
p(Yjlx;)
- L L [API(X j) + (I - A)pix;)]P(Yj lx j) log A ( ) + (1 - A) ( )
j
j
PI Yl
P2 Yl
I

= \ '" '" (.) ( .1 .)


I\. L... L...PI XI P Y1 XI

; j

og

Ap,(yJ + (1 - A)P2(yJ
( )
P, Yj

+ (I - A)L LP2(X;)P(Yjlx j) 10/p,(Yj) + ~I ~ A)P2(yJ


; j

:S ALPI(Yj) [APJYj)

+ (1 -

P2 Yj

+ (1 -

A) LP2(Yj) [API(Yj) + (1 j

A)P2(Yj) - IJ

pJYj)
A)P2(Yj) -

1]

P2(Yj)

11.2 + 11.(1 - A) - A + 11.(1 - A) + (1 - A? - (1 - A)

The proof is thus complete .


The convexity of I(x, y) in the space of input distr ibutions makes it convenient to use numerical techniques to determine the channel capac ity, since
any local maximum of a convex function is also its global maximum.

PROBLEMS
5.1. Prove that conditioning cannot increase entropy; that is,
H(xIY,z):S H(x IY)

You may assume that x, Y, and z are discrete random variables .

Chap . 5

Problems

61

5.2. Let x be a discrete random variable and f( ) an arbitrary function. Define the random variable y as y = I(x ).
(a) Prove that H (y Ix) = o.
(b) Show that H (y ) :S H (x ). Under what conditions will equality hold?
5.3. Show that l ex , y) = H (x) - H (x Iy) can also be written as
l (x ,y) =

2: 2:p(x
j
i

Yj ) log

~(X) ' fj\

P Xi P Yj

5.4. Let X, Y, and Z be ensembl es with two element s in each ensemble so that the
eight elements in the joint XYZ ensemble can be taken as the vertices of a unit
cube.
(a ) Find a joint pr ob ab ility assignment p (x , y , z) such that l ex, y ) = 0 and
lex , y I z) = 1 bit.
(b) Find a joint probability assignment p(x , y, z) such that l ex, y) = 1 bit and
l ex , y !z) = o.
5.5 . A source has an alphab et of four letters . The probabilities of the letters and two
possible binary codes for the sou rce are given next.
Letter

peak)

Code I

Code II

a,
a2
a,
a4

0.4
0.3
0.2
0.1

1
01
001
000

1
10
100
1000

For each code, what is the average mutual information provided about the source
letter by the specification of the first letter of the code word?
5.6. Consider the ensemble of seq uen ces of N binary di gits, x , . .. , XN. Each
seque nce containing an even number of I's has probability Z- N+' , and each
seque nce with an odd number of I 's has prob ability zero. Find the average
mutual informat ions
l (x2, x ,),l(x3, x2!x, ) , . .. ,l(XN, XN- l l x" . . . ,XN- 2)

Check your result for N = 3.


5.7. Let y and z be discrete random variables and define x = y + z. Show that , in
general , H(x I y) :S H(z) with equality iff y and z are independent. Can we say the
same thing about H (y I x)?
5.8. Consider a random vari able x that assumes the values -2 , -I , + I , + 2 with
equal probability. A seco nd LV. y is defined as y = x 2 .
(a) Find the covariance of x and y ; that is, Cov(x , y) = E{(x - x) (y - y)}.
(b) Find the average mutual information lex , y) .
(c) Are x and y independent ? Is there an incon sistenc y between the results of
parts (a) and (b)?

IVIUW 81 Information

UL

5.9. Consider the followi ng DMC with p(x = 0) = ~ and p (x = I) =


x

I.-n ap . ::>

1.

3/4

Figure PS.9

(a) Compute H( x), H( x Iy = 0), H(x Iy = I) , H (x Iy = 2), and H(x Iy).


(b) If yo u ha ve not mad e a ny mi st a ke s yo ur re su lts mus t indicate tha t
H (x Iy = 2) > H (x ); that is, the uncertaint y about x becomes larger after y is
observe d to be equal to 2. Does this represent an inconsistency in information
theory?
5.10. For
trix
(a)
(b)
(c)
(d)
(e)

each of the following classes of channels, give an example of the channel maand calculate the capacity.
Lossless but not deter ministic and not symmetric.
Deterministic but not lossless and not symmetric .
Lossless and determini stic.
Symmetric and lossless but not determi nistic .
Useless and determini stic.

5.11 . Consider a DMC with x and y as the input and output random variables, respectively. Show that H(y), the entrop y of the output, is a convex n function of the
input probability distribution and that H(y Ix) , the conditional entropy of the output give n the input, is a linear func tion of the input probability di stributi on.
Combining these results , show that I(x , y) is convex n on the space of the input
probability distribution.

6
CHANNEL THEOREMS

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

Conditional Mutual Information and Capacity


Cascaded Channels
Reduced Channels and Sufficient Reductions
Fano's Bound on Equivocation
Improving Channel Reliability
Noisy Channel Theorem for the BSC
Noisy Channel Theorem for the General DMC
Converse Theorem
Problems

In t ro d u c t io n In this cha pter we discuss some of the most importa nt


properties of the discrete memoryless channel. The culmination of the chapter
is in the last three sections when these properties are put together to prove the
noisy channel theorem, which is perhap s the most basic and surprising result in
information theory. The noisy channel theorem states that, at rates below capacity, information can be transmitted over the channel with negligible probability of error. The converse theorem states that , at rates above capac ity,
arbitrarily accurate communication is impossible . The noisy channel theorem
and its converse thus give the concept of capacity its practical significance.

6. 1 CONDITIONAL MUTUAL INFORMATION


AND CAPACITY

The conditional mutual information I(x k, y) ia defined as

I XbY = .6 P
j =1

(v,I)
P(Yj IXk)
Xk log
( .)
P Y1

vl lUlll lvl

I I l v V

vlll~

.... I I U p .

It is the information between the output and a particular input letter X k of a


DMC . The conditional mutual information is a function of both the channel
matrix and the input distribution, and its average over all input letters is equal
to lex , y); that is ,
K

l(x , y)

L p(xk)l(xb y)
k= 1

In this section we derive a useful relationship between the channel capacity C and the conditional mutual information. The following theorem is helpful
in this derivation.

Theorem 1. Letf(P) be a convex n function of P = ( PI, ' " ,PK),


where P is a probability vector. Assuming that the partial derivati ves af(p) japk
are continuous, the necessary and sufficient conditions for a vector P * to maximize f(P) are

af(p *)
---a;;:-

Y,

af(p*)
apk -

-- < y

for all k such that P t#-

for all k such that P t

where Y is an arbitrary constant.

Proof
(i) Sufficiency: Assume that for some P * and y the preceding relations
are satisfied. We show for every P that f(P) - f(P *) ~ and therefore prove
that P * maximizes f(P) . From convexity

Af(P)

(l - A)f(P *)

f[AP

+ (I - A)P*]

Therefore ,

f(P) - f (P *) ~f[P *

+ A(P - p *)] - f(P *)


A

for all A in the interval [0, I] . In the limit when A ---7 0,

f(P) - f(P *)

af(p *) (Pk - p t)
apk
Now, if p i #- 0, by assumption af(p*) japk = y. If p i = 0, then af(p *)j
apk -s y and Pk - pi ~ 0, yielding (af(p *)japk)(Pk - pJ) ~ Y(P k - p i).
Consequently,
k= 1

f(P) - f(P*) -s

#1 Y(Pk -

pt)

= Y(~Pk

~pt)

(ii) Necessity: Let P> maximize the functionf(P) . Then for every P and
every A in the interval [0, 1]

f[P *

+ A(P - p *)] - f(P *)

Sec. 6.1

Cond it iona l Mutual Information and Capacity

65

In the limit when A ---,> 0, thi s inequality yields

~ af(p *) (

L..-

k=!

apk

p,

*) < 0

Pk -

Since P * is a probability vec tor, at least one of its components must be strictly
positive . For simplicity of notation , let p i > O. Let P = ( p i - e, p f, . . . ,
p J"o + s , . . . ,pi) ; th at is, P differs from P in only PI and PkO. Any e satisfying -p to :5 e :5 p i is acc eptable . Now

af(p *) (Pk - p f) = - e af(p *) + e af(p *) = e(af(p *) _ af(p *))


apk

> 0,

If p J"o
have

ap,

apkO

apkO

:5

apt

then s could be either po sitive or negative, in which case we mu st

af(p *)

af(p *)

apkO

apt

If p to = 0 , then e mu st be po siti ve, in which case we must have

af(p *)
af(p *)
apkO apt

- - <--

Taking y = af( p *)/ aph we see that the conditions of the theorem are satisfied .
The proof is thus complete .
Theorem I is now applied to the mutual information l ex, y) betw een the
input and output of a DMC . l ex, y) is a convex n function of the inpu t distribution P = (Ph' . . ,PK ), and its partial derivative with respect to P is obtained
as follo ws:

al(p) _ a ~ ~ (
) I P(YjIXi)
L..- L..-P Xi,Yj og
apk
apk ;~ 1 j~t
p( Yj)
a K J
p( Yjlx;)
= - LPiLP(Yjlxi) log ----='---'-'-"-'---;-apk ;= t j =1
L mPmP( YjJxm)
_
J
p(Yj lxd
K J
P(Yj IXk) log e
- LP(Yj lxk) log
( .) - LPiLP(Yjl x;)""
( .1)
j= t
P Y1
;= 1 j =1
L.mPmP Y1 Xm

-- - -

_ ~ ( I ) I p(Yj lxd _ L..-P


~ ( ) P(YjIXk) log e
- L..-P Yj Xk og
Yj
j=!
p(Yj)
j=1
p( Yj)
= l (xb y) - log e
The maximi zing distribution must therefore have the following property:

Here y' = y

l(x b y)

= y',

l (xk, y)

:5

y ',

t- o
for all k such that P t = 0
for all k suc h that P

log e is a cons tant. Moreover, for the maximizing distribution,


= y' . Therefore , y' = C. The se result s are summarized
in the following theorem .

l ex , y)

L P t i.. y)

x,

- - - - - - ---:::;:ooe

Yl

-------........:~ Y2

Figure 6.1

Theorem 2. A necessa ry and sufficient condition for an input distribution {PI , .. . ,PK} to achieve capacity is that, for some number C ,
l(xk' y) = C ,

for all k such that P "" 0

l(xk ' y) :s C,

for all k such that P = 0

C is then the capacity of the channel.


Example
For the channel shown in Figure 6.1, it is reasonable to assume that { p(Xj ) = 0.5,
P(X2) = 0,P(X3) = O.S} is a good choice for the maximizing input distributi on , since X2
seems to be a poor choice for transmission . Straightforward calculations then yield
l(x" y) = 1 bit
l(x2' y) = 0
l(x3' y) = 1 bit

Since l(x2 'Y) < l (x j,y) = l(x3'y) , the conditions of theorem 2 are satisfied . The assumed distribution therefo re achieve s capacity.

6 .2 CASCADED CHANNELS

Conside r a DMC with in put alp h ab et {XI , .. . , XK}, o utp ut alphabet


{y" . . . ,YJ}, and channel matrix [p( Yj!xd] . A second DMC with input alphabet {y" ... ,YJ}, output alphabet {z" ... , ZM}, and channel matrix [p(z", IYj)] is
cascaded with the first channel , as show n in Figure 6.2 . DMC2 may be considered as a signal-processi ng unit operating on the output of DMCl, but in general it can be any channel . In the followin g we prove that the processing of y
does not increas e the mutual information between the input and output of a
channel; that is, l ex, y) 2: lex , z) for all input distributions { p (x ,), . . . ,P(XK)}'
A consequence of this result is that the channe l capacity does not increase by
processi ng the output.

_I

D MC1

-I

Figure 6.2

D MC2

Sec. 6.2

Cascaded Channe ls

67

'" L..
'" L..
'" P(xi; Yj, z; ) In P(Xk
I (x, z) - I (x, Y) -_ L..
( IZm)
)
k j m
P Xk
'"
'" L..P
'" (XbYj , z; ) In P(Xk
IYj)
L.. L..
()
k
j
m
P x,

'" '" '" (


) 1n p(x
klz m)
=L..L..L..PXk,Yj,Zm
( I .)
k j m
P Xk Y}

'" '" '"


L..P(I
x, Yj,Zm)P(Yj, Zm) P(Xk
( IIzn.)) k j m
P Xk Yj

-s

L.. L..

Now P(XkIYj, zm) = P(XkIyJ since the knowledge of z.; does not provide additional information about x, once )'j is identified . This is a consequence of the
definition of cascaded channels . Therefore,

I(x,z) - I(x,y)::s

LP(Yj,zm)LP(Xkl zm) - 1 = 1 - 1 = 0
m

Equality holds iff P(Xk IYj) = P(Xk Iz.,) for all triplets (x, ; Yj, z.,) with nonzero
joint probability. If the second channel is lossless, this requirement is obviously
satisfied, making I(x, y) = I(x, z). Losslessness , howeve r, is not a necessary
condition for the equality of I(x, y) and I(x, z), as shown in the following
example.
Example
For the cascaded channels in Figure 6 .3

P(Xl !Yl)

= p(xl l zl) =

I,

Also ,

(t)P(X l)

(1) [ I - P(Xl)]

2P(Xl)
3 - P(X l)
p

(X l I22 ) =

P(XbY2, 22) + P(X b Y3, 22)


+ P(XbY3, 2 2) + P(X2,Y2, 2 2) + P(X2,Y3, 22)
p(x)) let) @ + (t) (t)]
+ [I - P(XI)]@@ + (1)(t)]

P(XI ,Y2,22)
(t)p(X))
2p(x))
3 - p(x))

Therefore , pix , Y2) = p(x I Z2) ' Other equalities ca n be checked similarly. Note that in
this example lex , y) = l ex , z) for all inpu t distribution s. Th ere are examples where the
equality hold s for only some of the distributions.

....... " ... t-' ...,

z,

Y,

Y3

Figure 6.3

6.3 REDUCED CHANNELS AND


SUFFICIENT REDUCTIONS

In many practical situations it is desired to reduce the number of output letters


of a channel to simplify the processing of data. A reduction can be considered
as cascading with a deterministic channel. An elementary reduction is defined
as a reduction in which two output letters are combined to form a single letter.
In this case the matrix of the new channel can be obtained from the matrix of
the original channel by adding the columns that correspond to the combined letters. If the original matrix is
P I .I

PI ,j

P I ,j +1

PI,J

P2 ,l

P2 ,j

P2 ,j + 1

PZ,J

PK ,I

PK ,j

PK ,j+1

PK ,J

and if the elementary reduction is applied to letters Yj and Yj + I, the matrix of the
reduced channel will be

P 2,j

+ PI ,j+l
+ P2 ,j + I

PZ,J

PK ,j

+ PK ,j +1

PK ,J

PI ,!

P I,j

P2 ,1

PK , I

PI,J

Since any reduction can be achieved by successive application of elementary


reductions, we confine our analysis here to elementary reductions .
Figure 6.4 shows an elementary reduction where YI and Yz are combined
into ZI . From the discussion of cascaded channels in Section 6.2, we know that
I(x,z) ~ I(x,Y) with equality iff P(Xk!Yj) = P(Xk!Z",) for all (XbYj,Z",), where
P(Xb Yj, zm) = O. For the elementary reduction of Figure 6.4, equality holds iff

Sec. 6.3

Reduced Channels and Sufficient Reductions

--- ---

Xl

69

z,

Y3 I---">----t- - - - - - -

DMC

YJ f---<o--+- -

- - - --

Figure 6.4

P(XkIYI) = P(XkIZI) for all x, that can reach Y" and P(XkIyz)
Xk that can reach Yz. Now
p(xklzl)

P(Xk)P(Zll xk)
P(ZI)

_ p ( YI)P(XkIYI)
P(YI)

= P(XkIZI) for

all

P(Xk)[P(Yll xk) + p (Yzl xd J


P(YI) + p( yz)

+ p ( Y2)P(XkIY2)
+ P( Y2)

It is now easy to see that l ex , y) = l ex , z) iff P(XkIYI) = P(XkIY2 ) for all k.


N ote: The reason that P(XkIYI) and P(XkIY2) must be equal for all xs , and
not just the ones that can reach both YI and Y2, is as follows: If Xk does not reach
YI and Y2, then P(XkIYI) = P(XkIY2) = 0 and the equalit y is satisfied. If an Xk
exists that reaches YI but not Y2, then, in general, P(XkIZI) ,p P(XkIYI) and the
reduction does not prese rve informat ion.
The preceding results are summarized by the followin g statement:
The outputs Yjl and YjZof a DMC with given input distribution {p(x ,) , . .. ,P(XK )}
can be combined without loss of average mutu al inform ation between input and
output iff P(XkIYjl) = P(XkIYjz) for all Xk'

A redu cti on in which l ex , y) is preser ved for all input distribution s is


called a sufficient reduction . Considering that

P(XkIy .) J

p(xk)p( Yj lxk)
L ;p(x;)p(Yjlx;)

equality of P(XkIYl) and P(XkIY2) requires that


p ( Yl l xd _ L ;p (x;)p( Yl lx;)
p(Y 2Ix k) - L ;p(X;)P(Y2IxJ

It is easy to verify that this relation will hold for all input distribution s iff
p( YII Xd ! P(Y21xd is a constant for all k , that is , iff column s I and 2 of the

channel matrix differ by a multiplicative constant. The two columns can then
be comb ined and the resulting reduction is a sufficient one .

IV

l..rl dl ll l t:::= 1

1 1 1t:::=Urt:' ll l ~

.... "Cll-' . 0

Example

The matrix

[i t
1
12

0]

kI

I
6

I
2

can be reduced to

0]

kk
["4 "4
1

'2

and then to

Both reductions are sufficient, and therefore the final two-output channel is as good as
the original four-output channel.

6.4

FANO'S BOUND ON EQUIVOCATION

Consider the cascade of two discrete memoryless channels shown in Figure 6.5. The second channel is a special channel whose output alphabet Z has a
one-to-one correspondence with the first channel's input alphabet X. DMC2
can be interpreted as a decision device, which, upon observation of Yj, decides
which input letter Xk has been transmitted . In general , the decision is not free
from errors, and the probability of error PE is given by
K

PE =

2: 2: P(Xb 2

m)

k=l m= 1
m=;r6k

According to Fano, the following inequality is always satisfied.

H(x Iy) ::; H(P E )

+ PE log(K - 1)

where H(P E ) = - P E log P E - (l - P E ) log(l - P E ) . Since H( x Iz)


H(x Iy), it is sufficient to prove Fano's inequality for H(x Iz).
Proof

H(x I z) - H(PE )

PE log(K - 1)

K K K

- 2: 2: P(Xb 2

m)

k=l m=l

log P(Xk

12 +
m)

2: 2: P(Xb 2

m)

k= l m = 1
k"' m

log PE

2::

Sec. 6.5

Improving Channe l Reliabi lity

.1

DMCl

71

y 1

DMC2

Figure 6.5

The proof is now complet e.


Notice that Fano's inequality is independent of the decision scheme being
employed. PE is, of course, a function of the decision scheme, and the bound
on equivocation can be improved by reducing the probability of error. P E can
be minimized by assigning to each Yj the input letter X k that maximizes P (X k IYj)'
There is an intuitive justification for the validity of Fano's inequ ality .
Consider an observer that has access to both x and z and provides supplemental
information to the receiver by transmitting a signal to indicate whether an error
has occurred and another signal, in case of error, to identify the correct input.
Since the probability of error is P E , the first signal requires a rate of at least
H (P E ) . For the second signal, the observer treats all the input letters equally
and encodes them into equal-length code words. Since the letter to be identified
in the case of an incorrect transmission is among the K - 1 remaining letters,
the code-word lengths must be 10g(K - 1). These code words are transmitted
with probability PE, and therefore the total rate of supplemental information
is H (P E ) + P E 10g(K - I), which must be larger than or equal to the channel
equi vocation H( x IY) , since equivocation is th e minimum rate at which
complete supplemental information can be provided . Fano's inequality is thus
justified .

6.5 IMPROVING CHANNEL RELIABILITY

Con sid er a binary symmetric ch annel (BSC) with cro ssover probab ilit y
e = 0.01, where data are transmitted at the rate of 1 bit/ second . If each bit is
repeated three times during transmission, the useful rate will obviously drop to
1 bit every 3 seconds but the error rate will improve . This scheme is capable of
correcting for one transmission error if the receiver applies the majority rule in
decoding 3-bit sequences. The majority rule is to decide 1 if the majority of bits
are l's and to decide 0 otherwise. The probability of error per information bit is
now 3(1 - e)e2 + e3 = 2.98 x 10 - 4, which is better than the original value

v lICl I Il It:1

11 1 t: U I t:I I I ~

v llCltJ . U

of 0.01. It is possible to improve the error rate even further by repeating each
letter 2N + 1 times, but in the limit when N becomes large the rate approaches
zero and the communication system becomes useless .
The Hamming distance between two equal-length binary sequences is defined as the number of positions in which the sequences differ from each other.
The distance between 001011 and 001101, for example, is 2. The important
feature of the repetition scheme just described is that the input letters, 0 and 1,
which have a distance of I from each other, are replaced with input sequences
II . . . 1 and 00 . . . 0, which are a distance of 2N + 1 apart. This feature allows
the channel to commit a maximum of N transmission errors without hindering the ability of the receiver to make a correct decision . The setback is, of
course, the reduced transmission rate.
The essence of the noisy channel theorem of the next section is that it is
not necessary to have excessively small data rates in order to have arbitrarily
accurate communication . The trick is to handpick a set of appropriate sequences from the input alphabet and to confine transmitted signals to these sequences. The distance between each and every pair of selected sequences must
be large to ensure that a large number of transmission errors can be tolerated.
At the same time, the number of these sequences must also be large so that a
reasonable data rate is maintained .
The probability of error is a function of the decision scheme used at the
receiver. This is a decision as to which input sequence should be assigned to a
given received sequence . Before proceeding to the noisy channel theorem, it is
appropriate to say a few words about the decision schemes . Let Wi = XilXi2'
XiN and Vj = Yj1Yj2 ' . YjN represent the input and output sequences, respec tively. Wi must belong to the set of selected code words, whereas Vj can be any
output sequence . The optimum decision scheme upon observation of Vj is to
choose the particular Wi that maximizes the backward channel probability
p(Wi I Vj). This, however, requires the knowledge of the input distribution
{peW;)}. A suboptimum scheme that is generally used, known as the maximum
likelihood (ML) scheme, chooses the sequence Wi that maximizes the forward
channel probability p(Vj I W;). Since p(Wi I Vj) = p(W;)p(Vj I Wi)/p(Vj), it is easy
to show that the ML rule is optimum in the special case where the Wi'S are
equally likely.
For the BSC with crossover probability e < 0.5,
p(Vj IW;) = eD(I - et- D
where D is the number of positions in which Wi and Vj differ (i.e., the Hamming distance between them). Consequently, the ML rule requires minimization of D (i.e., choosing the code word Wi that is closest to the received
sequence Vj in the Hamming sense). Although this is not an optimum decision
rule, it is sufficiently good for the fulfillment of the promise of the noisy channel theorem.

Sec. 6.6

Noisy Channe l Theorem fo r the BSC

73

6.6 NOISY CHANNEL THEOREM FOR THE BSC

Consider a BSC with crossover probability p . The channel capa city in unit s
of bits /s ymbol is C = I - H(p ) , wh e re H (p) = -p log p - (l - p )
log(l - p ) . Ass uming that ea ch sy mbo l ha s durat ion T , the ca paci ty in
bits/ second is C = [I - H(p )] /T . Consider a source that creates information
at the rate of R bits/second for transmission over the channel. The noisy channel theorem states that, as long as R is less than C , communication with negligible error rate is possible. This is achieved by encoding sequences of data into
certain sequences (code words) from the channel input alphabet and treating the
code words as elementary channel inputs. Any degree of reliability is obta ined
provided that these code words are (I ) long enough and (2) far from each other
in the Hamming sense.
The first step in the proof of the theorem is the derivation of error probability P, for a given set of code words {WI. ' . . , WM } . Each word has length N ,
and the total number of words is given by

M = 2NRT
The justificat ion for the preceding equation is that each word has duration N T,
and therefore the total number of bits produced by the source in this interval is
NRT . This corre spond s to 2NRT source sequences , each of which must be assigned a unique Wi'
In the second step , P e is averaged over all possible code-w ord sets of
length N and size M to yield Fe. This is equivalent to random ly selecting M
code words from the set of 2N possible binary sequences and calculating the
corresponding error probability. The random coding argument enables us to obtain a closed form for Fe' which is then used to identify the condit ions under
which Fe can be made arbitrarily small. Finally, we argue that, since Fe is the
average error probabi lity over the set of all possible codes {WI, . . . , WM}, there
must exist at least one code for which P, is less than or equal to Fe. We thus
conclude that under certain conditions a set of code words exists for which the
probability of error is arbitrarily small. These steps are now described in detail.
Assume the code {WI. ... , WM}, consisting of code words of length N , is
used over a BSC with crossover probability p . Since the channel is memoryless, the average number of transmi ssion errors for each Wi is Np . Given two
arbitrar y positive numbers e and 0, the law of large numbers asserts that for
sufficiently large N the probabi lity of receiving a sequence Vj whose distance
from Wi is larger than N( p + e) is less than o. Define the test set as the set of
all binary sequences of length N whose distance from Vj is less than N(p + e) .
The decision strategy is the follow ing: If a code word such as Wi belongs to the
test set and no other code word is in the test set, annou nce Wi as the transmitted
sequence; otherwise, announce that an error has occurred . An error thus occurs
if (I ) the transmitted word Wi is not in the test set or (2) if some other code

..... ..... ,.... . . . .

word , say Wj , is in the test set. An upper bound on error prob ability P, is therefore given by
M

2: P {W

P, ::; 0

belong s to test set}

j= 1

j"' i

The average of P, over all possible code s of length N and size M can now
be calculated using the rand om coding argument. Assume the code words
WI, . . . , WM are cho sen independentl y and with equal prob ability from the set
of all possible sequences of N binary digit s. (The random selection is done with
replacement ; therefore , the possibil ity of two or more identical code words in a
given code exists. ) The probability that Wj belongs to the test set is now equal
to the number of sequences in the test set divided by the total number of sequence s of length N , the reason being that Wj is selected independently of Wi
and according to a uniform distr ibution .
N( p+c)

P {Wj belon gs to test set}

~2

(N)
k

k -O N

The summ ation on the right side gives the number of sequences of length N that
differ from a given sequence in at most N( p + e) places. Noting that the preceding is independent of i . the upper bound on the average error probability can
be written as
.

Pe

(N)

N(p+"j

::;

+ (M - 1)

2:

k~O

k
N

Using the following well-kno wn inequality (see the Append ix),

~ (N)
k=O

::; 2

NH

(a ) ,

Q'

<

0.5

we have

Pe

::;

0+M .

rN[ I - H (p +c)]

Replacing for M in this equation then yields

Since e and 0 can be arbitr arily small, it is seen that, as long as R is less than
[I - H ( p) ] /T, the average error probability can be made arbitrari ly small by
increasing N. As we have argued earlier, a small Pe means that at least one
code exists that has a small error proba bility P e . The proof is thus complete.

Sec. 6.7

Noisy Channel Theorem for the General DMC

75

6.7 NOISY CHANNEL THEOREM FOR


THE GENERAL DMC

Con sider a DMC w ith input alphabet {Xl, .. . , XK}, output alphabet
{y" ... , YJ}, transition matrix [pC Yj IxdL and capacity C. All input-output letters have equal duration 7 . The input distribution that achieves capacity is denoted P x* and, without loss of generality, we assume that p *(xd = 0 for
I :S k :S K. [If P *(Xi) = 0 for some letter, omit Xi from the list of the input
alphabet. The omission is of no consequence since the promise of the noisy
channel theorem can still be fulfilled.] The output distribution corresponding to
P ;'" is denoted p }*. If an information source generates data at the rate of
R bits/second with R < C, then the noisy channel theorem states that the data
can be transmitted over the channel with neglig ible probability of error. The
proof is similar to the one given for the BSC in Section 6.6, and only major
differences are outlined in this section .
First, a general distance between sequences must be defined since the
Hamming distance is only applicab le to the binary case. The maximum likelihood rule compares p(Vj IWi) for all code words Wi upon arrival of Vj and
chooses the code word with the largest forward probability. The following definition for the distance between Wi and Vj therefore seems reasonable:
p *(V)
D(Wi , Vj) = log p(Vjl ~i)
p * (Vj ) is the probability of Vj when the input letters are chosen independently

P:.

and according to the capacity-achieving distribution


The reason for the appearance of p *( Vj ) in this definition will become clear later. As far as the ML
rule is concerned, however, p *(VJ is a constant and does not affect the decision scheme . Since logarithm is a mo notonically increasing function, the
ineq uality p(Vj I WJ ~ p(Vj I Wk ) is equivalent to D(W i, Vj) :S D(Wk> Vj) .
Therefore, the ML ru le yields the mi nimum distance decision scheme in
this case.
Assuming that a given code word WiOhas been transmitted, the average
distance between WiOand the received sequence Vj is
D(WiO' Vj)

~p(Vj l W o)D(WiO' Vj)

p(V j WiO)

= -

~ p(Vj IWiO) log p * (Vj )

The summation is over all output sequences Vj = (Yj'"


channel is memoryless,
N

p(Vj WiO)

= TI p(Yjll IXii.)
n= !

. . ,YjN)' Since the

76

Channe l Theo rems

Chap. 6

and since p *(Vj ) is obtained by independent selection of input letters,


N

p *(Vj )

TI P *(Yj,,)
Il = I

Therefore ,

D WiO, Vj

)=

~ ~

- LJ LJp

(v, IXi") Iog

,,= 1 j = l

P(Yj !Xi,,)
*( .)
P Y1

is the nth letter of W iO, and J is the size of the output alphabet. The inner
summation is the conditional mutu al information for Xi" and, according to a previous theorem , is equ al to C, indep endent of Xi". Thu s

X i"

D (WiO, Vj )

- NC

for all i o. The follow ing decision scheme is now equivalent to the one used in
Secti on 6.6 . Given a received sequ ence Vj , calculate the distance between Vj
and each code word Wi' If there is one and only one code word with a distance
less than - NC + N e from Vj , choo se it as the tran smitted sequence . Otherwise, announce that an error has occu rred . The probability of error P, upon
transmi ssion of W i is now written as
M

P, ::s

a + 2 P{W

clo ser than - N(C - e) to the received sequence}

j= l
i foi

Using the random coding argument, P, can now be averaged over all poss ible
code-word sets. The rando m select ion of code words, however, is according to
the capacity-achieving distribution

P:.

P {Wj

closer than - N(C - e) to the received sequence}

= 2 p*(V)

p *(W)

all V

all IV with
D(IV, V)< -N(C-e )

P *(V )p *(W )

all pairs (IV, V) with


D(IV, V)< -N(C-e )

All pairs (W, V) with D (W, V) < - N(C - e) satisfy the followi ng:

p *(V)
log p(V IW)

< - N(C - e)

Therefore ,

P*(V ) < p(V !W) .

N(C - e)

Consequently,

2
all pairs .

P*(V )p *(W ) <

N(C - e)

2
all pairs .

P*(W, V) ::s

N(C - el

Sec . 6.8

Converse Theorem

77

Combining all the results, we conclud e that


Fe : : : 8 + M . 2- N (C - e) = 8

+ 2 - N(C-e- R.)

Since e and 8 can be arbitrarily small, the average probability of error approaches zero with increasing N, pro vided that Rr < C. A small Fe means that
at least one appropriate code exists, and thus the proof is compl ete .
6.8 CONVERSE THEOREM

The noisy channel theorem states that a source that generates information at a
rate below channel capacity can be transmitted with negligib le error probability.
The con verse theorem is concerned with the opposite situation and states that at
rates above channel capacity arbitrarily accurate communication is impossible .
There are several versions of the converse theorem in the literature. The strong
convers e theorem asserts that at rates above capacity the probability of message
error approaches I with increasing length of the message sequences . This version has only been proved for block codes and, in addition , does not pro vide
any inform ation about the probability of single-letter errors. Another version of
the theorem states that the average probability of error per source letter is
bounded away from zero when the source maintain s a rate above channel capacity. In other words , the probability of error per source letter (on the average)
cannot be made to go to zero no matter what coding techniqu e is emplo yed .
These theorems will not be prov ed here . Instead, we consider a simple
version of the converse theorem that is based on Fano 's inequ ality. Consider
M messages WI, ' .. , WM of length N that are used with equal probabi lity for
transmission over a DMC. If the input ensemb le is Wand the output ensemble
V, then
H (wlv)

-I(W,V)

+ H(W) =

- I (W , V)

logM

But
I(W , V)

= H(V)

- H(V IW)

H(V) -

L H(y (t1 ) Ix'" )


11 = 1

: : : L H (y(t1) - L H (y(t1 )Ix'" )


n= l

11=

L I( x(t1 ), y(t1) :::::: NC


n= 1

Therefore ,
H(WIV)

2:

NC

+ log M

On the other hand , according to Fano 's inequality,


H (W IV) :::::: P, log(M - I) - P, log P, - (I - Pe ) log(I - Pe )

Therefore,

-NC

1)

log M ::; P, log(M NRT

Replacing M with 2

+ H(P e )

::;

P, log M

log 2

yields,

RT

NRT

- - - --

RT > C and sufficiently large N, the probability


of error is bounded away from zero . This completes the proof of the converse
theorem .

It is readily observed that, for

PROBLEMS
6.1. The channel shown in Figure P6. I is known as the binary erasure channel. (A bit
transmitted through this channel may be erased but can never be received in
error.) Asuming that P(x = 0) = Po, calculate I(x, y) in terms of Po. What is the
channe l capacity?
x

1-

Figure P6.1

6.2. Suppose that the transmitter in the binary erasure channel of Problem 6.1 is connected to the receiver via a feedback loop and can repeat the erased bits until they
are correctly received . Assuming that the input stream of data is i.i .d ., determine
the average number of transmitted information bits per use of the channel.
6.3. Find a discrete memory less channel together with an input distribution such that
y(N) is independent even though the corresponding
the sequence of outputs
sequence of inputs x(l) . . . X(N) might be dependent.

r'" ...

6.4. In ajoint ensemble XY, the mutua l information I(x i, Yj) = log [p(xi' Yj)/p(x;)p(Yj)]
is a random variable . In this problem we are concerned with the variance of that
random variable.
(a) Prove that Var[I(x;, Yj)] = 0 iff there is a constant a such that, for all Xi, Yj
with P(Xi,Yj) # 0,
P(Xi,Yj) = ap(x;)p(Yj)

(b) Express I(x, y) in terms of a and interpret the special case a = 1.


(c) For each of the channels in Figure P6.4, find an input distribution such that
I(x,y) # 0 and Var[I(xi,Yj)] = O. Calculate I(x,y).

Problems

Chap. 6
x

79

x
a,

-.:::- - - "-=-- - -'"

b,

Figur e P6.4

6.5. A DMC is ca lle d additive modul o K if it has the inp ut and output alphabet
0, I, .. . , K - I and the output Y is given in terms of the input x and a noise variable z by Y = x Ell z. The noise variable takes on values 0 , I , . .. , K - I and is
statistically independent of the input , and the addition x Ell z is taken modulo K
(i.e ., as x + z or x + z - K , whichever is between 0 and K - I).
(a) Show that I (x , y) = H (y ) - H (z) .
(b) Find the capacity in terms of H (z) and find the maximizing input distributi on.
6.6. Two discrete memory less channels A and B are connected in cascade . Assuming
that MA and MB are the individual channel matrices, determine the matrix of the
cascade channel.
6 .7. Consider a DMC wi th inp ut alphabet {Xl ,' " ,XK} and ou tput alphabet
{ Yb ' . . ,y}} . Given the input distribution P(Xk) for I ::; k -s K and the channel
transition matrix [P ( Yj IXk)], the optimum detection rule is as follows: For a reIYj) be the
ceived letter Yj, choose x, such that P(XkIYj) is maximum . Let
probability of error when Yj is received.
(a) Express p eEIYj) in terms of the backward channel probabilities P(XkIYj)'
(b) Show thatH(x IYj ) 2: 2P(EIYj) when entropy is in bits [see Problem 2.7(d)].
(c) Prove the following upper bound on the total probability of error, peE), in
terms of equivocation:

ro:

peE) ::; 0.5H(x y)

(in bits)

6.8. A DMC is character ized by the matrix

i,

If p(x, ) = ~ and P(X2) = P(X3) = find the optimum decision scheme and calculate the associated probability of error.
6.9 . A source produces statistically indepe ndent, equally probable letters from an alphabet (ab a2) at a rate of one letter each 3 seconds. These letters are transmitted
over a binary symmet ric channe l that is used once each second by encoding the
source letter a , into the code word 000 and encod ing a2 into the code word Ill.
If, in the corre sponding 3-second interval at the channel output , any of the se-

Channe l Theorems

80

Chap . 6

quences 000, 001, 010, 100 is received, GI is decoded; otherwise, G2 is decoded.


Let e < 0.5 be the channel crossover probability.
(a) For each possible received three -digit sequence in the interval corresponding
to a given source letter, find the probability that GI came out of the source
given that received sequence.
(b) Using part (a), show that the preceding decoding rule minimizes the probability of an incorrect decision .
(c) Find the probability of an incorrect decision [using part (a) is not the easy
way here].
(d) Suppose the source is slowed down to produce one letter each 2n + I seconds, GI being encoded into 2n + I zeros and G2 into 2n + I ones . What
decision rule minimizes the probability of an incorrect decision at the decoder? Find this probability of incorrect decision in the limit as n ~ 00 .
(Hint: Use the law of large numbers.)

6.10. Let the discrete memoryless channels A and B have matrices M A and M B and capacities CA and CB , respectively.
(a) The sum of the channels is a new channel with the following matrix:

0]

MA
[

MB

It is operated by choosing either A or B at a given time and transmitting a


letter through it. The channel used can always be identified by the inputoutput symbols. Show that the capacity C of the sum channel is given by

2c =

2CA

+ 2 CB

(b) The product of the channels is defined as the channel whose inputs are the
pairs (Xia, Xjb) and whose outputs are the pairs (Yma, Yllb), where Xia and
Yma are the input and output letters of channel A, while Xjb and Yllb are the
input and output letters of channel B. The elements of the channel matrix are
given by

P(Yma'Yllblxia,Xjb) = P(YmaIXia)P(Yllb lx jb)


The product channel is thus operated by using the two channels simultaneously for each transmission. Show that the capacity of the product channel
is C = CA + CB
6.11. For each of the following channels , determine the capacity and a maximizing
input distribution .
P
I -p

a
(b)

(I --p)
2

(l -2

-P

I - : -

(3)

P)

Chap. 6

Prob lems

81

6.12. Consider a BSC with crosso ver probab ility I> and a DMS that produces binary
data at the rate of R = 3 bits per channel symbol [p(O) = p(l ) = kJ. The following two strategies for the reprodu ction of source data at the channel output are
suggested:
(i) Of every 3 bits produced by the source , transmit I bit and at the channel output reconstru ct the other 2 bits by coin flipping .
(ii) For every 3-bit block of source data, transmit I bit that represents the majority and at the channel output repeat each received bit three times.
Compute the average probability of error in each case and compare the two strategies. Which is better ?
6.13

(a) Find the capacity of the channel of Figure P6.13.


(b) If the code words WI = X " Wz = X 3 constitute the set of permissible code
words for this channel , determin e the probability of error per message (i.e. ,
per code word). What is the rate of generation of information R (in bits per
channel symbol) that could be transmitted over this channel with the preceding choice of code words?
(c) Same as part (b) except for the code words. The code in this case consists of
WI = X IX" W Z = XZX3, W 3 = X 3X S, W4 = X 4XZ, and W s = X SX 4 '
X,

Y,

x2

Y2

x3

Y3

Y4

Ys

Figure P6.13
6.14. Consider a binary channel in which noise affects the input in blocks of seven
symbols. A block of seven is either transmitted without error or exactly one sym-

bol out of seven is incorrect. These eight possibilities are equally likely. Define W
and Vas ensembles of the 7-bit input and output sequences, respectively.
(a) Calculate H(V IW).
(b) What is the maximum value of I(W, V)?
(c) What is the capacity of the original binary channel?
6.15. Consider all binary sequences of length n = 7. We wish to choose M code words
WI' ... , WM for use on a binary channel. To correct all single errors that occur
during transmission , the Hamming distance between all code words Wi and Wj ,
i 'I' j , must be ~ 3.
(a) How many binary sequences have a distance less than or equal to I from a
given sequence Wi?
(b) What is the maximum possible value of M ?
(c) Assuming that the maximum value of M can be achieved, what is the rate of
generation of information that can be transmitted over the channel?

7
RATE-DISTORTION THEORY

7.1
7.2
7.3
7.4

Sou rce Coding wi th a Fidel ity Criterion


Rate-distort ion Funct ion and Its Propert ies
Fundamental Theorem of Rate-distortion Theory
Conve rse Th eorem
Problems

In t rod u c t ion In Chapters 2 and 3 we studied the problem of source


coding for discrete memory less sources and learned that the minimum rate at
which data could be accurately reproduced was the source entropy H (x). In this
chapter we address the problem of source coding with a fidelity criterion. Here
a distor tion measure is defined and , accordi ngly, the deviation of the repro duced data from the origina l (source) data is quantifie d. The problem is then to
minimize the data rate for a given level of distortion or, what amounts to the
same thing , to minimize the level of distortion for a given data rate . It turns out
that, for a broad class of sources and corresponding distortion measures , a ratedistortion function R(D) can be obtained that has the following properties :
(i) For any given leve l of distortion D , it is possible to find a coding
scheme with rate arbitrarily close to R(D) and average distortion arbitrarily close to D .
(ii) It is impossible to find a code that achieves reproduction with fidelity
D (or better) at a rate below R(D) .
In the following discussion we confine our attention to discrete memory less
sources and single-letter fidelity criteria for which the rate-distortion function
can be easily defined and its propert ies are easy to prove . The results, however,
are applicable to a wide variety of sources, and the interested reader is referred
to the vast literature on rate-distortion theory for further details [3, 5, 7, 8].

Rat e-Dist orti on Theo ry

84

Chap. 7

7.1 SOURCE COD IN G WITH A FIDELITY CRITERION

Con sider a DMS with alpha be t {a I , . . . , ad and pro bab ility di str ibu tion
. .. , pd Sequ ences of length N from thi s source will be encoded into
sequences of length N from the reprodu cing alphabet {bl> .. . , bj } . The singleletter distortion measure is defined as d(ak' bj ) , which assigns a real, nonnegative value to every pair (at, bj ) . Let Wi = ailai2. . . aiN be a source sequence
whose reproduced version in a give n encoding scheme is Vj = bj )bj 2 bjN
The distortion per letter of Wi is then give n by

{p 1,

1 N

L dia .; , bjn )

d(W i , Vj ) = -

(7. 1)

n= 1

The average distortion will then be the statistical average of d(Wi , Vj ) over all
source sequences; that is,

p(Wi)d(W i , Vj )

(7.2)

all IV;

In general, a code consists of M code words {V I , ' . . , VM } , and each Wi is assigned a code word Vj such that d(W i , Vj ) is minimi zed . The average distortio n
is given by Eq. (7 .2) , and the rate of the code is defined as

I
N log2M

(7.3)

in bits per source letter. A code is said to be D-admissible if the average distortion d associated with the code is less than or equal to D. The smallest value of
R amo ng all D-admissible codes is denoted R(D ) and is know n as the ratedistortio n function.
The minimum distortion is achieved if each letter a, is assigned a letter bjO
with the property that d(ab bjO) :5 d(at, bj ) for all j . The minimum distortion is
then given by
K

dmin =

L p(ak)d(ak' bjo)

(7.4)

k ~1

dmin will be zero if for every a, there exists at least one bj such that d(at, b) =
0; otherwise, dmin will be strictly positive . In any event , R(D ) is defined only
for D 2: dmin For D =
each letter a, is assigned a certain bjO' In the worst

case, all the bjo will be different and the code will have a rate equ al to the
source entropy . Thus , in general ,
K

R(dOlin )

:5 -

L Pk log Pk

(7.5)

k= )

The function R(D ) is a non inc reasing function of D . To see this, let D ) >
D 2 2: dmin Then the set of Dr admissible codes will be a subset of the set of
D)-admissible codes and, consequentl y, R(D ) :5 R(D 2) .

Sec. 7.1

Source Coding w ith a Fidelity Criterion

85

As D increases, the function R(D ) decreases until at some value of D, say


it becomes equal to zero and rem ains zero afterward . To determin e the
value ofdmm we notice in Eq . (7.3) that R = 0 if M = 1. Therefore , we must
look for a code with a single code word that has the smallest distortion among
all codes of size M = 1. Let us call this code word Vo = bo1bo2 . . . bON. The
average distortion associated with {Yo} is given by

dma"

_
d

2: p(Wi )d(W

j ,

yo)

=-

all Wi

2: p(ail . . . aiN ) 2: dio. ; , bOn)


all Wi

n= I

N I~l ~ p(ak)d(ah bOn)

(7 .6)

To minimize d , one must minim ize the inner sum in Eq. (7.6) by choo sing the
proper bOil" This choice is independent of n, however, and thus the minimum
value of d for a code of size M = I is
K

min
)

2: peak)d(ah bJ

(7.7 )

k= l

This is the smallest value of D at which R(D ) become s equal to zero, and we
denote it by d max .
From the foregoing discu ssion it is co ncluded that the rate -di stortion
function R(D ) is a nonincreasing funct ion of D , and the interv al [dmin , dmaxl
whose boundaries are defined by Eqs . (7.4) and (7.7) is the range of interest.
Example 1
Consider a memoryless binar y source with alphabet {D, I} and probabili ty distribution
{Po, I - Po}, where Po :::; ~ . The reprod ucing alphabet is also {D, I}, and the distortion
measure is given in Figure 7.1 . The minimum distortion is achieved when D is mapped
to D and 1 mapped to 1; thus dmin = O. This is obviously the case of perfec t reconstruction , and therefore
R (d min) = H(p o) = "P log Po - (l - Po) log(l - Po)

dmax is obtained from Eq . (7.7) by noting that


Source

d(O, 0)

=0

d( 1, 1)

=0

Figure 7.1

Reproduct ion

00

ndlt:: -UI:>lUllIUII

~ ( ) (

LJP a, d abbj

k~ l

Since Po :s

! :s

) _ {P O X 0
Po X I

+ (I - Po)

(I - Po)

I II CUI

v llOtJ

I = I - Po,
_
0 - Po,

I - Po, we conclude that (fmax = Po.

We are not yet in a position to calculate R(D ) for values of D in the interval [(fmin, d maxL but we shall give an exa mple of methods that are generally
used in data compression . Thi s exa mple will bring out the esse nce of source
coding with a fide lity criter ion and will shed light on the intimate relation
betwee n the noisy channel theorem of Chapter 6 and the fundamental theorem
of the rate-distortion theory. The method is based on the (7,4) Hamming error correcting-code, and we digress at this point to introdu ce this code .
The (7,4) Hamming Code
Consider the circle diagram in Figure 7 .2 where the three circles are drawn such that
their intersections create the regions numb ered I through 7. Using this diagram , it is
possible to construct 16 binary sequences of length 7 according to the following rules:
(l ) Bits I, 2 , 3, and 4 are chosen arbitr arily and placed in the corresponding regions of
the circ le diagram, regions I to 4 . (2) Bit s 5 , 6 , and 7 are chose n such that when
placed in the corresponding regions of the diagram the total number of l 's in each circle
becomes an even number. These are thus the parity chec k bits.
We now show that the 16 seq uences constructed according to these rules form
a single-erro r-correcting code . If a code word were transmitt ed over a BSC and the
receive d sequence satisfied all the parity checks , we would say that no errors have
occ urre d . A single error in bits 5 , 6, or 7 would violate one of the par ity checks, a
single error in bits 2, 3, or 4 would violate two parity checks, while an error in bit I
would violate all three parity checks . In eac h case it is possible to detect the erro neous
bit by consulting the circle diagram. Thus, whether the received sequence is error free
or contains a single error, it is always possible to recover the transmitted sequence .
Now consider the space of all binary sequences of length 7. There are 27 = 128
points in this space . The 16 code words of the Hamming code can be considered as the
centers of 16 nonoverlapping spheres of radius I in this space . Each sphere contains
the code word plus seven other sequences that have a unit distance from it. The total
number of points covered by the spheres is 16 X 8 = 128, which is all the points that
belong to the space . Therefore the spheres cove r the entire space .

Figur e 7.2

Sec. 7.1

Source Coding with a Fidelity Criterion

87

Let us now go back to the source introduced in example I and encode


source sequ ences of length N = 7 into code words of the (7 , 4) Hamming
code. Since spheres of unit radiu s around the code word s cover the entire
space, each source sequence is reproduced with at most one error. For
Po = 0.5, the average distortion per letter will be given by
-

"7I

d =

[ 16
128 x 0

112]

+ 128 x 1

8"I

The rate is obtained from Eq . (7.3) with N = 7 and M = 16. Thus R = 0.571
bit/source letter.
The rate distortion function for this problem is known to be (see
Problem 7. I)

= H(p o)

R(D)

- H(D),

0::; D ::; Po

(7.8)

Thus, for Po = 0.5, we have R(l /8) = 0.456 bit/ source letter, which is somewhat better than that achieved by the (7, 4) Hamming code
Example 2
Con sider a DMS with alphabet {a i, a2, a 3} and probability distribution P I = P2 =
P3 =
The reproducing alphabet is {bl , b2}, and the distortion mea sure is given in
Figure 7.3 . The minimum distortion is achieved by assigning al to b-, a2 to b2, and a3 to
either b, or b2 The average distortion is then given by d min = I. To obtain dmax> observe
that

"

( ) d(

L.JP a,

k= I

ai ,

b)
j

{!3
!3

X
X

I +!
3
2 +!

X
X

I +!
3
I +!

X
X

2
I

= .13 ,

b, = b,
b. = b

= .1

3'

and thus d max = 4/3 according to Eq. (7.7). The rate-distortion function for this source
is known to be (see Problem 7 .5)

R(D) =

f [I -

He(D

2- I)], I

:5

:5

in bits per source symbol. Notice that R(dmin ) = ~ is much less than H(1)
which one would obtain by assigning al to b., a2 to b2, and a3 to b. ,
Source

Reproduct ion

a,

b,

= 0 .9183,

Rate-Distortion Th eo ry

88

Chap. 7

7. 2 RAT E-DISTORTION FUNCTION


A ND ITS PROPERTIES

In this secti on we define the rate-distortion functi on from a comp letely different point of view. It turn s out, however, that the definition given here will result in the same function R(D ) as descri bed in Sectio n 7 .1 . We continue to use
the nota tions d and R(D ) for the average distort ion and the rate-distortion function , respectively, but the reader should bear in mind that these quantities are
now defined fro m a new perspecti ve and mayor may not have the previo us
connotations.
,a K},
Con sider a discrete mem oryless chann el with input alphab et {a] ,
input probability distribution { Ph ' .. ,pd, and output alph abet {bl ,
, bJ } .
Let an arb itra ry se t of transition prob abilities [p (bj Iad] be assigned to the
channel. We define the average distorti on d associated with this channel in the
follow ing way:
K

2": 2": p(adp(bj Iak)d(ak, bj )

(7.9)

k= l j = l

A channel matrix is said to be D- admi ssibl e if the average distortion d associated with that mat rix is less than or equ al to D .
W ith eac h cha nne l matrix is associated an average mutual inform ation
l ea , b ) between the cha nnel input and output. We de fine the rate-distoration
function R(D ) as the minimum of l ea, b ) over all D- admi ssible channel matrices; that is,
R(D)

= all D-admmin
{l ea , b)}
issible channels

(7.10)

Since the set of D-admi ssible channel matr ices is a closed and bound ed set and
since l ea , b) is a real-valued , continuous , and boun ded function of [ p(bj Iad ],
the minimum indica ted in Eq . (7. 10) exists .
It is not difficult to show that the minimum achieva ble d in Eq . (7 .9) is
the same as d min in Eq . (7.4) . Th erefore , R(D) is only defined for D 2':: dmin'
Moreover, since lea , b) ::5 H (a), we have
K

R(d min)

::5 -

2": Pk log Pk

(7. 11)

k~ l

similar to Eq. (7.5). R(D) is a nonincreasing functio n of D beca use the set of
D-admissi ble channels can only expand with increasing D . Th e following theorem establishes the smallest value of D at which R(D) becom es equal to zero .

Theorem 1.

Let d max be defined as follows :


K

d max = min 2":p(ak)d(ak, bj )


J

Th en R(D)

= 0 for D

2'::

k= l

d max and R(D) > 0 otherwise .

(7. 12)

Sec . 7.2

Rate -Distortion Function and its Properties

89

Proof: Let jo achieve the minim um in Eq . (7.12); then the deterministic


channel that connects all a, to bjO has distortion equal to d max and / (a, b) = O.
Since zero is the smalles t value that the average mutu al information ca n assume, we conclud e that R(d max ) = O. Con versely, let R(D ) = 0 for some value
of D . Then there must exist a D-admissible chann el for which / (a , b) = O. This
is equivalent to the statement that a and b are indepe ndent random variables;
consequently
K

LP(at, bj)d(at, bJ

k= ' j= 1
J

2:

LP(bj)d max

j=1

k= 1

= LP(bj) L p(ak)d(at, bj)

s.:

j='

But the channel is D -admi ssible; there fore , D


complete .

2:

d max and the proo f is thu s

The next theorem show s that R(D) is a con vex U function. Convexity
implies continuity at all point s except possibly at D = d min. [It has been shown
however that R(D ) is also continuous at D = dmin . ] The fact that R(D ) is a nonincreasing co nvex function of D also implies that it is strictly decreasin g .
Therefore , the gen eral behavior of R(D) is simil ar to the function shown in
Figure 7.4 .
Theorem 2.

R(D ) is a convex U function of D.

Proof" Let 'A E [0, I] and D" D 2 belong to [d min,d max ] . We will show that
R[AD,

+ (I -

'A)D 2]

::5

AR(D 1)

+ (I -

'A)R(D2)

(7. 13)

Let [p ,(bj Iak)] be the D ,-adm issible chann el with / ,(a , b) = R(D, ) and , similarly, let [pi bj Iad ] be the Dradmissible channel with / 2(a , b) = R(D 2) . Consider a new channel matrix defined as

R(D)

d min

d max

Figure 7.4

It is easy to show that this new channel is ADl + (I - A)Dz-admissible. If the


mutual information between the input and output of the new channel is I(a, b),
then
(7. 15)
But the ave rage mutual information is a convex U function of the transition
matrix (see Problem 7.2 ); there fore,

I(a , b)

+ (I - A)lz(a, b)

A1 1(a , b)

(7.16)

Combining Eqs. (7.15) and (7.16) and replacing I l(a, b) and Iz(a, b) with R(D,)
and R(D z)' we obtain Eq . (7. 13), thus completing the proof.
As was mention ed ea rl ier, co nvexi ty impl ies th at R (D ) is a strictly
de cr easin g fun ction of D . This , in tu rn , implies th at th e cha nne l mat ri x
that achieves R(D ) for a given D is not merely D- admi ssible , but its average
distorti on d is exactl y equal to D ; for if d were less than D , then the ratedistorti on function would have been constant on the interval [d , Dl .

7.3 FUNDAMENTAL THEOREM OF


RATE-DISTORTION THEORY

In this section we establi sh the relation ship between the rate-distorit on function,
which was formally defined in Secti on 7 .2 , and its practical, more intuitive version, which was describ ed in Section 7 . 1. In part icul ar, we will show that for
eve ry D in the interva l [d min, d maxl there ex ists a so urce code {V" . . . , VM}
whose ave rage distorti on d is arbitrarily close to D and whose rate R is arbitrarily clo se to R(D ). The co nce pt of rand om coding, whic h was ce ntra l to the
proof of the noisy channel theorem, will again be used here, and by showing
that the average distortion over all randomly selected code s of rate less than
R(D) + e is less than D + e , we will establish the existence of at least one
code with the desired properties.
Consider the discrete memoryless channel that achieves the minimum in
Eq . (7.10 ), and let its matrix be [p(bj Iad ]. From the discussions of Sec tion 7.2
we know that
K

L L pta, )p(b Ia, )d(ak> b


j

j )

= D

(7. 17)

k = l j =l

I(a, b)

~ ~

~ {:/ (ak)p (bj ak) log

p (b [ak)
p (b )

= R(D)

(7. 18)

If the source produces a sequence of length N and this sequence is transmitted


over the channel, the prob abilit y of the received sequence Vj = bj , . . bj N will
be given by
(7. 19)

Sec. 7.3

Fundamental Theorem of Rate-Distortion Theory

91

This will be the probability distribution for random coding; in other words, the
M code words will be selected independently and according to the probability
distribution in Eq. (7.19).
Now consider a source sequence Wi = a;lai2. . . aiN and define the set S;
as the set of all sequences Vj such that d(W; , Vj ) < D + e . If S; happens to
contain at least one of the code words VI,' .. , VM , then the distortion associated with W; will be less than D + e. We are therefore interested in the probability that a code word, chosen randomly according to the probability
distribution ofEq . (7.19), belongs to S; . We call this probability peS;):

(7.20)
We also define the set T; as the set of all sequences Vj such that
(lIN) log2[p(V j l WJ lp(Vj)]

<

R(D )

+ e.

Thus

(7.21)
Now

ns, n T;) = 2:

2:

p(Vj) > r N[R(D)+ e]


p(Vj IWi) (7.22)
VjESinTi
VjEsinTi
and we will show that L VP inriP(Vj IW;) is arbitrarily close to I , provided that N
P(SJ

2:

is sufficiently large. In fact, this is a consequence of the law of large numbers .


If W; is transmitted and Vj received , then deW;, V) = (lIN) L~=l d(a;II ' bjll) is
the time average of the r.v. d(ab bj ) and must therefore be close to the statistical average D of the LV. [see Eq. (7.17)] with probability close to l. Consequently, when W; is transmitted , the probability of Vj belonging to S; will be
close to 1. Similarly, the probability of Vj belonging to T; will be close to 1, because (lIN) 10g[p(Vj IWJ lp(Vj)] = (l IN) L~=l log[p(bjlll a;II)lp(bjll)] is the
time average of a random variable whose statistical average is R(D) according
to Eq. (7.18). Thus we conclude that, when Wi is transmitted, the received sequence Vj will be in Si n T; with probability close to unity. Therefore,

(7.23)
If the code words are randomly selected and at least one of them belongs
to S;, the average distortion for W; will be less than D + e . The probability
that none of the M code words belongs to S; is given by

[I - p(Si)r < [1 -

rN[R(D)+ e]

vE~nrP(Vj l WJ
J

(7.24)

where the upper bound is a direct consequence of Eq. (7.22) . We can further
upper bound Eq. (7.24) by using the following inequality:

(l - af3}M ::5 1 - (3

::5 a , (3 ::5

(7.25)

0'------ -- - -- -'-- - - - - o
f3

Figure 7.5

which is obtained from


(l - af3t =

eM InO -all)

::5 e - Ma{3 ::5

1 -

f3 +

e - Ma

The last inequality is obtained by examining Figure 7.5 .


Using Eq . (7.25) in Eq . (7.24), we will have
[1 -

P(Si)]M

<

I -

2:

p(Vjl

W i)

e - M' 2- NIRlD)+, ]

(7. 26)

Vj Es ,n T,

Now if M = 2N [RlD)+ 2e] , the last term in Eq. (7.26) can be made extremely small
for sufficiently large N . Similarly, the second term can be made arbitrarily
close to 1 according to Eq. (7.23) . Consequently, the probability of having no
code words in S; is arbitrarily small. In such situation s where W i has to be encoded into some Vj outside of S i, the distortion, although larger than D + e ,
will still remain below some finite value . [The maximum possib le distortion is
given by max k.j{d(a b bj )} .] Therefore , the contribution to the average distortion
by codes that have no code words in S; can be made less than e. From the foregoing discussion , we conclude that for M = 2N [R(D ) + 2eJ , or equi valently for
R = R(D) + 2e , there exists at least one code whose average distortion is less
than D + 2e. The proof is thus complete.

7.4 CONVERSE THEOREM

The converse to the fundamental theorem states that no D-admissible code can
have a rate less than R(D). To prove this, let {V" . . . , VM } be aD-admissible
code and consider a deterministic channel that assigns to each W i the closest Vj
in the code . Obviously, H(V IW) = 0, and consequently

I(W, V)

= H(V) - H(V IW) = H(V)

::5

log M

(7 .27)

Problems

Chap. 7

93

On the other hand,


I(W, V)

:z: H(a(II) -

= H(W) - H(W IV) =

H(W IV)

(7.28)

11 =1

Now
H(W IV)

= H(a(1) . .. a(N) Ib(l) . . . b(N)

:z: H(a(lI) Ib'" . . . b(N)


n= 1

:z: H(a(lI)Ib(II)

(7.29)

11= 1

Therefore,
N

I(W, V) ;::=

:z: I(a (II), b(II)

(7.30)

11 =1

Combining Eqs. (7.27) and (7.30) , we obtain


R

l i N
log M ;::= I(a (II), b(II)
N
N 11=1

:z:

=-

(7.31)

Let dll be the average distortion with which a'" is reproduced; then
R(dll ) -s I(a (II), b(II)

(7.32)

and consequently

1I~NR(dll_ ) ;::= R (1N 1~N_)


l dll

R ;::= N

(7.33)

The last inequality is a consequence of the fact that R(D) is a convex U function. Now the average distortion d is equal to (l IN) 2:~=1 dll , and since the code
is D-admissible by definition, we have d ~ D. From Eq. (7.33) and the fact
that R(D) is a decreasing function of D, we conclude that R ;::= R(D) . The proof
is thus complete .

PROBLEMS
7.1. A discrete memoryless source has alphabet {a, I} and probability distribution
{Po, I - Po}. The reproducing alphabet is also {a, I}, and the distortion measure is
d(O, 0) = d(l , I) =
and d(O, I) = d(l,O) = I. Show that the rate -distortion
function is given by R(D) = H(po) - H(D) for ::s; D ::s; Po.
7.2. Prove that the average mutual information lex, y) between the input and output of a
DMC is convex U in the transition probabilit y matrix [P( Yj IXk)] '
7.3. A source produces independent, equiprobable binary digits. The reproducing alphabet is ternary, and the distortion measure is shown in Figure P7.3 (omitted transitions in the diagram correspond to infinite distortion).
(a) Find and sketch the rate-distortion function R(D) .
(b) Find a simple encoding scheme that achieves any desired rate R at the distortion level D determined from R = R(D) .

nate-tnstorno n I neo ry

"

Sou rce

Reproduct ion

cnap .

d (l , 1) = O

Figure P7.3

7. 4. A source produc es independent equiprobable letters from an alphabet of five letters. The distortion measure is as shown in Figure P7.4 (omitted transitions correspond to infinite distorti on and included transi tions to zero distort ion) . Find the
rate-distortion function for this source and distortion measure.
Source

Repro duction

Figure P7.4

7.5. For the source of Example 2 in Section 7.1:


(a) Calcul ate and plot the rate-distortion function R(D) .
(b) Let the source code have length N = 2 and consist of two code words b.b,
and b2b l Find the average distortion d and the rate R for this code . Comp are R
with R(d ) obtained in part (a) .
7.6 . Consider a OMS with alphabet {a" . . . ,ad and uniform distributi on PI = P2 =
.. , = 11K. The reprod ucing alphabet is also {a" . . . ,ad and the distortion measure is given by

k = j
k ~j
What is the rate -di stortion functi on R(D ) for thi s source? [Hint : You may use
Fano 's inequality to bound R(D). ]

8
LINEAR ERROR-CORRECTING
CODES FOR THE BINARY
SYMMETRIC CHANNEL

8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13

Hamming Distance and Error-correcting Power


Hamming Upper Bound on the Number of Code Words
Parity Check Codes
Properties of the Parity Check Matrix
Construction of Parity Check Codes from the Matrix
Equivalence of Group Codes and Parity Check Codes
Decoding of Parity Check Codes: Decoding Sets
Error-correcting Power of Parity Check Codes
Hamming Lower Bound on the Number of Check Digits
Varsharmov-Gilbert Upper Bound on the Number of Check Digits
Convolutional Encoder
State and Ladder Diagrams
Maximum Likelihood Decoding and the Viterbi Algorithm
Problems

In t r o d u c t ion
In this chapter we introduce some elementary concepts
from the theory of error-correcting codes. We do not intend to get involved in a
detailed study of the techniques that have been developed over the past four
decades for encoding and decoding of information. Rather, we want to introduce the reader to the practic al problem s involved in achieving the promise of
information theory . Shannon 's idea of random coding has been instrumental in
proving the existence of good code s, but it does not help us find codes that can
be implemented on systems with speed and memory constraints. The problem
is that a randoml y selected code , which with a high probability is a good code ,
lacks any useful structure that can be utilized for encoding and decoding purposes. Consequently, the brute-force method of looking up a table is about the
best technique for its implementation . But to achieve significant improvements

in performance , one must use long codes; the resulting tables will then be excessively large and the search times impractically long.
By imposing various constraints on the code, researchers have been able
to develop highly structured and yet powerful error-correcting codes. The most
significant of these constraints is the constraint of linearity, which will be discussed throughout this chapter in the context of codes for the binary symmetric
channel. More general treatments of the subject, as well as detailed description
of various error -correcting codes, can be found in the extensive literature of
coding theory.
It has been shown that linear codes fulfill the promise of Shannon's theory; that is, the noisy channel theorem can be proved even when the randomly
selected code words are confined to the class of linear codes . The price one
generally pays for the convenience of linearity is the longer code words one has
to use for a given error-correcting power. Many classes of linear error-correcting codes have been found to date , but unfortunately a general technique of designing good codes for arbitrary channel s is yet to be discovered .
Our discussion of linear block codes for the binary symmetric channel
begins with the introduction of linear block codes, also known as parity check
codes . We discuss methods for encoding and decoding arbitrary parity check
codes and study their error-correcting power. Later in the chapter, we introduce
the idea of convolutional cod ing and describe an algor ithm, kno wn as the
Viterbi algorithm , for decoding line ar con volutional codes . Both block and
convolutional codes are widely used in practice , and the decision as to which
code is more appropriate in a given application depend s on the specific problem
at hand , as well as the cost/performance requirements.

8.1 HAMMING DISTANCE AND


ERROR-CORRECTING POWER

Consider a binary symmetric channel with crossover probability e < 0.5 . Let a
code of block length N consisting of code words WI, ... , WM be chosen for use
over this channel. We saw in Section 6.5 that if a sequence V) is received at the
output the maximum likelihood recei ver decodes it as Wi, where Wi is the
closest code word to V) in the Hamming sense. For example , if the code words
are WI = 00000, Wz = 10011 , W3 = 11100, and W4 = 01111 , then V) = 01011
is decoded as W4 since D(W4 , V)) = 1 and D(Wi , V)) > 1 for i 4. Similarly,
V) = 00110 may be decoded as either WI or W4 since D(WI, V)) = D(W4 , V)) =
2 and D(Wz, V)) = D(W3 , V)) = 3. *
*From R. Asn, Inf ormation Theory. New York: Interscience , 1965, p. 88. Reprinted by
permission of John Wiley & Sons, Inc.

Sec. 8.1

Ham ming Distance and Erro r-Correcting Power

97

The Hamming distance has the ch aracteristic properties of a distanc e


function. For arbitrary sequences VJ, Vz, V3 , the properties are as follows:
(i) D (VI , Vz) 2: 0 with equality iff VI
(ii) D (VI , Vz) = D(Vz, VI).
(iii) D(VJ, Vz) :::; D (VJ, V3) + D (V3' Vz)

Vz.

(triangle inequalit y).

Anoth er useful property is th e follow ing: D (V!> Vz) = D( VI E9 V3,


Vz E9 V3), where E9 is modul o 2 addi tion. For exampl e , if VI = 0111, Vz =
1010, and V3 = IlOO , then VI E9 V3 = lOll and Vz EEl V3 = 0110. It is easy to
check that D(O Ill, 1010) = D(1011 , 0110) = 3. The proof of the preceding relations is simple and the reader can take them up as an exercise.
The following theorem relates the error-correcting power of a code to the
minimum Hamming distance between its code words .
Theor em 1. Let WJ, .. . , WM be binary code words of length N, and assume that a positive integer u exists such that, for all i - i .
D(Wi> H-j)

2:

2v

+ 1

(8. 1)

Then the code can correct all errors of magnitud e :::; u, Conversely, any code
with this error-correcting power must satisfy Eq . (8. l) .
Proof.
Assume some code word, say Wi o' is transmitted . If the magnitud e of erro r is :::; v, th e recei ved seque nce Vj sati sfie s the relation
D(Wio' \tj) :::; v , By the triangle inequality, we then have, for all i - io,
D (Wio' Wi) :::; D (Wio' \tj) + D (Wi , \tj)

Consequently,
D (Wi , \tj)

2:

D (Wio' WJ - D (Wio' \tj)

2:

2v

I - v

= v

Thus Wio is closest to \tj and as a result \tj is decoded as Wio. All errors of magnitude :::; v are thus correctable.
Conversely, assume that two code words, say WI and Wz, have a distance
less than 2v + 1 [i.e., D(W!> Wz) :::; 2v ]. Upo n transmission of WJ, an error
pattern of magnitud e u that only affects the positions where WI and Wz differ
will result in a received sequence \tj, where D(Wz, \tj) :::; v , Therefore, in this
case at least, an error of magnitude v cannot be corrected. The proof is now
complete .
The next theorem is similar to the previous one, and its proof is left as an
exercise .
Theorem 2. Let WI, . .. , WM be binary code words of length N, and assume that a positive integer v exists such that, for all i - j,
D(Wj,H-j )

2:

2v

(8.2)

98

Linear Codes fo r the Binary Symmetric Channel

Chap. 8

Then the code can correct all errors of magnitude ::::; v - I and detect all errors of magnitude v , Conversely, any code with this error correcting/detecting
capability must satisfy Eq. (8.2).
8.2 HAMMING UPPER BOUND ON THE NUMBER OF
CODE WORDS

The following theorem is due to Hamming.


Theorem . If the code {WI> . . . , WM } consistin g of binary sequence s of
length N corrects all errors of magnitude v and less , then the number of code
words M must satisfy the following inequality:

M ::::;

2N
- v - N

L k=o (k)

(8.3)

Proof. For each code word W;, define a set B, as the set of all sequences
Vj that differ from Wi in at most u places . The total number of sequences in B, is
then given by

Since the code is capable of correcting all errors of magnitude v and less, the
sets B I> ' .. ,BM corresponding to the code words WI, ... , WM must be disjoint.
The reason is that if B , and B , have a common element V then D (Wi , W;) -s
D(W;, V ) + D(W;, V ) ::::; u + v = 2v , contradicting Theorem I of the previous
section. The total number of elements in the sets B I> .. ,BM must now be less
than or equal to the total number of sequences of length N ; that is,

M
(N) : : ; 2
k=O k
N

which is equivalent to Eq . (8.3). The proof is thus complete .


The Hamming upper bound in Eq . (8.3) is a necessary but not sufficient
condition. For instance, if N = 4 and v = I , then M = 3 satisfies the Hamming condition . However, no single-error-correcting code of length N = 4 and
size M = 3 exists. The reason is that the distance between all pairs of code
words in such a code must be greater than or equal to 2v + I = 3, and with
sequences of length 4 , three such code words are impossible to find. The maximum possible value of M for the preceding values of N and u is 2.

8 .3 PARITY CHECK CODES

So far in our discussions of coding we have not addressed the practical problems of encoding and decoding . We have assumed that a "table" or "code

Sec. 8.3

Parity Check Codes

99

book" exists that at the transmitter associates a code word with every source
sequence, and at the recei ver asso ciates every received sequence with a cod e
word. In the case of minimum-distanc e decoding , for instance , every received
sequence is associated with the closest code word in the Hamming sense . In
general , whe n the code words are long , the storage and pro ce ssing of data
become excessively prohibitive . We now introduce the clas s of parit y check
codes that, thanks to their structure , have substantially improved the efficiency
of encoding and decoding procedures.
Assuming that r\, r i , . .. , r N represent the digits of a binary sequence of
length N , a linear (parity check) code is defi ned as one whose code words are
solutions of the follo wing set of simultaneous linear equations:

(8.4)

The coe fficient s a ij are given bina ry numb ers (either 0 or 1), and the additions
and multiplications are all modulo 2 operations . All solutions of Eqs . (8.4)
must be included in the code.

Example
Consider the set of simultaneous linear equation s*

,.\
"1

+
,.z + "3+

"4+ "5
"4 +

+ "3 + r4

=0
"6 = 0
r6 = 0

(mod 2)

(8 .5)

Any sequence of binary digits r lrz . . . r6 that satisfies these equations will be an acceptable cod e word. For instance, WI = 001001 and W z = 111010 are acceptable code
words, whereas the sequence 101101 is not.

In Eqs . (8.4) , if an equ ation is a linear comb ination of others, it may be


deleted without affecting the solutions. The reason is that any solution that satisfies a certain set of equ ations will also satisfy linear combinations of those
equations . We thus assume , without loss of generality, that the equations that
describe a parity check code are linearl y independent.
An immediate con sequence of the preceding definition is that the sequence with r\ = r: = . . . = r N = 0 always satisfies the equations . Therefore,
a parit y check code will always contain the trivial code word W = 000 . . . O.

*From R. Ash, Inf ormation Theory. New York: Interscience, 1965, p. 164. Reprinted by
permission of John Wiley & Sons, Inc.

Using the notation of linear alge bra, Eqs . (8 .4) ca n be writte n as


lN

a mN

(~I)
.

(mo d 2)

(8.6)

rN

Th e m by N matrix A = [aij] is ca lle d the parity check ma tr ix; it ide ntifies


the code uniquely, althou gh it is not the only mat rix that co uld give rise to the
sa me cod e. Properti es of the par ity chec k matrix are discu ssed in the following sec tion.

8.4 PROPERTIES OF THE PARITY CHECK MATRIX


Some important properties of the parit y chec k matrix A that will be useful later
in the chapter are as follows:
No row of A is entire ly zero . If such a row exists , it could be eliminated wi thout affecting the code .
(ii) No ro w of A is a linear co mbination of other rows . Thi s is a co nsequ ence of the linear ind ep endence of Eq s. (8 .4) discu ssed in the previous section . Th e ra nk of A is equal to m , the number of linearly
independent rows of the matrix . According to a we ll-known theorem
of linear alge bra, the number of linearly ind ependent co lum ns of A
is als o equal to m. A proof of thi s theorem will be given late r in
thi s sectio n.
(iii) The rows of A co uld be arbitrarily rearranged witho ut affecting the
code. This sho uld be obv ious con sid er ing the fac t that the so lution of
a set of simulta neo us equatio ns is ind ependent of the order in which
the equations are arranged.
(iv) A ny pair of rows co uld be combined (mo dulo 2 add ition) and the
res ulting vector co uld repl ace either one of the rows. Th is will leave
the code int act , since every so lutio n of the new set of eq uatio ns satisfi es the old set, and vice versa . As an example, co nside r the mat rix
corresponding to the set of Eq s. (8 .5):
(i)

A =

1 0 0 1
0 1 1 1
(
101 1

(8 .7)

Sec. 8.4

Properties of the Parity Check Matrix

101

Upon adding the first row to the third row, we obtain

A' =

1 10)

1 0 0
0 1 1 101
(
001 o 1 1

(8.8)

Although A and A' are different in appearance, they represent the


same code. This could be checked by finding all the solution s of the
equations AW T = 0 and A'W T = 0 and verifying that the two sets of
solutions are indeed the same . (W is a 1 x N row vector and T
denotes the transpose.)
(v) A solution W = (rl' r i , .. . , rN) represents a linear dependence
among the columns of the matrix A. For example, if the nonzero elements of Ware rb r 2 , r 3, and rs, then the sum of columns C b C2 , C3 ,
and C5 is equal to zero. On the other hand, if a group of columns is
linearly independent, then no solution can express a linear dependence among them. As an example, consider the matrix A in
Eq. (8.7). Columns 1, 2, and 3 of this matrix are linearly independent; consequently, none of the sequences 100000, 010000, 001000,
110000, 101000, 011000, 111000 are solutions of AW T = O.
(vi) By definition, variants of a matrix are those obtained by rearranging
the rows [property (iii)] and/or combining the rows [property (iv)] in
any order any number of times . The code that corresponds to the
matrix A is the same as the code corresponding to any of its variants.
Since the rows of A are linearly independent, we can easily show
that the rows of any variant of A are also linearly independent. Furthermore it is possible to show, using property (v), that any linear
dependence or independence of the columns of A remains valid for
all its variants .
(vii) Triangular variants of the parity check matrix are useful in explaining several properties of the codes . A triangular variant of A can be
constructed as follows :
(a) Start with the first column. If the column has all a 's skip it. If
the first element of the first column is not a I , rearrange the
rows such that the first element becomes a 1.
(b) If the first column contains l's other than the first element, make
them equal to a by adding the first row to the corresponding
rows .
(c) Eliminate the first row and the first column from consideration.
Repeat steps (a) and (b) for the remaining (m - I) x (N - I)
matrix.

LII Il:H:lr \.-ou es TOr m e o rnarv svrnmetnc cnannet

IVL

cn ap. !l

Example
The following sequence shows the process of constructing a triangular variant.

A = [ 0:

HH:,
~ ! ~ i ~].c
0

1 0]

Move second row to the top.

10 01

~ !~ :

'>I

'l

01001

Add first row to third row.

(8. 9)

'>I

Exchange second and third rows .

i ~ i :!].c
01

001

'>I

Add third row to fourth row.

A4 is now a triangular matrix. Its lower left triangle is all D's, and its columns C" C2 ,
C3 , and Cs are linearly independent. This is due to the fact that these columns have the
"diagonal" elements of the triangular matrix. If, for some set of numbers a , b, C, d, we
happen to have

(8.10)

then d = 0 because of the last row. Consequently, C = 0 from the third row, b = 0
from the second row, and, eventually, a = 0 from the first row. Since Eq. (8. 10) is only
valid for a = b = C = d = 0, we conclude that the preceding columns of the matrix

Sec. 8.5

Construction of Parity Check Codes from the Mat rix

103

are linearly independent. This is obviously quite general and applies to triangular variants of any matrix . Also observe that the remaining columns C4 and C6 of A4 can be
obtained by a linear combination of C r , C2 , C3 , and Cs . To obtain C6 , for example
we write

{] + {~l +{rl+{] m

(8 .11)

and find that d = 1, c = I , b = 0 , and a = I; that is, C6 is the sum of C" C3 , and C s.
Again , this is quite general and could be applied to triangular variants of any matrix .
Using these results together with property (vi), we conclude that the number of indepen dent columns of the matrix A is exactly equal to In, the number of (independent) rows of
the matrix .

8.5 CONSTRUCTION OF PARITY CHECK CODES


FROM THE MATRIX

The m X N matrix A has m independent columns , say, Cj " Cj 2 , . . . , Cjm All


other columns can be expressed as linear combinations of the independent
columns. To determine the solutions W = (1', ,1'2 ' . . . , r N) of the eq uation
AW T = 0 , let us define the elements r jl> rn. .. . , 11m as check digits and the
remaining r 's as information digits . We can then choose the information digits
arbitrarily and solve the m independent equations for the m remaining unknowns (i. e . , the m check digits) . In general there are N - m information
digits and hence 2N - m code words . The rate of transfer of information is thus
given by R = 1 - (m/N) bits/symbol.
Example

Consider the 3 x 6 matrix

I 0 0
A = 0 1 I
(
1 0 1

(8.12)

Since columns C" C2 , and C 3 are linear ly independent, we choose 1'" 1'2, and 1'3 as check
digits. 1'4 , I's, and 1'6 will then be information digits and can be chosen arbitrarily. There
are eight possible cho ices for the sequence 1'41'S1'6, and therefore there are eight code
words altogether, as follow s:*
*From R. Ash, Information Theory. New York: Interscience, 1965, p. 93. Reprinted by
permission of John Wiley & Sons, Inc.

....,,'~~ ,

................ .... , .....,

.." ........ "' ..... 7 .... 7"

rl

r2

r,

r4

r5

r6

WI
W2

0
0

0
0

0
0

W3

W4

W5

0
0

0
0
0
0
1

0
0

W6
W7

0
0

0
0

Ws

11 . ....... "

........ "

.... ' 01

' .... '

....." ....t-' . .....

8.6 EQUIVALENCE OF GROUP CODES


AND PARITY CHEC K CODES

A binary group code is a code {WI, ... , WM} whose code words are bin ary
sequences of equal length that form an algebraic group under modulo 2 addition .

Definition. A group is a set G with an operation, usually referred to as


addit ion, defined on the set. A group must have the following properties:
(i) Closure: If a and {3 belong to G , then a + {3 must belong to G.
(ii) Associativ ity : For all a , {3, y be longing to G, we mu st have
(a + (3) + Y = a + ({3 + y).
(iii) Identity element: There exists an element 8 in G such that , for all a
in G, a + 8 = 8 + a = a .
(iv) Inverse : For each a in G, there exists an elem ent a ' in G such that
a + a ' = a ' + a = 8. a ' is called the inverse of a .

If the group has the additional property that a + {3 = {3


then the group is said to be Abelian or commutative .

a for all a, {3 in G ,

The set of integ ers (positive, zero, and negative) under ordinary add ition
forms an Abelian group, as doe s the set of all binary sequences of fixed length
under modulo 2 addition.
The following theorems establish the equivalence of group codes and
parity check codes .

Theorem 1. The set S of code words of a parity check code is an


Abelian group under modulo 2 addition.

Sec. 8.6

Equiv alence of Group Codes and Parity Check Codes

105

Proof. Let A be the matrix of the code . If Wi and Wj belong to S, then


AwT = AWj = O. Consequ entl y, A( WT EB w j ) = AwT EB AWj = 0; that is,
the modul o 2 sum of Wi and Wj is itself a code word and thus belongs to S
(closure) . Assoc iativ ity and commutati vit y fo llow fro m th e defini tion of
modulo 2 addition. The word W = 000 . . . 0 is the identit y element, and each
element Wi is its own inverse (i.e., Wi EB Wi = 0). S is therefore an Abelian
group.

Theorem 2. The set S of M binary sequences of length N that forms


a group under modulo 2 addition is a parity check code; that is, there exists a
parity check matrix for S.
Proof. Arrange the code words in an M x N matrix where each row is a
code word of the group code S. Assuming that -the matrix has rank K , there will
be K independent rows (column s). All other rows (columns) can be expresse d
as a linear combination of the K independent rows (columns). For simplicity of
notation, we assume that the independent rows are WI> W2 , , WK . Since S is
a group code , every linear combination of its code words must be a code word;
that is, W = f'I WI EB A2W2 EB .. . EB AKWK, where A;'S are binary co nstants ,
must be a row of the matrix. On the other hand , every row of the matrix must
be expressible as a line ar combination of WI> . . . , WK. Since every distinct
choice of (AI> . .. , AK) results in a distinct code word (a consequence of linear
indep end ence of WI, .. . , WK) , we conclude that the tot al number of code
words M is 2K
To prove that S is a parity check code , we must find the parity check
Eqs. (8.4) . If we find a set of equations satisfied by WI>' .. , WK, then all other
code words will also satisfy the equations because they are linear combinations
of the firs t K code words . Now consider the K X N matrix whose rows are
WI, . .. , WK' This is usually known as the generator matrix. Since there are K
independent row s, there are K indepe ndent colum ns . The rema ining N - K
columns can be expres sed as linea r combinations of the K indep endent col umns. This will give us N - K equations that are simultaneously satisfied by
the columns or, equivalently, by r l> r 2, .. . . r , These equations are linearly independent since each contains an element r j that does not appear in any other
equation . Since the total numbe r of solutions to these equati ons is 2N - m =
2N- (N- K) = 2K and there are 2K code words in S, the equations cannot have a
solution outside S. Therefore, the parity check code obtained from the equations and the gro up code S are equivalent.
Examp le
Consider the group code S whose code words are as follows (you can check that the
code words indeed form a group).

Linear Codes tor the binarv Svrnrnetr!c Channel

lUb

r3 r4 r5 r,

rl

rz

WI
Wz

I
I

W3
W4

0
0

0
0

W5
W6
W7

w,

0
0

0
0
0

0
0

0
0

Chap. 8

There are eight code words; therefore, there must be three independent rows. It can
easily be checked that W" Wz , and W 3 are linearly independent. Therefore, we must find
a set of linear equations for r" ri . .. . , r that is satisfied by W" Wz , and W 3. Observe
that in the generating matrix

001
o I 0
I

columns 4, 5, and 6 are independent. Therefore, columns I , 2, and 3 can be expressed


in terms of the independent columns as follows:
rl

+
rz +
r3 +
r4

+
r4 +
r4 +
rs

The parity check matrix is thus given by


A =

(~ ~ ~
001

r = 0
r5 = 0
r = 0

1I)
I

8.7 DECODING OF PARITY CHECK CODES:


DECODING SETS
Con sider the set S of the code words WI, Wz, . . . , WM of a parity check code
with matrix A . The matrix is In x N, and therefore all code words have length
N with M = 2N - m Consider an arbitrary binary sequence Z of length Nand
6
form the set Z E9 S = {Z EEl W1 ,Z E9 Wz, . . . ,Z EEl WM } . The set Z E9 S has
the following properties:
(i)

Z itself belongs to Z E9 S . Th e reason is that the trivial code word


Wi = 000 ... 0 is always a member of Sand Z E9 000 . . . 0 = Z.

Sec. 8.7

Decoding of Parity Check Codes: Decoding Sets

107

(ii) If Z belongs to S (i.e ., if Z is a code word itself) , then Z EB S = S.


The reason is that the modulo 2 sum of any two code words must be
a code word. Moreover, Z EB Wi ~ Z EB Wj if Wi ~ Wj ; therefore,
no two elements of Z EB S can be identical.
(iii) If V belongs to Z EB S, then V EB S = Z EB S. The reason is that
V = Z EB Wi for some code word Wi . Therefore , V EB S = Z EB
Wi EB S. But W i EB S = S; therefore, V EB S = Z EB S.
(iv) If V does not belong to Z EB S, then V EB Sand Z EB S are disjoint.
The reason is that if the two sets were not disjoint there would
be a sequence U belonging to both . Then U EB S = Z EB Sand
U EB S = V EB S from property (iii). Therefore, Z EB S and V EB S
must be equal. But V belongs to V EB S from property (i) and does
not belong to Z EB S by assumption. This is a contradiction , and thus
Z EB S and V EB S must be disjoin t.
For reasons to become clear later, the set Z EB S is called a decoding set.
Using the terminology of group theory , Z EB S is sometimes referred to as the
coset of Z with respect to S. For a given code S, one can construct several distinct cosets. Such cosets are mutually disjoint and, since every binary sequence
of lengh N must belong to one coset or another , the total number of cosets is
equal to 2N / M = 2N / 2N - m = Z", where M is the number of elements of each
coset and m is the rank of the parity check matrix .
The minimum -distance decoding scheme for parity check codes is intimately related to the concept of cosets. Consider a sequence Vj of length N at
the receiving end of the channel. To compare \0 with the code word Wi, we can
form the error-pattern vector Vj EB Wi' The error-pattern vector is zero at positions where Vj and Wi are the same and is I where \0 and Wi are different. For
example, if Vj = 01011 and Wi = 10010, then the error-pattern vector would
be Vj EB Wi = 11001. The number of I's in the error-pattern vector is the distance between Vj and Wi . In the preceding example, the distance between Vj
and Wi is 3. The minimum -distance decoding can thus be described as follows :
Given a received sequence Vj , (1) form all the sequences Vj EB Wi (i.e. , form
the coset Vj EB S) . (2) Choose the member of this coset with the least number
of 1's; this member is known as the coset leader. (3) Announce the transmitted
sequence as Z EB Vj , where Z is the coset leader.
Obviously, the coset leader is the error-pattern vector that the minimum distance decoder associates with Vj . Remember that all members of the coset
Vj EB S have the same coset as Vj (property iii) . Therefore, all members of a
coset are associated with the same error-pattern vector by the minimumdistance decoder. It is now clear why the cosets are called decoding sets. The
decoder does not have to know everything about the received sequence Vj ; all
it needs is the knowledge of the coset (or decoding set) to which Vj belongs .
Since each coset is associated with a unique error-pattern vector (coset leader),

i.m e a r Luue:; ro r me o rna r v ov rn rn et n c LnClnne l

IUO

L nap.

1:1

the decoder can then add the proper error pattern to Vi to obtain the transmitted
code word.
Example
Consider the following parity check matrix:
A =

(Ia

a1

I 0)1

(8.13)

The code words associated with A are WI = 0000, W2 = 0101 , W3 = 1110 , and
W4 = 1011. All cosets of this code are listed next and the coset leaders are underlined .
S = {OOOO, 0101,1 110, lOl l}
S, = {OOOI, 0100 , 1111, IOIO}
S2 = {IOOI, lIOO,OIII,OOIO}

S3 = {OIIO,OOII, 1000, l lOI}


Notice that in the case of S I we have a choice between 000 1 and aI00 for the coset
leader. As far as the decoding algorithm is concerned , there is no difference between the
two and we could have selected eithe r.
It is seen that the cosets are disjoint and, together, they contain all pos sible
sequences of length 4. If the sequence Vj = 0111 , for example, is received , the decoder
must somehow realize that Vj belongs to Sz. Once this is achieved , the corresponding
error sequence 00 10 will be added to Vj and the decision announced as 00 10 EB aIII
0101 = Wz. You can check that Wz is in fact the closest code word to Vj '

We now turn our attention to the problem of identification of the coset


for a received sequence Vi' Obviously, this can be done through a table lookup,
which is a slow and usually impractical technique . Fortunately, there is an
easier way. It turns out that each coset is uniquely identified by an m X I vector called the corrector. t The corrector C, for a given Vi is defined as C, = AVJ.
The same corrector belongs to other members of the coset Vi E9 S since
A(VJ EB wi) = AVJ E9 AW ; = AVJ = C,

Thus all elements of Vi E9 S have the same corrector Ci . On the other hand, if
Vk does not belong to Vi EB S, then C, = AVJ is different from C, = AvL for if
C, = Ci , then A(VJ E9 VJ) = 0, which means tha t Vi E9 V k = Wi or V k =
Vi EB Wi (i.e ., Vk belongs to Vi E9 S ) which is a contradiction.
The corrector vectors are binary sequences of length m (m is the rank
of A) and there are 2m such sequences . The number of cosets associated with
the matrix A was also found to be T" , Therefore, there is a one-to-one correspondence between the cosets of the code assoc iated with A and all binary
sequences of length m.
The parity check decoder must keep a table that associates all binary
sequences of length m (the correctors) with the corre sponding error-pattern
tIn the literature, the corrector is also known as syndrome.

Sec. 8.7

Decod ing of Parity Check Codes : Decoding Sets

109

vectors (coset leaders) . Upon receiving a sequence Vj , the decoder computes its
corrector C, = AVr It then searches the table to find the error-pattern vector Z,
corresponding to the corrector Cj . The minimum-di stance decision will then be
the code word Wi = Vj EEl Zj.
The amount of storage for the preceding decoding technique is on the
order of 2m , which could be significantly lower than 2N , the storage needed for
a direct table-lookup approach.
For the code corresponding to the matrix A in Eq. (8.13) , the correc tors and their
error-pattern vectors are as follow s:

00

0000

01

0100

10

1000

II

0010

This code can thus correct some of the single error s. All other errors will be neither
detected nor corrected .
Example

Consider the parity check matrix

A =

(~o ~ ~ ~
o

0
0

I
0

0
I

I
0

0;1)

(8. 14)

The code words are WI = 000000 , W 2 = 111010, W 3 = 101101, and W 4 = 010111 .


The cosets associated with this code are listed in Table 8. 1. The coset leaders are all
in the leftmost column . The correctors associ ated with the cosets are shown on the
right. Note that this code is capable of correcting all single errors along with some (but
not all) double and triple errors. If this code were used on a binary symmetric channel
with crossover probability e , the probability of message error PE would have been obtained from

(8. 15)
Here I - P E is the probabil ity of correct detection, the first term on the right side is the
probability of correc t detection corresponding to correct transmission , the second term is
the probability of correct detection corresponding to single errors in transmission, the
third term is the probability of correct detection corresponding to double errors in transmission, and so on . If coding is not used, the transm ission of four messages requires
2 bits and the probability of correct transmission would be
1 - P E = (I - e)2

(8. 16)

For e = 0.01, Eq. (8.15) yields PE = 7.86 X 10- , whereas from Eq. (8.16), PE =
I. 99 X 10- 2 This improvement in error probability is achieved at the cost of reducing
the data rate to one-third .
4

"V

o rnarv

LIl It:i:H \.. U U t: :; lur li l t:

ovnunernc \..11i:H1Ilt:1

\..flap .

1)

TABLE 8.1

000000

111010

101101

010111

Correctors
0000

Single
errors

100000
010000
001000
000100
000010
000001

011010
101010
110010
111110
111000
111011

001101
111101
100101
101001
101111
101100

110111
000111
011111
010011
010101
010110

1000
0100
0010
0001
1110
1011

Double
errors

110000
101000
100100
100010
100001
010100
010001

001010
010010
0 11110
011000
011011
101110
101011

0 11101
000101
001001
001111
001100
111001
111100

100111
111111
110011
110101
110110
000011
000110

1100
1010
1001
0110
0011
0101
1111

Triple
errors

110100
110001

001110
001011

011001
011100

100011
100110

1101
0111

Code
Words

Source: R. Ash, Information Theory, Wile y-Interscience , New York , 1965.

8.8 ERROR-CORRECTING POWER OF


PARITY CHECK CODES

In Section 8.1 we saw that to correct all errors of magnitude v and less, a code
{WI, . .. , WM} must satisfy

D(Wj , Wj ) ~ 2v

+ 1,

for all i -# j

For parity check codes, this condition can be stated in terms of the minimum
code-word weight, where the weight of a code word is the number of l's in it.
For instance, the weight of Wi = 1001100 is 3.
The minimum distance between code words of a parity check code is
equal to the weight of the code word with the least number of l 's (excluding
the trivial word W = 000 . . . 0). The reason is that, if the distance between Wi
and Wj is D, then the code word W k = W j EEl Wj has a weight equal to D that
cannot be less than the minimum weight. On the other hand, the minimumweight code word itself has a distance from the trivial code word equal to its
own weight. As an example, consider the parity check code {0000,0101, 1110,
lOll} . The minimum-weight code word is 0101, which has a weight of 2, and
the minimum distance is indeed equal to 2.
In light of the preceding result, Theorem 1 of Section 8.1 can be stated as
follows: A necessary and sufficient condition for a parity check code to correct
all errors of magnitude u and less is to have a minimum weight greater than or
equal to 2v + 1.

Sec. 8.9

Hamm ing Lower Bound on the Number of Check Digit s

111

If the code word with minimum weight has at least 2v + lones, then no
code word (except the trivial one) can have 2v or less 1's. Since the l's in a
code word represent a linear dependence among the columns of the matrix A
[see Section 8.4 , property (v)], the preceding theorem can be restated as follows: A necessary and sufficient condition for a parity check code to correct all
errors of magnitude v and less is that every set of 2v columns of the matrix A
be linearly independent.
Example
The following parity check matrix represents a double-error-correcting code with seven
check digits and three information digits .

0 0 0

0 0

0 0

0 0 0 0

0 0

0 0

0 0

0 0 0

0 0

0 0

0
0

0 0

0 0

(8.17)

It may be verified that any four columns of this matrix are linearly independent. *

8 .9 HAMMING LOWER BOUND ON THE NUMBER


OF CHECK DIGITS

The following theorem is due to Hamming :


Theorem. If a parity check code of block length N corrects all errors of
magnitude v and less, then the number of check digits m must satisfy the following inequality:
(8.18)
Proof. To correct an error pattern Z, we must have a distinct corrector

= AZ T corresponding to that error pattern.

To correct all errors of magnitude

k, we need (1) distinct correctors . To correct all errors of magnitude v and less,
we need L ~= o (1 ) distinct correctors. But there are 2'" correctors altogether;

therefore inequality (8.18) must be satisfied . The proof is now complete .


Notice that the preceding lower bound on m is necessary but not sufficient. For instance, if N = 10 and v = 2, we find from Eq. (8.18) that m 2':: 6
*From R. Ash, Inf ormation Theory . New York: Interscience, 1965 , p. 183. Reprinted by
permission of John Wiley & Sons, Inc.

11<:

Linear uooes TOr me emarv ::.ymmemc unarmer

l-nap.~

is necessary. However, it may be verified (by trial and error, for example) that
the smallest number of check digits for a double-error-correcting parity check
code of block length 10 is m = 7.
The Hamming lower bound on the number of check digits for a parity
check code is, in fact, identical to the Hamming upper bound on the number of
. code words for a general binary code, which was described in Section 8.2 .
To see this, observe that M = 2N - m and therefore inequalities (8.3) and (8.18)
are equivalent.

8.10 VARSHARMOV-GILBERT UPPER BOUND ON THE


NUMBER OF CHECK DIGITS

Theorem. A sufficient condition for the constructability of a parity


check code with block length N that corrects all errors of magnitude u and less
is that m, the number of check digits, satisfy the strict inequality
m

2 >

2v -1

k=O

(N - I)

(8.19)

Proof. We describe a construction procedure for the parity check matrix


and show that a v-error-correcting code can be constructed if Eq. (8.19) is satisfied. The construction procedure consists of selecting successive columns of the
matrix A in a way that makes every set of 2v columns linearly independent. As
we have shown in Section 8.8 , a matrix with this property corresponds to a
v-error-correcting code . Obviously, none of the columns can be entirely zero;
otherwise, a single error will be confused with correct transmission. The first
column C 1 is thus chosen arbitrarily from the set of all nonzero vecto rs of
length m. The second column C2 is chosen such that C2 - and C2 - C 1 This
can be done provided 2m > 2. The third column C3 is chosen such that C3 - 0,
C3 - C" and C3 - C 2 If v ~ 2, then we must also have C3 - C 1 + C2 to
assure that the three columns are linearly independent. This can be done provided 2 m > 4. The procedure continues until all columns are selected . Each
column is chosen with the aid of the previously selected columns in such a way
that all combinations of 2v columns are forced to be linearly independent. At
each stage of the process, the previously selected columns bar a number of vectors from candidacy for the next selection. The excluded vectors increase in
number as we proceed; therefore, in the last stage, when we must choose CN ,
the number of excluded vectors is the largest, and if at least one vector is still
left to choose for CN , the process will end successfully. The excluded vectors
for the final stage are as follows:

Sec. 8.11

Convo lut iona l Encoder

113

CN ' 0,
CN ' C;,

CN ' C; +

i =I,2 ,

c.,

i ,j

,N - l

1, 2,

,N - 1

i , j , k = 1, 2,
i.j, k distinct

CN ' C; + C, + Ci,

il>iz,
i I> i z,

,N

and

and

, izv- 1 = 1,2, .. . ,N - 1 and


, iZv - 1 distinct

At worst , all the se combinations will be distinct, in which case the total number
of excl uded vectors will be

+ ... +

(N1)
2v - 1

Z~l

(N- 1)

k=O

Since there are 2 vec tors of len gth m , we can choo se the last column C N if
Eq . (8. 19) is satisife d . Th e pro of is thu s co mplete .

Example
If N = 10 and v = 2, the upper bound for m is 8, which means that for m 2: 8 and
N = 10 a double-error-correcting parity check matrix can be constructed. The preceding
theorem does not exclude, however, the possibility of finding a parity check code with
m < 8. From the Hamming bound, we know that m must be larger than or equal to 6.
In fact, a double-error-correcting parity check code of length N = 10 exists with m = 7
check digits. [See the matrix in Eq. (8. 17) .]

We clo se our discu ssion of parity chec k codes by notin g that in the spe cia l case of v = 1 (sing le-error-co rrec ting codes) the Hamming lower bound is
2m ~ 1 + N and the Varsharmov-Gilbert upp er bound is 2m > N . Since 2m is
an integer, the two bounds are equal, and the minimum value of m is uniquel y
determ ined by m = rlogz(N + 1 Th is ca n be seen in a much simpler way by
observing that the columns of a sing le-error-co rrecting code must satisfy onl y
two requirements: (1) all columns are di stinct and (2) no column is entire ly
zero . For a further discu ssion of sing le-erro r-correcting code s , see Probl em 8 . 1.

8.11 CONVOLUTIONAL ENCODER


In the remainin g part of this chapter we will discu ss con volutional codes , wh ich
are different from block codes in the sense that here one ca nnot divide the
stream of data into separate blocks and associ ate ind ependent cod e words with
them . The distinction will be come clear once we introduce a convolutional
encoder later in the section . Th e mapping of the entire sequ enc e of dat a into a

~I"""UI

V V ..... ""..,

IVI

L II\:;

LJII IU I

""'y"l l l I C Lll\"

ll O l l ll C I

Vl la~ .

. X13x12Xl1

Figure 8.1

Linear binary convolutional encoder.

code sequence, however, can be thought of as one giant block code, and it is in
this sense that linear convolutional codes and linear block codes are similar.
Convolutional codes are powerful error-correcting codes with relatively simple
encoding and decoding schemes and have found application in many communication systems . We introduce the concept of convolutional codes by means of
simple examples and describe an optimum decoding technique known as the
Viterbi algorithm. We will not discuss topics such as the error-correcting power
of convolutional codes and sequential decoding techniques; the interested reader
is referred to the vast literature on coding theory for further information [3, 5].
The following example of a linear binary convolutional code will bring
out the important features of the convolutional technique . Consider a sequence
of binary information digits VI V 2 V 3 at the output of an information source ,
with 'T s being the fixed duration of each digit. The information sequence
enters a three-digit shift register as shown in Figure 8.1. The shift register is
initially reset to 000. At t = (n - 1)'T" the nth information digit U; enters the
shift register and creates two code letters X o" = V" EB V"-2 and XI" = V" EB
V n - I EB V"-2' which are then transmitted during the time intervals [(n - 1)'T"
(n - 0.5)'T.] and [(n - 0.5)'T" nTs ], respectively. Since for every information
digit there are two code digits, this code is a (2, I) convolutional code; in other
words, the rate of the code is R = ~. Since at every stage the encoder remembers the previous two input letters, the code is said to have a memory or constraint length of 3. Notice that the stream of data cannot be partitioned into
nonoverlapping blocks with each block corresponding to a block of code letters. It is in this respect that convolutional codes differ from block codes.

8.12 STATE AND LADDER DIAGRAMS

A convolutional encoder can be identified with a state diagram . In Figure 8.1


the contents of the leftmost two registers, V"-1 and V"-2, determine the state of
the encoder in the time interval (n - 2)'Ts < t < (n - 1)'Ts ; thus the state is
V ,,_I V"-2 just before V" arrives , and upon arrival of V" the state moves to V"
V,,-)o The knowledge of the previous state and the present state is therefore sufficient information for determining the encoder input and output at each stage.
Figure 8.2 shows the state diagram of the encoder of Figure 8.1 . There are four

Sec. 8.12

Stat e and Ladder Diagrams

115

00

Figure 8.2 State diagram. (From R. J. McEliece, The Theory of Inf ormation and Coding, Addison-Wesley, Reading, Mass., 1977. )

states corresponding to four different values of Un - 1 Un - 2 ; these states are labeled a, b, c, d and are represented by boxes in Figure 8.2. The lines that join
these boxes show possible transitions from state to state upon arrival of a new
information digit. A solid line represents a 0 input , while a broken line represents a 1 input. Each line has a two-digit label that represents the resulting output X OnX 1n. For example , if the input sequence is 110100 and the initial state of
the system is So = a, the state follo ws the path acdbcba and the encoder output
will be 111010000111.
The ladder diagram is the extension of the state diagram in time. The ladder diagram for the encod er of Figure 8. 1 is shown in Figure 8.3. Each column
of four dots in Figure 8.3 represents the states of the system at t = n, with
j = 0, 1,2, ... . In going from S, to S;+l s a solid line represents a 0 input and a
broken line represe nts a 1 input. The 2-bit sequences on the lines correspond to
the output at each transition . The encoder output stream can be found by tracing the appropriate path through the ladder. For exa mple , the input strea m
110100. . . corresponds to the path aocld2b3c4bsa6' . . and the output sequence
111010000111 . . . .
In practice, it is necessary to work with a truncated version of a convolutional code. If the stream of informati on is truncated after N digits, it is usually
preferred to input a suffic ient numb er of zeros so that the last bit can pass
through the shift register. For the encoder of Figure 8. 1, the required number of
zeros is 2, and the last few stages of the ladder diagram for a code truncated
after N input letters are shown in Figure 8.4 . Notice that in the last two stages
all the lines are solid; this is because the last two input letters are zeros . The
truncated convolution al code can, in fact, be thought of as a single-block code

... 11 ' ......... '

St at e

t =

"" ..... ..... v

...

I V'

.. , . ....

LJII I UI

....,ylllll l v .. l l ..... \JIIO l llltJl

0
00

. ,,.,.. .

.....

,,

Figure 8.3 Ladder diagra m. (Fro m R. J . McEl iece , The Theory of Inf ormation and
Coding, Addison-Wesley, Reading, Mass. , 1977 .)
State

t=(N - 1),s

t=N ' s

t=(N+1),s

t =(N+2},s

Figure 8.4

Final stages of a truncated ladder diagram at level N.

of length 2N + 4 with N inform ation digits. The rate, R = N j(2N + 4), approaches ! as N ~ 00 . Moreo ver, it is not difficult to show that this linear block
code is in fact a parity check code (see Problem 8.17).
8.13 MAXIMUM LIKELIHOOD DECODING AND THE
VITERBI ALGORITHM

The Viterbi algorithm is a maximum likelihood (minimum distance ) decision


scheme for convolutional code s. If the sequence "i = (yO\ Y II ... YO.N+2Yi. N+ 2) is
received at the channel output , the decode r must identify a sequence of states
SOSI ' . . S N+ 2 that corresponds to the code word W; = X O\XII . . . XO,N+ 2, X I .N+2

Maximum Likelihood Decoding and the Viterbi Al gorit hm

Sec. 8.13

117

with minimum distance from Vj. The sequence of states So SI .. . SN+Z is not
arbitrary; it represents a continuous path in the ladder diagram. Moreover, the
initial and final states are always known in advance : So = SN+Z = a.
For a given received sequence Vj , we assign a "length" to each branch of
the ladder diagram. A branch that connects the state S in level (j - I) (i.e .,
Sj-I) to state S ' in level j (i.e. , S/) will have a length equal to the Hamming
distance between its corresponding encoder output (XOjx lj) and the correspond ing 2 bits in the recei ved sequence (YOjY lj)' Figure 8.5 shows the ladder diagram of the encoder of Figure 8.1, truncated at N = 6. A received sequence
Vj = 10110011101 11100 is also shown.
As an example, consider the bran ch that connects d z to b 3 . For this
branch , X0 3 XI 3 = 10 and Y 03 Y13 = 00; therefore, the length of d zb3 is 1. Figure 8.6 shows the length s of all the branches of Figure 8.5.
V.=

[ 10

11

00

11

11

10

00 J

11

00

\ .....
\ .....
\
\

\
\
\
\
\

j = 0

Figure 8.5 (From R. J. McEliece , The Theory of Inf ormation and Coding, AddisonWesley, Reading, Mass. , 1977.)

ao

a8

\
b

j =O

\ 1

Figure 8.6 (From R . J. McE liece , The Theory of Information and Coding, AddisonWP<] P V Readino_Mass.. 1977.)

LII' tJi:ll

"0

"UUtJ~

l,nap. ts

i o r we orn arv e v m rn etn c I.-nanne l

The minimum-distance decoding scheme can now be stated as follows:


For a given received sequence V;, determine the sequence of states a OSI S2' ..
SNSN+laN+2 that connects ao to aN+2 through a continuous path of minimum total
length.
Example
It is possible to verify in Figure 8.6 that the path arlll C2b3a4csb6a7aS is the shortest path
between a o and as. The total length of this path is equal to 4.

The problem of maximum likelihood decoding is now reduced to the


problem of identification of the shortest path between ao and aN+2' Viterbi recognized that if ao S 1 S2 . . . Sj .. . SN+ jaN +2 is suc h a path then the sequence
aoSjS2 ... Sj must also correspond to the shortest path between ao and Sj' Now
assume that for all states S, at level j, that is, aj, b., Cj , and d., the shortest path
leading to S, is known (one path for each state). Then for every state at level
(j + 1), say Sj +h determination of the shortest path between a o and Sj +l is as
simple as comparing two possible paths and choosing the shorter one. This is
because every path that leads to Sj+1 must first visit some state in level j, and
there are only two states in level j from which Sj+1 can be reached. Considering
the preceding arguments, the Viterbi algorithm can be described as follows:
1. Setj = 1.
2. For every state S, of level j, find the shortest path leading to S, from
ao

3. If j

=N +

2, stop. Otherwise, set j

=j +

1 and go to step 2.

Figure 8.7 shows this process for the ladder diagram of Figure 8.6. The
numbers appearing on the states represent the length of the shortest path that
leads to them. The shortest path to a 8 is readily seen to be a oajc2b3a4c5b6a7a8.
The decoding is now complete .

\
\

\\

d
j =

4
as

,
)\;\
,
-,

\
c

1
ao \

\ 1

-,

,2

-,

-,

"-

"2

'2

'-

\ 3

Figure 8.7 The Viterbi algorithm. (From R. J. McEliece, The Theory of Information
and Coding, Addison-Wesley, Reading, Mass., 1977.)

Chap. 8

Problems

119

Figure 8.8 (3, 2) convolutional encoder with eight states.

So far we have restricted our study of convolutional codes to a simple


(2, I) code with a small number of states. The ideas, however, can be readily
generalized to include (N, K) convolutional codes with any number of states.
Figure 8.8 shows a (3,2) convolutional encoder with eight states. The ladder
diagram for this encoder has eight dots in each column, representing the eight
states. Each state can go to four states in the next level and can be reached from
four states in the previou s level. The lines must be of four different types (say
_____ ,
, _ . _ . _, .........
) corresponding to four
different inputs U 2j -tU2j = 00,01, 10, 11. Each line in the ladder diagram will
be labeled with a 3-bit output sequence . It is easy to see that the Viterbi
algorithm is applicable to the general case of (N , K ) code s with an arbitrary
number of states. In practice, however, as the number of states grows, the storage and computation requirements become prohibitive . Decoding techniques
that are almo st as good as Viterbi 's can be applied in such situations . We
mention sequential decoding and threshold decoding techniques but will not
discuss the specifics here .

PROBLEMS
8.1. The Hamming code is a parity check code whose matrix A has the following
property: column n of A is the binary number n . For example, the matrix of a
Hamming code with m = 3, N = 7 is

0 0 0 I 1 1
A=OIIOOI
(
I 0 1 0 I 0
(a) For a given m, what is the maximum code-word length N ?
(b) Construct the largest Hamming matrix for m = 4 and show that it corrects all
single errors of transmission .

L. II I G O I

'-'VU'c;) IV I

1I l C ull iary vYlll l l lt::LlIl,; \."r I Cl IIlIt::l 1

L.nap.

(c) Show that errors of magnitude 2 and lar ger cannot be corrected by the
Hamming code .
8.2. (a) Show that a parity check code will correct v-tuple (and all smaller) errors and
detect (but not necessarily correct) (v + I)-tuple errors if and only if every
set of 2v + 1 columns of the parity check matrix is linearly independent.
(b) Given a v-tuple error correcting parity check code with parity check matrix
A = [1mIAI], where 1m is an identity matrix of order m, show that the code
defined by the matrix

0
0
Ao =

t;

AI

0
will correct v-tuple errors and detect (v + I)-tuple errors . (This corresponds
to adding one check digit that performs a parity check on all the digits of a
code word .)
(c) Find a parity check matrix of a single-error-correcting, double-error-detecti ng
code with five check digits and eight information digits.
(d) What is the analog of the Varsharmov-Gilbert condition for codes that correct i--tuple and detect (v + I)-tuple errors?
8.3. Consider the following parity check matrix:

(~q ~ !)

(a) Find the code words of the code corresponding to A .


(b) Show that it is possible to replace the rightmost column of A such that the
resulting code will correct single errors and detect double errors. Show that
not all double errors can be corrected .
8.4. The simplest error-control code is obtained by adding a single parity check bit at
the end of a binary sequence of length N - I to make the total number of I' s in
the sequence even . For this code:
(a) Determine the parity check matrix A and the rate R .
(b) What is the minimum distance of the code ? How many errors can be corrected ? How many errors can be detected?
(c) Describe the decoding sets (cosets) and the corrector vector s . How many
error patterns can be corrected?
8.5 . It is desired to construct a parity check code to detect v-tuple and all sma ller
errors . If an error of magnit ude ::5 v is made , the decoder must decide that an
error has occurred; it is not required that the decoder provide any inform ation
about the nature of the error.
(a) Find the necessary and sufficient conditions on the parity check matrix for
this type of error detection .

Chap. 8

Problems

121

(b) Find a convenient condition on the code words that is equivalent to the condi tion of part (a) .
(c) Given a parity check code with 2K words of length N, find the number of
error patterns that can be detected by the code . [Note that this number is the
same for all (N, K) codes.]

8.6 . A generator matrix for an (N , K ) binary code S is a matrix G whose rows consist of K linearly independent code words of S. Every code word of S may be
obta ined as a linear co mbination of the row s of G . Assumi ng that the last K
column s of G are linearl y independent (if not we can interchange columns) , we
may reduce G by elementary row transformations to a new generator matrix
G * = [B lh]
where B is K by m (m = rank of the parity check matrix = N - K ) and IK is an
identity matrix of order K.
(a) Show the matrix A = [1mIB T] is a parity check matrix for S.
(b) Find parity check matrices for the codes whose generator matrices are

(i) G

[~ ~ ~ ~ ~ ~J
0

000

(ii) G = [I

I]

8.7. The parity check matri x A represe nts a single-error-correcting, double-errordetecting code . By eliminating a row and a column from A, it is always possible
to obtain the matrix of a sing le-error -correcting code . How do you go abo ut
selecting the row and column that must be eliminated for this purpose?
8.8. The circuit in Figure P8.8 is used to encode binary digits for transmission over a
binary symmetric channel with crossover probability e < !. The shift register is
initially filled with O's; then four information digits are shifted into the shift register and simultaneously transmitted . Next three check digits are transmitted; the
four information digits in the shift regi ster shift one place to the right between
each check digit calculation .
Find the parity check matrix , the genera tor matrix, a decoding table , and the
probabilit y of decoding error for the code .

Source

7X e

x u u u
5

Channel

u,

Figure P8.8
8.9 . (a) Show that in a parity check code either all the code words contain an even
number of ones or half the code words contain an odd number of ones and
half an even number.

,.<:.<:

L lllt::dl

\,.,UUt::::; lUI

li l t::

o u rcn v

vY l l ll l ltal lL: \... 1Id l II 1t::1

vlldlJ. 0

(b) Let x m , 1I be the nth digit in the mth code word of a parity check code, Show
that, for any given n, either exactly half or all the x m , 1I are zero, If all the X m , 1I
are zero for a given n , explain how the code could be improved,
(c) Show that the average number of ones per code word, averaged over all code
words in a parity check code of block length N, must be at most N /2.
8.10. Consider two parity check codes. Code I is generated by the rule

Xz

Uz

X4

UI

Xs

UI

X6

= u:

X7

UI

EB
EB
EB
EB

Uz
U3
U3
Uz

EB

U3

Code II is the same except that X6 = Uz .


(a) Write down the generator matrix and parity check matrix for code I.
(b) Write out a decoding table for code I , assuming a BSC with crossover probability e <
(c) Give an exact expression for the probability of decoding error for codes I and
II. Which is larger?
(d) Find d min for codes I and II.
(e) Give a counterexample to the conjecture that, if one (N, K) parity check code
has a larger minimum distance than another (N , K) parity check code, it has a
smaller error probability on a BSC.
Consider a parity check code of block length N, and let Wo = 000 . . . a be the allzero code word and WI be an arbitrary code word of this code. Show that the
number of code words that have a given distance, say n, from W o is the same as
the number of code words at the same distance n from WI'
A binary code has minimum distance d m in . Show that if this code is used on a binary erasure channel (see Problem 6.1), as long as the number of erasures is less
than dmin, the error can be corrected .
Let 51 and 5 z be two binary parity check codes of equal block length . Show that
the intersection of 51 and 5 z, that is, the set of all code words that belong to both
codes, is also a parity check code. Describe a method for obtaining the matrix of
the new code from the matrices of the original codes.
Consider the m X N parity check matrix A. Define a new matrix Al by adding a
row of all I's to A; that is,

4.

8.11.

8.12.

8.13.

8.14.

(a) Can you obtain the code words of Al from the code words of A? If the answer
is positive, describe a procedure.
(b) In terms of Nand m, how many code words belong to AI?
(c) If A is a v-tuple error-correcting code, prove that Al is at least v-tuple error
correcting, (v + I)-tuple error detecting.
8.15. Let the m X N parity check matrix A be expressed as A = [B 11m ], where B is
m X (N - m) and I", is the m X m identity matrix. Construct a new matrix Al by
eliminating the first column of A (i.e. , Al = [Bi ll",]) .

Chap. 8

Problems

123

(a) Describe a procedure for obtaining the code words of Al from the code words
of A.
(b) In terms of Nand m, how many code words belong to AI?
(c) Show that the minimum distance (d min) for the new code is ;:=: d min for the
original code .
8.16. Suppose you were approached by a customer who told you' that his (binary) channel accepts words of length n = 7 and that the only kind of error pattern ever
observed is one of the eight patterns 0000000, 1000000, 1100000 , 1110000,
1111000, 1111100, 1111110, 1111111. Design a parity check code that will correct all such patterns with as large a rate as possible.
8.17. Using the equivalence between parity check codes and group codes, show that a
truncated linear convolutional code is a parity check block code.
8.18. The convolution of a sequence {ao, aI, ...} with another sequence {bo, b., . .. } is
defined as the sequence {co, CI> }, where c; = 2,;~oa ;bn -;. Using modulo 2
addition and multiplication, show that each of the sequences XOIXQ2X03 and
XllX12X13 . of Figure 8.1 can be expressed as the convolution of the input
sequence VI V 2 V 3 . . with an appropriate (finite) sequence .

9
ADVANCED TOPICS

9.1
9.2
9.3
9.4
9.5

Discrete Stationary Sources


Ergodicity and the Shannon-McMillan Theorem
Differential Entropy
A Simple Continuous Channel
The Arimoto -Blahut Algorithm
Problems

Introduction
In this final chapter we introduce some of the more advanced topics of information theory . This presentation is neither detailed nor
exhaustive; its only aim is to convey to the reader the sense that the concepts of
information theory that have been developed in this book in connection with
simple discrete memoryless sources and channels are far more general and can
be extended in a variety of directions . In Section 9.1, for example, we remove
the restriction of memorylessness from the discrete source and show that entropy can be defined in a natural way for discrete stationary sources . We then
show that this ge nera lize d entro py has the same practical sig nifica nce that
was associated with its restricted versio n for the discrete memoryless source;
that is, the entropy of a stationary source is the minimum average code-word
length that one has to use in order to represent the source accurately. In Section 9.2 , we generalize the concept of likely or typical sequences by proving
the Shannon- McMillan theorem for a broad subclass of discret e stationary
sources . By imposing the constraint of ergodicity, we will be able to show the
generality of the relationship between entropy and typical sequences that was
studied in Chapter 2.
Another direction in which the results of information theory can be extended is in relation to continuou s messages and channels. Although some informatio n generation, storage, and tra nsmiss ion sys tems in use tod ay are

1"-0

MUVQI I\..CU

IUIJI\...~

'-"IICp . oJ

discrete in nature, the vast majority of such systems remain continuous. Microwave channels, telephone lines, fiber -optic networks, and magnetic disk and
tape drives are all examples of continuous systems. A great deal of effort has
been expended to extend the results of information theory to such practical situations. The cornerstone of all these efforts is the concept of differential entropy,
which will be introduced in Section 9.3. Although we shall not build upon this
concept here to investigate the properties of continuous (or waveform) messages and channels, we will point out its major differences with conventional
(discrete) entropy. Our analysis of a continuous channel in Section 9.4 is based
on intuition rather than rigor, but the derivation of the formula for the capacity
of a band-limited Gaussian channel in this section is expected to give the reader
a great deal of insight.
Section 9.5 describes a numerical algorithm for the calculation of the
capacity of discrete memoryless channels . This is a simple, yet elegant method ,
which was not discovered until later in the development of information theory
[13,14]. The discussion here takes advantage of the knowledge developed in
Chapters 5 and 6 with regard to the information theoretic functions .

9.1 DISCRETE STATIONARY SOURCES

In Chapter 2 we studied sources that produced a sequence of letters U jU 2U 3 . .


from a finite alphabet. There we required the letters to be independent and
identically distributed. It was shown that the entropy H(u) of the source was intimately related to the optimum encoding of information sequences produced by
the source, and that H(u) represented the minimum number of bits per source
letter that was required for perfect reconstruction. In this section we generalize
this result by removing the requirement of independence among the letters .
Consider the sequence of letters UIU 2U 3' . . produced by the information
source S. Each letter u, is a selection from a discrete alphabet. A complete description of this stochastic process requires the assignment of probabilities
P(Uj +l Uj +2' Uj +N) for all starting points j 2:: 0, all sequence lengths N 2:: I,
and all realizations of the sequence Uj + I . .. Uj +N ' The source is defined as stationary if its probabilistic description is independent of a time origin, that is, if
p(Uj+ I Uj +2 Uj +N) is independent of j for all values of N and for all realizations of the sequence. According to this definition, the individual letters are
identically distributed but are not necessarily independent. A source with i.i.d.
letters is a stationary source, but so is a source whose sequences are formed by
mere repetition of the first letter . For example, a binary source that produces
the output 11111 ... with probability P and 00000 ... with probability I - P is
a stationary source .
Let SN be a sequence of N letters from a discrete stationary source , and let
u IU 2 . . U N be the joint ensemble for SN ' We define the entropy per letter in a
sequence of N letters as

Sec. 9.1

Discrete Stationary Sources

HN(u)

Obviously, HN(u)
function of N .

?:

~{ - ~ P(SN) IOgp(SN)}

(9 .1)

0 for all N . We now prove that HN(u) is a nonincreasing

H (UNIU IU Z .

Lemma 1.

127

. . U N- I )

is nonin creasing with N .*

Proof. Using first the fact that cond itioning cannot increase entropy and
then the stationarity of the source, we obtain
H(UNI UIU z

"

U N-I) :5

H (UNIUz... U N-

I)

H (UN-I [ Ul

'"

UN- Z)

(9.2)

Lemma 2.
(9.3)

Proof
HN(u)

I
N H (u,u z, . . UN)
I

=N

{H( ul)

+ H (uzlu ,) + H (U3/UI UZ) + . . . + H (UN/UlUZ"

,UN- l)}

From Lemma I , the last term on the right side of the preceding equation is a
lower bound to each of the other term s. Inequal ity (9. 3) then follows .

Theorem 1.

HN(u) is noninc reasing with N.

Proof. Using Lemm a 2, we write


1

HN(u) = N H (u l .. . UN)
I

= N H (u , . . . U N-I ) +
N - I

:5 ~ HN- I (U)

I
N H (UNIu

, . .. U N- I )

N HN(u)

(9 .4)

Rearranging Eq . (9.4) now yields


HN(u) :5 H N- 1(U)

(9.5)

The proof is thus complete.


Using Theorem I and the fact that HN(u) ?: 0 for all N , we conclude that
the limit of HN(u) as N ~ 00 exis ts. The entropy per lett er for a stationary
source can now be defined as follows :
H (u)

lim HN(u)

(9.6)

N --> x

*From R. G. Ga llagher, Information Theory and Reliable Communications . New York:


Wiley, 1968, p. 57. Reprinted by permission of John Wiley & Sons, Inc.

MUVClI IL.t;U

UIJ Il,,;~

\.,.,IIClIJ ;;,

In the special ca se of independent identicall y di st ributed lett er s , we have


H (u l . . . UN) = H (u I) + . .. + H (UN) = NH( u) , and thus the preceding definition is equi valent to the earlier definiti on for mem oryless sources .
The entropy for the discrete stationary source can be defined from several
other perspectives, but all defin ition s turn out to be equiva lent. One such definition is given in Theorem 2, whi ch follows , and another is the subject of
Problem 9.1.

Lemma 3.
for all N

(9 .7)

Proof
1

HN+M(U)

= N + MH(UI . . . UN+M)
1

+ MH(UI

. {H(UN UI

+ ... +

UN- I)

+N +M

+ H(UN+I l UI . . . UN)

UN- I)

H(UN+M UI . . . UN+M-I)}

Using Lemma 1, this can be written as


HN+M(U) ::; N

+ MH(u l UN- I) +

In the limit when M ......,.

00 ,

M
M

+ 1
+ N H( UNl UI . .. UN- I)

we have

H(u) ::; H (UNlUI . . . UN-I)

The proof is now complete .

Theorem 2.
equal to H (u) .

The limit of H(UNIUI . . . UN-I) as N ......,.

00

ex ists and is

Pro of. From Lemma I and the fact that H(UNl UI . . . UN-I ) is larger than
or equal to zero for all N , we conclude that the limit exists. Lemm a 2 asserts
that the lim it is ::; H (u ), while from Lemma 3 the limit mu st be ~ H (u ).
Therefore , H(u ) = limN_X H (UNl UI . .. UN-I ). Thi s completes the proof.
The following theorem establishes the relationship between the entropy of
a discrete stationary source and the average code-word length required for its
perfect reco nstruction .

Theorem 3. Let H(u) be the entropy per letter for a discret e stationary
source with a finite alphabet, and let 0 > 0 be arbitrary. Given a code alphabet
of D symbols , it is possible to encode source sequences into a prefix-free code
with the average code -word length per source letter satisfying the following
relation:

Sec. 9.2

Ergodicity and the Shannon-McMillan Theorem

H (u )
-

:5

log D

129

H (u )

< -- + 0

(9.8)

log D

n cannot violate the left inequality for any uniqu ely decodable code.
Proof. The proof is similar to the proof of the corresponding theorem for
memoryless sources given in Chapte r 2. By encoding sequences of length N at
a time , we can achieve
HN(u )

log D

_
:5

H N(u)

n < log D

+ N

(9.9)

Since H N(u ) ::::: H(u ) , the left inequ ality in Eq. (9 .9) yields H(u) /I og D :5 1l ,
and any uniquely decodabl e code must satisfy this inequ ality. By choosing a
large enough value for N, the term on the right side of Eq . (9.9) can be made
less than H (u )/Iog D + o. The proof is thus complete.
9 .2 ERGODICITV AND THE
SHANNON-MCMILLAN THEOREM

Consider a source with alphabet {a], a i , a3} and two modes of behavior, each
occurring with probab ility ]. In the first mode the source produce s an infinite
sequence of repetitions of al. In the second mode the source produce s an infinite sequence of statistically independent, equiprobable selection s of the letters a2 and a-; This is obviously a stationary source, since the probability of a
given sequence does not depend on the starting point. A sequence of length N
has either probability! (if it is all ai's) or G)N+l (if it is composed of a2's and
a3's). We then have
HN(u)

~ (- t

~{ ~

p (SN ) log P(SN))

log 2

2N(N2':11 log 2) }

= N ;; 2

log 2

The entrop y per source letter is thus given by H (u ) = limN--+oo HN(u) = ! bit.
If we encode sequences of length N into binary code (u sing either
Shannon- Fano coding or Huffman codin g) , it is not hard to see that the sequence of a I ' S is encoded into a single binary digit, and each of the 2N sequences compo sed of a2's and a3's is encoded into a binary sequence of length
N + I. The average code-wo rd length is ! + (N + 1)/2 = (N + 2)/2 , and
the average number of bits per source letter is (N + 2)/2N , which is close to
H (u ) for large N . However, it is clear that neither Ii nor H (u ) are quantiti es of
great significance in this case .
Source s that cannot be separated into different persisting modes of behavior are known as ergodic sources . A stationary ergodic source has the property
that statistical averages of functions defined on its output sequences are arbitrarily close to the time averages with probability close to 1; that is, the proba-

. .... W .... I I . . . . . . . . . . . . . .

t-" ......

.... , ,,... t-' . ....

bility that the time average differs significantly from the statistical average is
almost zero. For the stationary ergodic source, the following theorem, which is
a generalization of the results of Section 2.4, can be proved .
The Shannon-McMillan Theorem. Let a discrete stationary ergodic
source have entropy H(u) . For arbitrary e > 0, a > 0, there exists No such
that, for N 2: No,
(9.10)
Proof. The proof will be given in two steps. First we prove the theorem
for sources with finite memory; then we show that a stationary source can be
arbitrarily closely approximated by a finite-memory source.
Part 1: Let
N

= piu, . .. U m ) 11

P(U I . . . UN)

piu; U n- m Un-I)

n=m +l

This represents a source with a memory of length m. Now


I
- N log P(UI

. . . UN)

I
N log P(U I

= -

um )

. . .

I
]
- N-m[
--- --2: N
log ptu; I U n- m
Un-I)
N

N - m

n=IIl+ !

(9.11)
Since m is fixed, in the limit N ~ 00 the first term on the right side of
Eq. (9.11) approaches zero and the coefficient of the second term approaches
unity. The second term is now the time average of - log ptu, IUn -Ill . . . Un- I)
and, because of ergodicity, it approaches the statistical average with probability
1. The statistical average of this quantity is
E( - log piu; IUn-Ill

. . Un-I))

= -

2:

p(U n-

1Il

Un)

log piu; IUn-Ill

. . Un- I)

all sequences
Un - m"

Un

ni, IUn-In

.. Un-I)

(9.12)

Since the memory is finite and equal to m, the conditional entropy in Eq. (9.12)
is equal to H(u) . Therefore, in the limit of large N, Eq. (9.10) will be satisfied.
Part 2: We now show that, for sufficiently large m, the quantity
N

jJ(UI . . . UN)

= P(U I .. . U IIl)

11

ptu;

IUn-Ill .. Un- I)

n=m +l

closely approximates ptu , . . . UN) with probability close to 1.

Sec. 9.2

Ergodicity and the Shannon -McMill an Theorem

131

Figure 9.1

p{

~ [log P(UI . . . UN)

- log P(UI ... uN)1


=

>

e}

. . . UN)/
p{11 og P(UI
p(Uj ",UN)

> Ne }

::s-1 E(/IOg P(UI . . . UN)I)


Ne
P(UI ' . . UN)

(9 .13)

The last inequality follows from a version of Cheby shev inequality for positive
random variables (see Problem 1.4) .
Let us digres s momentarily to prove the following inequality concerning
the absolute value of the function log x:
2 log e
[log xl ::s - - x - log x
e

(9.14)

The function !(log x + [log xl) is shown in Figure 9.1 . It is equal to log x
for x ;;::: 1 and is zero otherwise. The tangent to this function passing through
the origin is (log e/e )x , which is also shown in the figure . Thu s !(log x +
Ilogxl)::s (log e/e)x , which is equivalent to Eq. (9.14) .
Using Eq. (9.14) in Eq. (9. 13), we obtain

p{

~ [log P(UI ' . . UN) -

log piu ,

uN)1 >

e}

{2lo e

g
UN)] - E [ log -'---'----=---~
P(U I ",UN)]}
::s - 1 - - E [fJ(Ul
Ne
e
P(UI" ,UN)
P(U I " ,UN)

(9. 15)

However,

E [ P(UI ' .. UN)] -_


p(U I UN)
2:
li t lim

""
LJ

P(UI

all sequences

[P(Ul . . .

.. .

"" ' (U l
UN) P(UI .. . UN) -_ LJp
p(UI UN)

' . .

UN)

::'~) 'P(Um+lIUl . .. Um) . . . 2:P(UNIUN_m"'UN_l)]


"m + l

= I

UN

(Q 1(;)

Advanced top ics

Chap. 9

And

P(U I' . . Um) II~=m+ IP(Un IUn- m.. . Un-I )


E log .:......:.--'----"'-'----!.=-'-'-'---'-'-'--"---"'--"--'-'(

P(UI " ,UN)

= NHN(u) - mHm(u) - (N - m)H(um+11UI . . . Un,)

(9. 17)

Therefore,

p{

~ [log P(U I ' .. UN) -

log P(U I ' . . uN)1 > e}


1 {2 log e
- - - [HN(u) - H(Um+11U I'
s
Ne

~ -

[Hm(u) - H(Um+11 UI

. . .

..

um)]

Urn )]}

(9. 18)

For sufficiently large m , the terms in the brackets on the right side of Eq. (9. 18)
will be as close to zero as desired; and since N is always larger than m, the first
term can be made arbitrar ily small as well . Therefore, for sufficiently large m,

p{~ Ilogp(ul ,,,uN)

- log p(ul ", uN)! > e} < 0

(9.19)

for any e > 0, 0 > 0 chosen arbitrarily . This result , combined with the result
of part 1, completes the proof of the Shannon-McMillan theorem .

9 .3 DIFFERENTIAL ENTROPY

Consider the continuous random variable x with density function p(x ). The differential entropy of x is defined as follows:

H(x)

= - {,/(x)

log p(x) dx

(9.20)

Unlike the entrop y for a discrete r.v. , the precedin g function is not necessarily
positive, not necessarily finite , and not necessarily invariant under transforma tions of x. Therefore, its interpr etation as the amount of uncertainty associated
with x is no longer valid . The following examples demonstr ate the se facts
clearly.
Example 1
Let x be uniformly distributed between a and b. Then
H (x)

_fb
_
a b

1- log _ 1- dx
- a
b -a

log(b - a)

(9.21)

As b - a assumes values in the interval (0 , 00), the function H (x ) in Eq. (9.21) takes on
negative, zero, and positive values.

Sec. 9.3

Differential Entropy

133

Example 2
Let x have the following density function:

p(x) =

x < e

(9.22)

_ '_
I_

{x

In2 .c '

2:

Then

()-

dy Hx - J~ In x + 22 In In x dx2: J~ -dx- -- J~ -dOn- x) -- J.~ ,


x In x
, x In x
, In x
I
Y

00

(9.23)

Differential entrop y can thus be infinite .

Example 3
Let x be a Gaussian random variable with average t-t and variance u 2 Then

H(x) =

-f~ , ~
-~

v 2r. U

ex p[ - (x - ~)2] [ - IOg('I/2;U) _ (x - ~)2 log e] dx


Za
Zcr

= 10g('I/2;u) + -

2u

log e

=-

log(27Teu 2)

(9.24)

The entropy in Eq. (9.24) is independent of t-t and, depending on the value of o , could
be positive , zero, or negative.
Now co ns ide r the re ver sible (o ne -to -one) tran sformation y = ax . Since
2
E(y) = ap: and Var(y) = a 2u2 , we have H(y) = log(27Tea 2u ) . Clearly, the differential entropy is not invariant under this reversible transformation.

The differences between entropy for a discrete r. v. and differential entropy for a continu ous r. v. can be traced to the fact that differential entropy is
not a limiting case obtained by approximating a continuou s distribut ion with a
discrete one. If the real axis is divided into intervals of length Li and each interval is assigned the probability pen Li)Li, then the discrete appro ximation to the
entropy will be
cc

H(x) = -

2:

pen Li) Li log[p(n Li) Li]

,,=- x

2:

- log Li

n= -x

pen Li) Li -

2:

[pen Li) log pen Li)] Li

n = - :x;l

In the limit of small Li, the approximation becomes

H(x)

-log Li - f "P(x) log p(x) dx

(9.25)

where - log Li approac hes infinity . It is the absence of this infinitely large term
from the differential entropy that is responsible for its odd behavior.
Among the continuous random variables with fixed standard deviation 0-,
the Gaussian r.v. has the largest differential entrop y. To prove this, we need the
following lemma .

MUVdlll;t:U

I UIJIl;:i

l..nap.

::J

Lemma.

Let p(x) and q(x) be two continuous density functions. Then ,


assuming the integrals exist, we will have

- f"P(x) log p(x) dx

- f ",p(x) log q(x) dx

(9.26)

with equality iff p(x) and q(x) are equal.

Proof
- f ",p(x) log p(x) dx

+ f"P(x)

log q(x) dx

f ",p(x)

f",

IOg[;~;n dx ~ f

q(x) dx - f"P(x) dx

",p(x)

[;~;~ -

I]

dx

=0

Equality holds iff p(x) = q(x). The proof is now complete.


Now let the r. v. x have average u., variance a-2 , and probability density
function p(x) . Define

q(x)

~a- exp[ -

(x

~f)2]

Application of Eq . (9.26) yields

H(x)

= -

f"P(x) log p(x) dx

- f",p(x) [ - log(y!2; a-) - (x

log(y!2;a-)

+-

a-

2a-

~f)2 log e] dx

log e = - log(27Tea- 2)
2

(9.27)

The right side of Eq. (9 .27) is the entropy for a Gaussian r.v. with variance a-2
(see Example 3). Thus the Gaussian distribution has the largest differential entropy, as stated before . Moreover, in Eq . (9 .27) equality holds iff p(x) = q(x)
everywhere.
Joint and conditional entropies as well as average mutual information for
continuous random variables can be defined in a similar way. For instance, if
x and yare continuous random variables with joint density p(x, y), conditional
densities p(x Iy) and p(y Ix), and marginal densities p(x) and p(y), we define
cc

H(x,y) =

-JJ

p(x, y) logp(x , y)dxdy

(9.28)

= - JJ ptx, y) log p(x Iy) dx dy

(9.29)

co

H(x Iy)

- '"

Sec. 9.4

A Simple Continuous Channe l

135

ce

I(x, Y)

p(x,y)
LrJ p(x,y) log p(x)p(y)
dxdy

(9.30)

It is not difficult to verify that most of the relationships between information


theoretic functions that were derived in the previous chapters for discrete random variables also hold among their continuous counterparts defined here. We
do not intend to pursue this subject any further in this book since the problem s
of information theory in the continuous domain are beyond the scope of our
treatment. Let us just mention that much of the theor y of continuous waveforms and channels parallels the discrete theory with the central role played by
the differential entropy.

9.4 A SIMPLE CONTINUOUS CHANNEL

In this section we deri ve the expression for the capacity of a band-limited channel with average power constraint and signal-independent, additive, Gaussian
noise. The arguments used here are not mathematic ally rigorous but are intuitively acceptable, and in fact the final result , which is the expression for the
channel capacity, is an exact result. The channel is diagramatically shown
in Figure 9.2 . The Fourier spectrum of the input waveform x (t ) is limited to
frequencies betw een - Wand + W cycles/second , and its average power,
defined as
P

= lim T -->"

2T

f +Tx 2(t ) dt

(9.31)

-T

is confined to values belo w Po. The noise z(t) is white, Gaussian, and indepen dent of x(t) . Its spectral density is !No, independent of frequency.
To analyze the system, we first sample the waveform s at the Nyquist rate
of 2W samples per second. This preserves the information content of the waveforms and allows us to confine attention to the samples alone. Let us denote the
total number of samples in the time interval [ -T, T ] by K ; then K = 4WT. For
sufficiently large K, the input power constraint can be expressed as
1 K

- 2: xf s. Po
K

(9.32)

i= 1

zt t )
H(f)

x (t )

+'

I
- '1'1

Figure 9.2

v tt )

IN f

Schematic diagram of an ideal band-limited channel with additive noise.

Advanced Topics

Chap . 9

where X i are the samples of x(t) in the interval [ -T, T]. The noise samples z,
are zero-mean, Gau ssian random variables with variance U"; = NoW. These
samples are independent of the samples of x( t) and of each other. From the law
of large number s, one can deduce that
1

-K 2: z2 =
i= 1

(9.33)

U"2

for sufficiently large K . Finally, the output samples Yi must satisfy the following inequalit y:
1 ~ 2
2
(9.34)
- L..J Yi :5 Po + o ;
K i~1
Now consider a K-dimen sional Eucl idean space and assign the K coordinates of the space to the K samples of the wave forms . In this way, each waveform is uniquely identified with a point in the K-dimensional space . Inequalit y
(9.32) then states that the input signal x( t) must be within a sphere of radius
-vPo. Similarly, the output signal yet) is within a sphere of radius Y Po + U"; ,
and each realization of the noise waveform z(t) is, with a probability close to
1, inside a sphere of radius cr. , As the diagram in Figure 9.3 shows, when an
input signal such as xo{t ) is transmitted , the output signal Yo{t) must be somewhere within a sphere of radius U"z around xo(t) . If the input signals that are
used for communication over the channel are far enough from each other, their
Axi s 2

Ax is 1

Figure 9.3

Sec. 9.4

The Arimoto-Blahut Algorithm

137

corre sponding noise spheres will not overlap and, consequently, the receiver
can correctly identify the tran smitted signals .
Th e problem of calculating the channel capacity is now reduced to that of
calculating the maximum num ber of spheres of radi us (Jz that can be packed inside a sphere of radius Y Po + (J ~ without overlap . Since the dimensionality K
of the space is large, this is approximately the ratio of the volumes of the two
spheres . Thu s , if M denotes the maximum numb er of input signals that can be
corre ctly identi fied at the recei ver, we can write

(Y P +
0

(J ~

(J Z)K

(9 .35)

The maximum rate of tran sfer of data over the channel is then given by

Po)

C = -21 logzM = - log, 1 + 2


T
4T
(J z
But K

= 4WT

and

(J ;

bits/ second

= NoW; therefore ,

=W

10gZ( 1

+~)
NoW

bits / second

(9 .36 )

As a numerical example , con sider a telep hone ch annel with a bandwidth of


3 kHz, maximum useful power of 1 W, and noise spectral density of I mV /
"\/fu. From Eq . (9. 36), the capacity is found to be C 25 Kbit / second.

9.5 THE ARIMOTO-BLAHUT ALGORITHM

In this section we describe an iterative method of calculating the channel capacity for a discret e memoryless channel and prove its convergence . The algorithm
was discovered by S . Arimoto [13] and, independently, by R. Blahut [14]; our
proof of convergence here follows that of Arimoto. Thi s algorithm can be applied without any restrictions to a DMC with tran sition matri x [pC Ixd] and
can easily be programmed on a per sonal computer. To begin with , let us rewrite
the mutu al inform ation between the input and output of the channel in the following form :

s,

I(x, y)

k= 1

k= 1

= H(x) - H(x Iy) = - L P(Xk) log P(Xk) + L p(xk)ak

(9 .37 )

where
J

a,

= LP(Yj l xd

10gp(xkIYj )

(9 .38)

j=l

is a function of both the channel matri x and the input distribution P = {P (XI),
. . . ,p(xd}. It will be assumed throughout the sect ion that the input distribution
P is strictly positive ; that is, P(Xk) > 0 for 1 :5 k :5 K . Under this condition

. . . . .......... ....

the conditional probabilities P(XkIYj) can always be defined, and consequently


the coefficients a, exist for all k . We define A = {a I, a2, ... , aK} for simplicity
of future reference.
To calculate the capacity C, one must find an input distribution P * for
which lex, y) is maximum . The Arimoto-Blahut algorithm achieves this goal
by starting with an arbitrary distribution p (ll and calculating a sequence of distributions p (l l,p (2), . ..
for which the corresponding sequence of
lex, y)'s, that is, l (l)(x, y), 1(2)(x, y) , ... , I (Il)(x , y), . . . converges to the capacity.
The algorithm is iterative and r r: can be directly calculated from p (Il). The
initial distribution p (l ) must be strictly positive, but is otherwise arbitrary. In
general, a good choice for p (1) is the uniform distribution P(XI) = . .. =

.r , .. .

P(XK) = 11K.

To obtain P (11+ I ) from P ( II) , one calculates A = {a I , . . . , ad from


Eq. (9.38) . Since A thus obtained depends on
it will be referred to as A (II ) .
With A fixed at A (II) , the right side of Eq. (9.37) is maximized over P, and the
distribution that achieves maximum is chosen as t-: . The maximum value of
the expression on the right side of Eq . (9.37), which is obtained by setting
p = p(Il+Il, is denoted [(Ill, while the value of lex, y) at the nth iteration, which
is obtained by setting P = p(Il), is denoted 1(1l)(x, y). Therefore,

r,

1(1l)(x, y) = -

~I

~1

L P (Il)(Xk) log P (Il)(Xk) + L P (Il)(xk)a')


K

[ (II)

1(1l+I)(x, y)

-LP (Il+I)(Xk) log P(11+ 1)(Xk)

LP (Il+Il(xk)a')

k ~1

k~ 1

k ~1

k~ 1

= - L p (Il+I)(Xk) log p (Il+I)(Xk) + L p (ll+I)(xda ' +I)

We now show how

(9.39)

e v: can

{q), ... , qd be defined in terms of A

(9.40)
(9.41 )

be derived in terms of r . Let Q


{a), .. . , ad in the following way:

(9.42)
Obviously, qk > 0 and L kqk
written as
-

1. The right side of Eq. (9 .37) can then be

~ P(Xk) log P(Xk) + ~ P(Xda k = ~ P(Xk) log ptd + IOg(~ eak)


:5

IOg(~ eak)

(9.43)

In Eq. (9.43), equality holds iff p(xd = qk for all k (that is, the maximizing
distribution is Q). Therefore, p (Il+I) = Q and [ (II) = log(L ke"k); the values of
ak are obtained from Eq. (9.38) with P = p (Il).
If at some stage in the computation rr: becomes equal to r, the algorithm has come to successful completion because now p (ll) is the distribution

Sec. 9.4

The Arimoto-Blahut A lgorithm

139

that maximizes the mutual information . In this case, C = j (II)( X, y) . We are


thus interested in situations where p (II+I) = p (lI) for all n. In such cases the inequality in Eq . (9.43) is strict and we have
j (II)(X , y) < r (lI)
(9.44)
We now prove that the sequence {j (II)(X, y)} is strictly increasing . This is done
by proving the following inequality:
/ (II+ l)(X , y) 2: r (lI)
(9.45)
Using Eqs. (9.40) and (9.41) and replacing for a, from Eq. (9.38), we have
K

r (lI) -

/ (II+ l)(X,

y)

= 2: P (11+ I)(xd [ai") -

a ~l + l )]

k~ 1

(11) (

2:

2:p (II+l)(Xk)P(Yjl xk) log

~1 ~1

~l+ l)t

~ ~

< ~ ~ (11+ 1)(


) [ p (II)(XkIYj ) _
- L..- L..-P
XbYj
(11 + 1)(

I .)

~ I~ I
J

j=1

k=1

~ ~

2: P (11 + ll(yJ 2:p (II)(XkIYj ) -

r)

I )
]
1

1= 0

This proves the validity of Eq. (9.45) and consequentl y the claim that the sequence {/ (II)(X, y)} is strictly increasing . Since the sequence is bounded from
above by the channel capacity C , it must have a limit C that is less than or
equal to C. Our task is now reduced to proving that Cis in fact equal to C.
Lemma . Define 0 (11 ) = C - r (II), where C is the capacity, and denote
the input distribution that achieves capacity by p * = {P * (Xt), . . . ,P *(XK)} .
Then
(11+ 1)(

0 (11) ::;

"p * (x ) In P
L..-

k=1

(II)(

Xk

x,

(9.46 )

Proof. From Eq . (9.42) and the discu ssion follo wing Eq . (9.43), we
can write
r (lI)

= In(~ ea(n) =
~

In

ea(n)

P (11+ l)(Xk)

= a (II)
k

In P (11+ l) (x

(9.47)

Replacing for a ~") from Eq . (9.38) yields


- L..-P Yj Xk
j=1

I ) 1n

P (II)(Xk)P(Yj !x d - ~ (
1 (II+ I)( )
(11) ( .)
L..-P Yj Xk n P
x,
P Yl
j=1

_~

P (II)(Xk)P(Yjl xk)

r (lI) -

- L..-p(Yj Xk) In
j =l

I )

(11 + 1)(

Xk P

(11)(

.)

Yl

Equation (9.48) is valid for all k. Thus we can write

(9.48)

._ . _

Theorem.

The limit of I (lI l(X, y) as n ~

00

..,,...........

_ IIUt-' .

oJ

is C.

Proof. We prove that 8 (11) ~ 0 as n ~ 00 (i.e., the limit of f (lI ) is C) . In


light of inequalities (9.44) and (9.45) , this is equivalent to the statement of
the theorem.
{8(1I)} is a positive, monotonically decreas ing sequence . If it does not converge to zero it will converge to some positive value , and consequentl y 2: :~ 1 8 (11)
will be infinite . But this is impossible because
N
K
(N+I) (X)
K
p *(x)
L8(1I):'5 LP *(Xk) In P (1)( ; :'5 LP *(xd In~(
k)
(9.50)
11=1
k= l
P Xk
k=1
P Xk
and the right side is independent of N. This completes the proof of the theorem .

Finally, we remark on the speed of convergence. Assuming the initial distribution p O) to be uniform , Eq. (9.50) can be written as
N

8 (11)

:'5

In K - H *(x) :'5 In K

(9. 5 1)

n=l

But {8(1I)} is a positive, monotonically decreasing sequence; therefore,


N

8 (11) ;:: N8 (N)

(9.52)

11= 1

Combining Eqs. (9.51) and (9.52), we obtain


8(N)

:'5

In K
N

(9.53)

The inequality in Eq. (9.53) provides a simple criterion for terminating the
algorithm. A more sophisticated criterion is discussed in Problem 9.6.

PROBLEMS
9.1. One method of obtaining the entropy for a discrete stationary source
U N . is to consider

U ,U 2 ' . .

Chap. 9

Problems

141

(a) Show that HN1N(u) is nonincreas ing with N .


(b) Prove that, in the limit N ~ 00 , HNIN(u ) approaches H(u ).
9.2. Let x and y be contin uous random variables with join t density p( x , y ) and individual (margi nal) densities p (x ) and p ( y ) . The average mutual information between x
and y is define d as follows:
cc

fJ
p(x ,y)
l (x , y) = ~", p(x , y) log p (x)p( y ) dx dy
(a) Prove that, in gener al, l ex, y) 2': 0 with equality iff x and yare independent.
(b) Assuming that H(x) , the differential entropy of x, exists, is it always true that
l ex, y) ~ H(x) ?
9.3 . Let x and z be independent, Gaussian random variables with E(x) = E(z) = 0 and
var(x) = CT~ , var(z) = CT;, and let y = x + z. What is the average mutual information between x and y?
9.4 . The input x to a channel has uniform distribution on the interval (0, 27T) . The noise
z is independent of x with an arbitrary distribution p (z) . The output y is the sum
modulo 27T of x and z (i.e . , the addition takes place on the unit circle). Determine
l ex, y) in terms of p (z) and evaluate it in the special case where z is uniformly distributed on the interva l (a,f3) , with f3 - a es 27T .
9.5 . A channel has an input ensemble X consisting of the numbers + I and -I used
with the probabilities p( + I) = p ( - I) = The output y is the sum of the input x
and an independent noise random variable z with the probability density p( z) = ~
for - 2 < z ~ 2 and p( z) = 0 elsew here .
(a) Find and sketch the output probab ility density for the channel.
(b) Find l (x , y).
(c) Suppose the output is transformed into a discrete processed ouput u defined
by u = I for y > I , u = 0 for - I < y ~ I , and u = - I for y ~ - I . Find
l ex , u) and interpret your result .

!.

9.6 . Let l (x., y) be the condit ional mutual informatio n between the input and output of
a DMC for an arbitrary input distribut ion P = { p (x, ) , . . . ,P(XK)} . Prove that
Max l (x. , y)

2':

This provide s a way to term inate the Ari moto-Blahut algo rithm by calc ulating
the quantity
Max l (x" y) - l ex, y)

at each iteration and stopping the search when this value is sufficiently small.
[Hint: Let P I and P z be two arbitrary input distributions and show that
K

l i x , y) -s

L Pz(x.)Il . . .. y)
.~ ,

by consideri ng the partial derivative of l ex , y) with respect to PI' ]

APPENDIX

<

We prove that, for N sufficiently large and a


always valid:

~ (N)
where Nee

= K,

<

k=O

an integer , and H(a )

NH

t, the following inequality

(a )

-a a- a)
log,

(I -

logz(l - a).

Proof. The following inequalities are easily verified :

(K

~ 1) = (K -

(K ". 2)

1)!

(;!-

K + 1)! =

(K - 2)' (;'- K
=

(~) . N - ~ + 1< (~) 1~ a

+ 2)!

K- 1 (N) ( a )z
(K N)
- I ' N- K + 2 < K I - a

Therefore, using the fact that a <

we write

tJ~) < (~) tJ I~ ar < (~) ~ C~ ar (~) ~ 2:


=

Using Stirling's formu la, we derive a bound on (~) as follows:

N)
(K

<:
NNe -NVNe
N!
K! (N - K) ! - KKe-KVK . (N - Kt-Ke-(N- khJN - K e 7/4

<

I N
V K(N -

(N)K(
K)

N ) N-K
N - K

is

---r.:::=;=1====7
.
. a - Na . (1 - a ) - N(I - a )
Y Na (I - a)
2N[ - a logza -

(I - a ) logz(I -a)]

Y Na(I - a)

Therefore,

~ (N)
k= O

<

1 - a 2
Na(I - 2a)

' 2 NH(a )

Since a is a constant, for sufficiently large N the coefficient of 2NH (a ) on the


right side will be less than unity. Consequentl y,

~ (N)
k= O

The proof is now complet e.

<

NH

(a )

BIBLIOGRAPHY

1. C. E . Shannon and W. Weaver, The Math ematical Theory of Communication, University of Illinoi s Press , Urbana (1964).
2. N . Abramson, Information Theory and Coding, McGraw-HilI, New York (1963) .
3. R . G . Gallager, Information Theory and Reliable Communicati on, Wiley, New
York (1968).
4 . R. Ash , Information Theory, Wiley-Interscience, New York (1965) .

5. R. J. McEliece, The Theory of Information and Coding, Addi son-Wesley , Reading ,


Mass. (1977).
6. R. W. Hamming, Coding and Information Theory, Prentice-Hall, Englewood Cliffs,
N.J . (1980).
7. T. Berger, Rate Distortion Theory, Prentice-Hall, Englewood Cliffs, N.J . (1971) .
8 . D. Slepian , ed . , Key Papers in the Developm ent of Information Theory , IEEE
Press, New York (1974).
9. B. M . Fitingof, "Optimal Coding in the Case of Unknown and Changing Message
Statistics ," Probelmy Pereda chi lnformatsii, 2, 3-11 (1966) .
10. B. M . Fitingof, "The Compression of Discrete Information ," Probelmy Peredachi
Informatsii , 3, 28-36 (1967) .
II . A. N. Kolmogorov, "Three Approaches to the Quantitative Definition of Information ," Probelmy Peredachi Informatsii, 1,3-11 (1965).
12. L. D. Davi sson , "Universal Noiseless Coding," IEEE Trans. Info., 19,783-795,
(1973) .
13. S. Arimoto , " An Algorithm for Computing the Capac ity of Arbitrary Discrete
Memoryless Channels," IEEE Trans . Info ., 18, 14-20 (1972) .
14. R . Blahut , "Computation of Channel Capacity and Rate Distortion Functions ,"
IEEE Trans . Info . , 18, 460-473 (1972).

INDEX

Arimoto , S. , 137
Arirnoto-Blahut algorithm , 137, 141
Ash, R., 96 , 99 ,103 ,110 , III
Binary erasure channel, 78
Binary symmetric channel (BSC), 52 , 58
Bits, 12
Blahut , R. , 137
(See also Arimoto-Blahut algorithm)
Block encoder
with fixed-length alph abet , 18
with variable-length alphabet, 19
BSC , 52
Capacity
of continuous channel, 135
of discrete mem oryless channel, 57
of encoder with variable length
alphabet , 21, 22
Channel
binary eras ure, 78
binary symmetric, 52 , 58
capacity, 49 , 57, 64, 66
coding, 12
determini stic , 53 , 59, 68
discrete memoryless, 49
equivocation, 71
lossless , 53, 59, 67
matrix, 51
useless, 53, 59
Chebyshev inequality, 4, 9, 131
Check digits , 103
Code book (or table), 98
Conditional entropy, 98
Conditional mutual informati on , 63 , 76
Constraint length , 114

Continuous chann els, 125, 135


capacity of, 135
Continuous messages , 125
Convex
7
cap
cup (U) , 7

function . ji

set , 6
Convolutional code , 113
constraint length of, 114
maximum likelihood decoding of, 116
truncat ed , 115
Convo lutional encoder, 113, 11 9
ladder diagram for, 115
state diagram for, 114
Corrector, 108
Correlation , 4
Coset , 107
Coset leader , 107
D-adm issible, 84
Data compression, 17, 86
Decoding set, 106
Differential entropy, 126, 132
Discrete memor yless
channel (DMC) , 49
source (OMS), 17
Distortion
measure , 83
average , 84 , 88
single-letter, 84
(See also Rate-distortion)
DMC ,49
OMS , 17
Double-error-correcting code, III

Entropy , II
conditional, 50
differe ntial, 126, 132
grouping proper ty of , 15
(See a/so Quasi-entropy)
Equ ivocation, 54, 57
Ergod icity, 125, 130
Error-correcting code
linear convolutional, 11 3
linear block or parity check, 98
power of , 96, 110
Error-pattern, 97
Error-pattern vector, 107
Fano
inequality (or bound ), 70 , 77, 94
(See a/so Shannon-Fano code)
Fidelity criterion, single-letter, 83
Fitingof, B. M . , 41
Frequency vector , 43
Ga llager, R. G., 26, 27, 127
Gaussian random variable, 133
Generator matrix, 105, 106 , 121
Gilbert (See Varsharmov-Gilbert bound )
Gro up code , 104
Hammi ng
code, 86
distance, 72, 75, 96
lower bound, II I
upper bound, 98
Huffman code
binary, 3 1
generalization, 34
geometrical interpre tation, 33
Independence , 4
Information
averag e or average mutual or mutual ,
51
conditional mutual, 63, 76
Information digits, 103
Jensen inequality, 6, 44
Joint random variable, 3
Kraft inequality, 27, 28 , 29 , 30, 56
Ladder diagram, 115
Law of large numb ers, weak , 4 , 136
Likely sequence, 17, 42
Majo rity rule , 71
Maximum likelihood (ML) , 72 , 75 , 11 6
McEliece, R. J ., 115, 11 6 , 117, 118

McMill an (See Shannon-McM illan


theorem)
Nats, 12
Noisy channel, 49
converse theorem, 63, 77
theorem , 63, 72 , 73, 75
Observer, 54 , 7 1
Optimum deci sion scheme , 72
Parity check codes, 98
construction , 103
decode r, 108
matrix, 100
Power of error-correcting codes, 96 , 110
Prefix condition, 26, 28
Prefix-free codes, 27, 28, 56
Probability experi ment, I
Quasi-entropy, 43
Random coding, 73 , 76 , 90 , 95
Random variable(s)
average , 2
continuous, 132
corre lation , 4
discrete , 2
Gaussian , 133
independent, 4
joint , 3
variance, 2
Rate-distortion
converse theorem , 92
function, 83 , 84 , 87, 88
fundamental theorem , 86 , 90
Rate of transfer of informati on , 54
Reduced channel, 68
Reduction
elementary, 68
sufficient, 69
Reproducing alphabet, 84
Run-length coding, 37
Sequential decoding, 11 9
Shannon, C. E ., 12, 41 , 95
Shannon-Fano code , 29, 33 , 42, 129
Shannon-McMill an theorem , 125, 129
Single-error-cor recting code , 86 , 113
Source
discrete memory less, 17
discrete stationary, 125 , 126
ergodic, 129
Source coding, 12

Index

fixed-lengt h, 18
variable-length, 25
with fidelity criterion, 83
State diagram , 114
Stationary source , discre te, 125, 126
entropy of , 127
ergodic, 129
Stirli ng' s form ula, 7,45 , 143
Stochast ic process , 126
Sufficie nt reduction, 69
Syndrom e. 108

149

T hresho ld decodi ng , 11 9
Transition matrix (See Channel matrix)
Triang le ineq uality, 97
Typic al sequence (See Likely sequence)
Unique decodabil ity, 26
Universa l cod ing, 4 1, 45
Variant of matrix, 101
Varsharmov-Gilbert bound, 112
Viterbi algorithm, 116
Weight of code-word , 11 0

You might also like