Chapter 6 ITC

6
R E L IA B L E
M E.S S A G E S
T H R.O U G H ·
U N R E L IA B L E
CHANNELS
,\
6~ 1. Introduction I
In Chapter 6 we shell prove Shannon's second theorem-the u,ost
surprising as wel l~ the most important single result of information
theory. Because of the significance of this theorem, it would be well
to stand back and survey the main results we have already obtained.
We have been able to justify our use of entropy and entropy-
derived measures of information in .two inste.noes-Shanno
n's
first theorem (Section 4-3) and the generalization of this theorem
dealing with equivocation (Section 5-5), Shannon's first theorem
147
RELIABLE MESSAGES THROUGH UNRELIABLE CHANNELS 1~9
Shannon's second theorem was first published in 1948. The birth

of information theory may be dated from Shannon's publication of
this fundamental paper. Nevertheless, Shannon's original proof
of this theorem contained certain difficulties (McMillan, 1953).
The first rigorous proof of Shannon's second theorem was presented
by Feinstein in 1955. Subsequently, different proofs were provided
by Shannon (1957a); Blackwell, Brennan, and Thomasian (1959);
and Fa.no (1961). The proof presented in this chapter is somewhat
simpler than any of the proofs mentioned_above.
6-2. Error Probability and Decision Rules

Shannon's second theorem d~ls with the amount of error-free
information we can get through a channel. In order to appreciate
more fully the significance of this theorem, let us look into the
question of the error probability of a channel. For some of the
simple channels we have used-such as the BSC and the rSC-it is
intuitively clear what it is reasonable to ca.II _the error probability
of the channel. Nevertheless, we shall see that, even in these
cases, the error probability will depend upon a factor not yet c_o n-
sidered in our study of •information channels. For example, con-
sider the BSC
0.9 0.1] (6-1)
[ 0.1 0.9
Ordinarily we would say that the probability of error of this channel

is 0.1. ·Note, howeve~, that this statement depends- upon the
assumption -that th-e channel is used in a "reasonable" manner.
If the receiver were to examine the channel output and decide that
a one was sent when a zero is received and vice versa, the probability
of error would be 0. ~- Of course, this is not a reasonable way to
use such a channel, but it is a possibility we ·must consider. The
error probability depends upon how the receiver interprets the
output symbols of the channel.
To bring out this point in a more meaningful case, take the
channel .
0.5 . 0.3 0.2]
0.2 0.3 0.5 (6-2)
[
0.3 0.3 0.4
148 INFORMATION THEORY AND CODING
provided us with a yardstick with which to measure the information

emerging from a source. Using this theorem, we were able to rate
symbols from a source in terms of an equivalent number of binits
(or r-ary symbols) necessary to represent these symbols. The
generalization of this theorem showed that we could use a quantity
related to entropy (equivocation) as a yardstick to measure the
result.a of transmitting information through a channel.
In order to encode symbols from a source alphabet A, we know
that we must provide, on the average, H(A) binite per source
symbol. If the symbols from A are transmitted through a channel,
however, and we~ allowed to observe the output symbols from
a.lphabet B, we need only H(A/B) binite per A symbol in order to
represent these input symbols. Hence, the reception of the channel
outputs has provided us with H(A) - H(A/B) binits in the sense
just described. The equivocation H(A/B) can vary from H(A)
(when the channel input and output are statistically independent)
to zero (when·the channel is noiseless). AB it.does, the number of
binits per A symbol th&t we receive vari~ from zero to H(A).
Tra.nsmitting H(A) - H(A/B) binit.s of information is an
achievement of some importance. Still, the form in which these
binits appear at the output of our information channel leaves much
to be desired. Let us examine this.:point in greater detail. ABsume
that we transmit a block of n symbols from a. source A over an
information channel. Then, if our channel is noiseless, H (A/B) is
zero, and each output symbol contains H(A) bits of information;
we may reconstruct the sequence of n channel inputs from_the
sequence of n channel outpuiis, and it is clear that we have received
H(A) biiis of error-free information. If our channel is not noiseless, ·
however, the equivocation will not, in general, be zero, and each
~utput symbol will contain only H(A) - H(A/B) bits of in(orma-
tion. Furthermore, note the crucial way in which this infonnation
differs from the information o'ui of a noiseless channel. We cannot
reconstruct the channel input sequence perfectly from a. knowledge
of the channel output sequence. All we can say is that we can
encode the channel inputs, using H(A) - H(A/B) fewer binits per
symbol if we know the outputs. Hence, although we get informa-
tion thz:ough the channel, we do· not have error-free knowledge. of
the message transmitted. • This is the fly in the ointment, which
we shall remove by Shannon's second theorem.
J
D CODING
150 INFORMATION THEORY AN
Th is ch an ne l ha s three inn~u
W he n a pa rti cu la
?
r ch
Th
an
is
::1,
. pu
as~
t
<,Ja an ? three o~tputs, b1, b2, b,.
18 received, which in pu t sh ou ld
ng definition.
we sa y w as se nt qu es tio n pr om pt s th e fallowi
~e::~nA
a. p e
al
ou tp ut ) • ph ab et B = { b·} . = 1 2
, , · · • ,
r-symbol in pu t
Conside~ a ch an ne l wi th an d an s- sy m bo l
- {Oi}, 1 = 1, 2, • • . , r, an
s.
A d ..
eo1s1on
I d( b ' '3 un iq ue in pu t sy m -
ru an y fu nc tio n sp ec ifying .a
e, i , 1S
l.
bo fo r ea ch ou tp ut symbo
l
for tho channel of (6-2) are
Eu .m pl e 6- 1. Tw o possible. de<'ision ru les
d(b1) =- a1 .
(6-3)
d(b .) - a 1
d(b,) - a,
and
d(b i) - t11
(~ 4)
d(ba) ... as
d(ba) - a,
ith r in pu ts an d ·a ou tp ut s, th er e ar e ,.. different

Fo r a ch an ne l w es tio n wh ich pr om pt ed ou r defini-
le s. Th e qu
po ss ib le decision ru or e, be re ph ra se d as "W hi ch one
m ay , th er ef
tio n of a de ci sio n ni le we use ?11 Th e answer to th is
de cis io n ru le s sh ou ld
of th es e r• t we ar e try in g to accomplish, bu t a
pe nd up on w ha
qu es tio n will de is th e m in imization of th e er ro r
r 01, 1r pu rp os es
re as on ab le go al fo e, we se ek a decision rule which
an ne l. He nc
pr ob ab ili ty of th e ch ou r ehannel. To find such a ru
le,
th e er ro r pr ob ab ili ty of
m in im iz es er ro r Ps . Th is pr ob ab ili ty m ay
e th e pr ob ab ili ty of
we ea lc ul at conditional pr ob ab ili ty of
be w rit te n as th e av er ag e of P( E/ b1 ), th e
of th e ch an ne l is bJ.
er ro r gi ve n th at th e ou tp ut
(6-5)
PB = kB P(~/b);E'(b)
ses th e er ro r pr ob ab ili ty as a su m of non-
Eq ua tio n (6-5) expre.s de r to minimize Ps by choice
of a
te rm s. Th er ef or e, in or
ne ga tiv e b ) to minimize ea ch te
rm in th e
b; ), we m ay se le ct d( 1
de ci sio n ru le d( pe nd . up~n . t~ e decision•r~~e we
b do es no t de
su m se pa ra te ly . P( 1 )
se d(b 1) to nn wm 1ze the cond1t1onal
nt to ch oo
us e; so it is eq ui va le
).
pr ob ab ili ty of er ro r P( E/ b, = <ll,
Fo r a fix ed decision rul~, d(b1) (6-6)
P(E/b1) !"" l - P[d(b1)/b1J
RELIABLE MESSAGES THROUGH UNRELIABLE CHANNELS 151
where, since our decision rule is fixed, P[d(b1)/b1] is the backward

probability P(0-1/b1). Finally, in order to minimize (6-6) for each
b1, we choose
(6-7a)
where a• is defined by
P(a*/b1) > P(a./b1) for all i (6-7b)
In other words, the channel error 'J)Tobability i8 minimized if we use
thaJ, decision rule which chooses for each output Bymbol the input
symbol with the highest 7Jrobability. This decision rule is sometimes
called a conditional ma:cimum-likelihood decision rule. The con-
ditional maximum-likelihood decision rule depends upon the a
priori probabilities P(at). We may use Bayes' law to write (6-7b)
as
P(bJ/a*)P(a•) > P(bJ/°")P(a.)
for all i' (6-8)
P(b1) - P(b1)
Hence, when the a. priori probabilities are all equal, the conditional
maximum-likelihood decision rule may be written
(6-9a)
where
P(b;/a*) > P(bJ/a;) for all i (6-9b)
The decision rule defined by (6-9b) is ·known as the ma:cim1'tn--
likelihood decision rule. The maximum-likelihood decision rule
does not depend upon the a priori probabilities. When the a
priori probabilities are all equal, the maximum-likelihood decision
rule results in a. minimum value for the probability of error. Even
if the a priori probabilities are not equal, however (or evenµ these
probabilities are unknown), we may stiil employ this decision pro-
cedure; in such cases, of course, we are not assured of obtaining a
minimum value of the channel error ~robability.
Example 6-1. From (6-9) we may immediately write a maximum-likel~ood
decision rule for the channel of (6-2). Such a rule ·is
d(b1) • a,
d(b,) - a 1
d(ba) aa aa
Note that this rule is not unique. There a.re, in fact, three maximum-lik.eli•
hood decision rules for this channel.
►
This channel has three inputs a1, a2, aa and three outputs, b1, b2, ba.
When a particular channel output is received, which input should
we say was sent? This question prompts the following definition.
Definition. Consider a channel with an r-symbol input
alphabet A == {cis}, i = 1, 2, . . . , r, and an 8-symbol
output alphabet B = {b1} , j = 1, 2, . . . , B. A decision
rule, d(bJ), is any function specifying a unique input sym-
bol for ea.ch output symbol. ·
Example 6-1. Two possible derision rules for the channel of (6-2) are
d(b1) - Gt
d(b1) -= a1 (a.3)
d(ba) - a,
a.nd
d(b,) - Gt
d(b1) - Gt (6-4)
d(ba) - a,
For a cha:nnal with r inputs and 8 outputs, there are r• different
possible decision rules. The question which prompted our defini-
tion of a ·decision rul~ may, therefore, be rephrased as "Which one
of these r' decision rules should we use?'' The answer to this
question will depend upon what we are trying to accomplish, but a
reasonable goal for our purposes is the minimizat ion of the error .
probability of the channel. Hence, we seek a decision rule which
minimizes the error probabilit y of our channel. To find such a rule,
we calculate the probabilit y of error PB. This probabilit y may
be written as the average of P(E/b;), the conditiona l probabilit y of
error given that the output of the channel is b1,
Pzr
.
=L
B
P(E/b);t'(b ) (6-5)
Equation (6-5) expresses the error probabilit y as a sum of non-

negative terms. Therefore, in order to minimize P11 by choice of a
decision rule d(b1), we may select d(b1) to minimize eac~ ~erm. in the
sum. separately . P(b1) does not depend. upon the dec1S1on rule we
use; eo it is equivalent to choose d(b1) to minimize the conditiona l
probability of error P(E/b,).
For a fixed decision rule, d(b;) == 41, ,:.
P(E/b1) ~ 1 - P[d(b1)/b1J (6-6)
REUABlE MESSAGES THROUGH UNRELIABLE CHANNELS 151
where, since our decision rule is fixed, Pld(b1)/b1J is the backward

probability P(al/b1), Finally, in order to minimize (6-6) for each
b1, we choose
(6-7a)
where a• is defined by
P(a*/b1) > P(41/b1) for all i (6-7b)
In other words, the channel errQ1' '/)robabuity is minimized if we use

that decision rule which clwoBeB /Q1' each output aymbol the input
aymbol with the highut '/)robabiUty. · This decision rule is sometimes
called a conditional maximum-likelihood decision rule. The con-
ditional maximum-likelihood decision rule depends upon the a
priori probabilities P(41). We may use Bayes' Jaw to write (6-7b)
88
P(bJ/a*)P(a~ > P(bJ/<11)P(0-i) for all i (6-8)
P(b1) - P(b,)
Hence, when the a ·priori probabilities are all equal, the oonditional
maximum-likelihood decision rule may be written
d(b1) = a• (6-9a)
whe~
P(bJ/a*) ~ P(bJ/4') for all i (6-9b)
The decision rule defined by (6-9b) is known 88 the maximum-

lilcelinood decision rule. The roaxiro1un-likelihood decision rule
does not depend upon the a priori probabilities. When the a
priori probabilities are all equal, the roaxiro11m-Jikelihood decision
rule results in a minimum value for the probability of error. Even
if the a priori probabilities are not equal~ however (or even if these
probabilities are unknown), we may still employ this decision pro-"
cedure; in such cases, of course, we are not assured of obtaining a
minimum value of the channel error ~robability.
lb:ample 8-1. From (6-9) we may immediately write a maximum-likelihood
decision rule for the ohannel of (6-2). Such a rule is
d(b1) • a,
d(bt) • 01
d(b~) - Ot
Note that this rule i8 not unique. There are, in faot, three maximum-likeli-
·
hood decision rules forI this ohannel.
I
I
I
l
The error probability u · · ..

obtaine d with th 1'd f ( smg any given dec1s1on rule is easily
e a. o 6-5) and (6-6).
Ps = L P(E/b)P(b)
B
r
= iB P(b) - B P[d(b)/bJP(b)
= 1 - ~ P[d(b)_, b]. (6-10)
. The terms in the summation of (6-10) are just the joiJ}t probabili-
ties that d(b!) = a• is transm itted and b,-.is received (for ea.oh J).
Hence, defining PB = 1 ·- Ps, we may rewrite (6-10) as
Pa= r P(a*, b) .
B
(6-11)
Since
r P(a, b) a:::: 1 '(6-12)
t.'s
we may also rewrite (6_-10) as
PB= }; P(a, b) (6-13)

B,A-~•
. .
The notatio n }; is meant to indicate a sum over all members
. A-G• . .
of the A alphabe t except d(b1) :::; a•. An ~ternat ive way of writing
(6-13) is
Pa= }; P(b/a)P(a) (6-14)
II B,A-a•
If .the a priori probabi lities P(a) are all equal, then {6-14) becomes
Pa =~
B,A-a•
2 P(b/a) (6-16)
Equation (6-16) is of" some ·'interest since (for ·the special case ·of
equal-a priori probabilities) it provid ~ an expression for the channel
a
error probability in terms of sum over the elements of the channel ·
matrix P(b/a). The sum is over all elemente of the channel matrix,
except that one term [corresponding to d(b1)] is omitted from each
column.
-
E:rample 8-8, We calculate the error probability of the obannel wied in
Examples ~1 and 6-2.
0.6 0.3 0.2]
0.2 0.3 0.6 (~16)
[ 0.3 0.3 0.4
We fWllme that all three Jnput symbols are chosen with equal probabilities and
that a maximum-likelihood decision rule ia used. (Recall that this rule result.a
in & minimum P11 lor equal a priori probabilities.)
P• • tf(0.2 + 0.3) + (0.3 + 0.3) + (0.2 + 0.4)]
- 0.667
6-3. The Fano Bound

The error probability has been presented in the previous section
without reference to entropy, eq~ivocation, or mutual information.
The purpose of Chapter 6 is to show a connection between these two
separate sets of ideas. As a first step in this direction, we provide
upper and lower·bounds on the equivocation in terms of the error
probability.
In what follows we shall make rep~ated use of (6-11) and (6-13),
P]l = ~ P(a*, b) {6-11}
Pll s: i
B,A-a•
P(a, b) . (6-13)
Using these two relationships, we construct tp.e identity

· · r- 1 · 1
H(Pa) + P11 log (r - 1) = P11 log + Pi log PB Ps
~ r-1
.. i-1 P(a, b) log p~
B,A-a• .
. + ~ P(a•, b) log J; (6-17)
The equivocation H(A/B) may be written in terms of the same

sort of summations:
H(A/B) =B
I
f. 0
• P(a, b) log P(;/b)
+ ~ P(O.•, b) log P(a~/b) (6-18)
l
15-4 INFORMAT ION THEORY AND CODING
Subtracti ng (6-17) from (6-18) yields

H(A/B) - B(P11) - P11 log (r - 1)
- l-a• P(a, b) log (r -
B,A
P.
l)P(a/b)
+ ~ P(a*, b) log P(!:/b) (6-19)
Now we employ (2-2) ·-to change the base of the logarithms in the
right side of (6-19),
(log e)- 1[H(A/B) - H(PB) - P.a log (r - 1)]
- B,J, •• P(a, b) In (r - i>Pca/b)

+ ~ P(a•, b) In P(~:/b) (6-20)
so that we may use the inequalit y

lnx<x -1 (6-21)
on eacli term in the summatio ns. · The right side of (6-20) is less
than or equa.l to
»1.. P(a, b) [ (r - :;~(a/b) - 1] + i P(a*, b) [P(!:/b) - 1]
= [r:.- l B,J, •• P(b)] - Pa+ [Pai P(b)] - P.,, .

a::: 0 (6-22) I
and we have the inequalit y we seek, l

• I
H(A/B) < H(PB) + Ps log (r - 1) (6-23)
lI
This importan t inequalit y was first derived by Fa.no. It is valid, l
no matter wh&t decision rule we use, although the error probabilit y l
3
can depend drasticall y upon the decision rule. The form of ~ j
;
inequality suggests a.n interestin g interpreta tion. Assume that
we have some fixed decision rule. When we receive . an output
symbol, we need H(P11) bits of informati on to specify whether our
d~on rule has produced an error. It produces an error with
probabilit y, P11, and then we may specify which of the remaining
I
j
REUABlE MESSAGE
S THROUGH UNRE
LIABLE CHANNELS 155
,. - 1 in pu t sy m bo
ls w as se nt w it h a
ab ov e in te rp re ta ti t m os t lo g (r - 1)
on un fo rt un at el y, bi
t (6 -2 3) , al th ou gh is no t e. de qu at e as ts . ;h0~
it ~ Y be us ed as
di ff er en t fr om th e th e ba si s of a pr oo a pr oo
f so m ew ha t
L et UB ex am in e on e w e ha ve us ed .
th e co nd it io n fo r eq
(6 -2 3) . T h e in eq ua li ty in th e F an
ua li ty o bo un d
ln x < x - 1
be co m es an eq ua (6 -2 1)
li ty if an d on ly if
in (6-23), w e fi nd x
th at th e Fa.no bo un =- 1. U si ng th is co nd it io n
on ly if d be co m es an eq ua
li ty if an d
P (a /b ) - r ~ fo r al l b an d a
1 ~ a• (6 -2 4a )
an d
P (a */ b) =- P11 fo r al l b
Si nc e (6-24b)
LA P(a/b) = l fo r all b
th e se co nd co nd it io
n, (6-24b), follows
ti on (6-24a) im pl ie fr om th e first, (6 -2
s, fo r ea ch b, th at 4a ). Equa-
on e se le ct ed b y al l in pu t sy m bo ls
ou r de ci si on ru le ex ce pt th e
co nd it io n th us se rv a.re eq ua ll y pr ob ab
es to re in fo rc e ou le . T hi s
bo un d. r in te rp re ta ti on of
th e Fa.no
6- 4. Reliable Mes
sages and Unrelia
ble Channels
T he ob je ct of 8-ha
.nnon's se co nd th eo
m en ta l li m it at io ns re m is to de sc ri be th
on th e tr an sm is si on e fu nd a-
th ro ug h an un re li ab of re li ab le er ro r- fr ee
le ch an ne l. C on si m es sa ge s
tr an sm it re li ab le m de
es sa ge s (F ig ur e 6-1) r fi rs t th e us e of a. B S C to
.
A - { ~ - \ B SC ~ ~} - B
Flo'Olllll 6- 1. A B
SC .
T o be ev en m or e
specific, w e as su m
er ro r of th is B S C , e th at p, th e pr ob
is 0.01. T hu s, 99 ab il it y of
m it te d are received pe r ce nt of th e bi
co ni ts tr an s-
si on sy st em s, ho w ev rr ec tl y. F o r ma.ny m od em -d at a- tr
er , th is le ve l of re li a. ns m is -
ab il it y is fa.r fr om
ad eq ua te .
l
I
l:;
10- •, or eve n lower are
Pro bab ilit y of err or req uire me nts of 10-e,
reli abi lity of our channel
oft en nec ess ary . In ord er to inc rea se the
l times. For example,
we ma y hav e to rep eat ·o ur mes sag e sev era
or 1) three times. One
sup pos e we dec ide to sen d eac h mes sag e (0
ted in Fig ure 6-2.
me tho d of vie win g this pro ced ure is illu stra
Measage, Output, .
U nU1ted ,i.gnala
000 000
001
001 010
010
011
100
101
~1 (BB C)• ~
011
100
101
110
110 111 111
Fiot mB 6-2. A met hod.of inor eaa~ g relia

bilit y.
utp ut of the cha nne l 11nder the se circ

um stan ces is an ele-
h
xne nt
O
t
T e (BS C) L- a bin ary seq uen ce of len gth 3. The pro bab ility
ror s occ ur in the tran smi ssio n of our thre e dig
its is ·
t,ha,t no er
(1 - p)• = (p) •
bili ty of jus t one erro r is

Th e prob a 3pp2
ilit y of two erro rs is

Th e prob a b 3pl p
in erro r is
rob abi lity tha t all thre e bin its will be received
._bile t b e P p•
tha t a
'P is less tha n ¼ (i.e., whe nev er the pro bab ility it is
\\1b eD ~ercei ved cor rec tly is gre ate r tha
n the pro bab ility tha t
decide tha t the message
bin i~ ~~r~ cor rec tly) , it see ms rea son abl e to
reee JV'e u JI1 111 a,ceording to the ma
jori ty vot e of the thre e received
.,,8/J OOo ~ ~is ion rule nee d not be
just ifie d on pur ely dem ocr a tio
maximwn-likelihood
binite• , •t is eas y to see tha t this is the
n rule lead s to a pro b-
gro ~~ '~ - In &DY eve nt, suc h a decisio
~o
- - - - - - - - - -- - -
. RELIABLE MESSAGES THROUGH UNRELIA

BLE CHANNELS 157
ability of inte rpr etin g the message in err ort
Pa (equal to the sum
of the probabilities of three binit errors and
two bin it errors) of
p Jf =- p• + 3p2~ . (6-25)
Fo r p == 0.01, this yields
P11 ~ 3 X 10-4 (6-26)
Thus we have been able to reduce the pro
bab ilit y of err or from a
value of 10-2 (when sending O or 1) to a
value of 3 X 10-4 (when
sending 000 or 111). Having gone this
far, it is not difficult to
see how to increase the -r elia bili ty even
furthe
five binits over the channel for ea.ch binary r. We ma y send
message we wish to
transmit. This is represented in Figure 6-3
.
Unu aed ,ign alf Meaaagu Oueputa
000 00 000 00
00001
00010 00001
00011 00010
00011
-.\ (BBC)• ~
11110
11111 11111
FIG UBI I ~8. A met hod of increasing reliability.
Th e probability of zero, one, two, three,
four, or five bin it errors
in tre.nsmission is
p' .'
5'P'fl'
1Qp2pl .
10p11}2
5p4p
p'
t The proba.bility of message error will ordinarily depend ,upo
n the
meea.ge probabilities. Beoo.use of the sym met ry in the situatio & prio ri
describing, however, the erro r pro bab ility n. we a.re
is independent of the e. priori prob&-
bilities.
·
158 fNFORMATION THEORY AND CODING '
respectively. If we again use a majority rule (i.e., maximum likeli-
hood) to decide whether 00000 or 11111 was transmitted we obtain I
a probability of error '
Ps = p' + 6p 1} + 10p p
4 3 2
(6-27) '
,•
·(i.e., .the sum of the probabilities of five, four, or three binit errors).
For p = 0.01, this yields · .
PB s== 10-1 (6-28)
There is, of course, no limit to this crude method of increasing
reliability. In Table 6-1 we give the probability of message error
when 1, 3, 5, 7, 9, and 11 binits per message a.re used in a BSC with
single-binit probability of erro:s; of p == 0.01.
TilLZ 6-1. PBOBAllILlTY 01' M'ESSA.Glll ERROR IN A BBC
Biniu par ~robabil~y of
b&nartl muaag~ muaage error
1 10-1
3 3 X 10-•
g 10-1
7 4 X 10-'
9 10-•
11 5 X 10-10
The improvement displayed in Table 6-1 is not achieved without

penalty. The price we pay for increased message reliability is
increa.sed redundancy in the transmitted binits. . In other words,
2
although we may decrease the probability of error from 10- to
5 X 10-10 by going from 1 binit per binary message to 11 binits
per binary message we must also· decrease the message rate from
1 message per binit to
ft message per binit. In general, the simple
repetitive method we have described can lead to an exchange of
ffl&fflge rate for measage reliability. Typically such an exchange will
appear as plott,ed in Figure 6-4.
6-5. An Example of Coding to ~orrect Errors

An important question is suggest.ed .b y Figu_re 6-4. Th~ _codm:g
scheme that we have investigated so far-sun.pie repet1tion-1S
l
t
the most obvious method of exchanging rate for reliability. Do
there exist more sophisticated and more e~cient methods of making
this exchange? That is, for a given value of probability of message
error in Figure 6-4, do there exist coding methods which give us a
greater message rate than that achieved by simple repetition? The
answer to this question is an emphatic "Yest"
10-IJ
J0-14L.....--.L---'---....i....-_ _.__ ____.__

1 1 1 l ! 1
3 5 Y 9 11
Message rate, bfnary messages/binIt
Fiat7B!I 6-4. Exchange of rate for reliability in & ~SC using repetition.
The answer to this question is provided by Shannon's second

theorem (Section 6-10). It not only states that we can do better
than is indicated in Figure 6-4, but i~ also tells us how much better
we can do. Indeed the "how much better" answer provided by
Shannon's theorem is the most amazing part of the story we have
to tell. We have sketched this answer in Figure 6-5.
Shannon's second .theorem tells us that for any message rate less
than the channel capacity C, we can find codes such that the prob-
ability that a message will be in error is less than any positive nwn-
ber e-i.e., as small as we want it. The theorem presents us with
the unexpected result that it is not necessary to require the message
rate ~ approach O in order to achieve more and more reliable

operation of the channel.
In Secti~n 6-4 w_e discussed the possibility of achieving virtually
error-free information transmission through an unreliable ohan nel-
th~ B~~- Le~ us n~w investigate the exchange of message rate for
reliability a bttle more carefully. In the previous section we
decreased our rate by simply repeating the J;>inary message we
10-4 1
10-4
Limit given by
10_~ )>----l....-S_h_
••-no_n_•s-~h_eo_,e_m_ _
1 .! l ! .l 1
, s a · 7. 9 . 11
C Message rate, binary messages/blnlt
Ftat7BII 6-5. Two poasible exchanges of rate for reliability in a BSC.
wished to transmit. As indicated in Figures 6-1 and 6-2, this c~n
be viewed as increasing the order of the channel extension we use
and selectjng only two of the possible extension input symbols,
a., as messages. A more powerful method of varying the message
rate -the method we shall employ in proving Shannon's second
theo rem- is to fix the order of the·extension we use but to vary
the number of channel input symbols, a,, used as messages. Again
we illustrate this for the case of the BSC in Figure 6-6.
Suppose we can transmit binary symbols through a BSC a~ the.
rate of one per second. Then the CXI, consisting of sequences of
-
3 binits, can be transmitted at the rate of one every 3 sec. If we

select only the two sequences 000 and 111 as allowable messages, as
shown in the previous section, we can obtain a probability of error
P11 == 3 X 10-, (6-29)
while the message rate is i binit per sec. If, on the other hand, we
use all eight of the a, as mess~ges, the probability that a message
000 000
001 001
010 .--------, 010
A
1
-
011
100 -+
I (BBC)• !L 011'
100 =- B• .
101 101
110 110
111 ' 111
Fia'OJUl 6-6. Third extenmon of the BSC.
(not a, binit) will be transmitted correctly is .p3• The probability

of a message error then is 1 .- p•. For p = 0.01, this yields
P11 ~ 3 X 10-2 (6-30)
The message rate corresponding to . this probability of error . is 1

binit per seo. There are
possibilities between these two .extremes,
a,
of course. We might select four of the as code words represent-
ing four equiproba.ble messages. For example, let the four messages
be rep~ented by
000
011
101 (6-31)
110
. .
If these four ai are chosen, we may use the maximum-likelihood
decision rulef shown in Figure 6-7. ·
t .Aa shown in Example 6-2, the maximum-likelihood rule is not unique. In

this example, there are several me.ximum-likelihood rules, in addition to the
one shown in Figure 6-7.
Mua agu
Output., Mkcted
:} -- --.. 000 i
I
!
010 } 011
011 - - - - -
100 } 101
101 - - - - -
110}----- 110
111
FIGU RE 6a7. A max imu m-li keli hoo d deci sion .rule .
rpr ete d correctly,

~e1:1 the pro bab ilit y _that a m~ ge wil l be inte
are tran smi tted
P~, 18 Jus t the pro bab ilit y tha .t the firs t two bin its
without error, or
(6-32)
For p = 0.0 1, this yields

P.11 Iii:$ 2 X 10-2 (6-33)
ond to two bin ary

Since the /ou r bin ary seq uen ces we use corresp
message, the message
messages and we tak e 3 sec to tran sm it eac h
lts of selecting two,
rate is jus t f bin it per sec. Com par ing the resu the (BS C) 1 ,
es fro m the eig ht pos sibl e inp uts of
four, or eight mes sag er the prob-
we see tha t, in genera.I, the mor e messages we use, the high
abuity of a message error. bols has a tota l of
The nth ext ens ion of a sou rce wit h r inp ut sym
use onl y M of these
r• inp ut sym bol s wh ich we ma y use. If we
rease the pro bab ility
possibilities as messages, how eve r, we ma y dec
ility of error without
of erro r. Th6 trick is to decrease the probab
re,quiring M to be ,o sma ll that the meesage rate, t (log M) / n, beco
mes
pro bab ility
f.()o ,ma/,l. Sha nno n's sec ond the ore m tells us tha t the 0
itra rily 81n all as lon g as M ~ less tha n 2" •
of error can be ma de arb
ival ent bina ry mes sage s per sym bol.
t We meu ure the meaaage rate in equs usin g ..n sym bols is equ ival ent to send -
Thua, send ing one of M .poa .ible mes aage
a mes sage rate of (log M) /n bina ry
ing log M bina.ry meu a,ge s in n sym bols and
lneas&gea per sym bol.
For this M, the message rate is
log M _ C (6-34)
n
and the channel capacity is seen to correspond to the upper limit
of the error-free meB&age ra.tel
6-6. Hamming Distance

In Sections 6-7 and 6-8 we prove Shannon 's second theorem for
the special case of-the BSC. We can take advantage of the binary
nature of the input and output symbols to simplify this theorem
for the BSC. Hammin g has introduced the useful idea of the dis-
tap,ce between two binary sequellces (Hammin g, 1950). The
hamming distancB between :two binary sequences of the same len~h,
ell and /31, is defined as the number of places in which eq and {3 differ.
1
For example, let .
· · ex, = · 101111
/31 = 111100
and let D(eq, {31) be the haroroiog·distance between ex. and fJJ, Then
D(Cll, /J1) = 3.
The concept of hamming distance Iru,1,Y be applied to the three
difierent codes for the (BSC) 1 discussed in the previous section.
TABLII 6-2. Tll1lU Co»u 70B TD (BSC)•
COM ct Cod8<B Codee
000 000 000

001 011 111.
010 ' 101
011 110
100
101
110
111
No. of message.a M: 8 4 2
The code words of the three codes given in Table 6-2 may .be
shown as vertices of three-dimensional cubes. Then the hamnung
1~ INFORMATION THEORY AND CODING
distance between any two code points can be seen to be the num-
ber of steps we must take to go from one point to the other. Note
that the minimum distances between code points in codes ct, CB,
and e a.re 1, 2, and 8, respectively. ·
The minimum distanoe between code points of a code is closely
related to the probability of error of the code. In general, the
greater the minimum distance, the lower the error probability we
ca.n expect when using the code. Of colll'8e, the greater the mini- .
mum distance, the fewer code points we can pack onto the vertices
of an n-dimensional C\lbe. This is but another expression of an
effect noted in the previous section. The advantages of being able
to represent a large number of messages wi~ a. code must be weighed
011 011
111
I I
001 I
I
101
II I
I
000
/
I
01~-
--- 110
000
/
,J...- -- 110
000
_,.,,
I
>-- --
100
Code <t. Code CB Code e
Flouu 6-8. Three codes for the (BSC) •.
against the advantages of a low channel error probability when

using that code.
Errors arising in the f.l'&nsmission of ao, · a sequence of n binits,
over a (BSC)• will ca.use the received sequenoe fJJ to differ from the
tra.nsmitted seq-uence. HD errors occur 4uring ~~mission, the
barorning distance between ao and fJ1 will be D.
{6-35)
The ao6f"(JJJe number of errors occurring in a. block of n binits will

be np, where p is the error probability of the BSC. Thus the aver- 1
age hamming distance between a. trwmitted and a reoeived J
sequence is also np. The actual distance between a transmitted
a.nd a. received sequence, of course, will rarely be equal to this aver-
age. Hence, we must look into the problem of determining which
code word waa transmitted when an output sequence fJJ is received
from the ehannel-ie., which decision rule to use.
J
' RELIABLE MESSAGES THROUGH UNRalABLE CHANNELS
Throughout this chapter we may assume that our messages (and

hence our code words) are equiprobable. In Section 6-2 we showed
165
that the maximum-likelihood decision rule minimized the error

probability when all possible inputs were equiprobable. We now
show that the maximum-likelihood decision rule has a simple inter-
pretation in terms of hamming distance. Let a. be the transmitted
code word and {J1 be any possible channel output sequence. As
before; let the hamming distance between these two binary sequences
of length n be D. Then (X( and /JJ differ in exactly D places, and the
probability that (31 will be received if a. is sent is just the probability
,,Closest
, codeword
fl1~•a•
Received ,..,.
sequence
Fxouu 6-9. Maximum-likelihood decision rule for the (BSC) 11•
that errors will oecur in the D places where they cliffer and that no
errors will occur in the n - D remaining places:
(6-36)
For p < ½-(the only case of any interest), P(ftJ/a.) decreases with
increasing D.· The farther /J1 is from the transmitted binary sequence,
the less likely it is to be received. The·~um-likelihood deci-
sion rule selects that code word which 'maximizes P(/31/ ~); hence,
for any recei:ved sequence {J;, the maximum-likelihood rule selects
the code word closest to fJ; in the hamming-distance sense.
6-7. ShaMon's Secs,nd Theorem for the BSC-The First Step

We now prove Shannon's second theorem for the special case of
the BSC. The more general theorem valid for any zero-memory
information channel with a. finite number of symbols will be proved
in Seotion 6-9.
' RB.I.AILE MESSAGES THROUGH UNREUA8lE CHANN£lS
When one of these code words, say ao, is sent through the chan-
nel, some other bina.ry sequence of length n, Bay {J1, is received
167
(Figure 6-11).
a, ~1 (BBC)• ~ fJi
Fiouu &-11. The channel.

We know that the maximum-likelihood decision rule described
in the previous section will minimize our probability ol error if all
M messages are sent with equal probability. The ma.ximum-likeli-
hood rule is difficult to analyze, however. Accordingly, we shall
use another closely related type of decision rule. Although the
rule we d~cribe will not be so good as the maximum-likelihood_
rule, we shall find that we can still attain a p~obability of error as
small as we wish.
We have already noted that the average distance between the
transmitted sequence ao and the received sequence /11 will be np,
where n is the order of the BSC we use (or the block length of the
code), and p is the probability of error of the BSC. When we
I. receive a symb.ol (31 out of our channel,
1I therefore, our natural inclination is to
hunt for the ·transmitted code -symbol
among those (if any) code symbols
· at a distance np or less from fJ;. In
geometrical terms we can say that we tJ,
draw a sphere of radius 11,p about /JJ
and hunt for in this sphere. The
a0
quantity np is only the average dis-
tance of ao from ~;, however, and it
might be prudent to enlarge our sphe~ Flouu 6-t 2 • A sphere about
a bit,· as insurance tha.t ao will be the received symbol.
inside the sphere with high probability.
Mathematicians are wont to call such insurance e, and we follow
their time-honored custom, Consider a sphere of radius np. about
/31, where p. = p + E (Figure 6-12).
Our decision procedure will be to draw the sphere of radius np.
about~, and, if there is a unique code point inside this sphere, we
ehe.11 decide that such a point was the one transmitted. li there is
no unique code point inside the sphere (either because there are
Shannon's Second Theorem (Special Case)

Consider a BSC with error probability p, and hence capacity
C = 1 - H(p). Let e be an arbitrarily small positive number,
and let M = 2•<0-d. Then, for n sufficiently large, it is possible to
select a subset of M code words (to represent M equiprobable .I
messages) from the set of 2" possible inputs to the (BBC)", such
tha.t the probability of error in decoding the channel output can l
•
be made as small as we wish.
· In Figure 6-10 we have indicated the 2" possible inputs and 2" t
outputs of the nth extension of a BSC with probability of error p.
1!
A"
lj
00···00 00 .. · 00 '
00 • • • 01 00 .-.-, 01
- . .. . . .
00 • · • 10 00 · • ··-10
11 · · · 11 11 · · · 11 . ~ I
1
Fiauu: 6-10'. The nth extension of a BSC. j
The inputs and outputs of this channel consist of sequences of l
n ·binary digits. In order to send M messages through this channel,
we select M of the 2" p~ible .inputs as code words. In Section
6-5 we indicated how the probability of messs.ge enor P s increases
as M increases. The question we ·must now answer is, "How many
messages is it possible to send and still have the message error
probability remain small?"
Of course, the answer to this question must depend upon how
we select our input symbols for the messages. If we select the
code words so that they cluster together, we can expect a higher
probability of error than if we construct a code with the same num-
ber of code words more or less equally spaced from one another.
The method of coding will be crucial to the probability of error we.
obtain and, therefore, to the maximum number of messages we
may use. Let us defer this important question, however, and
assume that somehow we have selected a code consisting of M code
words of n binits each for use with the (BSC)".
168 INFORMATION THEORY AND CODIN.G
none or because there are more than one), we just throw up our
hands and make an error! Now the reader may object that, in
the latter circumstances, we a.re giving up too easily. The reader
is correct in this objection. Nevertheless, we shall still be able to
show that this procedure leads to a negligible probability of error.
Using the procedure just described, there are two ways in which
an error in decoding a received symbol may occur. Let S(np.)
indicate a sphere ·of radius np. about the received symbol (Figure
6-18). The first way is if ao, the transmitted code word, is not in
l
'\
ecx1
131
Flotra11 6-13. Correct decoding of fJ,.
B(np,); the only other· way an error can occur is if ao is in S(np.)

but some other code word is also in S(np.) . Hence we may write
the probability of error. as
PB = Pr {ao fl S(np,) J + Pr {ao E S(np.)}
. X Pr· {at least one other code word E S(np,)} (6-37)
where .e · and. (;l ar"e read "is contained in" and "is not contained I • I
in," respectively. Since Pr {ao E S(np.)} < 1, (6-37) implies

P• < Pr {ao ~ S(np,)} t '
+ Pr {at least one other code word E S(~p,)} (6-38)

The probability · that at, least one of two possible events will
occur is never greater than the sum of .the individual probabilities
that each will occur. A generalization of this law implies
.P r {at least one other code word E S(np,)}
S L
Pr {a; E S(np,)}
a,,....
(~39)
-
where the summation on the right is over the M - 1 code words
not transmitted. Substitution of (6-39) into (6-38) gives us the
inequality we need:
PB ~ Pr {ao (;t S(np.)} .+
.
L Pr l~ E S(np.)}
GI "'Cle
(6-40)
Equation (6-40) expresses a simple bound on the probability of

error Ior a specific set of M code words: The first term of (6-40)
is the probability that the received word and the transmitted word
will not be within a hamming distance of n(p + t) from each other;
the second term is the sum.of probabilities (one for each word not
transmitted) that the received word and each word not transmitted
will be within a hamming distance of n(p + t) from each other.
The first term of (6-40) is easy to evaluate. It is just the proba-
bility that more than n(p + t) errors were made in the transmission
of n binits through a BBC with probability of error p. The average
number of errors occurring in a block of n binits is np. For any
finite value of n, there will be some finite probability that the num-
ber of errors will exceed its mea.n value by rn or more. AB n
increases, however, this becomes less and less probable; more
»recisely the weak ~w of l&rge numbers (Parzen, 1961) tells us
that. for any two positive numbers e and a there exists an no such
that for any n > no the probability that the number of errors
exceeds its mean value by more than ne is less than a. This means
t~t by taking n large enough we may be sure that
Pr {ao ~ S(np.)) <3 (6-41)
with 6 as small as we wish. .
With this equation, half the work in evaluating the error proba-
bility (6-40), and thus half the work in proving Shannon's second
theorem, is done. Using (6-41) i~ (6-40) yields
I
Pa < 6 +. L Pr {~ E S(np.) l
a1r'ae
(6:..42) \
Note that 8 W&$ independent of the set of M code words we chose
to represent our M messages. The last term of (6-42), on the
other hand, depends quite strongly on the code we happen to choose.
How then oan ·we use (6-42) to provide a bound for .the probability
T
:ee;ro r witho ut going into the difficult question of which code to
The answer to this dilemma is provided by an ingenious argu-

roen~ due to Shannon. Inste ad of evaluating (6-42) for some
particular code, Shan non sh~wed that it is possible to evaluate
(6-42) averaged over all J)OSSlble codes. The first term of (6-42)
does n?t depend ~pon. the code we use. The summation of M _ 1
terms m (6-42) does; if, however, we average the summation over
all possible codes. we sh~ obtai n the average probability of error -
averaged over &11 possible codes. This is not exactly what we \
started out to calculate but we shall see that it is enough to prove
the fundamental theorem.
6-8. Random Coding-The Second Step

Sb&nnon's e.rgu ment ~ome times called rando m coding--proceeds
as follows. In order to pick M input code words for our code, we
select them at rando m from the set of 2,. possible channel inputs.
We may think of the 2" input symbols as writte n on 2n separate
slips of paper and mixed toget her in a large bowl. We are then
blindfolded and told to select M sliJls from the bowl. After each
slip is selected, it is read and repla eea in the bowl before the next
slip is selected. The M slips of paper then define M code words for
~t .
Each time we select a code word there are 2" possibilities. We
make M separate selections; hence, there are a total of 2_1'M different
codes we might end up with. Each of. these codes is selected with
probability 2-11.11. The error probability obtained with any specific
code is given in (6-42). We now average (6-42) over our 2M"
poesible codes to obtai n the average error pr~bability Ps. We
have already noted that 8, the first term on the right side of (6-42),
does not depend upon the code we choose. Hence, we need only
average M - 1 terms of the form Pr {a, E S(np,)}, where a. ~ ao,
Using a wavy bar to indicate an averaging over the 2Mn dift'erent
t The coding procedure deecribed may result in a singular code- that is, we
lll&y select the ume alip of paper more than once, ,-nd therefore use
the s~e
code word for more than one meaaage. For M << 2•, such an occurrence 18
unlikeiy but 1)018ible. For M > 2- su~h an occurrence is inevitable.
' RELIABLE MESSAGES THROUGH UNRBJABLE CHANNELS
codes, we rewrite (6-42) as

P. < a+ (M - 1) Pr (<Xi E S(np.)J
171
~ 6 + M Pr (<Xi E S(np,) J a, ¢ ao (6-43)

To evaluate Pr I<X4 ES(np. ) f, a,. ~ ao, in (6-43), we make use
of the simplicity of the coding procedure used to generate ai, The
<Xi were selected from the 2,. possible code words at ·random ; hence
the average probabi lity that a,, a code word not equal to the trans-
mitted code word ao, is contained in a sphere of radius np. about
the received sequence {J1 is equal to the ratio of N(np.), the total
number of different binary sequences in the sphere, to 2,., the total
number of different binary sequences of length n.
Pr {<Xi E S(np.)} = Ne;?.) Cl4 ~ ao (6-44)
Finally, we obtain a bound for N(np,) . The number of binary

sequences of length n at a distance of exactly k from {J1 is just the
number of pOEble ways a binary sequence of length n can differ
from {11 in exactly k places, or just the binomial coefficient (;).
Summing over all values of k less than or ~qual to np., we obtain t
N(np,) = 1 + (~) + (;) + · · · + (n;.)"

-•t (n)
1:-0 k
(6--45)
We can bound this sum with e.n often used inequality of informa-
tion theory (Peterson, 1961, p. 246; Wozencraft and Reiffen, 1961,
p. 71): ·
I (~) ~
i•O
2•BlP,l for p, <i (6-46)
Hence- , on combining (6-44), (6-45), and (6-46), we obtain

Pr {CXi E S(np.) f ~ 2-"ll-B(p,)} CXi ~ ao (6-47)
t Of coure~ np need not be an integer. In this cue we must replace np, in
the last bino:ma1• coefficient of (6-45) with the la.rgest integer less than np,.
The proof, however, is changed in no respect.
DING
172 INFORMATION THEORY AND CO
ces the bo un d
an d usi ng thi s res ult in (6- 43 ) produ
-
P s S 8 + M2-nC1-HCr,.>J
(6- 48)
e of Sh an no n's second theorem

Eq ua tio n (6-48) co nta ins the essenc
. Fo r the pa ram ete r 8 can be
(fo r the speoial cas e of the BS C)
de as sm all as we wi sh by inc rea sing the block len gth n. Hence,
ma
tot al rig ht sid e of (6- 48) can be ma de as small as we want, as
the
lon g as (6- 49)
p)]
log M < n[ l - H( p.) ] < n[l - H( ,.I
tak ing E small,
Th is is the expression we seek. By
H( p,) = H( p + E)
n be ma de arb itr ari ly clo se to H (p) , an d we ma y choose a nu mb er
ca t 1 - H( p) is
me ssa ge s M as clo se to 2~Cl-HCJJ>J as we wa nt. Bu
of
cit y C of the BS C. He nc e, we ma y choose M messages,
the ca pa
2"°, an d the average pro ba bil ity
wh ere M is an y nu mb er less tha n lue. Th ere mu st
of err or oan be ma de less tha n an y pre ass ign ed va
well as the average, an d so we
be at lea st on e code which does as
wi th M < 2"0 code words _w ith
are ass ure d tha t the re exists a code
err or pro ba bil ity arb itr ari ly small.
en d of Section 6-5. If we
This is the res ult we pro mi sed at the
we ma y ·choose an y nu mb er of
use a lon g enough block11 len gth n,
wo rds M up to 2 <l for use ·wi th a :BSC, an d s·till ha ve the
co de
ity of mi sre cog niz ing a cod e wo rd as sm all as we wish.
pro ba bil
]
He nc e, we ma y sen d up to [see (6-34)
_lo_g_2_,._0 = C (6-50)
n
wi th eac h bin it over a BSC
ess en tia lly error-free bin ary -messages
of cap aci ty C.
us sio n
6-9. Sh an no n's Se co nd Th eo rem -D isc j
ore m pro ve d in the las t tw o sec tions is ad mi tte dly of a
Th e the
considered is the :sim ple st no n-
veey special nature. Th e channel
triv:ial ch an ne l-t he BSC. Never
theless, -al l the im po rta nt ideas
re general the ore m an d all the
necesaa.ry for the pro of of the mo
po rta nt con seq uen ces wh ioh fol low . fro m the. more .general
im
!;
theorem are illustrated in these sections. We use the present sec-
tion to discuss these ideas and implications before proving the
general theorem in Section 6-10.
The first idea we discuss is random coding, introduced by Shannon.
It is necessary to understand what is meant by such a coding
procedure if we are to appreciate the limitations of the theorem.
Because of the fact that the code words were selected at random,
we could use Equation (6-47) to bound the probability that an
arbitrary code word would be in the sphere of radius np, surround-
ing p1, If we had fixed the code words in some deterministic
fashion, we would not be able to speak of the probability that a
code word is within a distance np, of the received sequence PJ in
this way. Looking at this matter even more closely, it is possible
to view the coding procedure we have described as really no coding
procedure at all.
AJJ a ·practical method of coding or finding a good set of oode
words, random coding leaves mueh to be desired. If we use this
procedure, on the average we can make the probability of error as
small as we-wish. The average referred to here, unfortunately, is
the average over the ensemble of possible codes. Thus we a.re not ~
assured, once we arrive a.t some fixed code, that we have a good
code. As an extreme example, note that it is possible to obtain
the highly singular code where all M messages are mapped into the
same code word.
Shannon's second theorem can, therefore, be characterized a.s a
I
little more than an existence proof, but a little less than a. construc-
tive proof. The theorem.does not exactly show us how to construct
a good code, and so we do not quite have a method of finding codes. I
On the other hand, the theorem does provide us with a method

which will generate good codes on the average; so it is not as bad
.l
as a pure existence proof.
In the more general version of Shannon's second theorem proved
in the next section, we show that we may select M = 21'< 0 -•>,
E > O, code words (where C is the capacity of the channel) and
still have the message ·e rror probability as small as we wish. w~ ·
shall also prove a pa.rtial converse to this theorem: namely, if we
choose M · 2•<o+«>, e > 0, code words, it is not possible to find a
decision rule resulting in an arbitrarily small probability of error
P, by increasing n, the blook length of the oode. This form of I
►
17A INFORMATION THEORY AND CODING
the converse will be suffic ient for our purposes. We note, however,
that severa l strong er fonns may be proved . Wolfowitz (1959) has
shown that if we choose M = 2"<~+•> code words (where C is the
chann el capac ity and E > 0), then the best proba bility of error
approaches OM as n increa ses I
The codin g theore m states that we can make the proba bility of
misint erpret ing a word sent throug h a noisy chann el to repres ent
one of our messa ges as small as we wish, while keepin g the messa ge
rate of the cha.nn el fixed. The impor tant point of the previous
sentence is that the theore m deals with the proba bility of error of
messages, or code- worda. For example, in the case of the BSC,
the theore m says that the proba bility of misint erpret ing a sequence
of n zeros e.nd ones is arbitr arily smaU. This is much strong er
than merel y saying that the proba bility of misint erpret ing a single
binit is arbitr arily smaJ). This distin ction has led to misunder-
stara dinge in interp reting the result s of the variou s forms of the
converse of Shann on's secon d theore m. In terms of the BBC
treate d in the previo us sectio n, the converse states that if the num-
ber of equipr obable messa ges M is greate r than .2nc (where age.in
C is the capac ity of the BSC) , the proba bility of error in a word
approa.ches 1 as n increa ses. This conclusion will hold for any set.
of code words (not just on the averag e with respec t to an ensem ble
of codes) and· for any decisi on rule. The theore m is of. ~ t
mathe matic al intere st, but its releva nce to the commumca.tion
proble m is .often overem phasiz ed. The theore m does not s~te
that effective comm unicat ion is impossible if M > 2nc.
this point let us plot the proba bility of a binit error 1n a BSC
:o clarify
versus th~ messa ge rate when the binits are chosen to be O or 1
with equal proba bility• . • h o
rate R bina...v messages per b1n1t, less t a.n ,
F or any messa ge , ,l.,U,94., b b·lit
the ea. cit of the chann el, we know that ~e pro a I Y of a
. . pa y b made as small as we wish. For a message
b1D.1t error can also ~ think of the following procedure.
rWate R _greatei
e again emp oy
~ ::=i on of the BSC to transm it messages
nth . a e rate of R binary messages I
and let n increa se. To achiev e a mess g to send throug h the nth /l
per binit, we must have 2ilB m ~ we may send nR binits

extension of the BSC.. A l ~ ive ~nd just under nO of these
throug h the ~th extens1on.l .the e::bitr arilY small probability of
binita throu gh the cbe,nne Wl
error. For the remaining nR - nC binit.s we must transmit, we

shall require the receiver to decide on Os or ls by merely flipping
a coin-heads for a 0, tails for a 1. For tAese binite the probability
of error will be t. The probability of binit error averaged over
both the reliable and unreliable binit.s will be slightly greater than
i(R - 0)/ R. This result is portrayed in Figure 6-14.
That part of Figure 6-14 which gives the probability of error for
R > C is obtained by use of the procedure we have outlined. We
have not shown that this procedure is the best possible procedure,
1 ¼
~
I!
I 5
1
7
.! ...
11
C
B = message rate, binary messages/blnit
FIGUBB 6-14. Binit error rate in the BBC as a function of meimage rate.
however. In fact, the calculation of the smallest possible binit

error probability for message rate.a R > C is still an -open question.
In addition, note that although we have terminated the abscissa.
in Figure 6-14 at R = 1, we can obtain even higher message rates
by using the coin-tossing procedure we have ·described. For
example, consider a noiseless BSC (p = O). Then our coin-tossing
.1
i procedure leads to a probability of binit error of 0.25 at a message
1
rtLte of 2 binary messages per binit.
A final point to be discussed before we tum to a general proof
of Shannon's second theorem is that of probability of error bounds.
In both the proof for the BSC and the general proof, we are interested
only in demonstrating that the probability of error can be made
arbitrarily small when M < 2 <c-,>. A large number of results,
11
however, have been obtain ed, detailing exactl y how fast the proba -
bility of error can be made to approa ch O as n, the order of the
extens ion used, is increased. We do not treat this subjec t other
than to remar k that many author s have been able to obtain expo-
nentia l (or almos t exponential) behav ior of the probab ility of error
with n. Some of these results are referenced in the notes at the
end of this chapte r.
6-10. Shannon's Second Theo rem-G enera l' Case

Let us now turn to a proof of Shanµon'e second theore m for the
general discrete channel witho ut memory. In concept, the proof
will be remar kably similar to the prQof presente4 for the BSC in
Sections 6-7 and 6-8.
Shannon's Second Theorem
Consider a channel with r inputs, s outputs; and.ch annel capac ity
C. Let E be an arbitra rily small number, and let M = 2n<C-e) .
Then, fQr n s\lftici~lltly large, it ~ possible to s:Iect a ~~beet of M
cQde words (to represent M eqwprobable . messages) from the set
of rn possible inputs to the nth extension_of the· c~annel, such that
the probab ility of error in decoding the channel outpu t can be
made as small as we wish.
In Figure E)-15' we.ha ve indicated'' the r" possible inputs and the
a" outp'Q.ts of the nth extension of o-ur channel.
A11
a, - (a, • • •. <11a1) fJ1 - (b1 • • • b1b1)

aa - (a1 • • • a1a1)
. . . . . . . . . . ..
.-+, (ohamieJ)" I- . . . . . . . . . . . .
/Ja ,.. (bi • • • b1b1)
fl.- - (b, • • ·• b,b,)
Fxauam 6-15. The·nt h extension of a channe l.

In order ·to send M messages throug h-this oha.nnel we select M
of the r" poasible inputs a·s code words. Again we ask the questi on
"How many messages is it possible to send and still ha"9'e the messag~
error probability remain small? "

Chapter 6 ITC

Uploaded by

Copyright:

Available Formats

Chapter 6 ITC

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6 ITC

Uploaded by

Copyright:

Available Formats

6

Shannon's second theorem was first published in 1948. The birth

6-2. Error Probability and Decision Rules

Ordinarily we would say that the probability of error of this channel

provided us with a yardstick with which to measure the information

ith r in pu ts an d ·a ou tp ut s, th er e ar e ,.. different

where, since our decision rule is fixed, P[d(b1)/b1] is the backward

Equation (6-5) expresses the error probabilit y as a sum of non-

where, since our decision rule is fixed, Pld(b1)/b1J is the backward

In other words, the channel errQ1' '/)robabuity is minimized if we use

The decision rule defined by (6-9b) is known 88 the maximum-

The error probability u · · ..

PB= }; P(a, b) (6-13)

6-3. The Fano Bound

Using these two relationships, we construct tp.e identity

. + ~ P(a•, b) log J; (6-17)

The equivocation H(A/B) may be written in terms of the same

+ ~ P(O.•, b) log P(a~/b) (6-18)

Subtracti ng (6-17) from (6-18) yields

+ ~ P(a*, b) log P(!:/b) (6-19)

- B,J, •• P(a, b) In (r - i>Pca/b)

so that we may use the inequalit y

»1.. P(a, b) [ (r - :;~(a/b) - 1] + i P(a*, b) [P(!:/b) - 1]

= [r:.- l B,J, •• P(b)] - Pa+ [Pai P(b)] - P.,, .

and we have the inequalit y we seek, l

Fiot mB 6-2. A met hod.of inor eaa~ g relia

utp ut of the cha nne l 11nder the se circ

bili ty of jus t one erro r is

ilit y of two erro rs is

. RELIABLE MESSAGES THROUGH UNRELIA

Unu aed ,ign alf Meaaagu Oueputa

The improvement displayed in Table 6-1 is not achieved without

6-5. An Example of Coding to ~orrect Errors

J0-14L.....--.L---'---....i....-_ _.__ ____.__

The answer to this question is provided by Shannon's second

rate ~ approach O in order to achieve more and more reliable

3 binits, can be transmitted at the rate of one every 3 sec. If we

P11 == 3 X 10-, (6-29)

(not a, binit) will be transmitted correctly is .p3• The probability

P11 ~ 3 X 10-2 (6-30)

The message rate corresponding to . this probability of error . is 1

t .Aa shown in Example 6-2, the maximum-likelihood rule is not unique. In

rpr ete d correctly,

For p = 0.0 1, this yields

ond to two bin ary

6-6. Hamming Distance

COM ct Cod8<B Codee

000 000 000

against the advantages of a low channel error probability when

The ao6f"(JJJe number of errors occurring in a. block of n binits will

Throughout this chapter we may assume that our messages (and

that the maximum-likelihood decision rule minimized the error

Fxouu 6-9. Maximum-likelihood decision rule for the (BSC) 11•

6-7. ShaMon's Secs,nd Theorem for the BSC-The First Step

Fiouu &-11. The channel.

Shannon's Second Theorem (Special Case)

Flotra11 6-13. Correct decoding of fJ,.

B(np,); the only other· way an error can occur is if ao is in S(np.)

in," respectively. Since Pr {ao E S(np.)} < 1, (6-37) implies

+ Pr {at least one other code word E S(~p,)} (6-38)

J0-14L.....--.L---'---....i....-_ _. .