Channel Capacity

Channel Capacity
The Characterization of the Channel Capacity
Radu T. Trı̂mbiţaş
November 2012
Outline
Contents
1 Channel Capacity 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Discrete Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Examples of Channel Capacity 8

2.1 Noiseless Binary Channel . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Noisy Channel with non-overlapping outputs . . . . . . . . . . 9
2.3 Permutation Channel . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Noisy Typewriter . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Binary Symmetric Channel (BSC) . . . . . . . . . . . . . . . . . . 12
2.6 Binary Erasure Channel . . . . . . . . . . . . . . . . . . . . . . . 13
3 Symmetric Channels 15
4 Properties of Channel Capacity 16
5 The Shannon’s 2nd Theorem 16

5.1 The Shannon’s 2nd Theorem - Intuition . . . . . . . . . . . . . . 16
5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Jointly Typical Sequences . . . . . . . . . . . . . . . . . . . . . . 20
6 Channel Coding Theorem 24

6.1 Channel Coding Theorem . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Zero-Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3 Fano’s Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Converse to the Channel Coding Theorem . . . . . . . . . . . . . 35
6.5 Equality in the Converse to the Channel Coding Theorem . . . . 36
7 Feedback Capacity 37
1
8 Source-Channel Separation Theorem 39
9 Coding 42
9.1 Introduction to coding . . . . . . . . . . . . . . . . . . . . . . . . 42
9.2 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1 Channel Capacity
1.1 Introduction
Towards Channels
• So far, we have been talking about compression. I.e., we have some

source p( x ) with information H ( X ) (the limits of compression) and the
goal is to compress it down to H bits per source symbol in a representa-
tion Y.
• In some sense, the compressed representation has a ”capacity” which is
the total amount of bits that can be represented. I.e., with n bits, we can
obviously represent no more than n bits of information.
• Compression can be seen as a process where we want to fully utilize the
capacity in the compressed representation, i.e., if we have n bits of code
word, we ideally (i.e., in an perfect compression scheme) would like there
to be no less than n bits of information being represented. Recall effi-
ciency: H ( X ) = E(`).
• We now want to transmit information over a channel.
• From Claude Shannon: The fundamental problem of communication is

that of reproducing at one point either exactly or approximately a mes-
sage selected at another point. Frequently the messages have meaning . .
. [which is] irrelevant to the engineering problem
• Is there a limit to the rate of communication over a channel?

• If the channel is noisy, can we achieve (essentially) perfect error-free trans-
mission at a reasonable rate?
2
The 1930s America
• Stock-market crash of 1929
• The great depression

• Terrible conditions in textile and mining industries
• deflated crop prices, soil depletion, farm mechanization
• The rise of fascism and the Nazi state in Europe.

• Analog radio, and urgent need for secure, precise, and efficient commu-
nications
• Radio communication, noise always in data transmission (except for Morse
code, which is slow).
Radio Communications
• Place yourself back in the 1930s.

• Analog communication model of the 1930s.
• Q: Can we achieve perfect communication with an imperfect communi-

cation channel?
3
• Q: Is there an upper bound on the information capable of being sent un-
der different noise conditions?
• Key: If we increase the transmission rate over a noisy channel will the
error rate increase?
• Perhaps the only way to achieve error free communication is to have a

rate of zero.
• The error profile we might expect to see is the following:
• Here, probability of error Pe goes up linearly with the rate R, with an

intercept at zero.
• This was the prevailing wisdom at the time. Shannon was critical in
changing that.
Simple Example
• Consider representing a signal by a sequence of numbers.

• We now know that any signal (either inherently discrete or continuous,
under the right conditions) can be perfectly represented (or at least ar-
bitrarily well) by a sequence of discrete numbers, and they can even be
binary digits.
• Now consider speaking such a sequence over a noisy AM channel.

• Very possible one number will be masked by noise.
• In such case, each number we repeat k times, where k is sufficiently large
to ensure we can ”decode” the original sequence with very small proba-
bility of error.
• Rate of this code decreases but we can communicate reliably even if the
channel is very noisy.
• Compare this idea to the figure on the following page.
4
A key idea
• If we choose the messages carefully at the sender, then with very high
probability, they will be uniquely identifiable at the receiver.
• The idea is that we choose the source messages that (tend to) not have
any ambiguity (or have any overlap) at the receiver end. I.e.,
• This might restrict our possible set of source messages (in some cases
severely, and thereby decrease our rate R), but if any message received in
a region corresponds to only one source message, ”perfect” communica-
tion can be achieved.
1.2 Discrete Channels

Discrete Channels
Definition 1. A discrete channel is a system consisting of an input alphabet X ,
an output alphabet Y , and a distribution (probability transition matrix) p(y| x )
which is the probability of observing output symbol y given that we send the
symbol x. A discrete channel is memoryless if the probability distribution of yt ,
the output at time t, depends only on the input xt and is independent of all
previous inputs x1:t−1 .
• We will see many instances of discrete memoryless channels (or just DMC).
• Recall back from lecture 1 our general model of communications:
Model of Communication
• Source message W, one of M messages.

• Encoder transforms this into a length-n string of source symbols X n
5
• Noisy channel distorts this message into a length-n string of receiver sym-
bols Y n .
• Decoder attempts to reconstruct original message as best as possible, comes
up with W,
b one of M possible sent messages.
Rates and Capacities
• So we have a source X governed by p( x ) and channel that transforms X

symbols to Y symbols and which is governed by the conditional distri-
bution p(y| x )
• These two items p( x ) and p(y| x ) are sufficient to compute the mutual
information between X and Y. That is, we compute
p(y| x )
I ( X; Y ) = I p( x) ( X, Y ) = ∑ |p(x){z
p(y| x ) log
p(y)
(1)
x,y }
p( x,y)
p(y| x )
= ∑ p( x ) p(y| x ) log (2)
x,y ∑ x0 p(y| x 0 ) p( x 0 )
• We write this as I ( X; Y ) = I p( x) ( X; Y ), meaning implicitly the MI quan-

tity is a function of the entire distribution p( x ), for a given fixed channel
p ( y | x ).
• We will often be optimizing over the input distribution p( x ) for a given
fixed channel p(y| x ).
Definition 2 (Information flow). The rate of information flow through a channel

is given by I ( X; Y ), the mutual information between X and Y, in units of bits
per channel use.
Definition 3 (Capacity). The information capacity of a channel is the maximum
information flow
C = max I ( X; Y ), (3)
p( x )
where the maximum is taken over all possible input distributions p( x ).
6
Definition 4 (Rate). The rate R of a code is measured in the number of bits per
channel use.
• We shall soon give an operational definition of channel capacity as the

highest rate in bits per channel use at which information can be sent with
arbitrarily low probability of error.
• Shannon’s second theorem establishes that the information channel ca-
pacity is equal to the operational channel capacity.
• There is a duality between the problems of data compression and data
transmission.
– During compression, we remove all the redundancy in the data to

form the most compressed version possible
– During data transmission, we add redundancy in a controlled fash-
ion to combat errors in the channel.
• We show that a general communication system can be broken into two

parts and that the problems of data compression and data transmission
can be considered separately.
Fundamental Limits of Compression
• For compression, if error exponent is positive, then error → 0 exponen-

tially fast as block length → 1. Note, Pe ∼ e−nE( R) .
• That is,
• Only hope of reducing error was if R > H. Something ”funny” happens

at the entropy rate of the source distribution. Can’t compress below this
without incurring error.
Fundamental Limits of Data Transmission/Communication
• For communication, lower bound on probability of error becomes bounded

away from 0 as the rate of the code R goes above a fundamental quantity
C. Note, Pe ∼ e−nE( R) .
7
• That is,
• only possible way to get low error is if R < C. Something funny happens
at the point C, the capacity of the channel.
• Note that C is not 0, so can still achieve ”perfect” communication over a
noisy channel as long as R < C.
2 Examples of Channel Capacity

2.1 Noiseless Binary Channel
Noiseless Binary Channel
• Noiseless binary channel, diagram shows p(y| x )
• So, p(y = 0| x = 0) = 1 = 1 − p(y = 1| x = 0) and p(y = 1| x = 1) = 1 =

1 − p(y = 0| x = 1), so channel is just an input copy.
• One bit sent at a time is received without error, so capacity should be 1
bit (intuitively, we can reliably send one bit per channel usage).
• I ( X; Y ) = H ( X ) − H ( X |Y ) = H ( X ) in this case, so C = max p( x) I ( X; Y ) =
max p( x) H ( X ) = 1.
• Clearly, p(0) = p(1) = 1/2 achieves this capacity.
• Also, p(0) = 1 = 1 − p(1) has I ( X; Y ) = 0, so achieves zero information

flow.
8
2.2 Noisy Channel with non-overlapping outputs
Noisy Channel with non-overlapping outputs
This channel has 2 possible outputs corresponding to each of the 2 inputs.
The channel appears to be noisy, but really is not.
• Here, p(y = 0| x = 0) = p(y = 1| x = 0) = 1/2 and p(y = 2|y = 1) =

p(y = 3| x = 1) = 1/2.
• If we receive a 0 or 1, we know 0 was sent. If we receive a 2 or 3, a 1 was

sent.
• Thus, C = 1 since only two possible error free messages.
• Same argument applies
I ( X; Y ) = H ( X ) − H (Y | X ) = H ( X )
| {z }
=0
• Again uniform distribution p(0) = p(1) = 1/2 achieves the capacity.
2.3 Permutation Channel

Permutation Channel
• Here, p(Y = 1| X = 0) = p(Y = 0| X = 1) = 1.

• So output is a binary permutation (swap) of input.
• Thus, C = 1; no information lost.
• In general, for alphabet of size k = |X | = |Y |, let be a permutation, so

that Y = σ ( X ).
• Then C = log k.
9
Asside: on the optimization to compute the value C
• To maximize a given function f ( x ), it is sufficient to show that f ( x ) ≤ α

for all x, and then find an x ∗ such that f ( x ∗ ) = α.
• We’ll be doing this over the next few slides when we want to compute
C = max p( x) I ( X; Y ) for fixed p(y| x ).
• The solution p∗ ( x ) that we find that achieves this maximum won’t nec-
essarily be unique.
• Also, the solution p∗ ( x ) that we find won’t necessarily be the one that we
end up, say, using when we wish to do channel coding.
• Right now C is just the result of a given optimization.

• We’ll see that C, as computed, is also the critical point for being able to
channel code with vanishingly small error probability.
• The resulting p∗ ( x ) that we obtain as part of the optimization in order
to compute C won’t necessarily be the one that we use for actual coding
(example forthcoming).
2.4 Noisy Typewriter
Figure 1: Noisy Typewriter, C = log 13 bits
10
Noisy Typewriter
• In this case the channel input is either received unchanged at the output
with probability 1/2 or is transformed into the next letter with probabil-
ity 1/2 (Figure 1).
• So 26 input symbols, and each symbol maps probabilistically to itself or
its lexicographic neighbor.
• I.e., p( A → A) = p( A → B) = 1/2, etc.

• Each symbol always has some chance of error, so how can we communi-
cate without error?
• Choose subset of symbols that can be uniquely disambiguated on re-
ceiver side. Choose every other source symbol, A, C, E, etc.
• Thus A → { A; B}, C → {C; D }, E → { E; F }, etc. so that each received

symbols has only one unique source symbol.
• Capacity C = log 13
• Q: what happens to C when probabilities are not all 1/2?
• We can also compute the capacity more mathematically.

• For example:
C = max I ( X; Y ) = max ( H ( X ) − H (Y | X )
p( x ) p( x )
= max H (Y ) − 1 //for X = x ∃ 2 choices

p( x )
= log 26 − 1 = log 13
• The max p( x) H (Y ) = log 26 can be achieved by using the uniform distri-

bution for p( x ), for which when we choose any x symbol, there is equal
likelihood of two Ys being received.
• An alternatively p( x ) would put zero probability on the alternates (B, D,

F, etc.). In this case, we still have H (Y ) = log 26
• So the capacity is the same in each case (∃ two p( x ) that achieved this)
but only one is what we would use, say, for error free coding.
11
2.5 Binary Symmetric Channel (BSC)
Binary Symmetric Channel (BSC)
• A bit that is sent will be flipped with probability p.

• p (Y = 1 | X = 0 ) = p = 1 − p (Y = 0 | X = 0 ) . p (Y = 0 | X = 1 ) = p =
p (Y = 1 | X = 1 ) .
• BSC is the simplest model of channel with errors, yet it captures most of
the complexity of the general problem
• Q: can we still achieve reliable (”guaranteed” error free) communication
with this channel? A: Yes, if p < 1/2 and if we do not ask for too high
a transmission rate (which would be R > C), then we can. Actually, any
p 6= 1/2 is sufficient.
• Intuition: think about AEP and/or block coding.
• But how to compute C, the capacity?
I ( X; Y ) = H (Y ) − H (Y | X ) = H (Y ) − ∑ p( x ) H (Y | X = x )
= H (Y ) − ∑ p ( x ) H ( p ) = H (Y ) − H ( p ) ≤ 1 − H ( p ) .
• When is H (Y ) = 1? Note that

P (Y = 1 ) = P (Y = 1 | X = 1 ) P ( X = 1 )
+ P (Y = 1 | X = 0 ) P ( X = 0 )
= (1 − p) P( X = 1) + pP( X = 1)
= P ( X = 1)
• So H (Y ) = 1 if H ( X ) = 1.
• Thus, we get that C = 1 − H ( p) which happens when X is uniform.
• If p = 1/2 then C = 0, so if it randomly flips bits, then no information
can be sent.
• If p 6= 1/2, then we can communicate, albeit potentially slowly. E.g., if
p = 0.499 then C = 2.8854 × 10−6 bits per channel use. So to send one
bit, need to use the channel quite a bit.
• If p = 0 or p = 1, then C = 1 and we can get maximum possible rate (i.e.,
the capacity is one bit per channel use).
12
Decoding
• Lets temporarily look ahead to address this problem.
• We can ”decode” the source using the received string, source distribu-
tion, and the channel model p(y| x ) via Bayes rule. I.e.
P (Y | X ) P ( X ) P (Y | X ) P ( X )
P ( X |Y ) = =
P (Y ) ∑ x0 P(Y | X 0 = x 0 ) Pr ( X 0 = x 0 )
• If we get a particular y, we can compute p( x |y) and make a decision

based on that. I.e., xb = argmax x p( x |y).
• This is optimal decoding in that it minimizes the error.
• Error if x 6= xb, and P(error ) = P( x 6= xb).
• This is minimal if we chose argmax x p( x |y) since the error 1 − P( xb|y) is
minimal.
Minimum Error Decoding

• Note: Computing quantities such as P( x |y) is a task of probabilistic in-
ference.
• Often this problem is difficult (NP-hard). This means that doing mini-
mum error decoding might very well be exponentially expensive (unless
P = NP).
• Many real world codes are such that computing the exact computation
must be approximated (i.e., no known fast algorithm for minimum error
or maximum likelihood decoding).
• Instead we do approximate inference algorithms (e.g., loopy belief prop-
agation, message passing, etc.). These algorithms tend still to work very
well in practice (achieve close to the capacity C).
• But before doing that, we need first to study more channels and the the-
oretical properties of the capacity C.
2.6 Binary Erasure Channel

Binary Erasure Channel
13
• e is an erasure symbol, if that happens we don’t have access to the trans-
mitted bit.
• The probability of dropping a bit is then α.
• We want to compute capacity. Obviously, C = 1 if α = 0.
C = max I ( X, Y ) = max ( H (Y ) − H ( X |Y ))
p( x ) p( x )
= max H (Y ) − H (α).
p( x )
• So while H (Y ) ≤ log 3, we want actual value of the capacity.

• let E = {Y = e}. Then
H (Y ) = H (Y, E) = H ( E) + H (Y | E).
• Let π = P( X = 1). Then

 
if Y =0 Y =1
Y =e z if }|
z }| { ifz}|{ {
H (Y ) = H (1 − π )(1 − α), α , π (1 − α)
 
= H ( α ) + (1 − α ) H ( π ).
• This last equality follows since H ( E) = H (α), and
H (Y | E ) = H (Y |Y = e ) + ( 1 − α ) H (Y |Y 6 = e ) = α · 0 + ( 1 − α ) H ( π ) .
• Then we get
C = max H (Y ) − H (α)
p( x )
= max ((1 − α) H (π ) + H (α)) − H (α)

π
= max (1 − α) H (π ) = 1 − α
π
• Best capacity when π = 1/2 = P( X = 1) = P( X = 0).

• This makes sense, loose α% of the bits of original capacity.
14
3 Symmetric Channels
Symmetric Channels
Definition 5. A channel is symmetric if rows of the channel transition matrix

p(y| x ) are permutations of each other, and columns of this matrix are permu-
tations of each other. A channel is weakly symmetric if every row of the tran-
sition matrix p(.| x ) is a permutation of every other row, and all column sums
∑ x p(y| x ) are equal.
Theorem 6. For a weakly symmetric channel,
C = log |Y | − H (r ) (4)
where r is the row of transition matrix. This is achieved by a uniform distribution on

the input alphabet.
Proof.
I ( X; Y ) = H (Y ) − H (Y | X ) = H (Y ) − H (r ) ≤ log |Y | − H (r ) (5)
with equality if the output distribution is uniform. But p( x ) = 1/ |X | achieves

a uniform distribution on Y, as seen from
1 1 1
p(y) = ∑ p(y| x ) p( x ) =
|X | ∑
p(y| x ) = c
|X |
=
|Y |
, (6)
x ∈X
where c is the sum of the entries in one column of the transition matrix.
Example 7. The channel with transition matrix
 
0.3 0.2 0.5
p(y| x ) =  0.5 0.3 0.2 
0.2 0.5 0.3
is symmetric. Its capacity is C = log 3 − H (0.5, 0.2, 0.3) = 8.8818 × 10−16 .

Example 8. Y = X + Z (mod c), where X = Z = {0, 1, . . . , c − 1}, and X and
Z are independent.
Example 9. The channel with transition matrix
1 1 1

p(y| x ) = 31 61 2
1
3 2 6
is weakly symmetric, but not symmetric.
15
4 Properties of Channel Capacity
Properties of Channel Capacity
1. C ≥ 0 since I ( X; Y ) ≥ 0.
2. C ≤ log |X | since C = max p( x) I ( X; Y ) ≤ max H ( X ) = log |X |.
3. C ≤ log |Y | for the same reason. Thus, the alphabet sizes limit the trans-
mission rate.
4. I ( X; Y ) = I p( x) ( X; Y ) is a continuous function of p( x ).
5. I ( X; Y ) is a concave function of p( x ) for fixed p(y| x ).

6. Thus, Iλp1 +(1−λ) p2 ( X; Y ) ≥ λI p1 ( X; Y ) + (1 − λ) I p2 ( X; Y ). Interestingly,
since concave, this makes computing something like the capacity easier.
I.e., a local maximum is a global maximum, and computing the capacity
for a general channel model is a convex optimization procedure.
7. Recall also, I ( X; Y ) is a convex function of p(y| x ) for fixed p( x ).
5 The Shannon’s 2nd Theorem

5.1 The Shannon’s 2nd Theorem - Intuition
Shannon’s 2nd Theorem
• One of the most important theorems of the last century.

• We’ll see it in various forms, but we state it here somewhat informally to
start acquiring intuition.
Theorem 10 (Shannon’s 2nd Theorem). C is the maximum number of bits (on

average, per channel use) that we can transmit over a channel reliably.
• Here, ”reliably” means with vanishingly small and exponentially decreas-

ing probability of error as the block length gets longer. We can easily
make this probability essentially zero.
• Conversely, if we try to push > C bits through the channel, error quickly
goes to 1.
• Intuition of this we’ve already seen in the noisy typewriter and the region
partitioning.
• Slightly more precisely, this is a sort of bin packing problem.
• We’ve got a region of possible codewords, and we pack as many smaller
non-overlapping bins into the region as possible.
16
• The smaller bins correspond to the noise in the channel, and the packing
problem depends on the underlying ”shape”
• Not really a partition, since there might be wasted space, also depending
on the bin and region shapes.
• Intuitive idea: use typicality argument.
• There are ≈ 2nH (X ) typical sequences, each with probability 2−nH (X ) and
(n)
with p( Aε ) ≈ 1, so the only thing with ”any” probability is the typical
set and it has all the probability.
• The same thing is true for conditional entropy.
• That is, for a typical input X, there are 2nH (Y | X ) output sequences.
• Overall, there are 2nH (Y ) typical output sequences, and we know that
2nH (Y ) ≥ 2nH (Y | X ) .
Shannon’s 2nd Theorem: Intuition
• Goal: find a non-confusable subset of the inputs that produce disjoint

output sequences (as in picture).
• There are ≈ 2nH (Y ) (typical) outputs (i.e., the marginally typical Y se-
quences).
• There are 2nH (Y | X ) (X-conditionally typical Y sequences) outputs. ≡ the

average possible number of outputs for a possible input, so this many
could be confused with each other. I.e., on average, for a given X = x,
this is approximately how many outputs there might be.
• So the number of non-confusable inputs is
2nH (Y )
≤ = 2n( H (Y )− H (Y |X )) = 2nI (X;Y ) (7)
2nH (Y | X )
• Note, in non-ideal case, there could be overlap of the typical Y-given-X

sequences, but the best we can do (in terms of maximizing the number of
non-confusable inputs) is when there is no overlap on the output. This is
assumed in the above.
17
• We can view the number of non-confusable inputs (7) as a volume. 2nH (Y )
is the total number of possible slots, while 2nH (Y | X ) is the number of slots
taken up (on average) for a given source word. Thus, the number of
source words that can be used is the ratio.
• To maximize the number of non-confusable inputs (7), for a fixed channel

p(y| x ), we find the best p( x ) which gives I ( X; Y ) = C, which is the log
of the maximum number of inputs possible to use.
• This is the capacity.
5.2 Definitions
Definitions
Definitions 11. • Message W ∈ {1, . . . , M } requiring log M bits per mes-

sage.
• Signal sent through channel X n (W ), a random codeword.
• Received signal from channel Y ∼ p(yn | x n )
b = g (Y n ) .
• Decoding via guess W
• Discrete memoryless channel (DMC) (X ; p(y| x ); Y )
Definitions 12. • n-th extension to channel is (X n ; p(yn | x n ); Y n ), where

p y k | x k , y k −1 = p ( x k | y k )
• Feedback if xk can use both previous inputs and outputs.
18
• No feedback if p( xk | x1:k−1 ; y1:k−1 ) = p( xk | x1:k−1 ). We’ll analyze feedback
a bit later.
Remark. If the channel is used without feedback, the channel transition
function reduces to
n
p (yn | x n ) = ∏ p ( y i | x i ). (8)
i =1
(M,n) code
Definition 13 (( M, n) code). An ( M, n) code for channel (X ; p(y| x ); Y ) is
1. An index set {1, 2, . . . , M } .
2. An encoding function X n : {1, 2, . . . , M } → X n yielding codewords
X n (1), X n (2), X n (3), . . . , X n ( M ). Each source message has a codeword,
and each codeword is n code symbols.
3. Decoding function, i.e., g : Y n → {1, 2, . . . , M } which makes a ”guess”
about original message given channel output.
Remark. In an ( M, n) code, M = the number of possible messages to be
sent, and n = number of channel uses by the codewords of the code.
Error
Definition 14 (Probability of Error λi for message i ∈ {1, . . . , M}).
λi := P ( g(Y n ) 6= i | X n = x n (i )) = ∑n p (yn |xn (i)) I ( g (yn ) 6= i) (9)
y
I (·) is the indicator function; the conditional probability of error given that
index i was sent.
Definition 15 (Max probability of Error λ(n) for ( M, n) code).
λ(n) : = max λi . (10)

i ∈{1,2,...,M}
(n)
Definition 16 (Average probability of error Pe ).
M
1
∑ λi = P(Z 6= g(Yn ))
(n)
Pe =
M i =1
where Z is a r.v. with probability P( Z = i ) ∼ U ( M ) (discrete uniform)

M
∑ P ( g(Yn ) 6= i|X n = X n (i)) p(i),
(n)
Pe = E( I ( Z 6= g(Y n ))) =
i =1
1
where p(i ) = M.
Remark. A key Shannon’s result is that a small average probability of error
means we must have a small maximum probability of error!
19
Rate
Definition 17. The rate R of an ( M, n) code is
log M
R= bits per transmission
n
Definition 18. A rate R is achievable if there exists a sequence of 2nR , n

codes such that λ(n) → 0 as n → ∞.

Remark. To simplify the notation we write 2nR , n instead 2nR , n .

Capacity
Definition 19. The capacity of a channel is the supremum of all achievable rates.
• So the capacity of a DMC is the rate beyond which the error won’t tend
to zero with increasing n.
• Note: this is a different notion of capacity that we encountered before.

• Before we defined C = max p( x) I ( X, Y ).
• Here we are defining something called the ”capacity of a DMC”.
• We have not yet compared the two (but of course we will ).
5.3 Jointly Typical Sequences

Jointly Typical Sequences
Definition 20. The set of jointly typical sequences with respect to the distribu-
tion p( x, y) is defined by
(n)
Aε = {( x n , yn ) ∈ X n × Y n :
1
− log p( x n ) − H ( X ) < ε (11)
n
1
− log p(yn ) − H (Y ) < ε (12)
n

1
− log p( x n , yn ) − H ( X, Y ) < ε , (13)
n
where
n
p( x n , yn ) = ∏ p ( x i , y i ). (14)
i =1
Jointly Typical Sequences: Picture
20
Intuition for Jointly Typical Sequences
• So intuitively,
num. jointly typical seqs. 2nH (X,Y )
= nH (X ) nH (Y )
num ind. chosen typical seqs. 2 2
= 2n( H (X,Y )− H (X )− H (Y ))
= 2−nI (X,Y )
• So if we independently at random choose two (singly) typical sequences

for X and Y, then the chance that it will be an ( X, Y ) jointly typical se-
quence decreases exponentially with n, as long as I ( X, Y ) > 0.
• to decrease this chance as much as possible, it can become 2−nC .
Theorem 21 (Joint AEP). Let ( X n , Y n ) ∼ p( x n , yn ) = ∏in=1 p( xi , yi ) i.i.d. Then:

(n)
1. P ( X n , Y n ) ∈ Aε → 1 as n → ∞.
(n)
2. (1 − ε)2n( H (X,Y )−ε) ≤ Aε ≤ 2n( H (X,Y )+ε)
en, Y
3. If ( X en ) ∼ p( x n ) p(yn ), are drawn independently, then

P (X en, Yen ) ∈ A(ε n) ≤ 2−n( I (X;Y )−3ε) . (15)
Also, for sufficiently large n,

P (X en, Yen ) ∈ A(ε n) ≥ (1 − ε) 2−n( I (X;Y )+3ε) (16)
21
Key property: we have bound on the probability of independently drawn
sequences being jointly typical, falls off exponentially fast with n, if I ( X; Y ) >
0.
Joint AEP proof

(n)
Proof of P ( X n , Y n ) ∈ Aε → 1. • We have, by the w.l.l.n.s,
1
− log P ( X n ) → − E (log P( X )) = H ( X ) (17)
n
so ∀ε > 0 ∃m1 such that for n > m1
 
 1  ε
P  − log P( X n ) − H ( X ) > ε < (18)
 
 n  3
| {z }
S1
• So, S1 is a non-typical event.
Joint AEP proof

(n)
Proof of P ( X n , Y n ) ∈ Aε → 1. • Also ∃m2 , m3 such that ∀n > m2 we
have  
 1  ε
P  − log P(Y n ) − H (Y ) > ε < (19)
 
 n  3
| {z }
S2
and ∀n > m3 we have

 
 1  ε
P  − log P( X n , Y n ) − H ( X, Y ) > ε < (20)
 
 n  3
| {z }
S3
• So all events S1 , S2 and S3 are non-typical events.
22
Joint AEP proof

(n)
Proof of P ( X n , Y n ) ∈ Aε → 1. • For n > max(m1 , m2 , m3 ), we have that
P (S1 ∪ S2 ∪ S3 ) ≤ 3ε/3
(n)
• So, non-typicality has probability < ε , meaning P( Aε ) ≤ ε giving
(n)
P( Aε ) ≥ 1 − ε, as desired.
Joint AEP proof

(n)
Proof of (1 − ε)2n( H (X,Y )−ε) ≤ Aε ≤ 2n( H (X,Y )+ε) . • We have
∑ ∑
(n)
1= p( x n , yn ) ≥ p( x n , yn ) ≥ Aε 2−n( H (X,Y )+ε)
n n
x ,y (n)
( x n ,yn )∈ Aε
(n)
=⇒ Aε ≤ 2n( H (X,Y )+ε)

(n)
• Also, P Aε > 1 − ε,
∑
(n)
1−ε ≤ p( x n , yn ) ≤ Aε 2−n( H (X,Y )−ε)
(n)
( x n ,yn )∈ Aε
(n)
=⇒ Aε ≥ (1 − ε)2n( H (X,Y )−ε)
Joint AEP proof

Proof of two indep. sequences are likely not jointly typical. We have the following two
derivations:

P (X en, Yen ) = ∑ p( x n ) p(yn )
(n)
( x n ,yn )∈ Aε
≤ 2n( H (X,Y )+ε) 2−n( H (X )−ε) 2−n( H (Y )−ε)

= 2−n( I (X;Y )−3ε)

en ) ≥ (1 − ε)2n( H (X,Y )−ε) 2−n( H (X )+ε) 2−n( H (Y )+ε)
en, Y
P (X
= (1 − ε)2−n( I (X;Y )+3ε) .
23
More Intuition
• There are ≈ 2nH (X ) typical X sequences
• There are ≈ 2nH (Y ) typical Y sequences.
• The total number of independent typical pairs is ≈ 2nH (X ) 2nH (Y ) , but not
all of them are jointly typical. Rather only ≈ 2nH (X;Y ) of them are jointly
typical.
• The fraction of independent typical sequences that are jointly typical is:
2nH (X,Y )
= 2n( H (X,Y )− H (X )− H (Y )) = 2−nI (X,Y )
2nH (X ) 2nH (Y )
and this is essentially the probability that a randomly chosen pair of
(marginally) typical sequences is jointly typical.
• So if we use typicality to decode (which we will) then there are about
2nI (X;Y ) pairs of sequences before we start using pairs that will be jointly
typical and chosen randomly.
• Ex: if p( x ) = 1/M then we can choose about M samples before we see a
given x, on average.
Channel Coding Theorem (Shannon 1948[2])

• The basic idea is to use joint typicality.
• Given a received codeword yn , find an x n that is jointly typical with yn .
• This x n will occur jointly with yn with probability 1, for large enough n.
x n is jointly typical with yn is about
• Also, the probability that some other c
2 − nI ( X;Y ) ,
• so if we use < 2nI ( X; Y ) codewords, then some other sequence being
jointly typical will occur with vanishingly small probability for large n.
6 Channel Coding Theorem

6.1 Channel Coding Theorem
Channel Coding Theorem (Shannon 1948): more formally
Theorem 22 (Channel coding theorem). For a discrete memoryless channel, all
rates below capacity C are achievable. Specifically, for every rate R < C, there exists a
sequence of 2nR , n codes with maximum probability of error λ(n) → 0 as n → ∞.

Conversely, any (2nR , n) sequence of codes with λ(n) → 0 as n → ∞ must have that
R < C.
24
Channel Theorem
• Implications: as long as we do not code above capacity we can, for all

intents and purposes, code with zero error.
• This is true for all noisy channels representable under this model. We’re
talking about discrete channels now, but we generalize this to continuous
channels in the coming lectures.
• We could look at error for a particular code and bound its errors.
• Instead, we look at the average probability of errors of all codes generated
randomly.
• We then prove that this average error is small.
• This implies ∃ many good codes to make the average small.
• To show that the maximum probability of error also small, we throw
away the worst 50% of the codes.
• Recall: idea is, for a given channel ( X, p(y| x ), Y ) come up with a (2nR , n)
code of rate R which means we need:
1. Index set {1, . . . , M}

2. Encoder: X n : {1, . . . , M} → X n maps to codewords X n (i )
3. Decoder: g : Y n → {1, . . . , M } .
• Two parts to prove: 1) all rates R < C are achievable (exists a code with
vanishing error). Conversely, 2) if the error goes to zero, then must have
R < C.
All rates R < C are achievable

Proof that all rates R < C are achievable. • Given R < C, assume use of p( x )
and generate 2nR random codewords using p( x n ) = ∏in=1 p( xi ).
• Choose p( x ) arbitrarily for now, and then change it later to get C.
• set of random codewords (the codebook) can be seen as a matrix:
···
 
x1 (1) x2 (1) x n (1)
C=
 .. .. .. .. 
(21)
. . . . 
x1 2nR x2 2nR · · · xn 2nR
• So, there are 2nR codes each of length n generated via p( x ).

• To send any message ω ∈ 1, 2, . . . , M = 2nR , we send codeword x1:n (ω ) =

{ x1 (ω ), x2 (ω ), . . . , xn (ω )} .
25
Proof that all rates R < C are achievable. • Can compute probabilities of a given
codeword for ω . . .
n
P ( x n (ω )) = ∏ p(xi (ω )), ω ∈ {1, 2, . . . , M } .
i =1
• . . . or even the entire codebook:

2nR n
P(C ) = ∏ ∏ p(xi (ω )) (22)
ω =1 i =1

Proof that all rates R < C are achievable. Consider the following encoding/decoding
scheme:
1. Generate a random codebook as above according to p( x )

2. Codebook known to both sender/receiver (who also knows p(y| x )).
3. Generate messages W according to the uniform distribution (we’ll see
why shortly), P(W = ω ) = 2−nR , for ω = 1, . . . , 2nR .
4. Send X n (ω ) over the channel.
5. Receiver receives Y n according to distribution
n
Y n ∼ P (yn | x n (ω )) = ∏ p (yi |xi (ω )) . (23)
i =1
6. The signal is decoded using typical set decoding (to be described).

Proof that all rates R < C are achievable. Typical set decoding: Decode message
as ω̂ if
1. ( x n (ω̂ ), yn ) is jointly typical

(n)
2. @ k such that ( x n (k), yn ) ∈ Aε (i.e., ω̂ is unique)
Otherwise output special invalid integer ”0” (error). Three types of errors
might occur (type A, B, or C).
26
(n)
A. ∃k 6= ω̂ such that. ( x n (k), yn ) ∈ Aε (i.e., > 1 possible typical message).
B. no ω̂ s.t. ( x n (ω̂ ), yn ) is jointly typical.

C. if ω̂ 6= ω, i.e., wrong codeword is jointly typical.
Note: maximum likelihood decoding is optimal, but typical set decoding is

not, but this will be good enough to show the result.
Proof that all rates R < C are achievable. three types of quality measures we might
be interested in.
1. Code specific error
2nR
1
∑ λi
(n)
Pe (C) = P(ω̂ 6= ω |C) =
2nR i =1
where
λi = P( g(yn ) 6= i | X n = x n (i )) = ∑n p (yn |xn (i)) I ( g(yn ) 6= i)

y
but we would like something easier to analyze.
2. Average error over all randomly generated codes (avg. of avg.)
P(E ) = ∑ Pr(C) Pr(W

b 6= W |C) = ∑ P(C) Pe (C)
C C
Surprisingly, this is much easier to analyze than Pe

Proof that all rates R < C are achievable. three types of quality measures we might
be interested in.
3. Max error of the code, ultimately what we want to use
PC ,max (C) = max λi

i ∈{1,...,M }
We want to show that if R < C, then exists a codebook C s.t. this error
→ 0 (and that if R > C error must → 1).
Our method is to:
27
1. Expand average error (bullet 2 above) and show that it is small.
2. deduce that ∃ at least one code with small error
3. show that this can be modified to have small maximum probability of
error.

Proof that all rates R < C are achievable.
2nR
1
∑ P(C) Pe (C) = ∑ P(C) ∑
(n)
P(E ) = λω (C) (24)
C C 2nR ω =1
2nR
1
=
2nR ∑ ∑ P(C)λω (C) (25)
ω =1 C
but
nR
∏2i=1 P( x n (i ))
z }| {
∑ P(C)λw (C) = ∑ P ( g (Y ) 6= ω |X = x (ω )) P x (1), . . . , x 2
n n n n n nR
C C | {z }
T
= ∑ T
x n (1),...,x n (2nR )

P(C)λω (C)
= ∑ ∏ P( x n (i )) ∑ P ( g (Y n ) 6= ω | X n = x n (ω )) P ( x n (ω ))
i 6=ω x n (ω )
| {z }
1
= ∑ P ( g (Y n ) 6= ω | X n = x n (ω )) P ( x n (ω ))
x n (ω )
= ∑ P ( g (Y n ) 6= 1| X n = x n (1)) P ( x n (1)) = ∑ P(C)λ1 (C) = β

x n (ω ) C
Last sum is the same regardless of ω, call it β. Thus, we can can arbitrarily
assume that ω = 1.
28

So error is equal to:
prob. of choosing x1 for ω and not choosing y1
+prob. of choosing x2 for ω and not choosing y2
+. . .
this is just the same for all ω ∈ {1, . . . M} so we may just pick ω = 1

Proof that all rates R < C are achievable. So we get
2nR
1
∑ P(C) Pe ∑ ∑ P(C)λ1 (C) = P(E |W = 1)
(n)
P(E ) = (C) = β=
C 2nR ω =1 C
• Next, define the random events (again considering ω = 1):

n o n o
(n)
Ei := ( X n (i ), Y n ) ∈ Aε , i ∈ 1, 2, . . . , 2nR .
• Assume that input is x n (1) (i.e., first message sent).
• Then the no error event is the same as: E1 ∩( E2 ∪ E3 ∪ · · · ∪ E M ) = E1

∩( E2 ∩ E3 ∩ · · · ∩ E M ).

Proof that all rates R < C are achievable. Various flavors of error
• E1 means that the transmitted and received codeword are not jointly typ-
ical (this is error type B from before).
• E2 ∪ E3 ∪ · · · ∪ E2nR . This is either:
– Type C: wrong codeword is jointly typical with received sequence
29
– Type A: greater than 1 codeword is jointly typical with received se-
quence
so this is type C and A both.

Our goal is to bound the probability of error, but lets look at some figures
first.
30

Proof that all rates R < C are achievable. Goal: bound the probability of error:

P(E |W = 1) = P E1 ∪ E2 ∪ E3 ∪ . . .
nR
2
≤ P E1 + ∑ P( Ei )
i =2
We have that
(n)
(n → ∞)

P E1 = P Aε →0
So, ∀ε > 0∃n0 such that

P E1 ≤ ε, ∀n > n0

Proof that all rates R < C are achievable. • Also, because of random code gen-
eration process (and recall, ω = 1), X n (1) and X n (i ) i.r.v., hence X n (i ) and
Y n i.r.v for i 6= 1
• This gives, for i 6= 1,
 
(n) 
P ( X n (i ), Y n ) ∈ Aε  ≤ 2−n( I (X;Y )−3ε)

| {z }
indep. events
by the joint AEP.
31
• This will allow us to bound the error, as long as I ( X; Y ) > 3ε.
Proof that all rates R < C are achievable. Consequently,

nR
2
P(E ) = P(E |W = 1) ≤ P E1 |W = 1 + ∑ P( Ei |W = 1)
i =2
2nR
≤ ε + ∑ 2−n( I (X;Y )−3ε)
i =2

= ε + 2nR − 1 2−n( I (X;Y )−3ε)
≤ ε + 23nε 2−n( I (X;Y )− R)
= ε + 2−n[( I (X;Y )−3ε)− R]
≤ 2ε
The last statement is true only if I ( X; Y ) − 3 > R.
Proof that all rates R < C are achievable. • So if we chose R < I ( X; Y ) (strictly),
we can find an ε and n so that the average probability of error P(E ) ≤ 2ε,
can be made as small as we want.
• But, we need to get from an average to a max probability of error, and
bound that.
• First, choose p( x ) = argmax p( x) I ( X; Y ) rather than uniform p( x ), to change
the condition from R < I ( X; Y ) to R < C. Thus, this gives us higher rate
limit.
• If P(E ) ≤ 2ε, the bound on the average error is small, so there must exist
some specific code, say C ∗ s.t.
(n)
Pe (C ∗ ) ≤ 2ε.
32
Proof that all rates R < C are achievable. • Lets break apart this error proba-
bility.
2nR
1
∑ λi (C ∗ )
(n)
Pe (C ∗ ) =
2nR i =1
1 1
=
2nR ∑ λi (C ∗ ) +
2nR ∑ λi (C ∗ )
i:λi <4ε i:λi ≥4ε
• Now suppose more than half of the indices had error ≥ 4ε (i.e., suppose
|{i : λi ≥ 4ε}| ≥ 2nR /2 = 2nR−1 ). Under this assumption
1 1 1
2nR ∑ λi (C ∗ ) ≥
2nR ∑ 4ε =
2nR
|{i : λi ≥ 4ε}| 4ε > 2ε.
i:λi ≥4ε i:λi ≥4ε

Proof that all rates R < C are achievable. • Can’t be since these alone would
be more than our 2ε upper bound.
• Hence, at most half the codewords can have error ≥ 4ε , and we get
|{i : λi ≥ 4ε}| ≥ 2nR /2 =⇒ |{i : λi < 4ε}| ≥ 2nR /2
• Create a new codebook that eliminates all bad codewords (i.e., those in
with index {i : λi ≥ 4ε}). There are at most half of them.
• The remaining codewords are of size ≥ 2nR /2 = 2nR−1 = 2n( R−1/n) (at
least half of them). They all have max probability ≤ 4ε .
• We now code with rate R0 = R − 1/n → R as n → ∞, but for this new
sequence of codes, the max error probability (n) 4 , which can be made as
small as we wish.
Discussion
• To summarize, random coding is the method of proof to show that if

R < C, there exists a sequence of (2nR , n) codes with λ(n) → 0 as n → ∞.
• This might not be the best code, but it is sufficient. It is an existence proof.
• Huge literature on coding theory. We’ll discuss Hamming codes.
• But many good codes exist today: Turbo codes, Gallager (or low-density-
parity-check) codes, and new ones are being proposed often.
33
• But we have yet to prove the converse . . .
• We next need to show that any sequence of (2nR , n) codes with (2nR , n)
must have that R < C.
(n)
• First lets consider the case if Pe = 0, in such case it is easy to show that
R < C.
6.2 Zero-Error Codes

Zero-Error Codes
(n)
• Pe = 0 =⇒ H (W |Y n ) = 0 (no uncertainty)
• For the sake of an easy proof, assume H (W ) = nR = log M (i.e., uniform
distribution over {1, 2, . . . , M}.
(n)
• First lets consider the case if Pe = 0, in such case it is easy to show that
R < C. Then we get
nR = H (W ) = H (W |Y n ) + I (W; Y n ) = I (W; Y n )
≤ I (Xn ; Yn ) // Since W → X n → Y n Markov chain and DP inequality
n
≤ ∑ I (Xi ; Yi ) //Fano’s lemma, follows next
i =1
≤ nC //definition of capacity
• Hence
R ≤ C.
6.3 Fano’s Lemmas

Fano’s Lemmas
Lemma 23. For a DMC with a codebook C and the input message W uniformly
distributed over 2nR
H (W |Wb ) ≤ 1 + Pe(n) nR (26)
(n)
Proof. W uniformly distributed =⇒ Pe = P(W 6= W b ). We apply Fano’s in-
equality
1 + Pe log |X | ≥ H ( X |Y );
for W and an alphabet of length 2nR .
Next lemma shows that the capacity per transmission is not increased if we
use a DMC many times.
34
Lemma 24. Let Y n be the result of passing X n through a memoryless channel of
capacity C. Then
I ( X n ; Y n ) ≤ nC, ∀ p ( x n ). (27)
Proof.
I ( X n ; Y n ) = H (Y n ) − H (Y n | X n )
n
= H (Y n ) − ∑ H (Yi |Y1 , . . . , Yi−1 , X n )
i =1
n
= H (Y n ) − ∑ H (Yi | Xi ) //def. of DMC
i =1
Fano’s Lemmas
Proof - continuation.
n
I ( X n ; Y n ) = H (Y n ) − ∑ H (Yi | Xi )
i =1
n n
≤ ∑ H (Yi ) − ∑ H (Yi |Xi )
i =1 i =1
n
= ∑ |I (X{zi ; Yi}) ≤ nC.
i =1
≤C
6.4 Converse to the Channel Coding Theorem

Converse to the Channel Coding Theorem
The converse states: any sequence of (2nR ; n) codes with λ(n) → 0 must
have that R < C.
Proof that λ(n) → 0 as n → ∞ ⇒ R < C. • Average prob. goes to zero if max

(n)
(n) nR
probability does: λ(n) → 0 ⇒ Pe→ 0, where Pe = 2nR 1
∑2i=1 λi
• Set H (W ) = nR for now (i.e., W uniform on 1, 2, . . . , M = 2nR ). Again,

makes the proof a bit easier and

• doesn’t affect relationship between R and C).

• So, P Wb 6= W = Pe(n) = 1 ∑ M λi
M i =1
35
Converse to the Channel Coding Theorem
Proof that λ(n) → 0 as n → ∞ ⇒ R < C.

nR = H (W ) = H W |W b + I W; W
b //uniformity of W

(n)
≤ 1 + Pe nR + I W; Wb //by Fano’s Lemma 23
(n)
≤ 1 + Pe nR + I ( X n ; Y n ) //DP inequality
(n)
≤ 1+ Pe nR + nC //Lemma 24
(n) 1
R ≤ Pe R + C + (28)
n
(n)
Now as n → ∞, Pe → 0, and 1/n → 0 as well. Thus R < C.
Proof that λ(n) → 0 as n → ∞ ⇒ R < C.
• We rewrite (28) as
(n) C 1
Pe ≥ 1− − .
R nR
• if n → ∞ and R > C, then error lower bound is strictly positive, and
depends on 1 − C/R.
(n) (n0)
• Even for small n, Pe > 0, since otherwise, if Pe = 0 for some code, we
can concatenate code to get large n same rate code, contradicting Pe > 0.
• Hence we cannot achieve an arbitrarily low probability of error at rates
above capacity
• This coverse is called weak converse

(n)
• strong converse: if R > C, Pe →1
6.5 Equality in the Converse to the Channel Coding Theorem

Equality in the Converse to the Channel Coding Theorem
What if we insist on R = C and Pe = 0. In such case, what are the require-
36
ments of any such code.
nR = H (W ) = H ( X n (W ))
= H (W |Wb ) + I (W; W b ) = I (W; W
b)
| {z }
=0, since Pe =0
= H (Y ) − H (Y n | X n )
n
n
= H (Y n ) − ∑ (Y n | X i )
i =1
= ∑ H (Yi ) − ∑ H (Yi | Xi ) //if all Yi indep.
i i
= ∑ I ( Xi ; Yi )
i
= nC //p∗ ( x ) ∈ arg max I ( Xi ; Yi )
p( x )
So there are 3 conditions for equality, R = C, namely
1. all codewords must be distinct

2. Yi ’s are independent
3. distribution on x is p∗ ( x ), a capacity achieving distribution.
7 Feedback Capacity
Feedback Capacity
Consider a sequence of channel uses.
37
Another way of looking at it is:
Can this help, i.e., can this increase C?
Does feedback help for DMC
• A: No.
• Intuition: w/o memory, feedback tells us nothing more than what we
already know, namely p(y| x ).
• Can feedback made decoding easier? Yes, consider binary erasure chan-
nel, when we get Y = e we just re-transmit.
• Can feedback help for channels with memory? In general, yes.
Feedback for DMC
Definition 25. A (2nR , n) feedback code is is the encoder Xi (W; Y1:i−1 ), a de-
(n)
coder g : Y n → 1, 2, . . . , 2nR and Pe = P ( g (Y n ) 6= W ), for H (W ) = nR

(uniform).
Definition 26 (Capacity with feedback). The capacity with feedback CFB of a
DMC is the supremum of all rates achievable by feedback codes.
Theorem 27 (Feedback capacity).
CFB = C = max I ( X; Y ). (29)

p( x )
Feedback codes for DMC
Proof. • Clearly,CFB > C, since FB code is a generalization.

• Next, we use W instead of X and bound R.
• We have
H (W ) = H (W |W
b ) + I (W; W
b)
(n)
≤ 1 + Pe nR + I (W; W
b) //by Fano
(n)
≤ 1 + Pe nR + I (W; Y n ) //DP ineq.
• We next bound I (W; Y n )
38
. . . proof continued.
n
I (W; Y n ) = H (Y n ) − H (Y n |W ) = H (Y n ) − ∑ H (Yi |Y1:i−1 , W )
i =1
n
= H (Y n ) − ∑ H (Yi |Y1:i−1 , W, Xi ) //Xi = f (W, Y1:i−1 )
i =1
n
= H (Y n ) − ∑ H (Yi | Xi ) ≤ ∑ [ H (Yi ) − H (Yi | Xi )]
i =1 i
= ∑ I ( Xi ; Yi ) ≤ nC
i

. . . proof continued. Thus we have
(n)
nR = H (W ) ≤ 1 + Pe nR + nC
1 (n)
=⇒ R ≤ + Pe R + C =⇒ R ≤ C < CFB .
n
Thus, feedback does not help.
8 Source-Channel Separation Theorem

Joint Source/Channel Theorem
• Data compression: We now know that it is possible to achieve error free

compression if our average rate of compression, R, measured in units of
bits per source symbol, is such that R > H where H is the entropy of the
generating source distribution.
• Data Transmission: We now know that it is possible to achieve error free
communication and transmission of information if R < C, where R is the
average rate of information sent (units of bits per channel use), and C is
the capacity of the channel.
• Q: Does this mean that if H < C, we can reliably send a source of entropy
H over a channel of capacity C?
• This seems intuitively reasonable.
39
Joint Source/Channel Theorem: process
The process would go something as follows:
1. Compress a source down to its entropy, using Huffman, LZ, arithmetic

coding, etc.
2. Transmit it over a channel.

3. If all sources could share the same channel, would be very useful.
4. I.e., perhaps the same channel coding scheme could be used regardless
of the source, if the source is first compressed down to the entropy. The
channel encoder/decoder need not know anything about the original
source (or how to encode it).
5. Joint source/channel decoding as in the following figure:
6. Maybe obvious now, but at the time (1940s) it was a revolutionary idea!
Source/Channel Separation Theorem
• Source: V ∈ V that satisfies AEP (e.g., stationary ergodic).

• Send V1:n = V1 , V2 , . . . , Vn over channel, entropy rate H (V ) of stochastic
process (if i.i.d., H (V ) = H (Vi ), ∀i).
• Error probability and setup:

(n)
Pe = P V1:n 6= V b1:n
= ∑ ∑ P(v1:n ) P (y1:n |X n (v1:n )) I { g (y1:n ) 6= v1:n }

y1:n v1:n
where I indicator function, g decoding function
40
Source/Channel Coding Theorem
Theorem 28 (Source-channel coding theorem). If V1:n satisfies AEP and if H (V ) <
(n)
C, then ∃ a sequence of (2nR ; n) codes with Pe → 0. Conversely, if H (V ) > C, then
(n)
Pe > 0 for all n and cannot send the process with arbitrarily low probability of error.
(n) (n) (n)
Proof. • If V satisfies AEP, then ∃ a set Aε with Aε < 2n( H (V )+ε) (Aε
has all the probability).
• We only encode the typical set, and signal an error otherwise. This ε
contributes to Pe .
n o
(n)
• We index elements of Aε as 1, 2, . . . , 2n( H +ε) , so need n( H + ε) bits.
• This gives a rate of R = H (V ) + ε. If R < C then the error < ε, which we

can make as small as we wish.

. . . proof continued. • Then
(n)
Pe = P(V1:n 6= V
b1:n )

(n) (n)
/ A ε ) + P g (Y n ) 6 = V n | V n ∈ A ε
≤ P(V1:n ∈
| {z }
<ε, since R<C
≤ ε + ε = 2ε,
• and the first part of the theorem is proved.

(n)
• To show the converse, show that Pe → 0 ⇒ H (V ) < C for source
channel codes.

. . . proof continued. • Define:
X n (V n ) : V n → X n //encoder
g n (Y n ) : X n → V n //decoder
• Now recall, original Fano says H ( X |Y ) ≤ 1 + Pe log |X |.

• Here we have
(n)
H (V n |V̂ n ) ≤ 1 + Pe log |V n | = 1 + nPe log |X |
41
. . . proof continued. • We get the following derivation
H (V1 , V2 , . . . , Vn ) H (V1:n )
H (V ) ≤ =
n n
1 1
= H (V1:n |V b1:n ) + I (V1:n ; V b1:n )
n n
1 (n)
1
≤ 1 + Pe log |V n | + I (V1:n ; V b1:n ) //by Fano
n n
1 (n)
1
≤ 1 + Pe log |V n | + I ( X1:n ; Y1:n ) //V → X → Y → V b
n n
1 (n)
≤ + Pe log |V | + C //memoryless
n
• Letting n → ∞, 1/n and Pe → 0 which leaves us with H (V ) ≤ C.
9 Coding
9.1 Introduction to coding
Coding and codes
• Shannon’s theorem says that there exists a sequence of codes such that if
R < C the error goes to zero.
• It doesn’t provide such a code, nor does it offer much insight on how to
find one.
• Typical set coding is not practical. Why? Exponentially large sized blocks.
• In all cases, we add enough redundancy to a message so that the original
message can be decoded unambiguously
Physical Solution to Improve Coding

• It is possible to communicate more reliably by changing physical proper-
ties to decrease the noise (e.g., decrease p in a BSC).
• Use more reliable and expensive circuitry
• improve environment (e.g., control thermal conditions, remove dust par-
ticles or even air molecules)
• In compression, use more physical area/volume for each bit.
• In communication, use higher power transmitter, use more energy thereby
making noise less of a problem.
• These are not IT solutions which is what we want.
42
Repetition Repetition Repetition Code Code Code
• Rather than send message x1 x2 . . . xk we repeat each symbol K times re-

dundantly.
• Recall our example of repeating each word in a noisy analog radio con-
nection.
• Message becomes x1 x1 . . . x1 x2 x2 . . . x2 . . .
| {z } | {z }
K× K×
• For many channels (e.g., BSC(p < 1/2)), error goes to zero as k → 1.
• Easy decoding: when K is odd, take a majority vote (which is optimal for
a BSC)
• On the other hand, Rα1/K → 0 as K → ∞
• This is really a pre-1948 way of thinking code.

• Thus, this is not a good code.
Repetition Code Example
• (From D. Mackay [2]) Consider sending message s = 0010110
• One scenario
s 0 0 1 0 1 1 0
z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{
t 000 000 111 000 111 111 000
n 000 001 000 000 101 000 000
r 000 001 111 000 010 111 000
• Another scenario
s 0 0 1 0 1 1 0
z}|{ z}|{ z}|{ z}|{ z}|{ z}|{ z}|{
t 000 000 111 000 111 111 000
n 000 001 000 000 101 000 000
r 000
|{z} 001
|{z} 111
|{z} 000
|{z} 010
|{z} 111
|{z} 000
|{z}
s
b 0 0 1 0 0 1 0
corrected errors *
undetected errors *
• Thus, can only correct one bit error not two.
Simple Parity Check Code
• Binary input/output alphabets X = Y = {0, 1}.

• Block sizes of n − 1 bits: x1:n−1 .
43
• nth bit is an indicator of an odd number of 1 bits in.

• I.e., xn ← mod ∑in=−11 xi , 2

• Thus a necessary condition for valid code word is: mod ∑in=−11 xi , 2 = 0.
• Any any instance of an odd number of errors (bit swaps) won’t pass this
condition (although an even number of errors will pass the condition).
• Quite perfect: can not correct errors, and moreover only detects some of
the kinds of errors (odd number of swaps).
• On the other hand, parity checks form the basis for many sophisticated
coding schemes (e.g., low-density parity check (LDPC) codes, Hamming
codes etc.).
• We study Hamming codes next.
9.2 Hamming Codes

(7; 4; 3) Hamming Codes
• Best illustrated by an example.

• Let X = Y = {0, 1}.
• Fix the desired rate at R = 4/7 bit per channel use.

• Thus, in order to send 4 data bits, we need to use the channel 7 times.
• Let the four data bits be denoted x0 , x1 , x2 , x3 ∈ {0, 1}.
• When we send these 4 bits, we are also going to send 3 additional parity
or redundancy bits, named x4 , x5 , x6 .
• Note: all arithmetic in the following will be mod 2, i.e. 1 + 1 = 0, 1 + 0 =
1, 1 = 0 − 1 = −1, etc.
• Parity bits determined by the following equations:
x4 ≡ x1 + x2 + x3 mod 2
x5 ≡ x0 + x2 + x3 mod 2
x6 ≡ x0 + x1 + x3 mod 2
• I.e., if ( x0 , x1 , x2 , x3 ) = (0110) then ( x4 , x5 , x6 ) = (011) and complete 7-bit

codeword sent over channel would be (0110011).
44
• We can also describe this using linear equalities as follows (all mod 2).
x1 + x2 + x3 + x4 =0
x0 + x2 + x3 + x5 =0
x0 + x1 + x3 + x6 = 0
• Or alternatively, as Hx = 0 where x T = ( x1 , x2 , . . . , x7 ) and

 
0 1 1 1 1 0 0
H= 1 0 1 1 0 1 0 
1 1 0 1 0 0 1
• Notice that H is a column permutation of all non-zero length-3 column

vectors.
• H is called parity-check matrix

• Thus the code words are defined by the null-space of H, i.e., { x : Hx =
0}.
• Since the rank of H is 3, the null-space dimension is 4, and we expect
there to be 16 = 24 binary vectors in this null space.
• These 16 vectors are:
0000000 0100101 1000011 1000011
0001111 0101010 1001100 1001100
0010110 0110011 1010101 1010101
0011001 0111100 1011010 1011010
Hamming Codes : weight
• Thus, any codeword is in C = { x : Hx = 0}.
• Thus, if v1 , v2 ∈ C then v1 + v2 ∈ C and v1 − v2 ∈ C due to linearity

(codewords closed under addition and subtraction).
• weight of a code is 3, which is minimum number of ones in any non-zero
codeword.
• Why? Since columns of H are all different, sum of any two columns is
non-zero, so can’t have any weight-2v (summing two columns is never
zero).
• Minimum weight is 3 since sum of two columns will equal another col-
umn, and sum of two equal column vectors is zero.
45
Hamming Codes : Distance
• Thus, any codeword is in C = { x : Hx = 0}.

• minimum distance of a code is also 3, which is minimum number of dif-
ferences between any two codewords.
• Why? Given v1 , v2 ∈ C ⇒ (v1 − v2 ) ∈ C. Can’t have difference (or sum,
and 1+1 = 1-1) of any two columns equaling zero, so v1 − v2 can’t differ
in only two places.
• Another way of saying this: if v1 , v2 ∈ C then d H (v1 , v2 ) ≥ 3 where
d H (·, ·) is the Hamming distance.
• In general, codes with large minimum distance is good because then it is
possible to correct errors, i.e., if v̂ is received codeword, then we can find
i ∈ argmini d H (v̂, vi ) as the decoding procedure.
Hamming Codes : BSC
• Now a BSC(p) (crossover probability p) will chance some of the bits (noise),
meaning a 0 might change to a 1 and vice verse.
• So if x = ( x0 , x1 , . . . , x6 ) is transmitted, what is received is y = x + z =
( x0 + z0 ; x1 + z1 , . . . , x6 + z6 ), where z = (z0 , z1 , . . . , z6 ) is the noise vector.
• Receiver knows y but wants to know x. We then compute
s = Hy = H ( x + z) = |{z}
Hx + Hz = Hz
=0
• s is called the syndrome of y. s = 0 means that all parity checks are

satisfied by y and is a necessary condition for a correct codeword.
• Moreover, we see that s is a linear combination of columns of H
       
0 1 1 0
s = z0  1  + z1  0  + z2  1  + · · · + z6  0 
1 1 0 1
• Since y = x + z, we know y, so if we know z we know x.

• We only need to solve for z in s = Hz, 16 possible solutions.
• Ex: Suppose that y T = 0111001 is received, then s T = (101) and the 16
solutions are:
0100000 0010011 0101111 1001001
1100011 0001010 1000110 1111010
0000101 0111001 1110101 0011100
0110110 1010000 1101100 1011111
46
• 16 is better than 128 (possible z vectors) but still many.
• What is the probability of each solution? Since we are assuming a BSC(p)
with p < 1/2, the most probable solution has the leastweight. Any solu-
tion with weight k has probability pk .
• Notice that there is only one possible solution with weight 1, and this is
most probable solution.
• In previous example, most probable solution is z T = (01000000) and in
y = x + z with y T = 0111001 this leads to codeword x = 0011001 and
information bits 0011.
• In fact, for any s, there is a unique minimum weight solution for z in
s = Hz (in fact, this weight is no more than 1)!
• If s = (000) then the unique solution is z = (0000000).
• For any other s, then s must be equal to one of the columns of H, so we
can generate s by flipping the corresponding bit of z on (giving weight 1
solution).
Hamming Decoding Procedure

Here is the final decoding procedure on receiving y:
1. Compute the syndrome s = Hy.
2. If s = (000) set z ← (0000000) and goto 4.
3. Otherwise, locate unique column of H equal to s form z all zeros but with
a 1 in that position.
4. Set x ← y + z.
5. output ( x0 , x1 , x2 , x3 ) as the decoded string.
This procedure can correct any single bit error, but fails when there is more
than one error.
Hamming Decoding: Venn Diagrams

• We can visualize the decoding procedure using Venn diagrams
47
• Here, first four bits to be sent ( x0 , x1 , x2 , x3 ) are set as desired and parity
bits ( x4 , x5 , x6 ) are also set. Figure shows ( x0 ; x1 , . . . , x6 ) = (1, 0, 0, 0, 1, 0, 1)
with parity check bits:
x4 ≡ x0 + x1 + x2 mod 2
x5 ≡ x1 + x2 + x3 mod 2
x6 ≡ x0 + x2 + x3 mod 2
• The syndrome can be seen as a condition where the parity conditions are
not satisfied.
• Above we argued that for s 6= (0, 0, 0) there is always a one bit flip that
will satisfy all parity conditions.
• Example: Here, z1 can be flipped to achieve parity.
48
• Example: Here, z4 can be flipped to achieve parity.
• Example: And here, z2 can be flipped to achieve parity.
• Example: And here, there are two errors, y6 and y2 (marked with a *).
• Flipping y1 will achieve parity, but this will lead to three errors (i.e., we
will switch to a wrong codeword, and since codewords have minimum
Hamming distance of 3, we’ll get 3 bit errors).
49
Coding
• Many other coding algorithms.

• Reed Solomon Codes (used by CD players).
• Bose, Ray-Chaudhuri, Hocquenghem (BCH) codes.
• Convolutional codes
• Turbo codes (two convolutional codes with permutation network)
• Low Density Parity Check (LDPC) codes.
• All developed on our journey to find good codes with low rate that achieve
Shannon’s promise.
References
References
[1] Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, 2nd edi-
tion, Wiley, 2006.
[2] David J.C. MacKay, Information Theory, Inference, and Learning Algorithms,
Cambridge University Press, 2003.
[3] Robert M. Gray, Entropy and Information Theory, Springer, 2009
50
References
[1] D. A. Huffman, A method for the construction of minimum redundancy
codes, Proc. IRE, 40: 1098–1101,1952
[2] C. E. Shannon, A mathematical theory of communication, Bell System
Technical Journal, 1948.
[3] R. V. Hamming, Error detecting and error correcting codes, Bell System
Technical Journal, 29: 147-160, 1950
51

Channel Capacity

Uploaded by

Copyright:

Available Formats

Channel Capacity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Channel Capacity

Uploaded by

Copyright:

Available Formats

Channel Capacity

The Characterization of the Channel Capacity

2 Examples of Channel Capacity 8

4 Properties of Channel Capacity 16

5 The Shannon’s 2nd Theorem 16

6 Channel Coding Theorem 24

• So far, we have been talking about compression. I.e., we have some

• From Claude Shannon: The fundamental problem of communication is

• Is there a limit to the rate of communication over a channel?

• Stock-market crash of 1929

• The great depression

• The rise of fascism and the Nazi state in Europe.

• Place yourself back in the 1930s.

• Q: Can we achieve perfect communication with an imperfect communi-

• Perhaps the only way to achieve error free communication is to have a

• Here, probability of error Pe goes up linearly with the rate R, with an

• Consider representing a signal by a sequence of numbers.

• Now consider speaking such a sequence over a noisy AM channel.

1.2 Discrete Channels

• Source message W, one of M messages.

Rates and Capacities

• So we have a source X governed by p( x ) and channel that transforms X

• We write this as I ( X; Y ) = I p( x) ( X; Y ), meaning implicitly the MI quan-

Definition 2 (Information flow). The rate of information flow through a channel

where the maximum is taken over all possible input distributions p( x ).

• We shall soon give an operational definition of channel capacity as the

– During compression, we remove all the redundancy in the data to

• We show that a general communication system can be broken into two

Fundamental Limits of Compression

• For compression, if error exponent is positive, then error → 0 exponen-

• Only hope of reducing error was if R > H. Something ”funny” happens

Fundamental Limits of Data Transmission/Communication

• For communication, lower bound on probability of error becomes bounded

2 Examples of Channel Capacity

• Noiseless binary channel, diagram shows p(y| x )

• So, p(y = 0| x = 0) = 1 = 1 − p(y = 1| x = 0) and p(y = 1| x = 1) = 1 =

• Clearly, p(0) = p(1) = 1/2 achieves this capacity.

• Also, p(0) = 1 = 1 − p(1) has I ( X; Y ) = 0, so achieves zero information

• Here, p(y = 0| x = 0) = p(y = 1| x = 0) = 1/2 and p(y = 2|y = 1) =

• If we receive a 0 or 1, we know 0 was sent. If we receive a 2 or 3, a 1 was

• Again uniform distribution p(0) = p(1) = 1/2 achieves the capacity.

2.3 Permutation Channel

• Here, p(Y = 1| X = 0) = p(Y = 0| X = 1) = 1.

• In general, for alphabet of size k = |X | = |Y |, let be a permutation, so

• To maximize a given function f ( x ), it is sufficient to show that f ( x ) ≤ α

• Right now C is just the result of a given optimization.

2.4 Noisy Typewriter

Figure 1: Noisy Typewriter, C = log 13 bits

• I.e., p( A → A) = p( A → B) = 1/2, etc.

• Thus A → { A; B}, C → {C; D }, E → { E; F }, etc. so that each received

• We can also compute the capacity more mathematically.

= max H (Y ) − 1 //for X = x ∃ 2 choices

• The max p( x) H (Y ) = log 26 can be achieved by using the uniform distri-

• An alternatively p( x ) would put zero probability on the alternates (B, D,

• A bit that is sent will be flipped with probability p.

• When is H (Y ) = 1? Note that

• If we get a particular y, we can compute p( x |y) and make a decision

Minimum Error Decoding

2.6 Binary Erasure Channel