HW 2 Sol

Solutions to Set #2
Data Compression, Huffman code and AEP
1. Huffman coding.
Consider the random variable

x1 x2 x3 x4 x5 x6 x7
X=
0.50 0.26 0.11 0.04 0.04 0.03 0.02
(a) Find a binary Huffman code for X.
(b) Find the expected codelength for this encoding.
(c) Extend the Binary Huffman method to Ternarry (Alphabet of 3)
and apply it for X.
Solution: Huffman coding.
(a) The Huffman tree for this distribution is
Codeword
1 x1 0.50 0.50 0.50 0.50 0.50 0.50 1
01 x2 0.26 0.26 0.26 0.26 0.26 0.50
001 x3 0.11 0.11 0.11 0.11 0.24
00011 x4 0.04 0.04 0.08 0.13
00010 x5 0.04 0.04 0.05
00001 x6 0.03 0.05
00000 x7 0.02
(b) The expected length of the codewords for the binary Huffman code
is 2 bits. (H(X) = 1.99 bits)
(c) The ternary Huffman tree is
Codeword
0 x1 0.50 0.50 0.50 1.0
1 x2 0.26 0.26 0.26
20 x3 0.11 0.11 0.24
21 x4 0.04 0.04
222 x5 0.04 0.09
221 x6 0.03
220 x7 0.02
This code has an expected length 1.33 ternary symbols. (H3 (X) =
1.25 ternary symbols).
1
2. Codes.
Let X1 , X2 , . . . , i.i.d. with

 1, with probability 1/2
X= 2, with probability 1/4

3, with probability 1/4.
Consider the code assignment


 0, if x = 1
C(x) = 01, if x = 2

11, if x = 3.
(a) Is this code nonsingular?

(b) Uniquely decodable?
(c) Instantaneous?
(d) Entropy Rate is defined as
H(X n )
H(X ) , lim . (1)
n→∞ n
What is the entropy rate of the process
Z1 Z2 Z3 . . . = C(X1 )C(X2 )C(X3 ) . . .?
Solution: Codes.
(a) Yes, this code is nonsingular because C(x) is different for every
x.
(b) Yes, this code is uniquely decodable. Reversing the codewords

 0, if x = 1
C ′ (x) = 10, if x = 2

11, if x = 3
gives an instantaneous code, and thus a uniquely decodable code.

Therefore the reversed extension is uniquely decodable, and so the
extension itself is also uniquely decodable.
2
(c) No, this code is not instantaneous because C(1) is a prefix of
C(2).
(d) The expected codeword length is
3
L(C(x)) = 0.5 × 1 + 0.25 × 2 + 0.25 × 2 = .
2
Further, the entropy rate of the i.i.d. X n is
3
H(X ) = H(X) = H(.5, .25, .25) = .
2
So the code is a uniquely decodable code with L = H(X ), and
therefore the sequence is maximally compressed with H(Z) =
1 bit. If H(Z) were less than its maximum of 1 bit then the Z n
sequence could be further compressed to its entropy rate, and X m
could also be compressed further by blockcoding. However, this
would result in Lm < H(X ) which contradicts theorem 5.4.2 of
the text. So H(Z) = 1 bit.
Note that the Z n sequence is not i.i.d. ∼ Br( 21 ), even though
H(Z) = 1 bit. For example, P {Z1 = 1} = 41 , and a sequence
starting 10 . . . is not allowed. However, once Zi = 0 for some i then
Zk is Bernoulli( 21 ) for k > i, so Z n is asymptotically Bernoulli( 21 )
and gives the entropy rate of 1 bit.
3. Huffman via MATLAB
(a) Give a Huffman encoding into an alphabet of size D = 2 of the

following probability mass function:

1 1 1 1 1 1 1
, , , , , ,
2 8 8 16 16 16 16
(b) Assume you have a file of size 1,000 symbols where the symbols are
distributed i.i.d. according to the pmf above. After applying the
Huffman code, what would be the pmf of the compressed binary
file and what would be the expected length?
(c) Generate a sequence (using MATLAB or any other software) of
length 10, 000 symbols of X with i.i.d probability PX . Assume the
alphabet of X is X = (0, 1, . . . , 6).
3
(d) What is the percentage of each symbol (0, 1, . . . , 6) in the sequence
that was generated by MATLAB. Explain this result using the law
of large numbers.
(e) Represent each symbol in X using the simple binary representa-
tion. Namely, X = 0 represent as ′ 000′ , X = 1 represent as ′ 001′ ,
X = 2 represent as ′ 010′ , . . . , X = 6 represent as ′ 110′ .
(f) What is the length of the simple representation. What percentage
of ′ 0′ and ′ 1′ do you have in this representation?
(g) Now, compress the 10, 000 symbols of X, into bits using Huffman
code.
(h) What is the length of the compressed file? What is the percentage
of ′ 0′ and ′ 1′ do you have in this representation?
(i) Explain the results.
Solution:
(a) The code is presented in Fig 1.

1
1 2
011 1 1 1
8
1
4 1
1
010 8 0
1 0
0011 1 1 1
16 2
1
8
1 1
0010 16 0 0
1
4
0001
1 1
16
1 0
8
1
0000 16
0
Figure 1: Huffman
4
(b) Huffman code is optimal code and achieves the entropy for dyadic
distribution. If the distribution of the digits is not Bernoulli( 12 )
you can compress it further. The binary digits of the data would
be equally distributed after applying the Huffman code and there-
fore p0 = p1 = 12 .
The expected length would be:
1 1 1 1 1 1 1
E[l] = · 1 + · 3 + · 3 + · 4 + · 4 + · 4 + · 4 = 2.25
2 8 8 16 16 16 16
Therefore, the expected length of 1000 symbols would be 2250
bits.
(c-i) The results from simulation should be as follows:
Drawn symbols
-------------
Size: 10000
Percentage of x=0: 0.4984
Before compression
------------------
Size: 30000
Percentage of b=0: 0.77213
After compression
------------------
Size: 22463
Compressions rate (symbol -> binary): 2.2463
5
Explanation: The symbols X(i), i ∈ {1, 2, . . . , 10, 000} are drawn
i.i.d. according to PX . Therefore, by the L.L.N, the percentage
of appearances of each symbol x ∈ X , should be approximately
PX (x). Next, we represent each symbol with 3 binary symbols.
Since the file in uncompressed, the percentage of 0’s is not half.
The expected percentage can be calculated from the PMF. In our
case,
E [#{0ś in file}] =10, 000 Px (0) · 3 + Px (1) · 2 + Px (2) · 2

+ Px (3) · 1 + · · · + Px (6) · 1
=23, 125
Since there are 30, 000 bits for representation, the expected per-
centage is 77.08%. After Huffman’s code is applied, the number
of appearances of 0’s and 1’s are almost equal. This is since loss-
less compression maximize entropy (now 0 and 1 in the file are
uniform).
4. Entropy and source coding of a source with infinite alphabet

(15 points)
Let X be an i.i.d. random variable with an infinite alphabet, X =
{1, 2, 3, ...}. In addition let P (X = i) = 2−i .
(a) What is the entropy of the random variable?

(b) Find an optimal variable length code, and show that it is indeed
optimal.
Solution
(a) X
H(X) = − p(x) log p(x)
x∈X
∞
X
=− 2−i log2 (2−i )
i=1
∞
X −i
=− =2
i=1
2i
6
(b) Coding Scheme:
1 0
2 10
3 110
4 1110
5 11110
. .
. .
. .
Average Length:
∞ ∞
X X i
L∗ = p(x = i)L(i) = i
= 2 = H(X)
i=1 i=1
2
Hence it is the Optimal Code.
5. Bad wine.
One is given 6 bottles of wine. It is known that precisely one bottle
has gone bad (tastes terrible). From inspection of the bottles it is
determined that the probability pi that the ith bottle is bad is given by
7 5 4 4 3 3
(p1 , p2 , . . . , p6 ) = ( 26 , 26 , 26 , 26 , 26 , 26 ). Tasting will determine the bad
wine.
Suppose you taste the wines one at a time. Choose the order of tasting
to minimize the expected number of tastings required to determine the
bad bottle. Remember, if the first 5 wines pass the test you don’t have
to taste the last.
(a) What is the expected number of tastings required?

(b) Which bottle should be tasted first?
Now you get smart. For the first sample, you mix some of the wines in
a fresh glass and sample the mixture. You proceed, mixing and tasting,
stopping when the bad bottle has been determined.
(c) What is the minimum expected number of tastings required to

determine the bad wine?
(d) What mixture should be tasted first?
7
Solution: Bad wine.
(a) If we taste one bottle at a time, the corresponding number of tast-

ings are {1, 2, 3, 4, 5, 5} with some order. By the same P
argument
as in Lemma 5.8.1, to minimize the expected length pi lk we
should have lj ≤ lk if pj > pk . Hence, the best order of tasting
should be from the most likely wine to be bad to the least.
The expected number of tastings required is
6
X 7 5 4 4 3 3
p i li = 1 × +2× +3× +4× +5× +5×
i=1
26 26 26 26 26 26
75
=
26
= 2.88
(b) The first bottle to be tasted should be the one with probability
7
26
.
(c) The idea is to use Huffman coding.
(01) 7 7 8 11 15 26
(11) 5 6 7 8 11
(000) 4 5 6 7
(001) 4 4 5
(100) 3 4
(101) 3
The expected number of tastings required is

6
X 7 5 4 4 3 3
p i li = 2 × +2× +3× +3× +3× +3×
i=1
26 26 26 26 26 26
66
=
26
= 2.54
Note that H(p) = 2.52 bits.

(d) The mixture of the first, third, and forth bottles should be tasted
first, (or equivalently the mixture of the second, fifth and sixth).
8
6. Relative entropy is cost of miscoding.
Let the random variable X have five possible outcomes {1, 2, 3, 4, 5}.
Consider two distributions on this random variable
Symbol p(x) q(x) C1 (x) C2 (x)
1 1/2 1/2 0 0
2 1/4 1/8 10 100
3 1/8 1/8 110 101
4 1/16 1/8 1110 110
5 1/16 1/8 1111 111
(a) Calculate H(p), H(q), D(p||q) and D(q||p).
(b) The last two columns above represent codes for the random vari-
able. Verify that the average length of C1 under p is equal to the
entropy H(p). Thus C1 is optimal for p. Verify that C2 is optimal
for q.
(c) Now assume that we use code C2 when the distribution is p. What
is the average length of the codewords. By how much does it
exceed the entropy H(p)?
(d) What is the loss if we use code C1 when the distribution is q?
Solution: Relative entropy is cost of miscoding.

(a)
X
H(p) = −pi log pi
i
1 1 1 1 1 1 1 1
= − log − log − log − 2 · log
2 2 4 4 8 8 16 16
15
= .
8
Similarly, H(q) = 2.
X pi
D(p||q) = pi log
i
qi
1 1/2 1 1/4 1 1/8 1 1/16
= log + log + log +2· log
2 1/2 4 1/8 8 1/8 16 1/8
1
= .
8
9
Similarly, D(q||p) = 18 .
(b) The average codeword length for C1 is
1 1 1 1 15
El1 = ·1+ ·2+ ·3+2· ·4= .
2 4 8 16 8
Similarly, the average codeword length for C2 is 2.
(c)
1 1 1 1
E p l2 =
·1+ ·3+ ·3+2· · 3 = 2,
2 4 8 16
which exceeds H(p) by D(p||q) = 81 .
17
(d) Similarly, Eq l1 = 8
, which exceeds H(q) by D(q||p) = 18 .
7. Shannon code. Consider the following method for generating a code

for a random variable X which takes on m values {1, 2, . . . , m} with
probabilities p1 , p2 , . . . , pm . Assume that the probabilities are ordered
so that p1 ≥ p2 ≥ · · · ≥ pm . Define
i−1
X
Fi = pi , (2)
k=1
the sum of the probabilities of all symbols less than i. Then the code-
word for i is the number Fi ∈ [0, 1] rounded off to li bits, where
li = ⌈log p1i ⌉.
(a) Show that the code constructed by this process is prefix-free and
the average length satisfies
H(X) ≤ L < H(X) + 1. (3)
(b) Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125).
Solution to Shannon code.
(a) Since li = ⌈log p1i ⌉, we have
1 1
log ≤ li < log + 1 (4)
pi pi
10
which implies that
X
H(X) ≤ L = pi li < H(X) + 1. (5)
The difficult part is to prove that the code is a prefix code. By

the choice of li , we have
2−li ≤ pi < 2−(li −1) . (6)
Thus Fj , j > i differs from Fi by at least 2−li , and will therefore

differ from Fi is at least one place in the first li bits of the binary
expansion of Fi . Thus the codeword for Fj , j > i, which has length
lj ≥ li , differs from the codeword for Fi at least once in the first
li places. Thus no codeword is a prefix of any other codeword.
(b) We build the following table
Symbol Probability Fi in decimal Fi in binary li Codeword
1 0.5 0.0 0.0 1 0
2 0.25 0.5 0.10 2 10
3 0.125 0.75 0.110 3 110
4 0.125 0.875 0.111 3 111
The Shannon code in this case achieves the entropy bound (1.75
bits) and is optimal.
8. An AEP-like limit. Let X1 , X2 , . . . be i.i.d. drawn according to
probability mass function p(x). Find
1
lim [p(X1 , X2 , . . . , Xn )] n .
n→∞
Solution: An AEP-like limit.

X1 , X2 , . . . , i.i.d. ∼ p(x). Hence log(Xi ) are also i.i.d. and
1 1
lim(p(X1 , X2 , . . . , Xn )) n = lim 2log(p(X1 ,X2 ,...,Xn )) n
1 P
= 2lim n log p(Xi )
= 2E(log(p(X)))
= 2−H(X)
by the strong law of large numbers.
11
9. AEP. Let X1 , X2 , . . . be independent identically distributed random
variables drawn according to the probability Qn mass function p(x), x ∈
{1, 2, . . . , m}. Thus p(x1 , x2 , . . . , xn ) = i=1 p(xi ). We know that
1
−
Qnn log p(X1 , X2 , . . . , Xn ) → H(X) in probability. Let q(x1 , x2 , . . . , xn ) =
i=1 q(xi ), where q is another probability mass function on {1, 2, . . . , m}.
(a) Evaluate lim − n1 log q(X1 , X2 , . . . , Xn ), where X1 , X2 , . . . are i.i.d.

∼ p(x).
q(X1 ,...,Xn )
(b) Now evaluate the limit of the log likelihood ratio n1 log p(X 1 ,...,Xn )
when X1 , X2 , . . . are i.i.d. ∼ p(x). Thus the odds favouring q are
exponentially small when p is true.
Solution: AEP.
(a) Since the X1 , X2 , . . . , Xn are i.i.d., so are q(X1 ), q(X2 ), . . . , q(Xn ),

and hence we can apply the strong law of large numbers to obtain
1 1X
lim − log q(X1 , X2 , . . . , Xn ) = lim − log q(Xi )
n n
= −E(log q(X)) w.p. 1
X
= − p(x) log q(x)
X p(x) X
= p(x) log − p(x) log p(x)
q(x)
= D(p||q) + H(p).
(b) Again, by the strong law of large numbers,
1 q(X1 , X2 , . . . , Xn ) 1X q(Xi )
lim − log = lim − log
n p(X1 , X2 , . . . , Xn ) n p(Xi )
q(X)
= −E(log ) w.p. 1
p(X)
X q(x)
= − p(x) log
p(x)
X p(x)
= p(x) log
q(x)
= D(p||q).
12
10. Empirical distribution of a sequence Before starting the question,
below are two facts that you may consider to use:
√
• Stirling approximation: n! ≈ 2πn( ne )n .
• Consider a sequence of length n that consist of two different num-
bers. The first number appears n1 times and the second number
appears n2 times such that n1 + n2 = n. The number of different
combinations of such sequences is given by n1nn2 = n1!n2!
n!
.
A fair dice with 6 faces was thrown n times, where n is a very large
number.
(a) Find how many different sequences there exists with an empirical
pmf (p1 , p2 , ..., p6 ), where pi is the portion of the sequence that is
equal to i ∈ {1, 2, ..., 6}.
In this section you can assume that n! ≈ ( ne )n since only the power
of ne will matter.
(b) Now, we were told that the portion of odd numbers in the sequence
is 2/3 (i.e., p1 + p3 + p5 = 2/3). For n very large, what is the most
likely empirical pmf of the sequence. Hint: Define:
 
 1 p 1  2 p2
 (
 X 32
X = 3 p3 , Y = 4 p4 , Z = .

 
 Y 13
5 p5 6 p6
Think why maximizing H(Z) means maximizing H(X), H(Y ).

(c) What is the cardinality of the weak typical set with respect to the
pmfs that you found/given in the previous subquestions, i.e., (a)
and (b)?
Remark 1 The weak typical set is the typical set we learned in
the AEP lecture.
Solution
13
(a) The number of combinations N is given by:

n
N =
n1 n2 n3 n4 n5 n6
n!
=
n1 !n2 !n3 !n4 !n5 !n6 !

n n
(a)
= e ,
n1 n1 n2 n2 n3 n3 n4 n4 n5 n5 n6 n6
e e e e e e
where (a) follows by Stirling’s approximation. Thus,
log(N ) = log n − n1 log n1 − n2 log n2 − . . . − n6 log n6

X 6 n
i
= − ni log ,
i=1
n
Finally, we get:
P6 ni
N ≈ 2− i=1 ni log n
P6 ni ni
= 2−n i=1 n log n
n1 n2 n3 n4 n5 n6
= 2nH( n , n , n , n , n , n )
(a) 1 1 1 1 1 1
≈ 2nH( 6 , 6 , 6 , 6 , 6 , 6 )
(b) 1
= 26nH( 6 ) ,
where:
(a) For sufficient large n we have nni ≈ 61 .
(b) i.i.d.
The result is probable since the typical set and the set defined by
the law of large numbers converge as the number of samples goes
to infinty.
(b) First, observe that:
n1 + n3 + n5 2 2
= ⇔ p1 + p2 + p3 =
n 3 3
We would like to find {p1 , p2 , p3 , p4 , p5 , p6 } which, under the given
constraints maximizes entropy. This results in the biggest typical
14
set which means the event will be most likely. Equivalently:
 
1 p1  2 p2
  (
X 32
X= 3 p3 , Y = 4 p4 , Z =

 
 Y 13
5 p5 6 p6
Define the following indicator auxiliary variable:
(
1 Z=X
I=
0 Z=Y
Now, let us maximize the entropy of Z.
(a)
H(Z) = H(Z, I)
= H(I) + H(Z|I)
2 1 2
= H( ) + H(Y ) + H(X),
3 3 3
where (a) follows since
H(Z, I) = H(Z, I)
= H(Z) + H(I|Z)
| {z }
Deterministic Given Z
= H(Z).
Accordingly, maximizing H(Z) means maximizing H(X), H(Y ).
As we’ve shown at class uniform distribution maximizes entropy.
Thus,
2 1
H(X), H(Y ) ∼ U nif orm ⇒ p1 , p3 , p5 = , p2 , p4 , p6 = .
9 9
11. drawing a codebook Let Xi be a r.v. i.i.d distributed according
to P (x). We draw codebook of 2nR codewords of X n independently
using P (x) and i.i.d.. We would like to answer the question: what is
the probability that the first codeword would be identical to another
codeword in the codebook as n goes to infinity.
• Let xn be a sequence in the typical set Anǫ (X). What is the asymp-
totic probability (you may provide an upper and lower bound) as
n → ∞ that we draw a sequence X n i.i.d distributed according to
P (x) and we get xn .
15
• Using your answer from the previous sub-question find an α such
that if R < α the probability that the first codewaord in the
codebook appears twice or more in the codebook goes to zero as
n → ∞.
• Find an α such that if R > α the probability that the first code-
waord in the codebook appears twice or more in the codebook
goes to 1 as n → ∞.
Solution: drawing a codebook.
(a) By the definition of the typical set, the probability of every xn ∈

(n)
Aǫ to be drawn is bounded by
2−n(H(X)+ǫ) ≤ p(xn ) ≤ 2−n(H(X)−ǫ)
notice that the question regards to the probability of a specific

(n)
sequence xn ∈ Aǫ and not the probability of the whole set (which
is almost 1).
(b) We want to find ᾱ such that if R < ᾱ then Pr(∃i 6= 1 : xn (i) =
xn (1)) goes to zero. Consider the following derivations:
nR
2[
n n
Pr(∃i 6= 1 : x (i) = x (1)) = Pr( {xn (i) = xn (1)})
i=2
2nR
(a) X
≤ Pr(xn (i) = xn (1))
i=2
nR
2
(b) X
≤ 2−n(H(X)−ǫ)
i=2
nR −n(H(X)−ǫ)
≤ 2 2
= 2nR−(H(X)−ǫ)
where (a) follows from the union bound and (b) follows from sec-
tion a. If we set ᾱ = H(X) − ǫ this probability goes to zero.
(c) We want to find α such that if R > α then P (A) = 1 − Pr(∀i 6=
16
1 : xn (i) 6= xn (1)) goes to 1. Consider the following derivations:
nR
2
Y
n n
Pr(∀i 6= 1 : x (i) 6= x (1)) = Pr(xn (i) 6= xn (1))
i=2
2nR
Y
≤ (1 − Pr(xn (i) = xn (1)))
i=2
2nR
(a) Y
≤ (1 − 2−n(H(X)+ǫ) )
i=2
(b) nR
≤ (1 − 2−n(H(X)+ǫ) )2
(c) nR −n(H(X)+ǫ)
≤ e−2 2
nR−(H(X)+ǫ)
≤ e−2
where (a) follows from section a, (b) follows by adding i = 1 to

the product and since the drawing is i.i.d, and (c) follows since
(1 − x)n ≤ e−nx . If we set R > H(X) + ǫ this probability goes to
zero and thus P (A) goes to 1. Hence, α = H(X) + ǫ.
12. Saving the princess. A princess was abducted and was put in one of
K rooms. Each room is labeled by a number 1, 2, . . . , K. Each room
is of size si where i = 1, 2, . . . , K. The probability of the princess to
be in room i is pi , and proportional to the size of the room si , namely,
pi = αsi where α is a constant.
(a) Find α.
(b) In order to save the princess you need to find in which room she
is. You may ask the demon a yes/no question. Like is she in
room number 1 or is she in room 2 or 5 or is she in a room of
odd number, and so on. You will save the princess if and only if
the expected number of questions is the minimum possible. What
would be the questions you should ask the demon in order to save
the princess?
Solution: Saving the princess.
17
(a) Since p is a probability mass function, it must satisfy
K
X K
X K
X
1= pi = αsi = α si
i=1 i=1 i=1
1
Therefore, α = PK .
i=1 si
(b) The idea is to use Huffman code with the probability that the
princess should be in a specific room. Follow the solution of Bad
wine question. Here, we will build questions like is the princess
in room 1 or 2 instead of mixing wine, and we want to find the
princess instead of bad wine bottle.
13. Lossless source coding with side information.

Consider the lossless source coding with side information that is avail-
able at the encoder and decoder, where the source X and the side
information Y are i.i.d. ∼ PX,Y (x, y).
Xn f (X n , Y n ) ∈ {1, 2, ..., 2nR } X̂ n (f (X n , Y n ), Y n )

Encoder Decoder
Yn Yn
Figure 2: Lossless source coding with side information at the encoder and
decoder.
Show that a code with rate R < H(X|Y ) can not be achievable, and
interpret the result.
Hint: Let T , f (X n , Y n ). Consider
nR ≥ H(T )
≥ H(T |Y n ), (7)
and use similar steps, including Fano’s inequality, as we used in the

class to prove the converse where side information was not available.
18
Solution Sketch of the solution (please fill in the explanation for each
step):
nR ≥ H(T )
≥ H(T |Y n ),
≥ I(X n ; T |Y n )
= H(X n |Y n ) − H(X n |T, Y n )
= nH(X|Y ) − ǫn ,
where ǫn → 0.
14. Challenge: Optimal code for an infinite alphabet This question

is a challenge.
15. Conditional Information Divergence
(a) Let X, Z be random variables jointly distributed according to

PX,Z . We define the conditional informational divergence as fol-
lows:

X PX|Z (x|z)
D PX|Z QX|Z PZ =
PX,Z (x, z) log .
QX|Z (x|z)
(x,z)∈X ×Z
With respect to this definition, prove for each relation if it is true

or false:
For any pair of random variables A, B that are jointly distributed
according to PA,B ,
i.

D PA,B QA,B = D PA QA + D PB|A QB|A PA .
ii.
D PA,B PA PB = D PB|A PB PA .
iii.
I(A; B) = D PB|A PB PA .
iv. X
D PA|B QA|B PB = PB (b)D PA|B=b QA|B=b .
b∈B
19
Zn Zn
T (X n , Z n )
n
X Encoder Decoder X̂ n
Figure 3: Source coding with side information.
(b) Consider the setting in Fig. 3.

We would like to compress the source sequence X n losslessly using
a prefix code with side information Z n which is available to the
encoder and the decoder. The sources (X n , Z n ) are distributed
i.i.d. according to PX,Z and that all the distribution and condi-
tional distributions are dyadic (i.e., PX is dyadic if PX (x) = 2−i ,
for some i, for all x ∈ X ). We denote the average number of bits
per symbol needed to compress the source X n as L.
i. What is the minimal L?
ii. Although the distribution of (X n , Z n ) is PX,Z , the distribution
that is used design the optimal prefix code is QX|Z PZ . What
is the actual L (average bits per symbol) of this code?
iii. Now, the distribution that is used to design the prefix code is
QX,Z . What is the actual L now?
Solutions See question 4 in Final Exam 2014 Moed A (with solutions).

16. True or False of a constrained inequality:
Given are three discrete random variables X, Y, Z that satisfy H(Y |X, Z) =
0.
(a) Copy the next relation and write true or false (If true, prove the
statement, and if not provide a counterexample).
I(X; Y ) ≥ H(Y ) − H(Z)
(b) What are the conditions for which the equality I(X; Y ) = H(Y )−
H(Z) holds.
(c) Assume that the conditions for I(X; Y ) = H(Y ) − H(Z) are sat-
isfied. Is it true that there exists a function such that Z = g(Y )?
20
Solution
(a) First, note that I(X; Y ) = H(Y ) − H(Y |X). Consider
0 =H(Y |X, Z)
=H(X, Y, Z) − H(X, Z)
=H(X) + H(Y |X) + H(Z|X, Y ) − H(X) − H(Z|X)
=H(Y |X) − I(Z; Y |X)
Therefore, we can conclude that
H(Y |X) = I(Z; Y |X) ≤ H(Z|X) ≤ H(Z). (8)
It follows that
H(Y ) − H(Y |X) ≥ H(Y ) − H(Z).
(b) To satisfy equality, we must satisfy equalities in (8). First equality

is obtained iff H(Z|X, Y ) = 0, implies that Z is a function of X
and Y . The second if H(Z|X) = H(Z), which implies that Z is a
function of X.
(c) False. Consider the following counter example. Let X and Z be
independent,
L each distributed according to Bernoulli( 21 ) and L
Y =
X Z. Then, Y = f (X, Z), H(Z|X) = H(Z) and Z = X Y
so H(Z|X, Y ) = 0. However, Z is not a function of Y .
17. True or False: Copy each relation and write true or false.
(a) Let X − Y − Z − W be a Markov chain, then the following holds:
I(X; W ) ≤ I(Y ; Z).
(b) For two probability distributions, pXY and qXY , that are defined
on X × Y, the following holds:
D(pXY ||qXY ) ≥ D(pX ||qX ).
(c) If X and Y are dependent and also Y and Z are dependent, then
X and Z are dependent.
21
Solution
(a) True. By data processing inequality, if A − B − C form a Markov

chain, then I(A; C) ≤ I(A; B). Here, we have X − Y − Z and
X − Z − W as Markov chains.
I(X; W ) ≤I(X; Z)
≤I(Y ; Z)
(b) True. Consider the definition of a conditional divergence,

X PX|Z (x|z)
D PX|Z QX|Z PZ =
PX,Z (x, z) log .
QX|Z (x|z)
(x,z)∈X ×Z
From previous question on the conditional divergence, we learned

that

D PX,Y QX,Y = D PX QY + D PY |X QY |X PX ,
where
X
D PY |X QY |X PX = PX (x)D PY |X=x QY |X=x ,
x∈X
which is non-negative. We conclude that
D(pXY ||qXY ) ≥ D(pX ||qX ).
(c) False. Here is a counterexample. Let X and W be two independent

random variables. Let Y = X + W and Z = W . Then, Y and X
are dependent, Z and Y are dependent, and Z is independent of
X.
18. Huffman Code : Let X n be a an i.i.d. source that is distributed

according to pX :
x 0 1 2 3
pX (x) 0.5 0.25 0.125 0.125
(a) Find H(X).
22
(b) Build a binary Huffman code for the source X.
(c) What is the expected length of the resulting compressed sequence.
(d) What is the expected number of zeros in the resulting compressed
sequence.
(e) Let X̃ n be an another source distributed i.i.d. according to pX̃ .
x̃ 0 1 2 3
pX̃ (x̃) 0.3 0.4 0.1 0.2
What is the expected length of compressing the source X̃ using

the code constructed in (b).
(f) Answer (d) for the code constructed in (b) and the source X̃ n .
(g) Is the relative portion of zeros (the quantity in (d) divided by the
quantity in (c)) after compressing the source X n and the source
X̃ n different? For both sources, explain why there is or there is
not a difference.
Solution
(a) The entropy of X is
H(X) = − 0.5 · log 0.5 − 0.25 · log 0.25 − 2 · 0.125 · log 0.125
=1.75
(b)
c(x) x pX (x) Tree
1 1
1 0 0.5
0
1 0.5
01 1 0.25
1 0
001 2 0.125 0.25
0
000 3 0.125
23
(c) Denote the length of a codeword by L(c(xi )). Then
n
X
n n
L(c (x )) = L(c(xi ))
i=1
Xn
= [pX (0) · L(c(0)) + pX (1) · L(c(1)) + pX (2) · L(c(2)) + pX (3) · L(c(3))]
i=1
=n (0.5 · 1 + 0.25 · 2 + 0.125 · 3 + 0.125 · 3)
=1.75n
The expected length of the sequence is nR∗ = 1.75n. Note that

the distribution on X is dyadic, and therefore the Huffman code
is optimal. Therefore, nR = nH(X).
(d) Let N (0|c) denote the number of zeros in a codeword c, and
cn (xn ) = [c(x1 ), . . . , c(xn )].
" n #
X
n n
E [N (0|c (X ))] =E N (0|c(Xi ))
i=1
n
X
= E [N (0|c(Xi ))]
i=1
n
X
= [pX (0) · N (0|c(0)) + pX (1) · N (0|c(1)) + pX (2) · N (0|c(2))
i=1
+ pX (3) · N (0|c(3))]
Xn
= [0.5 · 0 + 0.25 · 1 + 0.125 · 2 + 0.125 · 3]
i=1
=0.875n
Since the code is optimal, the number of zeros is half of the ex-
pected length (see the following sub-question).
24
(e) Denote the length of a codeword by L(c(xi )). Then
n
X
L(cn (xn )) = L(c(xi ))
i=1
n
X
= [pX (0) · L(c(0)) + pX (1) · L(c(1)) + pX (2) · L(c(2)) + pX (3) · L(c(3))]
i=1
=n (0.3 · 1 + 0.4 · 2 + 0.1 · 3 + 0.2 · 3)
=2n
(f) The expected number of zeros is

" n #
h i X
E N (0|cn (X̃ n )) =E N (0|c(X̃i ))
i=1
n
X h i
= E N (0|c(X̃i ))
i=1
n
X
= [pX̃ (0) · N (0|c(0)) + pX̃ (1) · N (0|c(1)) + pX̃ (2) · N (0|c(2))
i=1
+ pX̃ (3) · N (0|c(3))]
Xn
= [0.3 · 0 + 0.4 · 1 + 0.1 · 2 + 0.2 · 3]
i=1
=1.2n
Note that the expected number of zeros is not half of the expected
length. It implies that the code is not optimal.
R∗ = H(X̃) = 1.846
(g) For X n we used optimal code with varying length. Therefore,

the expected number of zeros is half of the compressed sequence.
However, ee used a code that is not optimal for X̃ n . Henceforth,
the compression rate is not optimal, and the expected number of
zeros is not necessarily half of the expected length. Note that the
expected length is not optimal too, since H(X̃) ≅ 1.8464, which
E[L(c(X˜n ))]
is not equal to n
.
25

HW 2 Sol

Uploaded by

Copyright:

Available Formats

HW 2 Sol

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW 2 Sol

Uploaded by

Copyright:

Available Formats

Solutions to Set #2

Data Compression, Huffman code and AEP

Consider the code assignment

(a) Is this code nonsingular?

Z1 Z2 Z3 . . . = C(X1 )C(X2 )C(X3 ) . . .?

gives an instantaneous code, and thus a uniquely decodable code.

3. Huffman via MATLAB

(a) Give a Huffman encoding into an alphabet of size D = 2 of the

(a) The code is presented in Fig 1.

Compressions rate (symbol -> binary): 2.2463

E [#{0ś in file}] =10, 000 Px (0) · 3 + Px (1) · 2 + Px (2) · 2

4. Entropy and source coding of a source with infinite alphabet

(a) What is the entropy of the random variable?

Hence it is the Optimal Code.

(a) What is the expected number of tastings required?

(c) What is the minimum expected number of tastings required to

(a) If we taste one bottle at a time, the corresponding number of tast-

The expected number of tastings required is

Note that H(p) = 2.52 bits.

Solution: Relative entropy is cost of miscoding.

7. Shannon code. Consider the following method for generating a code

H(X) ≤ L < H(X) + 1. (3)

Solution to Shannon code.

(a) Since li = ⌈log p1i ⌉, we have

The difficult part is to prove that the code is a prefix code. By

2−li ≤ pi < 2−(li −1) . (6)

Thus Fj , j > i differs from Fi by at least 2−li , and will therefore

Solution: An AEP-like limit.

by the strong law of large numbers.

(a) Evaluate lim − n1 log q(X1 , X2 , . . . , Xn ), where X1 , X2 , . . . are i.i.d.

(a) Since the X1 , X2 , . . . , Xn are i.i.d., so are q(X1 ), q(X2 ), . . . , q(Xn ),

(b) Again, by the strong law of large numbers,

Think why maximizing H(Z) means maximizing H(X), H(Y ).

where (a) follows by Stirling’s approximation. Thus,

log(N ) = log n − n1 log n1 − n2 log n2 − . . . − n6 log n6

Solution: drawing a codebook.

(a) By the definition of the typical set, the probability of every xn ∈

2−n(H(X)+ǫ) ≤ p(xn ) ≤ 2−n(H(X)−ǫ)

notice that the question regards to the probability of a specific

where (a) follows from section a, (b) follows by adding i = 1 to

Solution: Saving the princess.

13. Lossless source coding with side information.

Xn f (X n , Y n ) ∈ {1, 2, ..., 2nR } X̂ n (f (X n , Y n ), Y n )

and use similar steps, including Fano’s inequality, as we used in the

14. Challenge: Optimal code for an infinite alphabet This question

15. Conditional Information Divergence

(a) Let X, Z be random variables jointly distributed according to

With respect to this definition, prove for each relation if it is true

Figure 3: Source coding with side information.

(b) Consider the setting in Fig. 3.

Solutions See question 4 in Final Exam 2014 Moed A (with solutions).

I(X; Y ) ≥ H(Y ) − H(Z)

(a) First, note that I(X; Y ) = H(Y ) − H(Y |X). Consider

Therefore, we can conclude that

H(Y |X) = I(Z; Y |X) ≤ H(Z|X) ≤ H(Z). (8)

H(Y ) − H(Y |X) ≥ H(Y ) − H(Z).

(b) To satisfy equality, we must satisfy equalities in (8). First equality

(a) Let X − Y − Z − W be a Markov chain, then the following holds:

I(X; W ) ≤ I(Y ; Z).