Lecture Notes Part II
Lecture Notes Part II
Lecture Notes Part II
Lecture 11
Instructor: Arya Mazumdar Scribe: Nanwei Yao
2. Conditional entropy X
H(X|Y ) = p(x, y) log p(x|y)
x,y
3. Joint entropy X
H(X, Y ) = p(x) log p(x)
x2X
4. Mutual Information
I(X; Y ) = H(X) H(X|Y )
But previous proof is a little bit complex, thus we want to prove this equation again in a simpler way
to make it clearer.
Proof:
X
H(X, Y ) = p(x, y) log p(x, y)
X
= p(x, y) log[p(x)p(y|x)]
X X
= p(x, y) log p(x) p(x, y) log p(y|x)
X X
= log p(x) p(x, y) + H(Y |X)
x y
X X
= log p(x) p(x) + H(Y |X)
x x
X
= p(x) log p(x) + H(Y |X)
x
= H(X) + H(Y |X)
Definition:For random variables (X,Y) whose probabilities are given by p(x,y), the conditional entropy
H(Y |X) is defined by
X
H(Y |X) = p(x)H(Y |X = x)
x2X
1
X X
= p(x) p(y|x) log p(y|x)
x2X y2Y
XX
= p(x, y) log(p(y|x)
x2X y2Y
One important thing to notice is that H(X) H(X|Y ), this means that conditioning always reduces
the entropy. The entropy of a pair of random variables is the summation of the entropy of one plus the
conditional entropy of the other. This is based on Chain rule:
H(X, Y ) = H(X) + H(Y |X)
Thus, I(X; Y ) = H(X) + H(Y ) H(X, Y ). We can understand the mutual information I(X; Y ) as the
reduction in the uncertainty of X because of some of our knowledge of Y. By symmetry, we also have
I(X; Y ) = I(Y ; X). Thus we know that the information X provides us has the ”same amount” that Y
provides us.
Xn n
Encoder Y
Decoder
fn
fn:Xn-{1,2,…..Mn}
Above is a figure of rate distortion encoder and decoder. Where fn = X n ! {1, 2, .......Mn } is the
encoder, and X̂ n is the actual code we choose.
2
entire space There are Mn points in the
entire space, each point is a
chosen codeword (center).
We map each data point to
distortion the nearest center.
Thus, we know that the distortion of a sequence is the average distortion of the symbol-to-symbol dis-
tortion. We would like d(xn , x̂n ) D. The compression rate we could achieve is R = limx!+1 lognMn
bits/symbol.The Rate distortion function R(D) is the minimization of lognMn such that Ed(xn , x̂n ) D
at the limit of n ! 1.
Theorem(Fundamental theory of source coding): The information rate distortion function R(D) for a
source X with distortion measure d(x, x̂) is defined as
H(X1 , X2 , X3 , ......, Xn ) = H(X1 ) + H(X2 |X1 ) + H(X3 |X1 , X2 ) + ...... + H(Xn |X1 , X2 , ......, Xn 1)
is convex.
3
For the first property, we can prove its correctness intuitively: If R(D) is an increasing function, this
means that the more the distortion, the worse the compression. This is definitely what we don’t want.
Proof : Choose two points (R1, D1), (R2, D2) on the boundary of R(D) with distributions PX̂1 |X and
PX̂2 |X . Then, we can construct another distribution PX̂ |X such that
We know that I(X; Y ) is a convex function of pX̂,X () for a given pX (). Therefore,
D1 D2
For the above proof, we need to make the argument that distortion P D is a linear function
P of p(x̂|x). We
know that the D is the expected distortion and it is given as D = p(x, x̂)d(x̂, x) = p(x)p(x̂|x)d(x, x̂).
If we treat p(x̂|x) as a variable and both p(x) and d(x, x̂) as known quantities, we know that D is a
linear function of p(x̂|x). Therefore, R(D) is a convex function of p(x̂|x). The proof that I(X; X̂) is a
convex function of p(x̂|x) will not be shown here.
4
Converse Argument of R(D)
The converse argument of R(D) tells us that for any coding scheme whose expected distortion is at most
to be D, there doesn’t exist a code such that its rate is less than R(D). Now, let’s prove it.
Proof
log Mn H(X̂ n )
H(X̂ n ) H(X̂ n |X n )
n n
= I(X̂ ; X )
= H(X n ) H(X n |X̂ n )
Xn Xn
= H(Xi ) H(Xi |X̂ n , X1 , X2 , ...Xn )
i=1 i=1
n
X n
X
H(Xi ) H(Xi |X̂i )
i=1 i=1
Xn
= I(Xi ; X̂i )
i=1
n
X
log Mn I(Xi ; X̂i )
i=1
Xn
R(Ed(Xi ; X̂i )
i=1
Xn
1
= n R(Ed(Xi ; X̂i ))
i=1
n
n
1X
nR( Ed(Xi ; X̂i ))
n i=1
n
1 X
= nR( E[ d(Xi , X̂)])
n i=1
nR(D)
log Mn
We see from the proof that R = limx!+1 n R(D), thus, our proof is finished.
5
= h(p) H(X X̂|X̂)
h(p) H(X X̂) (conditionality reduces entropy)
= h(p) H(Y )
Recall from previous lectures, for any D 12 , the binary entropy function h(D) is increasing. Thus
h(p) H(Y ) h(p) h(D) for D 12 . We have showed that R(D) h(p) h(D), and we will show
R(D) = h(p) h(D). When p = 21 , R(D) = 1 h(D).
Binary Entropy
0.7
0.6
0.5
0.4
h(D)
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Distortion D
Up to this point, we want to show H(X X̂|X) has the same value as h(D).
0 0 Pr(0)=1-p
D
X Pr( X)=D
D
Pr(1)=p
1 1
1-D
6
In this case, I(X; X̂) = H(X) H(Y ) = h(p) h(D), thus we proved both sides of the main theory for
binary source.
Midterm Solutions
1. 1 h(D) is the optimal rate of compression that is achievable.1 h(D) = 1 h(1/20), where
h(x) = x log x (1 x) log(1 x).
⇡( 14 )2
2. ( 12 )2
= 14 ⇡
(0,1) (1,1)
0.25
(0,0) (1,0)
3. p1 , p2 , p3 , ......
1
X 1
X 9 1 i 1
pi = ( )
i=n+1 i=n+1
10 10
9 1 n 9 1
= ( ) + ( )n+1 + ....
10 10 10 10
9 1 n 1 1
= ( ) [1 + + ( )2 + .....]
10 10 10 10
9 1
= 9
10 10
1 1 n 1
= ( )
10 10
9 1 n 1
pn = 10 ( 10 )
7
P1
Thus pn > i=n+1 pi .
0
P1
10 1
P2 1
11
.
.
.
Huffman Coding
.
n(n+1)
4. 2 = 5050 ! n = 100
Length is 100X dlog(100) + 1e = 800
R = m(log(m)+1)
m(m+1)/2 =
2(log(m)+1)
(m+1)
8
EE5585 Data Compression March 5, 2013
Lecture 12
Instructor: Arya Mazumdar Scribe: Shreshtha Shukla
1 Review
In the last class we studied the fundamental theorem of source coding, where R(D) is the optimal rate
of compression, given the normalized average distortion D. Our task was to find this optimal rate, for
which we had the theorem stated as
We also proved R(D) min I(X, X̂), known as the converse part of the theorem. The direct part of
the theorem known as achievability result is proved using Random Coding method. Also recall, For
Bernoulli(p) Random Variables: R(D) = h(p) h(D) where h(·) is the binary entropy function, when
D = 0, R(D) = h(p) that is, the source can be compressed to h(p) bits.
X 1 , X2 , X3 . . . X n
1
Figure 1: gaussian pdf with 1 bit partition
Now, how can you represent this using 1 bit? With 1 bit, at most we can get the information whether
x 0 or x < 0.
After compressing the file to 1 bit, the next question is: how do we decode it?
For this, some criteria has to be selected, example, we set our codewords based on MSE (mean square
error) i.e find the value of ’a’ that will minimize the expected squared error
Z 0 Z 1
= f (x)(x + a)2 dx + f (x)(x a)2 dx
1 0
Z 1
= 2 f (x)(x a)2 dx
0
Z 1 Z 1 Z 1
= 2[ x2 f (x)dx 2a xf (x)dx + a2 f (x)dx ]
0 0 0
Z 1
1 x2 /2 2
= (a2 + 2
) 4a xp e dx
2⇡ 2
0
2
Let y = x2 /2 2
,
) dy = xdx2 .
So, Z Z
1
1 x2 /2 2 p 1
y
4a xp e dx = 2a 2 2 /⇡ e dy
2⇡ 2
0 0
p
= 2a 2 2 /⇡
p
hence, E[(X X̂)2 ] = a2 + 2 2a 2 2 /⇡
to minimize this, di↵erentiate with respect to a and set it to 0.
p
=) 2a 2 2 2 /⇡ = 0
r
2
or a=
⇡
which is what we choose our codeword.
Further, it is always good to take a vector instead of a single random variable (i.e a scalar RV). The
vector then, lives in an n-dimensional space. In this case also, we need to find the appropriate regions
and their associated optimal reconstruction points.
3
Figure 3: Voronoi Regions in Rn space
These regions are called the Voronoi Regions and the partitions are known as Dirichlet’s Parti-
tions.To find the optimal partitions and associated centers, there is an algorithm known as the Llyod’s
Algorithm.
Briefly the Llyod’s algorithm:
We start with some initial set of quantized points, Eg : for 10 bit; 210 = 1024 points and then find
the Voronoi regions for these points. The expected distribution is then optimized. Update the points
to those optimal points and again find the Voronoi regions for them. Doing this iteratively converges to
optimal values.This algorithm has several names, such as Llyod’s Algorithm also known as Expectation
Maximization Algorithm. (we know the optimal values using the rate distortion theory so can compare
to the convergence result ). More detailed study on this will be done in later classes.
similarly for continuous RVs : instead of the pmf we have the pdf fX (x)
Eg, X = R; X ⇠ N (0, 2 )
We define Di↵erential Entropy of a continuous random variable as
Z 1
H(X) = fX (x) ln(fX (x))dx
1
3.1 Examples:
Example 1: Entropy of a gaussian RV: p
2 2
X ⇠ N (0, 2 ) f (x) = 1/ 2⇡ 2 e x /2
then
4
Z 1 p p
2 2 2 2
H(X) = 1/ 2⇡ 2 e x /2 loge (1/ 2⇡ 2 e x /2 )dx
1
Z 1 p
= f (x)(loge (1/ 2⇡ 2) x2 /2 2
)dx
1
p Z 1
2
= [loge (1/ 2⇡ 2) 1/2 x2 f (x)dx]
1
p
2) 2 2
= [loge (1/ 2⇡ 1/2 ]
2
= 1/2(loge 2⇡ ) + 1/2
1 2
= ln 2⇡e .
2
1 2
i.e., H(X) = 2 log(2⇡e ) bits. which is the di↵erential entropy of a gaussian random variable.
3.2 Theorem
claim: A gaussian Random Variable X ⇠ N (0, 2 ) maximizes the di↵erential entropy among all con-
tinuous random variables that have variance of 2 .
proof:
Suppose Z is any random variable with var(Z) = 2 and pdf = g(z),
then H(X) H(Z) can be written as
Z Z
= f (x) ln(f (x))dx + g(x) ln(g(x))dx
Z p Z
2) 2 2
= f (x)(ln(1/ 2⇡ x /2 )dx + g(x) ln(g(x))dx
Z Z
= g(x) ln(f (x))dx + g(x) ln(g(x))dx
Z
= g(x) ln(g(x)/f (x))dx
= D(f ||g)
>0
5
4 Rate Distortion for gaussian Random Variables:
Suppose we have a gaussian source X : X 1 , X2 X3 ...Xn are iid.
To compress this data, according to the Rate distortion Theorem:
Z Z
R(D) = min I(X; X̂) s.t. f (x̂|x)f (x)d(x, x̂) D
f (x̂|x) x x̂
where d(x̂,x) is the euclidean distance. (Proof for this is similar to the discrete case.)
To evaluate, R(D),
Step1: Find a lower bound for I(X; X̂)
Step2: Find some f(x) that achieves the bound and hence is optimal.
1 2 1
ln 2⇡e ln 2⇡eE[(X X̂)2 ]
2 2
1 2
ln( /E[(x x̂)2 ]
2
2
=0 f or D >
After finding a lower bound, we now show that 9 one f(x) for which this bound is achievable (which is
the limit of compression for the gaussian RV)i.e we would like to back track and find how the inequalities
meet the equality condition.
Suppose we have,
6
Figure 4: Generating X
where X 0 is X̂
2
for this case I(X; X̂)= 12 ln D . i.e it achieves the bound.
Hence if there is a Gaussian source producing iid symbols then to encode a vector of length n
with resulting quantization error nD,
n
we need at least : 2 ln(
2
/D) nats or n2 log( 2 /D) bits to represent it.
7
EE5585 Data Compression March 8, 2013
Lecture 13
Instructor: Arya Mazumdar Scribe: Artem Mosesov
Scalar Quantization
Basics
Being a subset of vector quantization, scalar quantization deals with quantizing a string of symbols
(random variables) by addressing one symbol at a time (as opposed to the entire string of symbols.)
Although, as one would expect, this is not ideal and will not approach any theoretical limits; scalar
quantization is a rather simple technique that can be easily implemented in hardware. The simplest
form of scalar quantization is uniform quantization.
Given a string, x1 ,x2 ,...xn , we first pick a symbol to quantize and disregard the rest. We then quantize
this continuous variable to a uniform set of points, as follows:
So we have M+1 boundaries bi , and M quantization levels yi (which fall in the middle of the boundary
points). So a continuous number that falls between the boundaries bi 1 and bi gets assigned a quantized
value of yi . Naturally, this introduces signal distortion - an error. The error measure typically used for
this is mean squared error (Euclidean distance, as opposed to Hamming distance that’s used for binary
strings). We call this the quantization error, and recognize that it takes log2 (M ) bits to store the
symbol.
Optimization
We note that uniform quantization is only optimal (in the minimum MSE sense) for a uniform distribu-
tion. Given an arbitrary PDF (not necessarily uniform), we would like to find an optimal quantization.Let
us consider a random variable X with a PDF f X (x).
The MSE is, Z 1
2
(x Q(x)) fX (x) dx
1
Q(x) = yi if bi 1 x bi
M Z
X bi
2
q ⌘ MSE = (x yi )2 fX (x)dx
i=1 bi 1
1
This, then, becomes an optimization problem - given a maximum distortion rate, we would like to find
the optimal location of the quantization points (yi ’s and bi ’s). Of course, we can always have a very large
number of quantization points to keep the distortion low; however, we would like to keep this number
low, as to save memory space when storing these values.
Referring back to a uniform distribution, we note that (for a non-uniform pdf), the probability of di↵erent
yi ’s is not the same. That is, at the quantizer output we may see a lot more of a certain quantization
point than another. This makes the points a candidate for Hu↵man coding, as seen earlier in the course.
The probability of a particular quantization point is
Z bi
P (Q(x) = yi ) = fX (x)dx
bi 1
Now we can begin to optimize the average length of the code for the quantization points, which is
M
X Z bi
li fX (x)dx ,
i=1 bi 1
where li is the length of the code for yi . This optimization must occur subject to the following two
constraints:
Constraint 1: li ’s satisfy Kraft’s inequality.
M R
P
2 bi
Constraint 2: q ⌘ MSE = bi 1
(x yi )2 fX (x)dx D
i=1
To see how to simplify this problems, we look again at a uniform quantizer. Lets assume that X (the
symbol we want to quantize) is a uniform ⇠ U [ L, L] variable. The quantization ‘lengths’ are then
= 2L
M , as shown in figure 2.
M Z
X L+i
2 1
q = (x yi ) dx
i=1 L+(i 1) 2L
The optimal yi is then bi 12+bi . Of course, this is only for a uniform random variable, as initially assumed.
We may also notice that the quantization error plot is merely a sawtooth wave with wavelength and
2
2
amplitude 2 . The integral of this is then, q = 12 .
2
We may think of the quantization error produced by the system as an additive noise - the ‘quantization
noise’. The power of this noise is then q2 . The idea is shown in Figure 3, below.
From the figure, we note that the power of the input signal is,
Z L
2 L2
x = x2 fX (x)dx =
L 3
2
Hence, we have, SNR = 10 log10 ( x2 ) = 20 log10 M , where M is, as before, the number of quantization
q
levels. Since this is a uniform distribution, Hu↵man coding will not get us anywhere, and we have the
maximum entropy of 20 log10 M . For an n-bit quantizer then, we get 20 log10 2n = 20n log10 2 ⇡ 6n dB.
So the SNR is directly proportional to the number of bits used for quantization - with an increase of one
bit correspond to about a 6dB increase of SNR.
Now we take a look at optimum quantization for non-uniform distributions. Similarly, we have
M Z
X bi
2
q = (x yi )2 fx (x)dx
i=1 bi 1
which we would like to minimize. Often, however, we don’t know the exact PDF of the symbols, nor do
we know the variance. To overcome this, we use adaptive quantization. As we’ve seen before, one way
to do this is to estimate the PDF by observing a string of symbols. This is known as forward adaptive
quantization.
2
Going back to minimizing q, we want
2 Z bi
q
= (x yi )2 fx (x)dx =
yi yi bi 1
Z bi Z bi Z bi
2
= [ x fx (x)dx 2yi xfx (x)dx + yi2 fx (x)dx] =
yi bi 1 bi 1 bi 1
Z bi Z bi
= 2 xfx (x)dx + 2yi fx (x)dx = 0
bi 1 bi 1
R bi
b
xfx (x)dx
yi = R ibi 1 (1)
bi 1
fx (x)dx
3
So this is the optimal location of the reconstruction points, given the quantization points. Now we have
to find the quantization points. We do this similarly,
2
q
=0
bi
yi 1+ yi
bi 1 = (2)
2
So what we can do with this is an iterative procedure, where we first initialize the variables, then go
back and forth optimizing each one, and (ideally) arriving very close to an optimality point.
Lloyd-Max Algorithm
The Lloyd-Max algorithm is an iterative method that does just that. The crude steps (of one of the
versions of this algorithm) are as follows:
1. Knowing b0 , assume y1 .
2. Using (1), find b1 .
3. Using (2), find y2 .
and so on...
We also note that since we know the (approximate) signal statistics, we know bM . Then we have an
idea of how much of the error the algorithm made by seeing how close it is to the known value of bM
after the last iteration. If it is too far o↵, we reinitialize and try again until we are within the accepted
tolerance.
Later, we will see a more complex, but better performing method of vector quantization.
4
EE5585 Data Compression March 12, 2013
Lecture 14
Instructor: Arya Mazumdar Scribe: Cheng-Yu Hung
M Z
X bi
2
) q = (x yi )2 fX (x)dx (4)
i=1 bi 1
Thus, we can pose the optimal quantizer design problem as the followings:
Given an input pdf fX (x) and the number of quantization levels M in the quantizer, find the decision
boundaries {bi } and the reconstruction levels {yi } so as to minimize the mean squared quantization error.
If we know the pdf of X, a direct approach to find the {bi } and {yi } that minimize the mean squared
quantization error is to set the derivative of (4) with respect to bj and yj to zero, respectively. Then,
Z
@ q2 @ bj
= [ (x yj )2 fX (x)dx] (5)
@yj @yj bj 1
Z bj Z bj Z bj
@
= [ x2 fX (x)dx 2yj xfX (x)dx + yj2 fX (x)dx] (6)
@yj bj 1 bj 1 bj 1
Z bj Z bj
= 2 xfX (x)dx + 2yj fX (x)dx = 0 (7)
bj 1 bj 1
R bj
b
xfX (x)dx
) yj = R jbj 1 (8)
bj 1
fX (x)dx
Z Z
@ q2 @ bj
2
bj+1
= [ (x yj ) fX (x)dx + (x yj+1 )2 fX (x)dx] (9)
@bj @bj bj 1 bj
1
!"#
Then,
The decision boundary is the midpoint of the two neighboring reconstruction levels. Solving these two
equations (8) and (14) listed above will give us the values for the reconstruction levels and decision
boundaries that minimize the mean squared quantization error. Unfortunately, to solve for yj , we need
the values of bj and bj 1 , and to solve for bj , we need the values of yj and yj+1 . Therefore, the Lloyd-Max
algorithm is introduced how to solve these two equations (8) and (14) iteratively.
2
Lloyd-Max Algorithm
Suppose fX (x) and b0 = ↵, bM = +↵ is given, Find {bi }M M
i=0 and {yi }i=1 . Assume a value for y1 , then
From (8), find b1
From (14), find y2
From (8), find b2
From (14), find y3
..
.
From (8), find bM 1 R bM
xfX (x)dx
0 b
From (14), find yM . Since we know bM = +↵, we can directly compute yM = RMbM
1
and
fX (x)dx
bM 1
compare it with the previously computed value of yM . If the di↵erence is less than some tolerance
threshold, we can stop. Otherwise, we adjust the estimate of y1 in the direction indicated by the sign of
the di↵erence and repeat the procedure.
M Z
X bi
= xfX (x)dx (18)
i=1 bi 1
Z bM
= xfX (x)dx (19)
b0
Z +1
= xfX (x)dx (20)
1
= EX (21)
The reason of (19) to (20) is that the value of fX (x) beyond b0 and bM is zero.
3. EQ(X)2 EX 2
Rb Rb Rb
proof: If gX (x) = fX (x)/( bii 1 fX (x)dx), then bii 1 gX (x)dx = 1, bii 1 xgX (x)dx = Eg X, and
Eg (X Eg X)2 0 ) (Eg X)2 Eg X 2 . Thus,
M
X
2
EQ(X) = yi2 P r(Q(X) = yi ) (22)
i=1
3
R bi 2
M
X xfX (x)dx Z bi
bi 1
= ( R bi ) fX (x)dx (23)
i=1 fX (x)dx
bi 1
bi 1
XM Z bi 2 Z bi
fX (x)
= ( x R bi dx) fX (x)dx (24)
i=1 bi 1 f (x)dx
bi 1 X
bi 1
XM Z bi Z bi
fX (x)
x 2 R bi dx fX (x)dx (25)
i=1 bi 1 bi 1
f X (x)dx bi 1
XM Z bi
= x2 fX (x)dx (26)
i=1 bi 1
Z +1
= x2 fX (x)dx (27)
1
2
= EX (28)
2
4. q = EX 2 EQ(X)2
Lloyd Algorithm
The Lloyd algorith is another method to find {bi }M M
i=0 and {yi }i=1 . The distribution fX (x) is assumed
known.
(0) (0) (0)
Assume y1 , y2 , · · · , yM is an initial sequence of reconstruction values {yi }M
i=1 . Select a threshold ✏.
(1) (1) (1)
1.By Eqn (14). Find b0 , b1 , · · · , bM .
(1) (1) (1) (1) PM R b(1) (1)
2.By Eqn (8). Find y1 , y2 , · · · , yM . And compute q2 = i=1 (1) i
(x yi )2 fX (x)dx.
bi 1
(2) (2) (2)
3.By Eqn (14). Find b0 , b1 , · · · , bM .
(2) (2) (2) 2 (2)
PM R b(2) (2)
4.By Eqn (8). Find y1 , y2 , · · · , yM . And compute q = i=1 b(2) (x
i
yi )2 fX (x)dx.
⇢ i 1
i 1
(j 1) (j) 2 (j 1)
is calculated and compare it with previously error value q2 . Stop if f | q2 q | < ✏; otherwise,
(j+1) (j+1) (j+1) (j+1) (j+1) (j+1)
continue the same procedure of computing b0 , b1 , · · · , bM and y1 , y2 , · · · , yM by Eqn
(14) and (8) for next time j + 1.
Vector Quantization
The idea of vector quantization is that encoding sequences of outputs can provide an advantage over the
encoding of individual samples. This indicates that a quantization strategy that works with sequences
or blocks of outputs would provide some improvement in performance over scalar quantization. Here
is an example. Suppose we have two uniform random variables height X1 ⇠ U nif [40, 80] and weight
X2 ⇠ U nif [40, 240] and 3 bits are allowed to represent each random variable. Thus, the weight range is
divided into 8 equal intervals and with reconstruction levels {52, 77, · · · , 227}; the height range is divided
into 8 equal intervals and with reconstruction levels {42, 47, · · · , 77}. The two dimensional representation
of these two quantizers is shown in Figure 2(a).
4
!"# !$#
Figure 2: (a)The dimensions of the height/weight scalar quantization. (b)The height-weight vector
quantization
However, the height and weight are correlated. For example, a quantizer for a person who is 80
inches tall and weights 40 pounds or who is 42 inches tall and weights 200 pounds is never used. A more
sensible approach will use a quantizer like the one shown in Figure 2(b). Using this quantizer, we can no
longer quantize the height and weight separately. We will consider them as the coordinates of a point
in two dimensions in order to find the closest quantizer output point.
5
EE5585 Data Compression March 14, 2013
Lecture 15
Instructor: Arya Mazumdar Scribe: Khem Chapagain
In the previous lectures, we looked at the uniform and nonuniform scalar quantization and moved on to
vector quantizers. One thing that’s worth mentioning about scalar quanitization is that, for high rate,
uniform quantizers are optimal to use. That can be seen from the following derivation. As its name
implies, scalar quantizer inputs are scalar values (each input symbol is treated separately in producing
the output), and each quantizer output (codeword) represents a single sample of the source output.
• R.V. to quantize: X,
• Set of (decision) boundaries: {bi }M
i=0 ,
Rbi
x fX (x) dx
bi 1
1. yi =
Rbi
fX (x) dx
bi 1
yi + yi+1
2. bi =
2
If we want a high rate quantizer, the number of samples or quantization levels (M ) is very large.
Consequently, di↵erence between boundaries bi 1 and bi is very small as they are very closely spaced.
1
When M is sufficiently large, probability density function (pdf ), fX (x), doesn’t change much between
bi 1 and bi . In that case, the first property above can be written as,
Rbi
fX (a) x dx
bi 1
yi ⇡ , bi 1 a bi
Rbi
fX (a) dx
bi 1
(b2i b2i 1 ) / 2
=
(bi bi 1)
b i + bi 1
=
2
Which says that for very large M , reconstruction level is approximately in the midway between two
neighboring boundaries, but we know from the second property above that boundary is exactly between
two neighboring reconstruction levels. This means that we have a uniform quantizer since it has all
reconstruction levels equally spaced from one another and quantization steps (intervals) are equal.
Thus, when we are operating with a large M , it makes sense just to go for the uniform quantizers.
Lloyd Algorithm
In the last class, we talked about Lloyd-Max algorithm to iteratively compute the optimized quantizer
levels (yi ’s) and boundaries (bi ’s). There is another algorithm called Lloyd algorithm, which also
iteratively calculates yi ’s and bi ’s, but in a di↵erent way. The Lloyd quantizer that is constructed using
an iterative procedure provides the minimum distortion for a given number of reconstruction levels, in
other words, it generates the pdf -optimized scalar quantizer. The Lloyd algorithm functions as follows
(source distribution fX (x) is assumed to be known):
• Initialization:
(0)
– Assume an initial set of reconstruction values/levels, {yi }M
i=1 , where
(0)
denotes the 0th
iteration.
(0) (0)
(0) (0) yi + yi+1
– Find decision boundaries, {bi }Mi=0 from the equation bi =
2
(0)
b
M Zi
X (0)
– Compute the distortion (variance), D(0) = (x yi )2 fX (x) dx
i=1 (0)
bi 1
(k+1) (k+1)
(k+1) yi + yi+1
– bi = , new set of boundaries.
2
2
(k+1)
biZ
M
X (k+1) 2
– D(k+1) = (x yi ) fX (x) dx, new distortion.
i=1 (k+1)
bi 1
• Stop iteration if |D(k+1) D(k) | < ✏, where ✏ is a small tolerance > 0; i.e., stop when distortion
is not changing much anymore.
We will see later that this algorithm can be generalized to vector quantization.
Vector Quantization
The set of inputs and outputs of a quantizer can be scalars or vectors. If they are scalars, we call the
quantizers scalar quantizers. If they are vectors, we call the quantizers vector quantizers. By grouping
source outputs together and encoding them as a single block, we can obtain efficient compression
algorithms. Many of the lossless compression algorithms take advantage of this fact such as clubbing
symbols together or parsing longest phrases. We can do the same with quantization. We will look at
the quantization techniques that operate on blocks of data. We can view these blocks as vectors, hence
the name “vector quantization.” For a given rate (in bits per sample), use of vector quantization results
in a lower distortion than when scalar quantization is used at the same rate. If the source output is
correlated, vectors of source output values will tend to fall in clusters. By selecting the quantizer
output points to lie in these clusters, we have a more accurate representation of the source output.
We have already seen in the prior class that, if we have the data that are correlated, it doesn’t make
sense to do scalar quantization. We looked at the height(H), weight(W ) quantization scheme in two
dimensions. We have two random variables (H, W ) and will plot the height values along the x-axis
and the weight values along the y-axis. In this particular example, the height values are being
uniformly quantized to the five di↵erent scalar values, as are the weight values. So, we have a total of
25 quantization levels (M = 25) denoted by •. The two-dimensional representation of these two
quantizers is shown below.
Figure 2: The height, weight scalar quantizers when viewed in two dimensions.
3
From the figure, we can see that we e↵ectively have a quantizer output for a person who is very tall
and weighs minimal, as well as a quantizer output for an individual who is very short but weighs a lot.
Obviously, these outputs will never be used, as is the case for many of the other unnatural outputs. A
more sensible approach would be to use a quantizer with many reconstruction points inside the shaded
region shown in the figure, where we take account of the fact that the height and weight are correlated.
This quantizer will have almost all quantization levels (say 23) within this shaded area and no or very
few (say 2) quantization levels outside of it. With this approach, the output points are clustered in the
area occupied by the most likely inputs. Using this methodology provides a much finer quantization of
the input. However, we can no longer quantize the height and weight separately - we have to consider
them as the coordinates of a point in two dimensions in order to find the closest quantizer output point.
When data are correlated, we don’t have any doubt that quantizing data together (vector
quantization) is going to buy us something. It will be shown, when data are independent (no
correlation), we still gain from using vector quantization. Scalar quantization always gives something
like a rectangular grid similar to what we have seen in the above example. Note that for pdf -optimized
or optimal quantizer, unlike above example, grids/cells can be of di↵erent sizes meaning boundaries
don’t necessarily have to be uniformly spaced in either dimension. Nevertheless, the fact is that these
rectangles don’t fill up the space most efficiently. There might be other shapes, like a hexagon or
something else, that can fill up the space in a better way with the same number of quantization levels.
This idea is captured in vector quantization, where we have this flexibility of using any shape,
contrasting the scalar case where only choice is rectangle like shapes.
Area of the hexagon is six times the area of the inside triangle shown,
⇡ ⇡
A0 = 6 D0 cos( ) D0 sin( )
p 3 3
3 02
=6 D
p4
3 3 02
= D
2
4
If the areas are equal,
A0 = A
p
3 3 02
D = 2D2
2
4
D02 = p D2
3 3
4
D02 = D2
5.196
D0 = 0.877 D
D0 < D
Even though, both the square and hexagon occupy the same area, the worst case distortion in hexagon
(D0 ) is less than the worst case distortion in square (D). Hence, we can intuitively see that hexagon
can fill up the same space with less distortion (better packing), which is also known as the shape gain.
This is illustrated in the following figure.
Figure 4: 20 square cells and 20 hexagonal cells packing the equal area.
From earlier discussions, it is clear that for correlated data, vector quantizers are better. We just saw
that, even if the the data are not correlated, vector quantizers still o↵er this shape gain among other
advantages. Without having all this intuition and geometry, this fact was captured previously in the
rate distortion theory as well.
We can try to extend Lloyd algorithm to vector quantizers. Now, we have a space to quantize instead
of a real line. Analogous to scalar case this time we have,
M
[
i=1 , boundaries are hyper-planes or partitions of R , i.e.,
• {Vi }M n
V i = Rn
i=1
Given a set of quantization/reconstruction levels/points, we want find the optimal partitions. Suppose
quantization levels are,
{Yi }M
i=1
5
Vi , the Voronoi region of Yi , is the region such that any point in Vi is closer to Yi than any other Yj .
Vi = {X = (x1 , ..., xn ) 2 Rn | d(Yi , X) < d(Yj , X) 8 j 6= i}
M Z
X
= (X Yi )T (X Yi ) fX (X) dX
i=1 V
i
Where d(.) is some distance metric, Q(.) is the quantizer function that maps any value in Vi to Yi , and
k.k22 is the Euclidean norm (we can use other norms also).
6
(k+1)
– Vi = {X : d(X, Yi (k+1) ) < d(X, Yj (k+1) ) 8 j 6= i}, new set of reconstruction regions.
XM Z
– D(k+1) = (X Yi (k+1) )T (X Yi (k+1) ) fX (X) dX, updated distortion.
i=1
Vi (k+1)
This algorithm is not very practical because the integrals required to compute the distortions and
centroids are over odd-shaped regions (hyper-space) in n dimensions, where n is the dimension of the
input vectors. Generally, these integrals are extremely difficult to compute and trying every vector to
decide whether it’s in the Voronoi region or not is also unfeasible, making this particular algorithm
more of an academic interest. Of more practical interest is the algorithm for the case where we have a
training set available. In this case, the algorithm looks very much like the k-means algorithm.
• Initialization:
– Start with a large set of training vectors, X1 , X2 , X3 , ..., XL from the same source.
– Assume an initial set of reconstruction values, {Yi (0) }M
i=1
(0)
– Find quantization/voronoi regions, Vi = {Xi : d(Xi , Yi (0) ) < d(Xi , Yj (0) ) 8 j 6= i}, here
we need to partition only L vectors instead of doing it for all vectors.
M
X X
– Compute the distortion, D (0)
= (X Yi (0) )T (X Yi (0) )
i=1 X2V (0)
i
We didn’t know the pdf of the data or source, all we had was a set of training data. Yet, when this
algorithm stops, we have a set of reconstruction levels (Yi ’s) and quantization regions (Vi ’s), which
gives us a vector quantizer. This algorithm is a more practical version of the LBG algorithm.
Although, this algorithm forms the basis of most vector quantizer designs, there are few issues with
this algorithm, for example,
1. Initializing the LBG Algorithm: What would be the good set of initial quantization points
that will guarantee the convergence? The LBG algorithm guarantees that the distortion from one
iteration to the next will not increase. However, there is no guarantee that the procedure will
converge to the optimal solution. The solution to which the algorithm converges is heavily
dependent on the initial conditions and by picking di↵erent subsets of the input as our initial
codebook (quantization points), we can generate di↵erent vector quantizers.
7
2. The Empty Cell Problem: How do we take care of a situation when one of the
reconstruction/quantization regions in some iteration is empty? There might be no points which
are closer to a given reconstruction point than any other reconstruction points. This is a problem
because in order to update an output point (centroid), we need to take the average value of the
input vectors.
Obviously, some strategy is needed to deal with these circumstances. This will be the topic of next
class after the Spring Break.
Practical version of LBG algorithm described above is surprisingly similar to the k-means algorithm
used in data clustering. Most popular approach to designing vector quantizers is a clustering procedure
known as the k-means algorithm, which was developed for pattern recognition applications. The
k-means algorithm functions as follows: Given a large set of output vectors from the source, known as
the training set, and an initial set of k representative patterns, assign each element of the training set to
the closest representative pattern. After an element is assigned, the representative pattern is updated
by computing the centroid of the training set vectors assigned to it. When the assignment process is
complete, we will have k groups of vectors clustered around each of the output points (codewords).
Figure 6: Initial state (left) and final state (right) of a vector quantizer.
We have seen briefly how we can make use of the structure exhibited by groups, or vectors, of values to
obtain compression. Since there are di↵erent kinds of structure in di↵erent kinds of data, there are a
number of di↵erent ways to design vector quantizers. Because data from many sources, when viewed as
vectors, tend to form clusters, we can design quantizers that essentially consist of representations of
these clusters.
8
EE5585 Data Compression March 26, 2013
Lecture 16
Instructor: Arya Mazumdar Scribe: Fangying Zhang
1 Review of Homework 6
The review is omitted from this note.
H (in) W (lb)
65 170
70 170
56 130
80 203
50 153
76 169
Figure 1:
We need to find a line such that most points are around this line. This means the distances from
the points to the line is very small. Suppose W = 2.5H is the equation of the straight line. Let A be a
matrix to project values to this line and its perpendicular direction.
We have
0.37 0.92 cos(✓) sin(✓)
A= =
0.92 0.37 sin(✓) cos(✓)
65 70 56
X= , , ...
170 170 130
1
Then by rotating x-axis, we have a new axis:
Y = AX
From the above equation, we can get a set of new H-W table as following:
H (new) W (new)
182 3
184 -2
191 -4
218 1
161 10
181 -9
1
We can get back original data by rotating the axis again: X = A Y , for example,
65 182
=A 1
170 3
where,,
1 0.37 0.92 cos(✓) sin(✓)
A = = = AT .
0.92 0.37 sin(✓) cos(✓)
Note,
cos(✓) sin(✓) cos(✓) sin(✓) 1 0
⇤ = .
sin(✓) cos(✓) sin(✓) cos(✓) 0 1
Now, neglect the right column of the new table and set them to 0, that is:
H (new) W (new)
182 0
184 0
191 0
218 0
161 0
181 0
H W
68 169
68 171
53 131
81 201
60 151
67 168
Compare this table with the original table. We find that the variations are not large. So this new
table can be used.
2
3 Transform Coding
1. For an orthonormal transformation matrix A,
AT = A 1
, AT A = AAT = I. (1)
1
Suppose y = Ax, x̂ = A ŷ, where ŷ is the stored/compressed value. We introduce error
= ky ŷ k22 (2)
= k Ax Ax̂ k22 (3)
= k A(x x̂) k22 (4)
T
= [A(x x̂] ⇤ [Ax x̂] (5)
T T
= (x x̂) ⇤ A ⇤ A(x x̂) (6)
T
= (x x̂) ⇤ (x x̂) (7)
= kx x̂ k22 (8)
If we do not introduce a lot of errors in y, then there won’t be a lot of error for x.
2.
E[y T y] = E[xT AT Ax] = E[xT x]. (9)
which means the input and output energies are the same. This is known as Parseval’s identity.
4 Hadamard Transform
Suppose,
p 1 1
A = 1/ 2 (10)
1 1
1 is shaded as black in the figure below.:
Figure 2:
3
b11 M b12 M
=
b21 M b22 M
where,
a11 a12
M= .
a21 a22
A4 can be represented as below.
Figure 3:
4
EE5585 Data Compression March 28, 2013
Lecture 17
Instructor: Arya Mazumdar Scribe: Chendong Yang
Solution to Assignment 2
2
Problem 4 Let X be zero mean and variance Gaussian random variable. That is
1 x2
fX (x) = p e 2 2
2⇡ 2
and let the distortion measure be squared error. Find the optimum reproduction points for 1-bit scalar
quantization and the expected distortion for 1-bit quantization. Compare this with the rate-distortion
function of a Gaussian random variable.
Solution Let x be the source and x̂ be quantizer output. Our goal is to find optimal reproduc-
tion points -a and a (figure 1) such that the distortion is minimize in terms of MSE. i.e. minimize
D = E[(x x̂)2 ].
Z 0 Z 1
D = E[(x x̂)2 ] = fX (x)(x + a)2 dx + fX (x)(x a)2 dx
1 0
Z 1
=2 a)2 dx
fX (x)(x
0
Z 1 Z 1 Z 1
=2 a2 fX (x)dx + x2 fX (x)dx 2a xfX (x)dx
0 0 0
Z 1
4a 2 x2
= a2 + 2 p e y dy (Let y = )
2⇡ 2 0 2 2
4a 2
= a2 + 2 p
2⇡ 2
Now we take partial di↵erentiation of E[(x x̂)2 ] with respect to a and set it to zero,
1
@E[(x x̂)2 ] 4 2
= 2a p =0
@a 2⇡ 2
r
2
) a⇤ =
⇡
q q q
2 2 2
So two optimum reproduction points are a⇤ = ⇡ and a ⇤
= ⇡ . Substitute a ⇤
= ⇡
back to expectd distortion expression we just found, and the expected distortion
r
2 22 2 2 1 2
D = E[(x x̂) ] = + 4 p
⇡ 2⇡ ⇡
⇡ 2 2
=
⇡
We know that binary rate-distortion function of a Gaussian random variable is
2
1
R(D) = log2
2 D
2
) Dopt =
22R(D)
2
⇡ 2 2
In our case rate R = 1, so that Dopt = 4 < ⇡ , which means the distortion rate bound is
smaller than the distortion in our case.
Solution
F (01110) = P r(0.X1 X2 X3 X4 X5 < 0.01110)
= P r(X1 = 0, X2 < 1) + P r(X1 = 0, X2 = 1, X3 < 1) + P r(X1 = 0, X2 = 1, X3 = 1, X4 < 1)
= 0.72 + 0.72 · 0.3 + 0.72 · 0.32
= 0.6811
From the source, we have only observed the first five bits, 01110 ,which can be continued with an arbitary
sequence. However for an arbitary sequence that starts with the sequence 01110 we have
So comparing binary representation of F (01110) and F (01111) we observe that we are sure about 3
bits: 101.
2
Problem 3 Consider the following compression scheme for binary sequences. We divide the bi-
nary sequences into blocks of size 16. For each block if the number of zeros is greater than or equal to
8 we store a 0, otherwise we store a 1. If the sequence is random with probability of zero 0.9, compute
ther rate and average distortion (Hamming metric). Compare your result with the corresponding value
of rate distortion function for binary sources.
1
Solution It is easy to see that rate = R(D) = 16 . We can find the optimum distortion Dopt when
1
the rate is 16 according to binary rate-distortion function,
where h(.) is binary entropy function and p is the source probability of zero.
1
Dopt = h [h(p) R(D)]
1 1
)Dopt =h h(0.1)
16
)Dopt ⇡ 0.08
In our source compression scheme, the distortion is summarized in Table 1. Assume that two quantiza-
tion levels are 000... and 111... respectively
Dave
So the normalized distortion Dnorm = 16 = 0.1, which is larger than Dopt we found before.
3
Problem 1 Design a 3-bit uniform quantizer(by specifying the decision boundaries and reconstruction
levels) for a source with following pdf:
1 |x|
fX (x) = e 3
6
Solution Let be decision boundary and function Q(.) be uniform quantizer. Since it is a 3-bit
uniform quantizer, we need 23 = 8 reconstruction levels as following (Note we only consider positive
region here because the distribution is symmetric):
8
>
> 0x
< 32
<x2
Q(x) = 2
> 52
> 2 <x3
: 7
2 3 <x<1
⇡ 3.101
The optimuml 3-bit uniform quantizer shows in figure 2
Figure 2: Black dots represent optimuml quantization levels spaced by ⇡ 0.3101. Red dots represent
optimum reconstruction levels, which are set in the middle of two adjacent quantization levels
Problem 2 What is the mean and variance of the random variable of Problem 1 above? Derive
4
the mean of the output of uniform quantizer you designed for the above problem. What is the mean of
the optimum quantizer for this distribution?
Solution
Z 1 Z 1
1 |x|
Mean: E[X] = xfX (x)dx = x · e 3 dx = 0
1 1 6
Z 1 Z 1
1 x
Variance: E[X 2 ] = x2 fX (x)dx = 2 x2 · e 3 dx = 18
1 0 6
Optimum Quantizer Mean: E[Q(x)] = E[X] = 0
Problem 5 Consider a source X uniformly distributed on the set {1,2,...m}. Find the rate distortion
function for this source with Hamming distance; that is, d(x, x̂) = 0 when x = x̂ and d(x, x̂) = 1 when
x 6= x̂.
1 Kolmogorov Complexity
Let us briefly talk about the concept of Kolmogorov complexity. First, we should ask ourself “what is a
computer?”. Generally the computer is a Finite State Machine(FSM), consisting of read tape, output
tape and work tape. (figure 3). This is called a Turing machine (Alan Turing).
A Turing machine is able to simulate any other computer (Church-Turing Thesis). Any real-world
computation can be translated into an equivalent computation involving a Turing machine.
A sequence 010101.... (“01” repeat 10000 times) can be translated to the sentence “Repeat “01”
10000 times”. Another example, a sequence 141592... can be translated to the sentence ”First 6 digits
after the point in ⇡.”
5
Now let us define Kolmogorov complexity. For any sequence x
6
EE5585 Data Compression April 02, 2013
Lecture 18
Instructor: Arya Mazumdar Scribe: Shashanka Ubaru
Solutions for problems 1 - 4 and 6 of HW2 were provided in the previous lecture and also the concepts
of Turing machines and Kolmogorov Complexity were introduced . In this lecture,
(Reference for these topics : Chapter 14, “Elements of Information Theory” 2nd ed - T. Cover, J.
Thomas (Wiley, 2006). )
R(D) = min b
I(X; X)
x|x):E{d(x,b
p(b x)D}
This optimization equation seems difficult to solve. So, a good trick is to find a lower bound for
b subjected to the constraint mentioned and come up with an example that achieves this lower
I(X; X)
x | x). (Recall : This technique was used to find R(D) for Binary and
bound given the constraint on p(b
Gaussian random variables also)
By definition,
b
I(X; X) = H(X) b
H(X | X)
= log m b
H(X | X)
For binary random variable, we had equaled H(X | X)b to H(X b but here this is
widehatX | X),
not true. So, we define a new random variable Y ,
(
0 if X = Xb
Y =
b
1 if X 6= X
b is the uncertainity in X if X
H(X | X) b is known and we have,
b
H(X | X) b
H(X, Y | X)
= b + H(X | X,
H(Y | X) b Y ).
Substituting,
1
b
I(X; X) H(X) b
H(Y | X) b Y)
H(X | X,
log m H(Y ) b Y)
H(X | X,
b Y ) = Pr(Y = 0)H(X | X,
H(X | X, b Y = 0) + Pr(Y = 1)H(X | X,
b Y = 1)
If Y = 0 ) X = Xb ) H(X | X,
b Y = 0) there is no uncertainity and for a given X,
b there are only
M-1 choices for X.
b Y)
H(X | X, = Pr(Y = 1) log(M 1)
= b log(M 1)
Pr(X 6= X)
⇣ ⌘
b then,
and H(Y ) = h Pr(X 6= X)
⇣ ⌘
b
I(X; X) log m b
h Pr(X 6= X) b log(M
Pr(X 6= X) 1)
b)
Ed(x; x = b + 0. Pr(X = X)
1. Pr(X 6= X) b
= b D
Pr(X 6= X)
b
I(X; X) log m D log(M 1) h(D)
| {z }
both are increasing f unctions
2
X
Pr(X = i) = b = i) Pr(X = i | X
Pr(X b = i) + b = j) Pr(X = i | X
Pr(X b = j)
i6=j
1 X 1 D
= (1 D) +
M MM 1
i6=j
1 D D
= +
M M
1
=
M
So, X are equiprobable. The mutual information is given by,
b
I(X; X) = log m b
H(X | X)
M
X
b
H(X | X) = b = i)H(X | X
Pr(X b = i)
i=1
M
1 X b = i)
= H(X | X
M i=1
= b = i)
H(X | X
b = i) = 1
We have, Pr(X = i | X b = j) =
D and Pr(X = i | X D
8i 6= j. So,
M 1
X ✓ ◆
b = i) D D
H(X | X = (1 D) log(1 D) log
M 1 M 1
i6=j
✓ ◆
D
= (1 D) log(1 D) D log
M 1
= h(D) + D log(M 1)
Substituting,
b
I(X; X) = log m h(D) D log(M 1)
R(D) = log m h(D) D log(M 1)
Digression:
What happens, if we use scalar quantizer for the above mentioned system (quantize X 2 {1, 2, . . . . . . , M })?
Suppose we use an uniform quantizer.
1, 2, . . . . . . . . . . . . , M 1, M
| {z } | {z } | {z }
! ! !
⇣ ⌘
4+1
The reconstruction points will be i 2 for i=1,3,....M-1 odd no.s
Find average distortion
( D: This will be same for each points (uniform quantizer). We are using
1 6= i
hamming distortion. d = . Thus,
0 =i
3
1 1
D = .1 + .0
1
=
M M
Rate-Distortion trade-o↵: we have possible outputs. So, log number of bits. Then,
M
R(D) = log
But, D = 4 1 = 1 1 implies, 1 = 1 D
Thus the rate-distortion for this scheme is,
R = log M (1 D)
1
= log M log
1 D
From Figure 2 , it is evident that, the rate for scalar quantizer is always higher than the rate-distortion
function. Thus, this scheme is suboptimal.
Kolmogorov Complexity
Introduction
So far, a given sequence of data (object) X was treated as a random variable with probability mass
function p(x). And the attributes (properties) defined for the sequence like entropy H(X), average length
L(C), relative entropy divergence D(p k q), Rate-distortion R(D) etc., depended on the probability
4
distribution of the sequence. Most of the coding techniques and quantization techniques that we saw,
also depended on the probability distribution of the sequence. We can define a descriptive complexity of
1
the event X = x as log p(x) . But, Andrey Kolmogorov (Soviet mathematician) defined an algorithmic
(descriptive) complexity of an object X to be the length of the shortest binary computer program
that describes the object. He also observed that, this definition of complexity is essentially computer
independent. Here, the object X is treated as strings of data that enter a computer and Kolmogorov
Complexity is analogous to Entropy of this sequence.
An acceptable model for computers that is universal, in the sense that they can mimic the actions of
other computers is ‘Turing Machine’ model. This model considers a computer as a finite-state machine
operating on a finite symbol set. A computer program is fed left to right into this finite-state machine
as a program tape (shown in Figure 2). The machine inspects the program tape, writes some symbols
on a work tape and changes its state according to its transition table and outputs a sequence Y . In this
model, we consider only a program tape containing a halt command (ie,. when to stop the program). No
program leading to a halting computation can be the prefix of another program. This forms a prefix-free
set of programs. Now the question is, given a string/sequence, can we compress it or not?
Answer: Kolmogorov Complexity and its properties.
the minimum length over all programs that print x and halt. Thus, KU (x) is the shortest description
length of x over all descriptions interpreted by computer U .
5
This is the shortest description length if the computer U has the length of x made available to it.
Property 1:
If U is a universal computer, for any other computer A there exists a constant C such that,
KU (x) KA (x) + C
The constant C does not depend on x. All universal computers have same K(x) so they di↵er by C .
Property 2:
K(x|l(x)) l(x) + c.
Conditional complexity is less than the length of the sequence. That is, the length of a program will be
atmost length of our string x.
Example: “Print the following l length sequence : x1 . . . . . . xl ”
Here l is given so the program knows when to stop.
Property 3:
K(x) K(x|l(x)) + log⇤ (l(x)) + c.
where log⇤ (n) = log n + log log n + log log log n + · · · · · ·
Here we do not know the length l(x) of the sequence. The length of a sequence is represented as
log l(x). But log l(x) is unknown. So, we need log log l(x) and so on. Hence the term log⇤ l(x).
Property 4:
The number of binary sequence x with complexity K(x) < k is < 2k ,
| {x 2 {0, 1} : K(x) < k} | < 2k
1 + 2 + 22 + 24 + ......... + 2k 1
= 2k 1 < 2k
Property 5:
The Kolmogorov complexity of a binary string X is bounded by
n
!
1X
K(x1 x2 · · · · · · xn | n) nH xi + log⇤ n + c
n i=1
Suppose our sequence X has ‘k’ ones. Can we compress a sequence of n bits with k ones? Given a
table of X with k ones, our computer produces an index of length, log nk . But we do not know ‘k’. So,
to know ‘k’ we need log⇤ k . So, the worst case length will be,
✓ ◆
n
log + log⇤ n + c
k
By Sterling’s approximation,
✓ ◆ ✓ ◆ n
!
n k 1X
log nH = nH xi
k n n i=1
6
n
!
1X
K(x1 x2 · · · · · · xn | n) nH xi + log⇤ n + c
n i=1
Property 6:
The halting programs form a prefix-free set, and their lengths satisfy the Kraft inequality,
X
2−l(p) 1
p:p halts
Theorem:
Suppose {Xi } is an iid sequence with X 2 X . Then,
1 X
K(x1 x2 · · · · · · xn | n) Pr(x1 x2 · · · · · · xn ) ! H(X)
n x x ······x
1 2 n
Proof:
X
K(x1 x2 · · · · · · xn | n) Pr(x1 x2 · · · · · · xn ) H(x1 x2 · · · · · · xn )
x1 x2 ······xn
= nH(X)
Here K(x1 x2 · · · · · · xn | n) is the smallest length for any program and because the programs are
prefix free, this is the length of prefix-free codes. Then the LHS of above equation is nothing but the
average length of the symbol. And we have L(C) H(X). Thus,
1 X
K(x1 x2 · · · · · · xn | n) Pr(x1 x2 · · · · · · xn ) H(X)
n x x ······x
1 2 n
n
!
1 1X 1 c
K(x1 x2 · · · · · · xn | n) H xi +log⇤ n +
n n i=1 n n
⇢ " n
!#
1 1X 1 c
E K(x1 x2 · · · · · · xn | n) E H xi + log⇤ n +
n n i=1 n n
7
but
then
⇢
1 1 c
E K(x1 x2 · · · · · · xn | n) H [Pr(x = 1)] + log⇤ n +
n n n
1 c
= H [X] + log⇤ n +
n n
! H(X)
Incompressible Sequences
There are certain large numbers that are simple to describe like,
2
22
22 or (100!)
But most of such large sequences do not have a simple description. That is, such sequences are
incompressible. Given below is the condition for incompressible sequence.
K(x1 x2 x3 . . . xn |n)
lim = 1.
n!1 n
Thus, Kolmogorov complexity tells us given a sequence, how much we can compress. (Answering
our question posted in the Introduction). That is, if K(X) is of the order of length n then clearly, the
sequence is incompressible.
Theorem:
For binary incompressible sequence X = {x1 , x2 , x3 , . . . . . . , xn },
n
1X 1
xi !
n i=1 2
i.e., approximately same # of 1’s and 0’s or the proportions of 0’s and 1’s in any incompressible
string are almost equal.
Proof:
We have by definition,
K(x1 x2 x3 . . . xn |n) n cn
8
where cn is some number. Then by Property 5, we have
n
!
1X
n cn K(x1 x2 · · · · · · xn | n) nH xi + log⇤ n + c
n i=1
n
!
cn 1X log⇤ n c
1 H xi + +
n n i=1 n n
n
!
1X (cn + c + log⇤ n)
H xi 1
n i=1 n
= 1 "n
and "n ! 0 as n ! 1,
Figure 4: H(p) vs p
References
[1] Chapter 14, “Elements of Information Theory” 2nd ed - T. Cover, J. Thomas (Wiley, 2006)
9
[2] http://en.wikipedia.org/wiki/Kolmogorov complexity
10