CSEP 590 Data Compression: Course Policies Introduction To Data Compression Entropy Variable Length Codes
CSEP 590 Data Compression: Course Policies Introduction To Data Compression Entropy Variable Length Codes
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
x y x̂
Encoder Decoder
• Lossless compression x = xˆ
– Also called entropy coding, reversible coding.
• Lossy compression x ≠ xˆ
– Also called irreversible coding.
• Compression ratio = x y
– x is number of bits in x.
a b c z
th ch gh
3
2
1
0
0.01
0.08
0.15
0.22
0.29
0.36
0.43
0.5
0.57
0.64
0.71
0.78
0.85
0.92
0.99
x
0.8
entropy
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
probability of first symbol
ccabccbccc
1 1 00 01 1 1 01 1 1 1
CSEP 590 - Lecture 1 - Autumn 2007 26
Binary Tree Terminology
root
node
leaf
11000111100
0 1
c
0 1
a b
11000111100
0 1
c
0 1
a b
11000111100
0 1
c
0 1
a b
11000111100
0 1
c
0 1
a b
11000111100
cc
0 1
c
0 1
a b
11000111100
cc
0 1
c
0 1
a b
11000111100
cc
0 1
c
0 1
a b
11000111100
cca
0 1
c
0 1
a b
11000111100
cca
0 1
c
0 1
a b
11000111100
cca
0 1
c
0 1
a b
11000111100
ccab
0 1
c
0 1
a b
11000111100
ccabccca
0 1
a 0 1
d
0 1
b c
0 1
c
0 1 5/8
a b
1/8 1/4
b c
CSEP 590 - Lecture 1 - Autumn 2007 44
Variable Rate Code Example
• Example: a 0, b 100, c 101, d 11
• Coding:
– aabddcaa = 16 bits
– 0 0 100 11 11 101 0 0= 14 bits
• Prefix code ensures unique decodability.
– 00100111110100
– aabddcaa
d
0 1
b c
T p smallest T’
k p<q
k<h
h
p q
q p
T p smallest T’
k q 2nd smallest
q<r
h
q k<h r
r p q p
q+p
q p
C(T’) = C(T) + (h-1)(p+q) - hp -hq = C(T) - (p+q)
CSEP 590 - Lecture 1 - Autumn 2007 51
Optimality Principle 3 (cont’)
• If T’ were not optimal then we could find a
lower cost tree T’’. This will lead to a lower
cost tree T’’’ for the original alphabet.
T’ T’’ T’’’
q+p
q+p q p
.4 .1 .3 .1 .1
a b c d e
.4 .2 .3 .1
a c d
b e
b e
.4 .3 .3
a c
b e
.4 .3 .3 .4 .6
a c a
d c
b e d
b e
.4 .6
a a
c c
d d
b e b e
Huffman Code
HC = .4 x 1 + .1 x 4 + .3 x 2 + .1 x 3 + .1 x 4
= 2.1 bits per symbol
pretty good!
CSEP 590 - Lecture 1 - Autumn 2007 60
In Class Exercise
• P(a) = 1/2, P(b) = 1/4, P(c) = 1/8, P(d) = 1/16,
P(e) = 1/16
• Compute the Optimal Huffman tree and its
average bit rate.
• Compute the Entropy
• Compare
• Hint: For the tree change probabilities to be
integers: a:8, b:4, c:2, d:1, e:1. Normalize at
the end.
i= 3
n
= −2 log2 (2 ) − 2 log2 (2 ) − ∑ pilog2 (pi ) − 2 −k − 2 −k
−k −k −k −k
i= 3
n
= − ∑ pilog2 (pi ) − (p1 + p 2 )
i =1
H ≤ HC ≤ H + 1/k
• Pros and Cons of Extending the alphabet
+ Better compression
- 2k symbols
- padding needed to make the length of the input
divisible by k
a b c
0 1 0 1 0 1
a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c abbacc
.2 .1 .1
.4
00 00 0 1 01 0
0 1 0 1 0 1
a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c
.2 .1 .1
.4
0 0 1
1 2
3 4
CSEP 590 - Lecture 1 - Autumn 2007 75
Example
• n = qm + r is represented by:
q
67 8
11L10rˆ
r̂ is the fixed prefix code for r
– where
• Example (m = 5):
2 6 9 10 27
010 1001 10111 11000 11111010
m = - 1 = 89
log2 (127/128)
GC – entropy
order entropy
a 000 a c
b
b 001
000 001 a c
ca 010 b
cb 011
010 011 a b
c
cca 100
ccb 101 100 101 110
ccc 110
Unused output code is 111.
.7 .2 .1
∑p r
i=1
i i