Text Compression: Examples Huffman Coding: Go Go Gophers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Text Compression: Examples Huffman coding: go go gophers

“abcde” in the different formats ASCII 3 bits Huffman


Encodings ASCII: g 103 1100111 000 00
01100001011000100110001101100100… o 111 1101111 001 01
13
ASCII: 8 bits/character p 112 1110000 010 1100
Fixed: h 104 1101000 011 1101 6 7
Unicode: 16 bits/character
000001010011100 e 101 1100101 100 1110 3 4
10 g o
r 114 1110010 101 1111
Symbol ASCII Fixed Var. 3 3 s * 2 2
s 115 1110011 110 101
length length 0 1 sp. 32 1000000 111 101 1
0 0 2
a 01100001 000 000 p h e r
0 1 0 1
0
 Encoding uses tree: 1 1 1 1
b 01100010 001 11  0 left/1 right
a b c d e 6
Var:
c 01100011 010 01  How many bits? 37!! 4
000110100110 g o
0 1  Savings? Worth it?
d 01100100 011 001 2 2 3 3
3
e 01100101 100 10 0 1 0 1
p h e r
c e b 1 1 1 s *
0 1 1
1 2
CPS 100 a d 8.1 CPS 100 8.2

Huffman Coding Building a Huffman tree


 D.A Huffman in early 1950’s  Begin with a forest of single-node trees (leaves)
 Before compressing data, analyze the input stream  Each node/tree/leaf is weighted with character count
 Represent data using variable length codes  Node stores two values: character and count
 Variable length codes though Prefix codes  There are n nodes in forest, n is size of alphabet?
 Each letter is assigned a codeword
 Codeword is for a given letter is produced by traversing  Repeat until there is only one node left: root of tree
the Huffman tree  Remove two minimally weighted trees from forest
 Property: No codeword produced is the prefix of another  Create new tree with minimal trees as children,
 Letters appearing frequently have short codewords, • New tree root's weight: sum of children (character ignored)
while those that appear rarely have longer ones
 Huffman coding is optimal per-character coding method  Does this process terminate? How do we get minimal trees?
 Remove minimal trees, hummm……

CPS 100 8.3 CPS 100 8.4


Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

1 1
F C

11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 1 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 2 1
CPS 100 8.5 CPS 100 8.6
I E N S M A B O T G D L R U P F C I E N S M A B O T G D L R U P

Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

2 3 2 3 2 2
L R
1 1 1 2 1 1 1 2
C F P U C F P U

11 6 5 5 4 4 3 3 3 3 3 2 2 2 2 2 11 6 5 5 4 4 4 3 3 3 3 3 2 2 2
CPS 100 8.7 CPS 100 8.8
I E N S M A B O T G D L R I E N S M A B O T G D
Building a tree Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

23 21 16 37 23
CPS 100 8.9 CPS 100 8.10

Building a tree Mary Shaw


 Software engineering and
software architecture
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
 Tools for constructing
60 large software systems
 Development is a small
37 23 piece of total cost,
maintenance is larger,
21 16 12 11
depends on well-designed
and developed techniques
10 11 8 8 6 6
I  Interested in computer
5 5 5 6 4 4 4 4 3 3
science, programming,
E N S M A B curricula, and canoeing,
3 2 3 3 2 2 2 2 health-care costs
O T G D L R  ACM Fellow, Alan Perlis
1 1 1 2
Professor of Compsci at CMU
C F P U

60
CPS 100 8.11 CPS 100 8.12
Huffman Complexities Writing code out to file
 How do we measure? Size of input file, size of alphabet  How do we go from characters to encodings?
 Which is typically bigger?  Build Huffman tree
 Root-to-leaf path generates encoding
 Accumulating character counts: ______
 How can we do this in O(1) time, though not really  Need way of writing bits out to file
 Building the heap/priority queue from counts ____  Platform dependent?
 Initializing heap guaranteed  Complicated to write bits and read in same ordering
 Building Huffman tree ____
 Why?  See BitInputStream and BitOutputStream classes
 Create table of encodings from tree ____  Depend on each other, bit ordering preserved
 Why?
 Write tree and compressed file _____  How do we know bits come from compressed file?
 Store a magic number

CPS 100 8.13 CPS 100 8.14

Decoding a message Decoding a message

01100000100001001101 1100000100001001101
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 8.15 CPS 100 8.16


Decoding a message Decoding a message

100000100001001101 00000100001001101
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 8.17 CPS 100 8.18

Decoding a message Decoding a message

0000100001001101 1
60 60

37 23 37 23

21 16 12 11 21 16 12 11

10 11 8 8 6 6 10 11 8 8 6 6
I I
5 5 5 6 4 4 4 4 3 3 5 5 5 6 4 4 4 4 3 3
E N S M A B E N S M A B
3 2 3 3 2 2 2 2 3 2 3 3 2 2 2 2
O T G D L R O T G D L R
1 1 1 2 1 1 1 2
C F P U C F P U

CPS 100 G 8.19 CPS 100 GOOD 8.20


Decoding a message Huffman coding: go go gophers
ASCII 3 bits Huffman
g 103 1100111 000 ?? g o p h e r s *
01100000100001001101 o 111 1101111 001 ?? 3 3 1 1 1 1 1 2
p 112 1110000 010
60 h 104 1101000 011 2 2 3
e 101 1100101 100
37 23 r 114 1110010 101 p h e r s *
s 115 1110011 110 1 1
21 16 12 11 1 1 1 2
sp. 32 1000000 111
 choose two smallest weights
10 11 8 8 6 6
 combine nodes + weights
I
 Repeat
5 5 5 6 4 4 4 4 3 3
 Priority queue? 4 6
E N S M A B
3 2 3 3 2 2 2 2  Encoding uses tree: 2 2 g o
O T G D L R  0 left/1 right
3 3
1 1 1 2  How many bits? p h e r
C F P U 1 1 1 1

CPS 100 GOOD 8.21 CPS 100 8.22

Huffman Tree 2 Huffman Tree 2


 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.23 CPS 100 8.24


Huffman Tree 2 Huffman Tree 2
 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.25 CPS 100 8.26

Huffman Tree 2 Huffman Tree 2


 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.27 CPS 100 8.28


Huffman Tree 2 Huffman Tree 2
 “A SIMPLE STRING TO BE ENCODED USING A  “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” MINIMAL NUMBER OF BITS”
 E.g. “ A SIMPLE” ⇔  E.g. “ A SIMPLE” ⇔
“10101101001000101001110011100000” “10101101001000101001110011100000”

CPS 100 8.29 CPS 100 8.30

Huffman Tree 2 Other methods


 “A SIMPLE STRING TO BE ENCODED USING A  Adaptive Huffman coding
MINIMAL NUMBER OF BITS”  Lempel-Ziv algorithms
 E.g. “ A SIMPLE” ⇔  Build the coding table on the fly while reading
“10101101001000101001110011100000” document
 Coding table changes dynamically
 Protocol between encoder and decoder so that everyone
is always using the right coding scheme
 Works well in practice (compress, gzip, etc.)
 More complicated methods
 Burrows-Wheeler (bunzip2)
 PPM statistical methods

CPS 100 8.31 CPS 100 8.32

You might also like