Huffman Coding Algorithm
Every information in computer science is encoded as strings of 1s and 0s. The objective of
information theory is to usually transmit information using fewest number of bits in such a
way that every encoding is unambiguous. This tutorial discusses about fixed-length and
variable-length encoding along with Huffman Encoding which is the basis for all data
encoding schemes
Encoding, in computers, can be defined as the process of transmitting or storing sequence of
characters efficiently. Fixed-length and variable length are two types of encoding schemes,
explained as follows-
Fixed-Length encoding - Every character is assigned a binary code using same number of
bits. Thus, a string like “aabacdad” can require 64 bits (8 bytes) for storage or transmission,
assuming that each character uses 8 bits.
Variable- Length encoding - As opposed to Fixed-length encoding, this scheme uses
variable number of bits for encoding the characters depending on their frequency in the given
text. Thus, for a given string like “aabacdad”, frequency of characters „a‟, „b‟, „c‟ and „d‟ is
4,1,1 and 2 respectively. Since „a‟ occurs more frequently than „b‟, „c‟ and „d‟, it uses least
number of bits, followed by „d‟, „b‟ and „c‟. Suppose we randomly assign binary codes to
each character as follows-
a0
b 011
c 111
d 11
Thus, the string “aabacdad” gets encoded to 00011011111011 (0 | 0 | 011 | 0 | 111 | 11 | 0 |
11), using fewer number of bits compared to fixed-length encoding scheme.
But the real problem lies with the decoding phase. If we try and decode the string
00011011111011, it will be quite ambiguous since, it can be decoded to the multiple strings,
few of which are-
aaadacdad (0 | 0 | 0 | 11 | 0 | 111 | 11 | 0 | 11)
aaadbcad (0 | 0 | 0 | 11 | 011 | 111 | 0 | 11)
aabbcb (0 | 0 | 011 | 011 | 111 | 011)
… and so on
To prevent such ambiguities during decoding, the encoding phase should satisfy the “prefix
rule” which states that no binary code should be a prefix of another code. This will produce
uniquely decodable codes. The above codes for „a‟, „b‟, „c‟ and „d‟ do not follow prefix rule
since the binary code for a, i.e. 0, is a prefix of binary code for b i.e 011, resulting in
ambiguous decodable codes.
Let‟s reconsider assigning the binary codes to characters „a‟, „b‟, „c‟ and „d‟.
a0
b 11
c 101
d 100
Using the above codes, string “aabacdad” gets encoded to 001101011000100 (0 | 0 | 11 | 0 |
101 | 100 | 0 | 100). Now, we can decode it back to string “aabacdad”.
Advantages of Huffman Encoding-
This encoding scheme results in saving lot of storage space, since the binary codes
generated are variable in length
It generates shorter binary codes for encoding symbols/characters that appear more
frequently in the input string
The binary codes generated are prefix-free
Disadvantages of Huffman Encoding-
Lossless data encoding schemes, like Huffman encoding, achieve a lower
compression ratio compared to lossy encoding techniques. Thus, lossless techniques
like Huffman encoding are suitable only for encoding text and program files and are
unsuitable for encoding digital images.
Huffman encoding is a relatively slower process since it uses two passes- one for
building the statistical model and another for encoding. Thus, the lossless techniques
that use Huffman encoding are considerably slower than others.
Since length of all the binary codes is different, it becomes difficult for the decoding
software to detect whether the encoded data is corrupt. This can result in an incorrect
decoding and subsequently, a wrong output.
Real-life applications of Huffman Encoding-
Huffman coding is a data compression technique that can be used in a variety of applications.
Here are just a few examples:
1. Image compression: By compressing image data, Huffman coding can reduce the
amount of storage space required for digital images. This can be especially beneficial
for large images or images with a lot of detail.
2. Audio compression: Similar to image compression, Huffman coding can also be used
to compress audio data. This can save storage space and bandwidth when streaming or
downloading audio files.
3. Text compression: Text files can also be compressed using Huffman coding. This can
be useful for reducing the size of text documents or email attachments.
4. Data transmission: When data is transmitted over a network, it is often compressed
using Huffman coding to reduce the amount of bandwidth required. This can help to
speed up data transfer and reduce costs associated with transmitting data.