Huffman Code1
Huffman Code1
Huffman Code1
Example
Suppose we have a data consists of 100,000 characters that we want to
compress. The characters in the data occur with following frequencies.
a b c d e f
Frequency45,00013,00012,00016,0009,000 5,000
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed
Length 000 001 010 011 100 101
code
Conclusion
Fixed-length code requires 300,000 bits while variable code requires
224,000 bits.
Prefix Codes
In which no codeword is a prefix of other codeword. The reason prefix
codes are desirable is that they simply encoding (compression) and
decoding.
Can we do better?
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed
Length 0 101 100 111 1101 1100
code
Character 'a' are 45,000
each character 'a' assigned 1 bit codeword.
1 * 45,000 = 45,000 bits.
Implies that the total bits are: 45,000 + 123,000 + 56,000 = 224,000 bits
String Encoding
TEA 10 00 010
SEA 011 00 010
TEN 10 00 110
Decoding
Since no codeword is a prefix of other, the codeword that begins an encoded
file is unambiguous.
To decode (Translate back to the original character), remove it from the
encode file and repeatedly parse.
For example in "variable-length codeword" table, the string 001011101
parse uniquely as 0.0.101.1101, which is decode to aabe.
a b c d e f
Frequency 45,000 13,000 12,000 16,000 9,000 5,000
Fixed Length
000 001 010 011 100 101
code
Variable-
0 101 100 111 1101 1100
length Code
Figure
Figure
If C is the alphabet from which characters are drawn, then the tree for an
optimal prefix code has exactly |c| leaves (one for each letter) and exactly |
c|-1 internal orders. Given a tree T corresponding to the prefix code,
compute the number of bits required to encode a file. For each character c in
C, let f(c) be the frequency of c and let dT(c) denote the depth of c's leaf.
Note that dT(c) is also the length of codeword. The number of bits to encode
a file is
Therefore, the cost of the tree corresponding to the optimal prefix code is
224 (224*1000 = 224000).
Analysis
• Q implemented as a binary heap.
• line 2 can be performed by using BUILD-HEAP (P. 145; CLR) in
O(n) time.
• FOR loop executed |n| - 1 times and since each heap operation
requires O(lg n) time.
=> the FOR loop contributes (|n| - 1) O(lg n)
=> O(n lg n)
• Thus the total running time of Huffman on the set of n characters is
O(nlg n).
Since there are letters in the alphabet, the initial queue size is n = 8, and 7
merge steps are required to build the tree. The final tree represents the
optimal prefix code.
Figure
The codeword for a letter is the sequence of the edge labels on the path from
the root to the letter. Thus, the optimal Huffman code is as follows:
h:1
g:1 0
f: 1 1 0
e:1 1 1 0
d:1 1 1 1 0
c:1 1 1 1 1 0
b:1 1 1 1 1 1 0
a:1 1 1 1 1 1 1
As we can see the tree is one long limb with leaves n=hanging off. This is
true for Fibonacci weights in general, because the Fibonacci the recurrence
is
To prove this, write Fj as Fj+1 - Fj-1 and sum from 0 to i, that is, F-1 = 0.
Step 1: Show that this problem satisfies the greedy choice property, that is,
if a greedy choice is made by Huffman's algorithm, an optimal solution
remains possible.
Step 2: Show that this problem has an optimal substructure property, that is,
an optimal solution to Huffman's algorithm contains optimal solution to
subproblems.
Proof Idea
Take the tree T representing optimal prefix code and transform T into a
tree T` representing another optimal prefix code such that the x characters
x and y appear as sibling leaves of maximum depth in T`. If we can do
this, then their codewords will have same length and differ only in the last
bit.
Figures
Proof
Let characters b and c are sibling leaves of maximum depth in tree T.
Without loss of generality assume that f[b] ≥ f[c] and f[x] ≤ f[y].
Since f[x] and f[y] are lowest leaf frequencies in order and f[b] and f[c]
are arbitrary frequencies in order. We have f[x] ≤ f[b] and f[y] ≤
f[c]. As shown in the above figure, exchange the positions of leaves to get
first T` and then T``. By formula, B(t) = c in C f(c)dT(c), the
difference in cost between T and T` is
Proof Idea
Figure
Proof
We show that the cost B(T) of tree T can be expressed in terms of the
cost B(T`). By considering the component costs in equation, B(T) =
f(c)dT(c), we show that the cost B(T) of tree T can be expressed in
terms of the cost B(T`) of the tree T`. For each c belongs to C - {x, y},
we have dT(c) = dT(c)
If T` is non-optimal prefix code for C`, then there exists a T`` whose
leaves are the characters belongs to C` such that B(T``) < B(T`). Now,
if x and y are added to T`` as a children of z, then we get a prefix code for
alphabet C with cost B(T``) + f[x] + f[y] < B(T), Contradicting the
optimality of T. Which implies that, tree T` must be optimal for the
alphabet C. □
Proof
Let S be the set of integers n ≥ 2 for which the Huffman procedure
produces a tree of representing optimal prefix code for frequency f and
alphabet C with |c| = n.
If C = {x, y}, then Huffman produces one of the following optimal trees.
figure
Proof
Let tree be a full binary tree with n leaves. Apply induction hypothesis on
the number of leaves in T. When n=2 (the case n=1 is trivially true), there
are two leaves x and y (say) with the same parent z, then the cost of T is
Correctness
Proof is immediate from the greedy choice property and an optimal
substructure property. In other words, the proof is similar to the correctness
proof of Huffman's algorithm in the CLR.