Huffman Coding
Huffman Coding
Huffman Coding
Motivation
To compress or not to compress, that is the question!
Example:
A file with 100K characters
Character Frequency (in 1000s) Fixed-length codeword a 45 000 b 13 001 c 12 010 d 16 011 e 9 100 f 5 101
Space =
= 300K bits
Space =
= 224K bits
( Savings = 25%)
No Ambiguity !!
Variable-length codeword
101
100
111
1101
1100
Representation:
The Huffman algorithm is represented as: binary tree each edge represents either 0, "go to the left child" 1, "go to the right child"
0
100
0
a:45 25
0
55
1
30
0 0
c:12
b:13 f:5
14
1
d:16 e:9
0ptimal Code
100
0 1 0
100
0
86
1 0
14 28 14
1 0 1
a:45 25
0
55
1
58
0 1 0
30
0 0
b:13 f:5
a:45 b:13
14
1
d:16 e:9
Always a full binary tree One leaf for each letter of the alphabet
Constructing a Huffman code Build the tree T in a bottom-up manner. Begins with a set of |C| leaves Upward |C| - 1 "merging" operations
Greedy Choice? The two smallest nodes are chosen at each step.
f:5 14
0
e:9 30 a:45
1
d:16
1 0
25
1
a:45
0
25
1 0
f:5
e:9 c:12
b:13
c:12
b:13
0
14
1
d:16 e:9
f:5 a:45
0
55
1 0
100
0
25
0 1 0
30
1
a:45 d:16 25
0
55
1
c:12
b:13
0
14
1
30
1 0 1
f:5
e:9
c:12
b:13
0
14
1
d:16 e:9
f:5
Q is implemented as a binary min-heap. The merge operation is executed exactly |n| - 1 times. Each heap operation requires time O(log n). = O(nlog n)
Huffman code
Input
ACE
Output
E I C A H
= = = = =
01 00 10 111 110
(111)(10)(01) = 1111001
Decoding
1. Read compressed file & binary tree 2. Use binary tree to decode file
Follow path from root to leaf
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 1
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 2
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 3
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 4
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
AC
Huffman Decoding 5
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
AC
Huffman Decoding 6
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
ACE
Huffman Decoding 7
Induction on the number of code words The Huffman algorithm finds an optimal code for n = 1 Suppose that the Huffman algorithm finds an optimal code for codes size n, now consider a code of size n + 1 . . .
Correctness proof
x y a b y x b
a b x y
Assume that f[a] < f[b] and f[x] < f[y] Since f[x] and f[y] are the two lowest frequencies, f[x] < f[a] and f[y] < f[b].
T Tree constructed by Huffman X Any code tree Show C(T) <= C(X) T and X Trees from the greedy choice C(T) = C(T) C(X) <= C(X) T and X Trees with minimum cost leaves x and y removed
C(X) = C(X) x y C(T) = C(T) x y C(T) <= C(X) C(T) = C(T) = C(T) + x + y <= C(X) + x + y = C(X) <= C(X)
Two passes over the data: One pass to collect frequency counts of the letters A second pass to encode and transmit the letters, based on the static tree structure.
Problems: Delay (network communication, file compression applications) Extra disk accesses slow down the algorithm.
Algorithm FGK The next letter of the message is encoded on the basis of a Huffman tree for the previous letters. Encoding length = (2S + t), where S is the encoding
length by a static Huffman code, and t is the number of letters in the original message.
Sender and receiver start with the same initial tree use the same algorithm to modify the tree after each letter is processed and thus always have equivalent copies of it.
Sibling Property:
A binary tree with p leaves of nonnegative weight is a Huffman tree iff the p leaves have nonnegative weights w1, . . . , wp, and the weight of each internal node is the sum of the weights of its children; and the nodes can be numbered in non-decreasing order by weight, so that nodes (2j - 1)and 2j are siblings, for 1 j p - 1, and their common parent node is higher in the numbering.
32 11 f
9 11
21
7
10
10 5 c
11 5 d
2
6 e
2 a
3 b
Difficulty
Suppose that MT = ai1 , ai2, . . . , ait, has already been processed. ai(t+1) is encoded and decoded using Huffman tree for MT. How to modify this tree quickly in order to get a Huffman tree for MT+1?
33 21
10
11
11 f
8
10 5 c
11 5 d
2
ai(t+1) = b
6 e
6
5 c
2 a
3 b
X
11
7 3
22
10
11
5 d
2
6 e
2 a
4 b
First phase
Begin with the leaf of ai(t+1), as the current node. Repeatedly interchange the contents of the current node, including the subtree rooted there, with that of the highest numbered node of the same weight Make the parent of the latter node the new current node. Halt when the root is reached.
32 21
10
11
11 f 11
21
10
10 5 c
10 6 e 5 c
11
5 d
2
5 d
4
5 2 a
6 e 3 b
2
2 a
3 b
32 11 f
9
11
32 21
10
11
11 10
7
21
7
10
11 5 d 5
1
5 5 c
3 5 4
6 e
6 e 3 b
2
10 5 c
11 f 5 d
4
2 a
2 a
3 b
Second phase
We turn this tree into the desired Huffman tree for weights of ai(t+1)s leaf and its ancestors by 1 32 11 5
5 9 11 10
MT+1 by incrementing
33
11
the
21 10
6 7
12 11 f 6
21 10
6 7
10
6 e 3 b
2
6 e
11 f
2 a
5 c
5 d
2 a
4 b
5 c
5 d
Huffman savings are between 20% - 90% Dynamic Huffman Coding optimal and efficient Optimal data compression achievable by a character code can always be achieved with a prefix code. Better compression possible (depends on data) Using other approaches (e.g., pattern dictionary)
Conclusions