Huffman Coding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32
At a glance
Powered by AI
The key takeaways are that Huffman coding is a lossless data compression algorithm that uses variable-length codewords to encode characters based on their frequency of occurrence. It assigns shorter codewords to frequent characters and longer codewords to less frequent characters to achieve optimal compression.

Huffman coding is an algorithm that uses variable-length codewords to encode characters based on their frequency of occurrence in a file. It builds a binary tree from the frequency table where the two least frequent characters are merged at each step. The codeword for each character is then derived by traversing the path from the root to the leaf node corresponding to that character.

The steps involved in constructing a Huffman code are: 1) Build a frequency table of characters, 2) Create a leaf node for each unique character and label it with its frequency, 3) Add the leaf nodes to a priority queue sorted by frequency, 4) Repeatedly remove the two nodes of lowest frequency from the queue and combine them into a new internal node, with the two nodes as children, until only one node remains, which is the root node.

Huffman Coding

Hashmat Rohian CSE 6111

Motivation
To compress or not to compress, that is the question!

reducing the space required to store files on disk or tape

reducing the time to transmit large files.

Image Source : plus.maths.org/issue23/ features/data/data.jpg

Example:
A file with 100K characters
Character Frequency (in 1000s) Fixed-length codeword a 45 000 b 13 001 c 12 010 d 16 011 e 9 100 f 5 101

Space =

(45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000

= 300K bits

Can we do better ?? YES !!


Use variable-length codes instead. Give frequent characters short codewords, and infrequent characters long codewords.
a Frequency (in 1000s) Fixed-length codeword Variable-length codeword 45 000 0 b 13 001 101 c 12 010 100 d 16 011 111 e 9 100 1101 f 5 101 1100

Space =

(45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000

= 224K bits

( Savings = 25%)

PREFIX-FREE CODE : No codeword is also prefix of some other codeword.

No Ambiguity !!

Variable-length codeword

101

100

111

1101

1100

Representation:

The Huffman algorithm is represented as: binary tree each edge represents either 0, "go to the left child" 1, "go to the right child"
0

100
0

a:45 25
0

55
1

30
0 0

c:12

b:13 f:5

14
1

d:16 e:9

each leaf corresponds a particular code.

Cost of the tree B(T) = f(c) dT(c) where c C

0ptimal Code
100
0 1 0

100
0

86
1 0

14 28 14
1 0 1

a:45 25
0

55
1

58
0 1 0

30
0 0

c:12 c:12 d:16 e:9 f:5

b:13 f:5

a:45 b:13

14
1

d:16 e:9

Always a full binary tree One leaf for each letter of the alphabet

Constructing a Huffman code Build the tree T in a bottom-up manner. Begins with a set of |C| leaves Upward |C| - 1 "merging" operations

Greedy Choice? The two smallest nodes are chosen at each step.

The steps of Huffman's algorithm


f:5 e:9 c:12 b:13 d:16 a:45 c:12 b:13 14 d:16 a:45
0 1

f:5 14
0

e:9 30 a:45
1

d:16
1 0

25
1

a:45
0

25
1 0

f:5

e:9 c:12

b:13

c:12

b:13
0

14
1

d:16 e:9

f:5 a:45
0

55
1 0

100
0

25
0 1 0

30
1

a:45 d:16 25
0

55
1

c:12

b:13
0

14
1

30
1 0 1

f:5

e:9

c:12

b:13
0

14
1

d:16 e:9

f:5

Running Time Analysis

Q is implemented as a binary min-heap. The merge operation is executed exactly |n| - 1 times. Each heap operation requires time O(log n). = O(nlog n)

Huffman code

Input
ACE

Output

E I C A H

= = = = =

01 00 10 111 110

(111)(10)(01) = 1111001

Huffman Coding Example

Decoding
1. Read compressed file & binary tree 2. Use binary tree to decode file
Follow path from root to leaf

Huffman Code Algorithm Overview

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

Huffman Decoding 1

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

Huffman Decoding 2

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

Huffman Decoding 3

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

Huffman Decoding 4

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

AC

Huffman Decoding 5

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

AC

Huffman Decoding 6

A 3 1 5 1

H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0

1111001

ACE

Huffman Decoding 7

Induction on the number of code words The Huffman algorithm finds an optimal code for n = 1 Suppose that the Huffman algorithm finds an optimal code for codes size n, now consider a code of size n + 1 . . .

Correctness proof

Greedy Choice Proof


T T T

x y a b y x b

a b x y

Assume that f[a] < f[b] and f[x] < f[y] Since f[x] and f[y] are the two lowest frequencies, f[x] < f[a] and f[y] < f[b].

T Tree constructed by Huffman X Any code tree Show C(T) <= C(X) T and X Trees from the greedy choice C(T) = C(T) C(X) <= C(X) T and X Trees with minimum cost leaves x and y removed

Finish the induction proof

C(X) = C(X) x y C(T) = C(T) x y C(T) <= C(X) C(T) = C(T) = C(T) + x + y <= C(X) + x + y = C(X) <= C(X)

X : Any tree, X: modified, X : Two smallest leaves removed

What is our next step?

Challenges and how to tackle them?

Two passes over the data: One pass to collect frequency counts of the letters A second pass to encode and transmit the letters, based on the static tree structure.

Problems: Delay (network communication, file compression applications) Extra disk accesses slow down the algorithm.

We need one-pass methods, in which letters are encoded on the fly.

Dynamic Huffman codes

Algorithm FGK The next letter of the message is encoded on the basis of a Huffman tree for the previous letters. Encoding length = (2S + t), where S is the encoding
length by a static Huffman code, and t is the number of letters in the original message.

Sender and receiver start with the same initial tree use the same algorithm to modify the tree after each letter is processed and thus always have equivalent copies of it.

Sibling Property:
A binary tree with p leaves of nonnegative weight is a Huffman tree iff the p leaves have nonnegative weights w1, . . . , wp, and the weight of each internal node is the sum of the weights of its children; and the nodes can be numbered in non-decreasing order by weight, so that nodes (2j - 1)and 2j are siblings, for 1 j p - 1, and their common parent node is higher in the numbering.
32 11 f
9 11

21
7

10

10 5 c

11 5 d
2

6 e

2 a

3 b

Difficulty
Suppose that MT = ai1 , ai2, . . . , ait, has already been processed. ai(t+1) is encoded and decoded using Huffman tree for MT. How to modify this tree quickly in order to get a Huffman tree for MT+1?

Eg. Assume t = 32,


32 11 f
9 11

33 21
10

11

11 f
8

10 5 c

11 5 d
2

ai(t+1) = b
6 e
6

5 c

2 a

3 b

X
11
7 3

22

10

11

5 d
2

6 e

2 a

4 b

First phase
Begin with the leaf of ai(t+1), as the current node. Repeatedly interchange the contents of the current node, including the subtree rooted there, with that of the highest numbered node of the same weight Make the parent of the latter node the new current node. Halt when the root is reached.

Eg. Assume t = 32, ai(t+1) = b


32 11 f
9 11

32 21
10

11

11 f 11

21

10

10 5 c

10 6 e 5 c

11

5 d
2

5 d
4

5 2 a

6 e 3 b
2

2 a

3 b

32 11 f
9

11

32 21
10

11

11 10
7

21
7

10

11 5 d 5
1

5 5 c
3 5 4

6 e

6 e 3 b
2

10 5 c

11 f 5 d
4

2 a

2 a

3 b

Second phase
We turn this tree into the desired Huffman tree for weights of ai(t+1)s leaf and its ancestors by 1 32 11 5
5 9 11 10

MT+1 by incrementing
33
11

the

21 10
6 7

12 11 f 6

21 10
6 7

10

6 e 3 b
2

6 e

11 f

2 a

5 c

5 d

2 a

4 b

5 c

5 d

Huffman savings are between 20% - 90% Dynamic Huffman Coding optimal and efficient Optimal data compression achievable by a character code can always be achieved with a prefix code. Better compression possible (depends on data) Using other approaches (e.g., pattern dictionary)

Conclusions

Thank you for your attention!

You might also like