17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
Effects of Unequal Bit Costs on Classical Huffman
Codes
Sohag Kabir∗ , Tanzima Azad† , A S M Ashraful Alam‡ and Mohammad Kaykobad§
∗ Department
of Computer Science, University of Hull, United Kingdom, Email:
[email protected]
of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
Email:
[email protected]
‡ Department of Computer Science, University of Otago, New Zealand, Email:
[email protected]
§ Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
Email:
[email protected]
† Department
It is an efficient compression method: no other mapping of
individual source symbols produces a smaller average output
size of bits when the actual symbol frequencies agree with
those used to create the code. Huffman code is an efficient data
compression scheme that takes into account the probabilities
at which different quantization levels are likely to occur and
results in fewer data bits on the average. For any given set of
levels and associated probabilities, there is an optimal encoding
rules that minimizes the number of bits needed to represent the
source. There are many other variants of Huffman codes that
compress source data to reduce data size and/or transmission
cost. Even more efficient encoding is possible by grouping
sequences of levels together and applying the Huffman code
to the sequences. Encoding characters in predefined fixed
length code, does not attain an optimum performance, because
every character consumes an equal number of bits. Huffman
code tackles this by generating variable-length codes, given a
probability usage frequency for a set of symbols.
Abstract—Classical Huffman codes have a very good compression performance over traditional systems. Yet, more efficient
encoding is possible by considering and applying techniques
that treat the binary bits differently considering requirement of
storage space, energy consumption, speed of execution and so
on. Future transmission systems are likely to be more efficient
in many aspects. These systems will consume fewer resources
to transmit or store one of the binary bits. Hence, an unequal
bit cost would necessitate a different approach to producing
an optimal encoding scheme. This work proposes an algorithm,
which considers unequal bit-cost contribution to a message. Our
experiment yields that the proposed algorithm reduces overall
communication cost and improves compression ratio considerably
in comparison to classical Huffman codes. This unequal bit cost
technique produces a variant of Huffman Code that reduces total
cost of the compressed message.
Keywords—Huffman Coding, Source Coding, Coding and information theory, Data Compression
I.
I NTRODUCTION
There are some practical problems in classical Huffman
code. One of the most prominent problems is that the whole
stream must be read prior to encoding. This is a major
overhead when file size is too large. Mannan and Kaykobad
proposed a solution introducing the block technique in Huffman coding [3]. In this coding scheme, if the costs of 0
and 1 are considered equal, then it is quite impossible to
produce an encoded message with smaller cost. In reality,
the transmission or storage cost of 0 and 1 cannot be same.
Though present transmission systems have been implemented
considering equal cost for both the bits, one of the ways
future transmission systems can be efficient is to treat the bits
unequally. In this article, we propose an algorithm that is able
to power such a scheme. With this end in view, an alternative,
yet efficient representation of Huffman tree is possible that
can be economically applied. The outcomes of the proposed
algorithm are:
Digital data can be transmitted more efficiently using a
parallel transmission method that can increase transfer speed
by a factor of n over serial transmission method. As parallel
transmission mediums are not suitable over long distances,
this approach becomes very unrealistic. Requirements of n
communication lines add huge expense. Consequently, parallel
transmission techniques are limited to short distance communications like internal bus, locally connected devices, and
therefore long distance communication links remain serial.
Ruling out the possible availability of parallel transmission
links over long distance, we are left with its serial alternative
only. Nevertheless, serial communication links can also be
made efficient using data compression techniques on the source
data. An optimal encoding rule that minimize the number of
bits needed to represent source data, widely known as Huffman
code, is due to D.A. Huffman [1].
Applications of Huffman code are pervasive throughout
computer science. Huffman code can be used effectively where
there is a need for a compact code to represent a long series of
a relatively small number of distinct bytes. The algorithm to
completely perform Huffman encoding and decoding has too
many implementation details to describe here, but everything
that is required is explained in detail by Amsterdam in [2]. In
computer science and information theory, Huffman code is an
entropy encoding algorithm used for lossless data compression.
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
•
Efficient representation of a tree that minimises overall
cost of encoded message.
•
Improvement of performance using unequal letter cost
technique.
II.
R ATIONALE OF U NEQUAL B IT C OST T ECHNIQUE
Transmission capacity of telecommunications medium is
never unlimited. Therefore, an important function of a digital
96
17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
III.
communications system is to represent the digitized signal
with as fewer bits as possible. Data must be encoded to
meet the purposes like: unambiguous retrieval of information,
efficient storage, efficient transmission and etc. Efficiency can
be measured in terms of incurred costs, required storage space,
consumed power, time spent and likewise. These considerations impose certain constraints on an encoding scheme to
handle proper operation of underlying hardware; constraints
like preventing baseline wondering, preventing DC component,
achieving self-synchronization, error detection, error correction, immunity to noise and interference, etc.
Ability to retrieve unambiguous information results into a
lossless encoding scheme. Ambiguity is a situation in which
something can be understood in more than one way; i.e. an
expression or statement that has more than one meaning [16].
Ambiguity indicates an open-endedness that must be avoided.
Generally, ambiguity is avoided using predefined code like
ASCII, EBCDIC, etc. In predefined codes, number of elements
in an alphabet Σ are fixed; whenever the number of elements
in Σ is infinite, fixed length predefined codes become invalid.
The scope of this study remains limited to a situation for fixed
alphabet Σ only. A situation arising in case of an alphabet Σ
consisting of a variable number of elements will be considered
in future.
Magnetic storage devices require greater read/write access time, which is due to its own properties. Yet, nonvolatility and cheap production cost makes it the choice for
secondary storage. Data storage on magnetic disks requires
specialized writing and reading techniques. The read-head
must be placed over the correct position on the magnetic disk
surface and attain synchronisation both from internal electronic
circuitry timing information and from surface read information.
Therefore, storage encoding techniques have to avoid several
consecutive occurrences of similar bits while writing [4]. MFM
and n-m recoding are raw disk encoding schemes have been
implemented to work with equal-cost bits. Yet, treatment of
consecutive occurrence of bits can be handled more efficiently
when the storage cost of bits is unequal. This can be ensured
by use of servo track or suitable encoding technique like run
length codes. An encoding scheme that considers unequal bit
cost with respect to time can certainly improve the access time
required for information storage and retrieval.
Varn Coding is a polynomial-time solvable approach that
considers equal probabilities of the words, but not necessarily
equal in length, under the assumption that all codewords
are restricted to belonging to an arbitrary language [17].
Schutzenberger and Marcus considered decodability conditions
imposed on the generating function and a complete set of
code words; it dictates that the sum of arbitrary probabilities
of the words of the complete set is one [18]. In a complete
codeword set, no more codeword can be added, yet the set
remains decodable.
Several attempts have been considered to devise an error
detecting and/or correcting versions of Huffman codes using
a heuristic approach. One of the heuristic approaches shows
that minimum average cost per unit of transmission can be
attained by using no more input letters than the rank of the
channel matrix, considering geometric analysis of the discrete
memoryless channel, extended to the case where each input
letter may have a different positive cost [8]. Karp’s method
constructs a minimum-redundancy prefix codes for the general
discrete noiseless channel with some algebraic development
over classical Huffman Codes. It exploits Gomory’s integer
programming algorithm [19], [20] to construct optimum codes
and demonstrates the practicability of the method [21]. All
these techniques implement the transmission cost of 0 and 1 as
equal except the Morse code. Such an approach has also been
considered by Golin [14]. Despite the extensive literature, there
is no known polynomial-time algorithm for the generalized
problem, and the problem is not even known to be NP-hard
[14]. However, he presented a polynomial-time approximation
scheme for the problem that is based on relaxation of Huffman
coding with unequal letter costs. The relaxation, called the kprefix code problem, allows codewords of length more than k
to be prefixes of other codewords.
The Morse Code is a system of dots and dashes. It is
used to send messages by a flash lamp, telegraph key, or other
rhythmic devices like a tapping finger, by making or breaking
an electric circuit and transmits a signal as a series of electric
pulses. In this Code, each letter or number is represented by a
combination of dashes and dots. A dash is equal to three dots
in duration [5]. Morse Code is an example of a variable-length
encoding scheme. In Morse, frequently used letters like E and
T have shorter length codes than seldom-used letters like Q
and Z, which have longer codes. Using Morse Code, we can
treat each dot and dash mark as the equivalent of one binary bit
each. However, the Morse code scheme suffers from the prefix
problem [6]. Ignoring the prefix problem, Morse Code results
in a tremendous savings of bits over ASCII representation.
There have been almost fifty years of research like [7]–[13]
on data compression methods. While most among these contributions consider equal bit cost, few like Altenkamp, Mehlhorn
[7]; Gilbert [10], [11]; and Golin [14], [15] have put enormous
contributions in compression technique with unequal letter
cost. The proposed algorithm is another addition to these noble
efforts. Much of the available literature approaches the topic
from data transmission point of view. Yet, the technique can
be used to address issues like data storage, energy efficiency,
and efficient use of time slot etc. It is possible to combine the
advantages of Huffman code and Morse code and discard the
disadvantages. Such techniques result in variations of Huffman
code that compress source data, reducing the size of data while
keeping the cost minimum.
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
S COPE OF THE W ORK
Our proposed ‘Cost − considering code’ technique considers non-uniform cost for the binary bits. We assign an
unequal cost for each 0 and 1 i.e. it takes a standard amount
of resource to transmit a One, while it takes only one-third
resource to transmit a Zero. We allot a weight of three to each
unit of standard resource. In our proposed cost − considering
code, it takes f our units of resources for a 1 and a 0 bit, while
it takes six units in Huffman Code. It can be stated in another
way: both techniques take f our units of cost for a 0 and a 1;
yet, because Huffman Code does not consider cost, resources it
consumes is six units in our perspective of measurement. This
indicates the potentials in a cost-considering code. However,
these weights are variable; the proposed algorithm does not
97
17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
fj = Frequency of node j.
consider a specific weight only. The focus of this article is on
the algorithm only; it does not deal with underlying hardware
aspects. The application of the algorithm is not limited to the
case of transmission only, it can easily be extended to handle
situations of storage, speed of execution and likewise.
IV.
If there is a conflict between nodes, we resolve it by
swapping between the nodes and recalculating the cost and
frequencies. This process continues until all conflicts are
successfully resolved.
P ROPOSED A LGORITHM
The idea of the proposed method is to assign the most
frequent symbol the minimum cost and the least frequent
symbol the maximum cost. We use a min-priority queue
min Q of nodes, keyed after their cost and a max-priority
queue max Q of symbols, keyed after their frequency.
V.
Huffman coding technique considers frequency of the
symbols that are contributing to the message to be encoded. It
aims to increase the compression ratio basing on the frequency
of incidence and does not consider the cost of the bits
representing the symbols, i.e. most frequent symbols are given
shortest length codes and vice versa. The proposed method
of Cost − considering code additionally takes the cost of
bits contributing to the code of the symbols into account.
As a result total cost of the compressed data is reduced as
most frequent symbols contribute relatively less to the cost
of compressed data with compared to the classical Huffman
coding technique.
Algorithm 1 Cost-considering / Unequal bit cost Coding
Require: Distinct symbols contained in the message to be
encoded and their frequencies
Ensure: Non-uniform / variable letter cost i.e, Costconsidering balanced tree
1: for each distinct symbol i do
2:
Enqueue (max Q, f requency [ i ])
3: end for
4: create a root node
5: cost [ root ] ← 0
6: Enqueue (min Q, cost [ root ])
7: Define costs of the left and right child of the binary tree
8: repeat
9:
cost of parent node ← Dequeue (min Q)
10:
create lef t and right child for this node
11:
cost [ lef t child ] ← cost of parent node +
lef t child cost
12:
Enqueue (min Q, cost [ lef t child ])
13:
cost [ right child ] ← cost of parent node +
right child cost
14:
Enqueue (min Q, cost [ right child ])
15:
Mark parent node as explored
16: until 2 (n − 1) nodes are created
17: while min Q 6= ∅ do
18:
leaf node ← Dequeue (min Q)
19:
f requency[leaf node] ← Dequeue (max Q)
20: end while
21: for each parent node j do
22:
f requency [ j ] ← f requency [ lef t child ] +
f requency [ right child ]
23: end for
24: repeat
25:
if conflict between nodes then
26:
resolve conflict by swapping conflicted nodes
27:
calculate and reassign cost of all affected nodes
28:
calculate and reassign frequency of all affected nodes
29:
end if
30: until all conflicts are resolved
This section shows the experimental data such as cost of
compressed data, efficiency of Huffman code and proposed
technique. It also focuses on cost reduction and bit overhead
issues of the codes produced by both the techniques.
The input sequence in Table I is treated using both classical
Huffman code technique and proposed cost-considering technique. The outcomes are presented in Table II and Table III
respectively; and the resulting tree structure is shown in
Figure 1 and Figure 2 respectively.
Fig. 1.
VI.
Tree generated by Huffman code technique
P ERFORMANCE E VALUATION OF THE P ROPOSED
T ECHNIQUE
A variable-length code is considerably better than a fixedlength code, as it assigns shorter codewords to frequent symbols and longer codewords to infrequent ones. But a variablelength code which considers variable cost of its constituent
bits performs even better. It uses cheaper bits more than the
costlier ones. Proposed Cost − considering code algorithm
uses a table of the frequencies of occurrence of the elements
to build up an optimal way of representing each character as a
binary string. On the way of building up, it considers allocating
In Algorithm 1, min Q is a minimum priority queue and
max Q is a maximum priority queue. If (Ci > Cj ) and (fi >
fj ), then there remains a conflict
where,
Ci = Cost of node i,
Cj = Cost of node j,
fi = Frequency of node i,
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
E XECUTION OF THE A LGORITHM
98
17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
TABLE I.
F REQUENCY OF E LEMENTS FROM I NPUT Σ
Symbol
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
No
55K
32K
21K
12K
17K
23K
26K
18K
25K
9K
14K
7K
45K
47K
8K
TABLE II.
O UTPUT OF H UFFMAN C ODE T ECHNIQUE I NPUT Σ
Symbol
Frequency
Codeword
Cost
Total Cost
Total Bits
A
55
110
7
385
165
B
32
000
3
96
96
C
21
0100
6
126
84
D
12
01011
11
132
60
E
17
0010
6
102
68
F
23
1010
8
184
92
G
26
1110
10
260
104
H
18
0011
8
144
72
I
25
1011
10
250
100
J
9
01010
9
81
45
K
14
11110
13
182
70
L
7
111110
16
112
42
M
45
011
7
315
135
N
47
100
5
235
141
O
8
111111
18
144
48
Total Cost: 2748
Total Bits: 1322
TABLE III.
O UTPUT OF P ROPOSED Cost − considering T ECHNIQUE I NPUT Σ
Symbol
Frequency
Codeword
Cost
Total Cost
Total Bits
A
55
00000
5
275
275
B
32
11
6
192
64
C
21
00010
7
147
105
D
12
1001
8
96
48
E
17
0011
8
136
68
F
23
00001
7
161
115
G
26
00100
7
182
130
H
18
0110
8
144
72
I
25
101
7
175
75
J
9
00101
9
81
45
K
14
0101
8
112
56
L
7
0111
10
70
28
M
45
1000
6
270
180
N
47
0100
6
282
188
O
8
00011
9
72
40
Total Cost: 2395
Total Bits: 1489
TABLE IV.
C OMPARATIVE EFFICIENCY OF THREE DIFFERENT SCHEMES
a
b
c
d
e
f
Bit
Frequency
45K
13K
12K
16K
9K
5K
-
-
Fixed-length
000
001
010
011
100
101
300
452
Variable-length
0
101
100
111
1101
1100
224
470
Cost-Considering
00
100
11
010
011
101
243
405
weights of bits. Let us consider that we have a 100,000character data file that we wish to transmit. We observe that
the characters in the file occur with the frequencies given by
Table IV. That is, only six different characters appear, and the
character ‘a’ occurs 45,000 times [22].
possible cost for the same input.
We took input from 1600 emails containing alphanumeric
and punctuation characters, build code for each of those
and processed the aggregate result. The results were treated
statistically to find out the effects of unequal bit costs on data
compression. The effects were in line with our theoretical
assumption. The results of proposed cost-considering codes
performed better for most of the input sets, with several
input sets producing results which are distinctively better than
Huffman code. Figure 3 shows the comparison between the
two schemes.
The example in Table IV gives an idea regarding the comparative cost efficiency of the three schemes. While variablelength Huffman coding technique considerably compresses
data in terms of bits over fixed-length coding schemes and
is most effective in data compression by using considerably
fewer number of bits, cost-considering codes incurs minimum
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
Cost
99
17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
Fig. 3.
Comparison of performance between Cost-considering Code and Huffman Code
Consequently, the overall cost is reduced as the most frequent
symbols contribute more to the uncompressed data.
Fig. 2.
On the other hand, it is possible to customise costconsidering code to suit the needs specific to alphabets Σ and
produce different output from the one used for this paper. We
have experimented with at least one set of input which resulted
in using more bits, yet incurred less cost. As such, it is possible
that there could be many types of Cost-considering algorithms
and some of these may be better for a specific type of input
set. It leads to a new possible application area, a variant class
of cost-considering codes that can be more efficient for certain
localised language data compression, while another variant
class may be efficient for some other regional language. This
opens a new dimension in research for localisation of regional
languages.
Tree generated by Cost − considering code technique
Although a cost-considering variable-length code, in certain situations, can use more binary bits than used in Huffman
code, yet it provides a cost optimal solution. Cost considering
code performs on an average 12% better compression than
Huffman code. In the best case scenario, the compression is
39% better.
VII.
C ONCLUSION
R EFERENCES
The ultimate aim of compression is cost reduction; it may
be in case of storage or transmission, etc. Putting this end in
view, there may be other schemes, which are more efficient
than Variable-length Huffman Code that considers only compression. Cost-Considering Variable-length Codeword is one
such coding scheme. Similar to this effort, communication with
elements of an Σ of such unequal cost was considered in [8],
[17], [21], [23]–[29].
Equal letter cost algorithm considered the length of codeword of the elements in an Σ only and it does not consider the
cost of bits creating the codeword. As a result, even though it
might achieve a high compression ratio, it incurs a high cost
as well. It assigns the lowest length code to the most frequent
symbol. As the cost of bit is not considered, it is likely that the
cost would be much higher. On the other hand, our proposed
method considers the cost of bits contributing the codeword. So
it assigns least costly codeword to the most frequent symbol.
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
We hope to put more efforts in the future to come up
with algorithms that can generate efficient cost-considering
codes on the fly and do not need to rely on the number of
elements before constructing the code. It is expected that when
the theories surrounding cost-considering code become mature,
systems will be built around this theory and will be able to
perform more efficiently.
100
[1] D. A. Huffman, “A method for the construction of minimumredundancy codes,” in Proceedings of the Institute of Radio Engineers,
vol. 40, no. 9, September 1952, pp. 1098–1101.
[2] J. Amsterdam, “Data compression with huffman coding,” BYTE, vol. 11,
no. 5, pp. 98–108, May 1986.
[3] M. A. Mannan and M. Kaykobad, “Block huffman coding,” Computers
and Mathematics with Applications, vol. 46, no. 10, pp. 1581–1587,
2003.
[4] A. Gallopoulos, C. Heegard, and K. J. Kerpez, “The power spectrum
of run-length-limited codes,” IEEE Transactions on Communications,
vol. 37, no. 9, pp. 906–917, September 1989.
[5] W. A. Redmond, “International morse code,” Microsoft Encarta 2009
[DVD], pp. 275–278, 1964.
[6] P. D. Grunwald and P. M. B. Vitany, “Kolmogorov complexity and
information theory,” Journal of Logic, Language and Information,
vol. 12, pp. 497–529, 2003.
[7] D. Altenkamp and K. Mehlhorn, “Codes: Unequal probabilities, unequal
letter costs,” Journal of the Association for Computing Machinery,
vol. 27, no. 3, pp. 412–427, July 1980.
17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh
[8] N. M. Blachman, “Minimum-cost transmission of information,” Information and Control, vol. 7, no. 4, pp. 508–511, December 1964,
published by Elsevier Inc.
[9] N. Cot, “Characterization and design of optimal prefix codes,” Ph.D.
dissertation, Stanford University, 1957.
[10] ——, “Complexity of the variable-length encoding problem,” in Graph
Theory and Computing, 1975, pp. 211–244.
[11] E. N. Gilbert, “How good is morse code,” Information Control, vol. 14,
pp. 565–585, 1969.
[12] ——, “Coding with digits of unequal costs,” IEEE Transactions on
Information Theory, vol. 41, 1995.
[13] R. M. Krause, “Channels which transmit letters of unequal duration,”
Information Control, vol. 5, pp. 13–24, March 1962.
[14] M. J. Golin, C. Kenyon, and N. E. Young, “Huffman coding with
unequal letter costs,” in ACM Symposium on Theory of Computing,
May 2002, pp. 785–791.
[15] M. J. Golin, C. Mathieu, and N. E. Young, “Huffman Coding with
Letter Costs: A Linear-Time Approximation Scheme,” SIAM Journal
on Computing, vol. 41, no. 3, pp. 684–713, 2012.
[16] E. Britannica, “Encyclopaedia britannica,” in Encyclopaedia Britannica
Ultimate Reference Suite. Chicago: Encyclopaedia Britannica, 2011.
[17] B. Varn, “Optimal variable length codes -Arbitrary symbol cost and
equal code word probability,” Information Control, no. 19, pp. 289–
301, 1971.
[18] M. Schutzenberger and R. Marcus, “Full decodable code-word sets,”
IRE Transactions on Information Theory, vol. 5, no. 1, pp. 12–15, 1959.
[19] R. E. Gomory, “Outline of an algorithm for integer solutions to linear
programs,” Bulletin of the American Mathematical Society 64, pp. 275–
278, 1958.
[20] ——, Recent Advances in Mathematical Programming. New York:
McGraw-Hill, 1963, ch. An algorithm for integer solutions to linear
programs, pp. 269–302.
[21] R. Karp, “Minimum-redundancy coding for the discrete noiseless channel,” IRE Transactions on Information Theory, vol. 7, no. 1, pp. 27–38,
1961.
[22] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction
to Algorithms, 2nd ed. The MIT Press, 2001.
[23] N. M. Blachman, “Minimum-cost encoding of information,” IRE Transaction on Information Theory, vol. PGTI-3, pp. 139–149, 1954.
[24] D. M. Choy and C. K. Wong, “Bounds for optimal binary trees,” BIT,
vol. 17, pp. 1–15, 1997.
[25] S. Savari and A. Naheta, “Bounds on the expected cost of one-toone codes,” in IEEE International Symposium on Information Theory,
Chicago, IL, June 2004, p. 92.
[26] S. Verdú, “On channel capacity per unit cost,” IEEE Transactions on
Information Theory, vol. 36, no. 5, pp. 1019–1030, September 1990.
[27] N. Cot, “A linear-time ordering procedure with applications to variable
length encoding,” in Southeast Conference on Combinatorics Information Sciences and Systems, Princeton, NJ, 1974, pp. 460–467.
[28] L. E. Stanfel, “Tree structuring for optimal searching,” Journal of the
ACM, vol. 17, no. 1, pp. 508–517, 1970.
[29] Y. Perl, M. R. Garey, and S. Even, “Efficient generation of optimal
prefix code: Equiprobable words using unequal cost letters,” Journal of
the ACM, vol. 22, no. 2, pp. 202–214, April 1975.
978-1-4799-6288-4/14/$31.00 ©2014 IEEE
101