Academia.eduAcademia.edu

Effects of Unequal Bit Costs on Classical Huffman Codes

Classical Huffman codes have a very good compression performance over traditional systems. Yet, more efficient encoding is possible by considering and applying techniques that treat the binary bits differently considering requirement of storage space, energy consumption, speed of execution and so on. Future transmission systems are likely to be more efficient in many aspects. These systems will consume fewer resources to transmit or store one of the binary bits. Hence, an unequal bit cost would necessitate a different approach to producing an optimal encoding scheme. This work proposes an algorithm,which considers unequal bit-cost contribution to a message. Our experiment yields that the proposed algorithm reduces overall communication cost and improves compression ratio considerably in comparison to classical Huffman codes. This unequal bit cost technique produces a variant of Huffman Code that reduces total cost of the compressed message.

17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh Effects of Unequal Bit Costs on Classical Huffman Codes Sohag Kabir∗ , Tanzima Azad† , A S M Ashraful Alam‡ and Mohammad Kaykobad§ ∗ Department of Computer Science, University of Hull, United Kingdom, Email: [email protected] of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh Email: [email protected] ‡ Department of Computer Science, University of Otago, New Zealand, Email: [email protected] § Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh Email: [email protected] † Department It is an efficient compression method: no other mapping of individual source symbols produces a smaller average output size of bits when the actual symbol frequencies agree with those used to create the code. Huffman code is an efficient data compression scheme that takes into account the probabilities at which different quantization levels are likely to occur and results in fewer data bits on the average. For any given set of levels and associated probabilities, there is an optimal encoding rules that minimizes the number of bits needed to represent the source. There are many other variants of Huffman codes that compress source data to reduce data size and/or transmission cost. Even more efficient encoding is possible by grouping sequences of levels together and applying the Huffman code to the sequences. Encoding characters in predefined fixed length code, does not attain an optimum performance, because every character consumes an equal number of bits. Huffman code tackles this by generating variable-length codes, given a probability usage frequency for a set of symbols. Abstract—Classical Huffman codes have a very good compression performance over traditional systems. Yet, more efficient encoding is possible by considering and applying techniques that treat the binary bits differently considering requirement of storage space, energy consumption, speed of execution and so on. Future transmission systems are likely to be more efficient in many aspects. These systems will consume fewer resources to transmit or store one of the binary bits. Hence, an unequal bit cost would necessitate a different approach to producing an optimal encoding scheme. This work proposes an algorithm, which considers unequal bit-cost contribution to a message. Our experiment yields that the proposed algorithm reduces overall communication cost and improves compression ratio considerably in comparison to classical Huffman codes. This unequal bit cost technique produces a variant of Huffman Code that reduces total cost of the compressed message. Keywords—Huffman Coding, Source Coding, Coding and information theory, Data Compression I. I NTRODUCTION There are some practical problems in classical Huffman code. One of the most prominent problems is that the whole stream must be read prior to encoding. This is a major overhead when file size is too large. Mannan and Kaykobad proposed a solution introducing the block technique in Huffman coding [3]. In this coding scheme, if the costs of 0 and 1 are considered equal, then it is quite impossible to produce an encoded message with smaller cost. In reality, the transmission or storage cost of 0 and 1 cannot be same. Though present transmission systems have been implemented considering equal cost for both the bits, one of the ways future transmission systems can be efficient is to treat the bits unequally. In this article, we propose an algorithm that is able to power such a scheme. With this end in view, an alternative, yet efficient representation of Huffman tree is possible that can be economically applied. The outcomes of the proposed algorithm are: Digital data can be transmitted more efficiently using a parallel transmission method that can increase transfer speed by a factor of n over serial transmission method. As parallel transmission mediums are not suitable over long distances, this approach becomes very unrealistic. Requirements of n communication lines add huge expense. Consequently, parallel transmission techniques are limited to short distance communications like internal bus, locally connected devices, and therefore long distance communication links remain serial. Ruling out the possible availability of parallel transmission links over long distance, we are left with its serial alternative only. Nevertheless, serial communication links can also be made efficient using data compression techniques on the source data. An optimal encoding rule that minimize the number of bits needed to represent source data, widely known as Huffman code, is due to D.A. Huffman [1]. Applications of Huffman code are pervasive throughout computer science. Huffman code can be used effectively where there is a need for a compact code to represent a long series of a relatively small number of distinct bytes. The algorithm to completely perform Huffman encoding and decoding has too many implementation details to describe here, but everything that is required is explained in detail by Amsterdam in [2]. In computer science and information theory, Huffman code is an entropy encoding algorithm used for lossless data compression. 978-1-4799-6288-4/14/$31.00 ©2014 IEEE • Efficient representation of a tree that minimises overall cost of encoded message. • Improvement of performance using unequal letter cost technique. II. R ATIONALE OF U NEQUAL B IT C OST T ECHNIQUE Transmission capacity of telecommunications medium is never unlimited. Therefore, an important function of a digital 96 17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh III. communications system is to represent the digitized signal with as fewer bits as possible. Data must be encoded to meet the purposes like: unambiguous retrieval of information, efficient storage, efficient transmission and etc. Efficiency can be measured in terms of incurred costs, required storage space, consumed power, time spent and likewise. These considerations impose certain constraints on an encoding scheme to handle proper operation of underlying hardware; constraints like preventing baseline wondering, preventing DC component, achieving self-synchronization, error detection, error correction, immunity to noise and interference, etc. Ability to retrieve unambiguous information results into a lossless encoding scheme. Ambiguity is a situation in which something can be understood in more than one way; i.e. an expression or statement that has more than one meaning [16]. Ambiguity indicates an open-endedness that must be avoided. Generally, ambiguity is avoided using predefined code like ASCII, EBCDIC, etc. In predefined codes, number of elements in an alphabet Σ are fixed; whenever the number of elements in Σ is infinite, fixed length predefined codes become invalid. The scope of this study remains limited to a situation for fixed alphabet Σ only. A situation arising in case of an alphabet Σ consisting of a variable number of elements will be considered in future. Magnetic storage devices require greater read/write access time, which is due to its own properties. Yet, nonvolatility and cheap production cost makes it the choice for secondary storage. Data storage on magnetic disks requires specialized writing and reading techniques. The read-head must be placed over the correct position on the magnetic disk surface and attain synchronisation both from internal electronic circuitry timing information and from surface read information. Therefore, storage encoding techniques have to avoid several consecutive occurrences of similar bits while writing [4]. MFM and n-m recoding are raw disk encoding schemes have been implemented to work with equal-cost bits. Yet, treatment of consecutive occurrence of bits can be handled more efficiently when the storage cost of bits is unequal. This can be ensured by use of servo track or suitable encoding technique like run length codes. An encoding scheme that considers unequal bit cost with respect to time can certainly improve the access time required for information storage and retrieval. Varn Coding is a polynomial-time solvable approach that considers equal probabilities of the words, but not necessarily equal in length, under the assumption that all codewords are restricted to belonging to an arbitrary language [17]. Schutzenberger and Marcus considered decodability conditions imposed on the generating function and a complete set of code words; it dictates that the sum of arbitrary probabilities of the words of the complete set is one [18]. In a complete codeword set, no more codeword can be added, yet the set remains decodable. Several attempts have been considered to devise an error detecting and/or correcting versions of Huffman codes using a heuristic approach. One of the heuristic approaches shows that minimum average cost per unit of transmission can be attained by using no more input letters than the rank of the channel matrix, considering geometric analysis of the discrete memoryless channel, extended to the case where each input letter may have a different positive cost [8]. Karp’s method constructs a minimum-redundancy prefix codes for the general discrete noiseless channel with some algebraic development over classical Huffman Codes. It exploits Gomory’s integer programming algorithm [19], [20] to construct optimum codes and demonstrates the practicability of the method [21]. All these techniques implement the transmission cost of 0 and 1 as equal except the Morse code. Such an approach has also been considered by Golin [14]. Despite the extensive literature, there is no known polynomial-time algorithm for the generalized problem, and the problem is not even known to be NP-hard [14]. However, he presented a polynomial-time approximation scheme for the problem that is based on relaxation of Huffman coding with unequal letter costs. The relaxation, called the kprefix code problem, allows codewords of length more than k to be prefixes of other codewords. The Morse Code is a system of dots and dashes. It is used to send messages by a flash lamp, telegraph key, or other rhythmic devices like a tapping finger, by making or breaking an electric circuit and transmits a signal as a series of electric pulses. In this Code, each letter or number is represented by a combination of dashes and dots. A dash is equal to three dots in duration [5]. Morse Code is an example of a variable-length encoding scheme. In Morse, frequently used letters like E and T have shorter length codes than seldom-used letters like Q and Z, which have longer codes. Using Morse Code, we can treat each dot and dash mark as the equivalent of one binary bit each. However, the Morse code scheme suffers from the prefix problem [6]. Ignoring the prefix problem, Morse Code results in a tremendous savings of bits over ASCII representation. There have been almost fifty years of research like [7]–[13] on data compression methods. While most among these contributions consider equal bit cost, few like Altenkamp, Mehlhorn [7]; Gilbert [10], [11]; and Golin [14], [15] have put enormous contributions in compression technique with unequal letter cost. The proposed algorithm is another addition to these noble efforts. Much of the available literature approaches the topic from data transmission point of view. Yet, the technique can be used to address issues like data storage, energy efficiency, and efficient use of time slot etc. It is possible to combine the advantages of Huffman code and Morse code and discard the disadvantages. Such techniques result in variations of Huffman code that compress source data, reducing the size of data while keeping the cost minimum. 978-1-4799-6288-4/14/$31.00 ©2014 IEEE S COPE OF THE W ORK Our proposed ‘Cost − considering code’ technique considers non-uniform cost for the binary bits. We assign an unequal cost for each 0 and 1 i.e. it takes a standard amount of resource to transmit a One, while it takes only one-third resource to transmit a Zero. We allot a weight of three to each unit of standard resource. In our proposed cost − considering code, it takes f our units of resources for a 1 and a 0 bit, while it takes six units in Huffman Code. It can be stated in another way: both techniques take f our units of cost for a 0 and a 1; yet, because Huffman Code does not consider cost, resources it consumes is six units in our perspective of measurement. This indicates the potentials in a cost-considering code. However, these weights are variable; the proposed algorithm does not 97 17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh fj = Frequency of node j. consider a specific weight only. The focus of this article is on the algorithm only; it does not deal with underlying hardware aspects. The application of the algorithm is not limited to the case of transmission only, it can easily be extended to handle situations of storage, speed of execution and likewise. IV. If there is a conflict between nodes, we resolve it by swapping between the nodes and recalculating the cost and frequencies. This process continues until all conflicts are successfully resolved. P ROPOSED A LGORITHM The idea of the proposed method is to assign the most frequent symbol the minimum cost and the least frequent symbol the maximum cost. We use a min-priority queue min Q of nodes, keyed after their cost and a max-priority queue max Q of symbols, keyed after their frequency. V. Huffman coding technique considers frequency of the symbols that are contributing to the message to be encoded. It aims to increase the compression ratio basing on the frequency of incidence and does not consider the cost of the bits representing the symbols, i.e. most frequent symbols are given shortest length codes and vice versa. The proposed method of Cost − considering code additionally takes the cost of bits contributing to the code of the symbols into account. As a result total cost of the compressed data is reduced as most frequent symbols contribute relatively less to the cost of compressed data with compared to the classical Huffman coding technique. Algorithm 1 Cost-considering / Unequal bit cost Coding Require: Distinct symbols contained in the message to be encoded and their frequencies Ensure: Non-uniform / variable letter cost i.e, Costconsidering balanced tree 1: for each distinct symbol i do 2: Enqueue (max Q, f requency [ i ]) 3: end for 4: create a root node 5: cost [ root ] ← 0 6: Enqueue (min Q, cost [ root ]) 7: Define costs of the left and right child of the binary tree 8: repeat 9: cost of parent node ← Dequeue (min Q) 10: create lef t and right child for this node 11: cost [ lef t child ] ← cost of parent node + lef t child cost 12: Enqueue (min Q, cost [ lef t child ]) 13: cost [ right child ] ← cost of parent node + right child cost 14: Enqueue (min Q, cost [ right child ]) 15: Mark parent node as explored 16: until 2 (n − 1) nodes are created 17: while min Q 6= ∅ do 18: leaf node ← Dequeue (min Q) 19: f requency[leaf node] ← Dequeue (max Q) 20: end while 21: for each parent node j do 22: f requency [ j ] ← f requency [ lef t child ] + f requency [ right child ] 23: end for 24: repeat 25: if conflict between nodes then 26: resolve conflict by swapping conflicted nodes 27: calculate and reassign cost of all affected nodes 28: calculate and reassign frequency of all affected nodes 29: end if 30: until all conflicts are resolved This section shows the experimental data such as cost of compressed data, efficiency of Huffman code and proposed technique. It also focuses on cost reduction and bit overhead issues of the codes produced by both the techniques. The input sequence in Table I is treated using both classical Huffman code technique and proposed cost-considering technique. The outcomes are presented in Table II and Table III respectively; and the resulting tree structure is shown in Figure 1 and Figure 2 respectively. Fig. 1. VI. Tree generated by Huffman code technique P ERFORMANCE E VALUATION OF THE P ROPOSED T ECHNIQUE A variable-length code is considerably better than a fixedlength code, as it assigns shorter codewords to frequent symbols and longer codewords to infrequent ones. But a variablelength code which considers variable cost of its constituent bits performs even better. It uses cheaper bits more than the costlier ones. Proposed Cost − considering code algorithm uses a table of the frequencies of occurrence of the elements to build up an optimal way of representing each character as a binary string. On the way of building up, it considers allocating In Algorithm 1, min Q is a minimum priority queue and max Q is a maximum priority queue. If (Ci > Cj ) and (fi > fj ), then there remains a conflict where, Ci = Cost of node i, Cj = Cost of node j, fi = Frequency of node i, 978-1-4799-6288-4/14/$31.00 ©2014 IEEE E XECUTION OF THE A LGORITHM 98 17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh TABLE I. F REQUENCY OF E LEMENTS FROM I NPUT Σ Symbol A B C D E F G H I J K L M N O No 55K 32K 21K 12K 17K 23K 26K 18K 25K 9K 14K 7K 45K 47K 8K TABLE II. O UTPUT OF H UFFMAN C ODE T ECHNIQUE I NPUT Σ Symbol Frequency Codeword Cost Total Cost Total Bits A 55 110 7 385 165 B 32 000 3 96 96 C 21 0100 6 126 84 D 12 01011 11 132 60 E 17 0010 6 102 68 F 23 1010 8 184 92 G 26 1110 10 260 104 H 18 0011 8 144 72 I 25 1011 10 250 100 J 9 01010 9 81 45 K 14 11110 13 182 70 L 7 111110 16 112 42 M 45 011 7 315 135 N 47 100 5 235 141 O 8 111111 18 144 48 Total Cost: 2748 Total Bits: 1322 TABLE III. O UTPUT OF P ROPOSED Cost − considering T ECHNIQUE I NPUT Σ Symbol Frequency Codeword Cost Total Cost Total Bits A 55 00000 5 275 275 B 32 11 6 192 64 C 21 00010 7 147 105 D 12 1001 8 96 48 E 17 0011 8 136 68 F 23 00001 7 161 115 G 26 00100 7 182 130 H 18 0110 8 144 72 I 25 101 7 175 75 J 9 00101 9 81 45 K 14 0101 8 112 56 L 7 0111 10 70 28 M 45 1000 6 270 180 N 47 0100 6 282 188 O 8 00011 9 72 40 Total Cost: 2395 Total Bits: 1489 TABLE IV. C OMPARATIVE EFFICIENCY OF THREE DIFFERENT SCHEMES a b c d e f Bit Frequency 45K 13K 12K 16K 9K 5K - - Fixed-length 000 001 010 011 100 101 300 452 Variable-length 0 101 100 111 1101 1100 224 470 Cost-Considering 00 100 11 010 011 101 243 405 weights of bits. Let us consider that we have a 100,000character data file that we wish to transmit. We observe that the characters in the file occur with the frequencies given by Table IV. That is, only six different characters appear, and the character ‘a’ occurs 45,000 times [22]. possible cost for the same input. We took input from 1600 emails containing alphanumeric and punctuation characters, build code for each of those and processed the aggregate result. The results were treated statistically to find out the effects of unequal bit costs on data compression. The effects were in line with our theoretical assumption. The results of proposed cost-considering codes performed better for most of the input sets, with several input sets producing results which are distinctively better than Huffman code. Figure 3 shows the comparison between the two schemes. The example in Table IV gives an idea regarding the comparative cost efficiency of the three schemes. While variablelength Huffman coding technique considerably compresses data in terms of bits over fixed-length coding schemes and is most effective in data compression by using considerably fewer number of bits, cost-considering codes incurs minimum 978-1-4799-6288-4/14/$31.00 ©2014 IEEE Cost 99 17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh Fig. 3. Comparison of performance between Cost-considering Code and Huffman Code Consequently, the overall cost is reduced as the most frequent symbols contribute more to the uncompressed data. Fig. 2. On the other hand, it is possible to customise costconsidering code to suit the needs specific to alphabets Σ and produce different output from the one used for this paper. We have experimented with at least one set of input which resulted in using more bits, yet incurred less cost. As such, it is possible that there could be many types of Cost-considering algorithms and some of these may be better for a specific type of input set. It leads to a new possible application area, a variant class of cost-considering codes that can be more efficient for certain localised language data compression, while another variant class may be efficient for some other regional language. This opens a new dimension in research for localisation of regional languages. Tree generated by Cost − considering code technique Although a cost-considering variable-length code, in certain situations, can use more binary bits than used in Huffman code, yet it provides a cost optimal solution. Cost considering code performs on an average 12% better compression than Huffman code. In the best case scenario, the compression is 39% better. VII. C ONCLUSION R EFERENCES The ultimate aim of compression is cost reduction; it may be in case of storage or transmission, etc. Putting this end in view, there may be other schemes, which are more efficient than Variable-length Huffman Code that considers only compression. Cost-Considering Variable-length Codeword is one such coding scheme. Similar to this effort, communication with elements of an Σ of such unequal cost was considered in [8], [17], [21], [23]–[29]. Equal letter cost algorithm considered the length of codeword of the elements in an Σ only and it does not consider the cost of bits creating the codeword. As a result, even though it might achieve a high compression ratio, it incurs a high cost as well. It assigns the lowest length code to the most frequent symbol. As the cost of bit is not considered, it is likely that the cost would be much higher. On the other hand, our proposed method considers the cost of bits contributing the codeword. So it assigns least costly codeword to the most frequent symbol. 978-1-4799-6288-4/14/$31.00 ©2014 IEEE We hope to put more efforts in the future to come up with algorithms that can generate efficient cost-considering codes on the fly and do not need to rely on the number of elements before constructing the code. It is expected that when the theories surrounding cost-considering code become mature, systems will be built around this theory and will be able to perform more efficiently. 100 [1] D. A. Huffman, “A method for the construction of minimumredundancy codes,” in Proceedings of the Institute of Radio Engineers, vol. 40, no. 9, September 1952, pp. 1098–1101. [2] J. Amsterdam, “Data compression with huffman coding,” BYTE, vol. 11, no. 5, pp. 98–108, May 1986. [3] M. A. Mannan and M. Kaykobad, “Block huffman coding,” Computers and Mathematics with Applications, vol. 46, no. 10, pp. 1581–1587, 2003. [4] A. Gallopoulos, C. Heegard, and K. J. Kerpez, “The power spectrum of run-length-limited codes,” IEEE Transactions on Communications, vol. 37, no. 9, pp. 906–917, September 1989. [5] W. A. Redmond, “International morse code,” Microsoft Encarta 2009 [DVD], pp. 275–278, 1964. [6] P. D. Grunwald and P. M. B. Vitany, “Kolmogorov complexity and information theory,” Journal of Logic, Language and Information, vol. 12, pp. 497–529, 2003. [7] D. Altenkamp and K. Mehlhorn, “Codes: Unequal probabilities, unequal letter costs,” Journal of the Association for Computing Machinery, vol. 27, no. 3, pp. 412–427, July 1980. 17th Int'l Conf. on Computer and Information Technology, 22-23 December 2014, Daffodil International University, Dhaka, Bangladesh [8] N. M. Blachman, “Minimum-cost transmission of information,” Information and Control, vol. 7, no. 4, pp. 508–511, December 1964, published by Elsevier Inc. [9] N. Cot, “Characterization and design of optimal prefix codes,” Ph.D. dissertation, Stanford University, 1957. [10] ——, “Complexity of the variable-length encoding problem,” in Graph Theory and Computing, 1975, pp. 211–244. [11] E. N. Gilbert, “How good is morse code,” Information Control, vol. 14, pp. 565–585, 1969. [12] ——, “Coding with digits of unequal costs,” IEEE Transactions on Information Theory, vol. 41, 1995. [13] R. M. Krause, “Channels which transmit letters of unequal duration,” Information Control, vol. 5, pp. 13–24, March 1962. [14] M. J. Golin, C. Kenyon, and N. E. Young, “Huffman coding with unequal letter costs,” in ACM Symposium on Theory of Computing, May 2002, pp. 785–791. [15] M. J. Golin, C. Mathieu, and N. E. Young, “Huffman Coding with Letter Costs: A Linear-Time Approximation Scheme,” SIAM Journal on Computing, vol. 41, no. 3, pp. 684–713, 2012. [16] E. Britannica, “Encyclopaedia britannica,” in Encyclopaedia Britannica Ultimate Reference Suite. Chicago: Encyclopaedia Britannica, 2011. [17] B. Varn, “Optimal variable length codes -Arbitrary symbol cost and equal code word probability,” Information Control, no. 19, pp. 289– 301, 1971. [18] M. Schutzenberger and R. Marcus, “Full decodable code-word sets,” IRE Transactions on Information Theory, vol. 5, no. 1, pp. 12–15, 1959. [19] R. E. Gomory, “Outline of an algorithm for integer solutions to linear programs,” Bulletin of the American Mathematical Society 64, pp. 275– 278, 1958. [20] ——, Recent Advances in Mathematical Programming. New York: McGraw-Hill, 1963, ch. An algorithm for integer solutions to linear programs, pp. 269–302. [21] R. Karp, “Minimum-redundancy coding for the discrete noiseless channel,” IRE Transactions on Information Theory, vol. 7, no. 1, pp. 27–38, 1961. [22] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. The MIT Press, 2001. [23] N. M. Blachman, “Minimum-cost encoding of information,” IRE Transaction on Information Theory, vol. PGTI-3, pp. 139–149, 1954. [24] D. M. Choy and C. K. Wong, “Bounds for optimal binary trees,” BIT, vol. 17, pp. 1–15, 1997. [25] S. Savari and A. Naheta, “Bounds on the expected cost of one-toone codes,” in IEEE International Symposium on Information Theory, Chicago, IL, June 2004, p. 92. [26] S. Verdú, “On channel capacity per unit cost,” IEEE Transactions on Information Theory, vol. 36, no. 5, pp. 1019–1030, September 1990. [27] N. Cot, “A linear-time ordering procedure with applications to variable length encoding,” in Southeast Conference on Combinatorics Information Sciences and Systems, Princeton, NJ, 1974, pp. 460–467. [28] L. E. Stanfel, “Tree structuring for optimal searching,” Journal of the ACM, vol. 17, no. 1, pp. 508–517, 1970. [29] Y. Perl, M. R. Garey, and S. Even, “Efficient generation of optimal prefix code: Equiprobable words using unequal cost letters,” Journal of the ACM, vol. 22, no. 2, pp. 202–214, April 1975. 978-1-4799-6288-4/14/$31.00 ©2014 IEEE 101