Block Arithmetic Coding Source Compression: Abstruct

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

1546 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO.

5, SEPTEMBER 1993

Block Arithmetic Coding for Source Compression


Charles G. Boncelet Jr., Member IEEE

Abstruct- We introduce “Block Arithmetic Coding” (BAC), a average at least H(X)output bits; for V to F encoders, each
technique for entropy coding that combines many of the advan- output bit can on average represent at most ,1/H(X) input
tages of ordinary stream arithmetic coding with the simplicity symbols.
of block codes. The code is variable length in to fixed out (V
to F), unlike Huffman coding which is fixed in to variable out The most important entropy compressor is Huffman coding
(F to V). We develop two versions of the coder: 1) an optimal [9]. Huffman codes are F to V block codes. They block
encoder based on dynamic programming arguments, and 2) a the input into strings of q letters and encode these strings
suboptimal heuristic based on arithmetic coding. The optimal with variable length output strings. It is well-known that the
coder is oetimal over all V to F complete and proper block codes. number of output bits of a Huffman code is bounded above
We show that the suboptimal coder achieves compression that is
within a constant of a perfect entropy coder for independent and +
by q H ( X ) 1 for all block sizes. In special cases, this bound
identically distributed inputs. BAC is easily implemented, even can be tightened [5], [17]. As a result the relative efficiency of
with large codebooks, because the algorithms for coding and Huffman codes can be made as high as desired by taking the
decoding are regular. For instance, codebooks with 232 entries block size to be large enough. In practice, however, large block
are feasible. BAC also does not suffer catastrophic failure in the sizes are difficult to implement. Codebooks are usually stored
presence of channel errors. Decoding e m r s am confined to the
block in question. The encoding is in practice reasonably efficient. and must be searched for each input string. This storage and
With i.i.d. binary inputs with P(1) = 0.95 and 16 bit codes, searching limits block sizes to small values. (However, see
entropy arguments indicate at most 55.8 bits can be encoded; the Jakobsson [111.)
BAC heuristic achieves 53.0 and the optimal BAC achieves 53.5. Huffman codes can be implemented adaptively, but it is
Finally, BAC appears to be much faster than ordinary arithmetic
coding. difficult to do so. The process of generating Huffman code
trees is sufficiently cumbersome that adapting the tree on a
Index Terms-Arithmetic codes, block codes, variable to fixed symbol by symbol basis is rarely computationally efficient.
codes, entropy compression.
In recent years, a class of entropy coders called arithmetic
coders has been studied and developed [SI, [13], [MI-[21],
[25]. Arithmetic coders are stream coders. They take in an ar-
I. INTRODUCTION
bitrarily long input and output a corresponding output stream.

W E introduce a method of source coding called Block


Arithmetic Coding (BAC). BAC is a variable in to
fixed out block coder (V to F) unlike Huffman codes which
Arithmetic coders have many advantages. On average, the
ratio of output length to input length can be very close to
the source entropy. The input probabilities can be changed on
are fixed in to variable out (F to V). BAC is simple to imple- a symbol by symbol basis and can be estimated adaptively.
ment, efficient, and usable in a wide variety of applications. However, arithmetic codes have certain disadvantages. In
Furthermore, BAC codes do not suffer catastrophic failure in some applications, the encoding and decoding steps are too
the presence of channel errors. Decoding errors are confined complicated to be done in real time.
to the block in question. The entropy coder most similar to BAC is due to Tunstall
Consider the input, X = 21x2~3... to be a sequence of [23], attributed in [12]. Tunstall’s encoder is a V to F block
independent and identically distributed input symbols taken coder that operates as follows: Given a codebook with K -
from some input alphabet A = {al, a2,. . . ,am}.The symbols +
m 1 output symbols, a codebook with K output symbols
obey probabilities p3 = Pr(xl = a 3 ) .Assume that p3 > 0 for is generated by “splitting” the most probable input sequence
all j. The entropy of the source is into m successors, each formed by appending one of the a3.
m Unfortunately, the encoding of input sequences appears to
H(X)= - CPZ logm. (1) require searching a codebook. Thus, large block sizes appear
z=1 to be prohibitive, both in creating the code and in searching
(We will measure all entropies in bits.) The compression it. This has the practical effect of limiting block sizes to small
efficiency of entropy encoders is bounded below by the values.
entropy. For F to V encoders, each input symbol requires on Recently, a V to F block arithmetic coder has been proposed
by Teuhola and Raita [22]. While similar in spirit to this work,
Manuscript received September 9, 1991; revised February 15, 1993. This their coder, named FIXARI, differs in many details from BAC.
paper was presented in part in The 1992 Conference on Information Sciences
and Systems, Princeton, NJ, March 1992. This work was performed in part FIXARI parses the input differently from BAC and encodes
while the author was visiting at the University of Michigan. each string differently. The analysis of FIXARI is entirely
The author is with the Department of Electrical Engineering, University of
Delaware, Newark, DE 19716.
different from the analysis of BAC presented in Section 111.
IEEE Log Number 9211435. FIXARI does appear to be fast.
0018-9448/93$03.00 0 1993 IEEE
BONOCELET BLOCK ARiTHMETIC CODING FOR SOURCE COMPRESSION 1547

Other V to F entropy coders are runlength [7] and Ziv- 11. DEVELOPMENT
OF BLOCKARITHMETIC CODING
Lempel [14], [24], [27]. Runlength codes count the runs The BAC encoder parses the input into variable length
of single symbols and encode the runs by transmitting the input strings and encodes each with single fixed length output
number in each run. They work well when the probabilities are codewords.
highly skewed or when there is significant positive correlation The following definitions are taken from Jelinek and Schnei-
between symbols. Runlength codes are often used for small der [12]: Let W ( K ) = {wi : 1 5 i 5 K } be a K element
alphabets, especially binary inputs. Then the runlength code is collection of words, wi, with each word a substring of X.
particularly simple since, by necessity, the runs alternate be- W ( K )is proper if, for any two words, w; and wj with i # j ,
tween l’s and 0’s. Runlength codes are occasionally combined wi is not a prefix of wj. W ( K )is complete if every infinite
with Huffman to form a variable in to variable out (VtoV) length input string has a prefix in W . The characterization in
scheme. For example, consider the CCI’IT facsimile code [lo] [15] is succinct: a code is complete and proper if every infinite
or the more general VtoV runlength codes of Elias [6]. length input string has one and only one prefix in W ( K ) .
Ziv-Lempel codes work on different principles than BAC
It is useful to think of complete and proper from the
codes or the other codes considered so far. They find sub-
viewpoint of parsing trees. Proper means that the codewords
strings of input symbols that appear frequently and encode
are all leafs of the tree; complete means that all the nodes in the
each with a fixed length codeword. Ziv-Lempel codes work
tree are either leafs or they are internal nodes with m children.
particularly well on English text, as they can discover complex
The number of codewords in complete and proper sets is
dependencies between the symbols.
One advantage of most fixed output length block coders,
+
1 L(m - 1) for L = 1 , 2 , . . . [12]. This result is also easy to
including BAC, is that the effects of transmission errors derive from the viewpoint of trees. The root has m children
(symbols received incorrectly) are confined to the block in (L = 1).Each time a leaf node is replaced by m children, a
question. The .decoder does not lose synchronism with the net gain of m - 1 leafs result. Below, we will allow L = 0 so
incoming bit stream. While the corrupted block may be de- that one codeword can be allowed in appropriate sets.
coded incorrectly, remaining blocks will be decoded correctly. We assume that to each output codeword is assigned an
However, there may be insertions or deletions in the decoded integer index i with i = 1 , 2 , . . . ,Ks. Let S be the set
stream. These insertions and deletions can usually be dealt of output codewords. Consider a subset of codewords with
with at the application level. In contrast, Huffman codes and contiguous indices. Denote the first index by A and the last by
arithmetic codes can lose synchronism and suffer catastrophic B. The number of codewords in the subset is K = B - A 1. +
failure in the presence of channel errors. (It is often argued BAC proceeds by recursively splitting S into disjoint sub-
that Huffman codes are self-synchronizing, but this is not sets. With each input symbol, the current subset is split into m
guaranteed. Various techniques have been proposed to combat nonempty, disjoint subsets, one for each possible input letter.
this problem. See, e.g., [16].) The new subset corresponding to the actual letter is chosen
One subtle point is that fixed output length codes can and the process continues recursively. When the subset has
still suffer catastrophic failure if implemented in an adaptive fewer than m codewords, BAC stops splitting, outputs any
fashion. If the probability estimates depend on previously of the codewords in the subset, reinitializes, and continues.
transmitted symbols, then a single error can affect other blocks. The ideal situation is that each final subset contain only 1
In particular, Ziv-Lempel codes are usually implemented in an codeword; otherwise, the “extra” codewords are wasted.
adaptive fashion and are susceptible to catastrophic failure if Denote the number of codewords in each of the m disjoint
their dictionaries get corrupted. BAC codes will be vulner-
subsets by K l ( K ) ,where K is the number of codewords in
able to the same catastrophic failure if the probabilities are
the current set. Usually, we will suppress the dependence on
estimated adaptively based on prior symbols.
K and. denote the number simply by Kl.
Finally, there is a growing recognition that V to F codes can
The question of how the Kl should be selected is central to
outperform F to V codes with strong dependencies between
the remainder of this paper. We can identify five criteria that
symbols [MI, [26]. The idea is that, for a fixed number of
are necessary or desirable:
symbols, V to F codes can match long input strings. For
example, a runlength coder with K codewords can match a 1) Kl > 0.
run of up to K - 1 symbols. This run can be encoded with 2) C E I K l L K .
approximately log K bits, resulting in an exponential gain. In 3) C;”=,Ki = K .
contrast, Huffman codes only give arithmetic gains. For K +
4) Kl = 1 L l ( m - 1) for some LI = 0 , 1 , 2 , .. ..
large enough, the V to F coder is more efficient. 5) If p j > pi, then Kj 2 Ki.
This paper is organized as follows: Section I1 develops the Criteria Q1 and Q2 are necessary to ensure that the parsing
optimal and heuristic BAC’s. In Section 111, we present the is complete and proper and that the encoding can be decoded.
principal theoretical result of this paper: that BAC’s are within Each subset must have at least one codeword and the sum of
an additive constant of the entropy for all code sizes. Section all the subsets cannot exceed the total available.
IV presents computations illustrating the efficiency of BAC’s. Criteria Q3, Q4, and Q5 are desirable in that meeting each
Section V discusses extensions to time varying and Markov results in a more efficient encoding. In effect, Q3 says that one
symbols, Section VI discusses the EOF problem and proposes ought to use all the codewords that are available. Q4 assures
two new solutions, and Section VI1 presents conclusions and that the number of codewords in each subset is equal to one
discussion. (a leaf) or is the number in a complete and proper set. For
1548 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 5, SEPTEMBER 1993

/* BAC Encoder */ [getindexcodeo] Returns the index of the next codeword


K = Ks; taken from the channel.
A = 1; [undoeofo] Undoes the effect of doeof().
while ( ( I = getinput()) # EOF) As examples of the kinds of parsings available, consider the
following for binary inputs (a1 = O,a2 = 1):
{
Compute Ifl,Kz,. ..,If,,,; Let K1 = K - 1 and K2 = 1. These are runlength codes
for runs of 0’s. Similarly, if K1 = 1 and K2 = K - 1,
A =A+ IC,;
we get runlength codes for runs of 1’s. The parsing trees
K = If(;
are as much “left” or “right” as possible.
if (If < m )
Let K1(Ks) = K s / 2 and Kz(K.9) = Ks/2. If the first
{ symbol is a 1, then let K1 = K - 1 and K2 = 1 for
Output code(A); all other K’s; else, let K1 = 1 and K2 = K - 1. These
A = 1; are symmetric runlength codes, useful when the starting
If = Ifs; value is unknown.
I Let K1 = K / 2 and K2 = K/2. These codes map the
I input to output in an identity-like fashion.
doeof(A, I(); With the conditions above, we can state the following theorem:
Theorem 1 Block Arithmetic Codes are complete and
Fig. 1. Basic BAC Encoder. proper. Furthermore, all complete and proper parsings can be
f* BAC Decoder *f developed as BAC’s.
while ((C = getindexcode()) # EOF) Proojl It is easiest to develop this equivalence with
parsing trees. BAC codes are complete and proper from their
I
If = KS;
construction. They are proper because no outputs are generated
at nonleaf nodes; they are complete because each terminal
A = 1;
node has m - 1 siblings. For the converse, consider any
while (If 1 m )
complete and proper parsing as a tree. At each node of the
t tree, a BAC coder can reproduce the same tree by appropriately
Compute ICl, Ifz,...,IC,,,; choosing the subset sizes, Kl. For instance, at the root node,
Find 1 s.t. A + IC;_< C < A + E!=,K,; count the number of leafs in each branch. Use each of these
output ai; numbers for Kl, respectively. The process can clearly continue
A =A + .E!=;K,; at each node. 0
If = If1; In the case considered so far of independent and identically
I distributed input symbols, it is straightforward to derive an
expression for the efficiency of a code. Let N ( K ) denote the
1
expected number of input symbols encoded with K output
undoeof(A,If);
symbols. Due to the recursive decomposition, one obtains the
Fig. 2. Basic BAC Decoder following recursive formula:

example, consider m = 3 and K = 5. The subsets should


be divided into some permutation of 1,1,3, and not 1,2,2.
In the former case, the set with 3 elements can be divided
one more time; in the latter case, none of the subsets can.
Note, Q3 and Q4 can both be satisfied if the initial K satisfies The “1” is for the current symbol, zj. The sum follows from
+
K = 1 L(m - 1) and ELl Ll = L - 1. Q5 merely reflects considering all possible choices for z. With probability p l ,
the intuitive goal that more probable letters should get more the input is a1 and the subset chosen has Kl codewords. The
codewords. In theory, Q5 can always be met by sorting the expected number of input symbols encoded from here onward
Kl’s in the same order as the pl’s. Sometimes in practice, is N ( K l ) .
e.g., in adaptive models, it may be computationally difficult to Since this equation is crucial to the rest of this development,
know the sorted order and meeting Q5 may be problematic. we also present an alternate derivation. N ( K ) ,as the expected
The BAC encoding algorithm is given in Fig. 1. The BAC number of input symbols encoded with K output symbols, can
decoding algorithm is given in Fig. 2. be written as
We assume that the various functions behave as follows:
[getinputo] Returns the index of the next input symbol to
be coded. If a, is the next letter, then it returns r .
[code(A)] Returns the codeword corresponding to codeword
index, A.
[doeof(A, K)] Handles input end-of-file (EOF). Discussed where ni is the number of bits in wi. Parse wi = aljw:
below in Section VI. where al,; is the first letter in w;. Then n: = n; - 1 and, by
BONOCELET BLOCK ARITHMETIC CODING FOR SOURCE COMPRESSION 1549

independence, Pr(w;) = Pr(a1,;) Pr(wi). Thus, /* Good Heuristic to determine the L’s (Kt= 1 + Ll(m - 1)) */
K Sort the symbols so that p1 5 p1 5 . 5 pm.
q = 1.0;

i=l f,=&
K for ( I = 1; l 5 m; I++)
{
i=l fi = P1/%
111

4 = Ifif, + ((fi(2 - I ) - l)/(m - 1) + 0.5)];


if (LI< 0)
na LI = 0;
(3) f, = i - Lt;
1=1 q=q-p1.

(3) follows because Kl codewords start with al. I


The boundary conditions are easy and follow from the
Fig. 3. “Good” probability quantizer. Computes L1 for 1 = 1 , 2 , . . . , m
sjmple requirement that at least m codewords are needed to given L. Note, all quantities not involving L can be precomputed.
encode a single symbol:

N ( K ) = O K = 0 , 1 , ..., m - 1 . (4) as possible. Let Kl = [plK],where [SI is the quantization of


s, with the proviso that each Kl 2 1.
Note, one application of (2) yields N ( m ) = 1. While hard to One way to do the quantization consistent with Q 1 4 4
compute N ( K ) in closed form, except in special cases, (2)
above is described in Fig. 3. The idea is to process the input
and (4) are easy to program.
letters from least probable to most probable and, for each letter,
For an example of an easy case, consider ordinary runlength
to keep track of the number of codewords remaining and the
coding for binary inputs. Letting N , ( K ) refer to the expected total probability remaining. The algorithm actually computes
number of input symbols encoded with K output codewords
Ll, not Kl, as the former is somewhat easier. Note, Q5 may
using runlength encoding, then the recursion simplifies to
not be satisfied due to quantization effects. However, our
+ +
N , ( K ) = 1 pN?.(K - 1) q N , ( l ) computations and simulations indicate that it almost always is.
Furthermore, Q5 can be satisfied if desired by sorting the Lz’s.
= 1+ p N , ( K - l ) , (5) One drawback of the “Good” quantizer in Fig. 3 is that
where p is the probability of a symbol in the current run and the Ll’s are computed one at a time from the least probable
q = 1 - p. ( 5 ) follows since N,(1) = 0. The solution to this letter up to the letter input. Potentially, this loop may execute
linear difference equation is m times for each input symbol. For large m, this may be
too slow. One way to circumvent this problem is the “Fast”
N,(K) = (1 - pK-1)/(1 - p). (6) quantizer described below. The idea behind the fast quantizer
Interestingly, this solution asymptotically approaches 1/( 1 - is to compute a “cumulative L function”, Rl, or, equivalently,
a “cumulative K function”, Ql, and form either Ll or Kl =
p). We arrive at the well-known conclusion that V to F
runlength codes do not work well for large block sizes.
+
1 (m - 1)Ll by taking a difference. The equations for R
and L take the form:
We propose two specific rules for determining the size of
each subset. The first is an optimal dynamic programming Ro = 0 (8)
approach, letting N , ( K ) be the optimal expected number of
input symbols encoded using K output codewords: R1 = [ ( L- 1) -&I (9)
m j=1
Ll = RI - Rl-1 (10)
Those for Q and K are as follows:
subject to the constraints Kl 2 1 and CEIKl = K. This
optimization is readily solved by dynamic programming since Qo = 0 (11)
all the Kl obey Kl < K. However, if m is large, a full search Ql = 1 +
( m - 1)Rl (12)
may be computationally prohibitive. The optimal solution can KZ= QI - Qz-1 (13)
be stored in tabular form and implemented with a table lookup.
Note, the optimal codebook does not have to be stored, only The “Fast” quantizer satisfies Q 1 4 4 , though not necessarily
the optimal subset sizes. Q5. For large alphabets it is slightly less efficient than the
The second is a heuristic based on ordinary arithmetic cod- “Good” quantizer, but is much faster because there is no loop
ing. In arithmetic coding, an interval is subdivided into disjoint over all the symbols.
subintervals with the size of each subinterval proportional to The algorithmic complexity of BAC can be summarized
the probability of the corresponding symbol. We propose to as follows: Using the good heuristic, the encoder can require
do the same for BAC. Kl should be chosen as close to plK up to m - 1 multiplications per input symbol, and a similar
1550 IEEE TRANSACTIONS ON INFORh4tWION THEORY,VOL. 39, NO. 5, SEPTEMBER 1993

number of additions; using the fast heuristic, it only requires enough. See, e.g., Fig. 3.) In this section, we will take all
two multiplications and a fixed number of additions and “logs” to be natural logs to the base e.
comparisons per input symbol. The decocler does the same Theorem 1 For independent and identically distributed
computations of Kl (or L I ) as the encoder, but also has to inputs,
search for the proper interval. This searching can be done in
O(1og m) operations. Thus, the encoder can be implemented logK - H(X)Nh(K) 5 c, (17)
in 0 ( 1 ) or O ( m )operations per input symbol and the decoder for all K and some constant C. (C depends on the proba-
in O(1ogm) or O ( m ) operations, depending on whether the bilities, the quantization, and on the size of the alphabet, but
fast or the good heuristic is used. In the special case of not on K.)
binary inputs, there is little difference, both in complexity and Proofi We will actually show a stronger intermediate
performance, between the two heuristics. After the Kl’s have result, namely that
been computed, the optimal BAC can be implemented with no D
multiplications and a similar number of additions to the fast logK - H(X)Nh(K) I G(K) = C - -, (18)
K
heuristic. However, the one time cost of computing the optimal
Kl’smay be high (a brute force search requires approximately where D > 0. Clearly C - D/K 5 C.
O(mLS)operations if m is large, where KS = l+Ls(m-l)). The proof is inductive. There are three constants to be
In some applications, it is convenient to combine the optimal determined: C and D, and a splitting constant, Kc. First,
and heuristic approaches in one hybrid approach: use the opti- select any KC such that KC > 2m/pl, where p l > 0 is the
mal sizes for small values of K and use the heuristic for large minimum of the pl’s.
values. As we argue in Section IV, both encoders incur their The basis is as follows: For all K 5 K c , and for any
greatest inefficiency for small values of K. For instance, an D > 0, we can choose C so that
encoder can begin with the heuristic until 256-1024 codewords
are left and then switch to the optimal.
c 2 logK - H ( X ) N h ( K ) D +
Letting N h ( K ) represent the expected number of input sym- D
bols encoded with K output codewords under the heuristic, the
2 log K c - 0 -,
1
+ (19)
expression for N ( K ) becomes where, clearly, N h ( K ) 2 0 and D/K 5 D.At this point, K c
m has been determined and C has been specified in terms of D
+
Nh(K) = 1 C P l N h ( b l K 1 ) . (14) and the proposition holds for all K 5 KC.
1=1 For the inductive step, we now assume that (18) applies for
For the moment, if we ignore the quantizing above, relax all K I K’ for some K‘ 2 Kc. Then, we find a condition
the boundary conditions, and consider N ( . ) to be a function on D so that (18) applies for all K. Let K = K’ 1, then +
of a continuous variable, say s, we get the following equation: logK - H(X)Nh(K)

where the subscript e refers to “entropy”. The entropy rate for - H(X) - C P l W ) N h ( b l K l )
V to F codes, 1=1

satisfies this equation.’ Thus we see that, in the absence of


quantization effects, BAC codes are entropy codes. As we
argued above, for large K the quantization effects are small.
Furthermore, logK is a slowly varying function of K. We
might expect the heuristic to perform close to the optimal,
entropy rate. In fact, we prove in Section I11 below that Nh (K) 1=1
is within an additive constant (which depends on the p’s and
m) of the entropy rate for all K.
The inductive hypothesis is used in (20). The last term above
is easily bounded since G(-) is nondecreasing:
111. ASYMPT~TIC
RESULTS
Consider the heuristic encoder discussed above. Assume the
quantization is done so that, for K large enough, blK] =
+
prK 61, where 1611 5 m. Assume further that CZ, 61 =
0. (Such quantization can always be achieved for K large
‘We believe this entropy solution uniquely solves (15), but have been unable
to prove uniqueness.
BONOCELET BLOCK ARITHMETIC CODING FOR SOURCE COMPRESSION 1551

If 61 2 0, then
entropy -
50 - optimal -------
heuristic .-.-....
4o - runlength --

30 -
z
Similarly, if 61 < 0,
20 -
rpi K

0 2 4 6 8 10 12 14 16
log K
+
(By construction, plK 61 > 0.)
Fig. 4. Calculated N ( K ) versus log K for binary inputs with p = 0.95.
The first term of (20) can be bounded as follows:
m Combining (21), (27), and (28) and letting D' = ( m -
CPl(log(PlK)- log(b1KI)) 1)2" maxj IPjl, we get the following:
log K - H ( X ) N h ( K )I G ( K - 1 ) + D ' K - ~
1=1

IG(K). (29)
Substituting in for G(.), we need to select D so that the
inequality in (29) is valid:
D
C--
K-1
+KD'- 2<- c - -KD '
or, equivalently,
-1 K-1
f (1 +Z a 1 j K - j -DK + D'-
K <-DK+D, (31)
3=1 which is valid for all D > D'. This completes the proof. 0
(25) Clearly, the optimal approach is better than the heuristic and
worse than entropy. So, combining with the theorem above,
where ai3 are coefficients that are polynomial functions Of the we have the following ordering:
61's. They do not otherwise depend on K . Since E;"=, 61 = 0,
log K
-- logK-C
we get the following: H(X)- Ne(K) L No(K) 2 N h ( K ) L . (32)
m
H(X)
IV. COMPUTATIONAL
RESULTS
In this section, we compute BAC efficiencies for the heuris-
tic and optimal implementations and compare these with
entropy values. We also compare an implementation of BAC
and the arithmetic coder of Witten, Neal, and Cleary [25],both
operating on a binary input, and show that BAC is much faster.
Source code for the tests presented here can be found in [4].
In Fig. 4, we present calculations of N ( K ) for a hypo-
(26) thetical entropy coder, the optimal BAC, the good heuristic
Where P j = E;"=, 6lalj are polynomial functions of the 61's BAC, and a simple V to F runlength coder versus log K =
and do not otherwise depend on K . Now we can bound each 0 , 1 , . . . ,16. The input is binary with p = Pr(z = 1) = 0.95.
+
term separately. By construction, 1 Si/(p;K) > 1/2. Then, Note that the optimal and heuristic curves closely follow the
entropy line. The maximum difference between both curves
and entropy is 4.4 and occurs at log K = 3. In contrast, for
log K = 16, the optimal is within 2.3 input symbols of the
entropy and the heuristic within 2.8. This behavior seems to
The second term of (26) can be bounded as follows: be consistent over a wide range of values for p . The maximum
difference occurs for a small value of logK. The coders
m-1 improve slightly in an absolute sense as log K grows. At some
- PjK-j I ( m - l)K-'m+x IPjl. (28) point the daerence seems to level off. The relative efficiency
3
j=1 increases as the number of codewords grow. At log K = 3, it
1552 IEEE TRANSACTIONS ON I N F O W O N THEORY, VOL. 39, NO. 5, SEITEMBER 1993

TABLE rI
TIMES
EXECUTION FORBAC AND WNC FORA 1 MILLION
BIT FILE
5 - entropy - Version Time (secs)
optimal -------
good ... - BAC Encoder 4.2
4 - fast BAC Decoder 3.7
WNC Encoder 21.6
3 - WNC Decoder 29.5
WNC Encoder (fast) 22.2
2 -
WNC Decoder (fast) 29.8
IO Speed (encoder) 2.0
IO Speed (decoder) 1.7
1 -

probability of a 1 equal to 0.95. The execution times on a


SPARC IPC are listed in Table 11. The times are repeatable
to within 0.1 seconds. Also listed are the times for input and
output (IO Speed), i.e., to read in the input one bit at a time
and write out the appropriate number of bytes, and to read in
bytes and write out the appropriate number of bits. The IO
Entropy Optimal Heuristic Runlength
Speed numbers do not reflect any computation, just the input
P Ne(K) No(K) 6 Nh(K) NR(K) and output necessary to both BAC and WNC.
0.80 22.2 21.9 0.22 21.9 0.22 5.0 We see that in this test, the BAC encoder is approximately 5
0.85 26.2 25.8 0.24 25.7 0.30 6.7 times faster than WNC and the BAC decoder is 8 times faster
0.90 34.1 33.2 0.42 33.1 0.47 10.0 than the WNC decoder. Indeed, the BAC times are not much
0.95 55.9 53.5 0.69 53.1 0.80 20.0
0.98 113.1 103.4 1.37
longer than the inputloutput operations alone.
101.8 1.60 50.0
0.99 198.0 184.9 1.13 181.2 1.36 100.0
V. EXTENSIONS
BAC’s can be extended to work in more complicated
is 58% for both and, at logK = 16, it is 96% for the optimal environments than that of i.i.d. symbols considered so far. For
BAC and 95% for the heuristic BAC. instance, consider the symbols to be independent, but whose
In Fig. 5, we present calculations of N ( K ) versus log K probabilities are time varying:
for a 27 letter English alphabet taken from Blahut [l, p. 211.
p ( Z , j ) = Pr(xc,= Q ) . (33)
Plotted are N ( K ) for the entropy, optimal, and good and fast
heuristic BAC’s. On this example, the curves for the optimal Then denote the number of input symbols encoded with K
and the good heuristic are almost indistinguishable. output symbols starting at time j by N ( K ,j ) . Then, in analogy
In Table I, we present calculated N ( K ) for binary inputs to (2) we get the following:
and selected p’s with log K = 16, i.e., 16 bit codewords. One m

can see that both the heuristic and optimal BAC’s are efficient N ( K ( j ) , j )= 1 + Cp(Wvqj.+
l > ,+
j I), (34)
across the whole range. For instance, with p = 0.80, both 1=1

exceed 98.7% of entropy; with p = 0.95, both exceed 95.0%; where K l ( j )is the number of codewords assigned to a1 at time
and with p = 0.99, both exceed 91.5%. j . The heuristic is easy: Choose K l ( j + l ) = b ( Z , j ) K ( j ) ]We
.
The computational results support an approximation. For can also present an optimal dynamic programming solution:
all K large enough, H ( X ) N h ( K )M log K - C. In general, it
seems to be that < C, where C is taken from Theorem = 1+
No(K(j),j)
m
2. Also in Table I are computed values of C = logK -
H ( X ) N h ( K )for K = 216.
As another experiment, we implemented BAC and the coder
This problem can be solved backward in time, from a maxi-
of Witten, Neal, and Cleary (referred to as WNC) [25] to assess
the relative speeds. Source code for WNC appears in their
+
mum j = K - m 1 to a minimum j = 1.
As another example consider a time invariant, first order
paper, so it makes a good comparison. Speed comparisons are
Markov source. Let p(ilZ) = Pr(x(j) = ailx(j - 1) = a).
admittedly tricky. To make the best comparison, both BAC
Then the recursion for N ( K ) splits into two parts. The first
and WNC were optimized for a binary input and the encoders
is for the first input symbol; the second is for all other input
for both share the same input routine and the decoders the
symbols:
same output routine. There are two versions of WNC, a
m
straightforward C version, and an optimized C version. We
implemented both and found that, after optimizing for binary +
N ( K )= 1 x p i N ( K i ( Z ) (36)
1=1
inputs, the straightforward version was actually slightly faster m
than the optimized version. Both were tested on the same input (37)
file, 220 independent and identically distributed bits with a i=l
BONOCELET BLOCK ARITHMETIC CODING FOR SOURCE COMPRESSION 1553

where N(KIZ) is the number of input symbols encoded using inefficiency as follows:
K codewords given that the current input is al. The heuristic
again is easy: Choose Ki = Ip(ilZ)K].The optimal solution is
log K - C’- log(K - 1) + C’ M (K(1og K - Cl))-’ (38)
harder, although it can be done with dynamic programming. log K - C‘
For every 1, the optimal N,(KIZ) can be found because each For K large enough, this overhead is negligible.
Ki < K . In [3], we show a similar optimality result to
Theorem 2 for a class of first order Markov sources. WI. CONCLUSIONS AND DISCUSSION
In some applications it is desirable to adaptively estimate the
We believe that BAC represents an interesting alternative in
probabilities. As with stream arithmetic coding, BAC encodes
entropy coding. BAC is simple to implement, even for large
the input in a first-in-first-out fashion. The only requirement is
block sizes, because the algorithm is top-down and regular.
that the adaptive formula depends only on previous symbols,
In contrast, Huffman’s and Tunstall’s algorithms are not top-
not on present or future symbols.
down and generally require storing and searching a codebook.
BAC is efficient, though probably slightly less efficient than
ordinary arithmetic coding. The comparison with Huffman is
a little more complicated. The asymptotic result for BAC is,
VI. THE EOF PROBLEM on the surface, weaker than for Huffman. The BAC bound
uses C which must be computed on a case by case basis,
One practical problem that BAC and other arithmetic coders while the Huffman bound is 1. Both coders can be made as
have is denoting the end of the input sequence. In particular, relatively efficient as desired by selecting the block size large
the last codeword may be only partially selected. enough. However, BAC can use much larger block sizes than
The simplest solution to this problem is to count the number is practical for Huffman. The BAC asymptotic result is much
of symbols to be encoded and to send this count before stronger than that for Tunstall’s coder. For BAC, the difference
encoding any. It is an easy matter for the decoder to count the between entropy and the heuristic rates is bounded. Tunstall
number of symbols decoded and stop when the appropriate shows only that the ratio of rates of entropy and his coder
number is reached. However, this scheme requires that the approaches 1 as K + 03.
encoder process the data twice and incurs a transmission In the proof for Theorem 2, we have shown that a constant,
overhead to send the count. C, exists that bounds the difference between entropy and the
The EOF solution proposed in the stream arithmetic coder heuristic BAC for all K. We have made no effort to actually
of Witten, Neal, and Cleary [25] and used in FIXARI [22] evaluate the constant from the conditions given in the proof.
is to create an artificial letter with minimal probability. This This is because such an evaluation would be worthless in
letter is only transmitted once, denoting the end of the input. evaluating the performance of BAC. As shown in Section IV,
When the decoder decodes this special letter, it stops. We the constant in practice is quite small. One important need for
computed BAC’s lost efficiency for binary inputs for a variety future research is to provide tight bounds for C and, perhaps,
of probabilities and codebook sizes and found that it averages to characterize the difference between the entropy rate and
about 7%. For the English text example, the loss averaged BAC more accurately.
about 3.5% over a wide range of codebook sizes. One advantage of BAC compared to Huffman and stream
We propose two alternatives to the simple schemes above arithmetic coding is that BAC uses fixed length output code-
(see also [2]). The first assumes the channel can tell the words. In the presence of channel errors, BAC will not suffer
decoder that no more codewords remain and that the decoder catastrophic failure. The other two might.
is able to “lookahead” a modest number of bits. The idea We have also argued that BAC can accommodate more
is for the encoder to append to the input sequence as many complicated situations. Certainly the heuristic can handle
least probable symbols as necessary (possibly zero) to flush time varying and Markov probabilities. It can estimate the
out the last codeword. After transmitting the last codeword, probabilities adaptively. It remains for future work to prove
the encoder transmits the number of encoded symbols to the optimality results for these more complicated situations.
decoder. Even with 232 codewords, at most 31 extra input
symbols are needed. This number can be encoded with 5 bits. REFERENCES
The decoder looks ahead 5 bits until it detects that no more
symbols are present. It then discards the appropriate number [l] R. E. Blahut, Principles and Practice of Information Theory. Reading,
MA: Addison-Wesley, 1987.
of appended symbols. The overhead caused by this scheme is [2] C. G. Boncelet, Jr., “Extensions to block arithmetic coding,” in Proc.
modest: 5 bits at the end and one possibly wasteful codeword. 1992 ConfInform. Sci. and Syst., Princeton NJ, Mar. 1992, pp. 691-695. ,

The second scheme assumes the channel can not tell the [3] -, “Block arithmetic coding for Markov sources,” in Proc. 1993
Inform. T h o v Symp., San Antonio, TX,Jan. 1993.
decoder that no more codewords remain. We suggest devoting [4] -, “Experiments in block arithmetic coding,” Univ. Delaware, Dep.
one codeword to specify the end of the codeword sequence. Elec. Eng., Tech. Rep. 93-2-1,1993.
[5] R. M. Capocelli and A. De Santis, “New bounds on the redundancy of
(Note, we are not suggesting an extra input letter, but one Hufhnan codes,” IEEE Trans. Inform. Theory, vol. 37, pp. 1095-1104,
of the K codewords.) Then append the extra number of bits July 1991.
as above. The inefficiency here includes the 5 bits above and [6] R. Elias, “Universal codeword sets and representation of the integers,”
IEEE Trans. Inform. Theory, vol. IT-21, pp. 194-203, Mar. 1975.
the loss due to one less codeword. Using the approximation [7] S. W. Golomb, “Run-length encodings,” IEEE Transhaform. Theory,
discussed in Section IV above, one computes the relative vol. IT-12, pp. 399401, July 1966.
1554 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 5, SEPTEMBER 1993

[8] M. Guazzo, “A general minimum-redundancy source-coding algorithm,” “An overview of the basic principles of the q-coder adaptive binary
IEEE Trans. Inform. Theory, vol. IT-26, pp. 15-25, 1980. arithmetic coder,” IBM J. Res. Develop., vol. 32, no. 6 pp. 717-726,
[9] D. A. Huffman, “A method for the construction of minimum-redundancy Nov. 1988.
codes,” Proc. IRE, pp. 1098-1101, Sept. 1952. [191 J. Rissanen, “Generalized kraft inequality and arithmetic coding.”IBM
[lo] R. Hunter and A. H. Robinson, “International digital facsimile coding J. Res. Develop., pp. 198-203, May 1976.
standards,” Proc. IEEE, vol. 68, pp. 854-867, July 1980. [20] J. Rissanen and G. G. Langdon, “Arithmetic coding,” IBM J . Res.
[ll] M. Jakobsson, “Huffman coding in bit-vector compression.” In& Proc. Develop., vol. 23, no. 2, pp. 149-162, Mar. 1979.
Left., vol. 7, no. 6, pp. 304-307, Oct. 1978. [21] J. Rissanen and K. M. Mohiuddin, “A multiplication-free multialphabet
[12] F. Jelinek and K. Schneider, “Onvariable-length-to-block coding,”IEEE arithmetic code.” IEEE Trans. Commun., vol. 37, pp. 93-98, Feb. 1989.
Trans. Inform. Theory, vol. IT-18, pp. 765-774, Nov. 1972. [22] J. Teuhola and T. Raita, “Piecewise arithmetic coding.” in Proc. Data
[13] G. G. Langdon, “An introduction to arithmetic coding,” IBM J. Res. Compression Con& 1991, J. A. Storer and J. H. Reif, Fds., Snow Bird,
Develop., vol. 28, pp. 135-149, Mar. 1984. UT, Apr. 1991, pp. 3 3 4 2 ,
[14] A. Lempel and J. Ziv, “On the complexity of finite sequences,” IEEE [23] B. P. Tunstall, “Synthesis of noiseless compression codes,” Ph.D.
Trans. Inform. Theory, vol. IT-22, pp. 75-81, Jan. 1976. dissertation, Georgia Inst. Techno]., 1967.
[15] N. Merhav and D. L. Neuhoff, “Variable-to-fixed length codes provide [24] T. A. Welch, “A technique for high-performance data compression,”
better large deviations performance than fixed-to-variable length codes,” IEEE Compuf. Mag., pp. 8-19, June 1984.
IEEE Trans. Inform. Theory, vol. 38, pp. 135-140, Jan. 1992. [25] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data
(161 B. L. Montgomery and J. Abraham, “Synchronization of binary source compression.” Commun. ACM, vol. 30, pp. 520-540, June 1987.
codes,” IEEE Trans. Inform Theory, vol. IT-32, pp. 8494354, Nov. [26] J. Ziv, “Variable-to-ked length codes are better than fixed-to-variable
1986. length codes for Markov sources,” IEEE Tram. Inform. Theory, vol. 36,
[17] -, “On the redundancy of optimal binary prefix-condition codes pp. 861-863, July 1990.
for finite and infinite sources,” IEEE Trans. Inform. Theory, vol. IT-33, [27] J. Ziv and A. Lempel, “Compression of individual sequences via
pp. 156-160, Jan. 1987. variable-rate coding,” IEEE Trans. Inform. Theory, vol. IT-24, pp.
[18] W. B. Pennebaker, J. L. Mitchell, Jr. G. G. Langdon, and R. B. Arps, 530-536, Sept. 1978.

You might also like