Journal of Discrete Algorithms: Satoshi Yoshida, Takuya Kida
Journal of Discrete Algorithms: Satoshi Yoshida, Takuya Kida
Journal of Discrete Algorithms: Satoshi Yoshida, Takuya Kida
a r t i c l e
i n f o
Article history:
Available online 13 November 2014
Keywords:
Lossless data compression
VF code
Compressed pattern matching
a b s t r a c t
We discuss an improved method of Variable-to-Fixed length code (VF code) encoding. A VF
code is a coding scheme that splits an input text into a consecutive sequence of substrings,
and then assigns a xed length codeword to each substring. Since all the codewords have
the same length, they are easily extracted, and thus, VF code is suitable for processing
compressed texts directly. Almost Instantaneous VF code (AIVF code), which was proposed
by Yamamoto and Yokoo in 2001, achieves a good compression ratio by using a set of parse
trees. However, it requires more time and space for both encoding and decoding than does
a classical VF code. In this paper, we prove that the set of parse trees of AIVF code can be
multiplexed into a compact single tree and the original encoding and decoding procedures
can be simulated using the compacted tree. We also give the upper and lower bounds of
the number of nodes in the multiplexed parse tree, in addition to those of the number of
nodes reduced from the original trees. The experimental results showed that the proposed
method can encode natural language texts much faster than AIVF coding.
2014 Elsevier B.V. All rights reserved.
1. Introduction
1.1. Background
From the viewpoint of speeding up compressed pattern matching, variable-to-xed length codes (VF codes for short) were
reevaluated [8,6,15,9,18]. A VF code is a coding scheme that parses an input text into a consecutive sequence of substrings
with a dictionary tree, called a parse tree, and then assigns a xed length codeword to each substring. Tunstall code [12],
which is a classical and typical VF code, has been proved to be an entropy code for a memoryless information source (see
also [11]); its average code length per symbol comes asymptotically close to the entropy of the source when the code length
goes to innity. However, its actual compression ratio is not as good as that of other well-known codes, such as Huffman
code.
Several improvements on VF codes have been proposed thus far. Almost Instantaneous VF code (AIVF code for short1 ),
which was proposed by Yamamoto and Yokoo [15], is one of the most attractive codes from a practical viewpoint. Its
This is an extended version of papers published in DCC 2010 [16] and in ESKM 2012 [17].
Corresponding author.
E-mail addresses: [email protected] (S. Yoshida), [email protected] (T. Kida).
1
Almost instantaneous (VF) code originally meant that we need to read one symbol ahead for encoding. However, we use the word as the unique
name in this paper.
http://dx.doi.org/10.1016/j.jda.2014.10.005
1570-8667/ 2014 Elsevier B.V. All rights reserved.
76
Fig. 1. Multiplexing the parse trees of AIVF code into a single tree.
compression ratio is almost comparable to that of Huffman code.2 In fact, even if the codeword length is short, AIVF code
achieves a better compression ratio than Tunstall code does with considerably long codewords (see Section 5). However, its
encoding and decoding speeds are slower than those of Tunstall code. The reason is that AIVF code must construct k 1
parse trees, where k is the alphabet size of the text, which usually reaches 70140 for natural language texts. Thus, a large
amount of memory space is required to hold the trees. Therefore, a VF code with a more compact parse tree that can
achieve a good compression ratio as compared to AIVF code is desirable.
1.2. Our contribution
In this paper, we propose an ecient algorithm for encoding and decoding AIVF codes, which integrates the multiple
parse trees into a compact single tree and simulates the encoding and the decoding procedure of the original AIVF code.
Our idea originated in the observation that many nodes in the multiple parse trees of an AIVF code are common, and
thus, they can be multiplexed (see Fig. 1). We refer to the integrated parse tree as the Virtual Multiple AIVF parse tree (VMA
k
tree for short). We prove that the upper and lower bounds of the number of nodes in the VMA tree are Mk 12 k2 + k
1
and M ln(k + 1)
M
2
+ 1, respectively, where M denotes the number of codewords. We also prove that the upper and
lower bounds of the number of nodes reduced from the original multiple parse trees are
1 2
k
2
kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
k1
and
+ 52 k 7, respectively. We show that in fact our technique allows much faster encoding than does the original AIVF
It should be noted that AIVF code supposes a memoryless information source for its input as Tunstall and Huffman codes do.
77
78
Fig. 2. Parse tree T 0 for AIVF code. A circle indicates an internal node and a square indicates a leaf.
suggests that the average block length can be increased by assigning these unreachable codewords to other strings. For the
running example, instead of using the tree in Fig. 2, in this case, the length of the next block can be increased by using the
tree in Fig. 3.
In the method proposed in [15], k 1 parse trees T i (i = 0, . . . , k 2) are utilized. For each i, the ith parse tree T i has
a root and its children are labeled by ai +1 , . . . , ak (i = 0, . . . , k 2) (recall that Pr(ai ) Pr(a j ) for i < j). A given text is
encoded and decoded by switching the parse trees according to the context.
We now discuss how to construct the optimal parse tree for each T i . Let M
be the number of codewords. Then, our aim is
to construct the optimal parse tree that maximizes the average block length
xN ( T ) |x| Pr(x) for a memoryless information
source. The algorithm is shown in Algorithm 1. It should be noted that, when the k 1th child of an incomplete node is
added, the average block length can be increased by adding the kth child without increasing the number of codewords,
because a codeword is not assigned to a complete internal node. Recall that it is assumed that Pr(ai ) Pr(a j ) for i < j. In
this case, a greedy algorithm yields the optimal solution. The algorithm constructs T i as follows. First, it creates the root
node and its k i children as the initial tree. It should be noted that we treat the root as a complete internal node. Next,
repeat the following while the number of children of n which are not created is the number of unused codewords or more,
where n is the incomplete internal node or leaf with maximum probability: (i) Let S 1 be the average block length if we
create all the children of n and S 2 be the one if we create k d(n) 1 leaves with maximum probabilities sequentially. (ii)
otherwise we create k d(n) 1 leaves with maximum probabilities sequentially.
If S 1 S 2 , we create all the children of n,
Finally, we create M m leaves with maximum probabilities sequentially where m is the number of codewords used.
The encoding algorithm with multiple parse trees is Algorithm 2. First, T 0 is selected as the current parse tree. Then, the
symbols are read one by one, and the parse tree is traversed by the symbol. If the child labeled by the symbol does not exist,
the codeword assigned to the node is output. Next, it must be determined which tree is used for the next traverse. Let n be
the node that the preceding traverse nally reached, and let n have d(n) children labeled by a1 , a2 , . . . , ad(n) . Then, the next
symbol is larger than ad(n) . Thus, the root of the next tree should have k d(n) children labeled by ad(n)+1 , ad(n)+2 , . . . , ak .
The root of the tree T i of the multiple parse trees has k i children labeled by ai +1 , ai +2 , . . . , ak . Hence, T d(n) is selected
for the next parse tree and we jump to the root of T d(n) . For example, let an input text be S = aabaabaccab, and consider
encoding S with the parsing trees in Figs. 2 and 3. Then, S is split into blocks, aa baa bac ca b. Therefore, the output
binary sequence of length 15 bits is obtained as 001 001 011 111 101. For the decoding, the same tree must be used as
for the encoding.
3. Virtual multiple AIVF tree
In this section, we explain our idea and present an ecient algorithm for AIVF code. In the parse trees of AIVF code,
it can be observed that many nodes in T i are identical to those in T i +1 . More precisely, T i +1 completely covers the nodes
79
19:
Add k d(n) children n j , j = d(n) + 1, . . . , k to n.
Procedure FindOptPos()
20:
n argmaxnN ( T i ) Pr(n) Pr(ad(n)+1 ).
21:
Add the d(n) + 1-th child n of n.
(i )
in T i , except for the leftmost3 subtree under the node corresponding to ai +1 . First, we explain this relationship. Let S j be
the subtree of T i that consists of all the nodes under the direct child of the root corresponding to a j . Then, we have the
following theorem.
(i +1)
Theorem 1. Subtree S i + j
(i )
Proof. We prove the theorem by contradiction. For an integer i (0 i k 2), let T i be the tree of innite depth in which
the root has children according to ai +1 , . . . , ak , and all the other internal nodes have just k children. It should be noted that
(i )
the root of the multiple parse tree T i has children according to a j ( j = i + 1, . . . , k). That is, T i includes S j ( j = i + 1, . . . , k).
It should be also noted that each T i is optimal in the sense that it maximizes the average block length. That is, for any i
(i )
(0 i k 3) and j (2 j k i ), the average block length cannot be further increased by exchanging any node in S i + j
(i +1)
for a node included in T i but not in T i . Here, we assume that S i + j does not cover S i + j completely. Then, there exists a
(i )
(i +1)
node that is in S i + j but not in S i + j . Let n be such a node. Since T i is a parse tree that maximizes the average block length,
(i +1)
(i )
(i )
(i +1)
(i )
average block length. Therefore, S i + j completely covers S i + j .
The set of trees can be multiplexed and simply integrated into a single tree according to the above theorem. An integrated tree is called a Virtual Multiple AIVF tree (VMA tree for short). To simulate the encoding and the decoding of the
original AIVF codes by using the VMA tree, it is necessary to indicate which parse tree is currently being traversed during
processing. Thus, each node in the VMA tree is marked to indicate to which trees the node belongs. Since a node can belong
to several parse trees, the least i such that n belongs to T i is saved for each node. Denoting this mark by Tn(n), we have
Each edge is arranged in descending order of the occurrence probability of its label in left to right manner.
80
Fig. 4. An example of a VMA tree. This is an integrated tree obtained by multiplexing the trees shown in Fig. 2 and in Fig. 3. The number written in each
node indicates the value of Tn(n).
n the root of T .
i 0.
while not end of the input text do
c the next symbol of the input text.
if there exists a child of n labeled by c then
n the child of n labeled by c.
if Tn(n ) i then
n n .
else
Goto line 13.
end if
else
Output w i (n).
i d(n).
n child of the root labeled by c.
end if
end while
that Tn(n) = mini {i | 0 i k 2, n belongs to T i .}. For example, by integrating the two parse trees shown in Fig. 2 and
Fig. 3, the parse tree shown in Fig. 4 is obtained.
To encode with a VMA tree, the previous encoding algorithm must be modied, because even if there is a child in the
VMA tree for the next traverse, the traverse fails when there is no child in T i . Therefore, Tn(n) and the number i of the
currently traversing tree T i are compared. If i is less than Tn(n), there is not a proper node in T i , and we return to the root.
The encoding algorithm is shown in Algorithm 3, where the codeword assigned to n in T i is denoted by w i (n).
The algorithm for constructing a VMA tree is shown in Algorithm 4. Let S i denote the subtree that consists of all nodes
under the root node corresponding to ai . We dene #N ( S 0 ) = M k for convenience. We dene the function cod(a j ) = j,
and denote by rst(n) the rst symbol of the label sequence on the path from the root to n. That is, if rst (n) = a j ,
(i )
cod(rst(n)) = j. The number of nodes having the codeword is denoted by m in Algorithm 4. Subtrees S i +1 for i = 0, . . . ,
(k2)
k 2 and S k
are in the VMA tree from Theorem 1. Therefore, this algorithm repeats constructing a parse tree T in the
manner of Algorithm 1 and removing the leftmost subtree from T to T V . It should be noted that the number of nodes with
codewords to be created is the same as the number of nodes that have codewords in the subtree moved to T V for next
iteration.
4. Bound analysis of VMA tree
In this section, we discuss the upper and lower bounds of the number of nodes in a VMA tree. Let M be the number
of codewords, i.e., M = 2 for the codeword of length . We prove that the upper and lower bounds of the total number
k
of nodes in a VMA tree are respectively Mk 12 k2 + k
in Theorem 2 and M ln(k + 1) M
+ 1 in Theorem 3 and that
2
1
the lower and upper bounds of the number of nodes reduced by integrating multiple parse trees into one are respectively
1 2
k
2
+ 52 k 7 in Theorem 4 and
kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
k1
in Theorem 5.
k
Theorem 2. An upper bound of the total number of nodes in the integrated parse tree of an AIVF code is Mk 12 k2 + k
, where M
1
and k are the number of codewords and the alphabet size, respectively.
Proof. The VMA tree can be viewed as a joint tree that consists of several subtrees of the AIVF tree: the leftmost subtrees
(i )
(k2)
(k2)
(i )
S i +1 for i = 0, . . . , k 3 and subtrees S k1 and S k
of tree T k2 . Subtrees S i +1 have M (k i 1) codewords because
1
each subtree has at least one codeword. A k-ary tree with L leaves has at most kL
complete internal nodes. Therefore, the
1
upper bound of the number of nodes in the VMA tree is
81
22:
Add k d(n) children n j , j = d(n) + 1, . . . , k to n.
23:
Tn(n j ) cod(rst(n j )) 1.
Procedure FindOptPos()
24:
n argmaxnN ( T ) Pr(n) Pr(ad(n)+1 ).
25:
Add the d(n) + 1-th child n of n.
26:
Tn(n) cod(rst(n)) 1.
k 3
i =0
(i )
(i )
# C S i +1 + # N S i +1
+ #C S k(k12) + #C S k(k2)
+ #N S k(k12) + #N S k(k2) + 1
k 3
M (k i 1) 1
M 1
M (k i 1) +
+1+M
+
k1
k1
i =0
k1
= Mk k2 +
Theorem 3. A lower bound of the total number of nodes in the integrated parse tree of an AIVF code is M ln(k + 1)
and k are the number of codewords and the alphabet size, respectively.
M
2
+ 1, where M
(i )
Proof. Since the symbols a1 , . . . , ak are sorted in descending order of their occurrence probabilities, we have #N ( S j )
(i )
(i )
#N ( S j +1 ). Therefore, we have #N ( S i +1 ) kM
i . Hence the lower bound of the number of nodes in the VMA tree is
k 3
i =0
(i )
(k2)
# N S i +1 + # N S k 1
(k2)
+ #N S k
+1
k 3
M
+M +1
ki
i =0
k+1
M
1
i
di
M
2
+1
M ln(k + 1)
M
2
+ 1.
kM (k 12 )+ 12 (k6)(k+1)kM ln(k+1)
,
k1
82
Table 1
Experimental text les.
Texts
Size (byte)
||
Content
english.300MB
dna
225
16
0.5655
0.2478
English document
DNA sequence
(0)
(0)
(1)
(1)
(k3)
(k3)
Proof. Subtrees S 2 , . . . , S k , S 3 , . . . , S k , . . . , S k1 , S k
to the proof of Theorem 3, the upper bound is
k
k 3
i =0
#N S j
(i )
+ #C S j
+k2
j = i +2
k 3
(k 1)
i =0
(i )
M
ki
M
(k
k i
i 1) 1
k1
+k2
kM (k 12 ) + 12 (k 6)(k + 1) kM ln(k + 1)
k1
Theorem 5. A lower bound of the number of nodes reduced by the integration is 12 k2 + 52 k 7, where k is the alphabet size.
Proof. Analogous to the proofs of Theorem 2 and Theorem 4, the lower bound is
k
k 3
i =0
(i )
#N S j
+1
j = i +2
k 3
(k i + 2)
i =0
= k 2 + k 7.
5. Experiments
We implemented Tunstall code, AIVF code, and our proposed method.4 We abbreviate these programs as Tunstall, AIVF,
and VMA, respectively. All the programs we used are written in C++ and compiled by g++ of GNU, version 4.6. We embedded the information of tree structures by balanced parenthesis [10] into compressed texts. We ran our experiments on a
workstation equipped with an Intel Xeon(R) 3.00 GHz CPU with 12 GB RAM, which operated Ubuntu 12.04. We used dna
and english.300MB, which is the rst 300 MB of english from Pizza&Chili Corpus.5 For details, please refer to Table 1.
First, we measured the compression ratios, compression times, and decompression times of Tunstall, AIVF, and VMA. We
measured 100 (compressed le size)/(original le size) as the compression ratio. Table 2 shows the compression ratios.
The compression ratios of AIVF and VMA are better than that of Tunstall on english.300MB and dna. Since AIVF and VMA
essentially perform the same compression, their compression ratios are almost the same. However, slight differences are
caused by the differences in the size of parse trees. The compression times are shown in Table 3. The compression of VMA
and AIVF is slower than that of Tunstall, because VMA and AIVF create more nodes than does Tunstall. In addition, the
compression of VMA is faster than that of AIVF where codeword length is long, because VMA creates fewer nodes than
does AIVF. The improvement in compression time is larger on english.300MB than on dna. AIVF constructs k 1 parse
trees where k is the alphabet size. Therefore, VMA reduces many nodes when a corpus whose alphabet size is large is
compressed. Although the number of nodes in the parse tree exponentially grows as the codeword length increases, the
compression time of Tunstall does not seem to slow down. The reason is considered to be that the total amount of I/O time
is reduced because the total amount of data is decreased as the codeword length increases. Next, the decompression times
are shown in Table 4. The decompression of VMA and AIVF is slower than that of Tunstall. Since the number of nodes in
the parse trees of VMA and AIVF is larger than that of Tunstall, the reconstruction of the parse trees of VMA and AIVF is
slower than that of Tunstall. Moreover, the decompression of VMA is slightly slower than that of AIVF, because VMA needs
extra operations to compute the degree of a node by comparing the minimum tree number of its children and the current
tree number while decoding. The numbers of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree are
4
5
83
Table 2
Compression ratios of Tunstall, AIVF, and VMA in percentage.
Codeword length
8
9
10
11
12
13
14
15
16
english.300MB
dna
Tunstall
AIVF
VMA
Tunstall
AIVF
VMA
100.00
94.19
93.18
83.81
81.24
78.55
76.76
74.37
73.82
79.52
63.09
59.98
59.06
58.99
58.64
58.56
58.72
59.20
79.52
63.08
59.96
59.03
58.95
58.55
58.38
58.36
58.49
35.22
34.23
32.74
32.10
31.09
30.54
30.09
29.65
29.26
26.29
26.11
25.91
25.75
25.70
25.64
25.57
25.48
25.52
26.29
26.11
25.91
25.75
25.71
25.65
25.59
25.53
25.61
Table 3
Compression times of Tunstall, AIVF, and VMA in seconds.
Codeword length
8
9
10
11
12
13
14
15
16
english.300MB
dna
Tunstall
AIVF
VMA
Tunstall
AIVF
VMA
19.54
20.49
19.65
19.63
17.29
18.50
19.03
18.19
17.69
20.26
20.23
22.35
26.31
34.44
63.16
160.71
642.42
5241.48
28.90
28.26
29.34
32.33
33.17
39.61
60.06
143.83
838.33
18.54
17.21
16.62
17.19
17.34
17.21
16.83
16.52
16.05
18.45
18.36
18.84
19.22
20.06
22.31
32.48
96.45
774.81
20.31
20.42
21.54
21.01
21.37
23.27
30.41
81.04
483.77
Table 4
Decompression times of Tunstall, AIVF, and VMA in seconds.
Codeword length
8
9
10
11
12
13
14
15
16
english.300MB
dna
Tunstall
AIVF
VMA
Tunstall
AIVF
VMA
17.15
15.40
14.36
12.71
11.37
10.90
9.70
9.14
11.78
15.61
14.54
14.65
15.14
15.82
18.52
19.53
23.40
27.49
18.42
17.29
17.41
17.62
17.94
19.08
20.72
27.77
35.15
8.44
7.77
6.90
6.89
6.20
6.04
5.51
5.57
5.86
14.40
14.35
14.14
14.59
14.30
14.37
15.16
16.89
20.21
15.15
15.11
14.88
15.05
14.93
15.72
15.96
20.36
25.11
Table 5
The number of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree for english.300MB.
Codeword length
8
9
10
11
12
13
14
15
16
AIVF
57 568
114 912
229 600
458 976
917 728
1 835 232
3 670 240
7 340 256
14 680 288
VMA
3849
8210
18 239
38 420
77 470
152 665
305 362
614 054
1 238 041
shown in Tables 5 and 6. VMA considerably reduces the number of nodes especially on the English text. Finally, the number
of nodes reduces by the integration and its upper and lower bounds are shown in Tables 7 and 8.
84
Table 6
The number of nodes of AIVF and VMA and its upper and lower bounds in the VMA tree for dna.
Codeword length
AIVF
VMA
8
9
10
11
12
13
14
15
16
3855
7695
15 375
30 735
61 455
122 895
245 775
491 535
983 055
1913
3949
7662
15 851
30 605
62 510
123 947
251 769
490 842
Table 7
The number of nodes reduced by integration and its upper and lower bounds for english.300MB.
Codeword length
8
9
10
11
12
13
14
15
16
Table 8
The number of nodes reduced by integration and its upper and lower bounds for dna.
Codeword length
8
9
10
11
12
13
14
15
16
1942
4746
7713
14 884
30 850
60 385
121 828
239 766
492 213
3465
6923
13 841
27 677
55 348
110 690
221 374
442 742
885 478
161
161
161
161
161
161
161
161
161
We also compared eight compression algorithms: Tunstall, AIVF, VMA, STVF,6 v2vdc [3], gzip, bzip2, and LZMA. We used
the default options for gzip, bzip2, and LZMA. We selected 14 for the codeword length of VF codes. The results are shown
in Table 9. It should be noted that v2vdc is a word-based compression method and it therefore is not effective for DNA
data. We represent this by N/A in Table 9. The compression ratios of VMA, AIVF, and Tunstall are worse than those of
v2vdc, gzip, and bzip2 for the English text, because they do not assume a Markov source. On the other hand, VMA and AIVF
are rather good for the DNA text. V2vdc takes a long time to compress English text to construct sux arrays. Although the
compression and decompression speeds of VMA is slower than bzip2 for the English text, those of VMA for DNA text is
faster than those of bzip2. The reason is that our implementation for VMA, AIVF, and Tunstall takes O (||) time to traverse
the parse tree for a character.
Finally, we measured performance of compressed pattern matching algorithms. We compared VMA, AIVF, Tunstall, gzip,
and v2vdc. We used UNIX zgrep for pattern matching on compressed text with gzip. We chose patterns with lengths of 525
characters in the text. We measured the pattern matching times for 5 patterns of each length and calculated the average.
Tables 10 and 11 list the results for the matching times. We set 12 for the codeword lengths of VF codes. The leftmost
column indicates the pattern length. We indicate the results for dna of v2vdc as N/A because it is not applicable, that is,
effective for the data as mentioned above. The experimental results show that pattern matching on VF codes are superior
to that on gzip. Although v2vdc has the advantage for natural language texts like english.300MB, it is not directly applicable
to texts of unseparated words like a DNA sequence.
6
STVF does not work on several hundreds megabytes of text with our experimental environment due to the lack of memory. We therefore divided the
input text into a hundred megabyte texts and measured compressed le size and compression and decompression times as summation of compressed le
size and processing times of them, respectively.
85
Table 9
Experimental results compared with variable length encoding methods. The codeword length of VF codes is xed to 14.
Compression ratios (%)
VMA
AIVF
Tunstall
STVF
v2vdc
gzip
bzip2
LZMA
english.300MB
dna
english.300MB
dna
english.300MB
dna
58.38
58.56
76.76
56.74
27.07
37.82
28.03
24.98
25.59
25.57
30.09
24.76
N/A
28.12
25.76
22.73
60.06
160.71
19.03
1175.61
6266.11
23.60
38.10
345.05
30.41
32.48
16.83
672.42
N/A
49.87
52.51
730.48
20.72
19.53
9.70
7.54
4.95
2.79
13.00
5.72
15.96
15.16
5.51
4.75
N/A
3.33
20.98
7.66
Table 10
Pattern matching times for english.300MB in seconds. The codeword length of VF codes is 12.
Pattern length
Tunstall
AIVF
VMA
gzip
v2vdc
5
10
15
20
25
1.91
1.97
2.00
2.03
2.03
2.07
2.06
2.07
2.08
2.09
2.11
2.14
2.14
2.15
2.15
2.78
2.73
2.69
2.66
2.65
0.51
0.51
0.51
0.51
0.51
Table 11
Pattern matching times for dna in seconds. The codeword length of VF codes is 12.
Pattern length
Tunstall
AIVF
VMA
gzip
v2vdc
5
10
15
20
25
1.03
1.04
1.04
1.05
1.07
0.98
0.98
0.99
0.99
0.99
1.01
1.01
1.01
1.02
1.02
3.53
4.98
4.09
4.31
5.01
N/A
N/A
N/A
N/A
N/A
6. Conclusion
In this paper, we presented an ecient algorithm for AIVF codes, which integrates the multiple parse trees of an AIVF
code into one, called a VMA tree, and simulates the encoding process and the decoding process on it. We also estimated
the number of nodes in the VMA tree. We conducted several experiments to evaluate the compression performance of our
method. Although the decompression speed by using a VMA tree is slower than that of the original AIVF encoding, the
compression speed is faster than that of AIVF when using long codewords.
This method only works for memoryless sources, such as DNA, and cannot be extended to natural texts where the
probability of each character depends on the context (i.e. they can be modeled well by higher order Markov models), and
that the reason the method cannot be extended is that for higher order Markov sources the number of possible trees would
be exponential in the alphabet size in the worst case.
Although the methods presented in [8,6] have better compression ratios than do AIVF codes, they need to construct sux
trees [5] in order to construct a parse tree, and thus, they take a long time to achieve this. Therefore, from the viewpoint
of speeding up compressed pattern matching, AIVF code with the VMA tree is a strong candidate as well, because usually a
parse tree has to be constructed in order to conduct compressed pattern matching on VF codes.
Acknowledgements
This work was supported by JSPS Research Fellowships for Young Scientists and JSPS KAKENHI Grant Numbers 24240021,
25240003, 24650062.
References
[1] N. Brisaboa, A. Faria, G. Navarro, M. Esteller, (S, C)-dense coding: an optimized compression code for natural language text databases, in: Proceedings
of the 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003), in: Lecture Notes in Computer Science, vol. 2857,
Springer, 2003, pp. 122136.
[2] N. Brisaboa, E. Iglesias, G. Navarro, J. Param, An ecient compression code for text databases, in: Proceedings of the 25th European Conference on IR
Research (ECIR 2003), in: Lecture Notes in Computer Science, vol. 2633, Springer, 2003, pp. 468481.
[3] N. Brisaboa, A. Faria, J.-R. Lpez, G. Navarro, E. Lopez, A new searchable variable-to-variable compressor, in: Proceedings of the Data Compression
Conference 2010 (DCC 2010), IEEE, 2010, pp. 199208.
[4] N. Brisaboa, A. Faria, G. Navarro, J. Param, Dynamic lightweight text compression, ACM Trans. Inf. Syst. 28 (3) (2010) 10:110:32.
[5] M. Crochemore, W. Rytter, Jewels of Stringology, World Scientic, 2002.
86
[6] T. Kida, Sux tree based VF-coding for compressed pattern matching, in: Proceedings of the Data Compression Conference 2009 (DCC 2009), IEEE,
2009, p. 449.
[7] S. Klein, M. Ben-Nissan, Using Fibonacci compression codes as alternatives to dense codes, in: Proceedings of the Data Compression Conference 2008
(DCC 2008), IEEE, 2008, pp. 472481.
[8] S. Klein, D. Shapira, Improved variable-to-xed length codes, in: Proceedings of the 15th International Symposium on String Processing and Information
Retrieval (SPIRE 2008), in: Lecture Notes in Computer Science, vol. 5280, Springer, 2008, pp. 3950.
[9] S. Maruyama, Y. Tanaka, H. Sakamoto, M. Takeda, Context-sensitive grammar transform: compression and pattern matching, in: Proceedings of the
15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), in: Lecture Notes in Computer Science, vol. 5280, Springer,
2008, pp. 2738.
[10] I. Munro, V. Raman, S. Rao, Space ecient sux trees, J. Algorithms 39 (2) (2001) 205222.
[11] S. Savari, Variable-to-xed length codes for predictable sources, in: Proceedings of the Data Compression Conference 1998 (DCC 98), IEEE, 1998,
pp. 481490.
[12] B. Tunstall, Synthesis of noiseless compression codes, Ph.D. thesis, Georgia Institute of Technology, Atlanta, GA, 1967.
[13] T. Uemura, S. Yoshida, T. Kida, An improvement of STVF code by almost instantaneous encoding, Tech. rep., TCS-TR-A-10-44, Hokkaido University, 2010.
[14] J. Verhoeff, A new data compression technique, Ann. Syst. Res. 6 (1978) 139148.
[15] H. Yamamoto, H. Yokoo, Average-sense optimality and competitive optimality for almost instantaneous VF codes, IEEE Trans. Inf. Theory 47 (6) (2001)
21742184.
[16] S. Yoshida, T. Kida, An ecient algorithm for almost instantaneous VF code using multiplexed parse tree, in: Proc. of the Data Compression Conference
2010 (DCC 2010), 2010, pp. 219228.
[17] S. Yoshida, T. Kida, Analysis of multiplexed parse trees for almost instantaneous VF codes, in: Proceedings of the 2012 IIAI International Conference on
Advanced Applied Informatics, IIAI-AAI 12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 3641.
[18] S. Yoshida, T. Uemura, T. Kida, T. Asai, S. Okamoto, Improving parse trees for ecient variable-to-xed length codes, J. Inf. Process. 20 (1) (2012)
238249.