String Similarity Search: A Hash-Based Approach: Hao Wei, Jeffrey Xu Yu, and Can Lu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )

String Similarity Search: A Hash-Based


Approach
Hao Wei, Jeffrey Xu Yu, and Can Lu
The Chinese University of Hong Kong, Hong Kong
{hwei,yu,lucan}@se.cuhk.edu.hk

Abstract—String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query
auto-completion, and data cleaning needed in database, data warehouse and data mining. In this paper, we study string similarity
search based on edit distance that is supported by many database management systems such as Oracle and PostgreSQL. Given the
edit distance, ed(s, t), between two strings, s and t, the string similarity search is to find every string t in a string database D which is
similar to a query string s such that ed(s, t) ≤ τ for a given threshold τ . In the literature, most existing work take a filter-and-verify
approach, where the filter step is introduced to reduce the high verification cost of two strings by utilizing an index built offline for D .
The two up-to-date approaches are prefix filtering and local filtering. In this paper, we study string similarity search where strings can
be either short or long. Our approach can support long strings, which are not well supported by the existing approaches due to the size
of the index built and the time to build such index. We propose two new hash-based labeling techniques, named OX label and XX label,
for string similarity search. We assign a hash-label, Hs , to a string s, and prune the dissimilar strings by comparing two hash-labels, Hs
and Ht , for two strings s and t in the filter step. The key idea behind is to take the dissimilar bit-patterns between two hash-labels. We
discuss our hash-based approaches, address their pruning power, and give the algorithms. Our hash-based approaches achieve high
efficiency, and keep its index size and index construction time one order of magnitude smaller than the existing approaches in our
experiment at the same time.

Index Terms—Similarity search, hash, edit distance

1 I NTRODUCTION The existing approaches need to build a large index when the
strings are longer in a dataset. The larger the index is, the more
Strings are widely used to represent a variety of textual data time it requires to process queries. Accordingly, the performance
including DNA sequences, messages, emails, product reviews, decreases. In addition, such performance will be affected by a
and documents. And there are a large number of string datasets large number of strings in a dataset. In this paper, we focus on
collected from various data sources in real applications. Due to new hash-based approaches to deal with large short/long string
the fact that string data from different sources may be inconsistent datasets.
caused by the typing mistakes or the differences in data formats,
as one of the most important fundamental tasks, string similarity Related Work: Most of the existing string similarity search
search has been extensively studied [1], [5], [7], [18], [26], [32], algorithms take a filter-and-verify approach. The filter step is
[33], [34], which checks whether two strings are similar enough introduced to reduce the verification cost of two strings, s and
for data cleaning purposes in databases, data warehousing, and t, which is costly when two strings are long. In order to find
data mining systems. The applications that need string similarity similar strings in a string dataset D for a given query string
search include fuzzy search [11], query auto-completion [29], and s with a threshold τ , they first prune strings, t, that cannot be
DNA sequencing [13]. In the literature, there are two categories possibly similar with s such that ed(s, t) > τ using an index
to measure string similarity. One is token-based similarity metric built offline for D in the filter step, and then verify those strings
including overlap, Jaccard, cosine and dice [12], [31], and the that are possibly similar one by one in the verification step. The
other is the character-based similarity metric including edit dis- performance of an approach is measured by the query cost and the
tance [9]. The edit distance, ed(s, t), between two strings, s and t, index cost. The query cost is the sum of the filter cost (the total
is the minimum number of operations (substitution, insertion, and running time in the filter step) and the verification cost (the total
deletion) required to transform one string to another string. In this running time in the verification step). The index cost is the index
paper, we focus on string similarity search based on edit distance, construction time and the index space needed.
which is supported in many database management systems such To efficiently process string similarity search, the existing
as Oracle, PostgreSQL and Lucene [7], and is used in detecting work attempts to prune strings in D as many as possible based on
spammers [22] and DNA sequence alignment [17]. the index built offline. Almost all the existing work needs to know
The challenge of the string similarity search is to design the edit distance threshold τ beforehand, in order to construct the
an effective index that can achieve high efficiency for query index for a string dataset D, except for BitTree [33]. Behm et al.
processing with small overhead in the index size and the indexing propose a hierarchical structure containing different filters, e.g.,
time. It becomes more challenge in the age of big data since the length and charsum filter, in Flamingo package [2]. Gravano et
the string datasets become increasingly large from two aspects, al. propose to partition a string into a set of q -grams [9] and prune
the lengths of the strings and the number of strings insides. a string pair (s, t) that have less than a certain number of common

1
1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 2

q -grams. The chunk-based approaches share the similar idea but It cannot be easily handled by the prefix filtering approaches. For
partition the string using disjoint q -grams, called chunk [23], [28]. the local filtering approach, the BitTree index will be extremely
Instead of using fixed-length q -grams, Li et al. selectively choose large to be stored and it is time consuming to construct such an
high-quality grams of variable length in index construction [15]. index.
Below, we introduce two up-to-date approaches, namely, prefix Different from the existing work in the literature, we propose
filtering and local filtering. First, the prefix filtering approaches new hash-based labeling for string similar search. Let Hs and
compute a prefix set for every string for pruning. All q -grams Ht be two hash-labels for strings, s and t. We show that s and
of the string are sorted according to an universal sorted ordering t are definitely dissimilar for a given τ using Hs and Ht . We
based on frequency and the prefix set consists of the first (qτ + 1) propose two hash-based approaches, namely OX label and XX
q -grams [6]. It is known that the intersection of two strings’ prefix label. Both are in the scheme of (~, ℵ, ℘, #). Here, ~ and ℵ are
sets cannot be empty if the edit distance between them is not larger two functions to create a hash-label Hs for a string s, and ℘ and
than τ , since one edit operation affects at most q -grams. Xiao et # are two functions to compare two hash-labels, Hs and Ht for
al. optimize the prefix filtering by taking both positional filtering two strings, s and t. The key idea behind is to take the dissimilar
and content filtering into consideration [30]. Wang et al. further bit-patterns between two hash-labels. We discuss our hash-based
extend the idea by building the prefix set of size qτ + γ + 1, approaches, address their pruning power, and give the algorithms.
and they prune the dissimilar string pairs if the intersection size New optimizations to the verification algorithm are proposed for
of two prefix sets is not larger than γ [26]. Since the prefix length efficiently verifying whether a candidate string is an answer. We
has impacts on the performance, they choose the prefix length have conducted extensive performance studies and confirm the
adaptively at the expense of storing the entire q -gram set with efficiency of our hash-based approaches in both datasets of long
high space consumption. Deng et al. select τ + 1 disjoint q -grams strings and datasets of short strings with much smaller index size.
from the original prefix set to build an additional pivotal prefix set The paper organization: The remainder of the paper is orga-
which can be used to save the filter cost [7]. Wang et al. propose nized as follows. We discuss the filter-and-verify approaches in
to index k -wise signatures which consist of the combinations of Section 2, and highlight two new hash-based approaches for string
k q -grams in the prefix set [27]. It is worth noting that the prefix similarity search in Section 3. We will discuss hash-based labels
filtering approaches need to know τ , in order to decide the size in Section 4 and design the algorithms for index construction,
of the prefix set before the index construction. In string similarity pruning, and verification, in Section 5. We report our experimental
search, users cannot select a threshold larger than the τ used in study in Section 6 and conclude our work in Section 7.
index construction.
Second, the local filtering is studied in [33]. Yang et al. propose 2 T HE F ILTER & V ERIFY A PPROACH
the concept of local distance and prove that the edit distance Given a string s on an alphabet Γ of |Γ| characters, we use s[i] to
ed(s, t) will not be larger than the sum of the local distances denote its i-th character and s[i, j] to denote the substring from its
between every disjoint chunk of the query string s and the string i-th character to its j -th character. Let the length of the string s be
t in the dataset. Based on this property, they build a BitTree to |s|. The index i of s[i] starts from 1 and is in the range of [1, |s|],
store the local distances between every string in the dataset and for a string s. Given two strings s and t, the string edit distance,
every possible chunk of size smaller than a pre-defined qmax . denoted as ed(s, t), is the minimum number of steps transforming
Users can use different τ for string similarity search but it will s to t by the 3 edit operations: insertion, deletion and substitution.
increaseP the overhead in query time. The index size of BitTree is For example, suppose s = “effcionsy” and t = “efficiency”, we
O(n·l· qi=1 max
|Γ|i ), where n denotes the number of strings in the have ed(s, t) = 3 by replacing ‘o’ in s[6] by ‘e’, replacing ‘s’ in
dataset, l is the string length and Γ is the alphabet of characters. s[8] by ‘c’, inserting ‘i’ between ‘f’ and ‘c’. We give the string
To process string similarity search efficiently, the size of BitTree similarity search problem below.
needs to be very large. And therefore the index construct time
becomes very large. String Similarity Search: Given a string dataset D of n strings, a
query string s and an edit distance threshold τ , the string similarity
Besides the filter-and-verify approaches, the trie-based ap-
search problem is to find all strings t ∈ D such that ed(s, t) ≤ τ .
proach is proposed in [8], [25] which indexes a string dataset by a
A well-known algorithm to compute the edit distance between
trie. The edit distance is computed incrementally while traversing
two strings s and t is to fill an edit distance matrix of size
the trie, and the computation is shared among the strings with
(|s| + 1) × (|t| + 1) using dynamic programming. However, it
a common prefix. The trie-based approach is mainly used for
requires O(|s| · |t|) time complexity which is costly for long
string similarity join query, and it is designed for short strings
strings. The filter-and-verify framework adopted by the existing
as discussed in [33].
work builds an index to prune the dissimilar strings of the query
Main contributions: In this paper, we study string similarity string in the dataset D in the filter step, and verifies the remaining
search, when the query string s and the average string t in D candidates to get the real result in the verification step. The filter
can be long. The up-to-date approaches cannot efficiently process step is important to reduce the cost of computing the edit distance
long string similarity search for the following main reasons. For between two strings, by pruning the strings that cannot be possibly
the prefix filtering approaches, the main idea is to use a small in the final results as many as possible using the index built offline.
number of q -grams for filtering. When strings become long, the There are some simple heuristics that can be applied in the filter
pruning power of such a small number of q -grams will reduce step. The length-filter is such an example, which prunes the string
significantly. In addition, the prefix filtering approaches need to t if ||s| − |t|| > τ . The index built will further prune strings that
know τ before the index construction. However, when the average cannot be simply pruned by such simple heuristics.
strings become long, users want to use different τ for string similar To build an index for filtering, a common technique is based
search: a small τ for short strings and a large τ for long strings. on q -gram, where a q -gram of string s is a substring of length q in

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE
IEEE Transactions AND DATA
on Knowledge ENGINEERING,
and Data VOL.XXX,
Engineering NO. XXX,
( Volume: 30,2017
Issue: 1, Jan. 1 2018 ) 3

s. In this paper, we pad every string with q − 1 delimited character 1e+6 6000

‘$’ for $ 6∈ Γ, and hence the q -gram set of a string s, denoted as 5000

candidate number
1e+5

Index Size (MB)


Qs , contains all its |s|+ q − 1 substrings of length q . For example, 1e+4
4000

the 2-gram set of string s = “efficiency” is Qs = {$e, ef, ff, fi, ic, 3000
1e+3
ci, ie, en, nc, cy, y$}. It is worth noting that a q -gram set, Qs , of a 2000
1e+2
string s is a multiset, where the same q -gram at different positions 1000

are included. For two strings s and t such that ed(s, t) = τ , the 1e+1
36 40 44 48 52 |s|
0
36 40 44 48 52 |s|

number of the common q -grams between their q -gram sets, Qs prefix size prefix size

and Qt , should be at least max{|s|, |t|} + q − 1 − qτ [9], as (a) Candidate Number (b) Index Size
shown in Eq. (1), as a pruning condition. Fig. 1. Pruning ability with different prefix size

|Qs ∩ Qt | ≥ max{|s|, |t|} + q − 1 − qτ (1) local distance to the q -gram represented by the tree node. Every
tree node has |Γ| children nodes, where each of them represents
Here, the result of Qs ∩ Qt is a multiset since both Qs and Qt
one of the next possiblePcharacter in the q -gram. Therefore, the
are multisets, and |Q| is the number of elements in the multiset qmax
index space is O(n · l · i=1 |Γ|i ) which is too large especially
Q. By Eq. (1), it computes |Qs ∩ Qt | in the filter step, in order
for datasets of long strings. The index construction is also very
to avoid computing ed(s, t) in the verification step. But, since the
time consuming due to many local distance computation.
two q -gram sets are at the same scale of the string lengths, the set
intersection between their q -gram sets, Qs ∩Qt , takes O(|s|+ |t|)
time complexity, which is not efficient for long strings. To avoid 3 N EW H ASH -BASED A PPROACHES
storing the whole q -gram sets, most of the existing q -gram based We study the string similar search problem where every string
approaches select some q -grams as the representatives and build with edit distance ≤ τ to the query string should be returned as an
an index based on the q -grams selected. But the q -grams that are answer. By hash-based, we mean that we assign a hash-label to a
not selected for the index construction can be used to enhance the string, and prune strings by their hash-labels. Here, the hash-label
pruning ability in the filter step as shown below. Prefix filtering [6] is a binary vector (or a bit sequence) of fixed length L for string
and its variants [7], [26], [27] are the up-to-date approaches. They in any length. We denote the hash-label for a string s as Hs , and
firstly compute an universal ordering of all q -grams by q -grams’ denote the i-th bit of the hash-label as Hs [i]. For simplicity, we
frequencies in string dataset and sort the q -gram set of each string may use H and H[i] in the following.
by this ordering. The prefix set for each string is the set of the first First, we show that the existing hash-based approaches cannot
q -grams in the sorted q -gram set, and is used in index construction. be directly applied to filter strings. Given a hash function that
Two similar strings should share a certain number of common q - hashes a string s into a number (or a bit-sequence), Hs . The
grams between their prefix sets. problem on a hash-based approach is whether it can prune a string
The issue here is to balance the pruning power in the filter step t, for a given query string s, based on Hs and Ht . The condition
and the cost to pay for such pruning power. The cost includes the that must be held for pruning is that the distance between Hs
indexing construction cost, the index space, and the filtering cost. and Ht reflects the distance ed(s, t) between two strings, s and
Below, we show that a larger prefix set containing more q -grams t. There are two possible candidates, Bloom-Filter [3] and LSH
can have higher pruning ability in Fig. 1(a). In Fig. 1(a), the x-axis (Locality Sensitive Hashing) [4]. Bloom-Filter is a probabilistic
is the size of every string’s prefix set in the dataset reads by setting data structure designed to check whether an element belongs to a
q = 5 and τ = 6, and the y-axis is the candidate number, which is certain set. In brief, suppose D is a set of strings, Bloom-Filter
the number of strings that cannot be pruned in the filter step. The uses a binary vector of length L to represent D by hashing every
smaller the candidate number the higher pruning power. When string in D into the binary vector in the range of [1, L]. For a
the prefix set is the whole q -gram set, marked as |s|, it has the given query string, s, Bloom-Filter can tell that the string s is
highest pruning power because all q -grams are utilized in the index either definitely not in D or possibly in D. Obviously, Bloom-
construction. Such result implies that it is best to use the whole q - Filter cannot be used for pruning strings, since it is to check
gram sets where possible if the cost can be significantly reduced. membership, and it cannot preserve the distance between strings.
But the index size of the prefix set increases significantly as the LSH, as its name implies, has been used for answering many
prefix size increases, as shown in Fig. 1(b). In practice, there is a similarity search problems, e.g., approximate nearest neighbor
trade-off between the number of q -grams selected in the prefix set problem [10] and measuring Jaccard set similarity [20]. Assuming
and the pruning ability. For the existing prefix filtering approaches, LSH (·) is a hash function to be used in a LSH. If ed(s, t) ≤ τ ,
a limited number of q -grams selected for the prefix set will make LSH ensures that LSH (s) = LSH (t) with a probability at least p.
index size small on one hand, but reduce the pruning ability of However, LSH cannot be used for pruning strings for a given query
the prefix index on the other hand. There are other attempts to use string, because the probability p given in LSH cannot be 100%.
the whole q -gram sets. Yang et al. propose the concept of local In other words, in the filter step, we are required to filter those
distance based on which the local filtering approach are designed. strings that cannot possibly be an answer string. If the probability
The local filtering approach [33] attempts to index as many q - p cannot be 100%, some strings that should be possible answers
grams as possible by storing the local distances between the string may be pruned.
and every possible q -gram of length smaller than a parameter We propose two novel hash-based approaches for pruning us-
qmax . It builds a BitTree index of qmax levels where each tree ing the whole q -gram sets without high overhead. Both approaches
node represents one possible q -gram of length smaller than qmax . are in the scheme of (~, ℵ, ℘, #). Here, ~ and ℵ are two functions
There are l bit vectors of size n in every tree node, where l is the to create a hash-label Hs for a string s, and ℘ and # are two
string length and n is the number of strings in the dataset, and functions to compare two hash-labels, Hs and Ht for two strings,
every string in the dataset takes a bit in the bit vector to store its s and t. We explain the four functions below. Consider a string s.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE
IEEE TransactionsONonKNOWLEDGE
TRANSACTIONS KnowledgeAND
andDATA
DataENGINEERING,
Engineering VOL.XXX,
( Volume: 30,
NO. Issue:
XXX, 20171, Jan. 1 2018 ) 4
qj in Qs $e ef ff fc ci io on ns sy y$
Let its whole q -gram set be Qs = {q1 , q2 , · · · } where qi is the ~(qj ) 7 9 14 5 10 5 20 2 14 3
i-th q -gram starting from the position of s[i] in the length q , and qj in Qt $e ef ff fi ic ci ie en nc cy y$
let the hash-label Hs for a string s be initialized as all zeros, e.g., ~(qj ) 7 9 14 1 16 10 6 4 18 13 3
Hs [i] = 0 for 1 ≤ i ≤ L. TABLE 1
Hash values of q -grams in Qs and Qt
• ~(·) is a hash function, which takes a q -gram qj from a q -
gram set Qs as input, and hashes it to the i-th bit position OX(s) 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 1
in the hash-label Hs , if i = ~(qj ). OX(t) 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0

• ℵ(·) is a function, which takes a bit position i in the range 1 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1
of [1, L], and updates the corresponding bit in the position (a) OX hash-labels
specified, such as H[i] ← H[i] ℵ 1.
XX(s) 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1
• ℘(·, ·) is a function which takes two hash-labels, Hs and XX(t) 1 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0

Ht , and returns a hash-label, H, by applying ℘ to every pair 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1
of bits, such that H[i] ← Hs [i] ℘ Ht [i], for 1 ≤ i ≤ L.
(b) XX hash-labels
• #(·) is a function which takes a hash-label as input, and Fig. 2. Two different hash-labels for the two strings, s and t
counts the number of 1’s in the hash-label.
the number of different hashed bits in the two hash-labels, instead
Hash-label creation: Let the hash-label Hs for a string s be of counting the number of the common hashed bits.
initialized as all zeros. By both ~ and ℵ, a hash-label Hs for a We discuss the hash-label creation. Assume both OX and XX
string s is obtained by ℵ(~(qi )) for every q -gram qi in the q -gram label use the same hash function ~. Suppose the hash function ~
set Qs = {q1 , q2 , · · · } for the string s. hashes two q -grams, qk and ql , to the same i such that i = ~(qk )
Hash-labels comparison: Given two hash tables, Hs and Ht , by and i = ~(ql ). OX label uses ∨ to update the bit by H[i] ←
both ℘ and #, #(℘(Hs , Ht )) returns a number. In our approaches, H[i] ∨ 1, which is equivalent to H[i] ← 1, where the original bit
we show that a string can be pruned, if the number returned by value at H[i] does not matter. The result of any update caused by
#(℘(Hs , Ht )) satisfies some condition. either qk or ql or both is the same, H[i] = 1. On the other hand,
An acute reader may find that, following the idea given in XX label uses ⊕ to update the bit by H[i] ← H[i] ⊕ 1, where the
Eq. (1), a modified Bloom-Filter can be specified using the scheme original bit value at H[i] does matter. Assume H[i] = 0 initially,
of (~, ∨, ∧, #). By a hash function ~ and a logic-or function ∨, and we hash qk followed by ql . By hashing qk , the bit-value at
the modified Bloom-Filter hashes a q -gram, qj , into the i = ~(qj ) H[i] becomes H[i] = 1, because H[i] ⊕ 1 = 0 ⊕ 1 = 1. Then,
position in the hash-label and set the bit to be 1 such that H[i] ← by hashing ql , the bit-value at H[i] becomes H[i] = 0, because
H[i] ∨ 1. Suppose the two hash-labels for the two strings, s and H[i] ⊕ 1 = 1 ⊕ 1 = 0. The positional information of the q -grams
t, are Hs and Ht . The logic-and function ∧(Hs , Ht ) returns a is not considered in hash-label creation but the numbers of the
hash-label, H, such that H[i] = 1 if both Hs [i] = 1 and Ht [i] = q -grams’ appearances are utilized. As can be observed from the
1, otherwise H[i] = 0. Then the count function, #(H), counts examples, OX label and XX label take different approaches to deal
how many bits are 1. Note that ∧(Hs , Ht ) is like Qs ∩ Qt , and with hash collision, e.g., i = ~(qk ) = ~(ql ). We will discuss it
#(∧(Hs , Ht )) is like |Qs ∩Qt |. In other words, the two functions later in detail.
are to count how many common hashed bits in the two hash-labels. Next, we discuss the hash-label comparison. Both OX and XX
The question is whether the modified Bloom-Filter functions can label use the same two functions, ⊕ and #, for the hash-label
prune strings by using the number of common hashed bits. To comparison. Here, ⊕(·, ·) takes two hash-labels, Hs and Ht , and
be more specific, since the number of the common hashed bits returns a hash-label H by applying ⊕ to every pair of bits, such
is related to |Qs ∩ Qt | based on the analysis above, it seems it that H[i] ← Hs [i] ⊕ Ht [i], for 1 ≤ i ≤ L. And #(·) counts the
is possible to prune the strings when the number of the common number of 1’s in the input hash-label. Given two hash tables, Hs
hashed bits is smaller than max{|s|, |t|}+q −1−qτ , like Eq. (1). and Ht , by both ⊕ and #, #(⊕(Hs , Ht )) returns a number. In
We show it is difficult in general due to the existence of hash this paper, for two strings s and t, we prove that for ed(s, t) ≤ τ ,
collisions using an example below. Suppose we need to compare it must be true #(⊕(Hs , Ht )) ≤ 2qτ − ||s| − |t||. We will discuss
a data string s against a query string s, where both are the same. it in detail in Section 4.
In the extreme case, every q -gram in Qs may be hashed to the Example 3.1: Consider two strings s = “effcionsy” and t = “effi-
same bit position in the hash-label Hs , such that there is only ciency”. Assume the edit distance threshold τ used for ed(s, t) is
one bit with its value to be 1 in Hs , and other bits are 0. Then, τ = 2, and q = 2 in this example. The 2-gram set of string s is
by #(∧(Hs , Hs )), the number of common hashed bits is 1. It is Qs = {$e, ef, ff, fc, ci, io, on, ns, sy, y$} and the 2-gram set of
most likely that the data string s will be pruned by such modified string t is Qt = {$e, ef, ff, fi, ic, ci, ie, en, nc, cy y$}. Suppose the
Bloom-Filter. hash-labels used are of fixed size for L = 20, and there is a hash
The two hash-based schemas we propose are, OX = function ~, which hashes a 2-gram in the range of [1,20]. Table 1
(~, ∨, ⊕, #) and XX = (~, ⊕, ⊕, #). Here ∨ is for logic-or and shows the hash values of the q -grams for Qs and Qt . Note that
⊕ is for exclusive-or. Recall an exclusive-or (⊕) results 0 if its there are hash collisions here. Both ‘ff’ and ‘sy’ are hashed into
two bit inputs are the same, and 1 otherwise. The two schemas 14-th bit position, and both ‘fc’ and ‘io’ are hashed into 5-th bit
are different in hash-label creation, where OX label uses ~ and position in Fig. 2.
∨ and XX label uses ~ and ⊕. The two schemas are the same in Fig. 2(a) shows the OX hash-labels, Hs and Ht , for s and t, and
hash-label comparison for pruning, because both use ⊕ and #. the resulting bit-sequence H by exclusive-or (⊕) of the two hash-
The key idea behind for hash-label comparison is that we count labels. The number of different bits in H is 9. We can conclude that

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE
IEEE TransactionsON
TRANSACTIONS onKNOWLEDGE
KnowledgeAND
andDATA
DataENGINEERING,
Engineering VOL.XXX,
( Volume:NO.
30, Issue:
XXX, 20171, Jan. 1 2018 ) 5

the two strings are not similar. This is because #(⊕(Hs , Ht )) ≤ q -grams in Qs ∆Qt are possible to set bits to 1 in Hs ⊕ Ht . Based
2qτ − ||s| − |t|| does not hold, for #(⊕(Hs , Ht )) = 9 and 2qτ − on Lemma 4.1, with the consideration of possible hash collision,
||s| − |t|| = 2 · 2 · 2 − |9 − 10| = 7. we have #(⊕(Hs , Ht )) ≤ |Qs ∆Qt | ≤ 2qτ − ||s| − |t||. ✷
Fig. 2(b) shows the XX hash-labels Hs and Ht for s and t, and
the resulting bit-sequence H by exclusive-or (⊕) of the two hash- Theorem 4.2: Let s and t be two strings such that ed(s, t) ≤
labels. Here, the hash-label Hs by XX and the hash-label Hs by τ . By the scheme of XX = (~, ⊕, ⊕, #), it must be true that
OX are different at the 5-th bit and the 14-th bit. This is because #(⊕(Hs , Ht )) ≤ 2qτ − ||s| − |t||.
there are two hash collisions in Qs . Initially, both the 5-th bit and Proof: Let Qu be the multiset containing all q -grams from Qs
the 14-th bit are zero. Take the 5-bit as an example. When “fc” is and Qt such as Qu = Qs ⊎ Qt . Suppose qj is a common q -gram
hashed into this position, the bit becomes 1 by ⊕. Then, when “io” that appears in both Qs and Qt , and qj is hashed into the i-th bit
is hashed into this position, the bit becomes 0 by ⊕. The number position in the hash-label. When it is to create a hash-label for the
of different bits in H by XX label is still 9. We can conclude that q-gram set Qs , by the exclusive-or operator ⊕, Hs [i] = 1 if and
the two strings are not similar for the same reason. ✷ only if qj appears odd times in Qs . And it is the same for the
string t and Qt . When comparing two hash-labels by exclusive-
or operator, the i-th bit position ⊕(Hs [i], Ht [i]) = 1 iff an odd
4 H ASH -BASED L ABELS number of q -grams in Qu are hashed to the i-th bit position. Note
The key idea behind our hash-based approaches is to focus on the that Qu = Qs ⊎ Qt is obtained by merging two q -gram sets, Qu
differences between two q -gram sets, Qs and Qt , for two strings, and Qt . Therefore, ⊕(Hs , Ht ) can also be seen as a hash-label Hu
s and t, where Qs and Qt are two multisets. Let A and B be two created by the XX scheme for Qu , Hu = ⊕(Hs , Ht ). In addition,
multisets and an element x appears cA times in A and cB times since a common q -gram of Qs and Qt appears even number of
in B . We define that the x appears min{cA , cB } times in A ∩ B , times in Qu , the common q -gram cannot set the bit in Hu to be
cA + cB times in A ⊎ B , and max{0, cA − cB } times in A \ B . 1. In other words, only q -grams in Qs ∆Qt are possible to set bits
We use A∆B to denote the symmetric difference of two sets A to 1. Thus, we have ⊕(Hs , Ht ) = Hu = Hd , where Hd is a hash-
and B , where A∆B = (A \ B) ⊎ (B \ A). Here, ⊎ operation label created by the XX scheme for the q -gram set Qd = Qs ∆Qt .
is to merge the two sets A and B . Unlike the multiset union ∪ We have #(⊕(Hs , Ht )) = #(Hd ) ≤ 2qτ − ||s| − |t|| based on
operation, a common q -gram of A and B appears at least twice Lemma 4.1. ✷
in A ⊎ B according to the definition. The number of different Consider the XX scheme using Example 3.1. We have Qu =
q -grams between Qs and Qt is denoted as |Qs ∆Qt |. Qs ⊎Qt = {$e, ef, ff, fc, ci, io, on, ns, sy, y$, $e, ef, ff, fi, ic, ci, ie,
In this section, we discuss the pruning conditions used by en, nc, cy y$}, and Qd = Qs ∆Qt = {fc, io, on, ns, sy, fi, ic, ie,
OX/XX labels for filtering in Section 4.1. We analyze the pruning en, nc, cy}. As shown in Fig. 2(b), the hash-label by exclusive-or
ability of the OX/XX labels in Section 4.2 followed by the between Hs and Ht is the same as the hash-label Hd created by
comparison between OX and XX in Section 4.3. We discuss how the XX scheme for Qd .
to choose the value of q as the length of the q -gram in Section 4.4.
4.2 Probabilistic Analysis of the Hash-based Labels
4.1 The Pruning Conditions We discuss the pruning power of OX/XX. Recall that, if ed(s, t) ≤
Both the OX label and XX label use the same pruning condition τ holds, it must be true that #(⊕(Hs , Ht )) ≤ 2qτ − ||s| − |t||
for filtering the dissimilar strings. That is, if ed(s, t) ≤ τ holds for both OX/XX. The pruning power is based on the probability
for the string s and t, it must be true that #(⊕(Hs , Ht )) ≤ that the bit in ⊕(Hs , Ht ) is to be 1, given the hash-label is in
2qτ − ||s| − |t||. We prove it for the OX label in Theorem 4.1, and fixed length of L with possible hash collisions. The more bits
for the XX label in Theorem 4.2. Below, we firstly give a lemma in ⊕(Hs , Ht ) to be 1, the stronger pruning power. We give
based on which Theorem 4.1 and Theorem 4.2 are proved. the probabilistic analysis for OX and XX in Theorem 4.3 and
Theorem 4.4, respectively. Note that we make no assumption about
Lemma 4.1: Given two strings s and t, and let the threshold be the length L of the hash-labels in the theorems.
τ and their edit distance ed(s, t) ≤ τ . The size of the symmetric
difference of the two q -gram sets, |Qs ∆Qt |, is at most 2qτ − Theorem 4.3: Given two strings s and t. Let γ = |Qs ∩ Qt |,
||s| − |t||. ✷ α = |Qs \ Qt |, and β = |Qt \ Qs |. By the OX scheme, we hash
their q -gram sets into two hash-labels Hs and Ht of length L,
Proof: The size of q -gram sets, for the two strings, s and t, is respectively. The probability that a bit of ⊕(Hs , Ht ) remains to be
|Qs | = |s| + q − 1 and |Qt | = |t| + q − 1, respectively. Since 1 is
ed(s, t) ≤ τ , we have |Qs ∩ Qt | ≥ max{|s|, |t|} + q − 1 − qτ .
Therefore, we have |Qs ∆Qt | = |Qs | + |Qt | − 2 · |Qs ∩ Qt | POX = f (γ)(f (β)(1 − f (α)) + f (α)(1 − f (β))) (2)
≤ 2qτ − ||s| − |t||. ✷ where f (x) denotes the probability that a bit is 0 after inserting
x q -grams into a hash-label of length L.
Theorem 4.1: Let s and t be two strings such that ed(s, t) ≤
τ . By the scheme of OX = (~, ∨, ⊕, #), it must be true that 1
f (x) = (1 − )x ≈ e−x/L (3)
#(⊕(Hs , Ht )) ≤ 2qτ − ||s| − |t||. L
Proof: Suppose qj is a common q -gram that appears in both s and Eq. (2) holds for every bit of ⊕(Hs , Ht ).
t’s q-gram set, Qs and Qt . The q -gram qj will be hashed to the Proof: Without losing the generality, we analyze the i-th bit of
same i-th bit position in both Hs and Ht . So we have Hs [i] = 1 ⊕(Hs , Ht ).
and Ht [i] = 1 in the hash-labels. When comparing two hash- We explain Eq. (3). For a bit vector of length L, the probability
labels, the i-th position of Hs ⊕ Ht must be 0. Thus only the that a q -gram is not hashed to a certain bit is 1 − 1/L due to the

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE
IEEE Transactions
TRANSACTIONS ONon Knowledge
KNOWLEDGE and
AND Data
DATA EngineeringVOL.XXX,
ENGINEERING, ( Volume:
NO.30, Issue:
XXX, 2017 1, Jan. 1 2018 ) 6

uniformness of hash values. By the OX scheme, if the bit remains respectively. The probability that a bit of ⊕(Hs , Ht ) remains to be
to be 0 after inserting x different q -grams, it means none of the 1 is
x different q -grams is hashed to the bit. Given the fact that the 1 − (1 − 2/L)ω 1 − e−2ω/L
PXX = ≈ (5)
hash values for different q -grams are independent, the probability 2 2
f (x) is (1 − 1/L)x ≈ e−x/L . Note that the mathematical Eq. (5) holds for every bit of ⊕(Hs , Ht ).
approximation (1 − L1 )L ≈ e−1 holds when L is sufficiently
large. Proof: In the proof of Theorem 4.2, we show that, by the XX
We next prove the correctness of Eq. (2) by analyzing the scheme, Hs ⊕ Ht = Hd , where Hd is a hash-label for the q -grams
probability of the i-th bit of Hs ⊕ Ht to be 1. Since we apply in Qs ∆Qt . Without losing the generality, we analyze the i-th bit
ω
exclusive-or operator for every bit, there are only two possible of ⊕(Hs , Ht ). We use PXX to denote the probability of the i-th
cases that make the i-th bit to be 1, (a) Hs [i] = 1 and Ht [i] = 0, bit to be 1 in XX label after ω independent q -grams of Qs ∆Qt
and (b) Hs [i] = 0 and Ht [i] = 1. Both cases require that none are hashed. The i-th bit is set to be 1 in only two cases, (a) it
of the common q -grams in Qs ∩ Qt is hashed to the i-th bit remains to be 0 after the first w − 1 q -grams are hashed and the
and the corresponding probability is f (γ) based on Eq. (3) where w-th q -gram is hashed to the i-th bit, (b) it is set to be 1 after the
γ = |Qs ∩ Qt |. Note that, if a common q -gram in Qs ∩ Qt is first w − 1 q -grams are hashed and the w-th q -gram is hashed to
hashed to the i-th bit, it will make both Hs [i] = 1 and Ht [i] = 1 the other bit. Therefore, since hashing every q -gram in Qs ∆Qt is
simultaneously and the i-th bit of Hs ⊕ Ht will become 0. For the independent, we have
case (a), it further requires that no q -grams of Qt \ Qs is hashed ω 1 ω−1 1 ω−1
to Ht [i] and the probability is f (β) according to Eq. (3) where PXX = (1 − )P + (1 − PXX )
L XX L
β = |Qt \Qs |. Moreover, at least one q-gram of Qs \Qt is hashed 0
and PXX = 0. By mathematical deduction of the equation above,
to Hs [i] and the probability is 1 − f (α) where α = |Qs \ Qt |. So
the probability of the case (a) is f (γ)f (β)(1 − f (α)). In a similar 1 − (1 − 2/L)ω 1 − e−2ω/L
ω
way, the probability for the case (b) is f (γ)f (α)(1 − f (β)). Sum PXX = PXX = ≈ .
2 2
up the probability of the two cases can prove the result. ✷

We analyze the affect of the length L on the pruning power We discuss the length L used in the XX scheme. By Eq. (5),
of OX. For simplicity, we assume |s| = |t|, since the length-filter the expected number of bits with value 1 is L · PXX and the ratio
(||s| − |t|| > τ ) can prune strings having large difference between of the expected bits over the real size of |Qs ∆Qt | is
their string lengths. We denote ω = |Qs ∆Qt | and ω = 2α = 2β
when |s| = |t|. Based on Eq. (2) and Eq. (3), we have L · PXX 1 − e−2ω/L ω
≈ ≈1− (6)
ω 2ω/L L
−ω/2L −(2γ+ω)/(2L)
POX ≈ 2(1 − e )·e (4)
by applying Taylor series expansion of function ex when x is
Note that 2γ + ω = |s| + |t|, and the length of hash-labels is close to 0. The ratio reflects the effect of hash collisions and it
fixed as L. When L is not sufficiently large comparing to the sum is negatively linear with ω/L. Note that the higher the ratio is,
of the two strings’ lengths |s| + |t|, e−(2γ+ω)/(2L) tends to be 0, the stronger the pruning ability of XX label has. A large length
which makes POX approximately equal to 0. In other words, most L can reduce the number of hash collisions at the expense of
of the bits in ⊕(Hs , Ht ) are to be 0. As a result, the OX scheme increasing the index size. We give our approximate analysis below,
is not more suitable for a dataset of short strings where L is not although the exact value of ω is unknown before a similarity query
sufficiently large. Regarding long strings, it is more likely to have is issued. In the filter step, it is difficult to tell whether two strings
hash collisions, for a fixed length of hash-labels L. When several are dissimilar when |Qs ∆Qt | is slightly larger than the threshold
different q -grams from Qs ∆Qt are hashed into the same bit, the 2qτ −||s|−|t|| (refer to Theorem 4.1, and Theorem 4.2). Although
pruning power will be reduced. Reconsider Example 3.1, where τ is not required to be known in index construction, τ is given
the q -gram “ff” ∈ Qs ∩ Qt and “sy” ∈ Qs ∆Qt . Following the before the index construction in some cases as assumed in [7],
hash function (Table 1), we have ~(“ff”) = ~(“sy”) = 14. The [26] or the possible τ can be roughly estimated. We use 2qτ to
14-th bit in both Hs and Ht are to be 1 due to the common q -gram approximately represent the value of ω in Eq. (6) for analysis, and
“ff”, and therefore the corresponding bit of ⊕(Hs , Ht ) becomes set the index length L accordingly to reduce the number of hash
0, which makes it fail to detect the existence of “sy” in Qs ∆Qt . collisions to a small number, e.g, making 1 − 2qτ /L ≤ 0.2. For
The XX scheme is proposed to address such hash collisions. In example, assumed q = 3, given a string of length 1,000 and a
this example, after the q -gram “ff” has been hashed into the hash- query threshold τ = 2, the size of hash-label can be set to be
label, Hs [14] = 1, and then after the q -gram “sy” has been hashed L = 64, which is one 64-bit integer. Such small size is effective
into the hash-label, Hs [14] = 0. Thus, there is a chance to detect enough to index a long string of length 1,000, and make the hash
the existence of “sy” in Qs ∆Qt . collision metric 1 − 2qτ /L < 0.2. The XX scheme is space
We analyze the pruning power of the XX scheme by analyzing efficient.
the probability PXX of setting a bit to 1 in ⊕(Hs , Ht ). The theorem In [34], Zhang et al. propose a hash-based method that gen-
below shows that the performance of XX label is not affected by erates hash bucket vectors to represent strings. Suppose v is a
the string length no matter how long the string is. So it is suitable hash bucket vector where v[i] represents the number of q -grams
for datasets of long strings. mapped to the i-th hash bucket. They proved that the edit distance
between the string s and t is no smaller than
Theorem 4.4: Given two strings s and t. Let ω denote the size of
X vs [l] − vt [l] X vt [l] − vs [l]
the symmetric set difference |Qs ∆Qt |. By the XX scheme, we hash max( , ) (7)
Qs and Qt into two hash-labels Hs and Ht of the same length L q q
vs [l]>vt [l] vt [l]>vs [l]

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 7
L 40 80 160 320 640 1,280 2,560
because ⊕(Hs [5], Ht [5]) = 1, whereas XX label cannot detect it,
EOX 1.45 5.38 10.37 14.40 16.97 18.42 19.19
EXX 12.64 15.73 17.69 18.80 19.38 19.69 19.84 because ⊕(Hs [5], Ht [5]) = 0. Therefore there may be cases that
TABLE 2
OX label outperforms XX label and there may be cases that XX
EOX vs. EXX (ω = 20) label outperforms OX label. We analyze when POX > PXX . Based
on Eq. (4) and Eq. (5), for POX > PXX , we have
L 40 80 160 320 640 1,280 2,560
1 − e−2ω/L
EOX
EXX
5.5
19.9
27.8
39.0
64.0
67.7
97.8
97.3
121.0
119.7
134.7
133.7
142.1
141.5
2(1 − e−ω/2L ) · e−(2γ+ω)/(2L) > (8)
2
TABLE 3 And the following inequality holds by Taylor series expansion of
EOX vs. EXX (ω = 150)
function ex and a simple reduction,
And they prune the dissimilar strings by Eq. (7). Note that they
1 − e−2ω/L
use integers to represent counts in hash bucket vectors since a 2(1 − e−ω/2L ) > for ω/L > 0 (9)
count can be larger than 1, whereas we use bits in our hash labels. 2
Suppose we use 100 bits to build the index for a string. They Here, Eq. (8) is true when e−(2γ+ω)/(2L) ≈ 1 − (2γ + ω)/(2L)
have only 10 hash buckets if 10-bit integers are used to represent is close to 1. Recall that 2γ + ω is the sum of the two strings’
the counts, whereas we have 100 hash buckets in our labels. In lengths, |s| + |t|, and L is the size of hash-labels. By letting
other words, with the same total index length, they use fewer hash L = 10(|s| + |t|), e−(2γ+ω)/(2L) ≈ 0.95. This shows that we
buckets to represent the counts, which increases the probability can use the OX scheme when the available space is about one
of hash collision and therefore deteriorates the pruning ability. In order of magnitude larger than the total length of string pairs.
addition, as discussed in the proof of Theorem 4.4, the comparison Also, it should be true that |Qs ∆Qt | > |Qs ∩ Qt |, which means
between two XX labels, ⊕(Hs , Ht ), can avoid the hash collisions that the majority of the hash collisions is among Qs ∆Qt . Table 3
from the common q -grams in Qs ∩ Qt while the bucket vectors shows the comparison between EOX and EXX with different L,
method [34] cannot do so by Eq. (7). for two strings s and t of size 100. By letting w = 150, we have
|Qs ∆Qt | > |Qs ∩ Qt |. When L is small (≤ 80), the XX scheme
4.3 The pruning power: OX vs XX is still better than the OX scheme. The OX scheme becomes
better than XX scheme marginally, when L ≥ 320, but the XX
We compare the pruning power of OX and XX. Consider two scheme still has the comparable performance. Similar result can
strings, s and t. For the OX scheme, as given in Theorem 4.3 also be observed in Fig. 8(b) in our experiment. Therefore, the XX
and shown in POX in Eq. (4), POX tends to be zero when L is not scheme is a good choice when there is no prior knowledge about
sufficiently large in comparison with the sum of the lengths of two |Qs ∆Qt |.
strings. The XX scheme is unlike OX. As shown in Theorem 4.4,
the result of ⊕(Hs , Ht ) by XX for s and t depends only on the
size of Qs ∆Qt . And the hash collisions between the common q - 4.4 The value of q
gram in Qs ⊎ Qt and the q -gram in Qs ∆Qt make no influence to We discuss how to select q as the length of q -grams. The length q
it, no matter how large the common q -gram set is. It is important of a q -gram cannot be too large, because the q -gram set difference
to mention that the XX scheme is to use the dissimilar part of threshold is 2qτ −||s|−|t|| according to Lemma 4.1. A large q will
two strings. This fact makes it suitable for datasets of long strings make the threshold too large in the filter step. However, a small
with small query threshold, since it does not need to increase the q will make a q -gram set contain many identical q -grams which
hash-label size L to deal with long strings, in order to avoid the degrade the pruning power. For example, if q = 1, the q -gram set
hash collisions from large common q -gram set. The XX scheme is only counts the number of characters’ appearance in a string, and
more space efficient than the OX scheme. the symmetric set difference Qs ∆Qt trivially counts the number
We further compare the expected number of bits to be 1. Let of different characters between two strings. We set a reasonable
EOX and EXX denote the expected number of bits with value q such that a q -gram set has few identical q -grams. We give an
1, for OX label and XX label. We have, EOX = L · POX and approximate analysis here, assuming that the average q -gram set
EXX = L · PXX . The larger such expectation the higher pruning size is |Q| and a q -gram set contains |Q| random q -grams. |Q|
power. Suppose that the strings, s and t, are of length 100, and can be well estimated by the average string length in the dataset.
assume that the set difference ω = |Qs ∆Qt | = 20. Table 2 shows Recall that the alphabet size of characters is |Γ|, there are |Γ|q
the comparison between EOX and EXX for given hash-label size L possible q -grams in total. To make a small probability that two
(the number of bits). The XX scheme outperforms the OX scheme identical random q -grams co-exist in a q -gram set of size |Q|, |Γ|q
when L is small. should be about one order of magnitude larger than |Q|, such that
But, it does not necessarily mean that the XX label is always |Γ|q ≈ 10|Q|. Hence, we set q = ⌈(log 10 + log |Q|)/ log |Γ|⌉
better than the OX label. We discuss hash collisions that possibly in our experiments. For example, we set q = 3 for datasets of
exist among the dissimilar q -grams in Qs ∆Qt . Two different q - strings that mainly consist of 26 English characters, if the average
grams qx and qy in Qt \ Qs are possible to be hashed to the same string length is smaller than 1,000, and we set q = 5 for datasets
bit position such that i = ~(qx ) = ~(qy ). Consider Example 3.1. of DNA sequences that consist of {A, G, C, T}, if the average
Let qx = “fc”, and qy = “io”. Following the hash function given sequence length is about 100.
in Table 1, ~(qx ) = ~(qy ) = 5. By the OX scheme, the 5-th bit
of ⊕(Hs [i], Ht [i]) is set to 1 after hashing qx and qy (Fig. 2(a)).
By the XX scheme, the 5-th bit of ⊕(Hs [i], Ht [i]) is set to 0 after 5 T HE A LGORITHMS
hashing qx and qy (Fig. 2(b)). In this case, when comparing two We give algorithms for hash-label creation, hash-label based
hash-labels, Hs and Ht , OX label can detect one of qx and qy , filtering and verification in this section.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 8

Algorithm 1 MCount (s, q ) Algorithm 2 Hlabel (s, q , L, λ)


Input: a string s, and the value of q for q-gram Input: a string s, the value q for q-gram, the hash-label size L,
Output: a count array C for s and the scheme λ = (~, ℵ, ℘, #)
Output: a hash-label Hs
1: let T be an empty hash table;
2: let C be an array of size |s| + q − 1, where every C[i] ← 1 initially; 1: Let Hs be the hash-label of size L, where Hs [i] ← 0 for i ∈ [1, L];
3: for i from q to |s| do 2: Cs ← MCount(s, q);
4: let qi be the q-gram for s[i − q + 1, i]; 3: bi ← ℜq (h($)) ⊕ · · · ⊕ ℜ1 (h($));
5: if qi 6∈ T then 4: Let i be the final hash value for a q-gram;
6: T [qi ] ← 1; 5: for j from 1 to |s| + q − 1 do
7: else 6: if j ≤ q then
8: T [qi ] ← T [qi ] + 1; 7: b
i ← ℜ1 (b i) ⊕ ℜq+1 (h($));
9: C[i] ← T [qi ]; 8: else
10: return C ; 9: b b
i ← ℜ1 (i) ⊕ ℜq+1 (h(s[j − q]));
10: if j ≤ |s| then
11: b
i ←bi ⊕ ℜ1 (h(s[j]));
12: else
13: b b
i ← i ⊕ ℜ1 (h($));
5.1 Hash-Label Creation 14: i ← ((bi ⊕ h(C[j])) ≫q )%L + 1;
15: Hs [i] ← ℵ(Hs [i], 1);
The q -gram set of a string is a multiset and the symmetric set 16: return Hs ;
difference operator ∆ used is to handle two multisets. For a q -
gram qx , if there are ns copies of qx in q -gram set Qs and nt
copies of qx in Qt , then there are |ns − nt | copies of qx in hash function b
~ is implemented as follows.
Qs ∆Qt . However, the hash function will hash the same q -gram to
the identical bit position, which makes the hash-based scheme fail b
~(s[i + 1, i + q]) = ℜ1 (b
~(s[i, i + q − 1])) ⊕ (11)
to reflect the multiple appearance of the same q -gram in Qs ∆Qt . ℜq+1 (h(s[i])) ⊕ ℜ1 (h(s[i + q]))
To deal with multisets, we use an array C for appearance count,
to maintain the number of times a certain q -gram appears in a which is equivalent to
string and treat the number of such appearance as the suffix for
b
~(s[i+1, i+q]) = ℜq (h(s[i+1]))⊕· · ·⊕ℜ1 (h(s[i+q])) (12)
every q -gram in the q -gram set. Here, the original q -gram can be
seen as (q + 1)-gram by treating the appearance count as a special By such a rolling hash function, the hash values of a q -gram set
character appended in the suffix. For example, let s = “banana” can be computed in a sliding window manner. We can also use it
and t = “planar”. Then, the 2-gram sets for s and t are Qs = in Algorithm 1 to save the cost of looking-up in the hash table by
{$b, ba, an, na, an, na, a$} and Qt = {$p, pl, la, an, na, ar, r$}, applying such a hash function for q -grams.
respectively. With the appearance count, the 3-gram sets become The hash function used in OX/XX label needs to generate hash
Qs = {$b1, ba1, an1, na1, an2, na2, a$1}, and Qt = {$p1, pl1, values at random. But it is difficult to find such perfectly random
la1, an1, na1, ar1, r$1}. For s, the 2-gram “na” appears twice, hash function. As an alternative, we use a pairwise independent
and we have “na1” and “na2” in Qs . In general, suppose Qs has hash function which is comparable to the truly random hash
ns copies of a q -gram qx and Qt has nt copies of the same q - function in practice [21]. The rolling hash function b ~ (Eq. (11))
gram qx . Assume ns ≥ nt , there are ns − nt copies of qx in itself is not pairwise independent. We remove the last q bits of the
Qs ∆Qt . Hence, qx (nt + 1), · · · , qx ns ∈ Qs ∆Qt . The MCount generated hash value to make the output pairwise independent as
algorithm is given in Algorithm 1 to compute the array C . It uses suggested in [14]. Since the length of the bit vector for a string is
a hash table T to incrementally update the appearance information L, the q -gram hash function ~ used in OX/XX label is
for every q -gram of the string in lines 8-9. We only consider the
q -grams with the end position from q to |s| in line 3 because other ~(s[i+1, i+q]) = ((b ~(s[i+1, i+q]) ⊕ (h(C[j])) ≫q )%L + 1
q -grams are padded with different number of delimited character (13)
‘$’ in the prefix or suffix and no identical q -gram will exist in the where h(C[j]) can be seen as ℜ0 (h(C[j])) and ≫k is the logical
same string. The time complexity of Algorithm 1 is O(|s|). As an right-shift operator to right-shift k bits that need to be removed.
example, for s = “banana”, the 2-gram set is treated as a list {$b, The Hlabel algorithm is given in Algorithm 2 to create a hash-
ba, an, na, an, na, a$}, and the appearance count C returned by label Hs for a given string s for both OX and XX label. The
MCount(s, 2) is C = [1, 1, 1, 1, 2, 2, 1]. hash-label is represented as Hs and the hash value by the rolling
The main function of creating a hash-label (OX or XX) is the hash function b ~ Eq. (11) is maintained in the variable bi and the
hash function to hash an input q -gram to a certain bit position in final hash value by Eq. (13) is maintained in the variable i, which
the hash-label. We build a hash table for q -grams in Algorithm is also the bit position of Hs to which the q -gram is hashed. We
1. Observed that two consecutive q -grams of a string share q − 1 compute the appearance for every q -gram by Algorithm 1. Note
common characters. we use a rolling hash to avoid the repeated that string s should act like the one padded with q − 1 delimited
computation in the common characters because a rolling hash character ‘$’ in the prefix and suffix. Therefore, bi is initialized
function b~ satisfies the following property for a string s. as the hash value of a q -gram consists of q characters ‘$’ by the
rolling hash function Eq. (12). We compute the rolling hash values
b
~(s[i + 1, i + q]) = F (b
~(s[i, i + q − 1]), s[i], s[i + q]) (10) for q -grams in a consecutive way in line 5 and it iteratively updates
bi and Hs based on Eq. (11) and Eq. (13) in lines 5-15. We set the
where F is a function that takes the hash value of the last q -gram bit position Hs [i] in a way depending on whether it is to compute
b
~(s[i, i + q − 1]), the character s[i] removed, and the character OX label or XX label, at the end of the iteration. Like Algorithm
s[i + q] added as the inputs. We use ℜk (·) to denote the operation 1, the time complexity of Algorithm 2 is O(|s|) and it mainly
of rotating a bit vector to the left by k positions, and use h to consists of bit-wise operators, exclusive-or ⊕, right-shift ≫, etc,
denote a hash function that takes a character as input. The rolling which can be done efficiently.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEEIEEE Transactions
TRANSACTIONS on Knowledge
ON KNOWLEDGE ANDand Data
DATA Engineering
ENGINEERING, ( Volume:
VOL.XXX, 30, Issue:
NO. XXX, 2017 1, Jan. 1 2018 ) 9

h(‘f’) = 35 0 0 1 0 0 0 1 1 b
~(“fi”) 1 1 0 0 1 1 0 1
b
~(“ic”) 1 0 1 0 0 0 0 1
h(‘i’) = 160 1 0 1 0 0 0 0 0 ℜ1 (b
~(“fi”)) 1 0 0 1 1 0 1 1 h(‘1’) 1 0 0 1 1 1 0 1
ℜ3 (h(‘f’)) 0 0 0 1 1 0 0 1 ⊕
h(‘c’) = 145 1 0 0 1 0 0 0 1 ℜ1 (h(‘c’)) 0 0 1 0 0 0 1 1 0 0 1 1 1 1 0 0
⊕ ≫2
h(‘1’) = 157 1 0 0 1 1 1 0 1 b
~(“ic”) 1 0 1 0 0 0 0 1 ~(“ic”) = 15 0 0 0 0 1 1 1 1
(a) Some hash values (b) Roll from “fi” to “ic” (c) The final hash value
Fig. 3. q -gram hash function

Algorithm 3 Hsearch (s, τ ) since the hash function used in OX label and XX label are pair-
Input: a query string s, the threshold τ wise independent, the bit positions with value 1 appear randomly
Output: All strings, t, that ed(s, t) ≤ τ
in bit vector ⊕(Hs , Ht ) and one 64-bit integer contains about
1: let R ← ∅; #(⊕(Hs , Ht ))/K bit positions with value 1. Assume that few
2: Hs ← Hlabel(s, q, L, λ);
3: for each hash-label Ht of a string t in D that ||s| − |t|| ≤ τ do hash collisions exist such that #(⊕(Hs , Ht )) ≈ |Qs ∆Qt |.
4: c ← 0; Accumulating the number of 1 for the first ⌈2qτ K/|Qs ∆Qt |⌉
5: for j from 1 to L do
6: c ← c + ⊕(Hs [j], Ht [j]); 64-bit integers of the hash-label is expected to make the variable
7: if c > 2qτ − ||s| − |t|| then c larger than the threshold 2qτ . In other words, the larger the q -
8: continue to test the next;
9: if Verify(s, t, τ ) = true then
gram set difference |Qs ∆Qt | is, the less cost it needs in the filter
10: Insert t into R; step.
11: return R;

5.3 Verification
Example 5.2: Fig. 3 shows the process of computing the hash- We need to efficiently verify the candidate strings that cannot
label for the 2-gram “ic” in string t = “efficiency” by Algorithm be pruned in the filter step. The exact edit distance between two
2. Fig. 3(a) lists some hash values by the character hash function strings, s and t, can be computed by filling a (|s| + 1) × (|t| + 1)
h(·). The rolling hash value of the previous q -gram “fi”, b ~(“fi”) matrix M using dynamic programming by Eq. (14),
has already been computed and is given at the top of Fig. 3(b). 
Note that q = 2. By Eq. (11), we compute the rolling hash value M [i − 1, j] + 1

b
~(“ic”) in Fig. 3(b). Then, the appearance, C[5] = 1, is added and M [i, j] = min M [i, j − 1] + 1 (14)


the right-shift operator is applied in Fig. 3(c). Finally, suppose the M [i − 1, j − 1] + δ(s[i] = t[j])
bit vector length is 20, the final hash value of “ic” is ~(“ic”) =
15%20 + 1 = 16. ✷ where M [i, 0] = i, M [0, j] = j for all i, j , and δ(s[i] = t[j]) is
a function which returns 0 if s[i] = t[j], otherwise 1. M [|s|, |t|]
We compute the hash-label for every string of the dataset D
gives the minimum edit distance between s and t. The edit
by Algorithm 2 as the index. We store the index of every string
operations can be read by backtracing the operations used by
by a sequence of K 64-bit integers to represent the bit vector of
dynamic programming. We use edit-path to denote the sequence
OX and XX hash-label. The length L of the bit vector used is
of these edit operations. The time complexity of the dynamic
L = 64K . Note that the index for every string, including the
programming algorithm is O(|s| · |t|), but the verification step
query string, is of the same size. In general, we can set the index
only requires to check whether the edit distance of two strings is
length L according to the length of strings in the dataset and the
larger than τ , rather than the exact edit distance.
threshold τ . The parameter L can be also tuned to find a good
Li et al. in [16] propose an optimization techniques. Assume
tradeoff between the pruning ability and the overhead in filter
that |s| ≤ |t|, and let φ denote the length difference such as
step, as shown in Fig. 8.
φ = |t| − |s|. They observe that only the element M [i, j] needs to
be computed for −⌊(τ − φ)/2⌋ ≤ j − i ≤ ⌊(τ + φ)/2⌋. Hence,
5.2 Filtering By Hash-Labels in every row of the matrix, it considers at most τ + 1 columns and
hence the time complexity becomes O(τ · min{|s|, |t|}).
We give our hash-label based filtering algorithm, named Hsearch,
We further propose a new optimization technique such that
in Algorithm 3, which can be used both for the built OX index and
only a certain range of elements need to be checked. We observe
the XX index. In Algorithm 3, it first creates a hash-label Hs for
that M [i, j] is the edit distance between the substring s[1, i] and
the query string s in line 2. In a loop (lines 3-10), it checks whether
the substring t[1, j]. Therefore, we find a property of matrix M
the condition #(⊕(Hs , Ht )) ≤ 2qτ − ||s| − |t|| holds for every
that the values of adjacent entries in M differ at most 1. Take
possible string t that ||s| − |t|| ≤ τ in D following Theorem 4.1
M [i, j] and M [i, j + 1] as an example. M [i, j] ≤ M [i, j + 1] + 1
and Theorem 4.2. Such checking is done using a counter c. The
is true, because we can first use M [i, j+1] operations to transform
counter c increases by 1 if there is one more different bit. When
a string s[1, i] into a string t[1, j + 1] and then delete the last
c > 2qτ − ||s| − |t||, it will continue to check the next possible
character t[j + 1]. M [i, j] ≥ M [i, j + 1] − 1 is true by Eq. (14).
string, since s and t cannot be similar (lines 4-8). It is important to
By similar analysis, for a given k (both for positive and negative
note that we can utilize the SIMD instruction (SSE4) to efficiently
integers), we can derive that
calculate the number of 1 in a 64-bit integer. The algorithm Verify
is used to verify whether s and t are similar if needed (line 9). M [i + k, j] − |k| ≤ M [i, j] ≤ M [i + k, j] + |k| (15)
The overall worst case time complexity in the filter step of
Algorithm 3 (lines 4-8) is O(nK) where n = |D|. However, M [i, j + k] − |k| ≤ M [i, j] ≤ M [i, j + k] + |k| (16)

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEEIEEE Transactions
TRANSACTIONS ON on Knowledge
KNOWLEDGE and
AND Data
DATA Engineering
ENGINEERING, ( Volume:
VOL.XXX, NO. 30,
XXX,Issue:
2017 1, Jan. 1 2018 ) 10

e f f i c i e n c y Algorithm 4 Verify (s, t, τ )


0 1 2 3 4 5 6 7 8 9 10
e 1 0 1
1: if |s| > |t| then
f 2 0 1
2: x ← s, s ← t, t ← x;
f 3 0 1 3: φ ← |t| − |s|, low ← ⌊(τ − φ)/2⌋, up ← ⌊(τ + φ)/2⌋;
c 4 1 1
4: mid ← low + φ, lft ← 0, rgt ← low + up;
i 5 2 1
5: Let A be an array of length (low + up + 1), where every element is initialized 0;
o 6 2 6: for j from 0 to low − 1 do
n 7 2
7: A[j] ← low − j ;
s 8 3
8: for i from 1 to |s| do
y 9 9: for j from max{lft, low-i+1} to min{rgt, |t|+low-i} do
10: A[j] ← min{A[j − 1] + 1, A[j + 1] + 1,
A[j] + δ(s[i] = t[j − (low − i)])};
Fig. 4. Verification on matrix M 11: if A[mid] > τ then
12: return false;
Based on this property, we give two theorems below to help 13: if A[lft] + mid − lft > τ then
14: A[lft] ← ∞, lft ← lft + 1;
early stop the unnecessary computation in the matrix for fast 15: if A[rgt] + rgt − mid > τ then
verification. 16: A[rgt] ← ∞, rgt ← rgt − 1;
17: return true;
Theorem 5.1: The edit distance between the strings, s and t, is
larger than τ if there exists entry M [i, j] such that j −i = |t|−|s|
and M [i, j] > τ . The Verify algorithm is given in Algorithm 4. low and up
Proof: First, we prove that M [i − 1, j − 1] ≤ M [i, j]. Suppose denote the lower and upper bound of j − i and an array A of
there exists a certain M [i, j] such that M [i − 1, j − 1] > M [i, j]. size (low + up + 1) is used for computing the values of matrix
Thus we have M [i − 1, j] + 1 ≥ M [i − 1, j − 1] > M [i, j] by entries in the sliding window. Note that in the i-th row of the
Eq. (16) and M [i, j − 1] + 1 ≥ M [i − 1, j − 1] > M [i, j] by matrix, A[j] actually represents the matrix entry M [i][i+j −low].
Eq. (15). But, based on Eq. (14), we have M [i, j] ≥ min{M [i − A[mid] represents the matrix entries that is in the same diagonal
1, j − 1], M [i − 1, j] + 1, M [i, j − 1] + 1} > M [i, j] which with M [|s|, |t|]. That is, A[mid] represents M [i][i + φ]. lft and
leads to a contradiction. Second, since j − i = |t| − |s| which rgt denote the left and right limits of the array A and they can
means M [i, j] is in the same diagonal with M [|s|, |t|], we have be updated iteratively to shrink the size of the sliding window by
|s| − i = |t| − j . Therefore, M [|s|, |t|] ≥ M [i, j] > τ . ✷ Theorem 5.2. i+j −low should be between [1, |t|] and we can get
low−i+1 ≤ j ≤ |t|+low−i in line 9. When i ≤ low, j does not
Theorem 5.2: Let φ be |t| − |s|, and assume that |t| ≥ |s|. If start from 0. So we initialize the values from 0 to low − i of array
M [i, j] > τ − |i − j + φ|, no edit-path with the edit distance A in lines 6-7 such that when the sliding window moves, it can
ed(s, t) ≤ τ will pass the matrix entry M [i, j] and the entries correctly get the values from the leftmost column of the matrix. In
M [i′ , j ′ ] in the same diagonal with M [i, j] that i′ ≥ i. line 10, we compute the values in the i-th row by Eq. (14) with
Proof: If an edit-path with the minimum edit distance pass modifications considering the movement of sliding window. Then
M [i, j], it first transforms the substring s[1, i] to the substring it checks the early stop condition by Theorem 5.1 in line 11 and
t[1, j] by M [i, j] operations and then transforms the rest substring shrinks the window size by Theorem 5.2 in lines 13-16. The worst
s[i + 1, |s|] to the substring t[j + 1, |t|] by M [|s|, |t|] − M [i, j] case complexity of Algorithm 4 is O(τ · min{|s|, |t|}). The early
operations. The length difference between the substring s[i+1, |s|] stop condition and the window shrinking condition can effectively
and the substring t[j + 1, |t|] is |i − j + φ|. So M [|s|, |t|] − save the computation cost in practice as shown in Example 5.3
M [i, j] ≥ |i − j + φ| and if M [i, j] > τ − |i − j + φ|, we have and Fig. 4.
ed(s, t) = M [|s|, |t|] > τ . SSimilar analysis can be done for
the entries M [i′ , j ′ ] in the same diagonal with M [i, j] such that 6 E XPERIMENT
i′ ≥ i, because the length difference between the two substrings
We conducted extensive experiments to evaluate the OX/XX label
of s and t remains the same and M [i′ , j ′ ] ≥ M [i, j]. ✷
performance. We compare OX/XX label with 4 state-of-the-art
The following example shows how the theorems above accel- approaches, Flamingo [2], AdaptiveSearch [26], PivotalSearch [7],
erate the verification process. and LFSearch [33]. AdaptiveSearch and PivotalSearch are two
Example 5.3: We set τ = 2 and verify the string pair s = representatives of prefix filtering approaches. LFSearch is the
“effcionsy”, t = “efficiency” with φ = 1. So we only need to local filtering approach using BitTree where we use qmax = 2
compute the entries in the two diagonals that 0 ≤ j −i ≤ 1. Fig. 4 as suggested by the authors. We use the codes from the authors.
shows the matrix and the values of the two diagonals computed by All algorithms to be tested are implemented in C++ and compiled
dynamic programming. We find that M [5, 5] = 2, which means by g++ 5.4.0 with -O3 flag. And all the experiments are conducted
M [5, 5] > τ − |i − j + φ| = 1. So we can stop computing in a PC running Linux with Intel Xeon [email protected] CPU
the values in that diagonal according to Theorem 5.2. Finally, and 32GB memory. Programs that run over 24 hours or exceed the
M [8, 9] = 3 > τ is found and based on Theorem 5.1, we can memory limit will be terminated.
conclude that ed(s, t) > τ . ✷ Datasets: We test six real string datasets. DNA is extracted from
It is known that constructing the full matrix M is unnecessary Sequence Read Archive [33] 1 . reads is a large DNA dataset used
and the space requirement can be reduced to O(min{|s|, |t|}) in EDBT 2013 competition [24]. trec is a set of documents for
observing only the entries in the (i-1)-th row are involved when benchmark in information retrieval [12] 2 . enron is an email
computing the i-th row. Here, at most τ + 1 consecutive entries of dataset with titles and bodies [12], [26] 3 . FoodReview and
a row need to be computed, −⌊(τ −φ)/2⌋ ≤ j−i ≤ ⌊(τ +φ)/2⌋, 1. ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/sequenceread/
and it can be implemented as a sliding window of the size τ + 1 2. trec.nist.gov/data/t9 filtering.html
as shown in Fig. 4. 3. www.cs.cmu.edu/∼enron/

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 11
Dataset String Num Avg Len Min Len Max Len
We show the index size and index construction time in Fig. 5.
DNA 1,048,576 100 100 100
reads 15,000,000 100 94 103 Here, only OX/XX and PivotalSearch can index MovieReview.
trec 347,949 844 5 3,947 All testing algorithms except PivotalSearch have the similar index
enron 245,567 884 5 59,419
FoodReview 568,454 436 12 21,409
size and index construction time while τ changes. The index size
MovieReview 7,911,684 956 2 34,715 and index construction time of PivotalSearch increases when τ
TABLE 4 becomes larger. In Fig. 5, we show the best PivotalSearch results
Real datasets using τ = 2 for both DNA and reads, and τ = 12 for others. As
Flamingo LFSearch OX (L=1920)
observed, both OX and XX label have much smaller index size and
AdaptiveSearch XX (L=640)
PivotalSearch XX (L=1920) less index construction time. Compared with the index size got by
>104
the existing approaches, the index size of XX with L = 640 is at
Index Size (MB)

103
102
most 32% of theirs in short string datasets DNA and reads and is
101 at most 8% of theirs in other long string datasets (Fig. 5(a)). The
1 superiority of XX label in index construction time is also shown
DNA reads trec enron FdRev MvRev
in Fig. 5(b). We can also observe that OX label and XX label
(a) Index Size have the same index construction time irrelevant to L in Fig. 5(b),
>103 because the time complexity of hash-label creation (Algorithm 2)
Index Time (s)

102 is independent of the index length as discussed in Section 5.1.


We compare the query performance of different approaches
101
in the real datasets. As in the index construction, we also set τ
1 as 2, 4, 6, 8, 10, 12 in DNA and reads datasets of short longs,
DNA reads trec enron FdRev MvRev
and set τ as 12, 14, 16, 18, 20, 22, 24 in other four datasets
(b) Index Construction Time of long strings. Fig. 6 shows the total running time of 10,000
Fig. 5. Index Time and Size queries. XX with L = 640 shows its significant superiority over
the existing approaches even though its index size is about one
MovieReview consist of reviews from Amazon [19] 4 . Table order of magnitude smaller than theirs.
4 shows the statistics of the datasets, including the number of In both short string datasets (DNA and reads) and long
strings, average string length, and minimum (maximum) string string datasets (trec, enron, and FoodReview), XX with L =
length. Compared with DNA and reads, the other four datasets 640 has significant speedup over the best results by the existing
have longer string length. We will set small τ for short string approaches. The prefix filtering approaches, AdaptiveSearch and
datasets and large τ for long string datasets in our experiments. PivotalSearch, perform better than our proposed approaches only
Flamingo can only deal with strings with length ≤ 300 and hence when τ = 2 in DNA and reads datasets. For a small τ , XX can
we only get its results in DNA and reads. The ratios of the perform better with a smaller index length because it can reduce
strings with length larger than 1,000 in trec, enron, FoodReview, the pruning overhead and keep the strong pruning ability as well.
MovieReview datasets are 44%, 25%, 7%, 30%, respectively. Ex- Only OX/XX and PivotalSearch can index MovieReview and XX
cept PivotalSearch and our OX/XX label, the other three methods with L = 640 saves about 30% query time of PivotalSearch even
cannot build their index successfully in MovieReview due to their though PivotalSearch builds 7.5–15GB index for pruning whereas
excessive memory usage (≥ 32GB). As analyzed in Section 4, we XX label builds 0.6GB index for pruning. The query time can
set q = 5 in dataset DNA and reads due to their small alphabet be reduced by increasing the XX index size in MovieReview.
set and q = 3 in the other four datasets. As shown in Fig. 6(f), XX with L = 1, 920 tends to be faster
Following the similar approaches in [7], [33], we randomly than XX with L = 640 when τ ≥ 24. It is recommended to
select strings from the datasets as the queries. For every string increase the index length L if τ is known to be large. In other
selected, we apply at most τ modifications randomly. Thus, every cases, OX/XX with L = 1, 920 performs second to XX with
query string has at least one similar string in the dataset. By L = 640, because the longer index length requires more time in
default, the query time is measured by the total running time the filter step (Algorithm 3), even though they possibly reduce the
of 10,000 generated query strings including both filtering time number of candidate strings for verification as shown in Fig. 7.
and verification time. Note that the y-axes of most figures in the Fig. 7 shows that OX/XX with L = 1, 920 and LFSearch have
following are in logarithmic. the best performance with marginal differences between them in
Performance Study in Real Datasets: We report (a) the index candidate number. When τ is large and 2qτ is comparable to L
size and index construction time, and (b) query time and average (e.g., τ = 12 and q = 5 for DNA and reads datasets, and τ = 24
candidate number. and q = 3 for other datasets), the average candidate number
The index length of our OX/XX label is determined by a of XX with L = 640 increases more significant than XX with
parameter L. For OX label, since it is effective when the in- L = 1, 920 due to the small index length. Therefore, increasing
dex length is sufficiently large, we set L = 1, 920 (30 64-bit the index length for XX can reduce the number of candidate strings
integers). For XX label, we use L = 640 (10 64-bit integers) effectively. However, it is not necessary true that the total query
and L = 1, 920 to study the performance of XX label with time can be reduced as well since the query time in the filter step
different lengths. The prefix filtering approaches (AdaptiveSearch, may become dominant when answering the queries, as discussed
PivotalSearch) need τ in index construction. We set τ as 2, 4, below. XX with L = 640 shows a good trade-off between the
6, 8, 10, 12 in DNA and reads datasets of short longs, and set filter cost and the verification cost.
τ as 12, 14, 16, 18, 20, 22, 24 in other datasets of long strings. Since the results of the query time comparison among all
approaches are similar with different τ , we set τ = 10 for DNA
4. jmcauley.ucsd.edu/data/amazon/ and reads datasets of short strings and τ = 20 for other datasets

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEEIEEE Transactions
TRANSACTIONS on Knowledge
ON KNOWLEDGE ANDand Data
DATA Engineering
ENGINEERING, ( Volume:
VOL.XXX, 30, Issue:
NO. XXX, 2017 1, Jan. 1 2018 ) 12

Flamingo PivotalSearch XX (L=640) OX (L=1920)


AdaptiveSearch LFSearch XX (L=1920)
4 5 5 3
>10 >10 10

Query Time (s)

Query Time (s)

Query Time (s)


3 2
10 10
104
2
10 10
3
10
10 500 1
2 4 6 8 10 12 2 4 6 8 10 12 12 14 16 18 20 22 24
Threshold τ Threshold τ Threshold τ

(a) DNA (b) reads (c) trec


2 3 2
10 10 10
Query Time (s)

Query Time (s)

Query Time (s)


2
10
10
10
1
0.5 1 10
12 14 16 18 20 22 24 12 14 16 18 20 22 24 12 14 16 18 20 22 24
Threshold τ Threshold τ Threshold τ

(d) enron (e) FoodReview (f) MovieReview


Fig. 6. Query Time

Flamingo PivotalSearch XX (L=640) OX (L=1920)


AdaptiveSearch LFSearch XX (L=1920)
5
105 >106 105
candidate number

candidate number

candidate number
104 105
104 104
103
103
102 2
10 103
10 10
1 1 102
2 4 6 8 10 12 2 4 6 8 10 12 12 14 16 18 20 22 24
Threshold τ Threshold τ Threshold τ

(a) DNA (b) reads (c) trec


103 104 >104
candidate number

candidate number

candidate number
103
103
102 102
102
10

10 1 10
12 14 16 18 20 22 24 12 14 16 18 20 22 24 12 14 16 18 20 22 24
Threshold τ Threshold τ Threshold τ

(d) enron (e) FoodReview (f) MovieReview


Fig. 7. Average Candidate Number

of long strings in the following experiments where τ does not >100


XX-Verify XX-Filter OX-Verify OX-Filter
Query Time (s)

changes. 80
60
40
Impact of Index Length L: The index size of both OX and 20
XX label is determined by the index length L, and the total 0
320 640 960 1280 1600 1920 2240 2560
index size is linear with L. For query performance, we find that Index Length L

the impact of index length L is different in different datasets. (a) Query Time (the left is XX and the right is OX)
We analyze it using MovieReview with τ = 20 in Fig. 8 and
Avg Candidate Number

105
XX
OX
show the common features we have found. In Fig. 8(a), in every 104

pair of bars, the left is for XX label and the right is OX label. 103
Each bar shows the total query processing time. For XX (OX)
102
label, its filtering time is shown by XX-Filter (OX-Filter) and the 320 640 960 1280 1600 1920 2240 2560
Index Length L
verification time is shown by XX-Verify (OX-Verify). As index
(b) Pruning Ability
length L increases, the number of candidate strings decreases
significantly for both OX/XX label (Fig. 8(b)), and therefore the Fig. 8. Parameter Studies
verification time for OX/XX label decreases (Fig. 8(a)). But the
larger index length L will make the filter time to increase linearly, Verification Optimization: We test the performance of the pro-
which is consistent with the query algorithm complexity O(nL) as posed verification algorithm (Algorithm 4) by comparing it with
analyzed in Section 5.2. When the index length is sufficiently large the existing verification algorithm, denoted as VPJ (Verify in
(≥ 960 in MovieReview), the filter time becomes the dominant PassJoin), in [16]. We test it using XX with L = 640. We
of the overall query time. As a result, reducing the number of set τ accordingly as mentioned above. Fig. 9 shows the results,
candidate strings cannot always improve the query time. A large where the y-axis is the total query time of 10,000 queries. In
filter time by heavy index will deteriorate the query performance. Fig. 9, XX-VPJ is to use XX for filtering followed by VPJ for
Also, as shown in Fig. 8(b), we find that OX shows better pruning verification. And XX is to use XX for filtering followed by the
ability than XX label when the index length L is sufficiently large proposed Algorithm 4 for verification. The dash lines shown in
(≥ 1600). the figures are the best of all the existing algorithms (Flamingo,

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering ( Volume: 30, Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 13

563 9138
have larger index size which affects their scalability in long string
8000
400
6000 datasets. Both OX/XX label have much smaller index size than the
4000
200
2000 existing approaches. Similar conclusions made from DNA dataset
0
XX-VPJ XX
0
XX-VPJ XX can be drawn for reads dataset with longer string length. Note that
(a) DNA (b) reads Flamingo and AdaptiveSearch fail to build their index in reads
92 15
datasets of longer strings.
80 12
60
40
8 Performance in Short Strings: We test the performance of the
20
0
4
0
hash labels proposed for short strings. To make a short string
XX-VPJ XX XX-VPJ XX
dataset, we shorten the strings to a certain length for every string
(c) trec (d) enron in the original dataset. Due to the limited space, we show the
10 60
8.8
8 50 results in the DNA dataset where the string lengths are about 100.
40
6
4
We shorten the string length to 20, 40, 60, and 80, respectively.
20
2
0 0
Since the strings are known to be short, we set the index length of
XX-VPJ XX XX-VPJ XX
the hash labels to be small. In the experiments, we set L = 192,
(e) FoodReview (f) MovieReview three 64-bit integers, for XX/OX labels. And we use a small query
Fig. 9. Verification Optimization (y-axis’s unit: second)
threshold, τ = 2, for short strings. Fig. 11 shows the results.
The XX label and the OX label are the fastest ones when the
AdaptiveSearch, PivotalSearch, LFSearch) in the datasets tested. string length is ≥ 60. When the string length is even smaller, the
It shows that XX label is more efficient than the state-of-the- prefix filtering approaches perform better than the hash labels but
art string similarity search approaches without the verification the hash labels have the comparable performance. Note that short
algorithm proposed, as indicated by XX-VPJ. With the new strings have only a few q -grams and thus it is not costly to compare
verification algorithm, we can make it even more efficient with the q -grams directly. In Fig. 11(b), we show that the XX/OX labels
a speedup of 2x–6x compared with the existing approaches. When and LFSearch have the least average candidate number, which
the threshold τ becomes larger, more candidate strings will be is similar with the result in Fig. 7. The edit distance between
generated for verification and thus the performance improvement two strings tends to increase when the string length increases.
by the proposed verification algorithm will be more significant. Therefore, with the same threshold, the candidate number of all
Due to the limited space, we report the gap between XX-VPJ and approaches, except Flamingo, tend to decrease in Fig. 11(b). It is
XX using the DNA dataset with different τ values. By setting important to note that the index space of XX/OX label is at most
τ = 10, 12, 14, the speedup of XX over XX-VPJ is 1.07, 1.54, a half size of the index space used in the existing approaches and
3.67 respectively because the average candidate number is greatly the index construction by the XX/OX scheme is at least two times
increased, 135, 6,085, 350,543 respectively. When τ = 14, the faster than the index construction time by the existing approaches.
verification cost will take the majority of the total running time
and hence the improvement is significant. 7 C ONCLUSION
Scalability Study: We conduct scalability study using DNA and In this paper, we study two new hash-based approaches, OX label
reads datasets, and show the results in Fig. 10. We set τ = 10 in and XX label, for string similarity search based on edit distance,
this testing. First, we test the scalability in terms of the number where OX = (~, ∨, ⊕, #) and XX = (~, ⊕, ⊕, #). Both OX
of strings. We randomly sample 20%, 40%, 60%, 80% number and XX label use the same last two functions, ⊕ and #, to
of strings from the datasets. Fig. 10(a) and Fig. 10(b) show the compare two hash-labels for pruning. But they take a different
query performance. Each approach requires more query time when way to create the hash-labels. Here, OX label uses two functions,
the dataset becomes larger. Among them, XX with L = 640 is ~ and ∨, to create a hash-label for a string, whereas XX label
the most efficient approach and is at least 5 times faster than uses two functions, ~ and ⊕, to create a hash-label for a sting. We
the existing approaches. The index sizes of all approaches also prove that both OX and XX label can be used to prune dissimilar
increase linearly as the string number increases linearly. We omit strings, s and t, when ed(s, t) > τ . The index size for OX label
the results due to the limited space. Second, we test the scalability and XX label is determined by L, and the hash-label for string of
in terms of the string lengths. To generate a dataset with longer any length has the same L (the number of bits). We analyze the
string length, we randomly select k strings from the dataset, and pruning power by OX label and XX label. We show that OX label
concatenate the k strings to a new long string. By adjusting k , is effective when L is sufficiently large comparing to the sum of
we generate datasets consisting of strings with length equal to the lengths of two strings, s and t. We also show that the pruning
200%, 300%, 400%, 500% of the original string length. Flamingo power of XX label only depends on the number of different q -
cannot process datasets with 400%, 500% string length due to its grams between the q -gram set Qs and the q -gram set Qt for s
constraint on string length (≤ 300). Fig. 10(c) shows the query and t, and can be effectively used for both short and long string
time in DNA dataset by varying the string lengths. XX label has similarity pruning. We conducted extensive performance studies
the best query performance and it is at least 4.5 times faster than using 6 real string datasets. We show that the index size and
the existing approaches. Its query time decreases as the strings index construction time for OX label and XX label can be at least
become longer because when the string becomes longer, with the one order of magnitude smaller than the up-to-date approaches,
same threshold τ , fewer candidate strings will be generated due and the hash-based approaches can significantly reduce query
to the small ratio of τ over the string length. The query time of time. We also analyze the impact of index length to the query
PivotalSearch and LFSearch increase marginally for the same rea- performance and the improvement of the proposed verification
son. Fig. 10(d) shows the index size in datasets of different string optimizations in our experiments. Scalability study is conducted to
length. Except OX/XX label and LFSearch, all other approaches confirm the efficiency of our approaches by increasing the number

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2756932, IEEE
IEEE Transactions on Knowledge and Data Transactions
Engineeringon Knowledge and Data30,
( Volume: Engineering
Issue: 1, Jan. 1 2018 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.XXX, NO. XXX, 2017 14

Flamingo PivotalSearch XX (L=640) OX (L=1920)


AdaptiveSearch LFSearch XX (L=1920)
4 5 5 >10
4
2000
>10 >10
Query Time (s)

Query Time (s)

Index Size (MB)


1500

Query Time (s)


3 4 3
10 10 10
1000
2 3 2
10 10 10
500

2
10 10 10 0
20% 40% 60% 80% 100% 20% 40% 60% 80% 100% 100% 200% 300% 400% 500% 100% 200% 300% 400% 500%
string number string number string length string length

(a) DNA (string number) (b) reads (string number) (c) DNA (string length) (d) DNA (string length)

Fig. 10. Scalability Study

Flamingo PivotalSearch XX (L=192) [20] M. Mitzenmacher, R. Pagh, and N. Pham. Efficient estimation for high
AdaptiveSearch LFSearch OX (L=192)
similarities using odd sketches. In Proc. of WWW’14, 2014.
>100 4 10
4
[21] M. Mitzenmacher and E. Upfal. Probability and computing: Randomized
algorithms and probabilistic analysis. Cambridge University Press, 2005.
candidate number

80 3
Query Time (s)

10
60
10
2
[22] V. M. Prieto, M. Álvarez, and F. Cacheda. Detecting linkedin spammers
40 and its spam nets. IJACSA, 4(9), 2013.
20
10 [23] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit
0 1 similarity query processing with the asymmetric signature scheme. In
20 40 60 80 100 20 40 60 80 100
string length string length
Proc. of SIGMOD’11, 2011.
[24] S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil,
(a) Query Time (b) Average Candidate Number E. Siragusa, A. Tiskin, W. Wang, J. Wang, and U. Leser. State-of-the-art
in string similarity search and join. SIGMOD Rec., 2014.
Fig. 11. short DNA string (τ =2) [25] J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string
similarity joins with edit-distance constraints. PVLDB, 3(1), 2010.
of strings in the dataset and by increasing the string length. XX [26] J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive
label with L = 640 (bits) per hash-label is capable of handling framework for similarity join and search. In Proc. of SIGMOD’12, 2012.
any short/long string datasets in our testing. [27] P. Wang, C. Xiao, J. Qin, W. Wang, X. Zhang, and Y. Ishikawa. Local
similarity search for unstructured text. In Proc. of SIGMOD’16, 2016.
[28] W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An
efficient algorithm for edit similarity joins. TKDE, 25(8), 2013.
R EFERENCES [29] C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane.
Efficient error-tolerant query autocompletion. PVLDB, 2013.
[1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. [30] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for
In Proc. of VLDB, 2006. similarity joins with edit distance constraints. PVLDB, 1(1), 2008.
[2] A. Behm, R. Vernica, S. Alsubaiee, S. Ji, J. Lu, L. Jin, Y. Lu, and C. Li. [31] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity
UCI Flamingo Package 4.1, 2010. joins for near-duplicate detection. TODS, 36(3), 2011.
[3] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. [32] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection
Commun. ACM, 13(7), 1970. for string collections to support approximate queries efficiently. In Proc.
[4] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min- of SIGMOD’08, 2008.
wise independent permutations. In Proc. of STOC’98, 1998. [33] X. Yang, Y. Wang, B. Wang, and W. Wang. Local filtering: Improving
[5] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient the performance of approximate queries on string collections. In Proc. of
fuzzy match for online data cleaning. In Proc. of SIGMOD’03, 2003. SIGMOD’15, 2015.
[6] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for [34] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree:
similarity joins in data cleaning. In Proc of ICDE’06, 2006. an all-purpose index structure for string similarity search based on edit
[7] D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm distance. In Proc. of SIGMOD’10, 2010.
for string similarity search. In Proc. of SIGMOD’14, 2014.
[8] J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient
string similarity joins. VLDB J., 21(4):437–461, 2012. Hao Wei received his PhD degree in Department
[9] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, of SEEM, The Chinese University of Hong Kong,
and D. Srivastava. Approximate string joins in a database (almost) for Hong Kong, 2017. His research interests include
free. In Proc. of VLDB’01, 2001. graph data management, efficient algorithm de-
[10] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards sign.
removing the curse of dimensionality. In Proc. of STOC’98, 1998.
[11] S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search.
In Proc. of WWW’09, 2009.
[12] Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An Jeffery Xu Yu has held faculty positions at Uni-
experimental evaluation. PVLDB, 7(8), 2014. versity of Tsukuba and Australian National Uni-
[13] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast versity. Currently, he is a Professor in the De-
and memory-efficient alignment of short dna sequences to the human partment of SEEM, The Chinese University of
genome. Genome biology, 10(3), 2009. Hong Kong, Hong Kong. His current research
[14] D. Lemire and O. Kaser. Recursive n-gram hashing is pairwise indepen- interests include graph processing and social
dent, at best. Computer Speech & Language, 24(4), 2010. network analysis.
[15] C. Li, B. Wang, and X. Yang. VGRAM: improving performance of
approximate queries on string collections using variable-length grams. Can Lu received his PhD degree in Department
In Proc. of VLDB’07, 2007. of SEEM, The Chinese University of Hong Kong,
[16] G. Li, D. Deng, J. Wang, and J. Feng. PASS-JOIN: A partition-based Hong Kong, 2017. His research interests include
method for similarity joins. PVLDB, 5(3), 2011. social network analysis, complex network theory
[17] Y. Li, A. Terrell, and J. M. Patel. WHAM: a high-throughput sequence and graph algorithms.
alignment method. In Proc. of SIGMOD’11, 2011.
[18] W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. Efficiently supporting
edit distance based string similarity search using B+-trees. TKDE,
26(12), 2014.
[19] J. McAuley, R. Pandey, and J. Leskovec. Inferring networks of substi-
tutable and complementary products. In Proceedings of SIGKDD, 2015.

1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like