Similarity Joins for Uncertain Strings
∗
Manish Patil
Rahul Shah
Louisiana State University
USA
Louisiana State University
USA
[email protected]
[email protected]
ABSTRACT
Keywords
A string similarity join finds all similar string pairs between
two input string collections. It is an essential operation in
many applications, such as data integration and cleaning,
and has been extensively studied for deterministic strings.
Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This
work presents the first solution for answering similarity join
queries over uncertain strings that implements possible-world
semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and
input (k,τ ), our task is to find string pairs (R, S) between
collections such that P r(ed(R, S) ≤ k) > τ i.e., the probability of the edit distance between R and S being at most
k is more than probability threshold τ . We can address the
join problem by obtaining all strings in S that are similar
to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string
databases only support a deterministic string as input. Exploiting these solutions would require exponentially many
possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various
filtering techniques that give upper and (or) lower bound on
P r(ed(R, S) ≤ k) without instantiating possible worlds for
either of the strings. We then incorporate these techniques
into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of
a string pair that survives pruning by using a trie structure
which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair.
Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation.
Uncertain strings; string joins; edit distance
1.
INTRODUCTION
Strings form a fundamental data type in computer systems
and string searching has been extensively studied since the
inception of computer science. String similarity search takes
a set of strings and a query string as input, and outputs all
the strings in the set that are similar to the query string. A
join extends the notion of similarity search further and require all similar string pairs between two input string sets to
be reported. Both similarity search and similarity join are
central to many applications such as data integration and
cleaning. Edit distance is the most commonly used similarity measure for strings. The edit distance between two
strings r and s, denoted by ed(r, s), is the minimum number of single-character edit operations (insertion, deletion,
and substitution) needed to transform r to s. Edit distance
based string similarity search and join has been extensively
studied in the literature for deterministic strings [7, 3, 2, 13,
18, 5]. However, due to the large number of applications
where uncertainty or imprecision in values is either inherent
or desirable, recent years have witnessed increasing attention
devoted to managing uncertain data. Several probabilistic
database management systems (PDBMS), which can represent and manage data with explicit probabilistic models of
uncertainty, have been proposed to date [17, 16]. Imprecision in data introduces many challenges for similarity search
and join in databases with probabilistic string attributes,
which is the focus of this paper.
Uncertainty model: Analogous to the models of uncertain databases, two models - string-level and character-level
- have been proposed recently by Jeffrey Jestes et al. [10]
for uncertain strings. In the string-level uncertanity model
all possible instances for the uncertain string are explicitly
listed to form a probability distribution function (pdf). In
contrast, the character-level model describes distributions
over all characters in the alphabet for each uncertain position in the string. We focus on the character-level model as it
is realistic and concise in representing the string uncertainty.
Let Σ= {c1 , c2 , ..., cσ } be the alphabet. A character-level
uncertain string is S = S[1]S[2]...S[l], where S[i] (1 ≤ i ≤ l)
is a random variable with discrete distribution over Σi.e.,
S[i] is a set of pairs (cj , pi (cj )), where cj ∈ Σ and pi (cj ) is
the probability of having symbol cj at position
! i. Formally
S[i] = {(cj , pi (cj ))|cj ̸= cm for j ̸= m, and
j pi (cj ) = 1}.
When the context of a string is unclear we represent pi (cj )
Categories and Subject Descriptors
H.2 [DATABASE MANAGEMENT]: Systems—Query
processing
∗
This work is supported in part by National Science Foundation (NSF) Grants CCF–1017623 and CCF–1218904.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from
[email protected].
SIGMOD/PODS’14, June 22 - 27 2014, Snowbird, UT, USA
Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00.
1471
Our Contributions: In this paper, we present a comprehensive investigation on the problem of similarity joins for
uncertain strings using (k,τ )-matching [6] as the similarity
definition and make the following contributions:
• We propose a filtering scheme that integrates q-gram filtering with probabilistic pruning, and we present an indexing scheme to facilitate such filtering.
• We extend the frequency distance filtering introduced in [4]
for an uncertain string pair and improve its performance
while maintaining the same space requirement.
• We propose the use of a trie data structure to efficiently
compute the exact similarity probability (as given by (k,τ )matching) for a candidate pair (R, S) without explicitly
comparing all possible string pairs.
• We conduct comprehensive experiments which demonstrate
the effectiveness of all proposed techniques in answering
similarity join queries.
for string S by P r(S[i] = cj ). Throughout we use a lower
case character to represent a deterministic string (s) against
the uncertain string denoted by a upper case character (S).
Let |S| (|s|) be the length of string S (s). Then the possible
worlds of S is!a set of all possible instances s of S with probability p(s),
p(s) = 1. S being a character-level uncertain
string, |S| = |s| for any of its possible instances.
Query semantics: In addition to capturing uncertainty
in the data, one must define the semantics of queries over
the data. In this regard, a powerful model of possible-world
semantics has been the backbone of analyzing the correctness of database operations on uncertain data. For uncertain string attributes, Jestes et al. [10] made the first attempt to extend the notion of similarity. They used expected edit distance (eed ) over all possible worlds of two
uncertain
strings. Given strings R and S, eed(R, S) =
!
ri ,sj p(ri )p(sj )ed(ri , sj ), where sj (ri ) is an instance of
S (R) with probability p(sj ) (p(ri )). Though eed seems like
a natural extension of edit distance as a measure of similarity, it has been shown that it does not implement the
possible-world semantics completely at the query level [6].
Consider a similarity search query on a collection of deterministic strings with input string r. Then, string s is an
output only if ed(r, s) ≤ k. For such a query R over an
uncertain string collection, possible world semantics dictate
that we apply the same predicate ed(r, s) ≤ k for each possible instance r of R, s of S and aggregate this over all
worlds. Thus, a possible world with instances r, s can contribute in deciding whether S is similar to R only if s is
within the desired edit distance of r. However, for the eed
measure, all possible worlds (irrespective but weighted by
edit distance) contribute towards the overall score that determines the similarity of S with R. To overcome this problem, in [6] the authors have proposed a (k,τ )-matching semantic scheme. Using this semantic, given a edit distance
threshold k and probability threshold τ , R is similar to S if
P r(ed(R, S) ≤ k) > τ . We use this similarity definition in
this paper for answering join queries.
2.
PRELIMINARIES
In this section we briefly review filtering techniques for
deterministic strings available in literature and extend them
for uncertain strings later in the article. Let r, s be the two
deterministic strings and k be the edit distance threshold.
2.1 q-gram Filtering
We partition s into k + 1 disjoint segments s1 , s2 , ..., sk+1 .
For simplicity let each segment be of length q ≥ 1 i.e,
sx = s[((x − 1)q + 1)..xq]. Further, let pos(sx ) represent the
starting position of segment sx in string s i.e, pos(sx ) = (x−
1)q +1. Then using the pigeonhole principle, if r is similar to
a string s, it should contain a substring that matches a segment in s. A naive method to achieve this is to obtain a set
q(r) enumerating all substrings of r of length q and for each
substring check whether it matches sx for x = 1, 2, .., k + 1.
However, in [14] authors have shown that we can obtain a
set q(r, x) ⊆ q(r) for each segment of s such that it is sufficient to test each substring w ∈ q(r, x) for a match with
sx . Table 1 shows sets q(r, x) populated for a sample string
r. Using the proposed “position aware” substring selection,
set q(r, x) includes substrings of r with start positions in
the range [pos(sx ) − ⌊(k − ∆)/2⌋, pos(sx ) + ⌊(k + ∆ )/2⌋] and
with length q. The number of substrings in set q(r, x) is thus
bounded by k + 1. In [14] authors prove that the substring
selection satisfies “completeness” ensuring any similar pair
(r, s) will be found as a candidate pair. We use a generalization of this filtering technique by partitioning s into m > k
partitions [14, 15] as summarized in the following lemma.
Problem definition: Given two sets of uncertain strings R
and S, an edit-distance threshold k and a probability threshold τ , similarity join finds all similar string pairs (R, S) ∈
R × S such that P r(ed(R, S) ≤ k) > τ . Without loss of
generality, we focus on self join in this paper i.e. R = S.
Related work: Uncertain/Probabilistic strings have been
the subject of study for the past several years. Efficient algorithms and data structures are known for the problem of
string searching in uncertain text [8, 1, 9, 19]. In [6] authors
have studied the approximate substring matching problem,
where the goal is to report the positions of all substrings
of uncertain text that are similar to the query string. Recently, the problem of similarity search on a collection of
uncertain strings has been addressed in [4]. However, most
of these works support only deterministic strings as query input. Utilizing these techniques for uncertain string as input
would invariably need all its possible worlds to be enumerated, which may not be feasible to do taking into account
the resultant exponential blowup in query cost. Though
the problem of similarity join on uncertain strings has been
studied in [10], it makes use of expected edit distance as a
measure of similarity. We make an attempt to address some
of the challenges involved in uncertain string processing by
investigating similarity joins on them in this paper.
Lemma 1. Given a string r and s, with s partitioned into
m > k disjoint segments, if r is similar to s within an edit
threshold k, r must contain substrings that match at-least
(m − k) segments of s.
Once again by assuming each segment of s to be of length
q ≥ 1, we can compute the set q(r, x) and attempt to match
each w ∈ q(r, x) with sx as before to apply the above lemma.
2.2 Frequency Distance Filtering
Given a string s from the alphabetΣ , a frequency vector
f (s) is defined as f (s) = [f (s)1 , f (s)2 , ..., f (s)σ ], where f (s)i
is the count of the ith alphabet of Σi.e, ci . Let f (r) and
f (s) be the frequency vectors of r and s respectively. Then
the frequency distance of r and s is defined as f d(r, s) =
max{pD, nD},
1472
pD =
"
f (r)i − f (s)i , nD =
f (r)i >f (s)i
"
Table 1: Application of q-gram Filtering
f (s)i − f (r)i
m = 3, q = 2, k = 1, τ =0.25
f (r)i <f (s)i
r
Frequency distance provides a lower bound for the edit
distance between r and s i.e., f d(r, s) ≤ ed(r, s) and can be
computed efficiently [12]. Thus, we can safely decide that s
is not similar to r if f d(r, s) > k.
3.
q(r, x)
S1
GGATCC
GG
GA
In this section we adopt and extend the ideas introduced
for deterministic strings earlier in Section 2.1 to uncertain
strings. We begin with the simpler case where either of the
two uncertain strings R and S is deterministic. Let R be
that string with r being its only possible instance. We try
to achieve an upper bound on the probability of r and S
being similar i.e., P r(ed(r, S) ≤ k). We then build upon
this result for the case when both strings are uncertain and
obtain an upper bound on the probability of R and S being
similar i.e., P r(ed(R, S) ≤ k).
Before proceeding, we introduce some notation and definitions. A string w of length l matches a substring in T starting at position i with probability P r(w = T [i..i + l − 1]) =
#l
ps=1 pi+ps−1 (w[ps]). A string w matches T with probabil#
ity P r(w = T ) = lps=1 pps (w[ps]) if |w| = |T | = l; otherwise it is 0. We simply say w matches with T (or vice versa)
if P r(w = T ) > 0. The probability of string W match#
!
ing T is given by P r(W = T ) = lps=1 cj ∈Σ P r(W [ps] =
cj ) × P r(T [ps] = cj ). Once again, we say W matches T if
P r(W = T ) > 0 for simplicity.
S2
(AC,1)
(GG,0.9)
(TG,0.1)
(CC,0.3)!
(GC,0.2)
(TC,0.5)!
G{(A,0.8),(G,0.2)}CT{(A,0.8),(C,0.1),(T,0.1)}C
(GA,0.8)!
(GG,0.2)!
S4
(AC,0.5)
(AG,0.5)
AA{(G,0.9),(T,0.1)}G{(C,0.3),(G,0.2),(T,0.5)}C
(AA,1)
S3
TC
CC
A{(C,0.5),(G,0.5)}A{(C,0.5),(G,0.5)}AC
(AC,0.5)
(AG,0.5)
q-GRAM FILTERING
GA
AT
TC
(CT,1)
(AC,0.8)
(CC,0.1)!
(TC,0.1)!
{(G,0.8),(T,0.2)}GA{(C,0.3),(G,0.2),(T,0.5)}CT
(GG,0.8)!
(TG,0.2)
(AC,0.3)
(AG,0.2)
(AT,0.5)!
(CT,1)
of the segments of S1 match any substring in r and hence
they can not form a candidate pair. For S2 , even though
the second segment matches some substring in r, we do not
use it as we know by position aware substring selection that
such an alignment can not lead to an instance of S that is
similar to r. We can reject S2 as well since it has only one
matched segment. Strings S3 and S4 survive this pruning
step and are taken forward.
3.1 Bounding P r(ed(r, S) ≤ k)
Computing upper bound for P r(ed(r, S) ≤ k):
So far we were interested in knowing if there exists a substring w ∈ q(r, x) that matches segment S x . We now try
to compute the probability that one or more substrings in
q(r, x) match S x . Let Ex denote such
! an event with probx
ability αx . Then αx = P r(Ex ) =
w∈q(r,x) P r(w = S ).
The correctness of αx relies on the following observations:
• q(r, x), being a set, contains all distinct substrings.
• Event of substring wi ∈ q(r, x) matching S x is independent of substring wj ∈ q(r, x) matching S x for wi ̸= wj .
Next, our idea is to prune out the possible worlds of S
which can not satisfy the edit-distance threshold k with
r and obtain a set!
C ⊆ Ω of candidate worlds. We can
then use P r(C) =
pwj ∈C p(pwj ) as the upper bound on
P r(ed(r, S) ≤ k). Consider a possible world pwj in which
sj is the possible instance of S. sj being the deterministic
string, we can apply the process of q-gram filtering described
in Section 2.1 to quickly assess if sj can give edit distance
within threshold k. If yes, pwj is a candidate world and
we include it in C. This naive method requires all possible
worlds of S to be instantiated and hence is too expensive to
be used. Below we show how to achieve the desired upper
bound i.e., P r(C) without explicitly listing set Ω or C.
For ease of explanation, let m = k + 1. We partition the
possible worlds in Ω into setsΩ 0 , Ω1 , ..., Ωm such that:
• Ωy includes any possible world pwj where r contains substrings matching exactly y segments from s1j , ..., sm
j that
partition sj i.e., y = |{sxj |sxj ∈ q(r, x) for x = 1, 2, .., m}|.
• Ω = Ω0 ∪ Ω1 ∪ Ω2 ∪ ... ∪ Ωm
• Ωy ∩ Ωz = ∅ for y ̸= z
The possible worlds Ωof S is the set of all possible instances of S. A possible world pwj ∈ Ω is a pair (sj , p(sj )),
where sj is an instance of S with probability p(sj ). Let
p(pwj ) = p(sj ) denote the probability of existence of a possible world pwj . Note that sj is a deterministic string and
!
! p(pwj ) = 1. Then by definition, P r(ed(r, S) ≤ k) =
ed(r,sj )≤k p(pwj ). We first establish the necessary condition for r to be similar to S within an edit threshold k, i.e.,
P r(ed(r, S) ≤ k) > 0, and then try to provide an upper
bound for the same.
Necessary condition for P r(ed(r, S) ≤ k) > 0:
We partition the string S into m > k disjoint substrings.
For simplicity, let q be the length of each partition. Note
that each partition S 1 , S 2 , ..., S m is an uncertain string. Let
r contain substrings matching m′ ≤ m segments of S i.e.,
the number of segments of S with P r(w = S x ) > 0 for any
substring w of r is m′ . Then it can be seen that for any
pwj ∈ Ω, r contains substrings that match with at most m′
segments from s1j , s2j , ..., sm
j that partition sj . Based on this
observation, the following lemma establishes the necessary
condition for P r(ed(r, S) ≤ k) > 0.
Lemma 2. Given a string r and a string S partitioned
into m > k disjoint segments, for r to be similar to S i.e.,
P r(ed(r, S) ≤ k) > 0, r must contain substrings that match
at-least (m − k) segments of S.
To apply the above lemma, we can obtain a set q(r, x)
using position aware selection as described earlier and use
it to match against segment S x . Table 1 shows the above
lemma applied to a collection of uncertain strings. None
1473
With this partitioning ofΩ , we have following:
is bounded by k and mk overall, (2) the cost of computing
P r(C) using dynamic programming is bounded by m(m−k).
P r(C) = P r(Ω1 ∪ Ω2 ∪ ... ∪ Ωm ) = P r(Ω \ Ω0 )
m
$
= P r(Ω) − P r(Ω0 ) = 1 −
(1 − αx )
3.2 Bounding P r(ed(R, S) ≤ k)
x=1
In the above equation,Ω 0 denotes the event that none of
the segments of S match substrings of r. By slight abuse
of notation, we say S x matches r (using position aware substring selection) if αx > 0. Then, the following lemma summarizes our result on the upper bound.
Lemma 3. Let r and S be the given strings with edit threshold k. If S is partitioned into m = k + 1 disjoint
segments,
#
P r(ed(r, S) ≤ k) is upper bounded by (1 − m
x=1 (1 − αx )),
where αx gives the probability that segment S x matches r.
Generalizing upper bound for m > k:
Finally, we turn our attention to compute P r(C) for the scenario where S is partitioned into m > k segments. Once
again considering the partitioning
!m of Ω introduced above
P r(C) = P r(∪m
y=(m−k) Ωy ) =
y=m−k P r(Ωy ). Then we
observe that computing P r(Ωy ) in this equation boils down
to the following problem: There are m events Ex (x =
1, 2, ..m) and we are given P r(Ex ) = αx . What is the
probability that exactly y events (among those m events)
happen? Our solution is as follows. Let P r(i, j) denote
the probability that, within the first i events, j of them
happen. We then have the following recursive equation:
P r(i, j) = P r(Ei )P r(i − 1, j − 1) + (1 − P r(Ei ))P r(i − 1, j).
By populating the m × m matrix using a dynamic programming algorithm based on the above recursion, we can lookup
the last column to find out P r(Ωy ) for y = m − k, ..., m.
This recursion gives us an efficient (O(m2 )) way to compute
P r(C). We note that it is possible to improve the running
time to O(m(m−k)), but leave out the details for simplicity.
In this subsection, we follow the analysis from the earlier
subsection taking into account the uncertainty introduced
for string R. The possible worlds Ω of R and S is the set of
all possible instances of R × S. A possible world pwi,j ∈ Ω is
a pair ((ri , sj ), p(ri )∗p(sj )), where sj (ri ) is an instance of S
(R) with probability p(sj ) (p(ri )). Also p(ri ) ∗ p(sj ) denote
the probability of existence of a possible world pwi,j and
!
! p(pwi,j ) = 1. Then by definition, P r(ed(R, S) ≤ k) =
ed(ri ,sj )≤k p(pwi,j ).
Necessary condition for P r(ed(R, S) ≤ k) > 0:
We begin by partitioning the string S into m > k disjoint
substrings as before and assume q to be the length of each
partition. Then the following lemma establishes the necessary condition for R to be similar to S within edit threshold.
Lemma 4. Given a string R and a string S partitioned
into m > k disjoint segments, for R to be similar to S i.e.,
P r(ed(R, S) ≤ k) > 0, R must contain substrings that match
at-least (m − k) segments of S.
The correctness of the above lemma can be verified by
extending the earlier observation as follows: Let R contain
substrings matching m′ ≤ m segments of S i.e., the number
of segments of S with P r(W = S x ) > 0 for any (uncertain)
substring W of R is m′ . Then for any pwi,j ∈ Ω, ri contains substrings that match with at most m′ segments from
s1j , s2j , ..., sm
j that partition sj . Next, we obtain a set q(R, x)
for each segment S x of S using the position aware substring
selection. This allows us to only test substrings W ∈ q(R, x)
for a match against S x . We highlight that the substring selection mechanism only relies on the length of two strings R
and S, start position of a substring W of R and that of S x .
Therefore following same arguments in [14], we can prove
that any similar pair (R, S) will be reported as a candidate.
Theorem 1. Let r and S be the given strings with edit
threshold k. Also assume S is partitioned into m > k disjoint segments and αx represents the probability that segment
S x matches r. Then P r(ed(r, S) ≤ k) is upper bounded by
the probability that at-least (m − k) segments of S match
r or in another words probability that r contains substrings
matching at-least (m − k) segments of S.
Computing αx :
Let Ex denote an event that one or more substrings in set
q(R, x) match segment S x and let αx be its probability. Using a trivial extension of the earlier result
! in Section 3.1, we
could perhaps compute αx = P r(Ex ) = W ∈q(R,x) P r(W =
S x ). However, we show that this leads to incorrect computation of αx and requires a careful investigation. Let R =
A{(A, 0.8), (C, 0.2)}AAT T , S = A{(A, 0.8), (C, 0.2)}AGCT ,
k = 1 and q = 3. Then, we have S 1 = A{(A, 0.8), (C, 0.2)}A,
q(R, 1) = {A{(A, 0.8), (C, 0.2)}A, {(A, 0.8), (C, 0.2)}AA }.
Using the above formula P r(E1 ) = 0.64+0.04+0.64 = 1.32,
which is definitely incorrect. To understand the scenario
better, let’s replace each substring W ∈ q(R, x) by a list of
pairs (wj , p(wj )), where wj is an instance of W with probability p(wj ). Note that it is only a different way of representing set q(R, x) and both representations are equivalent.
q(R, 1) = {(AAA,
!0.8), (ACA, 0.2), (AAA, 0.8),x (CAA, 0.2)}
and P r(E1 ) =
w∈q(R,x) p(w) ×P r(w = S ) = 1.32 as
before. However, this representation reveals that we have
violated the second observation which requires matching of
two substrings wi , wj ∈ q(R, x) with S x to be independent
events. In the current example, both occurrences of a substring AAA in q(R, 1) belong to same possible world and
effectively its probability contributes twice to P r(E1 ).
Continuing the example in Table 1, we now try to apply
the above theorem to strings S3 and S4 . For S3 we have
α1 = 1, α2 = 0, and α3 = 0.2. Therefore the upper bound
on S3 ’s similarity with r is 0.2 < τ and S3 can be rejected.
Even though four out of six possible worlds of S3 contribute
to C, the probability of each of them being small their collective contribution falls short of τ . Similarly the upper bound
for S4 can be computed as 0.4 and the pair (r, S4 ) qualifies as a candidate pair. Thus Theorem 1 with an implicit
requirement of Lemma 2 to be satisfied integrates q-gram
filtering and probabilistic pruning.
Let string S be preprocessed such that each segment S x
is maintained as a list of pairs (sxj , p(sxj )), where sxj is an
instance of S x with probability p(sxj ). Also assume r is preprocessed and sets !
q(r, x) are available to us for x = 1, 2, .., m
(|q(r, x)| = k + 1, m
x=1 |q(r, x)| = (k + 1)m). Then the desired upper bound can be computed efficiently by applying
the above theorem as it only adds the following computational overhead in comparison to its counterpart of deterministic strings: (1) computation cost for αx of each segment
1474
Theorem 1 and are rewritten as below. By slight abuse of
notation as before, we say S x matches R if αx > 0.
We overcome this issue by obtaining an equivalent set
q(r, x) of q(R, x) that satisfies the substring uniqueness requirement i.e., wi ̸= wj for all wi , wj ∈ q(r, x) with i ̸= j,
and implicitly make the matching of two of its substrings
with S x independent events. To achieve this we pick up all
distinct (deterministic) substrings w ∈ q(R, x) (think of a
representation of set q(R, x) consisting of (wj , p(wj )) pairs)
to be part of q(r, x). To distinguish between these two sets,
let pR (wj ) represent the probability associated with substring wj in q(R, x) and pr (wj ) be the same for q(r, x). Then,
we maintain the equivalence of sets by following the two step
process described below for each w ∈ q(r, x) and obtain the
probability to be associated with it i.e., pr (w).
Step 1: Sort all occurrences of w in q(R, x) by their start
positions in R. Group together all occurrences that overlap
with each other in R to obtain groups g1 , g2 , .... Then no two
occurrences across the groups overlap each other. Such a
grouping is required only when there is a suffix-prefix match
for w (i.e., some suffix of w represents same string as its
prefix), otherwise all its overlapping occurrences represent
different possible worlds of R and hence are in a single group
by themselves. We assign the probability p(gi ) to each group
gi as described below. Let psj represent the start position of
occurrence wj in R for j = 1, 2, .., |gi |. The region of overlap
between an occurrence wj of w and its previous occurrences
in R is given by range [y, z] = [psj , psj−1 + q − 1]. We define
βj = βj−1 + prR (wj ) − P r(wj [1..(z − y + 1)] = R[y..z])
with the initial condition β0 = 1, ps0 = −1. Then p(gi ) =
β|gi | . In essence, we keep adding the probability of every
occurrence while taking out the
#probability of its overlap.
Step 2: Assign pr (w) = 1 − (1 − p(gi )).
The first step combines all overlapping occurrences into a
single event and then we find out the probability that atleast one of these events takes place in second step. Now we
can correctly compute the probability of event S x matching
substrings in q(R,
! x) by using its equivalent set q(r, x) as
αx = P r(Ex ) = w∈q(r,x) pr (w) × P r(w = S x ). For the example under consideration, for a substring “AAA” we obtain
a single group with its associated probability 0.8 using the
process described above. Then q(r, 1) = {(AAA, 0.8), (ACA,
0.2), (CAA, 0.2)} and P r(E1 ) = 0.68 is correctly computed.
Lemma 5. Let R, S be the given strings with edit threshold k. If S is partitioned into m = k + 1 disjoint
segments,
#
P r(ed(R, S) ≤ k) is upper bounded by (1 − m
x=1 (1 − αx )),
where αx gives the probability that segment S x matches R.
Theorem 2. Let R, S be the given strings with edit threshold k. Also assume S is partitioned into m > k disjoint
segments and αx represents the probability that segment S x
matches R. Then P r(ed(R, S) ≤ k) is upper bounded by the
probability that at-least (m − k) segments of S match R or,
in another words the probability that R contains substrings
matching at-least (m − k) segments of S.
It is evident that the cost of computing the upper bound
in the above theorem is dominated by the set q(r, x) computations. If this is assumed to be part of the preprocessing
then the overhead involved is exactly the same as in the previous subsection. Let the fraction of uncertain characters in
the strings be θ, and the average number of alternatives of
an uncertain character be γ. For analysis of q-gram filtering, we assume uncertain character positions to be uniformly
distributed from now onwards. Then |q(r, x)| = (k + 1)γ θ·q ,
and computing set q(r, x) for each segment takes qγ θ·q times
when string R is deterministic (previous subsection). Note
that the multiplicative q appears only when substring w has
a suffix-prefix match and its occurrences in set q(R, x) overlap. Assuming typical values θ = 20%, γ = 5 and q = 3, it
takes only two and half times longer to compute αx when R
is uncertain using q(r, x).
4.
INDEXING
Using Theorem 2 we observe that if a string R does not
have substrings that match a sufficient number of segments
of S, we can prune the pair (R, S). We use an indexing technique that facilitates the implementation of this feature to
prune large numbers of dissimilar pairs. So far we assumed
each string S is partitioned into m segments, each of which is
of length q. In practice, we fix q as a system parameter and
then divide S into as many disjoint segments as necessary
i.e. m = max(k + 1, ⌊|S|/q⌋). Without loss of generality
let m = ⌊|S|/q⌋. We use an even-partition scheme [14, 15]
so that each segment has a length of q or q + 1. Thus we
partition S such that the last |S| − ⌊ |S|/q⌋ ∗ q segments have
length q + 1 and length is q for the rest of them.
Let Sl denote the set of strings with length l and Slx denote the set of the x-th segments of strings in Sl . We build
an inverted index for each Slx denoted by Lxl as follows. Consider a string Si ∈ Sl . We instantiate all possibilities of its
segment Six and add them to Lxl along with their probabilities. Thus Lxl is a list of deterministic strings and for each
string w, its inverted list Lxl (w) is the set of uncertain strings
whose x-th segment matches w tagged with probability of
such a match. To be precise, Lxl (w) is enumeration of pairs
(i, P r(w = Six )) where i is the string-id. By design, each
such inverted list Lxl (w) is sorted by string-ids as described
later. We emphasize that a string-id i appears at most once
in any Lxl (w) and in as many lists Lxl (w) as the number of
possible instances of Six . We use these inverted indices to
answer the similarity join query as follows.
We sort strings based on their lengths in ascending order and visit them in the same order. Consider the current
Computing upper bound for P r(ed(R, S) ≤ k):
Finally, to obtain the upper bound on P r(ed(R, S) ≤ k)
we obtain set C ⊆ Ω by pruning out those possible worlds
which can not satisfy the edit-distance threshold k. Consider
a possible world pwi,j in which sj (ri ) is a possible instance
of S (R). Both ri and sj being deterministic strings, we
can quickly assess if ri and sj can be within edit distance
k by applying the process of q-gram filtering described in
Section 2.1. If affirmative, pwi,j is a candidate world and
we include it in C. However, our goal is to compute P r(C)
without enumerating all possible worlds of R ×S. As before,
we partition the possible worlds in Ω into setsΩ 0 , Ω1 , ..., Ωm
such that Ω= ∪m
y=0 Ωy andΩ y ∩Ωz = ∅ for y ̸= z. Moreover,
Ωy includes any possible world pwi,j where ri contains substrings matching exactly y segments from s1j , ..., sm
j that partition sj i.e, y = |{sxj |sxj ∈ q(ri !
, x) for x = 1, 2, .., m}|. Then
m
P r(C) = P r(∪m
y=(m−k) Ωy ) =
y=m−k P r(Ωy ) and can be
computed by following the same dynamic programming approach described earlier. Therefore the key difference in the
current scenario (both R and S are uncertain) from the one
in the previous subsection is the computation of αx . After computing all αx we can directly apply Lemma 2 and
1475
5.
string R = Si . We find strings similar to R among the visited strings only using the inverted indices. This implies we
maintain indices only for visited strings to avoid enumerating a string pair twice. It is clear that we need to look for
similar strings in Sl by querying its associated index only
if |R| − k ≤ l ≤ |R|. To find strings similar to R, we first
obtain candidate strings using the proposed indexing as described in next paragraph. We then subject these candidate
pairs to frequency distance filtering (Section 5). Candidate
pairs that survive both these steps are evaluated with CDF
bounds (Section 6.1) with the final verification step (Section 6.2) outputting only the strings that are similar to R.
After finding similar strings for R = Si , we partition Si into
m > k (as dictated by q) segments and insert the segments
into appropriate inverted index. Then we move on to the
next string R = Si+1 and iteratively find all similar pairs.
Finally, given a string R, we show how to query the index
associated with Sl to find candidate pairs (R, S) such that
S ∈ Sl and P r(ed(R, S) ≤ k) > τ . We preprocess R to
obtain q(r, x) that can be used to query each inverted index
Lxl . For each w ∈ q(r, x) we obtain an inverted list Lxl (w).
Since all lists are sorted by string-id, we can scan them in
parallel to produce a merged (union) list of all string-ids i
along with the αx computed for each of them. We maintain a top pointer in each Lxl (w) list that initially points
to its first element. At each step, we find out the minimum string-id i among the elements currently at the top of
each list, compute αx for a pair (R, Si ) using the probabilities associated with string-id i in all Lxl (w) lists (if present).
After outputting the string-id and its αx as a pair in the
merged list, we increment the top pointers for those Lxl (w)
lists which have the top currently pointing to the element
with string-id i. Let the merged list be Lαx . Once again all
Lαx lists for x = 1, 2, .., m are sorted by string-ids. Therefore by employing top pointers and scanning lists Lαx in
parallel, we can count the number of segments in Si that
matched with their respective q(r, x) by counting the number of Lαx lists that contain string-id i. If the count is less
than m−k we can safely prune out candidate pair (R, Si ) using Lemma 5. Otherwise, we can compute the upper bound
on P r(ed(R, Si ) ≤ k) by supplying the αx values already
computed to the dynamic programming algorithm. If the
upper bound does not meet our probability threshold requirement, we can discard string Si as it can not be similar
to R by Theorem 2, otherwise (R, Si ) is a candidate pair.
Given a string R, the proposed indexing scheme allows us
to obtain all strings S ∈ S that are likely to be similar to
R without explicitly comparing R to each and every string
in S as has been done for related problems in the area of
uncertain strings [10, 6, 4]. For a string r in a deterministic
strings collection, we need to consider m(k + 1) of its substrings while answering the join query using the procedure
just described. In comparison, in the probabilistic setting
we need to consider m(k + 1)γ θ·q deterministic substrings of
R. Moreover, a string-id can belong to at most γ θ·q inverted
lists in Lxl in probabilistic setting whereas inverted lists are
disjoint for deterministic strings collection. Thus, the proposed indexing achieves competitive performance against its
counterpart for answering a join query over deterministic
strings. Further, indexing scheme uses disjoint q-grams of
strings instead of overlapping ones as in [6, 4]. This allows
us to use slightly larger q with same storage requirements.
FREQUENCY DISTANCE FILTERING
As noted in [4], frequency distance displays great variation with increase in the number of uncertain positions in
a string and can be effective to prune out dissimilar string
pairs. We first obtain a simple lower bound on f d(R, S) and
then show how to quickly compute the upper bound for the
same. For each character ci ∈ Σ, let f (S)ci , f (S)ti denote the
minimum and maximum possible number of its occurrences
in S respectively. For brevity, we drop the function notations and denote these occurrences as f Sic and f Sit . Note
that f Sic also represents the number of occurrences of ci in
S with probability 1 and f Sit represents the number of certain and uncertain positions of ci . Thus f Siu = f Sit − f Sic
gives the uncertain positions of ci in S. f Ric , f Riu and f Rit
are defined similarly. We observe that, if f Rit < f Sic , any
possible world pw of R ×S, will have a frequency distance at
least (f Sic − f Rit ). By generalizing this observation, we can
obtain a lower bound on f d(R, S) as summarized below.
Lemma 6. Let R and S be two strings from the same alphabet Σ, the we have f d(R, S) ≥ max{pD, nD}, where
"
"
pD =
(f Sic − f Rit )
(f Ric − f Sit ), nD =
f Sit <f Ric
f Rit <f Sic
Since the edit distance of a string pair is lower bounded
by its frequency distance, we can prune out (R, S) if the
minimum frequency distance obtained by above the lemma
is more than the desired edit threshold k. To obtain the upper bound on the probability of f d(R, S) being at most k,
we use the technique introduced in [4] that relies on the expected value of all possible frequency distances. Using such
an expectation for positive and negative frequency distance
(E[pD], E[nD]), One-Sided Chebyshev Inequality and following the same analysis in [4], we obtain following theorem.
Theorem 3. Let R and S be two strings from the same
alphabet Σ. Then we have,
B2
P r(ed(R, S) ≤ k) ≤ P r(f d(R, S) ≤ k) ≤ 2
B + (A − k)2
where, A =
B2 =
||R| − |S||
E[pD] + E[nD]
+
2
2
(|R| − |S|)2
(||R| − |S||)(E[pD] + E[nD])
+
2
2
+ min(|R| · E[nD], |S| · E[pD]) − A2
The main obstacle in using!the above theorem is efficient computation of E[pD] = ci E(f Ri − f Si ), E[nD] =
!
ci E(f Si − f Ri ). We focus on computing E[nD] below
as E[pD] can be computed in a similar fashion. With frequency of ci in S i.e. f Si varying between f Sic and f Sit ,
let P r(f Si = x) represents the probability that ci appears
exactly x times. Putting it an other way, P r(f Si = x) represents the probability that ci appears at exactly (x − f Sic )
uncertain positions from (f Siu ) uncertain positions overall.
This leads to a natural dynamic programming algorithm
that can compute P r(f Si = x) for all x = f Sic , ..., f Sit by
spending O((f Siu )2 ) time. Please refer to [4] more details.
With the goal of efficiency in computing E[nD], authors
preprocess S and maintain these values in O(f Siu ) space.
c
c
t
t
Without loss of generality, let
!f Ri < f Si ≤ f Ri < f Si .
Then by definition, E[nD] = ci E[nDi ] where,
1476
t
t
f Ri
"
f Si
"
x=f Ric
y=max
(x+1,f S c )
i
E[nDi ] =
If the fraction of uncertain characters in the strings is θ,
frequency filtering summarized in Theorem 3 can be applied
in O(σθ(|R| + |S|)). Typical alphabet size being constant,
the efficiency of applying frequency filtering depends on the
degree of uncertainty and string lengths. Therefore, with increase in length of input strings, improvement from |R| × |S|
to |R| + |S| provides substantial reduction in the filtering
time. While answering the similarity join query, we preprocess R = Si ∈ Sl to compute the arrays for each character
in alphabet Σ and maintain them as a part of our index. All
candidate pairs passing the q-gram filtering are then subjected to frequency distance filtering for further refinement
before moving onto next string R = Si+1 ∈ Sl .
P r(f Ri = x)P r(f Si = y)(y − x)
In the above equation, P r(f Ri = x) and P r(f Si = y)
can be computed in constant time using precomputed answers. Therefore, a naive way of computing E[nDi ] will
take O(f Siu f Riu ). Below we speed up this computation and
achieve min(f Siu , f Riu ) time. We maintain the following
probability distributions for each ci of S. For 0 ≤ x ≤ f Siu ,
S1i [x] = P r(f Si = f Sic + x)
f Siu
"
S2i [x] =
P r(f Si = f Sic + y)
y=x
6.
f Siu
S3i [x] =
"
S4i [x] =
x
"
(y − x + 1)P r(f Si = f Sic + y)
y=x
(x − y)P r(f Si = f Sic + y)
y=0
S1i is simply a probability distribution of ci appearing
at uncertain positions in range [0, f Siu ] (precomputed using dynamic programming). S2i maintains the probability
that ci appears at at-least x uncertain positions i.e. S2i [x] =
P r(f Si ≥ f Sic +x). S3i maintains the same summation with
elements in the summation series scaled by 1, 2, .... Finally
S4i takes the summation series for P r(f Si ≤ f Sic +x), scales
it by 0, 1, ... in reverse direction and maintains the output at
index x. The intuition behind maintaining the scaled summations is that, given a particular frequency z of f Ri , the
expectation of its frequency distance with f Si ∈ [f Sic , f Sit ]
resembles the summation series for S3i [x] or S4i [x]. All the
above distributions can be computed in O(f Siu ) time and
occupy the same O(f Siu ) storage. Similar probability distributions are also maintained for R. We achieve the speed
up without hurting preprocessing time and at no additional
storage cost. E[nDi ] can now be computed as follows:
f Sic −1
E[nDi ] =
=
"
t
t
f Si
"
(...) +
x=f Ric y=f Sic
nDi1 + nDi2
"
P r(f Si = y)(y − x))
t
P r(f Ri =
x)(f Sic
− x − 1)
P r(f Si = y)
Theorem 4. At each cell D = (x, y) of the DP table,
L[j] ≤ P r(ed(R[1..x], S[1..y]) ≤ j) ≤ U [j], where
f Sit
f Sic −1
"
f Si
"
y=f Sic
x=f Ric
+
(...)
y=f Sic
f Sic −1
=
f Si
"
t
f Si
"
P r(f Ri = x)(
x=f Ric
"
We briefly review the process in [6] and highlight the
changes needed to compute the mentioned bounds correctly
when both input strings are uncertain. We populate the
matrix |R| × |S| using dynamic programming. In each cell
D = (x, y), we compute (at most) k + 1 pairs of values
i.e., {(L[j], U [j])|0 ≤ j ≤ k}, where L[j] and U [j] are the
lower and upper bounds of P r(ed(R[1..x], S[1..y]) ≤ j) respectively. Then by checking the bounds in the cell (|R|, |S|),
we can accept or reject the candidate string pair (R, S), if
possible. To fill in the DP table, consider a basic step of computing bounds of a cell D = (x, y) from its neighboring cells
- upper left: D1 = (x − 1, y − 1), upper: D2 = (x, y − 1), and
left: D3 = (x−1, y). As noted!
in [6], when the R[x] matches
S[y] (with probability p1 =
ci P r(R[x] = ci )P r(S[y] =
ci )), it is always optimal to take the distribution from the
diagonal upper left neighbor. When R[x] does not match
S[y] with probability p2 = 1 − p1 , we use the relaxations
suggested in [6]. Let (argminDi ) returns index i (1 ≤ i ≤ 3)
such that LDi [0] is greatest; a tie is broken by selecting the
greatest LDi [1] and so on.
x=f Sic y=x+1
f Sic −1
nDi1 =
6.1 Bound based on CDF
t
f Ri
"
P r(f Ri = x)
"
P r(f Si = y)(y − f Sic + 1)
L[j] = max(p1 LD1 [j], p2 L(argmin Di ) [j − 1])
y=f Sic
x=f Ric
= R4i [f Sic − f Ric − 1] × S2i [0]
+ (R2i [0] − R2i [f Sic − f Ric ]) × S3i [0]
=
f Ri
"
x=f Sic
U [j] = min(1, p1 UD1 [j] + p2 UD1 [j − 1] +
P r(f Ri = x)
f Si
"
f Si
"
UDi [j − 1])
Proof. We follow the analysis in [6] as follows. Consider a possible world pwi,j in which ri [x] = sj [y]. Let the
distance values at cells D and Di (1 ≤ i ≤ 3) be v and
vi , respectively. Then we have v = v1 . This is because
v2 , v3 ≥ v1 − 1; thus, v = min(v1 , v2 + 1, v3 + 1) = v1 . Next,
consider a possible world pwi,j in which ri [x] ̸= sj [y]. Then,
v = min(vi ) + 1. By using (argmin Di ), we pick one fixed
P r(f Si = y)(y − x)
y=x+1
t
=
3
"
i=2
t
t
nDi2
VERIFICATION
The goal of verification is to conclude whether strings in
the candidate pair (R, S) that has survived the above filters,
are indeed similar i.e., P r(ed(R, S) ≤ k) > τ . A straightforward solution is to instantiate each possible world of R × S
and add up the probabilities of possible worlds where possible instances of R, S are within edit threshold k. Before
resorting to such expensive verification, we make a last attempt to prune out a candidate pair, by extending the CDF
bounds in [6]. If unsuccessful, we use the trie-based verification that exploits common prefixes shared by instances of
an uncertain string.
R1i [x − f Ric ] × S2i [x − f Sic + 1]
x=f Sic
1477
explored edge
unexplored edge
• Given u ∈ TS , v ∈ TR : if u is not similar to any ancestor
of v, and v is not similar to any ancestor of u, any possible instance s of S with prefix u can not be similar to a
possible instance r of R with v as its prefix.
Using the technique in [11, 5], we can compute a set of similar nodes in TR for each node u ∈ TS . Then, if u = sj is a
leaf node, each node v = ri ∈ TR in its similar set that is also
a leaf node, gives us a possible world pwi,j whose probability
contributes to P r(ed(R, S) ≤ k). However techniques in [5]
implicitly assume both trie structures are available. Here
we propose on-demand construction of trie which avoids all
possible instances of S to be enumerated. Note that we still
need to build the trie TR completely. However its construction cost can be amortized as we build TR once and use it for
all candidate pairs (R, ∗). As noted in [11], nodes in TR that
are similar to node u ∈ TS can be computed efficiently only
using such a similarity set already computed for its parent.
This allows us to perform a (logical) depth first search on
TS and materialize the children of u ∈ TS only if its similarity set is not empty. Figure 1 illustrates of this approach
and reveals that on-demand trie construction can reduce the
verification cost by avoiding instantiation and consequently
comparison with a large fraction of possible worlds of S. In
the figure, only the nodes linked with solid lines are explored
and instantiated by the verification algorithm. Moreover, we
do not display the similar node sets and the probabilities associated with trie nodes for simplicity.
String R
String S
T
A
C
T
C
C
A
A
A
C
A
C
A
G
G
G
G
C
T
C
T
C
A
C
T
G
C
T
C
T
A
C
T
C
C
G
T
G
T
G
T
G
T
C
C
C
C
C
C
C
C
Figure 1: Trie-based Verification Example
neighbor cell (i.e., the one that has a small distance value
with the highest probability) instead of accounting for all
possible worlds in which ri [x] ̸= sj [y]; and hence the true v
value could be smaller than this one in some possible worlds.
However, we observe that out of all possible worlds with distance v in the cell D, worlds with edit distance v in D1 are
not disjoint with worlds with distance v − 1 for D2 . The
same argument applies for worlds with v − 1 as distance in
D3 as well. Therefore, we choose the maximum out of the
two scenarios as our lower bound. For obtaining the upper
bound, the case where ri [x] matches sj [y] remains the same.
Possible world pwi,j with distance v − 1 for D2 , can be extended by reading an addition character of R and we get
distance v in cell D for all of them. Similarly, moving from
distance v − 1 in D3 to distance v in D can be thought to be
the case of inserting a character of S. Hence, we do not need
to scale down the probability UD2 [v − 1] as well UD3 [v − 1]
to obtain the upper bound for cell D.
We note that the bounds summarized in the above theorem are different than the ones presented in [6], as they
cannot be used directly for the current scenario1 . Finally,
the simple DP algorithm can be improved by computing
(L[j], U [j]) only for those cells D = (x, y) for which |x−y| ≤
k, since L[k] = U [k] = 0 otherwise. Thus, we can apply the
CDF bounds based filtering for a candidate pair (R, S) in
O(min(|R|, |S|)(k +1)max(k,γ )) where γ is average number
of alternatives of an uncertain character.
7.
EXPERIMENTS
We have implemented the proposed indexing scheme and
filtering techniques in C++. The experiments are performed
on a 64 bit machine with an Intel Core i5 CPU 3.33GHz
processor and 8GB RAM running Ubuntu. We consider the
following algorithms for comparisons which use only a subset of the filtering mechanisms. Algorithm QFCT makes
use of all the filtering schemes listed in this article whereas
QCT, QFT, FCT bypass frequency-distance filtering, filtering based on CDF bounds and q-gram filtering respectively.
Datasets: We use two synthetic datasets obtained from
their real counterparts employing the technique used in [10,
4]. The first data source is the author names in dblp (|Σ| =
27). For each string s in the dblp dataset we first obtain a
set A(s) of strings that are within edit distance 4 to s. Then
a character-level probabilistic string S for string s is generated such that, for a position i, the pdf of S[i] is based on
the normalized frequencies of the letters in the i-th position
of all the strings in A(s). The fraction of uncertain positions
in a character-level probabilistic string i.e., θ is varied between 0.1 to 0.4 to generate strings with different degree of
uncertainty. The string length distributions in this dataset
follow approximately a normal distribution in the range of
[10, 35]. For the second dataset we use a concatenated protein sequence of mouse and human (|Σ| = 22), and break it
arbitrarily into shorter strings. Then uncertain strings are
obtained by following the same procedure as that for the
dblp data source. However, for this dataset we use slightly
larger string lengths with less uncertainty i.e. string lengths
roughly follow uniform distribution in the range [20, 45] and
θ ranges between 0.05 to 0.2. In both datasets, the average
number of choices (γ) that each probabilistic character S[i]
may have is set to 5. The default values used for the dblp
dataset are: the number of strings in collection |S| = 100K,
6.2 Trie-based Verification
Prefix-pruning has been a popular technique to expedite
verification of a deterministic string pair (r, s) for edit threshold k. A naive approach for this verification would be to
compute the dynamic programming matrix (DP) of size |r|×
|s| such that cell (x, y) gives the edit distance between r[1...x]
and s[1...y]. Prefix-pruning observes that if all cells in row
x i.e., (x, ∗) do not meet threshold k, then the following
rows can not have cells with edit distance k or less i.e.,
DP [i > x, ∗] > k. Even using such an early termination condition, verifying all-pairs (all possible of instance of R × S)
for a candidate pair (R, S) can be expensive. With the goal
of avoiding naive all-pairs comparison, we propose trie-based
verification. Let TS be the trie of all possible instances of S
and TR be the same for string R. Let node u in TS represents a string u (obtained by concatenating the edge labels
from root to node u), then all possible instances of S with
u as a prefix are leaves in the subtree rooted at u. We say a
node u ∈ TS is similar to node v ∈ TR if ed(u, v) ≤ k. Using
prefix-pruning then we have following observation [5]:
1
a) Lower bound violation: r = ACC, S = A{(C, 0.7),
(G, 0.1), (T, 0.1)} with k = 1. b) Upper bound violation:
r = DISC, S = DI{(C, 0.4), (S, 0.5), (R = 0.1)} with k = 1.
1478
Figure 2: Effectiveness vs. Efficiency
Figure 3: Effect of Dataset Size |S|
average string length ≈ 19, θ = 0.2, k = 2, τ = 0.1, and q =
3. Similarly for protein dataset we use default setting with
|S| = 100K, average string length = 32, θ = 0.1, k = 4, τ =
0.01, and q = 3.
filtering even for the larger datasets. For the exceptional
case of the algorithm FCT, the filtering overhead increases
almost quadratically with increase in the input size as both
filtering techniques (frequency distance and CDF bounds)
need to explicitly compare the query string R with all possible strings S ∈ S (|S| ≥ |R| − k). Also, the filtering time
required for QFT and QCT closely follows that for QFCT.
This confirms the ability of q-gram filtering to significantly
reduce the filtering overhead, and highlights the advantages
offered by the proposed indexing scheme incorporating it.
Figure 3 also shows the time required for answering the
join query for these algorithms. FCT, lacking efficient filtering (though effective), takes the longest to output its answers. However, the query time for QFT, despite using efficient q-gram filtering, shows a rapid increase. In contrast to
this, the good scalable behavior of QFCT and QCT emphasizes the need for using tight filtering conditions based on
the lower and upper bounds of CDF. In the absence of these,
exponentially more number of candidates need trie-based
verification which results in quickly deteriorating query performance. Thus, a combination of q-gram filtering with CDF
bounds in QFCT achieves the best of both worlds, allowing
us to restrict the increase in both filtering time as well as the
number of trie-based verifications. Though the number of
outputs increased quadratically with data size, the increase
in the number of false positives in the verification step of
QFCT (i.e., the scenario where a candidate pair was not an
output after verification) was found to be linear to the output size. An order of magnitude performance gain of QFCT
over others seen in Figure 3 will be further extended for
larger input collections. With algorithm QFT requiring a
higher number of expensive verifications and QCT showing
similar trends as that of QFCT, we use only the remaining
two algorithms i.e., QFCT and FCT for the experiments to
follow. We also append a character ’D’ or ’P’ to the algorithms acronym to distinguish between its query times on
the dblp and protein datasets.
7.1 Effectiveness vs. Efficiency of Pruning
In this set of experiments, we compare the pruning ability of the filtering techniques and the overhead of applying
them on both the datasets with θ = 0.2, k = 2 and τ = 0.1.
Figure 2 shows the number of candidates remaining after applying each filtering scheme and reveals that CDF bounds
provide the tightest filtering among the three. Effectiveness
of the CDF follows from the fact that it uses upper as well
as lower bounds to prune the strings. The upper bound obtained by q-gram filtering tends to be looser than the CDF
as it depends on the partitioning based on q, whereas frequency distance based upper bound is sensitive to the length
difference between two strings. However, the effectiveness of
CDF comes at the cost of time. On the other hand, q-gram
filtering is extremely fast and can still prune out a significant
number of candidate pairs taking advantage of the indexing
scheme. For the protein dataset, q-gram is close to CDF
bounds in terms of effectiveness and is an order of magnitude
faster than computing CDF bounds. Frequency distance filtering being dependent only on alphabet size and uncertain
positions in the strings (against CDF’s dependance on string
length) can help to improve query performance by reducing
the number of candidate pairs passed on to CDF for evaluation. Therefore, in the following experiments, algorithm
variants use these filtering techniques in the increasing order
of their overhead as suggested by their acronyms.
Figure 2 also reveals that applying q-gram filtering and the
CDF bounds filtering takes longer for the protein dataset
than for dblp data. Due to larger string length and fixed
q, q-gram filtering needs to partition protein strings into a
larger number of segments (i.e., m). Thus, there are more αx
probabilities to be computed and it takes longer to compute
the desired upper bound in Theorem 2. Similarly, computing CDF bounds needs to populate a dynamic programming
matrix whose size depends on the string lengths. However,
frequency distance filtering benefits from smaller alphabet
set and lower degree of uncertainty in protein sequences and
shows better performance for the protein data.
7.3 Effects of θ
An increase in the number of uncertain positions in the
string has a detrimental effect on both algorithms QFCT and
FCT as shown in Figure 4. This is due to the direct impact
of θ on every step of the algorithm in answering join queries.
Starting with the q-gram filtering, more uncertain positions
for query string R imply more time required for populating
the sets q(r, x) as a preprocessing step, as well as for adding
the string R to inverted indices after answering the query.
Also the larger size of set q(r, x) due to the increase in θ
increases look up time in inverted indices and consequently
7.2 Effects of Data Size |S|
Figure 3 shows the scalability of various algorithms on
the dblp dataset, where we vary |S| from 50K to 500K.
With computationally inexpensive q-gram filtering as the
first step, algorithms QFCT, QFT and QCT achieve efficient
1479
Figure 4: Effect of θ
Figure 5: Effect of τ
bound, more candidate pairs require trie-based verification
resulting in higher query time. Such a scenario is evident in
Figure 5 for protein data for τ ranging from 0.001 to 0.1.
τ has an interesting effect on q-gram filtering. Figure 5
shows the number of candidates pairs rejected by q-gram
filtering in QFCT. It also shows the count of accepted candidates using CDF lower bound and rejected by CDF upper
bound in QFCT. Note that q-gram filtering only uses the
upper bound and Figure 5 shows the reduced effectiveness
of CDF lower bound filtering. As τ increases, probabilistic
pruning (Theorem 2) becomes more effective and prunes out
a significant number of candidate pairs that satisfy the necessary condition for two strings to be similar as described
in Lemma 5 (shown in Figure 5). In effect, q-gram filtering
reduces the overhead of applying CDF bounds and to some
extent compensates for the increased verification cost, if any.
This effect can be seen by gradual decrease in the number of
candidates rejected by CDF even though an increased number of candidates are pruned using the upper bound overall.
Finally, for large τ , the q-gram filtering advantage coupled
with reduced output size due to more selective τ results in
improved query time.
increase the time required for computing αx . Though size
of a set q(r, x) can increase exponentially with θ, its impact
is limited due to the small fixed value of q. There is another
subtle impact of θ on q-gram filtering. With more uncertain
positions in query string R, more strings in the collection
can be matched with substrings of R. We found this increase to be linear with ≈ 1.5% of all join pairs evaluated
by q-gram filtering for θ = 0.1 to only ≈ 4% evaluated for
θ = 0.4 on the dblp dataset. Thus, proposed q-gram filtering serves the purpose of efficient pruning even with the
increased uncertainty.
The impact of θ on the computation of frequency distance
and CDF bounds is more obvious. Computing the expected
frequency distance of a character directly depends on the
number of positions in input strings (R, S) where it appears probabilistically. Due to the probability computation
of two positions matching in R and S (R[x] = S[y]), it takes
longer to populate a dynamic programming matrix for CDF.
Thus, the increase in filtering time of query algorithms is almost linear to θ. Finally, in the trie-based verification, more
possible words need to be evaluated, increasing verification
cost exponentially. In conclusion, the verification step is the
worst affected among all due to large θ and is the primary
contributor in increased time for answering join queries. We
note that in most of the scenarios, algorithm QFCT takes
longer to answer join queries for the protein data than for
the dblp data because of the higher overhead of q-gram and
CDF filtering, which we pointed out in Section 7.1. On the
other hand, algorithm FCT performs better for the protein
data by virtue of faster frequency filtering as seen earlier.
This comparative behavior of QFCT and FCT is also evident in Figure 4.
7.5 Effects of k
Figure 6 shows the time required for answering a join
query on the dblp dataset when k changes from 1 to 4 and
for the protein dataset with k = 2, 4, 6, 8. With increased
k we can expect more string pairs to satisfy an edit threshold and hence an increase in query time. As we loosen the
edit threshold requirement, the effectiveness of q-gram filtering begins to deteriorate since the requirement for Lemma 5
can be met with string S having less number of its partitions being matched with substrings in R. Therefore, even
with probabilistic pruning, many false candidates pairs are
passed on to frequency distance and CDF filtering routines.
Also, the number of candidates removed by the upper bound
of frequency distance and CDF decreases with an increase
in k. Though lower bound filtering in CDF can accept more
candidates with an increase in k, this benefit is easily offset
by loose upper bounds resulting in net increase in verification cost. With increased k, the time required for algorithm
QFCT approaches that of FCT but still manages to save up
to 35% of FCT’s query cost.
7.4 Effects of τ
Figure 5 shows the results on the dblp and protein dataset
for different values of τ from 0.001 to 0.4. Though the query
times remain insensitive to τ for a large range, a gradual
increase or decrease in probability threshold has a two fold
effect on query algorithms. We analyze the scenario by looking at the number of candidate pairs pruned by CDF bounds
either by accepting based on lower bound or rejecting based
on upper bound. As τ increases, upper bound filter becomes
more and more selective as it can reject more number of candidate pairs. On the contrary, filtering based on lower bound
looses its effectiveness with increased τ as it can not accept
as many strings as it can for smaller values of τ . Thus the
relative increase and decrease in number of candidate pairs
pruned by CDF upper and lower bound respectively determines the overall effect of varying τ . When upper bound filter can not compensate for the loss in effectiveness of lower
7.6 Effects of q
In this set of experiments we try to investigate the effect of q-gram length on the efficiency and effectiveness of
q-gram filtering using input collections with 100K strings.
As pointed out earlier, q-gram filtering incurs more filtering
overhead for higher string lengths with fixed q. We can hope
1480
Figure 6: Effect of k
Figure 7: Effect of q
sible words for each of the candidate strings that may form a
similar pair with it. In effect, we may enumerate all possible
worlds for each string more than once. Additionally, given
a candidate pair (R, S) it needs to compare every possible
instance of R with that of S. In contrast, the trie-based verification enumerates all possible words for each string S only
once and when it is selected as a candidate for some other
string R in database, it enumerates and compares only those
possible worlds which are highly likely to be similar to some
instance of R. Thus the performance gains of trie-based
verification increase with increasing θ as seen in Figure 8.
We note that the cost of verification using trie-based approach also increases exponentially due to the requirement
of having a complete trie in place for query string R. Moreover, trie-based verification can be more expensive than the
naive method in scenarios where the majority of instances
of R × S satisfy the edit threshold due to the overhead of
building a trie and computing a set of similar nodes for each
node in the trie. Though we obtained performance gains using trie-based verification on the protein data as well, they
were less significant than for the dblp data due to higher
string lengths, lower degree of uncertainty (θ) and smaller
alphabet set.
to reduce this overhead by increasing q, however such an increase has side-effects on the space-time tradeoffof q-gram
filtering. Even though we will have fewer partitions for each
string due to increased q, each segment now has more possible instances to be added to the inverted indices increasing
the storage requirement as shown in Figure 7. The rate of
increase is faster for the dblp dataset because of higher θ
i.e., more uncertain positions and larger alphabet set. We
note that we use peak memory usage as a measure that accounts for the indices maintained at any point during query
answering based on the length of a string currently under
consideration. Further, this also implies that query preprocessing that populates sets q(r, x) needs more time offsetting
the benefits of higher q to some extent. Figure 7 shows the
improvement in the filtering time for q varying from 2 to 6.
With size of q(r, x) increasing exponentially with q, the improvement in filtering time achieved due to fewer segments
also decreases exponentially.
For deterministic strings, increasing q makes it difficult
for a segment of string s to match with substrings of query
string r and implies potential improvement in pruning ability of q-gram filtering. For uncertain strings though, due
to higher q, a segment may contain a larger number of uncertain positions. Hence there are more number of possible
instances with increased chances for a segment to find a
match in substrings of a query string. As a result, the effectiveness of q-gram filtering diminishes gradually for higher q
as seen in Figure 7. We note that, though filtering time improves with q, time required for answering a join query shows
uni-valley behavior as less effective filtering causes increased
query time for higher q even with less filtering overhead. We
found q = 3 or q = 4 offers the best combination of fast effective pruning with acceptable storage requirement. With
peak memory usage of inverted indices less than the input
data size itself for both q = 3 and 4, the space required for
storing all indices as required for answering similarity search
queries was found to be only ≈ 1.5 and ≈ 2 times the data
size respectively.
Figure 8: Trie-based Verification
7.8 Effects of String Length
In this final set of experiments, we test algorithms QFCT
and FCT by varying the length of the probabilistic strings.
For studying this effect, we use the 100K versions of the dblp
and protein datasets, and append each probabilistic string
to itself for 0,1,2 or 3 times. To ensure that the verification
step does not get excessively expensive, we limit the number of probabilistic characters in a probabilistic string to be
at most 8. Clearly, the costs of both algorithms increase
with longer strings as seen in Figure 9. In terms of filtering time, computation of q-gram filtering and CDF bounds
takes longer as string lengths increase, as described earlier in
Section 7.1. However, frequency distance filtering being de-
7.7 Evaluating Trie-based Verification
We now analyze the performance benefits offered by the
trie-based verification over a naive way of doing the same.
Figure 8 shows the verification time required for answering
join queries on the dblp and protein datasets with varying
degree of uncertainty i.e., parameter θ. With an increase in
the number of uncertain positions in the string, the number
of possible worlds increases exponentially. This results in increased verification cost for both trie-based and naive verification. In naive verification, we need to enumerate possible
worlds for each string in the dataset and also enumerate pos-
1481
9.
pendent only on the number of uncertain character positions
remains unaffected. This allows algorithm FCT to close the
performance gap with QFCT for higher string lengths by
virtue of efficient frequency distance filtering. Additionally,
verification cost begins to dominate the query time with the
increase in string lengths. We note that even trie-based verification needs to instantiate all possible worlds for each probabilistic string once while answering a join query. With each
possible world enumeration taking more time, higher string
length adversely affects the verification step. For fixed k,
τ , and uncertain character positions, the number of output
pairs decreases with increase in string length. Despite this,
the query time increases because of the aforementioned reasons. We emphasize that the proposed filtering techniques
maintain their effectiveness with varying lengths as the fraction of the candidate pairs that undergo verification and are
accepted as output remains almost constant.
[1] A. Amir, E. Chencinski, C. S. Iliopoulos,
T. Kopelowitz, and H. Zhang. Property matching and
weighted matching. In CPM, pages 188–199, 2006.
[2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In VLDB, pages 918–929, 2006.
[3] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
operator for similarity joins in data cleaning. In ICDE,
page 5, 2006.
[4] D. Dai, J. Xie, H. Zhang, and J. Dong. Efficient range
queries over uncertain strings. In SSDBM, pages
75–95, 2012.
[5] J. Feng, J. Wang, and G. Li. Trie-join: a trie-based
method for efficient string similarity joins. VLDB J.,
21(4):437–461, 2012.
[6] T. Ge and Z. Li. Approximate substring matching
over uncertain strings. PVLDB, 4(11):772–782, 2011.
[7] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,
N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for
free. In VLDB, 2001.
[8] C. S. Iliopoulos, C. Makris, Y. Panagis, K. Perdikuri,
E. Theodoridis, and A. Tsakalidis. The weighted suffix
tree: An efficient data structure for handling
molecular weighted sequences and its applications.
Fundam. Inf., 71:259–277, February 2006.
[9] C. S. Iliopoulos, K. Perdikuri, E. Theodoridis, A. K.
Tsakalidis, and K. Tsichlas. Algorithms for extracting
motifs from biological weighted sequences. J. Discrete
Algorithms, 5(2):229–242, 2007.
[10] J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string
similarity joins. In SIGMOD Conference, pages
327–338, 2010.
[11] S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive
fuzzy keyword search. In WWW, pages 371–380, 2009.
[12] T. Kahveci and A. K. Singh. Efficient index structures
for string databases. In VLDB, pages 351–360, 2001.
[13] N. Koudas, S. Sarawagi, and D. Srivastava. Record
linkage: similarity measures and algorithms. In
SIGMOD Conference, pages 802–803, 2006.
[14] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A
partition-based method for similarity joins. PVLDB,
5(3):253–264, 2011.
[15] M. Patil, X. Cai, S. V. Thankachan, R. Shah, S.-J.
Park, and D. Foltz. Approximate string matching by
position restricted alignment. In EDBT/ICDT
Workshops, pages 384–391, 2013.
[16] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E.
Hambrusch, and R. Shah. Orion 2.0: native support
for uncertain data. In SIGMOD, pages 1239–1242,
2008.
[17] J. Widom. Trio: A system for integrated management
of data, accuracy, and lineage. In CIDR, pages
262–276, 2005.
[18] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient
algorithm for similarity joins with edit distance
constraints. PVLDB, 1(1):933–944, 2008.
[19] H. Zhang, Q. Guo, and C. S. Iliopoulos. An
algorithmic framework for motif discovery problems in
weighted sequences. In CIAC, pages 335–346, 2010.
Figure 9: Effects of string length
7.9 Comparison with EED
In this subsection, we qualitatively compare the join query
algorithm in [10] against algorithms presented in this work:
1. We partition each string in the collection based on q
whereas q-gram filtering in [10] makes use of overlapping
q-grams. This allows us to significantly reduce the space
required for storing all q-grams (≈ 5 × datasize as reported in [10] against our index of twice the data size).
2. q-gram filtering presented in [10] requires each probabilistic string pair to be evaluated during query execution tasks like computation of frequency distance, CDF
bounds computation. Algorithm QFCT employes indexing that incorporates q-gram filtering before applying expensive filters. Therefore, we can expect QFCT to offer
benefits over the query algorithm in [10] similar to its
advantages over algorithm FCT seen in Figure 3.
3. Computing the exact eed between two probabilistic strings
requires all possible worlds for two strings to be instantiated in the same way as a naive verification method discussed in Section 7.7. On the other hand trie-based verification allows us to determine the similarity of a string
pair efficiently (refer to Figure 8).
8.
REFERENCES
CONCLUSIONS
In this paper, we study the largely unexplored problem of
answering similarity join queries on uncertain strings. We
propose a novel q-gram filtering technique that integrates
probabilistic pruning and extends frequency distance and
CDF based filtering techniques. In future work, we plan to
investigate tighter filtering conditions and improvements to
the trie-based verification algorithm.
1482