Academia.eduAcademia.edu

Similarity joins for uncertain strings

2014, Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14

A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings.

Similarity Joins for Uncertain Strings ∗ Manish Patil Rahul Shah Louisiana State University USA Louisiana State University USA [email protected] [email protected] ABSTRACT Keywords A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (k,τ ), our task is to find string pairs (R, S) between collections such that P r(ed(R, S) ≤ k) > τ i.e., the probability of the edit distance between R and S being at most k is more than probability threshold τ . We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on P r(ed(R, S) ≤ k) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation. Uncertain strings; string joins; edit distance 1. INTRODUCTION Strings form a fundamental data type in computer systems and string searching has been extensively studied since the inception of computer science. String similarity search takes a set of strings and a query string as input, and outputs all the strings in the set that are similar to the query string. A join extends the notion of similarity search further and require all similar string pairs between two input string sets to be reported. Both similarity search and similarity join are central to many applications such as data integration and cleaning. Edit distance is the most commonly used similarity measure for strings. The edit distance between two strings r and s, denoted by ed(r, s), is the minimum number of single-character edit operations (insertion, deletion, and substitution) needed to transform r to s. Edit distance based string similarity search and join has been extensively studied in the literature for deterministic strings [7, 3, 2, 13, 18, 5]. However, due to the large number of applications where uncertainty or imprecision in values is either inherent or desirable, recent years have witnessed increasing attention devoted to managing uncertain data. Several probabilistic database management systems (PDBMS), which can represent and manage data with explicit probabilistic models of uncertainty, have been proposed to date [17, 16]. Imprecision in data introduces many challenges for similarity search and join in databases with probabilistic string attributes, which is the focus of this paper. Uncertainty model: Analogous to the models of uncertain databases, two models - string-level and character-level - have been proposed recently by Jeffrey Jestes et al. [10] for uncertain strings. In the string-level uncertanity model all possible instances for the uncertain string are explicitly listed to form a probability distribution function (pdf). In contrast, the character-level model describes distributions over all characters in the alphabet for each uncertain position in the string. We focus on the character-level model as it is realistic and concise in representing the string uncertainty. Let Σ= {c1 , c2 , ..., cσ } be the alphabet. A character-level uncertain string is S = S[1]S[2]...S[l], where S[i] (1 ≤ i ≤ l) is a random variable with discrete distribution over Σi.e., S[i] is a set of pairs (cj , pi (cj )), where cj ∈ Σ and pi (cj ) is the probability of having symbol cj at position ! i. Formally S[i] = {(cj , pi (cj ))|cj ̸= cm for j ̸= m, and j pi (cj ) = 1}. When the context of a string is unclear we represent pi (cj ) Categories and Subject Descriptors H.2 [DATABASE MANAGEMENT]: Systems—Query processing ∗ This work is supported in part by National Science Foundation (NSF) Grants CCF–1017623 and CCF–1218904. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD/PODS’14, June 22 - 27 2014, Snowbird, UT, USA Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00. 1471 Our Contributions: In this paper, we present a comprehensive investigation on the problem of similarity joins for uncertain strings using (k,τ )-matching [6] as the similarity definition and make the following contributions: • We propose a filtering scheme that integrates q-gram filtering with probabilistic pruning, and we present an indexing scheme to facilitate such filtering. • We extend the frequency distance filtering introduced in [4] for an uncertain string pair and improve its performance while maintaining the same space requirement. • We propose the use of a trie data structure to efficiently compute the exact similarity probability (as given by (k,τ )matching) for a candidate pair (R, S) without explicitly comparing all possible string pairs. • We conduct comprehensive experiments which demonstrate the effectiveness of all proposed techniques in answering similarity join queries. for string S by P r(S[i] = cj ). Throughout we use a lower case character to represent a deterministic string (s) against the uncertain string denoted by a upper case character (S). Let |S| (|s|) be the length of string S (s). Then the possible worlds of S is!a set of all possible instances s of S with probability p(s), p(s) = 1. S being a character-level uncertain string, |S| = |s| for any of its possible instances. Query semantics: In addition to capturing uncertainty in the data, one must define the semantics of queries over the data. In this regard, a powerful model of possible-world semantics has been the backbone of analyzing the correctness of database operations on uncertain data. For uncertain string attributes, Jestes et al. [10] made the first attempt to extend the notion of similarity. They used expected edit distance (eed ) over all possible worlds of two uncertain strings. Given strings R and S, eed(R, S) = ! ri ,sj p(ri )p(sj )ed(ri , sj ), where sj (ri ) is an instance of S (R) with probability p(sj ) (p(ri )). Though eed seems like a natural extension of edit distance as a measure of similarity, it has been shown that it does not implement the possible-world semantics completely at the query level [6]. Consider a similarity search query on a collection of deterministic strings with input string r. Then, string s is an output only if ed(r, s) ≤ k. For such a query R over an uncertain string collection, possible world semantics dictate that we apply the same predicate ed(r, s) ≤ k for each possible instance r of R, s of S and aggregate this over all worlds. Thus, a possible world with instances r, s can contribute in deciding whether S is similar to R only if s is within the desired edit distance of r. However, for the eed measure, all possible worlds (irrespective but weighted by edit distance) contribute towards the overall score that determines the similarity of S with R. To overcome this problem, in [6] the authors have proposed a (k,τ )-matching semantic scheme. Using this semantic, given a edit distance threshold k and probability threshold τ , R is similar to S if P r(ed(R, S) ≤ k) > τ . We use this similarity definition in this paper for answering join queries. 2. PRELIMINARIES In this section we briefly review filtering techniques for deterministic strings available in literature and extend them for uncertain strings later in the article. Let r, s be the two deterministic strings and k be the edit distance threshold. 2.1 q-gram Filtering We partition s into k + 1 disjoint segments s1 , s2 , ..., sk+1 . For simplicity let each segment be of length q ≥ 1 i.e, sx = s[((x − 1)q + 1)..xq]. Further, let pos(sx ) represent the starting position of segment sx in string s i.e, pos(sx ) = (x− 1)q +1. Then using the pigeonhole principle, if r is similar to a string s, it should contain a substring that matches a segment in s. A naive method to achieve this is to obtain a set q(r) enumerating all substrings of r of length q and for each substring check whether it matches sx for x = 1, 2, .., k + 1. However, in [14] authors have shown that we can obtain a set q(r, x) ⊆ q(r) for each segment of s such that it is sufficient to test each substring w ∈ q(r, x) for a match with sx . Table 1 shows sets q(r, x) populated for a sample string r. Using the proposed “position aware” substring selection, set q(r, x) includes substrings of r with start positions in the range [pos(sx ) − ⌊(k − ∆)/2⌋, pos(sx ) + ⌊(k + ∆ )/2⌋] and with length q. The number of substrings in set q(r, x) is thus bounded by k + 1. In [14] authors prove that the substring selection satisfies “completeness” ensuring any similar pair (r, s) will be found as a candidate pair. We use a generalization of this filtering technique by partitioning s into m > k partitions [14, 15] as summarized in the following lemma. Problem definition: Given two sets of uncertain strings R and S, an edit-distance threshold k and a probability threshold τ , similarity join finds all similar string pairs (R, S) ∈ R × S such that P r(ed(R, S) ≤ k) > τ . Without loss of generality, we focus on self join in this paper i.e. R = S. Related work: Uncertain/Probabilistic strings have been the subject of study for the past several years. Efficient algorithms and data structures are known for the problem of string searching in uncertain text [8, 1, 9, 19]. In [6] authors have studied the approximate substring matching problem, where the goal is to report the positions of all substrings of uncertain text that are similar to the query string. Recently, the problem of similarity search on a collection of uncertain strings has been addressed in [4]. However, most of these works support only deterministic strings as query input. Utilizing these techniques for uncertain string as input would invariably need all its possible worlds to be enumerated, which may not be feasible to do taking into account the resultant exponential blowup in query cost. Though the problem of similarity join on uncertain strings has been studied in [10], it makes use of expected edit distance as a measure of similarity. We make an attempt to address some of the challenges involved in uncertain string processing by investigating similarity joins on them in this paper. Lemma 1. Given a string r and s, with s partitioned into m > k disjoint segments, if r is similar to s within an edit threshold k, r must contain substrings that match at-least (m − k) segments of s. Once again by assuming each segment of s to be of length q ≥ 1, we can compute the set q(r, x) and attempt to match each w ∈ q(r, x) with sx as before to apply the above lemma. 2.2 Frequency Distance Filtering Given a string s from the alphabetΣ , a frequency vector f (s) is defined as f (s) = [f (s)1 , f (s)2 , ..., f (s)σ ], where f (s)i is the count of the ith alphabet of Σi.e, ci . Let f (r) and f (s) be the frequency vectors of r and s respectively. Then the frequency distance of r and s is defined as f d(r, s) = max{pD, nD}, 1472 pD = " f (r)i − f (s)i , nD = f (r)i >f (s)i " Table 1: Application of q-gram Filtering f (s)i − f (r)i m = 3, q = 2, k = 1, τ =0.25 f (r)i <f (s)i r Frequency distance provides a lower bound for the edit distance between r and s i.e., f d(r, s) ≤ ed(r, s) and can be computed efficiently [12]. Thus, we can safely decide that s is not similar to r if f d(r, s) > k. 3. q(r, x) S1 GGATCC GG GA In this section we adopt and extend the ideas introduced for deterministic strings earlier in Section 2.1 to uncertain strings. We begin with the simpler case where either of the two uncertain strings R and S is deterministic. Let R be that string with r being its only possible instance. We try to achieve an upper bound on the probability of r and S being similar i.e., P r(ed(r, S) ≤ k). We then build upon this result for the case when both strings are uncertain and obtain an upper bound on the probability of R and S being similar i.e., P r(ed(R, S) ≤ k). Before proceeding, we introduce some notation and definitions. A string w of length l matches a substring in T starting at position i with probability P r(w = T [i..i + l − 1]) = #l ps=1 pi+ps−1 (w[ps]). A string w matches T with probabil# ity P r(w = T ) = lps=1 pps (w[ps]) if |w| = |T | = l; otherwise it is 0. We simply say w matches with T (or vice versa) if P r(w = T ) > 0. The probability of string W match# ! ing T is given by P r(W = T ) = lps=1 cj ∈Σ P r(W [ps] = cj ) × P r(T [ps] = cj ). Once again, we say W matches T if P r(W = T ) > 0 for simplicity. S2 (AC,1) (GG,0.9) (TG,0.1) (CC,0.3)! (GC,0.2) (TC,0.5)! G{(A,0.8),(G,0.2)}CT{(A,0.8),(C,0.1),(T,0.1)}C (GA,0.8)! (GG,0.2)! S4 (AC,0.5) (AG,0.5) AA{(G,0.9),(T,0.1)}G{(C,0.3),(G,0.2),(T,0.5)}C (AA,1) S3 TC CC A{(C,0.5),(G,0.5)}A{(C,0.5),(G,0.5)}AC (AC,0.5) (AG,0.5) q-GRAM FILTERING GA AT TC (CT,1) (AC,0.8) (CC,0.1)! (TC,0.1)! {(G,0.8),(T,0.2)}GA{(C,0.3),(G,0.2),(T,0.5)}CT (GG,0.8)! (TG,0.2) (AC,0.3) (AG,0.2) (AT,0.5)! (CT,1) of the segments of S1 match any substring in r and hence they can not form a candidate pair. For S2 , even though the second segment matches some substring in r, we do not use it as we know by position aware substring selection that such an alignment can not lead to an instance of S that is similar to r. We can reject S2 as well since it has only one matched segment. Strings S3 and S4 survive this pruning step and are taken forward. 3.1 Bounding P r(ed(r, S) ≤ k) Computing upper bound for P r(ed(r, S) ≤ k): So far we were interested in knowing if there exists a substring w ∈ q(r, x) that matches segment S x . We now try to compute the probability that one or more substrings in q(r, x) match S x . Let Ex denote such ! an event with probx ability αx . Then αx = P r(Ex ) = w∈q(r,x) P r(w = S ). The correctness of αx relies on the following observations: • q(r, x), being a set, contains all distinct substrings. • Event of substring wi ∈ q(r, x) matching S x is independent of substring wj ∈ q(r, x) matching S x for wi ̸= wj . Next, our idea is to prune out the possible worlds of S which can not satisfy the edit-distance threshold k with r and obtain a set! C ⊆ Ω of candidate worlds. We can then use P r(C) = pwj ∈C p(pwj ) as the upper bound on P r(ed(r, S) ≤ k). Consider a possible world pwj in which sj is the possible instance of S. sj being the deterministic string, we can apply the process of q-gram filtering described in Section 2.1 to quickly assess if sj can give edit distance within threshold k. If yes, pwj is a candidate world and we include it in C. This naive method requires all possible worlds of S to be instantiated and hence is too expensive to be used. Below we show how to achieve the desired upper bound i.e., P r(C) without explicitly listing set Ω or C. For ease of explanation, let m = k + 1. We partition the possible worlds in Ω into setsΩ 0 , Ω1 , ..., Ωm such that: • Ωy includes any possible world pwj where r contains substrings matching exactly y segments from s1j , ..., sm j that partition sj i.e., y = |{sxj |sxj ∈ q(r, x) for x = 1, 2, .., m}|. • Ω = Ω0 ∪ Ω1 ∪ Ω2 ∪ ... ∪ Ωm • Ωy ∩ Ωz = ∅ for y ̸= z The possible worlds Ωof S is the set of all possible instances of S. A possible world pwj ∈ Ω is a pair (sj , p(sj )), where sj is an instance of S with probability p(sj ). Let p(pwj ) = p(sj ) denote the probability of existence of a possible world pwj . Note that sj is a deterministic string and ! ! p(pwj ) = 1. Then by definition, P r(ed(r, S) ≤ k) = ed(r,sj )≤k p(pwj ). We first establish the necessary condition for r to be similar to S within an edit threshold k, i.e., P r(ed(r, S) ≤ k) > 0, and then try to provide an upper bound for the same. Necessary condition for P r(ed(r, S) ≤ k) > 0: We partition the string S into m > k disjoint substrings. For simplicity, let q be the length of each partition. Note that each partition S 1 , S 2 , ..., S m is an uncertain string. Let r contain substrings matching m′ ≤ m segments of S i.e., the number of segments of S with P r(w = S x ) > 0 for any substring w of r is m′ . Then it can be seen that for any pwj ∈ Ω, r contains substrings that match with at most m′ segments from s1j , s2j , ..., sm j that partition sj . Based on this observation, the following lemma establishes the necessary condition for P r(ed(r, S) ≤ k) > 0. Lemma 2. Given a string r and a string S partitioned into m > k disjoint segments, for r to be similar to S i.e., P r(ed(r, S) ≤ k) > 0, r must contain substrings that match at-least (m − k) segments of S. To apply the above lemma, we can obtain a set q(r, x) using position aware selection as described earlier and use it to match against segment S x . Table 1 shows the above lemma applied to a collection of uncertain strings. None 1473 With this partitioning ofΩ , we have following: is bounded by k and mk overall, (2) the cost of computing P r(C) using dynamic programming is bounded by m(m−k). P r(C) = P r(Ω1 ∪ Ω2 ∪ ... ∪ Ωm ) = P r(Ω \ Ω0 ) m $ = P r(Ω) − P r(Ω0 ) = 1 − (1 − αx ) 3.2 Bounding P r(ed(R, S) ≤ k) x=1 In the above equation,Ω 0 denotes the event that none of the segments of S match substrings of r. By slight abuse of notation, we say S x matches r (using position aware substring selection) if αx > 0. Then, the following lemma summarizes our result on the upper bound. Lemma 3. Let r and S be the given strings with edit threshold k. If S is partitioned into m = k + 1 disjoint segments, # P r(ed(r, S) ≤ k) is upper bounded by (1 − m x=1 (1 − αx )), where αx gives the probability that segment S x matches r. Generalizing upper bound for m > k: Finally, we turn our attention to compute P r(C) for the scenario where S is partitioned into m > k segments. Once again considering the partitioning !m of Ω introduced above P r(C) = P r(∪m y=(m−k) Ωy ) = y=m−k P r(Ωy ). Then we observe that computing P r(Ωy ) in this equation boils down to the following problem: There are m events Ex (x = 1, 2, ..m) and we are given P r(Ex ) = αx . What is the probability that exactly y events (among those m events) happen? Our solution is as follows. Let P r(i, j) denote the probability that, within the first i events, j of them happen. We then have the following recursive equation: P r(i, j) = P r(Ei )P r(i − 1, j − 1) + (1 − P r(Ei ))P r(i − 1, j). By populating the m × m matrix using a dynamic programming algorithm based on the above recursion, we can lookup the last column to find out P r(Ωy ) for y = m − k, ..., m. This recursion gives us an efficient (O(m2 )) way to compute P r(C). We note that it is possible to improve the running time to O(m(m−k)), but leave out the details for simplicity. In this subsection, we follow the analysis from the earlier subsection taking into account the uncertainty introduced for string R. The possible worlds Ω of R and S is the set of all possible instances of R × S. A possible world pwi,j ∈ Ω is a pair ((ri , sj ), p(ri )∗p(sj )), where sj (ri ) is an instance of S (R) with probability p(sj ) (p(ri )). Also p(ri ) ∗ p(sj ) denote the probability of existence of a possible world pwi,j and ! ! p(pwi,j ) = 1. Then by definition, P r(ed(R, S) ≤ k) = ed(ri ,sj )≤k p(pwi,j ). Necessary condition for P r(ed(R, S) ≤ k) > 0: We begin by partitioning the string S into m > k disjoint substrings as before and assume q to be the length of each partition. Then the following lemma establishes the necessary condition for R to be similar to S within edit threshold. Lemma 4. Given a string R and a string S partitioned into m > k disjoint segments, for R to be similar to S i.e., P r(ed(R, S) ≤ k) > 0, R must contain substrings that match at-least (m − k) segments of S. The correctness of the above lemma can be verified by extending the earlier observation as follows: Let R contain substrings matching m′ ≤ m segments of S i.e., the number of segments of S with P r(W = S x ) > 0 for any (uncertain) substring W of R is m′ . Then for any pwi,j ∈ Ω, ri contains substrings that match with at most m′ segments from s1j , s2j , ..., sm j that partition sj . Next, we obtain a set q(R, x) for each segment S x of S using the position aware substring selection. This allows us to only test substrings W ∈ q(R, x) for a match against S x . We highlight that the substring selection mechanism only relies on the length of two strings R and S, start position of a substring W of R and that of S x . Therefore following same arguments in [14], we can prove that any similar pair (R, S) will be reported as a candidate. Theorem 1. Let r and S be the given strings with edit threshold k. Also assume S is partitioned into m > k disjoint segments and αx represents the probability that segment S x matches r. Then P r(ed(r, S) ≤ k) is upper bounded by the probability that at-least (m − k) segments of S match r or in another words probability that r contains substrings matching at-least (m − k) segments of S. Computing αx : Let Ex denote an event that one or more substrings in set q(R, x) match segment S x and let αx be its probability. Using a trivial extension of the earlier result ! in Section 3.1, we could perhaps compute αx = P r(Ex ) = W ∈q(R,x) P r(W = S x ). However, we show that this leads to incorrect computation of αx and requires a careful investigation. Let R = A{(A, 0.8), (C, 0.2)}AAT T , S = A{(A, 0.8), (C, 0.2)}AGCT , k = 1 and q = 3. Then, we have S 1 = A{(A, 0.8), (C, 0.2)}A, q(R, 1) = {A{(A, 0.8), (C, 0.2)}A, {(A, 0.8), (C, 0.2)}AA }. Using the above formula P r(E1 ) = 0.64+0.04+0.64 = 1.32, which is definitely incorrect. To understand the scenario better, let’s replace each substring W ∈ q(R, x) by a list of pairs (wj , p(wj )), where wj is an instance of W with probability p(wj ). Note that it is only a different way of representing set q(R, x) and both representations are equivalent. q(R, 1) = {(AAA, !0.8), (ACA, 0.2), (AAA, 0.8),x (CAA, 0.2)} and P r(E1 ) = w∈q(R,x) p(w) ×P r(w = S ) = 1.32 as before. However, this representation reveals that we have violated the second observation which requires matching of two substrings wi , wj ∈ q(R, x) with S x to be independent events. In the current example, both occurrences of a substring AAA in q(R, 1) belong to same possible world and effectively its probability contributes twice to P r(E1 ). Continuing the example in Table 1, we now try to apply the above theorem to strings S3 and S4 . For S3 we have α1 = 1, α2 = 0, and α3 = 0.2. Therefore the upper bound on S3 ’s similarity with r is 0.2 < τ and S3 can be rejected. Even though four out of six possible worlds of S3 contribute to C, the probability of each of them being small their collective contribution falls short of τ . Similarly the upper bound for S4 can be computed as 0.4 and the pair (r, S4 ) qualifies as a candidate pair. Thus Theorem 1 with an implicit requirement of Lemma 2 to be satisfied integrates q-gram filtering and probabilistic pruning. Let string S be preprocessed such that each segment S x is maintained as a list of pairs (sxj , p(sxj )), where sxj is an instance of S x with probability p(sxj ). Also assume r is preprocessed and sets ! q(r, x) are available to us for x = 1, 2, .., m (|q(r, x)| = k + 1, m x=1 |q(r, x)| = (k + 1)m). Then the desired upper bound can be computed efficiently by applying the above theorem as it only adds the following computational overhead in comparison to its counterpart of deterministic strings: (1) computation cost for αx of each segment 1474 Theorem 1 and are rewritten as below. By slight abuse of notation as before, we say S x matches R if αx > 0. We overcome this issue by obtaining an equivalent set q(r, x) of q(R, x) that satisfies the substring uniqueness requirement i.e., wi ̸= wj for all wi , wj ∈ q(r, x) with i ̸= j, and implicitly make the matching of two of its substrings with S x independent events. To achieve this we pick up all distinct (deterministic) substrings w ∈ q(R, x) (think of a representation of set q(R, x) consisting of (wj , p(wj )) pairs) to be part of q(r, x). To distinguish between these two sets, let pR (wj ) represent the probability associated with substring wj in q(R, x) and pr (wj ) be the same for q(r, x). Then, we maintain the equivalence of sets by following the two step process described below for each w ∈ q(r, x) and obtain the probability to be associated with it i.e., pr (w). Step 1: Sort all occurrences of w in q(R, x) by their start positions in R. Group together all occurrences that overlap with each other in R to obtain groups g1 , g2 , .... Then no two occurrences across the groups overlap each other. Such a grouping is required only when there is a suffix-prefix match for w (i.e., some suffix of w represents same string as its prefix), otherwise all its overlapping occurrences represent different possible worlds of R and hence are in a single group by themselves. We assign the probability p(gi ) to each group gi as described below. Let psj represent the start position of occurrence wj in R for j = 1, 2, .., |gi |. The region of overlap between an occurrence wj of w and its previous occurrences in R is given by range [y, z] = [psj , psj−1 + q − 1]. We define βj = βj−1 + prR (wj ) − P r(wj [1..(z − y + 1)] = R[y..z]) with the initial condition β0 = 1, ps0 = −1. Then p(gi ) = β|gi | . In essence, we keep adding the probability of every occurrence while taking out the #probability of its overlap. Step 2: Assign pr (w) = 1 − (1 − p(gi )). The first step combines all overlapping occurrences into a single event and then we find out the probability that atleast one of these events takes place in second step. Now we can correctly compute the probability of event S x matching substrings in q(R, ! x) by using its equivalent set q(r, x) as αx = P r(Ex ) = w∈q(r,x) pr (w) × P r(w = S x ). For the example under consideration, for a substring “AAA” we obtain a single group with its associated probability 0.8 using the process described above. Then q(r, 1) = {(AAA, 0.8), (ACA, 0.2), (CAA, 0.2)} and P r(E1 ) = 0.68 is correctly computed. Lemma 5. Let R, S be the given strings with edit threshold k. If S is partitioned into m = k + 1 disjoint segments, # P r(ed(R, S) ≤ k) is upper bounded by (1 − m x=1 (1 − αx )), where αx gives the probability that segment S x matches R. Theorem 2. Let R, S be the given strings with edit threshold k. Also assume S is partitioned into m > k disjoint segments and αx represents the probability that segment S x matches R. Then P r(ed(R, S) ≤ k) is upper bounded by the probability that at-least (m − k) segments of S match R or, in another words the probability that R contains substrings matching at-least (m − k) segments of S. It is evident that the cost of computing the upper bound in the above theorem is dominated by the set q(r, x) computations. If this is assumed to be part of the preprocessing then the overhead involved is exactly the same as in the previous subsection. Let the fraction of uncertain characters in the strings be θ, and the average number of alternatives of an uncertain character be γ. For analysis of q-gram filtering, we assume uncertain character positions to be uniformly distributed from now onwards. Then |q(r, x)| = (k + 1)γ θ·q , and computing set q(r, x) for each segment takes qγ θ·q times when string R is deterministic (previous subsection). Note that the multiplicative q appears only when substring w has a suffix-prefix match and its occurrences in set q(R, x) overlap. Assuming typical values θ = 20%, γ = 5 and q = 3, it takes only two and half times longer to compute αx when R is uncertain using q(r, x). 4. INDEXING Using Theorem 2 we observe that if a string R does not have substrings that match a sufficient number of segments of S, we can prune the pair (R, S). We use an indexing technique that facilitates the implementation of this feature to prune large numbers of dissimilar pairs. So far we assumed each string S is partitioned into m segments, each of which is of length q. In practice, we fix q as a system parameter and then divide S into as many disjoint segments as necessary i.e. m = max(k + 1, ⌊|S|/q⌋). Without loss of generality let m = ⌊|S|/q⌋. We use an even-partition scheme [14, 15] so that each segment has a length of q or q + 1. Thus we partition S such that the last |S| − ⌊ |S|/q⌋ ∗ q segments have length q + 1 and length is q for the rest of them. Let Sl denote the set of strings with length l and Slx denote the set of the x-th segments of strings in Sl . We build an inverted index for each Slx denoted by Lxl as follows. Consider a string Si ∈ Sl . We instantiate all possibilities of its segment Six and add them to Lxl along with their probabilities. Thus Lxl is a list of deterministic strings and for each string w, its inverted list Lxl (w) is the set of uncertain strings whose x-th segment matches w tagged with probability of such a match. To be precise, Lxl (w) is enumeration of pairs (i, P r(w = Six )) where i is the string-id. By design, each such inverted list Lxl (w) is sorted by string-ids as described later. We emphasize that a string-id i appears at most once in any Lxl (w) and in as many lists Lxl (w) as the number of possible instances of Six . We use these inverted indices to answer the similarity join query as follows. We sort strings based on their lengths in ascending order and visit them in the same order. Consider the current Computing upper bound for P r(ed(R, S) ≤ k): Finally, to obtain the upper bound on P r(ed(R, S) ≤ k) we obtain set C ⊆ Ω by pruning out those possible worlds which can not satisfy the edit-distance threshold k. Consider a possible world pwi,j in which sj (ri ) is a possible instance of S (R). Both ri and sj being deterministic strings, we can quickly assess if ri and sj can be within edit distance k by applying the process of q-gram filtering described in Section 2.1. If affirmative, pwi,j is a candidate world and we include it in C. However, our goal is to compute P r(C) without enumerating all possible worlds of R ×S. As before, we partition the possible worlds in Ω into setsΩ 0 , Ω1 , ..., Ωm such that Ω= ∪m y=0 Ωy andΩ y ∩Ωz = ∅ for y ̸= z. Moreover, Ωy includes any possible world pwi,j where ri contains substrings matching exactly y segments from s1j , ..., sm j that partition sj i.e, y = |{sxj |sxj ∈ q(ri ! , x) for x = 1, 2, .., m}|. Then m P r(C) = P r(∪m y=(m−k) Ωy ) = y=m−k P r(Ωy ) and can be computed by following the same dynamic programming approach described earlier. Therefore the key difference in the current scenario (both R and S are uncertain) from the one in the previous subsection is the computation of αx . After computing all αx we can directly apply Lemma 2 and 1475 5. string R = Si . We find strings similar to R among the visited strings only using the inverted indices. This implies we maintain indices only for visited strings to avoid enumerating a string pair twice. It is clear that we need to look for similar strings in Sl by querying its associated index only if |R| − k ≤ l ≤ |R|. To find strings similar to R, we first obtain candidate strings using the proposed indexing as described in next paragraph. We then subject these candidate pairs to frequency distance filtering (Section 5). Candidate pairs that survive both these steps are evaluated with CDF bounds (Section 6.1) with the final verification step (Section 6.2) outputting only the strings that are similar to R. After finding similar strings for R = Si , we partition Si into m > k (as dictated by q) segments and insert the segments into appropriate inverted index. Then we move on to the next string R = Si+1 and iteratively find all similar pairs. Finally, given a string R, we show how to query the index associated with Sl to find candidate pairs (R, S) such that S ∈ Sl and P r(ed(R, S) ≤ k) > τ . We preprocess R to obtain q(r, x) that can be used to query each inverted index Lxl . For each w ∈ q(r, x) we obtain an inverted list Lxl (w). Since all lists are sorted by string-id, we can scan them in parallel to produce a merged (union) list of all string-ids i along with the αx computed for each of them. We maintain a top pointer in each Lxl (w) list that initially points to its first element. At each step, we find out the minimum string-id i among the elements currently at the top of each list, compute αx for a pair (R, Si ) using the probabilities associated with string-id i in all Lxl (w) lists (if present). After outputting the string-id and its αx as a pair in the merged list, we increment the top pointers for those Lxl (w) lists which have the top currently pointing to the element with string-id i. Let the merged list be Lαx . Once again all Lαx lists for x = 1, 2, .., m are sorted by string-ids. Therefore by employing top pointers and scanning lists Lαx in parallel, we can count the number of segments in Si that matched with their respective q(r, x) by counting the number of Lαx lists that contain string-id i. If the count is less than m−k we can safely prune out candidate pair (R, Si ) using Lemma 5. Otherwise, we can compute the upper bound on P r(ed(R, Si ) ≤ k) by supplying the αx values already computed to the dynamic programming algorithm. If the upper bound does not meet our probability threshold requirement, we can discard string Si as it can not be similar to R by Theorem 2, otherwise (R, Si ) is a candidate pair. Given a string R, the proposed indexing scheme allows us to obtain all strings S ∈ S that are likely to be similar to R without explicitly comparing R to each and every string in S as has been done for related problems in the area of uncertain strings [10, 6, 4]. For a string r in a deterministic strings collection, we need to consider m(k + 1) of its substrings while answering the join query using the procedure just described. In comparison, in the probabilistic setting we need to consider m(k + 1)γ θ·q deterministic substrings of R. Moreover, a string-id can belong to at most γ θ·q inverted lists in Lxl in probabilistic setting whereas inverted lists are disjoint for deterministic strings collection. Thus, the proposed indexing achieves competitive performance against its counterpart for answering a join query over deterministic strings. Further, indexing scheme uses disjoint q-grams of strings instead of overlapping ones as in [6, 4]. This allows us to use slightly larger q with same storage requirements. FREQUENCY DISTANCE FILTERING As noted in [4], frequency distance displays great variation with increase in the number of uncertain positions in a string and can be effective to prune out dissimilar string pairs. We first obtain a simple lower bound on f d(R, S) and then show how to quickly compute the upper bound for the same. For each character ci ∈ Σ, let f (S)ci , f (S)ti denote the minimum and maximum possible number of its occurrences in S respectively. For brevity, we drop the function notations and denote these occurrences as f Sic and f Sit . Note that f Sic also represents the number of occurrences of ci in S with probability 1 and f Sit represents the number of certain and uncertain positions of ci . Thus f Siu = f Sit − f Sic gives the uncertain positions of ci in S. f Ric , f Riu and f Rit are defined similarly. We observe that, if f Rit < f Sic , any possible world pw of R ×S, will have a frequency distance at least (f Sic − f Rit ). By generalizing this observation, we can obtain a lower bound on f d(R, S) as summarized below. Lemma 6. Let R and S be two strings from the same alphabet Σ, the we have f d(R, S) ≥ max{pD, nD}, where " " pD = (f Sic − f Rit ) (f Ric − f Sit ), nD = f Sit <f Ric f Rit <f Sic Since the edit distance of a string pair is lower bounded by its frequency distance, we can prune out (R, S) if the minimum frequency distance obtained by above the lemma is more than the desired edit threshold k. To obtain the upper bound on the probability of f d(R, S) being at most k, we use the technique introduced in [4] that relies on the expected value of all possible frequency distances. Using such an expectation for positive and negative frequency distance (E[pD], E[nD]), One-Sided Chebyshev Inequality and following the same analysis in [4], we obtain following theorem. Theorem 3. Let R and S be two strings from the same alphabet Σ. Then we have, B2 P r(ed(R, S) ≤ k) ≤ P r(f d(R, S) ≤ k) ≤ 2 B + (A − k)2 where, A = B2 = ||R| − |S|| E[pD] + E[nD] + 2 2 (|R| − |S|)2 (||R| − |S||)(E[pD] + E[nD]) + 2 2 + min(|R| · E[nD], |S| · E[pD]) − A2 The main obstacle in using!the above theorem is efficient computation of E[pD] = ci E(f Ri − f Si ), E[nD] = ! ci E(f Si − f Ri ). We focus on computing E[nD] below as E[pD] can be computed in a similar fashion. With frequency of ci in S i.e. f Si varying between f Sic and f Sit , let P r(f Si = x) represents the probability that ci appears exactly x times. Putting it an other way, P r(f Si = x) represents the probability that ci appears at exactly (x − f Sic ) uncertain positions from (f Siu ) uncertain positions overall. This leads to a natural dynamic programming algorithm that can compute P r(f Si = x) for all x = f Sic , ..., f Sit by spending O((f Siu )2 ) time. Please refer to [4] more details. With the goal of efficiency in computing E[nD], authors preprocess S and maintain these values in O(f Siu ) space. c c t t Without loss of generality, let !f Ri < f Si ≤ f Ri < f Si . Then by definition, E[nD] = ci E[nDi ] where, 1476 t t f Ri " f Si " x=f Ric y=max (x+1,f S c ) i E[nDi ] = If the fraction of uncertain characters in the strings is θ, frequency filtering summarized in Theorem 3 can be applied in O(σθ(|R| + |S|)). Typical alphabet size being constant, the efficiency of applying frequency filtering depends on the degree of uncertainty and string lengths. Therefore, with increase in length of input strings, improvement from |R| × |S| to |R| + |S| provides substantial reduction in the filtering time. While answering the similarity join query, we preprocess R = Si ∈ Sl to compute the arrays for each character in alphabet Σ and maintain them as a part of our index. All candidate pairs passing the q-gram filtering are then subjected to frequency distance filtering for further refinement before moving onto next string R = Si+1 ∈ Sl . P r(f Ri = x)P r(f Si = y)(y − x) In the above equation, P r(f Ri = x) and P r(f Si = y) can be computed in constant time using precomputed answers. Therefore, a naive way of computing E[nDi ] will take O(f Siu f Riu ). Below we speed up this computation and achieve min(f Siu , f Riu ) time. We maintain the following probability distributions for each ci of S. For 0 ≤ x ≤ f Siu , S1i [x] = P r(f Si = f Sic + x) f Siu " S2i [x] = P r(f Si = f Sic + y) y=x 6. f Siu S3i [x] = " S4i [x] = x " (y − x + 1)P r(f Si = f Sic + y) y=x (x − y)P r(f Si = f Sic + y) y=0 S1i is simply a probability distribution of ci appearing at uncertain positions in range [0, f Siu ] (precomputed using dynamic programming). S2i maintains the probability that ci appears at at-least x uncertain positions i.e. S2i [x] = P r(f Si ≥ f Sic +x). S3i maintains the same summation with elements in the summation series scaled by 1, 2, .... Finally S4i takes the summation series for P r(f Si ≤ f Sic +x), scales it by 0, 1, ... in reverse direction and maintains the output at index x. The intuition behind maintaining the scaled summations is that, given a particular frequency z of f Ri , the expectation of its frequency distance with f Si ∈ [f Sic , f Sit ] resembles the summation series for S3i [x] or S4i [x]. All the above distributions can be computed in O(f Siu ) time and occupy the same O(f Siu ) storage. Similar probability distributions are also maintained for R. We achieve the speed up without hurting preprocessing time and at no additional storage cost. E[nDi ] can now be computed as follows: f Sic −1 E[nDi ] = = " t t f Si " (...) + x=f Ric y=f Sic nDi1 + nDi2 " P r(f Si = y)(y − x)) t P r(f Ri = x)(f Sic − x − 1) P r(f Si = y) Theorem 4. At each cell D = (x, y) of the DP table, L[j] ≤ P r(ed(R[1..x], S[1..y]) ≤ j) ≤ U [j], where f Sit f Sic −1 " f Si " y=f Sic x=f Ric + (...) y=f Sic f Sic −1 = f Si " t f Si " P r(f Ri = x)( x=f Ric " We briefly review the process in [6] and highlight the changes needed to compute the mentioned bounds correctly when both input strings are uncertain. We populate the matrix |R| × |S| using dynamic programming. In each cell D = (x, y), we compute (at most) k + 1 pairs of values i.e., {(L[j], U [j])|0 ≤ j ≤ k}, where L[j] and U [j] are the lower and upper bounds of P r(ed(R[1..x], S[1..y]) ≤ j) respectively. Then by checking the bounds in the cell (|R|, |S|), we can accept or reject the candidate string pair (R, S), if possible. To fill in the DP table, consider a basic step of computing bounds of a cell D = (x, y) from its neighboring cells - upper left: D1 = (x − 1, y − 1), upper: D2 = (x, y − 1), and left: D3 = (x−1, y). As noted! in [6], when the R[x] matches S[y] (with probability p1 = ci P r(R[x] = ci )P r(S[y] = ci )), it is always optimal to take the distribution from the diagonal upper left neighbor. When R[x] does not match S[y] with probability p2 = 1 − p1 , we use the relaxations suggested in [6]. Let (argminDi ) returns index i (1 ≤ i ≤ 3) such that LDi [0] is greatest; a tie is broken by selecting the greatest LDi [1] and so on. x=f Sic y=x+1 f Sic −1 nDi1 = 6.1 Bound based on CDF t f Ri " P r(f Ri = x) " P r(f Si = y)(y − f Sic + 1) L[j] = max(p1 LD1 [j], p2 L(argmin Di ) [j − 1]) y=f Sic x=f Ric = R4i [f Sic − f Ric − 1] × S2i [0] + (R2i [0] − R2i [f Sic − f Ric ]) × S3i [0] = f Ri " x=f Sic U [j] = min(1, p1 UD1 [j] + p2 UD1 [j − 1] + P r(f Ri = x) f Si " f Si " UDi [j − 1]) Proof. We follow the analysis in [6] as follows. Consider a possible world pwi,j in which ri [x] = sj [y]. Let the distance values at cells D and Di (1 ≤ i ≤ 3) be v and vi , respectively. Then we have v = v1 . This is because v2 , v3 ≥ v1 − 1; thus, v = min(v1 , v2 + 1, v3 + 1) = v1 . Next, consider a possible world pwi,j in which ri [x] ̸= sj [y]. Then, v = min(vi ) + 1. By using (argmin Di ), we pick one fixed P r(f Si = y)(y − x) y=x+1 t = 3 " i=2 t t nDi2 VERIFICATION The goal of verification is to conclude whether strings in the candidate pair (R, S) that has survived the above filters, are indeed similar i.e., P r(ed(R, S) ≤ k) > τ . A straightforward solution is to instantiate each possible world of R × S and add up the probabilities of possible worlds where possible instances of R, S are within edit threshold k. Before resorting to such expensive verification, we make a last attempt to prune out a candidate pair, by extending the CDF bounds in [6]. If unsuccessful, we use the trie-based verification that exploits common prefixes shared by instances of an uncertain string. R1i [x − f Ric ] × S2i [x − f Sic + 1] x=f Sic 1477 explored edge unexplored edge • Given u ∈ TS , v ∈ TR : if u is not similar to any ancestor of v, and v is not similar to any ancestor of u, any possible instance s of S with prefix u can not be similar to a possible instance r of R with v as its prefix. Using the technique in [11, 5], we can compute a set of similar nodes in TR for each node u ∈ TS . Then, if u = sj is a leaf node, each node v = ri ∈ TR in its similar set that is also a leaf node, gives us a possible world pwi,j whose probability contributes to P r(ed(R, S) ≤ k). However techniques in [5] implicitly assume both trie structures are available. Here we propose on-demand construction of trie which avoids all possible instances of S to be enumerated. Note that we still need to build the trie TR completely. However its construction cost can be amortized as we build TR once and use it for all candidate pairs (R, ∗). As noted in [11], nodes in TR that are similar to node u ∈ TS can be computed efficiently only using such a similarity set already computed for its parent. This allows us to perform a (logical) depth first search on TS and materialize the children of u ∈ TS only if its similarity set is not empty. Figure 1 illustrates of this approach and reveals that on-demand trie construction can reduce the verification cost by avoiding instantiation and consequently comparison with a large fraction of possible worlds of S. In the figure, only the nodes linked with solid lines are explored and instantiated by the verification algorithm. Moreover, we do not display the similar node sets and the probabilities associated with trie nodes for simplicity. String R String S T A C T C C A A A C A C A G G G G C T C T C A C T G C T C T A C T C C G T G T G T G T C C C C C C C C Figure 1: Trie-based Verification Example neighbor cell (i.e., the one that has a small distance value with the highest probability) instead of accounting for all possible worlds in which ri [x] ̸= sj [y]; and hence the true v value could be smaller than this one in some possible worlds. However, we observe that out of all possible worlds with distance v in the cell D, worlds with edit distance v in D1 are not disjoint with worlds with distance v − 1 for D2 . The same argument applies for worlds with v − 1 as distance in D3 as well. Therefore, we choose the maximum out of the two scenarios as our lower bound. For obtaining the upper bound, the case where ri [x] matches sj [y] remains the same. Possible world pwi,j with distance v − 1 for D2 , can be extended by reading an addition character of R and we get distance v in cell D for all of them. Similarly, moving from distance v − 1 in D3 to distance v in D can be thought to be the case of inserting a character of S. Hence, we do not need to scale down the probability UD2 [v − 1] as well UD3 [v − 1] to obtain the upper bound for cell D. We note that the bounds summarized in the above theorem are different than the ones presented in [6], as they cannot be used directly for the current scenario1 . Finally, the simple DP algorithm can be improved by computing (L[j], U [j]) only for those cells D = (x, y) for which |x−y| ≤ k, since L[k] = U [k] = 0 otherwise. Thus, we can apply the CDF bounds based filtering for a candidate pair (R, S) in O(min(|R|, |S|)(k +1)max(k,γ )) where γ is average number of alternatives of an uncertain character. 7. EXPERIMENTS We have implemented the proposed indexing scheme and filtering techniques in C++. The experiments are performed on a 64 bit machine with an Intel Core i5 CPU 3.33GHz processor and 8GB RAM running Ubuntu. We consider the following algorithms for comparisons which use only a subset of the filtering mechanisms. Algorithm QFCT makes use of all the filtering schemes listed in this article whereas QCT, QFT, FCT bypass frequency-distance filtering, filtering based on CDF bounds and q-gram filtering respectively. Datasets: We use two synthetic datasets obtained from their real counterparts employing the technique used in [10, 4]. The first data source is the author names in dblp (|Σ| = 27). For each string s in the dblp dataset we first obtain a set A(s) of strings that are within edit distance 4 to s. Then a character-level probabilistic string S for string s is generated such that, for a position i, the pdf of S[i] is based on the normalized frequencies of the letters in the i-th position of all the strings in A(s). The fraction of uncertain positions in a character-level probabilistic string i.e., θ is varied between 0.1 to 0.4 to generate strings with different degree of uncertainty. The string length distributions in this dataset follow approximately a normal distribution in the range of [10, 35]. For the second dataset we use a concatenated protein sequence of mouse and human (|Σ| = 22), and break it arbitrarily into shorter strings. Then uncertain strings are obtained by following the same procedure as that for the dblp data source. However, for this dataset we use slightly larger string lengths with less uncertainty i.e. string lengths roughly follow uniform distribution in the range [20, 45] and θ ranges between 0.05 to 0.2. In both datasets, the average number of choices (γ) that each probabilistic character S[i] may have is set to 5. The default values used for the dblp dataset are: the number of strings in collection |S| = 100K, 6.2 Trie-based Verification Prefix-pruning has been a popular technique to expedite verification of a deterministic string pair (r, s) for edit threshold k. A naive approach for this verification would be to compute the dynamic programming matrix (DP) of size |r|× |s| such that cell (x, y) gives the edit distance between r[1...x] and s[1...y]. Prefix-pruning observes that if all cells in row x i.e., (x, ∗) do not meet threshold k, then the following rows can not have cells with edit distance k or less i.e., DP [i > x, ∗] > k. Even using such an early termination condition, verifying all-pairs (all possible of instance of R × S) for a candidate pair (R, S) can be expensive. With the goal of avoiding naive all-pairs comparison, we propose trie-based verification. Let TS be the trie of all possible instances of S and TR be the same for string R. Let node u in TS represents a string u (obtained by concatenating the edge labels from root to node u), then all possible instances of S with u as a prefix are leaves in the subtree rooted at u. We say a node u ∈ TS is similar to node v ∈ TR if ed(u, v) ≤ k. Using prefix-pruning then we have following observation [5]: 1 a) Lower bound violation: r = ACC, S = A{(C, 0.7), (G, 0.1), (T, 0.1)} with k = 1. b) Upper bound violation: r = DISC, S = DI{(C, 0.4), (S, 0.5), (R = 0.1)} with k = 1. 1478 Figure 2: Effectiveness vs. Efficiency Figure 3: Effect of Dataset Size |S| average string length ≈ 19, θ = 0.2, k = 2, τ = 0.1, and q = 3. Similarly for protein dataset we use default setting with |S| = 100K, average string length = 32, θ = 0.1, k = 4, τ = 0.01, and q = 3. filtering even for the larger datasets. For the exceptional case of the algorithm FCT, the filtering overhead increases almost quadratically with increase in the input size as both filtering techniques (frequency distance and CDF bounds) need to explicitly compare the query string R with all possible strings S ∈ S (|S| ≥ |R| − k). Also, the filtering time required for QFT and QCT closely follows that for QFCT. This confirms the ability of q-gram filtering to significantly reduce the filtering overhead, and highlights the advantages offered by the proposed indexing scheme incorporating it. Figure 3 also shows the time required for answering the join query for these algorithms. FCT, lacking efficient filtering (though effective), takes the longest to output its answers. However, the query time for QFT, despite using efficient q-gram filtering, shows a rapid increase. In contrast to this, the good scalable behavior of QFCT and QCT emphasizes the need for using tight filtering conditions based on the lower and upper bounds of CDF. In the absence of these, exponentially more number of candidates need trie-based verification which results in quickly deteriorating query performance. Thus, a combination of q-gram filtering with CDF bounds in QFCT achieves the best of both worlds, allowing us to restrict the increase in both filtering time as well as the number of trie-based verifications. Though the number of outputs increased quadratically with data size, the increase in the number of false positives in the verification step of QFCT (i.e., the scenario where a candidate pair was not an output after verification) was found to be linear to the output size. An order of magnitude performance gain of QFCT over others seen in Figure 3 will be further extended for larger input collections. With algorithm QFT requiring a higher number of expensive verifications and QCT showing similar trends as that of QFCT, we use only the remaining two algorithms i.e., QFCT and FCT for the experiments to follow. We also append a character ’D’ or ’P’ to the algorithms acronym to distinguish between its query times on the dblp and protein datasets. 7.1 Effectiveness vs. Efficiency of Pruning In this set of experiments, we compare the pruning ability of the filtering techniques and the overhead of applying them on both the datasets with θ = 0.2, k = 2 and τ = 0.1. Figure 2 shows the number of candidates remaining after applying each filtering scheme and reveals that CDF bounds provide the tightest filtering among the three. Effectiveness of the CDF follows from the fact that it uses upper as well as lower bounds to prune the strings. The upper bound obtained by q-gram filtering tends to be looser than the CDF as it depends on the partitioning based on q, whereas frequency distance based upper bound is sensitive to the length difference between two strings. However, the effectiveness of CDF comes at the cost of time. On the other hand, q-gram filtering is extremely fast and can still prune out a significant number of candidate pairs taking advantage of the indexing scheme. For the protein dataset, q-gram is close to CDF bounds in terms of effectiveness and is an order of magnitude faster than computing CDF bounds. Frequency distance filtering being dependent only on alphabet size and uncertain positions in the strings (against CDF’s dependance on string length) can help to improve query performance by reducing the number of candidate pairs passed on to CDF for evaluation. Therefore, in the following experiments, algorithm variants use these filtering techniques in the increasing order of their overhead as suggested by their acronyms. Figure 2 also reveals that applying q-gram filtering and the CDF bounds filtering takes longer for the protein dataset than for dblp data. Due to larger string length and fixed q, q-gram filtering needs to partition protein strings into a larger number of segments (i.e., m). Thus, there are more αx probabilities to be computed and it takes longer to compute the desired upper bound in Theorem 2. Similarly, computing CDF bounds needs to populate a dynamic programming matrix whose size depends on the string lengths. However, frequency distance filtering benefits from smaller alphabet set and lower degree of uncertainty in protein sequences and shows better performance for the protein data. 7.3 Effects of θ An increase in the number of uncertain positions in the string has a detrimental effect on both algorithms QFCT and FCT as shown in Figure 4. This is due to the direct impact of θ on every step of the algorithm in answering join queries. Starting with the q-gram filtering, more uncertain positions for query string R imply more time required for populating the sets q(r, x) as a preprocessing step, as well as for adding the string R to inverted indices after answering the query. Also the larger size of set q(r, x) due to the increase in θ increases look up time in inverted indices and consequently 7.2 Effects of Data Size |S| Figure 3 shows the scalability of various algorithms on the dblp dataset, where we vary |S| from 50K to 500K. With computationally inexpensive q-gram filtering as the first step, algorithms QFCT, QFT and QCT achieve efficient 1479 Figure 4: Effect of θ Figure 5: Effect of τ bound, more candidate pairs require trie-based verification resulting in higher query time. Such a scenario is evident in Figure 5 for protein data for τ ranging from 0.001 to 0.1. τ has an interesting effect on q-gram filtering. Figure 5 shows the number of candidates pairs rejected by q-gram filtering in QFCT. It also shows the count of accepted candidates using CDF lower bound and rejected by CDF upper bound in QFCT. Note that q-gram filtering only uses the upper bound and Figure 5 shows the reduced effectiveness of CDF lower bound filtering. As τ increases, probabilistic pruning (Theorem 2) becomes more effective and prunes out a significant number of candidate pairs that satisfy the necessary condition for two strings to be similar as described in Lemma 5 (shown in Figure 5). In effect, q-gram filtering reduces the overhead of applying CDF bounds and to some extent compensates for the increased verification cost, if any. This effect can be seen by gradual decrease in the number of candidates rejected by CDF even though an increased number of candidates are pruned using the upper bound overall. Finally, for large τ , the q-gram filtering advantage coupled with reduced output size due to more selective τ results in improved query time. increase the time required for computing αx . Though size of a set q(r, x) can increase exponentially with θ, its impact is limited due to the small fixed value of q. There is another subtle impact of θ on q-gram filtering. With more uncertain positions in query string R, more strings in the collection can be matched with substrings of R. We found this increase to be linear with ≈ 1.5% of all join pairs evaluated by q-gram filtering for θ = 0.1 to only ≈ 4% evaluated for θ = 0.4 on the dblp dataset. Thus, proposed q-gram filtering serves the purpose of efficient pruning even with the increased uncertainty. The impact of θ on the computation of frequency distance and CDF bounds is more obvious. Computing the expected frequency distance of a character directly depends on the number of positions in input strings (R, S) where it appears probabilistically. Due to the probability computation of two positions matching in R and S (R[x] = S[y]), it takes longer to populate a dynamic programming matrix for CDF. Thus, the increase in filtering time of query algorithms is almost linear to θ. Finally, in the trie-based verification, more possible words need to be evaluated, increasing verification cost exponentially. In conclusion, the verification step is the worst affected among all due to large θ and is the primary contributor in increased time for answering join queries. We note that in most of the scenarios, algorithm QFCT takes longer to answer join queries for the protein data than for the dblp data because of the higher overhead of q-gram and CDF filtering, which we pointed out in Section 7.1. On the other hand, algorithm FCT performs better for the protein data by virtue of faster frequency filtering as seen earlier. This comparative behavior of QFCT and FCT is also evident in Figure 4. 7.5 Effects of k Figure 6 shows the time required for answering a join query on the dblp dataset when k changes from 1 to 4 and for the protein dataset with k = 2, 4, 6, 8. With increased k we can expect more string pairs to satisfy an edit threshold and hence an increase in query time. As we loosen the edit threshold requirement, the effectiveness of q-gram filtering begins to deteriorate since the requirement for Lemma 5 can be met with string S having less number of its partitions being matched with substrings in R. Therefore, even with probabilistic pruning, many false candidates pairs are passed on to frequency distance and CDF filtering routines. Also, the number of candidates removed by the upper bound of frequency distance and CDF decreases with an increase in k. Though lower bound filtering in CDF can accept more candidates with an increase in k, this benefit is easily offset by loose upper bounds resulting in net increase in verification cost. With increased k, the time required for algorithm QFCT approaches that of FCT but still manages to save up to 35% of FCT’s query cost. 7.4 Effects of τ Figure 5 shows the results on the dblp and protein dataset for different values of τ from 0.001 to 0.4. Though the query times remain insensitive to τ for a large range, a gradual increase or decrease in probability threshold has a two fold effect on query algorithms. We analyze the scenario by looking at the number of candidate pairs pruned by CDF bounds either by accepting based on lower bound or rejecting based on upper bound. As τ increases, upper bound filter becomes more and more selective as it can reject more number of candidate pairs. On the contrary, filtering based on lower bound looses its effectiveness with increased τ as it can not accept as many strings as it can for smaller values of τ . Thus the relative increase and decrease in number of candidate pairs pruned by CDF upper and lower bound respectively determines the overall effect of varying τ . When upper bound filter can not compensate for the loss in effectiveness of lower 7.6 Effects of q In this set of experiments we try to investigate the effect of q-gram length on the efficiency and effectiveness of q-gram filtering using input collections with 100K strings. As pointed out earlier, q-gram filtering incurs more filtering overhead for higher string lengths with fixed q. We can hope 1480 Figure 6: Effect of k Figure 7: Effect of q sible words for each of the candidate strings that may form a similar pair with it. In effect, we may enumerate all possible worlds for each string more than once. Additionally, given a candidate pair (R, S) it needs to compare every possible instance of R with that of S. In contrast, the trie-based verification enumerates all possible words for each string S only once and when it is selected as a candidate for some other string R in database, it enumerates and compares only those possible worlds which are highly likely to be similar to some instance of R. Thus the performance gains of trie-based verification increase with increasing θ as seen in Figure 8. We note that the cost of verification using trie-based approach also increases exponentially due to the requirement of having a complete trie in place for query string R. Moreover, trie-based verification can be more expensive than the naive method in scenarios where the majority of instances of R × S satisfy the edit threshold due to the overhead of building a trie and computing a set of similar nodes for each node in the trie. Though we obtained performance gains using trie-based verification on the protein data as well, they were less significant than for the dblp data due to higher string lengths, lower degree of uncertainty (θ) and smaller alphabet set. to reduce this overhead by increasing q, however such an increase has side-effects on the space-time tradeoffof q-gram filtering. Even though we will have fewer partitions for each string due to increased q, each segment now has more possible instances to be added to the inverted indices increasing the storage requirement as shown in Figure 7. The rate of increase is faster for the dblp dataset because of higher θ i.e., more uncertain positions and larger alphabet set. We note that we use peak memory usage as a measure that accounts for the indices maintained at any point during query answering based on the length of a string currently under consideration. Further, this also implies that query preprocessing that populates sets q(r, x) needs more time offsetting the benefits of higher q to some extent. Figure 7 shows the improvement in the filtering time for q varying from 2 to 6. With size of q(r, x) increasing exponentially with q, the improvement in filtering time achieved due to fewer segments also decreases exponentially. For deterministic strings, increasing q makes it difficult for a segment of string s to match with substrings of query string r and implies potential improvement in pruning ability of q-gram filtering. For uncertain strings though, due to higher q, a segment may contain a larger number of uncertain positions. Hence there are more number of possible instances with increased chances for a segment to find a match in substrings of a query string. As a result, the effectiveness of q-gram filtering diminishes gradually for higher q as seen in Figure 7. We note that, though filtering time improves with q, time required for answering a join query shows uni-valley behavior as less effective filtering causes increased query time for higher q even with less filtering overhead. We found q = 3 or q = 4 offers the best combination of fast effective pruning with acceptable storage requirement. With peak memory usage of inverted indices less than the input data size itself for both q = 3 and 4, the space required for storing all indices as required for answering similarity search queries was found to be only ≈ 1.5 and ≈ 2 times the data size respectively. Figure 8: Trie-based Verification 7.8 Effects of String Length In this final set of experiments, we test algorithms QFCT and FCT by varying the length of the probabilistic strings. For studying this effect, we use the 100K versions of the dblp and protein datasets, and append each probabilistic string to itself for 0,1,2 or 3 times. To ensure that the verification step does not get excessively expensive, we limit the number of probabilistic characters in a probabilistic string to be at most 8. Clearly, the costs of both algorithms increase with longer strings as seen in Figure 9. In terms of filtering time, computation of q-gram filtering and CDF bounds takes longer as string lengths increase, as described earlier in Section 7.1. However, frequency distance filtering being de- 7.7 Evaluating Trie-based Verification We now analyze the performance benefits offered by the trie-based verification over a naive way of doing the same. Figure 8 shows the verification time required for answering join queries on the dblp and protein datasets with varying degree of uncertainty i.e., parameter θ. With an increase in the number of uncertain positions in the string, the number of possible worlds increases exponentially. This results in increased verification cost for both trie-based and naive verification. In naive verification, we need to enumerate possible worlds for each string in the dataset and also enumerate pos- 1481 9. pendent only on the number of uncertain character positions remains unaffected. This allows algorithm FCT to close the performance gap with QFCT for higher string lengths by virtue of efficient frequency distance filtering. Additionally, verification cost begins to dominate the query time with the increase in string lengths. We note that even trie-based verification needs to instantiate all possible worlds for each probabilistic string once while answering a join query. With each possible world enumeration taking more time, higher string length adversely affects the verification step. For fixed k, τ , and uncertain character positions, the number of output pairs decreases with increase in string length. Despite this, the query time increases because of the aforementioned reasons. We emphasize that the proposed filtering techniques maintain their effectiveness with varying lengths as the fraction of the candidate pairs that undergo verification and are accepted as output remains almost constant. [1] A. Amir, E. Chencinski, C. S. Iliopoulos, T. Kopelowitz, and H. Zhang. Property matching and weighted matching. In CPM, pages 188–199, 2006. [2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918–929, 2006. [3] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. [4] D. Dai, J. Xie, H. Zhang, and J. Dong. Efficient range queries over uncertain strings. In SSDBM, pages 75–95, 2012. [5] J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437–461, 2012. [6] T. Ge and Z. Li. Approximate substring matching over uncertain strings. PVLDB, 4(11):772–782, 2011. [7] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. [8] C. S. Iliopoulos, C. Makris, Y. Panagis, K. Perdikuri, E. Theodoridis, and A. Tsakalidis. The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundam. Inf., 71:259–277, February 2006. [9] C. S. Iliopoulos, K. Perdikuri, E. Theodoridis, A. K. Tsakalidis, and K. Tsichlas. Algorithms for extracting motifs from biological weighted sequences. J. Discrete Algorithms, 5(2):229–242, 2007. [10] J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string similarity joins. In SIGMOD Conference, pages 327–338, 2010. [11] S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search. In WWW, pages 371–380, 2009. [12] T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351–360, 2001. [13] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802–803, 2006. [14] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253–264, 2011. [15] M. Patil, X. Cai, S. V. Thankachan, R. Shah, S.-J. Park, and D. Foltz. Approximate string matching by position restricted alignment. In EDBT/ICDT Workshops, pages 384–391, 2013. [16] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In SIGMOD, pages 1239–1242, 2008. [17] J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262–276, 2005. [18] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008. [19] H. Zhang, Q. Guo, and C. S. Iliopoulos. An algorithmic framework for motif discovery problems in weighted sequences. In CIAC, pages 335–346, 2010. Figure 9: Effects of string length 7.9 Comparison with EED In this subsection, we qualitatively compare the join query algorithm in [10] against algorithms presented in this work: 1. We partition each string in the collection based on q whereas q-gram filtering in [10] makes use of overlapping q-grams. This allows us to significantly reduce the space required for storing all q-grams (≈ 5 × datasize as reported in [10] against our index of twice the data size). 2. q-gram filtering presented in [10] requires each probabilistic string pair to be evaluated during query execution tasks like computation of frequency distance, CDF bounds computation. Algorithm QFCT employes indexing that incorporates q-gram filtering before applying expensive filters. Therefore, we can expect QFCT to offer benefits over the query algorithm in [10] similar to its advantages over algorithm FCT seen in Figure 3. 3. Computing the exact eed between two probabilistic strings requires all possible worlds for two strings to be instantiated in the same way as a naive verification method discussed in Section 7.7. On the other hand trie-based verification allows us to determine the similarity of a string pair efficiently (refer to Figure 8). 8. REFERENCES CONCLUSIONS In this paper, we study the largely unexplored problem of answering similarity join queries on uncertain strings. We propose a novel q-gram filtering technique that integrates probabilistic pruning and extends frequency distance and CDF based filtering techniques. In future work, we plan to investigate tighter filtering conditions and improvements to the trie-based verification algorithm. 1482