Academia.eduAcademia.edu

Accuracy of approximate string joins using grams

Proc. of the International Workshop on …

Accuracy of Approximate String Joins Using Grams Oktie Hassanzadeh Mohammad Sadoghi Renée J. Miller University of Toronto 10 King’s College Rd. Toronto, ON M5S3G4, Canada University of Toronto 10 King’s College Rd. Toronto, ON M5S3G4, Canada University of Toronto 10 King’s College Rd. Toronto, ON M5S3G4, Canada [email protected] [email protected] [email protected] ABSTRACT Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depend on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) effects the accuracy of the join. We also compare different measures based on the highest accuracy they can gain on different datasets. 1. INTRODUCTION Data quality is a major concern in operational databases and data warehouses. Errors may be present in the data due to a multitude of reasons including data entry errors, lack of common standards and missing integrity constraints. String data is by nature more prone to such errors. Approximate join is an important part of many data cleaning methodologies and is well-studied: given two large relations, identify all pairs of records that approximately match. A variety of similarity measures have been proposed for string data in order to match records. Each measure has certain characteristics that makes it suitable for capturing certain types of errors. By using a string similarity function sim() for the approximate join algorithm, all pairs of records that have similarity score above a threshold θ are considered to approximately match and are returned as the output. Performing approximate join on a large relation is a noto- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09. riously time-consuming task. Recently, there has been an increasing interest in using approximate join techniques based on q-grams (substrings of length q) made out of strings. Most of the efficient approximate join algorithms (which we describe in Section 2) are based on using a specific similarity measure, along with a fixed threshold value to return pairs of records whose similarity is greater than the threshold. The effectiveness of majority of these algorithms depend on the value of the threshold used. However, there has been little work studying the accuracy of the join operation. The accuracy is known to be dataset-dependent and there is no common framework for evaluation and comparison of accuracy of different similarity measures and techniques. This makes comparing their accuracy a difficult task. Nevertheless, we argue that it is possible to evaluate relative performance of different measures for approximate joins by using datasets containing different types of known quality problems such as typing errors and difference in notations and abbreviations. In this paper, we present an overview of several similarity measures for approximate string joins using q-grams and thoroughly evaluate their accuracy for different values of thresholds and on datasets with different amount and types of errors. Our results include: • We show that for all similarity measures, the value of the threshold that results in the most accurate join highly depends on the type and amount of errors in the data. • We compare different similarity measures by comparing the maximum accuracy they can achieve on different datasets using different thresholds. Although choosing a proper threshold for the similarity measures without a prior knowledge of the data characteristics is known to be a difficult task, our results show which measures can potentially be more accurate assuming that there is a way to determine the best threshold. Therefore, an interesting direction for future work is to find an algorithm for determining the value of the threshold for the most accurate measures. • We show how the amount and type of errors affect the best value of the threshold. An interesting result of this is that many previously proposed algorithms for enhancing the performance of the join operation and making it scalable for large datasets are not effective enough in many scenarios, since the performance of these algorithms highly depends on choosing a high value of threshold which could result in a very low accuracy. This shows the effectiveness of those algorithms that are less sensitive to the value of the threshold and opens another interesting direction for future work which is finding algorithms that are both efficient and accurate using the same threshold. The paper is organized as follows. In Section 2, we overview related work on approximate joins. We present our framework for approximate join and description of the similarity measures used in Section 3. Section 4 presents thorough evaluation of these measures and finally, Section 5 concludes the paper and explains future directions. 2. RELATED WORK Approximate join also known as similarity join or record linkage has been extensively studied in the literature. Several similarity measures for string data have been proposed [14, 4, 5]. A recent survey[9], presents an excellent overview of different types of string similarity measures. Recently, there has been an increasing interest in using measures from the Information Retrieval (IR) field along with q-grams made out strings [10, 6, 2, 18, 5]. In this approach, strings are treated as documents and q-grams are treated as tokens in the documents. This makes it possible to take advantage of several indexing techniques as well as various algorithms that has been proposed for efficient set-similarity joins. Furthermore, these measures can be implemented declaratively over a DBMS with vanilla SQL statements [5]. Various recent works address the problem of efficiency and scalability of the similarity join operations for large datasets [6, 2, 18]. Many techniques are proposed for set-similarity join, which can be used along with q-grams for the purpose of (string) similarity joins. Most of the techniques are based on the idea of creating signatures for sets (strings) to reduce the search space. Some signature generations schemes are derived from dimensionality reduction for the similarity search problem in high dimensional space. One efficient approach uses the idea of Locality Sensitive Hashing (LSH) [13] in order to hash similar sets into the same values with high probability and therefore is an approximate solution to the problem. Arasu et al. [2] propose algorithms specifically for set-similarity joins that are exact and outperform previous approximation methods in their framework, although parameters of the algorithms require extensive tuning. Another class of work is based on using indexing algorithms, primarily derived from IR optimization techniques. A recent proposal in this area [3] presents algorithms based on novel indexing and optimization strategies that do not rely on approximation or extensive parameter tuning and outperform previous state-of-the-art approaches. More recently, Li et al.[15] propose VGRAM, a technique based on the idea of using variable-length grams instead of q-grams. At a high level, it can be viewed as an efficient index structure over the collection of strings. VGRAM can be used along with previously proposed signature-based algorithms to significantly improve their efficiency. Most of the techniques described above mainly address the scalability of the join operation and not the accuracy. The choice of the similarity measure is often limited in these algorithms. The signature-based algorithm of [6] also considers accuracy by introducing a novel similarity measure called fuzzy match similarity and creating signatures for this measure. However, accuracy of this measure is not com- pared with other measures. In [5] several such similarity measures are benchmarked for approximate selection, which is a special case of similarity join. Given a relation R, the approximate selection operation using similarity predicate sim(), will report all tuples t ∈ R such that sim(tq , t) ≥ θ, where θ is a specified numerical ’similarity threshold’ and tq is a query string. Approximate selections are special cases of the similarity join operation. While several predicates are introduced and benchmarked in [5], the extension of approximate selection to approximate joins is not considered. Furthermore, the effect of threshold values on accuracy for approximate joins is also not considered. 3. FRAMEWORK In this section, we explain our framework for similarity join. The similarity join of two relations R = {ri : 1 ≤ i ≤ N1 } and S = {sj : 1 ≤ j ≤ N2 } outputs is a set of pairs (ri , sj ) ∈ R×S where ri and sj are similar. Two records are considered similar when their similarity score based a similarity function sim() is above a threshold θ. For the definitions and experiments in this paper, we assume we are performing a self-join on relation R. Therefore the output is a set of pairs (ri , rj ) ∈ R×R where sim(ri , rj ) ≥ θ for some similarity function sim() and a threshold θ. This is a common operation in many applications such as entity resolution and clustering. In keeping with many approximate join methods, we model records as strings. We denote by r the set of q-grams (sequences of q consecutive characters of a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘ la’, ‘lab’} for tokenization using 3-grams. In certain cases, a weight may be associated with each token. The similarity measures discussed here are those based on q-grams created out of strings along with a similarity measure that has shown to be effective in previous work [5]. These measures share one or both of the following properties: • High scalability: There are various techniques proposed in the literature as described in Section 2 for enhancing the performance of the similarity join operation using q-grams along with these measures. • High accuracy: Previous work has proved that in most scenarios these measures perform better or equally well in terms of accuracy comparing with other string similarity measures. Specifically, these measures have shown good accuracy in name-matching tasks [8] or in approximate selection [5]. 3.1 Edit Similarity Edit-distance is widely used as the measure of choice in many similarity join techniques. Specifically, previous work [10] has shown how to use q-grams for efficient implementation of this measure in a declarative framework. Recent works on enhancing performance of similarity join has also proposed techniques for scalable implementation of this measure [2, 15]. Edit distance between two string records r1 and r2 is defined as the transformation cost of r1 to r2 , tc(r1 , r2 ), which is equal to the minimum cost of edit operations applied to r1 to transform it to r2 . Edit operations include character copy, insert, delete and substitute [11]. The edit similarity is defined as: simedit (r1 , r2 ) = 1 − tc(r1 , r2 ) max{|r1 |, |r2 |} (1) There is a cost associated with each edit operation. There are several cost models proposed for edit operations for this measure. In the most commonly used measure called Levenshtein edit distance, which we will refer to as edit distance in this paper, uses unit cost for all operations except copy which has cost zero. Jaccard similarity is the fraction of tokens in r1 and r2 that are present in both. Weighted Jaccard similarity is the weighted version of Jaccard similarity, i.e., P t∈r ∩r wR (t) simW J accard(r1 , r2 ) = P 1 2 (2) t∈r1 ∪r2 wR (t) where w(t, R) is a weight function that reflects the commonality of the token t in the relation R. We choose RSJ (Robertson-Sparck Jones) weight for the tokens which was shown to be more effective than the commonly-used Inverse Document Frequency (IDF) weights [5]:   N − nt + 0.5 wR (t) = log (3) nt + 0.5 where N is the number of tuples in the base relation R and nt is the number of tuples in R containing the token t. 3.3 Measures from IR A well-studied problem in information retrieval is that given a query and a collection of documents, return the most relevant documents to the query. In the measures in this part, records are treated as documents and q-grams are seen as words (tokens) of the documents. Therefore same techniques for finding relevant documents to a query can be used to return similar records to a query string. In the rest of this section, we present three measures that previous work has shown their higher performance for approximate selection problem [5]. 3.3.1 Cosine w/tf-idf The tf-idf cosine similarity is a well established measure in the IR community which leverages the vector space model. This measure determines the closeness of the input strings r1 and r2 by first transforming the strings into unit vectors and then measuring the angle between their corresponding vectors. The cosine similarity with tf-idf weights is given by: X wr1 (t) · wr2 (t) (4) t∈r1 ∩r2 where wr1 (t) and wr2 (t) are the normalized tf-idf weights for each common token in r1 and r2 respectively. The normalized tf-idf weight of token t in a given string record r is defined as follows: wr (t) = q P wr′ (t) ′ ′ 2 t′ ∈r wr (t ) , wr′ (t) = tfr (t) · idf (t) 3.3.2 BM25 The BM25 similarity score for a query r1 and a string record r2 is defined as follows: simBM 25 (r1 , r2 ) = X ŵr1 (t) · wr2 (t) (5) t∈r1 ∩r2 where 3.2 Jaccard and WeightedJaccard simCosine (r1 , r2 ) = where tfr (t) is the term frequency of token t within string r and idf (t) is the inverse document frequency with respect to the entire relation R. ŵr1 (t) = (k3 +1)·tfr1 (t) k3 +tfr1 (t) wr2 (t) = (1) (t) (k +1)·tf r2 1 wR (t) K(r 2 )+tfr (t) 2 (1) and wR is the RSJ weight: (1) wR (t) = K(r) =   t +0.5 log N−n nt +0.5   |r| k1 (1 − b) + b avg rl where tfr (t) is the frequency of the token t in string record r, |r| is the number of tokens in r, avgrl is the average number of tokens per record, N is the number of records in the relation R, nt is the number of record containing the token t and k1 , k3 and b are set of independent parameters. We set these parameters based on TREC-4 experiments [17] where k ∈ [1, 2], k3 = 8 and b ∈ [0.6, 0.75]. 3.3.3 Hidden Markov Model The approximate string matching could be modeled by a discrete Hidden Markov process which has shown better performance than Cosine w/tf-idf in IR literature [16] and high accuracy and running time for approximate selection [5]. This particular Markov model consists of only two states where the first state models the tokens that are specific to one particular “String” and the second state models the tokens in the “General English”, i.e., tokens that are common in many records. Refer to [5] and [16] for complete description of the model and possible extensions. The HMM similarity function accepts two string records r1 and r2 and returns the probability of generating r1 given r2 is a similar record: simHM M (r1 , r2 ) = Y (a0 P (t|GE) + a1 P (t|r2 )) (6) t∈r1 where a0 and a1 = 1 − a0 are the transition states probabilities of the Markov model and P (t|GE) and P (t|r2 ) is given by: number of times t appears in r2 |r2 | number of times t appears in r r∈R P |r| r∈R P (t|r2 ) = P P (t|GE) = 3.4 Hybrid Measures The implementation of these measures involves two similarity functions, one that compares the strings by comparing their word tokens and another similarity function which is more suitable for short strings and is used for comparison of the word tokens. 3.4.1 GES The generalized edit similarity (GES) [7] which is a modified version of fuzzy match similarity presented in [6], takes two strings r1 and r2 , tokenizes the strings into a set of words and assigns a weight w(t) to each token. GES defines the similarity between the two given strings as a minimum transformation cost required to convert string r1 to r2 and is given by:   tc(r1 , r2 ) , 1.0 (7) simGES (r1 , r2 ) = 1 − min wt(r1 ) where wt(r1 ) is the sum of weights of all tokens in r1 and tc(r1 , r2 ) is a sequence of the following transformation operations: • token insertion: inserting a token t in r1 with cost w(t).cins where cins is the insertion factor constant and is in the range between 0 and 1. In our experiments, cins = 1. • token deletion: deleting a token t from r1 with cost w(t). • token replacement: replacing a token t1 by t2 in r1 with cost (1 − simedit (t1 , t2 )).w(t) where simedit is the edit-distance between t1 and t2 . 3.4.2 SoftTFIDF SoftTFIDF is another hybrid measure proposed by Cohen et al. [8], which relies on the normalized tf-idf weight of word tokens and can work with any arbitrary similarity function to find similarity between word tokens. In this measure, the similarity score, simSof tT F IDF , is defined as follows: X w(t1 , r1 )·w(arg max (sim(t1 , t2 )), r2 )· max (sim(t1 , t2 )) t1 ∈C(θ,r1 ,r2 ) t2 ∈r2 t2 ∈r2 (8) where w(t, r) is the normalized tf-idf weight of word token t in record r and C(θ, r1 , r2 ) returns a set of tokens t1 ∈ r1 such that for t2 ∈ r2 we have sim(t1 , t2 ) > θ for some similarity function sim() suitable for comparing word strings. In our experiments sim(t1 , t2 ) is the Jaro-Winkler similarity as suggested in [8]. 4. EVALUATION 4.1 Datasets In order to evaluate effectiveness of different similarity measures described in previous section, we use the same datasets used in [5]. These datasets were created using a modified version of UIS data generator, which has previously been used for evaluation of data cleaning and record linkage techniques [12, 1]. The data generator has the ability to inject several types of errors into a clean database of string attributes. These errors include commonly occurring typing mistakes (edit errors: character insertion, deletion, replacement and swap), token swap and abbreviation errors (e.g., replacing Inc. with Incorporated and vice versa). The data generator has several parameters to control the injected error in the data such as the size of the dataset to be generated, the distribution of duplicates (uniform, Zipfian or Poisson), the percentage of erroneous duplicates, the extent of error injected in each string, and the percentage of different types of errors. The data generator keeps track Group Name Dirty D1 D2 M1 M2 M3 M4 L1 L2 AB TS EDL EDM EDH Medium Low Single Error Erroneous Duplicates 90 50 30 10 90 50 30 10 50 50 50 50 50 Percentage of Errors in Token Duplicates Swap 30 20 30 20 30 20 30 20 10 20 10 20 10 20 10 20 0 0 0 20 10 0 20 0 30 0 Abbr. Error 50 50 50 50 50 50 50 50 50 0 0 0 0 Table 1: Datasets Used in the Experiments of the duplicate records by assigning a cluster ID to each clean record and to all duplicates generated from that clean record. For the results presented in this paper, the datasets are generated by the data generator out of a clean dataset of 2139 company names with average record length of 21.03 and an average of 2.9 words per record. The errors in the datasets have a uniform distribution. For each dataset, on average 5000 dirty records are created out of 500 clean records. We have also run experiments on datasets generated using different parameters. For example, we generated data using a Zipfian distribution, and we also used data from another clean source (DBLP titles) as in [5]. We also created larger datasets. For these other datasets, the accuracy trends remain the same. Table 1 shows the description of all the datasets used for the results in this paper. We used 8 different datasets with mixed types of errors (edit errors, token swap and abbreviation replacement). Moreover, we use 5 datasets with only a single type of error (edit errors, token swap or abbreviation replacement errors) to measure the effect of each type of error individually. Following [5], we believe the errors in these datasets are highly representative of common types of errors in databases with string attributes. 4.2 Measures We use well-known measures from IR, namely precision, recall, and F1, for different values of threshold to evaluate the accuracy of the similarity join operation. We perform a self-join on the input table using a similarity measure with a fixed threshold θ. Precision (Pr) is defined as the percentage of similar records among the records that have similarity score above threshold θ. In our datasets, similar records are marked with the same cluster ID as described above. Recall (Re) is the ratio of the number of similar records that have similarity score above threshold θ to the total number of similar records. A join that returns all the pairs of records in the two input tables as output has low (near zero) precision and recall of 1. A join that returns an empty answer has precision 1 and zero recall. The F1 measure is the harmonic mean of precision and recall, i.e., F1 = 2 × P r × Re P r + Re (9) We measure precision, recall, and F1 for different value of similarity thresholds θ. For comparison of different similar- Figure 3: Maximum F1 score for different measures on datasets with only edit errors Figure 4: Maximum F1 score for different measures on datasets with only token swap and abbreviation errors ity measures, we use the maximum F1 score across different thresholds. 4.3 Results Figures 1 and 2 show the precision, recall, and F1 values for all measures described in Section 3, over the datasets we have defined with mixed types of errors. For all measure except HMM and BM25, the horizontal axis of the precision/recall graph is the value of the threshold. For HMM and BM25, the horizontal axis is the percentage of maximum value of the threshold, since these measure do not return a score between 0 and 1. Effect of amount of errors As shown in the precision/recall curves in Figures 1 and 2, the “dirtiness” of the input data greatly affects the value of the threshold that results in the most accurate join. For all the measures, a lower value of the threshold is needed as the degree of error in the data increases. For example, Weighted Jaccard achieves the best F1 score over the dirtiest datasets with threshold 0.3, while it achieves the best F1 for the cleanest datasets at threshold 0.55. BM25 and HMM are less sensitive and work well on both dirty and clean group of datasets with the same value of threshold. We will discuss later how the degree of error in the data affects the choice of the most accurate measure. Effect of types of errors Figures 3 shows the maximum F1 score for different values of the threshold for different measures on datasets containing only edit-errors (the EDL, EDM and EDH datasets). These figures show that weighted Jaccard and Cosine have the highest accuracy followed by Jaccard, and edit similarity on the low-error dataset EDL. By increasing the amount of edit error in each record, HMM performs as well as weighted Jaccard, although Jaccard, edit similarity, and GES perform much worse on high edit error rates. Considering the fact that edit-similarity is mainly proposed for capturing edit errors, this shows the effectiveness of weighted Jaccard and its robustness with varying amount of edit errors. Figure 4 shows the effect of token swap and abbreviation errors on the accuracy of different measures. This experiment indicates that edit similarity is not capable of modeling such types of errors. HMM, BM25 and Jaccard also are not capable of modeling abbreviation errors properly. Comparison of measures Figures 5 shows the maximum F1 score for different values of the threshold for different measures on dirty, medium and clean group of datasets. (Here we have aggregated the results for all the dirty data sets together (respectively, the moderately derity or medium data sets and the clean data sets). The results show the ef- Figure 5: Maximum F1 score for different measures on clean, medium and dirty group of datasets fectiveness and robustness of weighted Jaccard and cosine in comparison with other measures. Again, HMM is among the most accurate measures when the data is extremely dirty and has relatively low accuracy when the percentage of error in the data is low. Remark As stated in Section 2, the performance of many algorithms proposed for improving scalability of the join operation highly depend on the value of similarity threshold used for the join. Here we show the accuracy numbers on our datasets using the value of the theshold that makes these algorithms effective. Specifically we address the results in [2] although similar observations can be made for results of other similar works in this area. Table 2 shows the F1 value for thresholds that results in the best accuracy on our datasets and the best performance in experimental results of [2]. PartEnum and WtEnum algorithm presented in [2] significantly outperform previous algorithms for 0.9 threshold, but have roughly the same performance as previously proposed algorithms such as LSH when threshold 0.8 or less is used. The results in Table 2 shows that there is a big gap between the value of the threshold that results in the most accurate join on our datasets and the threshold that results in effectiveness of PartEnum and WtEnum in studies in [2]. 5. CONCLUSION We have presented an overview of several similarity measures for efficient approximate string joins and thoroughly evaluated their accuracy on several datasets with different characteristics and common quality problems. Our results show the effect of the amount and type of errors in the Dirty Medium Clean Jaccard Join Threshold 0.5 (Best Acc.) 0.8 0.85 0.9 (Best Performance) 0.65 (Best Acc.) 0.8 0.85 0.9 (Best Performance) 0.7 (Best Acc.) 0.8 0.85 0.9 (Best Performance) F1 0.293 0.249 0.248 0.247 0.719 0.611 0.571 0.548 0.887 0.854 0.831 0.812 Weighted Jaccard Join Threshold F1 0.3 (Best Acc.) 0.528 0.8 0.249 0.85 0.246 0.9 (Best Performance) 0.244 0.55 (Best Acc.) 0.776 0.8 0.581 0.85 0.581 0.9 (Best Performance) 0.560 0.55 (Best Acc.) 0.929 0.8 0.831 0.85 0.819 0.9 (Best Performance) 0.807 Table 2: F1 score for thresholds that result in best running time in previous performance studies and highest accuracy on our datasets for two selected similarity measures datasets and the similarity threshold used for the similarity measures on the accuracy of the join operation. Considering the fact that the effectiveness of many algorithms proposed for enhancing the scalability of approximate join rely on the value chosen for the similarity threshold, our results show the effectiveness of those algorithms that are less sensitive to the value of the threshold and opens an interesting direction for future work which is finding algorithms that are both efficient and accurate using the same threshold. Finding an algorithm that determines the best value of the threshold regardless of the type and amount of errors for the similarity measures that showed higher accuracy in our work is another interesting subject for future work. 6. REFERENCES [1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE’06. [2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB’06. [3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW’07. [4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 2003. [5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In SIGMOD’07. [6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD’03. [7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE ’06. [8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb’03. [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1), 2007. [10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB’01. [11] D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997. [12] M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9–37, 1998. [13] Indyk, Motwani, Raghavan, and Vempala. Locality-preserving hashing in multidimensional spaces. In STOC’97. [14] N. Koudas and D. Srivastava. Approximate joins: Concepts and techniques. In VLDB’05 Tutorial. [15] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB’07. [16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden markov model information retrieval system. In SIGIR’99. [17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC’95. [18] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD’04. (a) Low Error Data Sets (b) Medium Error Data Sets Edit Similarity (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets Jaccard (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets Weighted Jaccard (c) Dirty Data Sets Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the threshold on different datasets (a) Low Error Data Sets (b) Medium Error Data Sets Cosine w/tf-idf (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets BM25 (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets HMM (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets SoftTFIDF (c) Dirty Data Sets (a) Low Error Data Sets (b) Medium Error Data Sets GES (c) Dirty Data Sets Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on different datasets