Accuracy of Approximate String Joins Using Grams
Oktie Hassanzadeh
Mohammad Sadoghi
Renée J. Miller
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
University of Toronto
10 King’s College Rd.
Toronto, ON M5S3G4, Canada
[email protected]
[email protected]
[email protected]
ABSTRACT
Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching
of string attributes. The accuracy of the similarity measures
highly depends on the characteristics of the data such as
amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of
the strings, mainly due to their high efficiency. In this work,
we evaluate the accuracy of the similarity measures used
in these methodologies. We present an overview of several
similarity measures based on q-grams. We then thoroughly
compare their accuracy on several datasets with different
characteristics. Since the efficiency of approximate joins depend on the similarity threshold they use, we study how the
value of the threshold (including values used in recent performance studies) effects the accuracy of the join. We also
compare different measures based on the highest accuracy
they can gain on different datasets.
1.
INTRODUCTION
Data quality is a major concern in operational databases
and data warehouses. Errors may be present in the data due
to a multitude of reasons including data entry errors, lack of
common standards and missing integrity constraints. String
data is by nature more prone to such errors. Approximate
join is an important part of many data cleaning methodologies and is well-studied: given two large relations, identify
all pairs of records that approximately match. A variety of
similarity measures have been proposed for string data in
order to match records. Each measure has certain characteristics that makes it suitable for capturing certain types
of errors. By using a string similarity function sim() for the
approximate join algorithm, all pairs of records that have
similarity score above a threshold θ are considered to approximately match and are returned as the output.
Performing approximate join on a large relation is a noto-
Permission to copy without fee all or part of this material is granted provided
that the copies are not made or distributed for direct commercial advantage,
the VLDB copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, to post on servers
or to redistribute to lists, requires a fee and/or special permission from the
publisher, ACM.
VLDB ‘07, September 23-28, 2007, Vienna, Austria.
Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.
riously time-consuming task. Recently, there has been an increasing interest in using approximate join techniques based
on q-grams (substrings of length q) made out of strings.
Most of the efficient approximate join algorithms (which we
describe in Section 2) are based on using a specific similarity
measure, along with a fixed threshold value to return pairs of
records whose similarity is greater than the threshold. The
effectiveness of majority of these algorithms depend on the
value of the threshold used. However, there has been little
work studying the accuracy of the join operation. The accuracy is known to be dataset-dependent and there is no common framework for evaluation and comparison of accuracy
of different similarity measures and techniques. This makes
comparing their accuracy a difficult task. Nevertheless, we
argue that it is possible to evaluate relative performance of
different measures for approximate joins by using datasets
containing different types of known quality problems such as
typing errors and difference in notations and abbreviations.
In this paper, we present an overview of several similarity
measures for approximate string joins using q-grams and
thoroughly evaluate their accuracy for different values of
thresholds and on datasets with different amount and types
of errors. Our results include:
• We show that for all similarity measures, the value of
the threshold that results in the most accurate join
highly depends on the type and amount of errors in
the data.
• We compare different similarity measures by comparing the maximum accuracy they can achieve on different datasets using different thresholds. Although
choosing a proper threshold for the similarity measures
without a prior knowledge of the data characteristics
is known to be a difficult task, our results show which
measures can potentially be more accurate assuming
that there is a way to determine the best threshold.
Therefore, an interesting direction for future work is
to find an algorithm for determining the value of the
threshold for the most accurate measures.
• We show how the amount and type of errors affect the
best value of the threshold. An interesting result of
this is that many previously proposed algorithms for
enhancing the performance of the join operation and
making it scalable for large datasets are not effective
enough in many scenarios, since the performance of
these algorithms highly depends on choosing a high
value of threshold which could result in a very low
accuracy. This shows the effectiveness of those algorithms that are less sensitive to the value of the threshold and opens another interesting direction for future
work which is finding algorithms that are both efficient
and accurate using the same threshold.
The paper is organized as follows. In Section 2, we overview
related work on approximate joins. We present our framework for approximate join and description of the similarity
measures used in Section 3. Section 4 presents thorough
evaluation of these measures and finally, Section 5 concludes
the paper and explains future directions.
2.
RELATED WORK
Approximate join also known as similarity join or record
linkage has been extensively studied in the literature. Several similarity measures for string data have been proposed
[14, 4, 5]. A recent survey[9], presents an excellent overview
of different types of string similarity measures. Recently,
there has been an increasing interest in using measures from
the Information Retrieval (IR) field along with q-grams made
out strings [10, 6, 2, 18, 5]. In this approach, strings are
treated as documents and q-grams are treated as tokens in
the documents. This makes it possible to take advantage
of several indexing techniques as well as various algorithms
that has been proposed for efficient set-similarity joins. Furthermore, these measures can be implemented declaratively
over a DBMS with vanilla SQL statements [5].
Various recent works address the problem of efficiency and
scalability of the similarity join operations for large datasets
[6, 2, 18]. Many techniques are proposed for set-similarity
join, which can be used along with q-grams for the purpose
of (string) similarity joins. Most of the techniques are based
on the idea of creating signatures for sets (strings) to reduce the search space. Some signature generations schemes
are derived from dimensionality reduction for the similarity search problem in high dimensional space. One efficient
approach uses the idea of Locality Sensitive Hashing (LSH)
[13] in order to hash similar sets into the same values with
high probability and therefore is an approximate solution to
the problem. Arasu et al. [2] propose algorithms specifically
for set-similarity joins that are exact and outperform previous approximation methods in their framework, although
parameters of the algorithms require extensive tuning. Another class of work is based on using indexing algorithms,
primarily derived from IR optimization techniques. A recent
proposal in this area [3] presents algorithms based on novel
indexing and optimization strategies that do not rely on approximation or extensive parameter tuning and outperform
previous state-of-the-art approaches. More recently, Li et
al.[15] propose VGRAM, a technique based on the idea of
using variable-length grams instead of q-grams. At a high
level, it can be viewed as an efficient index structure over the
collection of strings. VGRAM can be used along with previously proposed signature-based algorithms to significantly
improve their efficiency.
Most of the techniques described above mainly address
the scalability of the join operation and not the accuracy.
The choice of the similarity measure is often limited in these
algorithms. The signature-based algorithm of [6] also considers accuracy by introducing a novel similarity measure
called fuzzy match similarity and creating signatures for this
measure. However, accuracy of this measure is not com-
pared with other measures. In [5] several such similarity
measures are benchmarked for approximate selection, which
is a special case of similarity join. Given a relation R, the
approximate selection operation using similarity predicate
sim(), will report all tuples t ∈ R such that sim(tq , t) ≥ θ,
where θ is a specified numerical ’similarity threshold’ and tq
is a query string. Approximate selections are special cases
of the similarity join operation. While several predicates
are introduced and benchmarked in [5], the extension of approximate selection to approximate joins is not considered.
Furthermore, the effect of threshold values on accuracy for
approximate joins is also not considered.
3. FRAMEWORK
In this section, we explain our framework for similarity
join. The similarity join of two relations R = {ri : 1 ≤ i ≤
N1 } and S = {sj : 1 ≤ j ≤ N2 } outputs is a set of pairs
(ri , sj ) ∈ R×S where ri and sj are similar. Two records
are considered similar when their similarity score based a
similarity function sim() is above a threshold θ. For the
definitions and experiments in this paper, we assume we are
performing a self-join on relation R. Therefore the output
is a set of pairs (ri , rj ) ∈ R×R where sim(ri , rj ) ≥ θ for
some similarity function sim() and a threshold θ. This is a
common operation in many applications such as entity resolution and clustering. In keeping with many approximate
join methods, we model records as strings. We denote by r
the set of q-grams (sequences of q consecutive characters of
a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘
la’, ‘lab’} for tokenization using 3-grams. In certain cases,
a weight may be associated with each token.
The similarity measures discussed here are those based
on q-grams created out of strings along with a similarity
measure that has shown to be effective in previous work [5].
These measures share one or both of the following properties:
• High scalability: There are various techniques proposed in the literature as described in Section 2 for
enhancing the performance of the similarity join operation using q-grams along with these measures.
• High accuracy: Previous work has proved that in most
scenarios these measures perform better or equally well
in terms of accuracy comparing with other string similarity measures. Specifically, these measures have shown
good accuracy in name-matching tasks [8] or in approximate selection [5].
3.1 Edit Similarity
Edit-distance is widely used as the measure of choice in
many similarity join techniques. Specifically, previous work
[10] has shown how to use q-grams for efficient implementation of this measure in a declarative framework. Recent
works on enhancing performance of similarity join has also
proposed techniques for scalable implementation of this measure [2, 15].
Edit distance between two string records r1 and r2 is defined as the transformation cost of r1 to r2 , tc(r1 , r2 ), which
is equal to the minimum cost of edit operations applied to
r1 to transform it to r2 . Edit operations include character
copy, insert, delete and substitute [11]. The edit similarity is
defined as:
simedit (r1 , r2 ) = 1 −
tc(r1 , r2 )
max{|r1 |, |r2 |}
(1)
There is a cost associated with each edit operation. There
are several cost models proposed for edit operations for this
measure. In the most commonly used measure called Levenshtein edit distance, which we will refer to as edit distance
in this paper, uses unit cost for all operations except copy
which has cost zero.
Jaccard similarity is the fraction of tokens in r1 and r2
that are present in both. Weighted Jaccard similarity is the
weighted version of Jaccard similarity, i.e.,
P
t∈r ∩r wR (t)
simW J accard(r1 , r2 ) = P 1 2
(2)
t∈r1 ∪r2 wR (t)
where w(t, R) is a weight function that reflects the commonality of the token t in the relation R. We choose RSJ
(Robertson-Sparck Jones) weight for the tokens which was
shown to be more effective than the commonly-used Inverse
Document Frequency (IDF) weights [5]:
N − nt + 0.5
wR (t) = log
(3)
nt + 0.5
where N is the number of tuples in the base relation R and
nt is the number of tuples in R containing the token t.
3.3 Measures from IR
A well-studied problem in information retrieval is that
given a query and a collection of documents, return the
most relevant documents to the query. In the measures in
this part, records are treated as documents and q-grams are
seen as words (tokens) of the documents. Therefore same
techniques for finding relevant documents to a query can
be used to return similar records to a query string. In the
rest of this section, we present three measures that previous
work has shown their higher performance for approximate
selection problem [5].
3.3.1 Cosine w/tf-idf
The tf-idf cosine similarity is a well established measure in
the IR community which leverages the vector space model.
This measure determines the closeness of the input strings
r1 and r2 by first transforming the strings into unit vectors
and then measuring the angle between their corresponding
vectors. The cosine similarity with tf-idf weights is given
by:
X
wr1 (t) · wr2 (t)
(4)
t∈r1 ∩r2
where wr1 (t) and wr2 (t) are the normalized tf-idf weights
for each common token in r1 and r2 respectively. The normalized tf-idf weight of token t in a given string record r is
defined as follows:
wr (t) = q P
wr′ (t)
′ ′ 2
t′ ∈r wr (t )
, wr′ (t) = tfr (t) · idf (t)
3.3.2 BM25
The BM25 similarity score for a query r1 and a string
record r2 is defined as follows:
simBM 25 (r1 , r2 ) =
X
ŵr1 (t) · wr2 (t)
(5)
t∈r1 ∩r2
where
3.2 Jaccard and WeightedJaccard
simCosine (r1 , r2 ) =
where tfr (t) is the term frequency of token t within string
r and idf (t) is the inverse document frequency with respect
to the entire relation R.
ŵr1 (t) =
(k3 +1)·tfr1 (t)
k3 +tfr1 (t)
wr2 (t) =
(1)
(t)
(k +1)·tf
r2
1
wR (t) K(r
2 )+tfr (t)
2
(1)
and wR is the RSJ weight:
(1)
wR (t) =
K(r) =
t +0.5
log N−n
nt +0.5
|r|
k1 (1 − b) + b avg
rl
where tfr (t) is the frequency of the token t in string record
r, |r| is the number of tokens in r, avgrl is the average
number of tokens per record, N is the number of records in
the relation R, nt is the number of record containing the
token t and k1 , k3 and b are set of independent parameters.
We set these parameters based on TREC-4 experiments [17]
where k ∈ [1, 2], k3 = 8 and b ∈ [0.6, 0.75].
3.3.3 Hidden Markov Model
The approximate string matching could be modeled by
a discrete Hidden Markov process which has shown better
performance than Cosine w/tf-idf in IR literature [16] and
high accuracy and running time for approximate selection
[5]. This particular Markov model consists of only two states
where the first state models the tokens that are specific to
one particular “String” and the second state models the tokens in the “General English”, i.e., tokens that are common
in many records. Refer to [5] and [16] for complete description of the model and possible extensions.
The HMM similarity function accepts two string records
r1 and r2 and returns the probability of generating r1 given
r2 is a similar record:
simHM M (r1 , r2 ) =
Y
(a0 P (t|GE) + a1 P (t|r2 ))
(6)
t∈r1
where a0 and a1 = 1 − a0 are the transition states probabilities of the Markov model and P (t|GE) and P (t|r2 ) is given
by:
number of times t appears in r2
|r2 |
number
of
times
t appears in r
r∈R
P
|r|
r∈R
P (t|r2 ) =
P
P (t|GE) =
3.4 Hybrid Measures
The implementation of these measures involves two similarity functions, one that compares the strings by comparing
their word tokens and another similarity function which is
more suitable for short strings and is used for comparison of
the word tokens.
3.4.1 GES
The generalized edit similarity (GES) [7] which is a modified version of fuzzy match similarity presented in [6], takes
two strings r1 and r2 , tokenizes the strings into a set of
words and assigns a weight w(t) to each token. GES defines
the similarity between the two given strings as a minimum
transformation cost required to convert string r1 to r2 and
is given by:
tc(r1 , r2 )
, 1.0
(7)
simGES (r1 , r2 ) = 1 − min
wt(r1 )
where wt(r1 ) is the sum of weights of all tokens in r1 and
tc(r1 , r2 ) is a sequence of the following transformation operations:
• token insertion: inserting a token t in r1 with cost
w(t).cins where cins is the insertion factor constant and
is in the range between 0 and 1. In our experiments,
cins = 1.
• token deletion: deleting a token t from r1 with cost
w(t).
• token replacement: replacing a token t1 by t2 in r1
with cost (1 − simedit (t1 , t2 )).w(t) where simedit is the
edit-distance between t1 and t2 .
3.4.2 SoftTFIDF
SoftTFIDF is another hybrid measure proposed by Cohen
et al. [8], which relies on the normalized tf-idf weight of word
tokens and can work with any arbitrary similarity function
to find similarity between word tokens. In this measure, the
similarity score, simSof tT F IDF , is defined as follows:
X
w(t1 , r1 )·w(arg max (sim(t1 , t2 )), r2 )· max (sim(t1 , t2 ))
t1 ∈C(θ,r1 ,r2 )
t2 ∈r2
t2 ∈r2
(8)
where w(t, r) is the normalized tf-idf weight of word token
t in record r and C(θ, r1 , r2 ) returns a set of tokens t1 ∈ r1
such that for t2 ∈ r2 we have sim(t1 , t2 ) > θ for some similarity function sim() suitable for comparing word strings.
In our experiments sim(t1 , t2 ) is the Jaro-Winkler similarity
as suggested in [8].
4.
EVALUATION
4.1 Datasets
In order to evaluate effectiveness of different similarity
measures described in previous section, we use the same
datasets used in [5]. These datasets were created using a
modified version of UIS data generator, which has previously been used for evaluation of data cleaning and record
linkage techniques [12, 1]. The data generator has the ability to inject several types of errors into a clean database of
string attributes. These errors include commonly occurring
typing mistakes (edit errors: character insertion, deletion,
replacement and swap), token swap and abbreviation errors
(e.g., replacing Inc. with Incorporated and vice versa).
The data generator has several parameters to control the
injected error in the data such as the size of the dataset to
be generated, the distribution of duplicates (uniform, Zipfian or Poisson), the percentage of erroneous duplicates, the
extent of error injected in each string, and the percentage
of different types of errors. The data generator keeps track
Group
Name
Dirty
D1
D2
M1
M2
M3
M4
L1
L2
AB
TS
EDL
EDM
EDH
Medium
Low
Single
Error
Erroneous
Duplicates
90
50
30
10
90
50
30
10
50
50
50
50
50
Percentage of
Errors in
Token
Duplicates
Swap
30
20
30
20
30
20
30
20
10
20
10
20
10
20
10
20
0
0
0
20
10
0
20
0
30
0
Abbr.
Error
50
50
50
50
50
50
50
50
50
0
0
0
0
Table 1: Datasets Used in the Experiments
of the duplicate records by assigning a cluster ID to each
clean record and to all duplicates generated from that clean
record.
For the results presented in this paper, the datasets are
generated by the data generator out of a clean dataset of
2139 company names with average record length of 21.03
and an average of 2.9 words per record. The errors in
the datasets have a uniform distribution. For each dataset,
on average 5000 dirty records are created out of 500 clean
records. We have also run experiments on datasets generated using different parameters. For example, we generated
data using a Zipfian distribution, and we also used data from
another clean source (DBLP titles) as in [5]. We also created larger datasets. For these other datasets, the accuracy
trends remain the same. Table 1 shows the description of
all the datasets used for the results in this paper. We used
8 different datasets with mixed types of errors (edit errors,
token swap and abbreviation replacement). Moreover, we
use 5 datasets with only a single type of error (edit errors,
token swap or abbreviation replacement errors) to measure
the effect of each type of error individually. Following [5],
we believe the errors in these datasets are highly representative of common types of errors in databases with string
attributes.
4.2 Measures
We use well-known measures from IR, namely precision,
recall, and F1, for different values of threshold to evaluate
the accuracy of the similarity join operation. We perform a
self-join on the input table using a similarity measure with a
fixed threshold θ. Precision (Pr) is defined as the percentage
of similar records among the records that have similarity
score above threshold θ. In our datasets, similar records are
marked with the same cluster ID as described above. Recall
(Re) is the ratio of the number of similar records that have
similarity score above threshold θ to the total number of
similar records. A join that returns all the pairs of records in
the two input tables as output has low (near zero) precision
and recall of 1. A join that returns an empty answer has
precision 1 and zero recall. The F1 measure is the harmonic
mean of precision and recall, i.e.,
F1 =
2 × P r × Re
P r + Re
(9)
We measure precision, recall, and F1 for different value of
similarity thresholds θ. For comparison of different similar-
Figure 3: Maximum F1 score for different measures
on datasets with only edit errors
Figure 4: Maximum F1 score for different measures
on datasets with only token swap and abbreviation
errors
ity measures, we use the maximum F1 score across different
thresholds.
4.3 Results
Figures 1 and 2 show the precision, recall, and F1 values
for all measures described in Section 3, over the datasets we
have defined with mixed types of errors. For all measure
except HMM and BM25, the horizontal axis of the precision/recall graph is the value of the threshold. For HMM
and BM25, the horizontal axis is the percentage of maximum
value of the threshold, since these measure do not return a
score between 0 and 1.
Effect of amount of errors As shown in the precision/recall
curves in Figures 1 and 2, the “dirtiness” of the input data
greatly affects the value of the threshold that results in the
most accurate join. For all the measures, a lower value of
the threshold is needed as the degree of error in the data
increases. For example, Weighted Jaccard achieves the best
F1 score over the dirtiest datasets with threshold 0.3, while
it achieves the best F1 for the cleanest datasets at threshold
0.55. BM25 and HMM are less sensitive and work well on
both dirty and clean group of datasets with the same value
of threshold. We will discuss later how the degree of error
in the data affects the choice of the most accurate measure.
Effect of types of errors Figures 3 shows the maximum
F1 score for different values of the threshold for different
measures on datasets containing only edit-errors (the EDL,
EDM and EDH datasets). These figures show that weighted
Jaccard and Cosine have the highest accuracy followed by
Jaccard, and edit similarity on the low-error dataset EDL.
By increasing the amount of edit error in each record, HMM
performs as well as weighted Jaccard, although Jaccard, edit
similarity, and GES perform much worse on high edit error
rates. Considering the fact that edit-similarity is mainly
proposed for capturing edit errors, this shows the effectiveness of weighted Jaccard and its robustness with varying
amount of edit errors. Figure 4 shows the effect of token
swap and abbreviation errors on the accuracy of different
measures. This experiment indicates that edit similarity is
not capable of modeling such types of errors. HMM, BM25
and Jaccard also are not capable of modeling abbreviation
errors properly.
Comparison of measures Figures 5 shows the maximum
F1 score for different values of the threshold for different
measures on dirty, medium and clean group of datasets.
(Here we have aggregated the results for all the dirty data
sets together (respectively, the moderately derity or medium
data sets and the clean data sets). The results show the ef-
Figure 5: Maximum F1 score for different measures
on clean, medium and dirty group of datasets
fectiveness and robustness of weighted Jaccard and cosine
in comparison with other measures. Again, HMM is among
the most accurate measures when the data is extremely dirty
and has relatively low accuracy when the percentage of error
in the data is low.
Remark As stated in Section 2, the performance of many
algorithms proposed for improving scalability of the join operation highly depend on the value of similarity threshold
used for the join. Here we show the accuracy numbers on
our datasets using the value of the theshold that makes these
algorithms effective. Specifically we address the results in
[2] although similar observations can be made for results
of other similar works in this area. Table 2 shows the F1
value for thresholds that results in the best accuracy on our
datasets and the best performance in experimental results
of [2]. PartEnum and WtEnum algorithm presented in [2]
significantly outperform previous algorithms for 0.9 threshold, but have roughly the same performance as previously
proposed algorithms such as LSH when threshold 0.8 or less
is used. The results in Table 2 shows that there is a big gap
between the value of the threshold that results in the most
accurate join on our datasets and the threshold that results
in effectiveness of PartEnum and WtEnum in studies in
[2].
5. CONCLUSION
We have presented an overview of several similarity measures for efficient approximate string joins and thoroughly
evaluated their accuracy on several datasets with different
characteristics and common quality problems. Our results
show the effect of the amount and type of errors in the
Dirty
Medium
Clean
Jaccard Join
Threshold
0.5 (Best Acc.)
0.8
0.85
0.9 (Best Performance)
0.65 (Best Acc.)
0.8
0.85
0.9 (Best Performance)
0.7 (Best Acc.)
0.8
0.85
0.9 (Best Performance)
F1
0.293
0.249
0.248
0.247
0.719
0.611
0.571
0.548
0.887
0.854
0.831
0.812
Weighted Jaccard Join
Threshold
F1
0.3 (Best Acc.)
0.528
0.8
0.249
0.85
0.246
0.9 (Best Performance) 0.244
0.55 (Best Acc.)
0.776
0.8
0.581
0.85
0.581
0.9 (Best Performance) 0.560
0.55 (Best Acc.)
0.929
0.8
0.831
0.85
0.819
0.9 (Best Performance) 0.807
Table 2: F1 score for thresholds that result in best running time in previous performance studies and highest
accuracy on our datasets for two selected similarity measures
datasets and the similarity threshold used for the similarity
measures on the accuracy of the join operation. Considering
the fact that the effectiveness of many algorithms proposed
for enhancing the scalability of approximate join rely on the
value chosen for the similarity threshold, our results show
the effectiveness of those algorithms that are less sensitive to
the value of the threshold and opens an interesting direction
for future work which is finding algorithms that are both efficient and accurate using the same threshold. Finding an
algorithm that determines the best value of the threshold
regardless of the type and amount of errors for the similarity measures that showed higher accuracy in our work is
another interesting subject for future work.
6.
REFERENCES
[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean
answers over dirty databases: A probabilistic
approach. In ICDE’06.
[2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In VLDB’06.
[3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
pairs similarity search. In WWW’07.
[4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and
S. Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems, 18(5), 2003.
[5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi,
and D. Srivastava. Benchmarking declarative
approximate selection predicates. In SIGMOD’07.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani.
Robust and efficient fuzzy match for online data
cleaning. In SIGMOD’03.
[7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
operator for similarity joins in data cleaning. In ICDE
’06.
[8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A
comparison of string distance metrics for
name-matching tasks. In IIWeb’03.
[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
Duplicate record detection: A survey. IEEE TKDE,
19(1), 2007.
[10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,
N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for
free. In VLDB’01.
[11] D. Gusfield. Algorithms on strings, trees, and
sequences: computer science and computational
biology. Cambridge University Press, New York, NY,
USA, 1997.
[12] M. A. Hernández and S. J. Stolfo. Real-world data is
dirty: Data cleansing and the merge/purge problem.
Data Mining and Knowledge Discovery, 2(1):9–37,
1998.
[13] Indyk, Motwani, Raghavan, and Vempala.
Locality-preserving hashing in multidimensional
spaces. In STOC’97.
[14] N. Koudas and D. Srivastava. Approximate joins:
Concepts and techniques. In VLDB’05 Tutorial.
[15] C. Li, B. Wang, and X. Yang. Vgram: Improving
performance of approximate queries on string
collections using variable-length grams. In VLDB’07.
[16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A
hidden markov model information retrieval system. In
SIGIR’99.
[17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu,
M. Gatford, and A. Payne. Okapi at trec-4. In
TREC’95.
[18] S. Sarawagi and A. Kirpal. Efficient set joins on
similarity predicates. In SIGMOD’04.
(a) Low Error Data Sets
(b) Medium Error Data Sets
Edit Similarity
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
Jaccard
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
Weighted Jaccard
(c) Dirty Data Sets
Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the
threshold on different datasets
(a) Low Error Data Sets
(b) Medium Error Data Sets
Cosine w/tf-idf
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
BM25
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
HMM
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
SoftTFIDF
(c) Dirty Data Sets
(a) Low Error Data Sets
(b) Medium Error Data Sets
GES
(c) Dirty Data Sets
Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on
different datasets