TR-CS-09-01
Similarity-Aware Indexing for
Real-Time Entity Resolution
Peter Christen, Ross Gayler, David Hawking
August 2009
ANU Computer Science Technical Report Series
This technical report series is published by the School of Computer Science,
College of Engineering and Computer Science, The Australian National
University. Prior to 2009 this series was published as Joint Computer
Science Technical Report Series jointly by the Department of Computer
Science, Faculty of Engineering and Information Technology, and the
Computer Sciences Laboratory, Research School of Information Sciences
and Engineering, The Australian National University.
Please direct correspondence regarding this series to:
Technical Reports
School of Computer Science
College of Engineering and Computer Science
The Australian National University
Canberra ACT 0200
Australia
or send email to:
Technical-DOT-Reports-AT-cs-DOT-anu.edu.au
A list of technical reports, including some abstracts and copies of some full
reports may be found at:
http://cs.anu.edu.au/techreports/
Recent reports in this series:
TR-CS-08-03 Jie Cai, Alistair P. Rendell, Peter E. Strazdins, and H’sien Jin
Wong. Predicting Performance of Intel Cluster OpenMP with
Code Analysis Method. November 2008.
TR-CS-08-02 Paul Thomas. Implementation of PIS. June 2008.
TR-CS-08-01 Stephen M. Blackburn, Sergey I. Salishev, Mikhail Danilov,
Oleg A. Mokhovikov, Anton A. Nashatyrev, Peter A.
Novodvorsky, Vadim I. Bogdanov, Xiao Feng Li, and Dennis
Ushakov. The Moxie JVM Experience. April 2008.
TR-CS-07-05 Peter Strazdins.
Research-Based Education in Computer
Science at the ANU: Challenges and Opportunities. August
2007.
TR-CS-07-04 Stephen M. Blackburn and Kathryn S. McKinley. Immix
Garbage Collection: Fast Collection, Space Efficiency, and
Mutator Locality. August 2007.
TR-CS-07-03 Peter Christen. Towards Parameter-free Blocking for Scalable
Record Linkage. August 2007.
Similarity-Aware Indexing for Real-Time Entity Resolution
Peter Christen
Ross Gayler
David Hawking
School of Computer Science
Australian National University
Canberra ACT 0200, Australia
Scoring Solutions
Veda Advantage
Melbourne VIC 3000,
Australia
Funnelback Pty Ltd
Dickson ACT 2601, Australia
[email protected]
[email protected]
[email protected]
ABSTRACT
General Terms
Entity resolution, also known as data mat hing or re ord
linkage, is the task of identifying re ords from several databases that refer to the same entities. Traditionally, entity
resolution has been applied on stati databases, for example
to nd re ords that relate to the same patient in di erent
health databases. Most resear h in entity resolution has onentrated on either improving the mat hing quality, making
entity resolution s alable to very large databases, or redu ing the manual e orts required throughout the resolution
pro ess. In reasingly, however, many organisations are fa ed
with the hallenge of having large databases that ontain entities, and a stream of query re ords that have to be mat hed
with these databases in real-time, su h that the best mat hing re ords are retrieved. Example appli ations in lude online law enfor ement and national se urity databases, publi
health surveillan e and emergen y response systems, nanial veri ation systems, and online retail stores.
In this paper, a novel inverted index based approa h for
real-time entity resolution is presented. At build time, similarities between attribute values are omputed and stored
to support the fast mat hing of re ords at query time. The
presented approa h di ers from other re ently developed
approa hes to approximate querying, in that it allows any
similarity omparison fun tion, and any `blo king' fun tion,
both possibly domain spe i , to be in orporated.
Experimental results on a large real-world database indiate that the total size of all data stru tures of this novel index approa h grows sub-linearly with the size of the database,
and that it allows mat hing of query re ords in sub-se ond
time, more than two orders of magnitude faster than a traditional entity resolution index approa h.
Algorithms, Experimentation, Performan e.
Categories and Subject Descriptors
H.3.3 [Information Systems℄: Information Storage and
Retrieval|Information Sear h and Retrieval ; H.3.1 [Information Systems℄: Information Storage and Retrieval|
Content Analysis and Indexing.
Accepted as poster at CIKM’09, November 2–6, 2009, Hong Kong.
Keywords
Data mat hing, re ord linkage, s alability, similarity query,
approximate string mat hing, inverted indexing.
1. INTRODUCTION
In reasingly, many appli ations that deal with data management and analysis require that data from di erent sour es
is mat hed and aggregated before it an be used for further
pro essing. The aim of data mat hing is to identify and
mat h all re ords that refer to the same real world entities.
These entities an, for example, be ustomers, patients, tax
payers, travellers, students, businesses, onsumer produ ts,
or bibliographi itations. While statisti ians and health resear hers ommonly name the task of mat hing re ords as
data or re ord linkage, omputer s ientists and the database
and business oriented IT ommunities speak of entity resolution, data or eld mat hing, data leansing, data integration, dupli ate dete tion, data s rubbing, list washing,
obje t identi ation, or merge/purge pro essing.
Traditionally, te hniques for mat hing re ords that orrespond to the same entities have been applied in the health
se tor and within the ensus [14, 21, 28℄. In reasingly, however, entity resolution is now being used within and between many organisations in both the publi and private
se tors in a large variety of appli ation domains. Examples
in lude nding dupli ates in business mailing lists, bibliographi databases (digital libraries) and online stores; rime
and fraud dete tion within nan e and insuran e ompanies
as well as government agen ies; ompilation of longitudinal
data for so ial resear h; or the assembly of terrorism wat h
lists for improved national se urity.
Be ause real-world data rarely ontains unique entity identi ers a ross all the databases to be mat hed, most entity
resolution approa hes ompare re ords using the information available in the databases that partially identify entities, su h as their names, address details, or dates of birth.
For ea h of the partially identifying attributes ompared between two re ords, a similarity is al ulated. These similarities are then used olle tively to lassify ea h ompared
re ord pair as a mat h, non-mat h, or possible mat h [14, 16,
28℄. The mat hing pro ess is often hallenged be ause real
world data is dirty, i.e. ontains missing or out-of-date attribute values, variations and errors, values that are swapped
between attributes, or data that is oded di erently [26℄.
Traditional entity resolution approa hes assume that two
or more stati databases are to be mat hed in bat h mode,
in order to produ e a new mat hed data set. In reasingly,
however, entity resolution is required in an online, real-time
environment, where query re ords have to be mat hed with
one or several large databases, and the most similar re ords
are to be retrieved. One example appli ation of urrent
interest is health surveillan e and emergen y response systems, where the aim is to nd all re ords that relate to a
ertain individual, for example a patient showing symptoms
of an infe tious disease, from a variety of databases. In
order to nd other individuals that might have been in onta t with that patient, a sear h needs to be ondu ted in
an airline database, to nd the details of other people who
have travelled with the patient; the database of the patient's
employer, to nd potentially infe ted o-workers; the s hool
database of the patient's hildren, and so on. In many ases,
the sear h for mat hing re ords will rely upon personal details, like the name, address and date of birth of the patient,
and thus be subje t to errors and variations, as well as outof-date information. A urate and real-time approximate
mat hing te hniques are required for su h situations.
The appli ation domain of spe i interest to the authors
is onsumer nan ial servi es. Entity resolution is in reasingly important in this domain as su h servi es are being
delivered remotely. On e a onsumer has established an a ount with a nan ial institution, she or he is normally required to use an unambiguous identity token, like an a ount
number. However, the initial establishment of a onsumer's
identity is diÆ ult. The normal approa h taken is entity resolution of identifying information, as provided by the onsumer, against one or more databases of related identifying information. The information provided is often subje t
to variability and error, requiring an approximate mat hing pro ess. As this pro ess will be driven by automated
systems that require sub-se ond responses, automated and
a urate mat hing, s alability, and real-time entity resolution are major te hni al hallenges for su h systems.
This paper presents work that is aimed towards the development of su h systems. The basi idea of a hieving realtime entity resolution is to ombine similarity al ulations
used for approximate mat hing with inverted index te hniques that are ommonly used in the eld of information
retrieval, for example for large-s ale Web sear h engines [3,
29, 32℄. In the past de ade, with the popularity and ommer ial su ess of su h sear h engines, a large amount of
resear h and development on optimisation te hniques has
been ondu ted in this eld [3, 32℄. Some of these optimisation te hniques are used in the work presented in this paper
to fa ilitate real-time entity resolution of large databases.
The ontributions of this paper are a novel index approa h
suitable for real-time entity resolution. This approa h signifi antly improves the mat hing speed over a similar approa h
re ently presented by two of the authors [9℄. Compared to
this earlier approa h, whi h was between two and one hundred times faster than a traditional index approa h for entity
resolution, the novel te hnique presented here is onsistently
over two orders of magnitude faster than the traditional index approa h. An important aspe t of the novel approa h
presented here is that it allows any similarity omparison
fun tion, and any en oding fun tion for `blo king' [2℄, both
possibly domain spe i , to be in orporated. Most other
approximate mat hing approa hes developed in re ent times
are limited to spe i similarity fun tions (su h as edit distan e, or Ja ard or osine similarity), and therefore may
not be suitable for entity resolution in appli ations that require spe i en oding and omparison fun tions.
The remainder of this paper is stru tured as follows. Next,
in Se tion 2, an overview of related resear h is provided. The
proposed novel index approa h for real-time entity resolution
is then presented in Se tion 3, and experimentally evaluated
in Se tion 4 using a large real-world database. The results of
these experiments are dis ussed in Se tion 5, and the paper
is on luded with an outlook to future work in Se tion 6.
2. RELATED WORK
Resear h into entity resolution is being ondu ted in various domains, in luding data mining, ma hine learning, information retrieval, arti ial intelligen e, digital libraries, information systems, statisti s, and the database ommunity.
Several re ent overview arti les are available [13, 28℄. Entity
resolution te hniques an broadly be lassi ed into learning
approa hes [5, 8, 10, 11, 12℄, or database and graph-based
methods [20, 27, 31℄. So far, most resear h in this area has
fo used on the quality of the mat hing pro ess, i.e. the a ura y of lassifying the ompared pairs or groups of re ords
into mat hes and non-mat hes. The hallenges of s alability
to very large databases and real-time mat hing have so far
only re eived limited attention.
For the traditional mat hing of (large) stati databases,
indexing is important, be ause potentially every re ord from
one database needs to be ompared with all re ords from the
other database, resulting in a pro ess that is of quadrati
omplexity in the sizes of the databases to be mat hed. Indexing te hniques, also known as `blo king', are therefore
ommonly applied to redu e the number of omparisons to
be ondu ted. In the standard blo king approa h [2℄, whi h
will be presented in detail in Se tion 3.1 below, the databases
are split into blo ks a ording to some riteria, and only
re ords within the orresponding blo k are ompared with
ea h other. A blo king riterion, also alled a blo king key,
might be based on a single re ord attribute (that should ontain values of high quality), or based on the on atenation
of values from several attributes. In order to over ome the
problem of variations and errors in real-world data, one aim
is to group similar sounding values into the same blo k. This
an be a omplished by using phoneti en oding fun tions,
su h as Soundex, NYSIIS or Double-Metaphone [7℄. These
fun tions, whi h are often language or domain spe i , are
applied when generating the blo king keys. Examples of
su h phoneti en odings are shown in Figure 1.
The standard blo king approa h has two major drawba ks. First, the size of the generated blo ks depends upon
the frequen y distribution of the attribute values used in
the blo king key. For example, using surname values in a
blo king key will likely generate a very large blo k ontaining the ommon surname `Smith', resulting in a very large
number of omparisons that need to be ondu ted for this
blo k. Se ond, if a value in an attribute used as a blo king
key ontains errors or variations that result in a di erent enoding, then the orresponding re ord will be inserted into a
di erent blo k, and potentially true mat hes will be missed.
This problem is normally over ome, at in reased omputational osts, by having two or more di erent blo king keys
based on di erent re ord attributes.
Various alternative blo king approa hes have been developed in re ent times, aimed at improving the s alability
of the mat hing pro ess and in reasing mat hing a ura y.
With the sorted neighbourhood approa h [18℄, the databases
are sorted a ording to the values in the blo king key, and a
xed-size window is moved over the databases. All re ords
within the urrent window will then be ompared with ea h
other. An approa h related to this is to insert the blo king
key values and their suÆxes into a suÆx array based inverted index [1℄, and to then generate blo ks from all re ords
that have the same suÆx value. With this approa h, ea h
re ord will be inserted into several blo ks, depending upon
the length of its suÆx values. Another approa h is to allow
for `fuzzy' blo king by onverting blo king key values into q gram lists and, using sub-lists of these q -gram lists, to insert
ea h re ord into several blo ks a ording to a Ja ard-based
similarity threshold [2℄. While this approa h an improve
the a ura y of the resulting mat hing, its omputational
omplexity (a large number of q -gram sub-lists need to be
generated) makes it unsuitable for large databases. Another
idea for indexing is to apply lustering by using a omputationally eÆ ient similarity measure to generate highdimensional overlapping lusters ( alled ` anopies'), and to
then extra t blo ks of re ords from these lusters [11℄. Ea h
re ord will be inserted into several lusters and thus several
blo ks, resulting in higher mat hing a ura y but at higher
omputational osts. Another re ent approa h is to map
blo king key values into a high-dimensional Eu lidean spa e
su h that the distan es between all pairs of strings are preserved [19℄. The re ords in a blo k then orrespond to all
obje ts in this spa e that are similar to ea h other.
Condu ting entity resolution not on stati databases but
at query time has so far re eived very limited attention, with
only two re ent publi ations presenting approa hes spe i
to su h situations. The authors have earlier shown that using an inverted index approa h an signi antly speed-up
the query mat hing pro ess [9℄. A se ond approa h is based
on unsupervised relational lustering, whi h assumes that
the data to be mat hed ontains relational information that
expli itly links di erent types of entities [5℄. The idea of this
approa h is to utilise the relational links between re ords
to improve the entity resolution pro ess. At query time,
mat hing is ondu ted in an iterative fashion on a database
that ontains unresolved entities. While this approa h an
a hieve mu h better mat hing a ura y ompared to traditional entity resolution approa hes (that only onsider attribute similarities), it has mu h higher omputational osts.
Mat hing times of around 30 se onds for one query re ord
on a database ontaining around 800; 000 re ords have been
reported [5℄. This approa h is therefore impra ti al for realtime entity resolution on very large databases.
A large body of work has been ondu ted in the database
ommunity on similarity queries and their s alable and efient implementations [4, 6, 15, 17, 23, 24, 25, 30℄. Many
of the presented approa hes optimise indexing and ltering
te hniques for spe i types of similarity measures, su h as
edit distan e, or q -gram or osine based similarities. They
also mainly deal with the situation of either nding similar
tuples between a set of query re ords and a database table,
or two large tables. One real-time similarity join approa h
based on a modi ed trie hash-join has been presented very
re ently [22℄. It al ulates q -gram similarities, and then applies several ltering steps to a hieve fast query times.
Thus far, s alability to very large databases has not been
addressed by most re ent resear h in the area of entity resolution, and most publi ations in this area have presented experimental results based on only small to medium sized data
sets ontaining up to one million re ords [5, 20, 27, 31℄. Most
of the re ently developed advan ed entity resolution te hniques have a omputational omplexity that makes them
impra ti al for mat hing very large databases that ontain
many million re ords. Additionally, most approa hes published so far, with the ex eption of two very re ent te hniques [5, 9℄, are assuming the situation of mat hing stati
databases in bat h mode.
3. INDEXING FOR REAL-TIME ENTITY
RESOLUTION
Indexing, as presented in the previous se tion, is required
for real-time entity resolution systems to speed-up the mat hing pro ess by redu ing the number of andidate re ords that
need to be mat hed with a query re ord.
The obje tive of real-time entity resolution is to mat h
a stream of query re ords as qui kly as possible to one or
several (large) databases that ontain re ords about existing
entities, and potentially to a range of external data sour es
that ontain additional information that an be used to verify the mat hed entities. The response time for mat hing
a single query re ord has to be as short as possible, ideally sub-se ond. The mat hing approa h must fa ilitate approximate mat hing and eÆ iently s ale-up to very large
databases that ontain many millions of re ords. In addition, the mat hing should generate a mat h s ore that indiates the likelihood that a mat hed re ord in the database
refers to the same entity as the query re ord.
Real-time entity resolution has mu h in ommon with the
fun tionality of large-s ale Web sear h engines. However,
the databases upon whi h entity resolution is ommonly applied do not ontain Web or text do uments that in lude
a large number of terms and thus provide a ri h variety of
features. Rather, these databases are made of stru tured
re ords with well de ned attributes that often only ontain
short strings or numbers, su h as the personal details of people (for example name, address, or date of birth values).
In this se tion, the traditional standard blo king approa h
to indexing for entity resolution is presented rst to illustrate
the basi ideas of using an inverted index for entity resolution. Based on this approa h, a similarity-aware inverted
index approa h that is suitable for real-time entity resolution is then dis ussed in detail in Se tion 3.2. Both index
approa hes are illustrated in Figures 2 and 3.
Both index approa hes presented here are based on a standard inverted index [32℄, where the keys of the index are
(possibly en oded) attribute values, and the orresponding
lists ontain the re ord identi ers of all re ords that have
this (en oded) value. Two types of fun tion are required
by both index approa hes. First, (phoneti ) en oding fun tions are needed that group similar attribute values together.
For string attributes, su h as personal names or street and
suburb names, phoneti en odings like Soundex, NYSIIS or
Double-Metaphone are ommonly used [7℄. Figure 1, for example, shows the Soundex en odings of eight surname values. As an be seen, this en oding fun tion groups the values
`smith' and `smyth' into one blo k, and `millar', `miller' and
`myler' into another. The se ond type of fun tions required
Re ord ID
r1
r2
r3
r4
r5
r6
r7
r8
Surname
smith
miller
peter
myler
smyth
millar
smith
miller
Soundex en oding
s530
m460
p360
m460
s530
m460
s530
m460
Figure 1: Example re ords with surname values and
their Soundex en odings, used to illustrate the two
index approa hes in Figures 2 and 3.
SI
millar
miller 0.9 myler
miller
millar
0.9 myler 0.8
myler
millar
0.7 miller
smith
smyth
0.9
smyth
smith
0.9
peter
RI
BI
millar
r6
m460
r2
r4
r6
p360
r3
s530
r1
r5
r7
r8
Figure 2: Standard blo king index resulting from
the example re ords given in Figure 1. The blo king
keys orrespond to Soundex en odings.
are similarity omparisons that al ulate a normalised similarity between two attribute values, su h that 1 orresponds
to an exa t similarity and 0 to total dissimilarity [7℄. Note
that for di erent attribute types (strings, dates, numbers,
et .) di erent su h omparison fun tions an be used. Additionally, domain spe i
omparison fun tions are often
applied to improve the mat hing quality. One example of
su h a fun tion would be a date of birth omparison, where
a mismat h in the month or day of birth is less severe than
a mismat hed year of birth.
The real-time entity resolution pro ess as dis ussed in this
paper onsists of two phases. First, in the build phase, an
index is generated using a stati database that ontains a
possibly large number of leaned re ords that are assumed
to refer to resolved entities, i.e. one single re ord per realworld entity only. On e built, the index is queried in the
se ond phase with a stream of query re ords. These re ords
an either refer to an entity stored in the index, or to a new
and unknown entity. It is assumed, however, that the query
re ords an ontain variations and typographi al errors, or
wrong, out-of-date or missing values. Missing values an be
handled by repla ing them with a spe ial hara ter (that is
outside of the hara ter set used for an attribute) in both
database and query re ords. For ea h query re ord, the
mat hing pro ess returns a ranked list of potential mat hes
and their similarities with the query re ord. A mat h is
su essful if one of the top ranked re ords refers to the same
entity as the query re ord.
In the following two se tions, the two index approa hes
are des ribed in detail, and in Se tion 3.3 two optimisations
for redu ing the query mat hing time are dis ussed. Experiments on these two index approa hes are then presented in
Se tion 4. In Algorithms 1 to 4, the attributes of a re ord
r are denoted by r:i, with r:0 assumed to be an identi er
attribute that allows unique identi ation of ea h re ord. A
list with key k in an inverted index X is denoted with X[k℄.
An empty list is denoted by () and an empty index by fg.
0.7
0.8
m460
millar miller myler
p360
peter
s530
smith smyth
miller
myler
peter
smith
smyth
r2
r4
r3
r1
r5
r8
r7
Figure 3: Similarity-aware index resulting from the
example re ords from Figure 1. The similarity index
is shown in the top left, the blo k index in the middle
right, and the re ord identi er index at the bottom.
3.1 Standard Blocking
The basi idea of standard blo king is that ea h re ord in
a database is inserted into a blo k a ording to the value
of its blo king key [2℄, as illustrated in Figure 2. En oding
fun tions [7℄ are used to group similar attribute values into
the same blo k. Ea h blo k orresponds to an inverted index list, with the key being the (en oded) blo king key value,
while the values in the orresponding list are the re ord identi ers of all re ords in this blo k.
Standard blo king is used in this paper be ause it is a
baseline approa h upon whi h many other re ently developed index approa hes for entity resolution an be built.
For example, anopy lustering [11℄, suÆx-array blo king [1℄,
and the sorted neighbourhood [18℄ approa h, as dis ussed in
Se tion 2, an all be implemented as extensions to a basi
inverted index. Therefore, standard blo king an be seen as
the basi approa h for traditional bat h-oriented entity resolution of stati databases. Other index approa hes based
on it have higher omputational requirements.
The build phase of standard blo king is shown in Algorithm 1. The input to this algorithm is a data set D ontaining n attributes that will be used in the entity resolution
pro ess, and n orresponding en oding fun tions, Ei . A basi inverted index I, as illustrated in Figure 2, is generated in
the build phase, where re ord identi er values are inserted
into inverted index lists a ording to their orresponding
en oded attribute values. This en oding of attribute values
might be omputationally expensive. Therefore, in order to
prevent repeated al ulation of en odings of attribute values, on e an en oding has been omputed, it is stored in the
en odings a he C (line 9) so it an be retrieved qui kly for
subsequent o urren es of the same attribute value r:i (line
6). The algorithm returns the inverted index data stru ture
I ontaining re ord identi ers, r:0, in the inverted index lists,
and the a he C ontaining the omputed en odings. No
similarity al ulations are performed during the build phase
of this index approa h. Note that for simpli ity the index
and a he are shared between attributes. Depending upon
the hara teristi s of the data to be pro essed and mat hed,
however, it might be favourable to have separate index and
a he data stru tures per attribute.
Algorithm 1: Standard blo king {
Input:
- Data set: D
- Number of attributes of D used: n
- En oding fun tions: Ei ; i = 1 : : : n
Output:
- Standard blo king index: I
- En odings a he: C
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Build
Initialise I = fg
Initialise C = ;
for r 2 D:
for i = 1 : : : n:
if r:i 2 C then:
= C[r:i℄
else:
= Ei (r:i)
C[r:i℄ =
Append r:0 to I[ ℄
Algorithm 2 des ribes the query phase. As input, it requires a query re ord q, the inverted index I, the data set
D, the en odings a he C, the number of attributes to be
used for entity resolution n, the en oding fun tions Ei , and
the omparison fun tions used to al ulate the similarities
between attribute values, Si . The query phase onsists of
two steps. First, in lines 1 to 8, the en oded attribute values
(possibly available in the en odings a he C) of the query
re ord q are used to retrieve the re ord identi er lists from
the orresponding blo ks in the inverted index I. The union
of these lists, b, ontains all identi ers of the andidate
re ords that will be ompared with the query re ord in the
se ond step of the algorithm (lines 9 to 14). The required n
attribute values for ea h andidate re ord r need to be retrieved from data set D (line 10). An eÆ ient index on D is
therefore required that allows fast a ess to a random re ord
r using its identi er r:0. For ea h andidate re ord that is
ompared with the query re ord, a similarity, s, is al ulated
over all ompared attributes (line 13) and inserted into the
list of mat hes M (line 14). For simpli ity a simple summing of s is assumed, however, in reality other aggregation
fun tions, like weighted sums, an be applied. Finally, in
line 15, the list of mat hes M is sorted su h that the largest
similarity values are at the beginning.
3.2 Similarity-Aware Index
This index is based on the idea of pre- al ulating the
similarities between all unique attribute value ombinations
within ea h blo k on e during the build phase, so that the
similarities do not need to be re- al ulated for every query
re ord, thereby signi antly redu ing the mat hing time required in the query phase.
As illustrated in Figure 3, this approa h ontains three
inverted index data stru tures. The re ord identi er index,
RI, is similar to the inverted index I used in standard blo king, but the keys of this index are the a tual attribute values
and not their en odings. The blo k index, BI, is the data
stru ture that represents the blo ks by having en oded attribute values as keys and the a tual attribute values that
have the same en oding in the orresponding inverted index
lists. Ea h list in this index therefore ontains all attribute
values that are in the same blo k. The similarity index, SI,
Algorithm 2: Standard blo king { Query
Input:
- Query re ord: q
- Data set: D
- Number of attributes of D used: n
- Standard blo king index: I
- En odings a he: C
- En oding fun tions: Ei ; i = 1 : : : n
- Similarity omparison fun tions: Si ; i = 1 : : : n
Output:
- Ranked list of mat hes: M
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Initialise M = ()
Initialise andidate re ord identi er list b = ()
for i = 1 : : : n:
if q:i 2 C then:
= C[q:i℄
else:
= Ei (q:i)
b = b [ I[ ℄
for r:0 2 b:
Retrieve r from D using identi er r:0
s = 0
for i = 1 : : : n:
s = s + Si (r:i; q:i)
Append (r:0; s) to M
Sort M a ording to similarities (largest rst)
stores the similarities of pairs of attribute values that are
in the same blo k. Spe i ally, for ea h attribute value, it
ontains a list of other attribute values (in the same blo k)
and the similarities between these two values.
Algorithm 3 des ribes how a similarity-aware index is built.
The algorithm requires the same input as the build algorithm for standard blo king. Additionally, the similarity
omparison fun tions, Si , are also required, be ause similarities between attribute values are al ulated during the
build phase rather than the query phase. For ea h re ord r
in data set D, its identi er r:0 is added to the inverted index
list in RI that orresponds to attribute value r:i (line 6). It
is important to note that all the following steps (lines 8 to
19) only need to be done if the attribute value r:i has not
been pro essed before (line 7). This will signi antly redu e
the omputational e ort if attribute values appear in a data
set several times, whi h is the ase for attributes that have a
Zipf-like or exponential distribution of values, as is ommon
for example for attributes that ontain names [9℄.
For a new attribute value r:i that has so far not been indexed, the rst step (lines 8 to 11) is to al ulate its en oding
and to retrieve all other values in its blo k. The new value
is then added into the inverted index list b of this blo k,
and the updated list is stored ba k into the blo k index BI.
The similarities between the new attribute value r:i and all
attribute values v already in this blo k are al ulated next
(line 13), and inserted into both the new value's similarity
list si (line 15) and the other value's list oi (line 17). Finally, the similarity list si of the new value r:i is added to
the similarity index SI in line 19.
The query phase using the similarity-aware index is des ribed in Algorithm 4. During the query pro ess an a umulator M, a data stru ture that ontains re ord identi-
Algorithm 3: Similarity-Aware Index { Build
Input:
- Data set: D
- Number of attributes of D used: n
- En oding fun tions: Ei ; i = 1 : : : n
- Similarity omparison fun tions: Si ; i = 1 : : : n
Output:
- Re ord identi er index: RI
- Similarity index: SI
- Blo k index: BI
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
Initialise RI = fg
Initialise SI = fg
Initialise BI = fg
for r 2 D:
for i = 1 : : : n:
Append r:0 to RI[r:i℄
if r:i 62 SI:
= Ei (r:i)
b = BI[ ℄
Append r:i to b
BI[ ℄ = b
Initialise inverted index list si = ()
for v 2 b:
s = Si (r:i; v )
Append (v; s) to si
oi = SI[v ℄
Append (r:i; s) to oi
SI[v ℄ = oi
SI[r:i℄ = si
ers and their (partial) similarities with the query re ord,
is generated [29, 32℄. Two possible ases an o ur for ea h
attribute of the query re ord q. The rst ase o urs when
an attribute value is available in the index, and its similarities with other attribute values have been al ulated in the
build phase. In this ase, in lines 4 to 6, the identi ers r:0
of all other re ords that have the same attribute value are
retrieved and their similarities (exa tly 1, as they have the
same attribute value) are added into the a umulator M.
A new element for re ord identi er r:0 will be added to the
a umulator if it doesn't exist. Next, all other attribute values in the same blo k and their similarities with the query
attribute value are retrieved from the similarity index SI
(line 7). For ea h of these values, their re ord identi ers
are retrieved from the re ord identi er index RI, and their
similarities are added into the a umulator in line 11.
The se ond ase o urs when an attribute value in the
query re ord q is not available in the index, and thus the
similarities between this value and other attribute values
need to be al ulated (lines 13 to 19). This is similar to
the query phase of the standard blo king index. First, in
lines 13 and 14, the en oding for this unknown attribute
value is al ulated, and then all re ords in its orresponding
blo k are retrieved from the blo k index BI. In lines 16
and 17, the similarities between the attribute value from the
query re ord and ea h of the other re ords in the blo k are
al ulated, and the re ord identi ers of all orresponding
re ords are retrieved from the re ord identi er index RI.
The a umulator M is then updated in line 19 for ea h of
these re ords. Finally, in line 20, the a umulator is sorted
su h that the largest similarities are at the beginning.
Algorithm 4: Similarity-Aware Index { Query
Input:
- Query re ord: q
- Number of attributes of D used: n
- Re ord identi er index: RI
- Similarity index: SI
- Blo k index: BI
- En oding fun tions: Ei ; i = 1 : : : n
- Similarity omparison fun tions: Si ; i = 1 : : : n
Output:
- Ranked list of mat hes: M
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15
16:
17:
18:
19:
20:
Initialise M = ()
for i = 1 : : : n:
if q:i 2 RI:
// Case 1
ri = RI[q:i℄
for r:0 2 ri:
M[r:0℄ = M[r:0℄ + 1:0
si = SI[r:i℄
for (r:i; s) 2 si:
ri = RI[r:i℄
for r:0 2 ri:
M[r:0℄ = M[r:0℄ + s
else:
// Case 2
= Ei (q:i)
b = BI[ ℄
for v 2 b:
s = Si (q:i; v )
ri = RI[v ℄
for r:0 2 ri:
M[r:0℄ = M[r:0℄ + s
Sort M a ording to similarities (largest rst)
The overall eÆ ien y of the similarity-aware index depends upon how many attribute values of the query re ord
are already stored in the index (in whi h ase no similarity
al ulations need to be performed) ompared to how many
are new. With in reased size of the data set D, and espe ially as D is overing larger portions of a population,
one would assume that a larger portion of values would be
available in the index, thereby improving the eÆ ien y of
this index approa h. In Se tion 4 this assumption will be
evaluated experimentally.
3.3 Optimisations
A variety of optimisation approa hes have been developed
for inverted index te hniques [3, 29, 32℄. These approa hes
apply ompression to redu e the amount of memory required
by the index data stru ture, sorting of the inverted index
lists, and ltering of andidate re ords that are guaranteed
not to be in the top ranked mat hes.
Currently, two su h optimisation te hniques are implemented in the two index approa hes presented in this paper. The rst is a minimum similarity threshold, tmin (with
0 < tmin < 1). Within the query phase of standard blo king, this threshold is used together with the overall minimum threshold (dis ussed below) to redu e the number of
mat hes to be stored in the ranked mat h list M. Similarities, as al ulated in line 13 of Algorithm 2, are not added to
the overall similarity s of two re ords if they are below tmin .
Within the similarity-aware index, the minimum threshold
tmin is used in the build phase to only store similarities be-
Number of unique values
Number of values with ount 1
Six most frequent values
(and their ounts)
Given name
Surname
Suburb name
78,386
404,642
13,109
7116
193,437
931
John (149,817)
Smith (65,243)
Toowoomba (29,127)
Peter (116,985)
Jones (32,234)
Frankston (18,856)
David (101,859) Williams (31,647)
Croydon (15,556)
Robert (89,564)
Brown (31,024) Port Ma quarie (15,499)
Mi hael (89,222)
Wilson (26,940)
Reservoir (14,784)
Margaret (69,165)
Taylor (26,044) Glen Waverley (14,756)
Post ode
2632
18
4350 (35,129)
4670 (24,701)
4740 (23,981)
2250 (23,454)
2170 (22,726)
4870 (21,639)
Table 1: Chara teristi s of the data set used for experiments.
tween attribute values that are above tmin . Spe i ally, lines
15 to 18 in Algorithm 3 are only exe uted if the similarity
s, as al ulated in line 14, is larger than tmin . Not storing
lower similarities will redu e the memory requirements of
the similarity-aware index, and also speed-up the mat hing
time during the query phase, be ause the inverted index lists
in the similarity index SI will be shorter.
The se ond optimisation is an overall minimum threshold, Tmin , with 0 < Tmin < n, and n being the number
of re ord attributes that are used in the entity resolution
pro ess. Within the standard blo king query phase, this
threshold an be used in line 14 of Algorithm 2 to only append re ord identi ers to M that have a summed similarity
s Tmin . This will redu e the size of the list of mat hes M
and thus redu e the time needed to sort M.
Within the query phase of the similarity-aware index, Tmin
is used to redu e the growth of the a umulator M in lines
6, 11 and 19 of Algorithm 4. Assume n attributes are being
ompared, resulting in a summed similarity (n tmin )
s n for ea h ompared re ord pair, with only similarities
between individual attribute values above tmin being stored
in the index. When al ulating the total similarity between
a query re ord q and the re ords stored in the index, line
2 of Algorithm 4 loops over the n attributes used for the
mat hing. With an overall threshold Tmin < n, a phase
threshold, p, an be al ulated as p = dn
Tmin e. As long
as the loop ounter i p, all attribute similarities need to
be added into M, be ause potentially any new partial mat h
an rea h Tmin . However, on e loop ounter i > p, no new
re ord identi ers (and their similarities) need to be added
to the a umulator, be ause the total similarity for these
re ords annot rea h Tmin . For example, assume there are
four attributes to be used in the entity resolution pro ess
(n = 4) and Tmin = 2:5, so p = d4 2:5e = 2. For the rst
two attributes (i = 1; 2), new re ord identi ers are added
into the a umulator. However, for the third and fourth attributes (i = 3; 4), no new re ord identi ers will be added to
the a umulator be ause even if su h a new re ord has an
exa t mat h with the query re ord in both attributes three
and four, the maximum total similarity of this re ord will
be s = 2:0, whi h is below Tmin . This optimisation an
signi antly redu e the nal length of the a umulator.
For the standard blo king approa h, a further optimisation in the query phase an be implemented if only the top
mat hing re ord is (or re ords are) of interest. Rather than
storing all mat hes (and their similarities) in the mat h list
M (line 14 in Algorithm 2), and having to sort them before
returning the ranked list (line 15), only the mat h(es) with
the highest similarity need to be stored in M, and no sorting
will be required.
4. EXPERIMENTAL EVALUATION
The proposed similarity-aware index approa h is experimentally evaluated and ompared with the standard blo king approa h. The issues of interest were the time used to
build the index and query it with re ords of varying quality,
and the a ura y of the retrieved mat hes. The experiments
were ondu ted on a Linux server ontaining two Intel Xeon
quad- ore 64-bit CPUs with 2.33 GHz lo k frequen y, 8 Gigabytes of main memory, and two SAS drives (446 Gigabytes
in total). No other users were logged onto this ma hine, and
no other jobs were run during the experiments.
A large real-world data set ontaining 6,917,514 re ords
was used for the presented experiments. It ontained surnames, post odes and suburb (town) names sour ed from
an Australian telephone dire tory from 2002 (Australia On
1
Dis ). This data orresponds to all entries in Australian
telephone books in late 2002, and thus has hara teristi s
similar to many other real-world data olle tions used by
Australian organisations. Additionally, a list ontaining
about 80,000 di erent given names and their frequen ies of
o urren e, supplied to the authors by a major Australian
government agen y, was used to generate and add a given
name attribute. For ea h re ord in the data set, a given
name was randomly sele ted (with repla ement) from the
given name list a ording to its frequen y, and appended
to the re ord. As su h, this is a typi al example data set
that ontains a large number of unique and leaned entities,
with similar data being olle ted by many other private and
publi se tor organisations in many ountries.
Table 1 provides an overview of the resulting data set used
in the presented experiments. As expe ted, all the name
attributes exhibit a strongly skewed distribution of values,
with a small number of very ommon values and a large number of very rare values. For example, 40% of all surnames
only appear on e in the data set, while the top ve most
frequent surnames a ount for nearly 7% of the population.
Only post odes are more uniformly distributed, whi h is due
to the pro ess by Australia Post to split populated regions
into similar sized post ode areas.
Both index approa hes were implemented in Python, with
version 2.5.2 used for the experiments. For the en oding
fun tions Ei , used to blo k the test data sets, the DoubleMetaphone [7℄ phoneti en oding was applied on the three
name attributes, while for the post ode attribute the blo king was based on sele ting the last three digits (i.e. all re ords
where the post ode value has the same last three digits were
inserted into the same blo k). For the omparison fun tions,
Ci , the Winkler [7℄ approximate string omparison was used
for the three name attributes, while for post odes the sim1
http://www.australiaondis . om
Build time
Memory usage
8000
Standard Blocking
Sim-Aware Index
Average query time
Standard Blocking
Sim-Aware Index
Standard Blocking
Sim-Aware Index
10
1000
Seconds
MBytes
Seconds
4000
100
1
0.1
1000
0.01
10
400
691,751
2,767,006
4,842,260
Number of records in data set
6,917,514
691,751
2,767,006
4,842,260
Number of records in data set
6,917,514
691,751
2,767,006
4,842,260
6,917,514
Number of records in data set
Figure 4: Summary experimental results: Build time (left); memory usage (middle); and average query time
per re ord (right). Note that all three graphs are shown with a logarithmi y-axis s ale.
ilarity was al ulated by ounting the number of mat hing
digits divided by four. For example, the similarity of the
two post ode values `2346' and '2356' is 0:75.
In order to evaluate the s alability of the similarity-aware
index, test data sets of four di erent sizes were built ontaining 10% (691,710), 40% (2,767,006), 70% (4,842,260) and
100% of the re ords in the original data set. The full data
set was split into ten data sets of equal size. Next, from ea h
of these ten data sets, ten query re ords were randomly sele ted (giving one hundred base query re ords in total). To
assess the mat hing quality, ve query sets of one hundred
re ords ea h were reated by transforming the hundred base
re ords in di erent ways. The rst set of hundred re ords
were made by exa tly opying the base query re ords. In the
se ond set one modi ation was inserted into one of the four
attributes in ea h re ord (a di erent attribute in ea h re ord
in a round robin fashion); in the third set two modi ations
were inserted; in the fourth set three modi ations, and in
the fth query set all four attribute values were modi ed in
ea h re ord. The modi ations, while done manually, were
based on the authors' experien e with real-world name data.
They mostly orresponded to ommon phoneti and typographi al variations, for example hanges su h as `Di kson'
to `Dixon', ni kname substitutions like `Robert' to `Bob', or
simple hara ter inserts, deletes, substitutions or transpositions. For post odes, only substitutions and transpositions
of digits were applied, su h as `2607' hanged into `2601'.
S alability was evaluated by building an index for ea h
test data set (ea h ontaining 10%, 40%, 70%, or 100% of
the re ords in the full data set) and then querying it with
ea h of the ve query sets. The time used to build ea h
index was re orded, as well as the total amount of memory used by that index. During the query phase, the time
for querying ea h re ord was measured, as well as whether
the top ranked returned re ord was a true mat h (i.e. if the
re ord identi er of the best returned mat h was the same as
the re ord identi er of the query re ord). For the similarityaware index, the number of ase 1 and ase 2 mat hes (as
dis ussed in Se tion 3.2 and shown in Algorithm 4) was also
re orded. While test runs were ondu ted with both optimisations turned o and on, due to spa e limitations of
this paper only results with a tivated optimisations are reported. The minimum threshold tmin was set to 0:55 and
the overall minimum threshold Tmin to 2:0. These values
were sele ted su h that the experiment on the full database
for the similarity-aware index still tted into the 8 Gigabytes
main memory available on the experimental platform.
For the experiments with the smaller test data sets (less
than 100% of the full data set), ea h experiment was ondu ted ten times (with omponent 10% data sets sele ted
in a round robin fashion) and all results averaged, while for
the full database an experiment was only run on e.
5. RESULTS AND DISCUSSION
A summary of the experimental results is shown in Figure 4. As expe ted, building a standard blo king index is
signi antly faster than building a similarity-aware index, by
a fa tor ranging from 16 times for the smallest test data set
to 20 times when building the index for the largest test data
set. The main reason for this is that during the build phase
of the standard blo king index no similarity al ulations between attribute values are performed. The build time for
both index approa hes however does grow sub-linearly with
the size of the data set. For standard blo king, this is beause the en odings of attribute values are a hed (line 6 in
Algorithm 1), so the more re ords are loaded and inserted
into the index, the more often a hed en oding values an be
retrieved and fewer need to be al ulated. For the similarityaware index, the al ulation of similarities between attribute
values and inserting them into the similarity and blo king
indi es SI and BI again only needs to be done the rst time
a new, previously unseen attribute value o urs.
Similarly, the amount of memory required by both index
approa hes (shown in the middle of Figure 4) grows sublinearly with the size of the test data set, be ause as the
data set grows fewer new attribute values, whi h need to
be pro essed and stored, will o ur. For the test data sets
used in the experiments, the similarity-aware index required
around 1.8 times as mu h memory on average as the standard blo king index. The rate of growth for both build time
and memory requirements depends upon the distribution of
attribute values in the data set to be indexed. Given that
many real-world databases ontain attributes that follow a
Zipf-like or exponential distribution, su h as names [9℄, a
sub-linear growth an be expe ted in pra ti e. A theoreti al
analysis of the growth fa tor is one avenue of future work
planned by the authors.
One of the most important aspe ts of the novel index approa h presented in this paper is its fast query mat hing
time. As an be seen in the right graph in Figure 4, the
novel approa h a hieves average query times below 0:1 se onds even for the index that is based on the full test data set
ontaining nearly 7 million re ords. Over the di erent test
Accuracy for data set with 6,917,514 records
Standard blocking
Sim-Aware Index
100
Query case 1
80%
80
Accuracy
10% (691K)
40% (2767K)
70% (4842K)
100% (6918K)
100%
120
60
60%
40%
40
20%
20
0
0
0
1
2
3
4
Number of modifications per record
Figure 5: Query mat hing a ura y for the full test
data set for varying number of modi ations per
re ord. Similar a ura y results were a hieved for
the smaller test data sets.
Given name
Surname
Suburb name
Gail (g400)
Gayle (g400)
0.827
Billman (b455)
Pillman (p455)
0.905
Boystown (b235)
Boydtown (b350)
0.942
Figure 6: An example re ord pair that will be missed
by the similarity-aware index approa h be ause of
di erent en oding values, but will be ompared by
standard blo king. The values in bra kets are the
orresponding Soundex en odings, and the similarities (bottom row) were al ulated using the Winkler
approximate string omparison fun tion [7℄.
data sets (10%, 40%, 70% and 100% of the full data set size),
the query time for the similarity-aware index is between 140
and 150 times faster than standard blo king. However, for
both index approa hes, the query time urrently in reases
linearly with the size of the indexed data sets. Improving
upon this is a urrent e ort by the authors.
The query mat hing a ura y results are shown in Figure 5 for the largest test data set with varying number of
modi ations per re ord. As an be seen for both index approa hes, mat hing a ura y gets lower with an in reased
number of modi ations. This is what one would expe t,
as with more modi ations per re ord the likelihood that
another re ord (with similar attribute values) be omes the
best mat hing re ord is in reased.
The a ura y for the similarity-aware index is higher ompared to standard blo king for the query sets with one and
two modi ations, but then drops more rapidly for the query
sets with three and four modi ations. This is due to the
requirement of the similarity-aware index that the values
of all attributes for a re ord pair need to be in the same
blo k in order to have their similarity added to the a umulator. If two attribute values are in di erent blo ks, then
the orresponding similarity, whi h an be high, will not be
onsidered. For standard blo king, on the other hand, only
one pair of attribute values needs to be in the same blo k in
order that two re ords are being ompared.
1
2
3
4
Number of modifications per record
Figure 7: Proportion of ase 1 (query attribute value
is available in similarity-aware index) to ase 1 plus
ase 2 (new unknown attribute value) for varying
number of modi ations per re ord.
This is illustrated in Figure 6 with two example re ords
that have an overall similarity of 2:674 out of a maximum of
3:0. These re ords would be ompared by standard blo king
be ause at least one attribute (given name) ontains values
that are in the same blo k; whereas they would not be ompared by the similarity-aware index, be ause two of the three
attributes (surname and suburb name) have di erent blo king key values and thus the orresponding similarities would
not be added into the a umulator.
Although this e e t may lead to standard blo king having
higher a ura y on more heavily modi ed query re ords, it
an also lead to lower a ura y for standard blo king ompared to the similarity-aware index when query re ords are
of relatively good quality, as an be seen in Figure 5 for the
query sets with one or two modi ations only. Improving
the similarity-aware index and a hieving equal or even better mat hing a ura y than standard blo king in all ases is
one of the urrent resear h e orts by the authors.
Finally, Figure 7 shows the proportion of query attribute
values that were available in the similarity-aware index ( ase
1) and thus no similarities had to be al ulated at query
time. As an be seen, the more modi ations a query re ord
had, the more likely the modi ed attribute values were not
in the index and thus their similarities had to be al ulated.
However, even with modi ations in all four query re ord attributes, more than 40% of all attribute values were available
in the index and thus their similarities were pre- al ulated.
This results shows the eÆ ien y of the similarity-aware index in speeding up query mat hing by pre- omputing similarities between re ords while the index is built.
6. CONCLUSIONS AND FUTURE WORK
In this paper, a novel index approa h for real-time entity
resolution has been presented and evaluated experimentally
on a large real-world data set. The experiments showed that
this approa h an mat h query re ords more than two orders
of magnitude faster than a basi standard index approa h
that is traditionally used for entity resolution. The novel
approa h requires less than double the amount of memory
of the standard index, but building the index an take up-to
twenty times longer.
For query re ords that do not ontain too many variations and errors, the a ura y of the novel index approa h
an be better than the standard blo king approa h. However, when most or all attribute values in a query re ord
ontains variations and errors, then mat hing a ura y an
drop signi antly. Improving upon this drawba k is one of
the major avenues for additional work on this novel index approa h. Other areas of future resear h in lude a theoreti al
analysis of the omplexity and s alability of this index approa h, improving the query mat hing time, and ondu ting
experiments on a variety of other real-world databases.
To the best of the authors' knowledge, the similarityaware inverted index presented in this paper is the rst approa h aimed at developing real-time entity resolution on
large databases that ombines approa hes from information
retrieval with traditional entity resolution te hniques.
7.
REFERENCES
[1℄ A. Aizawa and K. Oyama. A fast linkage dete tion s heme
for multi-sour e information integration. In WIRI'05,
Tokyo, 2005.
[2℄ R. Baxter, P. Christen, and T. Chur hes. A omparison of
fast blo king methods for re ord linkage. In ACM
SIGKDD'03 Workshop on Data Cleaning, Re ord Linkage
and Obje t Consolidation, Washington DC, 2003.
[3℄ R. Bayardo, Y. Ma, and R. Srikant. S aling up all pairs
similarity sear h. In WWW'07, Ban , Canada, 2007.
[4℄ A. Behm, S. Ji, C. Li, and J. Lu. Spa e- onstrained
gram-based indexing for eÆ ient approximate string sear h.
In IEEE ICDE'09, pages 604{615, Shanghai, China, 2009.
[5℄ I. Bhatta harya and L. Getoor. Query-time entity
resolution. Journal of Arti ial Intelligen e Resear h,
30:621{657, 2007.
[6℄ M. Celikik and H. Bast. Fast error-tolerant sear h on very
large texts. In ACM Symposium on Applied Computing,
pages 1724{1731, Honolulu, Hawaii, 2009.
[7℄ P. Christen. A omparison of personal name mat hing:
Te hniques and pra ti al issues. In Workshop on Mining
Complex Data, held at IEEE ICDM'06, Hong Kong, 2006.
[8℄ P. Christen. Automati re ord linkage using seeded nearest
neighbour and support ve tor ma hine lassi ation. In
ACM SIGKDD'08, pages 151{159, Las Vegas, 2008.
[9℄ P. Christen and R. Gayler. Towards s alable real-time
entity resolution using a similarity-aware inverted index
approa h. In AusDM'08, CRPIT vol. 87, Glenelg,
Australia, 2008.
[10℄ W. Cohen, P. Ravikumar, and S. Fienberg. A omparison
of string distan e metri s for name-mat hing tasks. In
IJCAI'03 Workshop on Information Integration on the
Web (IIWeb), pages 73{78, A apul o, 2003.
[11℄ W. Cohen and J. Ri hman. Learning to mat h and luster
large high-dimensional data sets for data integration. In
ACM SIGKDD'02, pages 475{480, Edmonton, Canada,
2002.
[12℄ M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A
re ord linkage toolbox. In IEEE ICDE'02, pages 17{28,
San Jose, 2002.
[13℄ A. Elmagarmid, P. Ipeirotis, and V. Verykios. Dupli ate
re ord dete tion: A survey. IEEE Transa tions on
Knowledge and Data Engineering, 19(1):1{16, 2007.
[14℄ I. Fellegi and A. Sunter. A theory for re ord linkage.
Journal of the Ameri an Statisti al So iety,
64(328):1183{1210, 1969.
[15℄ L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas,
S. Muthukrishnan, and D. Srivastava. Approximate string
joins in a database (almost) for free. In VLDB'01, pages
491{500, Roma, Italy, 2001.
[16℄ L. Gu and R. Baxter. De ision models for re ord linkage. In
Sele ted Papers from AusDM, Springer LNCS 3755, pages
146{160, 2006.
[17℄ M. Hadjieleftheriou, A. Chandel, N. Koudas, and
D. Srivastava. Fast indexes and algorithms for set similarity
sele tion queries. In IEEE ICDE'08, pages 267{276,
Can un, Mexi o, 2008.
[18℄ M. A. Hernandez and S. J. Stolfo. The merge/purge
problem for large databases. In ACM SIGMOD'95, San
Jose, 1995.
[19℄ L. Jin, C. Li, and S. Mehrotra. EÆ ient re ord linkage in
large data sets. In DASFAA'03, pages 137{146, Tokyo,
2003.
[20℄ D. Kalashnikov and S. Mehrotra. Domain-independent data
leaning via analysis of entity-relationship graph. ACM
Transa tions on Database Systems, 31(2):716{767, 2006.
[21℄ C. Kelman, J. Bass, and D. Holman. Resear h use of linked
health data { A best pra ti e proto ol. Aust NZ Journal of
Publi Health, 26:251{255, 2002.
[22℄ M. Kumar, S. Moriah, and S. Krishnamoorthy.
Performan e evaluation of similarity join for real time
information integration. In Bangalore Annual Compute
Conferen e, Bangalore, India, 2009.
[23℄ C. Li, J. Lu, and Y. Lu. EÆ ient merging and ltering
algorithms for approximate string sear hes. In IEEE
ICDE'08, pages 257{266, Can un, Mexi o, 2008.
[24℄ C. Li, B. Wang, and X. Yang. VGRAM: Improving
performan e of approximate queries on string olle tions
using variable-length grams. In VLDB'07, pages 303{314,
Vienna, Austria, 2007.
[25℄ X. Liu, G. Li, J. Feng, and L. Zhou. E e tive indi es for
eÆ ient approximate string sear h and similarity join. In
IEEE WAIM'08, pages 127{134, 2008.
[26℄ E. Rahm and H. H. Do. Data leaning: Problems and
urrent approa hes. IEEE Data Engineering Bulletin,
23(4), 2000.
[27℄ M. Weis and F. Naumann. Spa e and time s alability of
dupli ate dete tion in graph data. Te hni al Report 25,
Hasso-Plattner-Institut, University of Potsdam, Germany,
2007.
[28℄ W. E. Winkler. Overview of re ord linkage and urrent
resear h dire tions. Te hni al Report RR2006/02, US
Bureau of the Census, 2006.
[29℄ I. Witten, A. Mo at, and T. Bell. Managing Gigabytes:
Compressing and indexing do uments and images. Morgan
Kaufmann, 2nd edition, 1999.
[30℄ C. Xiao, W. Wang, X. Lin, and J. Yu. EÆ ient similarity
joins for near dupli ate dete tion. In WWW'08, pages
131{140, Beijing, 2008.
[31℄ X. Yin, J. Han, and P. Yu. Link lus: EÆ ient lustering via
heterogeneous semanti links. In VLDB'06, pages 427{438,
Seoul, Korea, 2006.
[32℄ J. Zobel and A. Mo at. Inverted les for text sear h
engines. ACM Computing Surveys, 38(2), 2006.