Academia.eduAcademia.edu

Similarity-aware indexing for real-time entity resolution

2009, … of the 18th ACM conference on …

Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

TR-CS-09-01 Similarity-Aware Indexing for Real-Time Entity Resolution Peter Christen, Ross Gayler, David Hawking August 2009 ANU Computer Science Technical Report Series This technical report series is published by the School of Computer Science, College of Engineering and Computer Science, The Australian National University. Prior to 2009 this series was published as Joint Computer Science Technical Report Series jointly by the Department of Computer Science, Faculty of Engineering and Information Technology, and the Computer Sciences Laboratory, Research School of Information Sciences and Engineering, The Australian National University. Please direct correspondence regarding this series to: Technical Reports School of Computer Science College of Engineering and Computer Science The Australian National University Canberra ACT 0200 Australia or send email to: Technical-DOT-Reports-AT-cs-DOT-anu.edu.au A list of technical reports, including some abstracts and copies of some full reports may be found at: http://cs.anu.edu.au/techreports/ Recent reports in this series: TR-CS-08-03 Jie Cai, Alistair P. Rendell, Peter E. Strazdins, and H’sien Jin Wong. Predicting Performance of Intel Cluster OpenMP with Code Analysis Method. November 2008. TR-CS-08-02 Paul Thomas. Implementation of PIS. June 2008. TR-CS-08-01 Stephen M. Blackburn, Sergey I. Salishev, Mikhail Danilov, Oleg A. Mokhovikov, Anton A. Nashatyrev, Peter A. Novodvorsky, Vadim I. Bogdanov, Xiao Feng Li, and Dennis Ushakov. The Moxie JVM Experience. April 2008. TR-CS-07-05 Peter Strazdins. Research-Based Education in Computer Science at the ANU: Challenges and Opportunities. August 2007. TR-CS-07-04 Stephen M. Blackburn and Kathryn S. McKinley. Immix Garbage Collection: Fast Collection, Space Efficiency, and Mutator Locality. August 2007. TR-CS-07-03 Peter Christen. Towards Parameter-free Blocking for Scalable Record Linkage. August 2007. Similarity-Aware Indexing for Real-Time Entity Resolution Peter Christen Ross Gayler David Hawking School of Computer Science Australian National University Canberra ACT 0200, Australia Scoring Solutions Veda Advantage Melbourne VIC 3000, Australia Funnelback Pty Ltd Dickson ACT 2601, Australia [email protected] [email protected] [email protected] ABSTRACT General Terms Entity resolution, also known as data mat hing or re ord linkage, is the task of identifying re ords from several databases that refer to the same entities. Traditionally, entity resolution has been applied on stati databases, for example to nd re ords that relate to the same patient in di erent health databases. Most resear h in entity resolution has onentrated on either improving the mat hing quality, making entity resolution s alable to very large databases, or redu ing the manual e orts required throughout the resolution pro ess. In reasingly, however, many organisations are fa ed with the hallenge of having large databases that ontain entities, and a stream of query re ords that have to be mat hed with these databases in real-time, su h that the best mat hing re ords are retrieved. Example appli ations in lude online law enfor ement and national se urity databases, publi health surveillan e and emergen y response systems, nanial veri ation systems, and online retail stores. In this paper, a novel inverted index based approa h for real-time entity resolution is presented. At build time, similarities between attribute values are omputed and stored to support the fast mat hing of re ords at query time. The presented approa h di ers from other re ently developed approa hes to approximate querying, in that it allows any similarity omparison fun tion, and any `blo king' fun tion, both possibly domain spe i , to be in orporated. Experimental results on a large real-world database indiate that the total size of all data stru tures of this novel index approa h grows sub-linearly with the size of the database, and that it allows mat hing of query re ords in sub-se ond time, more than two orders of magnitude faster than a traditional entity resolution index approa h. Algorithms, Experimentation, Performan e. Categories and Subject Descriptors H.3.3 [Information Systems℄: Information Storage and Retrieval|Information Sear h and Retrieval ; H.3.1 [Information Systems℄: Information Storage and Retrieval| Content Analysis and Indexing. Accepted as poster at CIKM’09, November 2–6, 2009, Hong Kong. Keywords Data mat hing, re ord linkage, s alability, similarity query, approximate string mat hing, inverted indexing. 1. INTRODUCTION In reasingly, many appli ations that deal with data management and analysis require that data from di erent sour es is mat hed and aggregated before it an be used for further pro essing. The aim of data mat hing is to identify and mat h all re ords that refer to the same real world entities. These entities an, for example, be ustomers, patients, tax payers, travellers, students, businesses, onsumer produ ts, or bibliographi itations. While statisti ians and health resear hers ommonly name the task of mat hing re ords as data or re ord linkage, omputer s ientists and the database and business oriented IT ommunities speak of entity resolution, data or eld mat hing, data leansing, data integration, dupli ate dete tion, data s rubbing, list washing, obje t identi ation, or merge/purge pro essing. Traditionally, te hniques for mat hing re ords that orrespond to the same entities have been applied in the health se tor and within the ensus [14, 21, 28℄. In reasingly, however, entity resolution is now being used within and between many organisations in both the publi and private se tors in a large variety of appli ation domains. Examples in lude nding dupli ates in business mailing lists, bibliographi databases (digital libraries) and online stores; rime and fraud dete tion within nan e and insuran e ompanies as well as government agen ies; ompilation of longitudinal data for so ial resear h; or the assembly of terrorism wat h lists for improved national se urity. Be ause real-world data rarely ontains unique entity identi ers a ross all the databases to be mat hed, most entity resolution approa hes ompare re ords using the information available in the databases that partially identify entities, su h as their names, address details, or dates of birth. For ea h of the partially identifying attributes ompared between two re ords, a similarity is al ulated. These similarities are then used olle tively to lassify ea h ompared re ord pair as a mat h, non-mat h, or possible mat h [14, 16, 28℄. The mat hing pro ess is often hallenged be ause real world data is dirty, i.e. ontains missing or out-of-date attribute values, variations and errors, values that are swapped between attributes, or data that is oded di erently [26℄. Traditional entity resolution approa hes assume that two or more stati databases are to be mat hed in bat h mode, in order to produ e a new mat hed data set. In reasingly, however, entity resolution is required in an online, real-time environment, where query re ords have to be mat hed with one or several large databases, and the most similar re ords are to be retrieved. One example appli ation of urrent interest is health surveillan e and emergen y response systems, where the aim is to nd all re ords that relate to a ertain individual, for example a patient showing symptoms of an infe tious disease, from a variety of databases. In order to nd other individuals that might have been in onta t with that patient, a sear h needs to be ondu ted in an airline database, to nd the details of other people who have travelled with the patient; the database of the patient's employer, to nd potentially infe ted o-workers; the s hool database of the patient's hildren, and so on. In many ases, the sear h for mat hing re ords will rely upon personal details, like the name, address and date of birth of the patient, and thus be subje t to errors and variations, as well as outof-date information. A urate and real-time approximate mat hing te hniques are required for su h situations. The appli ation domain of spe i interest to the authors is onsumer nan ial servi es. Entity resolution is in reasingly important in this domain as su h servi es are being delivered remotely. On e a onsumer has established an a ount with a nan ial institution, she or he is normally required to use an unambiguous identity token, like an a ount number. However, the initial establishment of a onsumer's identity is diÆ ult. The normal approa h taken is entity resolution of identifying information, as provided by the onsumer, against one or more databases of related identifying information. The information provided is often subje t to variability and error, requiring an approximate mat hing pro ess. As this pro ess will be driven by automated systems that require sub-se ond responses, automated and a urate mat hing, s alability, and real-time entity resolution are major te hni al hallenges for su h systems. This paper presents work that is aimed towards the development of su h systems. The basi idea of a hieving realtime entity resolution is to ombine similarity al ulations used for approximate mat hing with inverted index te hniques that are ommonly used in the eld of information retrieval, for example for large-s ale Web sear h engines [3, 29, 32℄. In the past de ade, with the popularity and ommer ial su ess of su h sear h engines, a large amount of resear h and development on optimisation te hniques has been ondu ted in this eld [3, 32℄. Some of these optimisation te hniques are used in the work presented in this paper to fa ilitate real-time entity resolution of large databases. The ontributions of this paper are a novel index approa h suitable for real-time entity resolution. This approa h signifi antly improves the mat hing speed over a similar approa h re ently presented by two of the authors [9℄. Compared to this earlier approa h, whi h was between two and one hundred times faster than a traditional index approa h for entity resolution, the novel te hnique presented here is onsistently over two orders of magnitude faster than the traditional index approa h. An important aspe t of the novel approa h presented here is that it allows any similarity omparison fun tion, and any en oding fun tion for `blo king' [2℄, both possibly domain spe i , to be in orporated. Most other approximate mat hing approa hes developed in re ent times are limited to spe i similarity fun tions (su h as edit distan e, or Ja ard or osine similarity), and therefore may not be suitable for entity resolution in appli ations that require spe i en oding and omparison fun tions. The remainder of this paper is stru tured as follows. Next, in Se tion 2, an overview of related resear h is provided. The proposed novel index approa h for real-time entity resolution is then presented in Se tion 3, and experimentally evaluated in Se tion 4 using a large real-world database. The results of these experiments are dis ussed in Se tion 5, and the paper is on luded with an outlook to future work in Se tion 6. 2. RELATED WORK Resear h into entity resolution is being ondu ted in various domains, in luding data mining, ma hine learning, information retrieval, arti ial intelligen e, digital libraries, information systems, statisti s, and the database ommunity. Several re ent overview arti les are available [13, 28℄. Entity resolution te hniques an broadly be lassi ed into learning approa hes [5, 8, 10, 11, 12℄, or database and graph-based methods [20, 27, 31℄. So far, most resear h in this area has fo used on the quality of the mat hing pro ess, i.e. the a ura y of lassifying the ompared pairs or groups of re ords into mat hes and non-mat hes. The hallenges of s alability to very large databases and real-time mat hing have so far only re eived limited attention. For the traditional mat hing of (large) stati databases, indexing is important, be ause potentially every re ord from one database needs to be ompared with all re ords from the other database, resulting in a pro ess that is of quadrati omplexity in the sizes of the databases to be mat hed. Indexing te hniques, also known as `blo king', are therefore ommonly applied to redu e the number of omparisons to be ondu ted. In the standard blo king approa h [2℄, whi h will be presented in detail in Se tion 3.1 below, the databases are split into blo ks a ording to some riteria, and only re ords within the orresponding blo k are ompared with ea h other. A blo king riterion, also alled a blo king key, might be based on a single re ord attribute (that should ontain values of high quality), or based on the on atenation of values from several attributes. In order to over ome the problem of variations and errors in real-world data, one aim is to group similar sounding values into the same blo k. This an be a omplished by using phoneti en oding fun tions, su h as Soundex, NYSIIS or Double-Metaphone [7℄. These fun tions, whi h are often language or domain spe i , are applied when generating the blo king keys. Examples of su h phoneti en odings are shown in Figure 1. The standard blo king approa h has two major drawba ks. First, the size of the generated blo ks depends upon the frequen y distribution of the attribute values used in the blo king key. For example, using surname values in a blo king key will likely generate a very large blo k ontaining the ommon surname `Smith', resulting in a very large number of omparisons that need to be ondu ted for this blo k. Se ond, if a value in an attribute used as a blo king key ontains errors or variations that result in a di erent enoding, then the orresponding re ord will be inserted into a di erent blo k, and potentially true mat hes will be missed. This problem is normally over ome, at in reased omputational osts, by having two or more di erent blo king keys based on di erent re ord attributes. Various alternative blo king approa hes have been developed in re ent times, aimed at improving the s alability of the mat hing pro ess and in reasing mat hing a ura y. With the sorted neighbourhood approa h [18℄, the databases are sorted a ording to the values in the blo king key, and a xed-size window is moved over the databases. All re ords within the urrent window will then be ompared with ea h other. An approa h related to this is to insert the blo king key values and their suÆxes into a suÆx array based inverted index [1℄, and to then generate blo ks from all re ords that have the same suÆx value. With this approa h, ea h re ord will be inserted into several blo ks, depending upon the length of its suÆx values. Another approa h is to allow for `fuzzy' blo king by onverting blo king key values into q gram lists and, using sub-lists of these q -gram lists, to insert ea h re ord into several blo ks a ording to a Ja ard-based similarity threshold [2℄. While this approa h an improve the a ura y of the resulting mat hing, its omputational omplexity (a large number of q -gram sub-lists need to be generated) makes it unsuitable for large databases. Another idea for indexing is to apply lustering by using a omputationally eÆ ient similarity measure to generate highdimensional overlapping lusters ( alled ` anopies'), and to then extra t blo ks of re ords from these lusters [11℄. Ea h re ord will be inserted into several lusters and thus several blo ks, resulting in higher mat hing a ura y but at higher omputational osts. Another re ent approa h is to map blo king key values into a high-dimensional Eu lidean spa e su h that the distan es between all pairs of strings are preserved [19℄. The re ords in a blo k then orrespond to all obje ts in this spa e that are similar to ea h other. Condu ting entity resolution not on stati databases but at query time has so far re eived very limited attention, with only two re ent publi ations presenting approa hes spe i to su h situations. The authors have earlier shown that using an inverted index approa h an signi antly speed-up the query mat hing pro ess [9℄. A se ond approa h is based on unsupervised relational lustering, whi h assumes that the data to be mat hed ontains relational information that expli itly links di erent types of entities [5℄. The idea of this approa h is to utilise the relational links between re ords to improve the entity resolution pro ess. At query time, mat hing is ondu ted in an iterative fashion on a database that ontains unresolved entities. While this approa h an a hieve mu h better mat hing a ura y ompared to traditional entity resolution approa hes (that only onsider attribute similarities), it has mu h higher omputational osts. Mat hing times of around 30 se onds for one query re ord on a database ontaining around 800; 000 re ords have been reported [5℄. This approa h is therefore impra ti al for realtime entity resolution on very large databases. A large body of work has been ondu ted in the database ommunity on similarity queries and their s alable and efient implementations [4, 6, 15, 17, 23, 24, 25, 30℄. Many of the presented approa hes optimise indexing and ltering te hniques for spe i types of similarity measures, su h as edit distan e, or q -gram or osine based similarities. They also mainly deal with the situation of either nding similar tuples between a set of query re ords and a database table, or two large tables. One real-time similarity join approa h based on a modi ed trie hash-join has been presented very re ently [22℄. It al ulates q -gram similarities, and then applies several ltering steps to a hieve fast query times. Thus far, s alability to very large databases has not been addressed by most re ent resear h in the area of entity resolution, and most publi ations in this area have presented experimental results based on only small to medium sized data sets ontaining up to one million re ords [5, 20, 27, 31℄. Most of the re ently developed advan ed entity resolution te hniques have a omputational omplexity that makes them impra ti al for mat hing very large databases that ontain many million re ords. Additionally, most approa hes published so far, with the ex eption of two very re ent te hniques [5, 9℄, are assuming the situation of mat hing stati databases in bat h mode. 3. INDEXING FOR REAL-TIME ENTITY RESOLUTION Indexing, as presented in the previous se tion, is required for real-time entity resolution systems to speed-up the mat hing pro ess by redu ing the number of andidate re ords that need to be mat hed with a query re ord. The obje tive of real-time entity resolution is to mat h a stream of query re ords as qui kly as possible to one or several (large) databases that ontain re ords about existing entities, and potentially to a range of external data sour es that ontain additional information that an be used to verify the mat hed entities. The response time for mat hing a single query re ord has to be as short as possible, ideally sub-se ond. The mat hing approa h must fa ilitate approximate mat hing and eÆ iently s ale-up to very large databases that ontain many millions of re ords. In addition, the mat hing should generate a mat h s ore that indiates the likelihood that a mat hed re ord in the database refers to the same entity as the query re ord. Real-time entity resolution has mu h in ommon with the fun tionality of large-s ale Web sear h engines. However, the databases upon whi h entity resolution is ommonly applied do not ontain Web or text do uments that in lude a large number of terms and thus provide a ri h variety of features. Rather, these databases are made of stru tured re ords with well de ned attributes that often only ontain short strings or numbers, su h as the personal details of people (for example name, address, or date of birth values). In this se tion, the traditional standard blo king approa h to indexing for entity resolution is presented rst to illustrate the basi ideas of using an inverted index for entity resolution. Based on this approa h, a similarity-aware inverted index approa h that is suitable for real-time entity resolution is then dis ussed in detail in Se tion 3.2. Both index approa hes are illustrated in Figures 2 and 3. Both index approa hes presented here are based on a standard inverted index [32℄, where the keys of the index are (possibly en oded) attribute values, and the orresponding lists ontain the re ord identi ers of all re ords that have this (en oded) value. Two types of fun tion are required by both index approa hes. First, (phoneti ) en oding fun tions are needed that group similar attribute values together. For string attributes, su h as personal names or street and suburb names, phoneti en odings like Soundex, NYSIIS or Double-Metaphone are ommonly used [7℄. Figure 1, for example, shows the Soundex en odings of eight surname values. As an be seen, this en oding fun tion groups the values `smith' and `smyth' into one blo k, and `millar', `miller' and `myler' into another. The se ond type of fun tions required Re ord ID r1 r2 r3 r4 r5 r6 r7 r8 Surname smith miller peter myler smyth millar smith miller Soundex en oding s530 m460 p360 m460 s530 m460 s530 m460 Figure 1: Example re ords with surname values and their Soundex en odings, used to illustrate the two index approa hes in Figures 2 and 3. SI millar miller 0.9 myler miller millar 0.9 myler 0.8 myler millar 0.7 miller smith smyth 0.9 smyth smith 0.9 peter RI BI millar r6 m460 r2 r4 r6 p360 r3 s530 r1 r5 r7 r8 Figure 2: Standard blo king index resulting from the example re ords given in Figure 1. The blo king keys orrespond to Soundex en odings. are similarity omparisons that al ulate a normalised similarity between two attribute values, su h that 1 orresponds to an exa t similarity and 0 to total dissimilarity [7℄. Note that for di erent attribute types (strings, dates, numbers, et .) di erent su h omparison fun tions an be used. Additionally, domain spe i omparison fun tions are often applied to improve the mat hing quality. One example of su h a fun tion would be a date of birth omparison, where a mismat h in the month or day of birth is less severe than a mismat hed year of birth. The real-time entity resolution pro ess as dis ussed in this paper onsists of two phases. First, in the build phase, an index is generated using a stati database that ontains a possibly large number of leaned re ords that are assumed to refer to resolved entities, i.e. one single re ord per realworld entity only. On e built, the index is queried in the se ond phase with a stream of query re ords. These re ords an either refer to an entity stored in the index, or to a new and unknown entity. It is assumed, however, that the query re ords an ontain variations and typographi al errors, or wrong, out-of-date or missing values. Missing values an be handled by repla ing them with a spe ial hara ter (that is outside of the hara ter set used for an attribute) in both database and query re ords. For ea h query re ord, the mat hing pro ess returns a ranked list of potential mat hes and their similarities with the query re ord. A mat h is su essful if one of the top ranked re ords refers to the same entity as the query re ord. In the following two se tions, the two index approa hes are des ribed in detail, and in Se tion 3.3 two optimisations for redu ing the query mat hing time are dis ussed. Experiments on these two index approa hes are then presented in Se tion 4. In Algorithms 1 to 4, the attributes of a re ord r are denoted by r:i, with r:0 assumed to be an identi er attribute that allows unique identi ation of ea h re ord. A list with key k in an inverted index X is denoted with X[k℄. An empty list is denoted by () and an empty index by fg. 0.7 0.8 m460 millar miller myler p360 peter s530 smith smyth miller myler peter smith smyth r2 r4 r3 r1 r5 r8 r7 Figure 3: Similarity-aware index resulting from the example re ords from Figure 1. The similarity index is shown in the top left, the blo k index in the middle right, and the re ord identi er index at the bottom. 3.1 Standard Blocking The basi idea of standard blo king is that ea h re ord in a database is inserted into a blo k a ording to the value of its blo king key [2℄, as illustrated in Figure 2. En oding fun tions [7℄ are used to group similar attribute values into the same blo k. Ea h blo k orresponds to an inverted index list, with the key being the (en oded) blo king key value, while the values in the orresponding list are the re ord identi ers of all re ords in this blo k. Standard blo king is used in this paper be ause it is a baseline approa h upon whi h many other re ently developed index approa hes for entity resolution an be built. For example, anopy lustering [11℄, suÆx-array blo king [1℄, and the sorted neighbourhood [18℄ approa h, as dis ussed in Se tion 2, an all be implemented as extensions to a basi inverted index. Therefore, standard blo king an be seen as the basi approa h for traditional bat h-oriented entity resolution of stati databases. Other index approa hes based on it have higher omputational requirements. The build phase of standard blo king is shown in Algorithm 1. The input to this algorithm is a data set D ontaining n attributes that will be used in the entity resolution pro ess, and n orresponding en oding fun tions, Ei . A basi inverted index I, as illustrated in Figure 2, is generated in the build phase, where re ord identi er values are inserted into inverted index lists a ording to their orresponding en oded attribute values. This en oding of attribute values might be omputationally expensive. Therefore, in order to prevent repeated al ulation of en odings of attribute values, on e an en oding has been omputed, it is stored in the en odings a he C (line 9) so it an be retrieved qui kly for subsequent o urren es of the same attribute value r:i (line 6). The algorithm returns the inverted index data stru ture I ontaining re ord identi ers, r:0, in the inverted index lists, and the a he C ontaining the omputed en odings. No similarity al ulations are performed during the build phase of this index approa h. Note that for simpli ity the index and a he are shared between attributes. Depending upon the hara teristi s of the data to be pro essed and mat hed, however, it might be favourable to have separate index and a he data stru tures per attribute. Algorithm 1: Standard blo king { Input: - Data set: D - Number of attributes of D used: n - En oding fun tions: Ei ; i = 1 : : : n Output: - Standard blo king index: I - En odings a he: C 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Build Initialise I = fg Initialise C = ; for r 2 D: for i = 1 : : : n: if r:i 2 C then: = C[r:i℄ else: = Ei (r:i) C[r:i℄ = Append r:0 to I[ ℄ Algorithm 2 des ribes the query phase. As input, it requires a query re ord q, the inverted index I, the data set D, the en odings a he C, the number of attributes to be used for entity resolution n, the en oding fun tions Ei , and the omparison fun tions used to al ulate the similarities between attribute values, Si . The query phase onsists of two steps. First, in lines 1 to 8, the en oded attribute values (possibly available in the en odings a he C) of the query re ord q are used to retrieve the re ord identi er lists from the orresponding blo ks in the inverted index I. The union of these lists, b, ontains all identi ers of the andidate re ords that will be ompared with the query re ord in the se ond step of the algorithm (lines 9 to 14). The required n attribute values for ea h andidate re ord r need to be retrieved from data set D (line 10). An eÆ ient index on D is therefore required that allows fast a ess to a random re ord r using its identi er r:0. For ea h andidate re ord that is ompared with the query re ord, a similarity, s, is al ulated over all ompared attributes (line 13) and inserted into the list of mat hes M (line 14). For simpli ity a simple summing of s is assumed, however, in reality other aggregation fun tions, like weighted sums, an be applied. Finally, in line 15, the list of mat hes M is sorted su h that the largest similarity values are at the beginning. 3.2 Similarity-Aware Index This index is based on the idea of pre- al ulating the similarities between all unique attribute value ombinations within ea h blo k on e during the build phase, so that the similarities do not need to be re- al ulated for every query re ord, thereby signi antly redu ing the mat hing time required in the query phase. As illustrated in Figure 3, this approa h ontains three inverted index data stru tures. The re ord identi er index, RI, is similar to the inverted index I used in standard blo king, but the keys of this index are the a tual attribute values and not their en odings. The blo k index, BI, is the data stru ture that represents the blo ks by having en oded attribute values as keys and the a tual attribute values that have the same en oding in the orresponding inverted index lists. Ea h list in this index therefore ontains all attribute values that are in the same blo k. The similarity index, SI, Algorithm 2: Standard blo king { Query Input: - Query re ord: q - Data set: D - Number of attributes of D used: n - Standard blo king index: I - En odings a he: C - En oding fun tions: Ei ; i = 1 : : : n - Similarity omparison fun tions: Si ; i = 1 : : : n Output: - Ranked list of mat hes: M 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Initialise M = () Initialise andidate re ord identi er list b = () for i = 1 : : : n: if q:i 2 C then: = C[q:i℄ else: = Ei (q:i) b = b [ I[ ℄ for r:0 2 b: Retrieve r from D using identi er r:0 s = 0 for i = 1 : : : n: s = s + Si (r:i; q:i) Append (r:0; s) to M Sort M a ording to similarities (largest rst) stores the similarities of pairs of attribute values that are in the same blo k. Spe i ally, for ea h attribute value, it ontains a list of other attribute values (in the same blo k) and the similarities between these two values. Algorithm 3 des ribes how a similarity-aware index is built. The algorithm requires the same input as the build algorithm for standard blo king. Additionally, the similarity omparison fun tions, Si , are also required, be ause similarities between attribute values are al ulated during the build phase rather than the query phase. For ea h re ord r in data set D, its identi er r:0 is added to the inverted index list in RI that orresponds to attribute value r:i (line 6). It is important to note that all the following steps (lines 8 to 19) only need to be done if the attribute value r:i has not been pro essed before (line 7). This will signi antly redu e the omputational e ort if attribute values appear in a data set several times, whi h is the ase for attributes that have a Zipf-like or exponential distribution of values, as is ommon for example for attributes that ontain names [9℄. For a new attribute value r:i that has so far not been indexed, the rst step (lines 8 to 11) is to al ulate its en oding and to retrieve all other values in its blo k. The new value is then added into the inverted index list b of this blo k, and the updated list is stored ba k into the blo k index BI. The similarities between the new attribute value r:i and all attribute values v already in this blo k are al ulated next (line 13), and inserted into both the new value's similarity list si (line 15) and the other value's list oi (line 17). Finally, the similarity list si of the new value r:i is added to the similarity index SI in line 19. The query phase using the similarity-aware index is des ribed in Algorithm 4. During the query pro ess an a umulator M, a data stru ture that ontains re ord identi- Algorithm 3: Similarity-Aware Index { Build Input: - Data set: D - Number of attributes of D used: n - En oding fun tions: Ei ; i = 1 : : : n - Similarity omparison fun tions: Si ; i = 1 : : : n Output: - Re ord identi er index: RI - Similarity index: SI - Blo k index: BI 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: Initialise RI = fg Initialise SI = fg Initialise BI = fg for r 2 D: for i = 1 : : : n: Append r:0 to RI[r:i℄ if r:i 62 SI: = Ei (r:i) b = BI[ ℄ Append r:i to b BI[ ℄ = b Initialise inverted index list si = () for v 2 b: s = Si (r:i; v ) Append (v; s) to si oi = SI[v ℄ Append (r:i; s) to oi SI[v ℄ = oi SI[r:i℄ = si ers and their (partial) similarities with the query re ord, is generated [29, 32℄. Two possible ases an o ur for ea h attribute of the query re ord q. The rst ase o urs when an attribute value is available in the index, and its similarities with other attribute values have been al ulated in the build phase. In this ase, in lines 4 to 6, the identi ers r:0 of all other re ords that have the same attribute value are retrieved and their similarities (exa tly 1, as they have the same attribute value) are added into the a umulator M. A new element for re ord identi er r:0 will be added to the a umulator if it doesn't exist. Next, all other attribute values in the same blo k and their similarities with the query attribute value are retrieved from the similarity index SI (line 7). For ea h of these values, their re ord identi ers are retrieved from the re ord identi er index RI, and their similarities are added into the a umulator in line 11. The se ond ase o urs when an attribute value in the query re ord q is not available in the index, and thus the similarities between this value and other attribute values need to be al ulated (lines 13 to 19). This is similar to the query phase of the standard blo king index. First, in lines 13 and 14, the en oding for this unknown attribute value is al ulated, and then all re ords in its orresponding blo k are retrieved from the blo k index BI. In lines 16 and 17, the similarities between the attribute value from the query re ord and ea h of the other re ords in the blo k are al ulated, and the re ord identi ers of all orresponding re ords are retrieved from the re ord identi er index RI. The a umulator M is then updated in line 19 for ea h of these re ords. Finally, in line 20, the a umulator is sorted su h that the largest similarities are at the beginning. Algorithm 4: Similarity-Aware Index { Query Input: - Query re ord: q - Number of attributes of D used: n - Re ord identi er index: RI - Similarity index: SI - Blo k index: BI - En oding fun tions: Ei ; i = 1 : : : n - Similarity omparison fun tions: Si ; i = 1 : : : n Output: - Ranked list of mat hes: M 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15 16: 17: 18: 19: 20: Initialise M = () for i = 1 : : : n: if q:i 2 RI: // Case 1 ri = RI[q:i℄ for r:0 2 ri: M[r:0℄ = M[r:0℄ + 1:0 si = SI[r:i℄ for (r:i; s) 2 si: ri = RI[r:i℄ for r:0 2 ri: M[r:0℄ = M[r:0℄ + s else: // Case 2 = Ei (q:i) b = BI[ ℄ for v 2 b: s = Si (q:i; v ) ri = RI[v ℄ for r:0 2 ri: M[r:0℄ = M[r:0℄ + s Sort M a ording to similarities (largest rst) The overall eÆ ien y of the similarity-aware index depends upon how many attribute values of the query re ord are already stored in the index (in whi h ase no similarity al ulations need to be performed) ompared to how many are new. With in reased size of the data set D, and espe ially as D is overing larger portions of a population, one would assume that a larger portion of values would be available in the index, thereby improving the eÆ ien y of this index approa h. In Se tion 4 this assumption will be evaluated experimentally. 3.3 Optimisations A variety of optimisation approa hes have been developed for inverted index te hniques [3, 29, 32℄. These approa hes apply ompression to redu e the amount of memory required by the index data stru ture, sorting of the inverted index lists, and ltering of andidate re ords that are guaranteed not to be in the top ranked mat hes. Currently, two su h optimisation te hniques are implemented in the two index approa hes presented in this paper. The rst is a minimum similarity threshold, tmin (with 0 < tmin < 1). Within the query phase of standard blo king, this threshold is used together with the overall minimum threshold (dis ussed below) to redu e the number of mat hes to be stored in the ranked mat h list M. Similarities, as al ulated in line 13 of Algorithm 2, are not added to the overall similarity s of two re ords if they are below tmin . Within the similarity-aware index, the minimum threshold tmin is used in the build phase to only store similarities be- Number of unique values Number of values with ount 1 Six most frequent values (and their ounts) Given name Surname Suburb name 78,386 404,642 13,109 7116 193,437 931 John (149,817) Smith (65,243) Toowoomba (29,127) Peter (116,985) Jones (32,234) Frankston (18,856) David (101,859) Williams (31,647) Croydon (15,556) Robert (89,564) Brown (31,024) Port Ma quarie (15,499) Mi hael (89,222) Wilson (26,940) Reservoir (14,784) Margaret (69,165) Taylor (26,044) Glen Waverley (14,756) Post ode 2632 18 4350 (35,129) 4670 (24,701) 4740 (23,981) 2250 (23,454) 2170 (22,726) 4870 (21,639) Table 1: Chara teristi s of the data set used for experiments. tween attribute values that are above tmin . Spe i ally, lines 15 to 18 in Algorithm 3 are only exe uted if the similarity s, as al ulated in line 14, is larger than tmin . Not storing lower similarities will redu e the memory requirements of the similarity-aware index, and also speed-up the mat hing time during the query phase, be ause the inverted index lists in the similarity index SI will be shorter. The se ond optimisation is an overall minimum threshold, Tmin , with 0 < Tmin < n, and n being the number of re ord attributes that are used in the entity resolution pro ess. Within the standard blo king query phase, this threshold an be used in line 14 of Algorithm 2 to only append re ord identi ers to M that have a summed similarity s  Tmin . This will redu e the size of the list of mat hes M and thus redu e the time needed to sort M. Within the query phase of the similarity-aware index, Tmin is used to redu e the growth of the a umulator M in lines 6, 11 and 19 of Algorithm 4. Assume n attributes are being ompared, resulting in a summed similarity (n  tmin )  s  n for ea h ompared re ord pair, with only similarities between individual attribute values above tmin being stored in the index. When al ulating the total similarity between a query re ord q and the re ords stored in the index, line 2 of Algorithm 4 loops over the n attributes used for the mat hing. With an overall threshold Tmin < n, a phase threshold, p, an be al ulated as p = dn Tmin e. As long as the loop ounter i  p, all attribute similarities need to be added into M, be ause potentially any new partial mat h an rea h Tmin . However, on e loop ounter i > p, no new re ord identi ers (and their similarities) need to be added to the a umulator, be ause the total similarity for these re ords annot rea h Tmin . For example, assume there are four attributes to be used in the entity resolution pro ess (n = 4) and Tmin = 2:5, so p = d4 2:5e = 2. For the rst two attributes (i = 1; 2), new re ord identi ers are added into the a umulator. However, for the third and fourth attributes (i = 3; 4), no new re ord identi ers will be added to the a umulator be ause even if su h a new re ord has an exa t mat h with the query re ord in both attributes three and four, the maximum total similarity of this re ord will be s = 2:0, whi h is below Tmin . This optimisation an signi antly redu e the nal length of the a umulator. For the standard blo king approa h, a further optimisation in the query phase an be implemented if only the top mat hing re ord is (or re ords are) of interest. Rather than storing all mat hes (and their similarities) in the mat h list M (line 14 in Algorithm 2), and having to sort them before returning the ranked list (line 15), only the mat h(es) with the highest similarity need to be stored in M, and no sorting will be required. 4. EXPERIMENTAL EVALUATION The proposed similarity-aware index approa h is experimentally evaluated and ompared with the standard blo king approa h. The issues of interest were the time used to build the index and query it with re ords of varying quality, and the a ura y of the retrieved mat hes. The experiments were ondu ted on a Linux server ontaining two Intel Xeon quad- ore 64-bit CPUs with 2.33 GHz lo k frequen y, 8 Gigabytes of main memory, and two SAS drives (446 Gigabytes in total). No other users were logged onto this ma hine, and no other jobs were run during the experiments. A large real-world data set ontaining 6,917,514 re ords was used for the presented experiments. It ontained surnames, post odes and suburb (town) names sour ed from an Australian telephone dire tory from 2002 (Australia On 1 Dis ). This data orresponds to all entries in Australian telephone books in late 2002, and thus has hara teristi s similar to many other real-world data olle tions used by Australian organisations. Additionally, a list ontaining about 80,000 di erent given names and their frequen ies of o urren e, supplied to the authors by a major Australian government agen y, was used to generate and add a given name attribute. For ea h re ord in the data set, a given name was randomly sele ted (with repla ement) from the given name list a ording to its frequen y, and appended to the re ord. As su h, this is a typi al example data set that ontains a large number of unique and leaned entities, with similar data being olle ted by many other private and publi se tor organisations in many ountries. Table 1 provides an overview of the resulting data set used in the presented experiments. As expe ted, all the name attributes exhibit a strongly skewed distribution of values, with a small number of very ommon values and a large number of very rare values. For example, 40% of all surnames only appear on e in the data set, while the top ve most frequent surnames a ount for nearly 7% of the population. Only post odes are more uniformly distributed, whi h is due to the pro ess by Australia Post to split populated regions into similar sized post ode areas. Both index approa hes were implemented in Python, with version 2.5.2 used for the experiments. For the en oding fun tions Ei , used to blo k the test data sets, the DoubleMetaphone [7℄ phoneti en oding was applied on the three name attributes, while for the post ode attribute the blo king was based on sele ting the last three digits (i.e. all re ords where the post ode value has the same last three digits were inserted into the same blo k). For the omparison fun tions, Ci , the Winkler [7℄ approximate string omparison was used for the three name attributes, while for post odes the sim1 http://www.australiaondis . om Build time Memory usage 8000 Standard Blocking Sim-Aware Index Average query time Standard Blocking Sim-Aware Index Standard Blocking Sim-Aware Index 10 1000 Seconds MBytes Seconds 4000 100 1 0.1 1000 0.01 10 400 691,751 2,767,006 4,842,260 Number of records in data set 6,917,514 691,751 2,767,006 4,842,260 Number of records in data set 6,917,514 691,751 2,767,006 4,842,260 6,917,514 Number of records in data set Figure 4: Summary experimental results: Build time (left); memory usage (middle); and average query time per re ord (right). Note that all three graphs are shown with a logarithmi y-axis s ale. ilarity was al ulated by ounting the number of mat hing digits divided by four. For example, the similarity of the two post ode values `2346' and '2356' is 0:75. In order to evaluate the s alability of the similarity-aware index, test data sets of four di erent sizes were built ontaining 10% (691,710), 40% (2,767,006), 70% (4,842,260) and 100% of the re ords in the original data set. The full data set was split into ten data sets of equal size. Next, from ea h of these ten data sets, ten query re ords were randomly sele ted (giving one hundred base query re ords in total). To assess the mat hing quality, ve query sets of one hundred re ords ea h were reated by transforming the hundred base re ords in di erent ways. The rst set of hundred re ords were made by exa tly opying the base query re ords. In the se ond set one modi ation was inserted into one of the four attributes in ea h re ord (a di erent attribute in ea h re ord in a round robin fashion); in the third set two modi ations were inserted; in the fourth set three modi ations, and in the fth query set all four attribute values were modi ed in ea h re ord. The modi ations, while done manually, were based on the authors' experien e with real-world name data. They mostly orresponded to ommon phoneti and typographi al variations, for example hanges su h as `Di kson' to `Dixon', ni kname substitutions like `Robert' to `Bob', or simple hara ter inserts, deletes, substitutions or transpositions. For post odes, only substitutions and transpositions of digits were applied, su h as `2607' hanged into `2601'. S alability was evaluated by building an index for ea h test data set (ea h ontaining 10%, 40%, 70%, or 100% of the re ords in the full data set) and then querying it with ea h of the ve query sets. The time used to build ea h index was re orded, as well as the total amount of memory used by that index. During the query phase, the time for querying ea h re ord was measured, as well as whether the top ranked returned re ord was a true mat h (i.e. if the re ord identi er of the best returned mat h was the same as the re ord identi er of the query re ord). For the similarityaware index, the number of ase 1 and ase 2 mat hes (as dis ussed in Se tion 3.2 and shown in Algorithm 4) was also re orded. While test runs were ondu ted with both optimisations turned o and on, due to spa e limitations of this paper only results with a tivated optimisations are reported. The minimum threshold tmin was set to 0:55 and the overall minimum threshold Tmin to 2:0. These values were sele ted su h that the experiment on the full database for the similarity-aware index still tted into the 8 Gigabytes main memory available on the experimental platform. For the experiments with the smaller test data sets (less than 100% of the full data set), ea h experiment was ondu ted ten times (with omponent 10% data sets sele ted in a round robin fashion) and all results averaged, while for the full database an experiment was only run on e. 5. RESULTS AND DISCUSSION A summary of the experimental results is shown in Figure 4. As expe ted, building a standard blo king index is signi antly faster than building a similarity-aware index, by a fa tor ranging from 16 times for the smallest test data set to 20 times when building the index for the largest test data set. The main reason for this is that during the build phase of the standard blo king index no similarity al ulations between attribute values are performed. The build time for both index approa hes however does grow sub-linearly with the size of the data set. For standard blo king, this is beause the en odings of attribute values are a hed (line 6 in Algorithm 1), so the more re ords are loaded and inserted into the index, the more often a hed en oding values an be retrieved and fewer need to be al ulated. For the similarityaware index, the al ulation of similarities between attribute values and inserting them into the similarity and blo king indi es SI and BI again only needs to be done the rst time a new, previously unseen attribute value o urs. Similarly, the amount of memory required by both index approa hes (shown in the middle of Figure 4) grows sublinearly with the size of the test data set, be ause as the data set grows fewer new attribute values, whi h need to be pro essed and stored, will o ur. For the test data sets used in the experiments, the similarity-aware index required around 1.8 times as mu h memory on average as the standard blo king index. The rate of growth for both build time and memory requirements depends upon the distribution of attribute values in the data set to be indexed. Given that many real-world databases ontain attributes that follow a Zipf-like or exponential distribution, su h as names [9℄, a sub-linear growth an be expe ted in pra ti e. A theoreti al analysis of the growth fa tor is one avenue of future work planned by the authors. One of the most important aspe ts of the novel index approa h presented in this paper is its fast query mat hing time. As an be seen in the right graph in Figure 4, the novel approa h a hieves average query times below 0:1 se onds even for the index that is based on the full test data set ontaining nearly 7 million re ords. Over the di erent test Accuracy for data set with 6,917,514 records Standard blocking Sim-Aware Index 100 Query case 1 80% 80 Accuracy 10% (691K) 40% (2767K) 70% (4842K) 100% (6918K) 100% 120 60 60% 40% 40 20% 20 0 0 0 1 2 3 4 Number of modifications per record Figure 5: Query mat hing a ura y for the full test data set for varying number of modi ations per re ord. Similar a ura y results were a hieved for the smaller test data sets. Given name Surname Suburb name Gail (g400) Gayle (g400) 0.827 Billman (b455) Pillman (p455) 0.905 Boystown (b235) Boydtown (b350) 0.942 Figure 6: An example re ord pair that will be missed by the similarity-aware index approa h be ause of di erent en oding values, but will be ompared by standard blo king. The values in bra kets are the orresponding Soundex en odings, and the similarities (bottom row) were al ulated using the Winkler approximate string omparison fun tion [7℄. data sets (10%, 40%, 70% and 100% of the full data set size), the query time for the similarity-aware index is between 140 and 150 times faster than standard blo king. However, for both index approa hes, the query time urrently in reases linearly with the size of the indexed data sets. Improving upon this is a urrent e ort by the authors. The query mat hing a ura y results are shown in Figure 5 for the largest test data set with varying number of modi ations per re ord. As an be seen for both index approa hes, mat hing a ura y gets lower with an in reased number of modi ations. This is what one would expe t, as with more modi ations per re ord the likelihood that another re ord (with similar attribute values) be omes the best mat hing re ord is in reased. The a ura y for the similarity-aware index is higher ompared to standard blo king for the query sets with one and two modi ations, but then drops more rapidly for the query sets with three and four modi ations. This is due to the requirement of the similarity-aware index that the values of all attributes for a re ord pair need to be in the same blo k in order to have their similarity added to the a umulator. If two attribute values are in di erent blo ks, then the orresponding similarity, whi h an be high, will not be onsidered. For standard blo king, on the other hand, only one pair of attribute values needs to be in the same blo k in order that two re ords are being ompared. 1 2 3 4 Number of modifications per record Figure 7: Proportion of ase 1 (query attribute value is available in similarity-aware index) to ase 1 plus ase 2 (new unknown attribute value) for varying number of modi ations per re ord. This is illustrated in Figure 6 with two example re ords that have an overall similarity of 2:674 out of a maximum of 3:0. These re ords would be ompared by standard blo king be ause at least one attribute (given name) ontains values that are in the same blo k; whereas they would not be ompared by the similarity-aware index, be ause two of the three attributes (surname and suburb name) have di erent blo king key values and thus the orresponding similarities would not be added into the a umulator. Although this e e t may lead to standard blo king having higher a ura y on more heavily modi ed query re ords, it an also lead to lower a ura y for standard blo king ompared to the similarity-aware index when query re ords are of relatively good quality, as an be seen in Figure 5 for the query sets with one or two modi ations only. Improving the similarity-aware index and a hieving equal or even better mat hing a ura y than standard blo king in all ases is one of the urrent resear h e orts by the authors. Finally, Figure 7 shows the proportion of query attribute values that were available in the similarity-aware index ( ase 1) and thus no similarities had to be al ulated at query time. As an be seen, the more modi ations a query re ord had, the more likely the modi ed attribute values were not in the index and thus their similarities had to be al ulated. However, even with modi ations in all four query re ord attributes, more than 40% of all attribute values were available in the index and thus their similarities were pre- al ulated. This results shows the eÆ ien y of the similarity-aware index in speeding up query mat hing by pre- omputing similarities between re ords while the index is built. 6. CONCLUSIONS AND FUTURE WORK In this paper, a novel index approa h for real-time entity resolution has been presented and evaluated experimentally on a large real-world data set. The experiments showed that this approa h an mat h query re ords more than two orders of magnitude faster than a basi standard index approa h that is traditionally used for entity resolution. The novel approa h requires less than double the amount of memory of the standard index, but building the index an take up-to twenty times longer. For query re ords that do not ontain too many variations and errors, the a ura y of the novel index approa h an be better than the standard blo king approa h. However, when most or all attribute values in a query re ord ontains variations and errors, then mat hing a ura y an drop signi antly. Improving upon this drawba k is one of the major avenues for additional work on this novel index approa h. Other areas of future resear h in lude a theoreti al analysis of the omplexity and s alability of this index approa h, improving the query mat hing time, and ondu ting experiments on a variety of other real-world databases. To the best of the authors' knowledge, the similarityaware inverted index presented in this paper is the rst approa h aimed at developing real-time entity resolution on large databases that ombines approa hes from information retrieval with traditional entity resolution te hniques. 7. REFERENCES [1℄ A. Aizawa and K. Oyama. A fast linkage dete tion s heme for multi-sour e information integration. In WIRI'05, Tokyo, 2005. [2℄ R. Baxter, P. Christen, and T. Chur hes. A omparison of fast blo king methods for re ord linkage. In ACM SIGKDD'03 Workshop on Data Cleaning, Re ord Linkage and Obje t Consolidation, Washington DC, 2003. [3℄ R. Bayardo, Y. Ma, and R. Srikant. S aling up all pairs similarity sear h. In WWW'07, Ban , Canada, 2007. [4℄ A. Behm, S. Ji, C. Li, and J. Lu. Spa e- onstrained gram-based indexing for eÆ ient approximate string sear h. In IEEE ICDE'09, pages 604{615, Shanghai, China, 2009. [5℄ I. Bhatta harya and L. Getoor. Query-time entity resolution. Journal of Arti ial Intelligen e Resear h, 30:621{657, 2007. [6℄ M. Celikik and H. Bast. Fast error-tolerant sear h on very large texts. In ACM Symposium on Applied Computing, pages 1724{1731, Honolulu, Hawaii, 2009. [7℄ P. Christen. A omparison of personal name mat hing: Te hniques and pra ti al issues. In Workshop on Mining Complex Data, held at IEEE ICDM'06, Hong Kong, 2006. [8℄ P. Christen. Automati re ord linkage using seeded nearest neighbour and support ve tor ma hine lassi ation. In ACM SIGKDD'08, pages 151{159, Las Vegas, 2008. [9℄ P. Christen and R. Gayler. Towards s alable real-time entity resolution using a similarity-aware inverted index approa h. In AusDM'08, CRPIT vol. 87, Glenelg, Australia, 2008. [10℄ W. Cohen, P. Ravikumar, and S. Fienberg. A omparison of string distan e metri s for name-mat hing tasks. In IJCAI'03 Workshop on Information Integration on the Web (IIWeb), pages 73{78, A apul o, 2003. [11℄ W. Cohen and J. Ri hman. Learning to mat h and luster large high-dimensional data sets for data integration. In ACM SIGKDD'02, pages 475{480, Edmonton, Canada, 2002. [12℄ M. Elfeky, V. Verykios, and A. Elmagarmid. TAILOR: A re ord linkage toolbox. In IEEE ICDE'02, pages 17{28, San Jose, 2002. [13℄ A. Elmagarmid, P. Ipeirotis, and V. Verykios. Dupli ate re ord dete tion: A survey. IEEE Transa tions on Knowledge and Data Engineering, 19(1):1{16, 2007. [14℄ I. Fellegi and A. Sunter. A theory for re ord linkage. Journal of the Ameri an Statisti al So iety, 64(328):1183{1210, 1969. [15℄ L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB'01, pages 491{500, Roma, Italy, 2001. [16℄ L. Gu and R. Baxter. De ision models for re ord linkage. In Sele ted Papers from AusDM, Springer LNCS 3755, pages 146{160, 2006. [17℄ M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity sele tion queries. In IEEE ICDE'08, pages 267{276, Can un, Mexi o, 2008. [18℄ M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, San Jose, 1995. [19℄ L. Jin, C. Li, and S. Mehrotra. EÆ ient re ord linkage in large data sets. In DASFAA'03, pages 137{146, Tokyo, 2003. [20℄ D. Kalashnikov and S. Mehrotra. Domain-independent data leaning via analysis of entity-relationship graph. ACM Transa tions on Database Systems, 31(2):716{767, 2006. [21℄ C. Kelman, J. Bass, and D. Holman. Resear h use of linked health data { A best pra ti e proto ol. Aust NZ Journal of Publi Health, 26:251{255, 2002. [22℄ M. Kumar, S. Moriah, and S. Krishnamoorthy. Performan e evaluation of similarity join for real time information integration. In Bangalore Annual Compute Conferen e, Bangalore, India, 2009. [23℄ C. Li, J. Lu, and Y. Lu. EÆ ient merging and ltering algorithms for approximate string sear hes. In IEEE ICDE'08, pages 257{266, Can un, Mexi o, 2008. [24℄ C. Li, B. Wang, and X. Yang. VGRAM: Improving performan e of approximate queries on string olle tions using variable-length grams. In VLDB'07, pages 303{314, Vienna, Austria, 2007. [25℄ X. Liu, G. Li, J. Feng, and L. Zhou. E e tive indi es for eÆ ient approximate string sear h and similarity join. In IEEE WAIM'08, pages 127{134, 2008. [26℄ E. Rahm and H. H. Do. Data leaning: Problems and urrent approa hes. IEEE Data Engineering Bulletin, 23(4), 2000. [27℄ M. Weis and F. Naumann. Spa e and time s alability of dupli ate dete tion in graph data. Te hni al Report 25, Hasso-Plattner-Institut, University of Potsdam, Germany, 2007. [28℄ W. E. Winkler. Overview of re ord linkage and urrent resear h dire tions. Te hni al Report RR2006/02, US Bureau of the Census, 2006. [29℄ I. Witten, A. Mo at, and T. Bell. Managing Gigabytes: Compressing and indexing do uments and images. Morgan Kaufmann, 2nd edition, 1999. [30℄ C. Xiao, W. Wang, X. Lin, and J. Yu. EÆ ient similarity joins for near dupli ate dete tion. In WWW'08, pages 131{140, Beijing, 2008. [31℄ X. Yin, J. Han, and P. Yu. Link lus: EÆ ient lustering via heterogeneous semanti links. In VLDB'06, pages 427{438, Seoul, Korea, 2006. [32℄ J. Zobel and A. Mo at. Inverted les for text sear h engines. ACM Computing Surveys, 38(2), 2006.