Er VLDB2012 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 179

Entity Resolution: Tutorial

EntityResolution:Tutorial

Li Getoor
Lise G t Ashwin Machanavajjhala
UniversityofMaryland DukeUniversity
College Park MD
CollegePark,MD Durham,NC
,

http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
http://goo.gl/f5eym
WhatisEntityResolution?
Problemofidentifyingandlinking/groupingdifferent
manifestationsofthesamerealworldobject.
f f j

Examplesofmanifestationsandobjects:
p j
Differentwaysofaddressing(names,emailaddresses,FaceBook
accounts)thesamepersonintext.
Webpageswithdifferingdescriptionsofthesamebusiness.
Differentphotosofthesameobject.

Ironically,EntityResolutionhasmanyduplicatenames
Recordlinkage
Duplicatedetection
Coreference resolution
Referencereconciliation

Fuzzymatch
y Object consolidation
Objectconsolidation
Objectidentification
Deduplication
Entity clustering
Entityclustering
Approximatematch
Identityuncertainty
Merge/purge Household matching
Householdmatching

Hardeningsoftdatabases
Householdingg Referencematchingg

Doubles
ERMotivatingExamples
LinkingCensusRecords
Public Health
PublicHealth
Websearch
Comparison shopping
Comparisonshopping
Counterterrorism
Spam detection
Spamdetection
MachineReading

ERandNetworkAnalysis

before after
Motivation:NetworkScience
Measuringthetopologyoftheinternetusing
traceroute
IPAliasingProblem[Willinger etal.2009]
IPAliasingProblem[Willinger etal.2009]
TraditionalChallengesinER
Name/Attributeambiguity

Thomas Cruise
ThomasCruise

MichaelJordan
TraditionalChallengesinER
Name/Attributeambiguity
Errorsduetodataentry
Errors due to data entry
TraditionalChallengesinER
Name/Attributeambiguity
Errorsduetodataentry
Errors due to data entry
MissingValues

[Gilletal;Univ ofOxford2003]
TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
MissingValues
Changing Attributes
ChangingAttributes

Dataformatting

Abbreviations/DataTruncation
/
BigDataERChallenges
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M H t it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Nolongerjustmatchingnameswithnames,butAmazonprofileswith
No longer just matching names with names, but Amazon profiles with
browsinghistoryonGoogleandfriendsnetworkinFacebook.
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M H t it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Morelinked
More linked
Needtoinferrelationshipsinadditiontoequality
MultiRelational
Dealwithstructureofentities(AreWalmart andWalmart Pharmacy
thesame?)
Multidomain
l d
Customizablemethodsthatspanacrossdomains
Multipleapplications
Multiple applications (websearchversuscomparisonshopping)
(web search versus comparison shopping)
Servediverseapplicationwithdifferentaccuracyrequirements
Outline
1. AbstractProblemStatement
2
2. Algorithmic Foundations of ER
AlgorithmicFoundationsofER
3. ScalingERtoBigData
4
4. Challenges & Future Directions
Challenges&FutureDirections
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2. Algorithmic Foundations of ER
a) DataPreparationandMatchFeatures
b) Pairwise ER
c) ConstraintsinER
5minutebreak
d) Algorithms
RecordLinkage
Record Linkage
Deduplication
CollectiveER

3. ScalingERtoBigData
4. Challenges&FutureDirections
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2. Algorithmic Foundations of ER
3. ScalingERtoBigData
a) Blocking/CanopyGeneration
Blocking/Canopy Generation
b) DistributedER
4. Challenges&FutureDirections
Challenges & Future Directions
Outline
1. AbstractProblemStatement
2
2. Algorithmic Foundations of ER
AlgorithmicFoundationsofER
3. ScalingERtoBigData
4
4. Challenges & Future Directions
Challenges&FutureDirections
ScopeoftheTutorial
Whatwecover:
FundamentalalgorithmicconceptsinER
ScalingERtobigdatasets
TaxonomyofcurrentERalgorithms

Whatwedonotcover:
Schema/ontologyresolution
Schema/ontology resolution
Datafusion/integration/exchange/cleaning
Entity/InformationExtraction
PrivacyaspectsofEntityResolution
Detailsonsimilaritymeasures
Technical details and proofs
Technicaldetailsandproofs
ERReferences
Book/SurveyArticles
DataQualityandRecordLinkageTechniques
[T Herzog F Scheuren W Winkler Springer 07]
[T.Herzog,F.Scheuren,W.Winkler,Springer, 07]
DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07]
AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P
y
synthesislectures2010]]
EvaluationofEntityResolutionApproachedonRealworldMatchProblems
[H.Kopke,A.Thor,E.Rahm,PVLDB2010]
DataMatching[P.Christen,Springer2012]

Tutorials
RecordLinkage:SimilaritymeasuresandAlgorithms
[N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06]
DatafusionResolvingdataconflictsforintegration
[X.Dong,F.Naumann VLDB09]
EntityResolution:Theory,PracticeandOpenChallenges
Entity Resolution: Theory Practice and Open Challenges
http://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]
PART 1
PART1
ABSTRACT PROBLEM STATEMENT
ABSTRACTPROBLEMSTATEMENT
AbstractProblemStatement
RealWorld DigitalWorld
Records/
/
Mentions
Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Intensional Variant:Computeclusterrepresentative
RecordLinkageProblemStatement
Linkrecordsthatmatchacrossdatabases

B
A
ReferenceMatchingProblem
Matchnoisyrecordstocleanrecordsinareferencetable

Reference
Table
bl
AbstractProblemStatement
RealWorld DigitalWorld
Deduplication ProblemStatement
Deduplication withCanonicalization
GraphAlignment(&motifsearch)
Graph1 Graph2
Relationshipsarecrucial
Relationshipsarecrucial
Notation
R:setofrecords/mentions(typed)
H: set of relations / hyperedges (typed)
H:setofrelations/hyperedges
M:setofmatches(recordpairsthatcorrespondtosameentity)
N: set of non matches (recordpairscorrespondingtodifferententities)
N:setofnonmatches ( d i di t diff t titi )

E:setofentities
L set of links
L:setoflinks

TTrue(M
(Mtrue,N
Ntrue,EEtrue,LLtrue):accordingtorealworld
) di l ld
vs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm
RelationshipbetweenMtrue andMpred
Mtrue (SameAs ,Equivalence)
Mpred (Similarrepresentationsandsimilarattributes)
(Similar representations and similar attributes)

Mtrue RxR Mpred


Metrics
Pairwise metrics
Precision/Recall,F1
#ofpredictedmatchingpairs

Clusterlevelmetrics
purity,completeness,complexity
Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B
Precision/Recall/F1: Cluster level closest cluster MUC B3 ,
RandIndex
Generalizedmergedistance[Menestrina etal,PVLDB10]

Littleworkthatevaluationscorrectpredictionoflinks
TypicalAssumptionsMade
Eachrecord/mentionisassociatedwithasinglereal
worldentity.
y

Inrecordlinkage,noduplicatesinthesamesource
Iftworecords/mentionsareidentical,thentheyaretrue
If two records/mentions are identical then they are true
matches

(
(,)
) Mtrue
ERversusClassification
Findingmatchesvs nonmatchesisaclassificationproblem

Imbalanced:typicallyO(R)matches,O(R^2)nonmatches

Instancesarepairsofrecords.PairsarenotIID

(,) Mtrue
AND (
(,)
) Mtrue
( , ) Mtrue
(,)
ERvs (Multirelational)Clustering
Computingentitiesfromrecordsisaclusteringproblem

Intypicalclusteringalgorithms(kmeans,LDA,etc.)
number of clusters is a constant or sub linear in R.
numberofclustersisaconstantorsublinearinR.

In
InER:numberofclustersislinearinR,andaverage
ER: number of clusters is linear in R and average
clustersizeisaconstant.Significantfractionofclusters
aresingletons.
g
PART 2
PART2
ALGORITHMIC FOUNDATIONS OF ER
ALGORITHMICFOUNDATIONSOFER
OutlineofPart2
a) DataPreparationandMatchFeatures

b) Pairwise ER
Determiningwhetherornotapairofrecordsmatch

c) ConstraintsinER
5minutebreak
d) Algorithms
Recordlinkage(Propagationthroughexclusitivity negativeconstraint),
Deduplication (Propagationthroughtransitivitypositiveconstraint),
Collective(PropagationthroughGeneralConstraints)
MOTIVATINGEXAMPLE:
MOTIVATING EXAMPLE:
BIBLIOGRAPHICDOMAIN
Entities&RelationsinBibliographicDomain

Wrote Author Author Mention


Name
Research Area NameString

Paper
p WorksAt
Title I tit ti
Institution Institute Mention
Name NameString
# of Authors
Topic
Word1
Word 2 Paper Mention
TitleString
WordN

Cites Venue Venue Mention


AppearsIn
pp Name NameString
g
:entityrelationships
:cooccurrencerelationships
:resolutionrelationships
PART 2
PART2a
DATAPREPARATION&
DATA PREPARATION &
MATCHFEATURES
Normalization
Schemanormalization
SchemaMatching e.g.,contactnumberandphonenumber
Compoundattributes
C d tt ib t f ll dd
fulladdressvs str,city,state,zip
t it t t i
Nestedattributes
Listoffeaturesinonedataset(airconditioning,parking)vs eachfeaturea
boolean attribute
Setvaluedattributes
Setofphonesvs primary/secondaryphone
Recordsegmentationfromtext
Record segmentation from text
Datanormalization
Oftenconverttoalllower/allupper;removewhitespace
detectingandcorrectingvaluesthatcontainknowntypographicalerrorsor
d i d i l h i k hi l
variations,
expandingabbreviationsandreplacingthemwithstandardforms;replacing
nicknames with their proper name forms
nicknameswiththeirpropernameforms
Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,
etc.)
MatchingFeatures
Fortworeferencesxandy,computeacomparisonvectorof
similarityscoresofcomponentattribute.
[1stauthormatchscore,
papermatchscore,
venuematchscore,
yearmatchscore,.]

Similarityscores
y
Boolean(matchornotmatch)
Realvaluesbasedondistancefunctions
SummaryofMatchingFeatures
Handle
H dl
Typographicalerrors GoodforNames
Equalityonaboolean predicate AlignmentbasedorTwotiered
Editdistance JaroWinkler,SoftTFIDF,MongeElkan
Levenstein,SmithWaterman,Affine PhoneticSimilarity
Setsimilarity Soundex
Jaccard,Dice Translationbased
VectorBased Numericdistancebetweenvalues
Cosine
Cosinesimilarity,TFIDF
similarity, TFIDF Domainspecific
p
GoodforTextlike
Usefulfor
reviews/tweets
abbreviations,
Useful packages
Usefulpackages: alternate names.
alternatenames.
SecondString,http://secondstring.sourceforge.net/
Simmetrics:http://sourceforge.net/projects/simmetrics/
LingPipe,http://aliasi.com/lingpipe/index.html
Li Pi htt // li i /li i /i d ht l
RelationalMatchingFeatures
Relationalfeaturesareoftensetbased
Setofcoauthorsforapaper
Setofcitiesinacountry
f ii i
Setofproductsmanufacturedbymanufacturer

Canusesetsimilarityfunctionsmentionedearlier
CommonNeighbors:Intersectionsize
Jaccard
Jaccardss Coefficient:Normalizebyunionsize
Coefficient: Normalize by union size
AdarCoefficient:Weightedsetsimilarity

Canreasonaboutsimilarityinsetsofvalues
C b t i il it i t f l
AverageorMax
Otheraggregates
PART 2 b
PART2b
PAIRWISE MATCHING
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).

Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
0.5*1stauthormatchscore +0.2*venuematchscore +0.3*ppapermatchscore
p .
Hardtopickweights.
Matchonlastnamematchmorepredictivethanloginname.
Match on Smith
Matchon Smith lesspredictivethanmatchon
less predictive than match on Getoor
Getoor or
or Machanavajjhala.
Hardtotuneathreshold.
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).

Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
2 Formulaterulesaboutwhatconstitutesamatch.
2. Formulate rules about what constitutes a match
(1stauthormatchscore>0.7ANDvenuematchscore>0.8)
OR(papermatchscore>0.9ANDvenuematchscore>0.9)
M
Manuallyformulatingtherightsetofrulesishard.
ll f l ti th i ht t f l i h d
BasicMLApproach
r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches

P( | r M )
Decisionrule R
P( | r U )

R t r Match
R t r Non - Match
Fellegi &Sunter Model[FS,Science69]
r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches

P( | r M )
Decisionrule R
P( | r U )

R tl r Match
tl R tu r Potential Match
R tu r Non - Match

NaveBayes Assumption: P( | r M ) i P( i | r M )
MLPairwise Approaches
Supervisedmachinelearningalgorithms
Decisiontrees
[Cochinwala
[Co hin ala etal,IS01]
et al IS01]
Supportvectormachines
[Bilenko &Mooney,KDD03];[Christen,KDD08]
Ensemblesofclassifiers
Ensembles of classifiers
[Chenetal.,SIGMOD09]
ConditionalRandomFields(CRF)
[Gupta&Sarawagi,VLDB09]
[Gupta & Sarawagi VLDB09]

Issues:
Training
Trainingsetgeneration
set generation
Imbalancedclasses manymorenegativesthanpositives(evenafter
eliminatingobviousnonmatchesusingBlocking)
Misclassificationcost
Misclassification cost
CreatingaTrainingSetisakeyissue
Constructingatrainingsetishard sincemostpairsof
recordsareeasynonmatches.
y
100recordsfrom100cities.
Only106 pairsoutoftotal108 (1%)comefromthesamecity

Somepairsarehardtojudgeevenbyhumans
Inherentlyambiguous
E.g.,ParisHilton(personorbusiness)
Missingattributes
Starbucks,Torontovs Starbucks,QueenStreet,Toronto
AvoidingTrainingSetGeneration
Unsupervised/SemisupervisedTechniques
EMbasedtechniquestolearnparameters
[Winkler06,Herzogetal07]
GenerativeModels
[Ravikumar &Cohen,UAI04]
& Cohen UAI04]

ActiveLearning
CommitteeofClassifiers
[Sarawagi etalKDD00,Tajeda etalIS01]
Provablyoptimizingprecision/recall
Provably optimizing precision/recall
[Arasu etalSIGMOD10,Bellare etalKDD12]
Crowdsourcing
[WangetalVLDB12,MarcusetalVLDB12,]
[W t l VLDB 12 M t l VLDB 12 ]
CommitteeofClassifiers [Tejada etal,IS01]
ActiveLearningwithProvableGuarantees
Mostactivelearningtechniquesminimize01loss
[Beygelzimer etalNIPS2010].

However,ERisveryimbalanced:
Numberofnonmatches>100*numberofmatches.
Classifyingallpairsasnonmatcheshaslow01loss(<1%).
y g p ( )

Hence,needactivelearningtechniquesthatminimize
precision/recall.
precision/recall
ActiveLearningwithProvableGuarantees
Monotonicity ofPrecision[Arasu etalSIGMOD10]

Thereisalargerfractionof
matchesinC1thaninC2.
Algorithmsearchesforthe
p g y
optimalclassifierusingbinary
searchoneachdimension
ActiveLearningwithProvableGuarantees
[Bellare etalKDD12]
O (log2 n)callstoablackbox
O(log n) calls to a blackbox 01lossactivelearningalgorithm.
0 1 loss active learning algorithm

Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10]
(in the worst case).
(intheworstcase).

1. PrecisionConstrained Weighted01LossProblem
(using a Lagrange Multiplier )
(usingaLagrangeMultiplier).
2. Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a
blackbox activelearningclassifier.
3
3. Ri ht l
Rightvalueof
f iscomputedbysearchingoveralloptimalclassifiers.
i t db hi ll ti l l ifi
Classifiersareembeddedina2dplane(precision/recall)
Searchisalongtheconvexhulloftheembeddedclassifiers
Crowdsourcing
Growinginterestinintegratinghumancomputationindeclarative
workflowengines.
ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins)
[WangetalVLDB12,MarcusetalVLDB12,]

Opportunity:utilizecrowdsourcing forcreatingtrainingsets,
orforactivelearning.

Keyopenissue:Handlingerrorsinhumanjudgments
InanexperimentonAmazonMechanicalTurk:
Pairwise matchingjudgment,eachgivento5differentpeople
Majorityofworkersagreedontruthononly90%ofpairwise judgments.
SummaryofSingleEntityERAlgorithms
Manyalgorithmsforindependentclassificationofpairsofrecords
asmatch/nonmatch

MLbasedclassification&FellegiSunter
Pro:Advancedstateoftheart
P Ad d f h
Con:Buildinghighfidelitytrainingsetsisahardproblem

ActiveLearning&Crowdsourcing forERareactiveareasof
research.
PART 2
PART2c
CONSTRAINTS
Constraints
Importantformsofconstraints:
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2match,thenM3andM4must
Functional Dependency: If M1 and M2 match then M3 and M4 must
match

Transitivityiskeytodeduplication
Exclusivityiskeytorecordlinkage
Functionaldependenciesfordatacleaning,e.g.,
[Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonet
al ICDE07]
al,ICDE07]
Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
ConstraintTypes
HardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4must IfM1,M2matchthenM3,
match M4morelikelytomatch
If t
Iftwopapersmatch,theirvenues
t h th i If t
Iftwovenuesmatch,then
t h th
match theirpapersaremorelikely
tomatch

NegativeEvidence MentionM1andM2mustreferto IfM1,M2dontmatchthen


distinctentities(Uniqueness) M3,M4lesslikelytomatch
Coauthors are distinct
Coauthorsaredistinct If institutions donttmatch,
Ifinstitutionsdon match
thenauthorslesslikelyto
IfM1,M2dontmatchthenM3,M4 match
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4must IfM1,M2matchthenM3,
match M4morelikelytomatch
If t
Iftwopapersmatch,theirvenues
t h th i If t
Iftwovenuesmatch,then
t h th
match theirpapersaremorelikely
tomatch

NegativeEvidence MentionM1andM2mustreferto IfM1,M2dontmatchthen


distinctentities(Uniqueness) M3,M4lesslikelytomatch
Coauthors are distinct
Coauthorsaredistinct If institutions donttmatch,
Ifinstitutionsdon match
thenauthorslesslikelyto
IfM1,M2dontmatchthenM3,M4 match
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4must IfM1,M2matchthenM3,
match M4morelikelytomatch
If t
Iftwopapersmatch,theirvenues
t h th i If t
Iftwovenuesmatch,then
t h th
match theirpapersaremorelikely
tomatch

NegativeEvidence MentionM1andM2mustreferto IfM1,M2dontmatchthen


distinctentities(Uniqueness) M3,M4lesslikelytomatch
Coauthors are distinct
Coauthorsaredistinct If institutions donttmatch,
Ifinstitutionsdon match
thenauthorslesslikelyto
IfM1,M2dontmatchthenM3,M4 match
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint SoftConstraint
PositiveEvidence IfM1,M2matchthenM3,M4must IfM1,M2matchthenM3,
match M4morelikelytomatch
If t
Iftwopapersmatch,theirvenues
t h th i If t
Iftwovenuesmatch,then
t h th
match theirpapersaremorelikely
tomatch

NegativeEvidence MentionM1andM2mustreferto IfM1,M2dontmatchthen


distinctentities(Uniqueness) M3,M4lesslikelytomatch
Coauthors are distinct
Coauthorsaredistinct If institutions donttmatch,
Ifinstitutionsdon match
thenauthorslesslikelyto
IfM1,M2dontmatchthenM3,M4 match
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint SoftConstraint
PositiveEvidence Notethatsomeofthe
IfM1,M2matchthenM3,M4must IfM1,M2matchthenM3,
match M4morelikelytomatch
constraintsmayberelational
y
If t
Iftwopapersmatch,theirvenues
t h th i If t
Iftwovenuesmatch,then
t h th
match andrequirejoins
theirpapersaremorelikely
tomatch
Maybedirectional
May be directional
orbidirectional
NegativeEvidence MentionM1andM2mustreferto IfM1,M2dontmatchthen
distinctentities(Uniqueness) M3,M4lesslikelytomatch
Coauthors are distinct Constraintscanberecursive,
Coauthorsaredistinct Constraints can be recursive
If institutions
Ifinstitutionsdon
donttmatch,
match
thenauthorslesslikelyto
e.g.,iftwoauthorshave
IfM1,M2dontmatchthenM3,M4 match
cannotmatch matchingcoauthors,then
theymatch
h
Iftwovenuesdontmatch,thentheir h
papersdontmatch
AdditionalConstraints
AggregateConstraints[Chaudhuri etal.SIGMOD07]
countconstraints
count constraints
EntityAcanlinktoatmostNBs
Authorshaveatmost5papersatanyconference
Otheraggregateslikesum,averagemorecomplex
Oth t lik l

Again,
Again,thesecanbeeitherhardorsoftconstraints,
these can be either hard or soft constraints,
providepositiveornegativeevidence
MatchDependencies

Whenmatchingdecisionsdependonother
matching decisions (in other words, matching
matchingdecisions(inotherwords,matching
decisionsarenotmadeindependently),we
refer to the approach as collective
refertotheapproachascollective
MatchExtent
Global: Iftwopapersmatch,thentheirvenuesmatch
This
Thisconstraintcanbeappliedtoallinstancesofvenue
constraint can be applied to all instances of venue
mentions
AlloccurrencesofSIGMODcanbematchedtoInternational
Conference on Management of Data
ConferenceonManagementofData
Local:Iftwopapersmatch,thentheirauthorsmatch
Thisconstraintcanonlybeappliedlocally
This constraint can only be applied locally
DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyin
thecontextofthecurrentpaper
Ex.SemanticIntegrityConstraints
Type Example
Aggregate C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear
Subsumption C2=IfacitationXfromDBLP matchesacitationYinahomepage,then
eachauthormentionedinYmatchessomeauthormentionedinX
Neighborhood C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,they
arelikelytomatch
lik l t t h
Incompatible C4 =NoresearcherexistswhohaspublishedinbothHCIandnumerical
analysis
Layout C5=Iftwomentionsinthesamedocumentsharesimilarnames,they
f i i h d h i il h
arelikelytomatch
Key/Uniqueness C6=MentionsinthePClistingofaconferenceisto different
researchers
Ordering C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder
Individual C8=TheresearcherwiththenameMayssam Sariahasfewerthan
fi
fivementionsinDBLP(newgraduatestudent)
ti i DBLP ( d t t d t)
[Shen,Li&Doan,AAAI05]
AlgorithmsforHandlingConstraints
Recordlinkage propagationthroughexclusivity
Weightedkpartitematching

Deduplication propagationthroughtransitivity
C
Correlationclustering
l i l i

Collective propagationthroughgeneralconstraints
Collective propagation through general constraints
Similaritypropagation
Dependencygraphs,CollectiveRelationalClustering
P b bili ti
Probabilisticapproaches
h
LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,
Hybridapproaches
Dedupalog
PART 2 d
PART2d
ALGORITHMS
RECORD LINKAGE
RECORDLINKAGE
11assumption
Matchingbetween(almost)deduplicated databases.
Eachrecordinonedatabasematchesatmostonerecord
Each record in one database matches at most one record
inanotherdatabase.

Pairwise ERmaymatcharecordinonedatabasewith
morethanonerecordinseconddatabase
WeightedKPartiteMatching

Weighted Weighted
Edges Edges

Edgesbetweenpairsofrecordsfromdifferentdatabases
Edgeweights
o Pairwise matchscore
o Logoddsofmatching
L dd f hi
WeightedKPartiteMatching

Findamatching(eachrecordmatchesatmostoneotherrecord
fromotherdatabase)thatmaximizethesumofweights.
)
GeneralproblemisNPhard(3Dmatching)
Successivebipartitematchingistypicallyused.
Successive bipartite matching is typically used [Gupta&Sarawagi,VLDB
[G pta & Sara agi VLDB
09]
DEDUPLICATION
Deduplication =>Transitivity
Oftenpairwise ERalgorithmoutputinconsistentresults
(x,y) Mpred ,(y,z) Mpred ,but(x,z) Mpred

Idea:Correctthisbyaddingadditionalmatchesusingtransitive
closure
l

Incertaincases,thisisabadidea.
In certain cases this is a bad idea
Graphsresultingfrompairwise ERhave
diameter>20 Addedby
[Rastogi etalCorr
et al Corr 12]
12] Transitive
Transitive
Closure

Needclusteringsolutionsthatdealwiththisproblemdirectlyby
reasoningaboutrecordsjointly.
ClusteringbasedER
Resolutiondecisionsarenotmadeindependentlyfor
eachpairofrecords

Basedonvarietyofclusteringalgorithms,but
Numberofclustersunknownaprioiri
Many,manysmall(possiblysingleton)clusters

Oftentakeapairwisesimilaritygraphasinput

Mayrequiretheconstructionofaclusterrepresentative
orcanonicalentity
ClusteringMethodsforER
HierarchicalClustering
[[Bilenko etal,ICDM05]
, ]
NearestNeighborbasedmethods
[Chaudhuri etal,ICDE05]
CorrelationClustering
[SoonetalCL01,Bansal etalML04,NgetalACL02,
Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]
IntegerLinearProgrammingviewofER
rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster.
w+xy [0,1],costofclusteringxandytogether
[ ] g y g
w xy [0,1],costofplacingxandyindifferentclusters

Transitive
closure
CorrelationClustering

4
Clustermentionssuchthat 2
totalcostisminimized
total cost is minimi ed 1
Solidedgescontributew+ totheobjective
xy 3
Dashededgescontributew xyy totheobjective
5

Costbasedonpairwise similarities

Additive:w+xy =pxy andw xy =(1pxy)


Logarithmic:w
oga t c +xy =log(p a d xy =log(1p
og(pxy))andw og( pxy)
CorrelationClustering
SolvingtheILPisNPhard[Ailon etal2008JACM]

Anumberofheuristics[Elsner etal2009ILPNLP]
GreedyBEST/FIRST/VOTEalgorithms
Greedy BEST/FIRST/VOTE algorithms
GreedyPIVOTalgorithm(5approximation)
LocalSearch
GreedyAlgorithms
SStep1:Permutethenodesaccordingarandom
1 P h d di d
Step2:AssignrecordxtotheclusterthatmaximizesQuality
Start a new cluster if Quality < 0
StartanewclusterifQuality<0
Quality:
BEST:Clustercontainingtheclosestmatch
[Ngetal2002ACL]
FIRST:Clustercontainsthemostrecentvertexywithw+xy >0
[Soonetal2001CL]
[Soon et al 2001 CL]
VOTE:Assigntoclusterthatminimizesobjectivefunction.
[Elsner etal08ACL]

PracticalNote:
Runthealgorithmformanyrandompermutations,andpicktheclusteringwith
bestobjectivevalue(betterthanaveragerun)
Greedywithapproximationguarantees
PIVOTAlgorithm [Ailon etal2008JACM]
Pickarandom(pivot)recordp.
(p ) p
Newcluster=
2
1
={1,2,3,4}C={{1,2,3,4}}
={2,4,1,3}C={{1,2},{4},{3}} 3
{ , , , } {{ , }, { }, { }}
={3,2,4,1}C={{1,3},{2},{4}} 4

Whenweightsare0/1, E(cost(greedy))<3 OPT


Forw+xy +wxy =1, E(cost(greedy))<5 OPT

[Elsner etal,ILPNLP09]:Comparisonofvariouscorrelationclusteringalgorithms
PART2d
CANONICALIZATION
Canonicalization
Mergeinformationfromduplicatementionstoconstruct
aclusterrepresentativewithmaximalinformation
p
Starbucks,
3457HillsboroughRoad
Starbucks
Durham,NC
Ph:null 3457HillsboroughRoad,Durham,NC
Starbacks, Ph:(919)3334444
HillsboroughRd,Durham
Ph:(919)3334444 CriticallyimportantinWebportalswhere
usersmust beshownaconsolidatedview
Eachmentiononlycontainsasubsetofthe
attributes
Mentionscontainvariations(ofnames,
addresses)
Someofthementionshaveincorrectvalues
CanonicalizationAlgorithms
Rulebased:
Fornames:typicallylongestnamesareused.
Forsetvaluesattributes:UNIONisused.

FForstrings,[Culotta
ti [C l tt etalKDD07]learnaneditdistanceforfinding
t l KDD07] l dit di t f fi di
themostrepresentativecentroid.

Canusemajorityruletofixerrors
(if4outof5sayabusinessisclosed,thenbusinessisclosed).
Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen
underlyingdatachanges[PaletalWWW11]
CanonicalizationforEfficiency
StanfordEntityResolutionFramework[Benjelloun VLDBJ09]
Considerablackbox matchandmergefunction
Matchisapairwise boolean operator
Merge:constructcanonicalversionofamatchingpair

Canminimizetimetocomputematchesbyinterleavingmatching
andmerging
esp.,whenmatchandmergefunctions
satisfymonotonicity properties. r345

r12 r45

r1 r2 r3 r4 r5
COLLECTIVE ENTITY RESOLUTION
COLLECTIVEENTITYRESOLUTION
CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Nonprobabilisticapproaches
p pp
SimilarityPropagation
ProbabilisticModels
GenerativeModels
G i M d l
UndirectedModels
HybridApproaches
y pp
SIMILARITY PROPAGATION
SIMILARITYPROPAGATION
SimilarityPropagationApproaches
Similaritypropagationalgorithmsdefineagraphwhichencodes
thesimilaritybetweenentitymentionsandmatchingdecisions,
andcomputematchingdecisionsbypropagatingsimilarityvalues.
Detailsofconstructedgraphandhowthesimilarityiscomputedvaries
Algorithmsareusuallydefinedprocedurally
Algorithms are usually defined procedurally
Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there
isnoglobalprobabilisticmodeldefined

Approachesoftenmorescalablethanglobalprobabilisticmodels
DependencyGraph
[Dong et al SIGMOD05 ]
[Dongetal.,SIGMOD05]
Constructagraphwherenodesrepresentsimilaritycomparisons
between attribute values (realvalued)
betweenattributevalues(real valued)andmatchdecisionsbased
and match decisions based
onmatchingdecisionsofassociatednodes(booleanvalued)

Asmentionsareresolved,enrichedtocontainassociatednodesof
allmatchedmentions

Similaritypropagateduntilfixedpointisreached

Negativeconstraints(notmatchnodes)arecheckedaftersimilarity
propagationisperformed,andinconsistenciesarefixed
ExploittheDependencyGraph
Slid f
Slidesfrom[Dongetal,SIGMOD05]
[D t l SIGMOD05]

(a1,a4) (Distributed,Distributed )

(RobertS.Epstein,Epstein,R.S.)
(169180,169180)

(a2,a5)

(p1,p2)
(MichaelStonebraker,Stonebraker,M.)

(v1,v2)
(a3,a6)

(EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)

Referencesimilarity Attributesimilarity
ExploittheDependencyGraph

(a1,a4) (Distributed,Distributed )

(RobertS.Epstein,Epstein,R.S.)
(169180,169180)

(a2,a5)

(p1,p2)
(MichaelStonebraker,Stonebraker,M.)

(v1,v2)
(a3,a6)

(EugeneWong,Wong,E.) (ACM,ACMSIGMOD) (1978,1978)

Reconciled Similar
CollectiveRelationalClustering
[Bh
[Bhattacharya&Getoor,TKDD07]
h &G TKDD07]
Constructagraphwhereleafnodesareindividual
mentions
i
Performhierarchicalagglomerativeclusteringtomerge
clustersofmentions
l t f ti
Similaritycomputedbasedonacombinationofattribute
and relational similarity
andrelationalsimilarity
Whenclustersaremerged,updatethesimilaritiesofany
related clusters (clusters corresponding to mentions
relatedclusters(clusterscorrespondingtomentions
whichcooccurwithmergedmentions)
ObjectiveFunction
Mi i i
Minimize:

w sim
i j
A A (ci ,c j ) wR simR (ci , c j )

weightfor similarityof weightfor Similaritybasedonrelationaledges


attributes
b attributes
b relations
l b
betweenc i andc
d j

Greedy clustering algorithm: merge cluster pair with max


reduction in objective function

where for example


simA (ci , c j ) sim(c , c )
aAttributes
*
i
*
j
for cluster representative c*

and
simR (ci , c j ) sim jaccard ( N (ci ),
) N (c j ))
where N(c) are the relational neighbors of c
RelationalClusteringAlgorithm
1. Findsimilarreferencesusingblocking
2. Bootstrapclustersusingattributesandrelations
3. Computesimilaritiesforclusterpairsandinsertintopriorityqueue

4. Repeatuntilpriorityqueueisempty
p p yq py
5. Findclosestclusterpair
6. Stopifsimilaritybelowthreshold
7
7. If no negative constraints violated
Ifnonegativeconstraintsviolated
8. Mergetocreatenewcluster
9. Constructcanonicalclusterrepresentative
10. Updatesimilarityforrelatedclusters

O(nklogn)algorithmw/efficientimplementation
SimilaritypropagationApproaches
Method Notes Constraints Evaluation
RelDC Reference Modelchoice Context Accuracyand
[Kalashnikovet disambiguation nodesidentified attraction runtime forAuthor
al,TODS06]
l TODS06] usingusing
i i usingfeature
i f measurestheh resolutionand
l i d
Relationship basedsimilarity relational directorresolution
baseddata similarity inMovie database
cleaning(RelDC)
g( )
Reference Dependency Reference Both positive Precision/Recall,
Reconciliation Graphfor enrichment andnegative F1onpersonal
[Dongetal, propagating Explicitly handle constraints information
SIGMOD05] similarities+ missingvalues managementdata
enforcenon Parametersset (PIM),Coradataset
match byhand
constraints
Collective Modified Constructs Focuson Precision/Recall,
Relational hierarchical canonical entity coauthor F1onthree
Clustering
g agglomerative
gg asmergesare
g resolution and bibliographic
g p
[Bhattacharya& clustering made propagation datasets:CiteSeer,
Getoor,TKDD07] approach ArXiv,andBioBase,
andsyntheticdata
PROBABILISTICMODELS:
PROBABILISTIC MODELS:
GENERATIVEAPPROACHES
GenerativeProbabilisticApproaches
ProbabilisticsemanticsbasedonDirectedModels
Model
Modeldependenciesbetweenmatchdecisionsinagenerative
dependencies between match decisions in a generative
manner
Disadvantage:acyclicity requirement
Varietyofapproaches
BasedonLatentDirichlet Allocation,BayesianNetworks
Examples
LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07]
ProbabilisticRelationalModels[Pasula etal,NIPS02]
LDAforEntityResolution:Discovering
Groups from CoOccurrence
GroupsfromCo OccurrenceRelations
Relations
Stephen P Johnson Stephen C Johnson

Chris Walshaw Kevin McManus Alfred V Aho Ravi Sethi

M k Cross
Mark C ss M tin E
Martin Everett
tt J ff
Jeffrey D Ullman
Ull

Parallel Processing Research Group Bell Labs Group

P1: C. Walshaw, M. Cross, M. G. Everett, P4: Alfred V. Aho, Stephen C. Johnson,


S. Johnson J ff
Jefferey D
D. Ull
Ullman

P2: C. Walshaw, M. Cross, M. G. Everett, P5: A. Aho, S. Johnson, J. Ullman


S. Johnson, K. McManus

P3: C. Walshaw, M. Cross, M. G. Everett P6: A. Aho, R. Sethi, J. Ullman


LDAERModel
Entity label a and group label z for
each reference r
: mixture
mixture of groups for each co-
occurrence
z: multinomial for choosing entity a
for each group z
Va: multinomial for choosing
z reference r from entity a
Dirichlet priors with and

a
T

InferenceusingblockedGibbssampling
r V
forefficiency(andimprovedaccuracy)
A
R
P
GenerativeApproaches
Method Learning/Inference Evaluation
Method
[Li,Morie,&
[Li Morie & Generative
Generative Truncated EMtolearn
EM to learn F1on
F1 on person
person
Roth,AAAI04] modelfor parametersandMAP names,
mentionsin inferenceforentities locationsand
documents (unsupervised) organizationsin
TRECdataset
Probabilistic Probabilistic Parameterslearned %ofcorrectly
Relational Relational onseparatedcorpora, identified
M d l [P l Models
Models[Pasula M d l i f
inferencedoneusing
d i clusters
l on
etal.,NIPS03] MCMC subsetsof
CiteSeer data
LatentDirichlet
Latent Dirichlet Latent
LatentDirichlet
Dirichlet BlockedGibbs
Blocked Gibbs Precision/Recall
Allocation Allocation Sampling /F1onCiteSeer
[Bhattacharya Model Unsupervised andHEPdata
&Getoor, approach
SDM06]
PROBABILISTICMODELS:
PROBABILISTIC MODELS:
UNDIRECTEDAPPROACHES
UndirectedProbabilisticApproaches
ProbabilisticsemanticsbasedonMarkovNetworks
Advantage:noacyclicity
Advantage: no acyclicity requirements
Insomecases,syntaxbasedonfirstorderlogic
Advantage:declarative
g

Examples
ConditionalRandomFields(CRFs)[McCallum&Wellner,
NIPS04]
MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06]
ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]
MarkovLogic
AlogicalKBisasetofhardconstraints onthesetof
possibleworlds
Makethemsoftconstraints;whenaworldviolatesa
formula,itbecomeslessprobablebutnotimpossible
Giveeachformulaaweight
Higher weight Strongerconstraint
Higherweight Stronger constraint

P(world) exp weights o f formula s it sat isfies

[Richardson&Domingos,06]
MarkovLogic
AMarkovLogicNetwork(MLN) isasetofpairs(F,w)
where
F isaformulainfirstorderlogic
w isarealnumber

# true groundings
of ith clause

1
P( X ) exp wi ni ( x)
Z iF
Normalization Constant Iterate over all first-order MLN formulas

[Richardson&Domingos,06]
ERProblemFormulationinMLNs
Given
A
ADBofrecordsrepresentingmentionsofentitiesinthereal
DB of records representing mentions of entities in the real
world,e.g.papermentions
Asetoffieldse.g.author,title,venue
A set of fields e g author title venue
Eachrecordrepresentedasasetoftypedpredicatese.g.
HasAuthor(paper,author),HasVenue(paper,venue)
(p p , ), (p p , )

Goal
Todeterminewhichoftherecords/fieldsrefertothesame
d h h f h d /f ld f h
underlyingentity

Slidesfrom[Singla &Domingos,ICDM06]
HandlingEquality
IntroduceEquals(x,y) orx=y
Introducetheaxiomsofequality
I t d th i f lit
Reflexivity: x=x
Symmetry: x=y y=x
Transitivity: x=y y=z z=x
PredicateEquivalence:
x11 =xx2 y11 y22 (R(x1,y ) R(x2,y2))
, y1)
Positive,SoftEvidence
Introducereversepredicateequivalence
SSamerelationwiththesameentitygivesevidenceabout
l ti ith th tit i id b t
twoentitiesbeingsame
R(x1,y1) R(x2,y2) x1 =x2 y2 =y2

Nottruelogically,butgivesusefulinformation
Example
J Cox) HasAuthor(C2,CoxJ.)
HasAuthor(C1,J.Cox)
HasAuthor(C1 HasAuthor(C2 Cox J ) C1 = C2
C1=C2
(J.Cox=CoxJ.)
FieldComparison
Eachfieldisastringcomposedoftokens
IntroduceHasWord(field,word)
Introduce HasWord(field word)
Usereversepredicateequivalence
HasWord(f1,w1) HasWord(f2,w2) w1= w2 f1=f2
Example
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)
(J.Cox=CoxJ.)
Canhavedifferentweightforeachword
h d ff h f h d
TwolevelSimilarity
Individualwordsasunits:Cantdealwithspelling
mistakes
Breakeachwordintongrams:Introduce
HasNgram(word,ngram)
Usereversepredicateequivalenceforwordcomparisons
RecordMatching
SimplestVersion:Fieldsimilaritiesmeasuredby
presence/absenceofwordsincommon
HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)
HasField(r2,f2) w1= w2 r1=r2
Example
E l
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,
) HasAuthor(P2,CoxJ.)
J.Cox) ( , ) ((Cox=Cox)) ((P1=P2))
Transitivity
) (f2 =ff3)
(f11 =ff2) ) ( f3 =ff1)
AdditionalConstraints
HasAuthor(c,a
HasAuthor(c ) HasAuthor(c,a
a1) ) Coauthor(a1,a
HasAuthor(c a2) a2)
Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4
Inference
Usecheapheuristics(e.g.TFIDFbasedsimilarity)to
identifyplausiblepairs
yp p
Inference/learningoverplausiblepairs
Inferencemethod:lazygrounding+MaxWalkSAT
Learning:supervisedandtransfer(learn/handsetonone
g p ( /
domainandtransferred)
ProbabilisticSoftLogic
[Broecheler &Getoor,UAI10]
& Getoor UAI10]
Declarativelanguagefordefiningconstrainedcontinuous
Markov random field (CCMRF) using first order logic
Markovrandomfield(CCMRF)usingfirstorderlogic
(FOL)
Softlogic:truthvaluesin[0,1]
Soft logic: truth values in [0 1]
LogicaloperatorsrelaxedusingLukasiewicz tnorms
Mechanismsforincorporatingsimilarityfunctions,and
Mechanisms for incorporating similarity functions and
reasoningaboutsets
MAPinferenceisaconvexoptimization
MAP inference is a convex optimization
Efficientsamplingmethodformarginalinference
FOLtoCCMRF
PSLconvertsaweightedruleintopotentialfunctionsby
penalizing its distance to satisfaction
penalizingitsdistancetosatisfaction,
isthetruthvalueofgroundruleunder
interpretationx
Thedistributionovertruthvaluesis

::weightofruler
weight of rule r
:allgroundingsofruler
p g
:PSLprogram
UndirectedApproaches
Method Learning/Inference Evaluation
Method
[McCallum&
[McCallum & Conditional
Conditional Graphpartitioning
Graph partitioning F1on
F1 on DARPA
DARPA
Wellner, RandomFields (Boykov etal.1999), MUC&ACE
NIPS04] (CRFs) performedvia datasets
capturing correlationclustering
transitivity
constraints
[Singla & MarkovLogic Supervisedlearning ConditionalLog
D i
Domingos, N
Networks
k andinferenceusing
di f i lik lih d and
likelihood d
ICDM06] (MLNs) MaxWalkSAT &MCMC AUConCora
andBibServ
data

[Broecheler & Probabilistic Supervisedlearning Precision/Recall


Getoor,UAI10] SimilarityLogic andinferenceusing /F1Ontology
(PSL) continuous Alignment
optimization
HYBRID APPROACHES
HYBRIDAPPROACHES
HybridApproaches
Constraintbasedapproachesexplicitlyencoderelational
constraints
Theycanbeformulatedashybridofconstraintsand
probabilisticmodels
Orasconstraintoptimizationproblem
Examples
ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05]
Dedupalog [Arasu,Re,Suciu,ICDE09]
Dedupalog [Arasu etal.,ICDE09]

PaperRef(id, title, conference, publisher, year) Datatobe


Wrote(id authorName,
Wrote(id, authorName Position) deduplicated

TitleSimilar(title1,title2) (Thresholded)Fuzzy
A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2) J i O t t
JoinOutput

Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.

Step(1)Declaretheentities ClusterPapers,Publishers,&Authors

Paper!(id) :- PaperRef(id,-,-,-) Dedupalog isflexible:


Publisher!(p) :- PaperRef(-,-,-,p,-) UniqueNamesAssumption(UNA)
Author!(a)
( ) :- Wrote(-,a,-)
(, ,) P bli h (UNA) d P
Publishers(UNA)andPapers(NOTUNA)
(NOT UNA)

Slidesbasedon[Arasu,Re,Suciu,ICDE09]
Step(2)DeclareClusters
InputintheDB

PaperRef(id, title, conference, publisher, year) Clusterpapers,


Wrote(id authorName,
Wrote(id, authorName Position) publishers and authors
publishers,andauthors

TitleSimilar(title1,title2) Paper!(id) :- PaperRef(id,-,-,-)


A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2) Publisher!(p) :- PaperRef(-,-,-,p,-)
PaperRef(- - - p -)
Author!(a) :- Wrote(-,a,-)

Clustersaredeclared using*(likeIDBsorViews):Theseareoutput

Author*(a1,a2) <-> AuthorSimilar(a1,a2) Author1 Author2


AA Arvind Arasu
Clusterauthorswithsimilarnames
Arvind A Arvind Arasu

*IDBsareequivalencerelations:
q
Symmetric,Reflexive,&Transitively ADedupalog
A D d l program isa
i
ClosedRelations:i.e.,Clusters setofdataloglikerules
152
SimpleConstraints

Paperswithsimilartitlesshouldlikelybeclusteredtogether
Paper*(id1,id2) <-> PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)

Author*(a1,a2) <-> AuthorSimilar(a1,a2) ((<>)Softconstraints:


)
Payacostifviolated.

(id1,id
Paper*(id
Paper id2) <= PaperEq(id1,id
id2 )
(<=)Hardconstraints:Any
Paper*(id1,id2) <= PaperNeq(id1,id2) clusteringmustsatisfythese
PapersinPaperEQ must beclusteredtogether,
thoseinPaperNEQ mustnot beclusteredtogether

1. PaperEQ,PaperNEQ arerelations(EDBS)
2. denotesNegationhere.
Additional Constraints
AdditionalConstraints
Clusteringtwopapers,thenmustclustertheirfirstauthors
Author*(a1,a2) <= Paper*(id1,id2), Wrote(id1,a1,1), Wrote(id2,a2,1)

Cl
Clusteringtwopapersmakesitlikelyweshouldclustertheirpublisher
i k i lik l h ld l h i bli h
Publisher*(x,y) <- Publishes(x,p1), Publishes(x,p2),Paper*(p1,p2)

if
iftwoauthorsdonotsharecoauthors,thendonotclusterthem
two authors do not share coauthors, then do not cluster them
Author (x, y) <- (Wrote(x, p1,), Wrote(y, p2,), Wrote(z, p1,),
Wrote(z, p2,), Author(x, y))
Dedupalog viaCC

Semantics:TranslateaDedupalog Programtoasetofgraphs
Nodesarereferences(inthe!Relation) EntityReferences:Conference!(c)

VLDBJ
Conference*(c1,c2) <-> ConfSim(c1,c2)
VLDB
Positiveedges VLDBconf
[ ] Negativeedgesareimplicit
[] Negative edges are implicit ICDE
ICDT
InternationalConf.DE

Forasinglegraphw.o.hardconstraints
we can reuse prior work for O(1) apx
wecanreusepriorworkforO(1)apx.

156
CorrelationClustering

Soft Hard
(c1,cc2) <-
Conference*(c
Conference < ConfSim(c1,cc2) Positive Equal
[]Negative NotEqual
Conference*(c1,c2) <= ConfEQ(c1,c2)
VLDBJ
Conference*(c
C f *( 1,c2) <=
< ConfNEQ(c
C fNEQ( 1,c2)
VLDB
VLDBconf

ICDE

ICDT InternationalConf.DE
1. Pick a random order of edges
2. While there is a soft edge do
1. Pick first soft edge in order
2. If turn into
3. Else is [
[-]
] turn into
4. Deduce labels
3. Return Transitively closed subsets
Voting
Extendalgorithmtowhole languageviavotingtechnique.
Support different entity types recursive programs etc
Supportdifferententitytypes,recursiveprograms,etc.

Manydedupalog
y p gp programs
g Thm: Arecursivehard
haveanO(1)apx constraintsnoO(1)apx
Thm:AllsoftprogramsO(1) Expert:multiwaycuthard

Systemproperties:
(1)Streamingalgorithm
(2)linearin#ofmatches(notn2)
(3)Userinteraction

Features:Supportforweights,referencetables
(partially),andcorrespondinghardnessresults.
HybridApproaches
Method Evaluation
Constraint Twolayermodel: Researchers
basedEntity Layer1:Generativemodelfordatasetsthatsatisfy andIMDBwith
Matching constraints; noiseadded
[Shen,Li& Layer2:EMalgorithmandtherelaxationlabeling
Doan, algorithmtoperformmatching.Ineachiteration,use
AAAI05]; EM to estimate parameters of the generative model
EMtoestimateparametersofthegenerativemodel
buildson(Li, andamatchingassignment,thenemploysrelaxation
Morie,&Roth, labelingtoexploittheconstraints
AIMag 2004)
Dedupalog Declarativespecificationforrichcollectionof Precision/Recall
[Arasu,Re, constraintswithnicesyntactic sugaraddedtodatalog onCora,subset
Suciu,ICDE09] forER.Inference:Correlationclustering+voting ofACMdataset
Summary:CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Similaritypropagationapproaches
yp p g pp
ProbabilisticModels
GenerativeModels
UndirectedModels
U di dM d l
HybridApproaches

Nonprobabilisticapproachesoftenscalebetterthangenerative
probabilisticapproaches
Undirected/constraintbasedmodelsareofteneasiertospecify
Scalingundirectedmodelsactiveareaofresearch
PART 3
PART3
SCALING ER TO BIGDATA
SCALINGERTOBIG DATA
ScalingERtoBigData
Blocking/CanopyGeneration
DistributedER
Distributed ER
PART 3
PART3a
BLOCKING/CANOPY GENERATION
BLOCKING/CANOPYGENERATION
Blocking:Motivation
Navepairwise:|R|2 pairwise comparisons
1000
1000businesslistingseachfrom1,000differentcitiesacross
business listings each from 1,000 different cities across
theworld
1trillioncomparisons
11.6days (ifeachcomparisonis1s)

Mentionsfromdifferentcitiesareunlikelytobematches
BlockingCriterion:City
1billioncomparisons
16minutes(ifeachcomparisonis1s)
Blocking:Motivation
Mentionsfromdifferentcitiesareunlikelytobematches
Maymisspotentialmatches
May miss potential matches
Blocking:Motivation
PairsofRecords
MatchingPairs
satisfying
ofRecords
Blockingcriterion

SetofallPairs
ofRecords
BlockingAlgorithms1
Hashbasedblocking
EachblockCi isassociatedwithahashkeyhi.
Mentionx ishashedtoCi ifhash(x)=hi.
Withinablock,allpairsarecompared.
Each hash function results in disjoint blocks.
Eachhashfunctionresultsindisjointblocks.

Whathashfunction?
Deterministicfunctionofattributevalues
BooleanFunctionsoverattributevalues
[[Bilenko etalICDM06,MichelsonetalAAAI06,
, ,
DasSarma etalCIKM12]
minHash (minwiseindependentpermutations)
[Broder etalSTOC98]
BlockingAlgorithms2
Pairwise Similarity/Neighborhoodbasedblocking
Nearby
Nearbynodesaccordingtoasimilaritymetricareclustered
nodes according to a similarity metric are clustered
together
Resultsinnondisjointcanopies.

Techniques
SortedNeighborhoodApproach[HernandezetalSIGMOD95]
CanopyClustering[McCallumetalKDD00]
SimpleBlocking:InvertedIndexonaKey
Examplesofblockingkeys:
Firstthreecharactersoflastname
First three characters of last name
City+State+Zip
CharacterorTokenngrams
Minimuminfrequentngrams
LearningOptimalBlockingFunctions
Usingoneormoreblockingkeysmaybeinsufficient
2,376,206American
2,376,206 AmericansssharedthesurnameSmithinthe2000US
shared the surname Smith in the 2000 US
NULLvaluesmaycreatelargeblocks.

Solution:Constructblockingfunctionsbycombining
simplefunctions
ComplexBlockingFunctions
Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]
{City}AND{lastfourdigitsofphone}

Chaintrees[DasSarma etalCIKM12]
If
If({City}=NULLorLA)then
({Ci } NULL LA) h {lastfourdigitsofphone}AND{areacode}
{l f di i f h } AND { d }
else {lastfourdigitsofphone}AND{City}

BlkTrees [DasSarma etalCIKM12]


LearninganOptimalfunction[Bilenko etalICDM06]
Findkblockingfunctionsthateliminatethemostnon
matches,whileretainingalmostallmatches.
, g
Needatrainingsetofpositiveandnegativepairs

AlgorithmIdea:RedBlueSetCover

PositiveExamples PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
Blocking Keys
BlockingKeys notcovered
(b)Numberofrednodes
coveredisminimized
NegativeExamples
LearninganOptimalfunction[Bilenko etalICDM06]
AlgorithmIdea:RedBlueSetCover
PositiveExamples
p Pick k Blocking keys such that
PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
BlockingKeys notcovered
(b)Numberofrednodes
coveredisminimized
NegativeExamples

GreedyAlgorithm:
Greedy Algorithm
Constructgoodconjunctionsofblockingkeys{p1,p2,}.
Pick k conjunctions {pi1,p
Pickkconjunctions{p pi2,,p
pik}},suchthatthefollowingis
such that the following is
minimized
minHash (Minwise IndependentPermutations)
LetFx beasetoffeaturesformentionx
(functionsof)attributevalues
(functions of) attribute values
characterngrams
optimalblockingfunctions
Let bearandompermutationoffeaturesinFx
E.g.,orderimposedbyarandomhashfunction

minHash(x)=minimumelementinFx accordingto
WhyminHash works?
Surprisingproperty:Forarandompermutation,

Howtobuildablockingschemesuchthatonlypairswith
How to build a blocking scheme such that only pairs with
Jacquardsimilarity>sfallinthesameblock(withhighprob)?

Probabilitythat
(x,y)mentionsare `
bl k d t th
blockedtogether

Similarity(x,y)
BlockingusingminHashes
ComputeminHashes usingr*kpermutations(hash
functions)
)

Band of r minHashes
Bandofr

k blocks

Signatures
Signature sthatmatchon1outofk
that match on 1 out of k bands,gotothe
bands go to the
sameblock.
minHash Analysis
FalseNegatives:(missingmatches)
Sim(s) P(notsame
P(pair x y notinthesameblock
P(pairx,y not in the same block block)

withJacquardsim =s) 0.9 108

0.8 0.00035
shouldbeverylowforhighsimilaritypairs
should be very low for high similarity pairs
0.7 0.025

0.6 0.2
False Positives: (blocking nonmatches)
FalsePositives:(blockingnonmatches)
0.5 0.52
P(pairx,y inthesameblock
0.4 0.81
with Jacquard sim =s)
withJacquardsim s)
0.3 0.95

0.2 0.994

0.1 0.9998
CanopyClustering[McCallumetalKDD00]
Input:MentionsM,
d(x,y),adistancemetric, Inmultiple
canopies
thresholdsT1 >T2
Algorithm:
1 PickarandomelementxfromM
1. Pi k d l t f M
2. CreatenewcanopyCx using
mentionsys.t.d(x,y)<T
y ( ,y) 1
3. Deleteallmentionsy fromM
s.t.d(x,y)<T2(fromconsiderationinthisalgorithm) Eachelement
4. ReturntoStep1ifM isnotempty. h
hasasingle
i l
primarycanopy
PART 3 b
PART3b
DISTRIBUTED ER
DISTRIBUTEDER
DistributedER
Mapreduceisverypopularforlargetasks
M d i l f l k
Simpleprogrammingmodelformassivelydistributeddata

Hadoop providesfaulttoleranceandisopensource

MapPhase ReducePhase
(perrecordcomputation) (globalcomputation)

Shuffle
ERwithDisjointBlocking
ComputeBlocksinMap RemainingERinReduce
Map Phase
MapPhase Reduce Phase
ReducePhase
(perrecordcomputation) (globalcomputation)

Shuffle

BlockID Noneedtocompare
recordsacross
reducers
NondisjointBlocking
Howtoblock?
Hashbased:
Hash based:needanefficienttechniquetogrouprecordsif
need an efficient technique to group records if
theymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]
Distancebased:canopyclusteringonmapreduce[Mahout]
IterativeBlocking[Whang etalSIGMOD09]
Problem:Informationneededforarecordisin
p
multiplereducers.
Informationneededforarecordisinmultiplereducers.
Example1:
Example 1:
Reducer1:amatcheswithb
Reducer2:amatcheswithc
Needtocommunicateinordertocorrectlyresolvea,b,c
N dt i t i d t tl l b
Problem:Informationneededforarecordisin
p
multiplereducers.

Example2:Dedup papersandauthors
Id Author1 Author2 Paper
A1 JohnSmith RichardJohnson IndicesandViews
A2 JSmith RJohnson SQLQueries
A3 Dr Smyth
Dr.Smyth R Johnson
RJohnson Indices and Views
IndicesandViews

Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.

Canopyclusteringresultsinnondisjointclusters
[McCallumetalKDD00]

J.Smith JohnS.
Richard John
Johnson Richard Smith JohnJacob
J h J b
Smith
RichardM. R.Smith
Johnson Canopy
Canopy
py
C
Canopy f
for
for
forSmith John
Richard

Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.

CoAuthor(A1,B1) CoAuthor(A2,B2) match(B1,B2) match(A1,A2)

CoAuthor rulegroundstothecorrelation
match(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)

J.Smith JohnS.
Richard John
Steve Smith JohnJacob
Johnson
Johnson
R R.Smith
Johnson Canopy
Canopy
Canopy
Canopy for
for
forSmith John
Johnson
Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C l i Cl i / C ll i ER i h

Solution2:CorrelationClustering/CollectiveERineachcanopy
g/ py
+MessagePassing[Rastogi etalVLDB11]
Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C l i Cl i / C ll i ER i h

Connectedcomponentscanbelargeinrelational/multientityER.
p g / y

Solution2:CorrelationClustering/CollectiveERineachcanopy
+MessagePassing[Rastogi etalVLDB11]
MessagePassing

Simple Message Passing (SMP)


SimpleMessagePassing(SMP)
1. RunentitymatcherM locallyineachcanopy
2. IfM findsamatch(r1,r2)insomecanopy,passitas
evidencetoallcanopies
3. RerunM withineachcanopyusingnewevidence
4. Repeatuntilnonewmatchesfoundineachcanopy
il h f di h

Runtime: O(k2 f(k)c)


Runtime:O(k f(k) c)
k:maximumsizeofacanopy
f(k):TimetakenbyERoncanopyofsizek
c:numberofcanopies

Slideadaptedfrom[Rastogi etalVLDB11]talk
FormalProperties
forawellbehavedERmethod

Convergence:No.ofstepsno.ofmatches

Consistency:Outputindependentofthecanopyorder
y p p py

Soundness:Eachoutputmatchisactuallyatruematch

Completeness:Eachtruematchisalsoaoutputmatch

J.Smith JohnS.
Richard John
Johnson Richard Smith JohnJacob
Smith
RichardM. R.Smith
Johnson
Slideadaptedfrom[Rastogi etalVLDB11]talk
Completeness
Papers2and3matchonlyifacanopy
knows that
knowsthat
match(a1,a2)
match(b2,b3)
match(c2,c3)
t h( 2 3)

Simplemessagepassingwillnotfindanymatches
thus,nomessagesarepassed,noprogress

Solution:Maximalmessagepassing
Sendamessageifthereisapotentialformatch

Slideadaptedfrom[Rastogi etalVLDB11]talk
SummaryofScalability
O(|R|2)pairwise computationscanbeprohibitive.
Blockingeliminatescomparisonsonalargefractionofnonmatches.
EqualitybasedBlocking:
Construct(oneormore)blockingkeysfromfeatures
Recordsnotmatchingonanykeyarenotcompared.
Records not matching on any key are not compared
Neighbohood basedBlocking:
Formoverlappingcanopiesofrecordsbasedonsimilarity.
Onlycomparerecordswithinacluster.
Computingconnectedcomponents/MessagePassinginadditionto
blocking can help distribute ER
blockingcanhelpdistributeER.
P t4
Part4
CHALLENGESANDFUTURE
CHALLENGES AND FUTURE
DIRECTIONS
Challenges
Sofar,wehaveviewedERasaonetimeprocessappliedtoentire
database;noneoftheseholdinrealworld.
TemporalER
Temporal ER
ERalgorithmsneedtoaccountforchangeinrealworld
Reasoningaboutmultiplesources[Pal&M etal.WWW12]
Modeltransitions[LietalVLDB11]
Reasoningaboutsourcequality
Sourcesarenotindependent
CopyingProblem[DongetalVLDB09]
QueryTimeER
Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so
wegetaccurateresultsforaparticularquery?
Collectiveresolutionforqueries
Collective resolution for queries [Bhattacharya&GetoorJAIR07]
[Bhattacharya & Getoor JAIR07]
ER&Usergenerateddata
Deduplicated entitiesinteractwithusersintherealworld
Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo
Whatshouldbedonetosupportinteractions?
OpenIssues
ERisoftenpartofbiggerinferenceproblem
Pipelinedapproachesandjointapproachestoinformationextraction
and graph identification
andgraphidentification
HowcanwecharacterizehowERerrorsaffectoverallqualityof
results?
ERTheory
ER Theory
Needbettersupportfortheorywhichcangiverelationallearning
bounds
ER&Privacy
ER & Privacy
ERenablesrecordreidentification
HowdowedevelopatheoryofprivacypreservingER?
ERBenchmarks
ER B h k
NeedforlargescalerealworldERdatasetswithgroundtruth
Syntheticdatausefulforscalingbuthardtocapturerichcomplexities
ofrealworld
f l ld
Summary
Growingomnipresenceofmassivelinkeddata,andtheneed
forcreatingknowledgebasesfromtextandunstructureddata
motivate a number of challenges in ER
motivateanumberofchallengesinER

EspeciallyinterestingchallengesandopportunitiesforERand
socialmedia/usergenerateddata
/

As
Asdata,noise,andknowledgegrows,greaterneeds&
data noise and knowledge grows greater needs &
opportunitiesforintelligentreasoningaboutentityresolution

Manyotherchallenges
M th h ll
Largescaleidentitymanagement
Understandingtheoreticalpotentials&limitsofER
THANK YOU!
THANKYOU!
References Intro
W.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand
GreatPotential,NoticesoftheAMS56(5),2009
L Gill and M Goldcare English
L.GillandM.Goldcare, EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand
National Record Linkage of Hospital Episode Statistics and
DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007
A. Elmagrid etal,
A.Elmagrid et al, Duplicate
DuplicateRecordDetection
Record Detection,,TKDE2007
TKDE 2007
P.Christen,DataMatching,Springer2012
N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006
X. Dong & F. Naumann, Data
X.Dong&F.Naumann, Datafusion
fusionResolving
Resolvingdataconflictsforintegration
data conflicts for integration,,VLDB2009
VLDB 2009
L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,
AAAI2012
References SingleEntityER
D.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010
M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001
M Bilenko &R.Mooney,
M.Bilenko & R Mooney Adaptive
AdaptiveDuplicateDetectionUsingLearnableStringSimilarity
Duplicate Detection Using Learnable String Similarity
Measures,KDD2003
P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector
machineclassification.,KDD2008
Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,
SIGMOD2009
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
f
H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959
I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969
W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport
Series,USCensus,2006
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007
P R ik
P.Ravikumar &W C h
&W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004
A Hi hi l G hi l M d l f R d Li k UAI 2004
References SingleEntityER(contd.)
S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000
S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001
A Arasu etal,
A.Arasu et al On
Onactivelearningofrecordmatchingpackages
active learning of record matching packages,SIGMOD2010
SIGMOD 2010
K.Bellare etal,Activesamplingforentitymatching,KDD2012
A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010
J Wang et al CrowdER:
J.Wangetal, CrowdER:Crowdsourcing
Crowdsourcing EntityResolution
Entity Resolution,PVLDB5(11),2012
PVLDB 5(11) 2012
A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011
References SingleEntityER(contd.)
R.Gupta&S.Sarawagi,
R. Gupta & S. Sarawagi, Answering
AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb
Table Augmentation Queries from Unstructured Lists on the Web,,
PVLDB2(1),2009
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM
2012
M Bilenko etal,
M.Bilenko et al Adaptive
AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein
Product Normalization: Using Online Learning for Record Linkage in
ComparisonShopping,ICDM2005
S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005
W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,
C
ComputationalLinguistics27(4)2001
t ti l Li i ti 27(4) 2001
N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004
V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002
, g p g
M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation
disentanglement,ACLHLT2008
M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,
ILPNLP2009
N Ailon etal,
N.Ailon et al Aggregating
Aggregatinginconsistentinformation:Rankingandclustering
inconsistent information: Ranking and clustering,JACM55(5),2008
JACM 55(5) 2008
X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009
A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW
2012
A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007
O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009
References Constraints&MultiRelationalER
R.Ananthakrishna
A h k i h et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002
l li i i f d li i d h VL 2002
A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009
S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07
X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005
I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007
I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007
P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007
M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010
, y g ,
W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008
H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002
D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,
TODS06
J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling
SequenceData.,ICML2001
X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative
Approaches,AAAI2004
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006
W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005
P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006
Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009
Whang elal.,JointEntityResolution,ICDE2012
References Blocking
M.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM
2006
M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006
g g g
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,
CIKM2012
A.Broder etal,MinWiseIndependentPermutations,STOC1998
G P di etal,Beyond100millionentities:largescaleblockingbasedresolutionfor
G.Papadias l B d 100 illi ii l l bl ki b d l i f
heterogenous data,WSDM2012
M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995
A. McCallum et al, Efficient
A.McCallumetal, Efficientclusteringofhigh
clustering of highdimensional
dimensionaldatasetswithapplicationto
data sets with application to
referencematching,KDD2000
L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012
R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010
ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/
S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009
U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand
Observations,ICDM2009
Observations ICDM 2009
V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012
V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011
References Challenges&FutureDirections
I.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007
X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic
world,VLDB2009
world VLDB 2009
P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011
A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein
unreliable and uncertain environments,,WWW2012
unreliableanduncertainenvironments WWW 2012

You might also like