Scalable Entity Resolution
Scalable Entity Resolution
Scalable Entity Resolution
PLAGIARISM DECLARATION
1. Plagiarism is the use of ideas, material and other intellectual property of another’s work and
to present it as my own.
3. Accordingly, all quotations and contributions from any source whatsoever (including the
internet) have been cited fully. I understand that the reproduction of text without quotation
marks (even when the source is cited) is plagiarism.
5. I declare that the work contained in this thesis, except otherwise stated, is my original work
and that I have not previously (in its entirety or in part) submitted it for grading in this
thesis or another thesis.
3055863
Student number Signature
Acknowledgements
With profound gratitude and immense pleasure, I would like to thank my supervisor Dr. Hajira
Jabeen for her utmost support, guidance, understanding and erudite supervision through-out my
thesis. I am whole heartily grateful to Prof. Dr. Jens Lehmann for letting me pursue my master
thesis at the Smart Data Analytics(SDA) group in the Rheinische Friedrich-Wilhelms-Universität
Bonn (at the Department of Computer Science). I am highly obliged to Mr. Gezim Sejdiu for
sharing his knowledge and experience in Spark and Scala and helping me resolve technical blockers
during the evaluation of entity resolution code on the Spark clusters and integration with the
SANSA project. Lastly, I would like to acknowledge my special vote of thanks to my parents for
their endless love and support.
iv
Abstract
Entity resolution is the crucial task of identifying and linking entities that point to the same real
world object in various information spaces. It finds its application in numerous tasks, particu-
larly for public sectors like de-duplicating entities in federal datasets related to medicine, finance,
transportation, business and law enforcement etc. With the advent of growth in structured and
semi-structured data on the web in terms of volume and velocity, the task of linking records in het-
erogeneous data collections becomes more complicated. It is difficult to find inferences and semantic
relations between entities across different datasets containing noisy data that encompasses typos,
inconsistent and missing values appearing with loose schema bindings. As pairwise comparison of
entities over large datasets is computationally expensive exhibiting quadratic complexity, the need
to identify a more generic approach becomes more significant. The existing approaches reduce this
complexity by aggregating similar entities into blocks and then processing them through Brute force
approach. In our master thesis, we implement a more generic method for entity resolution in RDF
data using Spark framework and Scala as a programming language. Our approach does not apply
any prior blocking, but still eliminates the quadratic comparison. We compare our approaches
with existing state of the art methodologies and through evaluation on 5 real world datasets, we
show that our suggested approaches are effective, efficient and scalable over large heterogeneous
information spaces.
Key words:
Entity Resolution, Local Sensitivity Hashing(LSH), minHash Algorithm
v
Table of contents
PLAGIARISM DECLARATION ii
Acknowledgements iii
Abstract iv
List of figures ix
List of tables 1
1 INTRODUCTION 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 TECHNICAL DETAILS 5
2.1 Semantic web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 RDF data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Entity Resolution in Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The world of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Hadoop Distributed File System(HDFS) . . . . . . . . . . . . . . . . . . . . . 17
2.4 Scala Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 RELATED WORK 20
3.1 Block Building Techniques in Entity Resolution . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Schema-based Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Schema-agnostic Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . 23
3.2 The Role of Meta-Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Block Processing Techniques in Entity Resolution . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Controlling Block Sizes for effective comparison . . . . . . . . . . . . . . . . . 27
vi
4 APPROACH 29
4.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Local Sensitivity Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 HashingTF vs CountVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 MinHash for Jaccard Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Approx Similarity Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Approaches for Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Approach 1 - minHash LSH method (considering all attributes) . . . . . . . 33
4.3.2 Approach 2 - minHash LSH method (1 or 2 attribute) . . . . . . . . . . . . . 34
4.3.3 Approach 3 - minHash LSH subjects and jaccard similarity attributes . . . . 36
4.4 Our Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Storage Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2 Input and Output for Approach-2 minHash LSH method (1 or 2 attribute) . 42
4.4.3 Storage and computation for Approach-3 minHash LSH subjects and Jaccard
similarity attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 EVALUATION 45
5.1 Cluster Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Evaluation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.3 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5.1 Discussion of results for Approach-2 . . . . . . . . . . . . . . . . . . . . . . . 49
vii
5.5.2 Execution Times for Approach-2 datasets with HashingTF and CountVector-
izer models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5.3 Discussion of results for Approach-3 . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.4 Execution Times for Approach-3 datasets with HashingTF and CountVector-
izer models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 CONCLUSION 55
REFERENCES 57
viii
List of Figures
4.14 Matched entities with intersecting attribute names filtered with Jaccard similarity
on attribute names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.15 Attribute values comparison for intersecting predicates . . . . . . . . . . . . . . . . . 42
4.16 Similar Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
List of Tables
CHAPTER 1
INTRODUCTION
1.1. MOTIVATION
In the recent years, we have witnessed a rapid growth in data on the web owing to the success
of Social and Semantic Web applications. This ever growing data originates from a variety of
sources including information extractors, web crawled data and various user generated content,
involving unprecedented schema heterogeneity and high levels of data redundancy. As per the
current stats recorded, the 1 DBpedia 2016-10 release consists of 13 billion (2016-04: 11.5 billion)
pieces of information (RDF triples) out of which 1.7 billion (2016-04: 1.6 billion) were extracted
from the English edition of Wikipedia, 6.6 billion (2016-04: 6 billion) were extracted from other
language editions and 4.8 billion (2016-04: 4 billion) from Wikipedia Commons and Wikidata. In
addition, adding the large NIF datasets for each language edition increased the number of triples
further by over 9 billion, bringing the overall count up to 23 billion triples. Another semantic
knowledge base 2 YAGO, derived from Wikipedia, WordNet, WikiData, GeoNames, and other data
sources currently contains more than 10 million entities (like persons, organizations, cities, etc.)
with more than 120 million facts about these entities.
The problem is further accentuated as the data generated from heterogeneous sources suffers huge
quality issues and is loaded with erroneous information containing false or missing values. Thus,
identifying different entity profiles, usually with different schemata that map to the same real world
object becomes a prerequisite task for effective data integration and reuse in the world of Semantic
Web. Inherently, the task of Entity Resolution is quadratic O(N 2 ) in nature; given two large entity
collections E1 and E2, the task of comparing every entity profile in E1 with every entity in E2 naively
to link records would take a lifetime to process. Several approaches have been introduced till date
that aim to reduce these extensive comparisons in large data collections by typically employing a
blocking stage. This entails on placing similar entities in blocks and performing comparisons only
with in the block and not outside it. But, we still face redundant and superfluous entity profile
comparisons while processing these blocks. The worst case complexity in this case is O(m2 ∗ |B|),
1
https://wiki.dbpedia.org/develop/datasets/dbpedia-version-2016-10
2
https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago
3
with m being the maximal size of blocks and |B| being the total number of blocks.
Many researches existing in the field of ER aim at generating blocks efficiently, reducing the time
and space complexity. Some particularly focus on controlling the sizes of blocks, while others on
efficiently processing these blocks by pruning or purging the information in these blocks. Our
work focuses on implying a generic approach that deals with the heterogeneity of data and maps
duplicate profiles in two entity collections, without inducing time and effort in blocking the data. We
eliminate the blocking stage and focus on directly processing these blocks through a minHash LSH
based approach to identify different entity profiles described with different schemata in structured
or semi-structured format, but are actually same in real world. Our approach drastically increases
the performance and scalability of ER tasks over large datasets. Performing computations over
GBs of data on a standard machine with limited memory and processing speed is challenging. So,
we utilise a distributed framework Spark that is fast and open source with Scala as our chosen
programming language, making it beneficial for our research work.
1.2. CHALLENGES
With the enormous dynamically growing real world data, the task of Entity Resolution is of prime
importance for effective reuse of (semi-)structured data and leveraging data integration. It however
posses the various challenges due to inherent characteristics of the available data:-
1. Dealing with heterogeneity of data is difficult- As the data stems from multiple sources, their
is no specific standard schemata defined. With wide variety of attributes, loose schema bindings
and data being in various formats ranging from json objects to pure tag-style annotations forming
different profiles for the same real world object, comparing such profiles becomes an extremely
challenging task.
2. Velocity and volume of data- The data is growing dynamically and in huge amount. The need
has generated to link entities across cross-domain distributed data sources effectively at large scale.
Thus, neither the naive quadratic approach of pairwise comparisons, nor the blocking strategy is
efficient for larger datasets.
3. Inconsistent and noisy values in data- Several typo and missing values exist in extracted data.
Also, their could be extraction errors or alternative descriptions present for the same entity at
4
different sources. Such deficiency or false information increases difficulty in our task.
4. Need of labelled data- When supervised training based block processing strategies are involved
in Entity Resolution task, additional time and effort is induced in forming training data for the
model to learn.
In these settings, it is difficult to perform entity resolution task efficiently and effectively on GBs of
data. Thus, we propose a model that focuses on eliminating these discussed challenges and simul-
taneously introducing scalability in ER task while reducing the traditional quadratic complexity.
1.3. OUTLINE
This thesis comprises of six chapters. Chapter two represents general overview of concepts like
Semantic web, RDF data, Spark etc. Chapter 3 provides knowledge of the previous work done in
the area of Entity Resolution. Chapter 4 focuses on various approaches/methodologies suggested
by us to efficiently resolve the problem of Entity Resolution in Big data. Chapter 5 highlights
the validity of our approach by extensive evaluation results compared and explained against the
existing state of the art methods. Chapter 6 concludes the research work.
5
CHAPTER 2
TECHNICAL DETAILS
In this chapter, we give a brief overview of the topics that are closely connected with our thesis
work. It includes the knowledge of Semantic web, Linked Data, Resource Description Format(RDF)
and how Entity Resolution is a crucial task in the world of Big data. We also discuss about the
SPARK framework and Scala language that we have used using our thesis implementation and
evaluation.
The word Semantic indicates ”meaning”. Thus, Semantic Web is the web of data with its asso-
ciated meaning. Over the years, there has been a rapid increase in data over the web. The web
penetrates society in several forms such as Social contacts through blogging or social networking
platforms, E-commerce (like buying, selling, advertising of commodities), Education, Recreation
and Work-life balance etc. With this tremendous growth, the web accompanies a transition from
industrial to an information society, providing infrastructure for new quality of information han-
dling with acquisition and provisioning. The current web is immensely successful, containing huge
amounts of information that is machine-processable and human-readable but since the web data is
heterogeneous in terms of content, structure and character-encoding, it needs automated reasoning
techniques for intelligent information integration. Humans can derive information from given piece
of data but web can only deal with syntax. It needs the ability to integrate and fuse data from
a variety of sources in order to satisfy human quest of information. The lack of comprehensive
background knowledge to interpret information found on the web can be dealt with when syntax
is combined with its semantics for processing information intelligently. Thus, Semantic Web deals
with the understanding of the information available on the internet.
The term ”Semantic Web” was coined by Tim Berners-Lee for a web of data where pages are
structured and tagged in such a way that it can be read directly by computers. As per 1 , it is an
extension of the World Wide Web(WWW) through standards by the World Wide Web Consortium
(W3C). The standards promote common data formats and exchange protocols on the Web, most
1
https://en.wikipedia.org/wiki/Semantic Web
6
fundamentally the Resource Description Framework (RDF). According to the W3C, ”The Semantic
Web provides a common framework that allows data to be shared and reused across applications,
enterprises, and community boundaries”. The Semantic Web is therefore regarded as an integrator
across different content, information applications and systems. Figure 2.1 shows the Semantic Web
Stack, illustrating the architecture of Semantic Web. Also, known as ”Semantic Web Layer cake”
or ”Semantic Web Cake”, where each layer has a specific function to make the web data machine
understandable. 2 RDF is a simple language for expressing data model. It encodes structured in-
formation and exchanges information on the web. The RDF Schema(RDFS) extends RDF and
is a vocabulary for describing properties and classes of RDF-based resources semantically. Owl,
acronym for Web Ontology Language. Ontology provides a frame of reference for the disambigua-
tion and the global interconnection of knowledge. It has its own explicit formal semantics example
OWL 2.0 with Manchester syntax. It more more vocabulary for describing properties and classes
and relationships between them (e.g. disjointness, cardinality and equality etc.), SPARQL is the
general query language used for querying RDF databases.
2
https://www.w3.org/TR/rdf-schema/
7
Linked Data refers to structured data that is connected to other data to make it more useful.
Semantic web is all about linking data throughout the web and making it easily understandable
to both humans and machines. Linked data is the best practice to achieve this goal. Approaching
other datasets is viable through semantic queries. 3 Linked Data is a set of design principles for
sharing machine-readable interlinked data on the Web. Data is evolving to be more interlinked
and connected. Figure 2.2 shows the increasing connectedness in data. Hypertext has links, blogs
have pingback, tagging groups all related data. As the connectivity between data increases, the
web becomes more powerful and fusion or integration of data becomes easier.
Linked Open Data is a powerful blend of Linked Data and Open Data, that is both linked and uses
open sources, example - DBpedia. Figure 2.3 shows the Linked Open data as of October, 2019. In
early 2006, Tim Berners-Lee laid down four basic principles to achieve the vision of Semantic Web
and link data from variety of sources. They are as follows:-
1. Use URIs as names for things.
2. Use HTTP URIs so that people can look up these names.
3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
3
https://ontotext.com/knowledgehub/fundamentals/linked-data-linked-open-data/
8
4. Include links to other URIs so that they can discover more things.
RDF stands for Resource Description Format. It originates from the family of World Wide Web
Consortium(W3C) specifications. It is a common data model for modelling conceptual description
of information available in web sources. Various RDF datasets available are open source like
DBpedia, YAGO, Wikidata, Freebase etc. Originally, the design of RDF model is inspired from the
graphical representation of information, such that nodes or vertices denote the entities and several
entities are connected by edges or lines, labelled with their relationships. Information represented
in RDF data in the form of triples. These triples show facts or statements in the form of subject,
predicate and object assigned as follows:-.
1. A Subject is marked as a node containing URI or it is simply a blank node.
2. A Predicate or property should always be a URI.
3. A Object could be a URI, blank node or a literal.
Figure 2.4 represents an example triple(subject, predicate and object).
RDF graph representing a resource for which a URI or literal is not given. It is only used as a
subject or a predicate.
The data in RDF data model is represented through a variety of syntax notations and serialization
formats5 . These are as described:-
1. Turtle:- A text format with emphasis on human readability.
2. N-Triples:- A text format with emphasis on simple parsing.
3. RDF/XML:-The official XML-serialization of RDF.
4. JSON-LD:-The W3C recommendation (originally in 2014) for expressing RDF data in JSON
format.
5. RDFa:- A mechanism for embedding RDFa in (X)HTML.
Entity Resolution6 is the task of identifying and de-duplicating entities across datasets i.e. Given
two different datasets, the task of finding entities that point to the same real world data.
The primary step in Entity Resolution is determining what an entity is. Thus, an entity is a unique
object(a person, an organisation, a location, a building or an animal) with a set of attributes(i.e.
name, shape, size, habitat etc.) describing its features. Each feature has a value associated to it.
Since, a single entity could have multiple entries across different datasets, such as Figure 2.5 rep-
resents the city New Delhi in Dataset1 with its features (area code, previous name, country, type
and population density) and Figure 2.6 shows the exemplary representation of the city Delhi with
its features (area code, latitude, longitude, country, type, population density and its leaders) in
Dbpedia. These are two representations of same city with varied features as they come from dif-
ferent sources. It can also happen that same information is presented through different attribute
names like dbo:country and ex:locatedIn. The problem is further accentuated by ambiguities in
knowledge graphs. Several times the data contains typos or erroneous and missing values. Thus,
the task of disambiguation entities among different dataset to achieve a single annotation with data
12
Since the onset of the digital age, the amount of information that we generate and transfer has
increased dramatically in the past few years. Devices like laptops, mobile phones, PCs, cameras
and automobiles are the prime source for flooding data. The idea of Smart Cities, Homes and
Kitchens etc, simplify our life with various sensing devices incorporated in our environment, that
are creating pools and oceans of information each day. The data generated and aggregated from dif-
13
ferent sources has varied attribute/fields and could be structured, unstructured or semi-structured.
Merging this data effectively and extracting knowledge is useful but is a challenge for traditional
machines with limited capacities to store, process and manage this voluminous data. So, with this
increasing connectivity between different devices and possible interlinking of information generated
from heterogeneous sources, several agencies are developing new and powerful tools to analyze this
data.
Big data refers to large and complex datasets that cannot be processed by a traditional data-
processing application software. Most organisations store and analyze this data to understand
growing customer needs and predict market trends for their benefits. Working on these datasets
effectively in acceptable time and cost is a challenge in itself. The various Big data technologies
currently in market are Hadoop, Cassandra, Pig, Hive, Spark, Storm etc.
There are four main behaviours or concepts associated with the term ”Big data” as shown in Figure
2.8:-
1. Volume :- From software logs to fast growing sensor event detection data recorded timely, the
volume of data is increasing minute by minute. Currently, we have data in petabytes7 . In future,
it would be in Exabytes8 or Yottobytes9 .
2. Variety:- As the data comes from heterogeneous sources, it could range from texts, audio, video,
tweets to pure-tag style annotations. Thus, it is difficult to determine its structure.
3. Velocity:-The high speed of information generation and constant transmission of generated data
in real time determines the velocity
4. Veracity:- The data is noisy. It suffers quality issues like extraction errors, missing, erroneous
values or other inconsistencies that could affect the accuracy of our results.
2.3.1. Spark
Apache Spark1011 is a fast and general-purpose cluster computing system with its support available
in Java, Scala, Python and R. It shines for big data as it is faster than other platforms like Hadoop
in terms of computation and resource utilisation with better memory capacities. Built on top of
Scala, runs on an JVM, it could be used effectively for batch or real time data processing. It has an
rich set of higher-level tools that are independent element stacked under a common hood, making it
a unified analysis engine for large data processing. Each element performs its specific task. Figure
7
https://en.wikipedia.org/wiki/Petabyte
8
https://en.wikipedia.org/wiki/Exabyte
9
https://en.wikipedia.org/wiki/Yottobyte
10
https://spark.apache.org/docs/latest/
11
https://en.wikipedia.org/wiki/ApacheS park
15
2.9 shows the Apache Spark Stack. The description of which is as follows:-
1. Spark Core:- It forms the foundation of the overall project and provides distributed task dis-
patching, scheduling, and basic I/O functionalities, exposed through an application programming
interface centered on the RDD abstraction.
2. Spark SQL:- It introduces the concept of data abstraction through dataframes providing sup-
port for structured and semi-structured data processing. It also provides SQL language support,
with command-line interfaces and ODBC/JDBC server.
3. MLlib:- The distributed machine learning framework on top of Spark Core that makes statis-
tical and predictive analysis easy for Spark users providing support for regression, clustering ans
dimensionality reduction techniques.
4. GraphX and Graphframes:- Distributed graph processing framework. The abstraction sup-
ported by GraphX are RDDs, while Graphframes supports Dataframes
5. Spark Streaming:- It uses fast scheduling capabilities of Spark Core to perform streaming
analytics. It ingests data in mini-batches and performs RDD transformations on them.
filter or map. This creates a new RDD. The transformations are lazily evaluated i.e. the data inside
RDD is not available or transformed until an action is executed that triggers the execution.
2. Actions:- These are operations that trigger computation and returns value.
Table 2.1 shows a list of important RDD methods with their explainations
Method Description
flatmap Return a new RDD by first applying a function to all elements of the RDD,
and then flattening the results.
map Return a new RDD by applying a function to all elements of the RDD.
foreach It runs a function on each element of the dataset.
filter Return a new RDD containing only the elements, satisfying a given condition(s).
union Return the union of two RDDs (combine RDDs).
intersection Return the intersection of two RDDs (common RDDs).
persist Persists the RDD with the default storage level (memory only).
count Return the number of total elements.
unpersist Mark the RDD as non-persistent, and removes it from memory and disk.
repartition Shuffles the data across the network.
sparkContext Used to create the RDD.
saveAsTextFile Save the RDD as a text file.
reduceByKey The reduceByKey method runs on an input data in (key, value) form and
reduce it by key.
1. Driver Program:- When we start Apache Spark in Spark-shell, the Spark-shell becomes the
driver program or an entry point(ex:- main function in C), creating the spark context.
2. Spark Context:- The driver initiates the Spark Context. The Spark Context allows commu-
nication with other nodes in a cluster. These nodes are worker nodes.
17
3. Worker nodes:- When we run an application, it creates a different process with in a worker
node. So, every spark application runs in its own executor. A worker node will have many ex-
ecutors. With in an executor, we have DAGs (i.e. constructed graph broken down to stages and
tasks). These individual tasks run with in executor.
4. Cluster manager:- They are required for performing parallel operations on various nodes.
Spark provides support for different cluster managers like Hadoop yarn, Apache mesos etc.
The Hadoop Distributed File System (HDFS)15 is a distributed file system designed to run on
commodity hardware, by employing a Namenode and a Datanode. It is used as the primary
storage system by Hadoop16 applications. It17 is highly fault-tolerant and can be easily deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
handling large data sets. It is written in Java and is supported on all major platforms. Figure 2.11
shows the HDFS architecture.
15
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
16
https://hadoop.apache.org/docs/r1.2.1/hdfsd esign.html
17
https://searchdatamanagement.techtarget.com/definition/Hadoop-Distributed-File-System-HDFS
18
1. Hardware Failure:- HDFS works in a distributed file system with a cluster of nodes where
failure of any node at any point of time is a prominent exception. The detection of such faults and
fast automated recovery of data on the nodes is a crucial task of HDFS.
2. Streaming Data Access:- HDFS is designed more for batch processing rather than interactive
real-time data processing. More emphasis is laid on high throughput of data access rather than
low latency.
3. Large Data Sets:- HDFS is used for storing gigabytes to terabytes of data. It should provide
high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support
tens of millions of files in a single instance.
nation.
2. Highely influenced by Java, scala programs can convert to bytecodes and can run on the Java
Virtual Machine(JVM) in support.
3. It also provides language interoperability with Java.
4. Besides this, scala helps developer write concise codes.
5. It automatically detects the type-inference for variable, we donot need to mention it explicitly.
Just declare var/val and scala is smart enough to understand data type on it own.
20
CHAPTER 3
RELATED WORK
This section provides a comprehensive overview of the previous work done in the field of Entity
Resolution. Inherently the task of entity resolution is quadratic in nature. To reduce the number of
pairwise comparisons and retaining maximum true matches, the existing state of the art approaches
imply a two-step process of grouping similar entities into same block and non-similar entities in
different block, namely the Block building stage and latter comparing only the entities with in a
block i.e. block processing. The Section 3.1 deals with the Block building techniques applied in ER
and Section 3.2 provides an overview of techniques employed to process those blocks.
Consider datasets D1 and D2, each with 10000 entries, the absolute pairwise comparison, consid-
ering all entities would cause 100 million comparisons. If each comparison takes exactly 1 sec.
The total comparisons would take 100 million seconds, which is equivalent to 1157.5 days i.e. 3.17
years. Typically to reduce this extensive computation, the existing methodologies suggest applying
suitable blocking techniques.
The various techniques employed for block building in ER, either use the available schemata knowl-
edge or do not rely on it. Thus, they are categorized as schema-based or schema-agnostic block
building techniques. During blocking stage, the blocks formed could be disjoint such that an en-
tity appearing in one block does not appears in any other block or non-disjoint depending on the
datasets. Usually with semantic data, the blocks formed contain overlapping entities. This leads
to additional comparisons during block processing.
As the name suggests, available schemata knowledge is used to build the blocks. This is generally
applied when we are trying to identify similar entities in homogeneous data sources like relational
data (for ex:- Identifying similar persons in two databases based on their first name, surname,
birth date, profession and address fields). Given two entity collections E1 and E2, internally each
input entity in these two collections is transformed to a compact representation comprising of one
or more blocking keys. These blocking keys are nothing but selected attribute information. With
a-priori known schema, various techniques are employed to form blocks based on attribute name
and value information of an entity. These are as described below:-
1. Standard Blocking:- The most prominently applied blocking technique based on schema
knowledge is the Standard Blocking technique. Consider two relational entity collections E1 and
E2 with n1 and n2 entity profiles respectively. Since, the structure of these entity collections is
homogeneous i.e. each entity profile has same attribute names, we select the most appropriate
attribute name on the basis of noise and distinctiveness such that the blocking is done based on the
similarity of attribute values for a particular attribute name. Example:- For a Person based dataset
(with attributes id, first name, last name, address and zipcode), all entity profiles from collection
E1 and E2 having the same zipcode go into the same block. Table 3.1 represents the same example
database. Table 3.2 represents the corresponding blocks formed based on zipcode attribute. Based
on this technique the blocking is performed in the (1) publication. This technique is particularly
effective for disjoint blocks, but the effectiveness of this approach diminishes when it encounters
noisy or missing values in the data.
22
2. Sorted Neighbourhood:- This technique is largely used because of its robust nature. An
appropriate blocking key and window size(ws) is selected. The entities are sorted based on the
selected blocking key in alphabetical order and a window of fixed size ws is slided over the sorted
list. At each iteration, the entities co-occurring in the window are compared. Consider the Person
database with zipcode as the selected attribute. Let window size(ws) be 2. The sorted list formed
is as shown in Table 3.3. Thus, it would undergo four iterations comparing {(r2, r5), (r5, r4), (r4,
r1), (r1, r3)}. This technique is typically employed in (2) publications.
3. Canopy Clustering:- Another blocking scheme used in (2), when little knowledge of schema is
available. This typically employs configuring parameters for string matching and threshold t1 and
t2. This methodology typically employs redundancy for robustness at the cost of low efficiency, but
deals with noisy data effectively. The method aims at creating overlapping clusters, called canopies.
Select a random record r* from the dataset M as the current cluster center and choose threshold t1
and t2(with t1 greater than t2) such that the distance of all other records from r* are determined.
The records with distance from r* less than t2 are members of the cluster but cannot be treated
23
as centers. The records that are outside t2, but inside t1 are also cluster members and can be
considered for cluster centers also. The records outside t1 are not members of the cluster. Repeat
the selection of another random record from M, if M is not empty. The technique is typically
employed in (2), (3) publications. This method is mostly used when we aim to process blocks with
by supervised training of a model.
4. Q-grams based blocking:- In this technique, appropriate blocking key is selected and sub-
string q-grams of length q are created for the values of that blocking key (BKVs). For each of the
created BKVs, the matching entity is placed into its block based on the similarity of its values.
Consider the Person dataset in Table 3.1 Let Zipcode be the selected blocking key and let q=2 for
q-grams. The BKVs created are {54, 43, 32, 21, 35, 42, 28, 86, 67}. The blocks are formed are as
shown in Table 3.4 . It is robust to noisy attribute values, even for the BKVS but can cause higher
computational cost for large and overlapping blocks. The technique is considered in (3) publication.
54 43 32 21 35 42 28 86 67
r1 r1 r1 r1 r2 r2 r4 r4 r4
r2 r3 r2 r1 r5 r4
r3 r3 r5
r5 r5
With tremendously exploding data on the internet from a wide variety of sources, the schema knowl-
edge is not readily available or properly defined. The data aggregated varies from web crawled data
to pure-tag style annotations. Thus to deal with this attribute heterogeneity, noisy or missing
values and loose schema bindings, their was a need to circumvent schema-agnostic blocking meth-
ods. Attribute- agnostic blocking methods completely ignore all attribute names, but utilises all
attribute values. Several approaches developed in this attribute-agnostic blocking methodology are
discussed below:-
1. Token Blocking:- As the name suggests, token blocking is based on the idea of tokenising
24
attribute values. Given an entity profile, extract all the tokens from its attribute values. Create a set
of tokens from all the entity profiles in the two entity collections E1 and E2. For each token, create
a block placing all the entities containing that token into the block. Each block should contain at
least two entities. Consider the Person dataset in Table 3.1. The resulting blocks obtained from
token blocking are as shown in Table 3.5 . The blocks show redundancy, containing overlapping
profiles. The technique is typically employed in (3), (4) publications.
2. Attribute Clustering Blocking:- As the name suggests, the idea is to group the similar
attribute names into clusters and apply token blocking inside those clusters. The focus is to
improve the efficiency by reducing the size of blocks. For calculating similarity in attribute names
to form the clusters, Jaccard similarity or TFTDF measures could be applied. Consider the entity
profiles in Figure 3.2, on applying Attribute clustering blocking, the blocks formed are as discussed
in Figure 3.3. The technique is typically employed in (3) publication.
3. TYPi Match:- In this technique, clustering is done on the basis of domain or type. So,
we cluster entities into overlapping types and then apply token blocking on the attribute values.
Consider the entity profiles in figure 3.2, on applying Typi Match technique, the two clusters formed
are Organisation(Entity 1, Entity 2) and Person(Entity 3, Entity 4) respectively. The blocks formed
by tokenising the attributes in the clusters are discussed in Table 3.6. The technique is also referred
in (3) publication.
4. URI Semantics Blocking:- As the name suggests, blocking is done based on the similarity of
URI semantic. Used in (5) publication, the Example is shown in Figure 3.4 with the dataset and
the blocks formed. It can be done in the following three ways:-
• Infix Blocking - all entities whose URI share a specific Infix are in a block
• Infix Profile Blocking - Every block corresponds to a specific Infix (of an attribute value) and
contains all entities having it in their Infix Profile
26
• Literal Profile Blocking - Every block corresponds to a specific token and contains all entities
having it in their Literal Profile.
Some authors suggest applying a Meta-Blocking stage after block building to improve the effec-
tiveness of block processing stage. The Meta-Blocking technique aims at reducing the repeated
comparisons in entities caused by overlapping blocks. Thus, meta-blocking reforms our block col-
lection as a new collection which is equally effective but contains less number of comparisons i.e.
reducing the number of redundant and superfluous comparisons in the block collection. It highlights
the outline that the more blocks two entities share, the more similar they are. Thus, high likelihood
of them to be a matching record. It typically undergoes a process of building graph from the block
collection, weighing the edges of the graph by appropriate similarity measure, then pruning the
formed graph to reform our block collection. This technique is applied in (4), (6), (7) publication.
The blocks formed by the above techniques to reformed by the meta-blocking technique are now
to be processed, such that the maximum true matches are retained in less number of comparisons.
Various techniques are employed in previous work to accomplish this task efficiently.
27
The (8) focuses on controlling the sizes of block collections using clustering approaches. It focuses
on the fact that effective block construction done based on the attribute name similarity or by
limiting the size of blocks would lead to a decreased complexity while processing the blocks.
The (9) suggests a robust way to deal with the high redundancy challenges persisting in overlapping
blocks. Since same entity profile appears in multiple blocks, it inevitably leads to an increased
number of comparisons required for de-duplication. So, block to be processed can be scheduled
blocks through block utility based on probabilistic analysis. Once the blocks are scheduled, purging
is performed that removes the entity blocks that were build by a too common attribute value i.e
token. This cleans the over sized blocks and improves efficiency. The set of blocks are then
compared and a hash table is maintained to keep track of the entity matches and this information
is propagated further, since identified matches are not reconsidered in the subsequently processed
blocks. Lastly, block purging is involved to remove the blocks with less utility values and preempt
the resolution if the cost of further comparison becomes too expensive. The comparison of exact
entity matches is performed using the generic R-Swoosh algorithm suggested in (10).
The (4) and (6) employ various block filtering techniques during initial phase of block building and
meta-blocking as preprocessing strategies to achieve effectiveness. This includes various techniques
to prune graphs constructed in meta-blocking phase like Weighted Edge Pruning(WEP), where
threshold is determined to keep average weight across all the edges, Cardinality Edge Pruning(CEP),
which sets the determined threshold by cardinality instead of weight over the edges. Other pruning
methodologies are applied on nodes instead of edges. This includes Weighted Node Pruning (WNP),
where threshold is set for each node as the average weight of the adjacent edges and Cardinality
Node Pruning (CNP) that sets the determined threshold based on cardinality of the nodes instead of
weight. On applying these techniques, effective load balancing is done to assure maximum potential
28
The maximum effective true matches are retrieved while eliminating repeated comparisons, discard-
ing superfluous comparisons and avoiding non-matching comparisons simultaneously in collected
blocks. To achieve this, the (5) suggests composite blocking techniques, while (3) introduces a four
step approach, enhancing the block building stage with partitioning over categories and redundancy
bearing methods. Once the blocks are built, both the publications suggest the initiation of block
purging i.e. removing the blocks with less utility values and propagating the information about
detected matches to further blocks that are yet to be processed. The (3) suggests scheduling of
blocks by merge-purge approach and latter pruning the blocks for avoiding non-match comparisons.
The approaches discussed till now are evaluated in terms of Pair completeness(PC) and Reduction
Ratio (RR), which is equivalent to Recall and Precision respectively.
CHAPTER 4
APPROACH
The largely booming heterogeneous data on the web has duplicate entities that needs to be identified
for data integration and de-duplication. Given two different knowledge bases, the task of identifying
entities that point to the same real world data with the typical brute-force approach comparing all
entities imply O(N 2 ) complexity. The existing above discussed strategies are also time consuming
because they undergo the two step process of block building and then processing those blocks
through a learning or non-learning based approaches. These approaches become volume intensive
and hardware resilient when it comes to processing very large databases. With the advent of our
next approaches, we focus on removing the block building stage and the effort incurred in labelling
training data for learning based approaches in the task of entity resolution. In this section, we
suggest Local Sensitivity Hashing based approaches for entity resolution, that will not need multiple
iterations of training phase (as required for the learning based approaches) and can handle large
heterogeneous knowledge bases.
Local Sensitivity Hashing (LSH)1 is a modern technique that finds its applications in Near-duplicate
detection, Genome-wide association study and Audio/video fingerprinting for large databases. It
refers to a family of functions(known as LSH families) to hash data points into buckets. It works
on the intuition that data points close to each other are hashed into the same bucket with high
probability, while the data points located far apart are placed into different buckets. This helps
in disambiguating manifestations of same real world entity in different records or mentions. An
LSH(11) family is formally defined as follows:-
In a metric space (M, d), where M is a set and d is a distance function on M, an LSH family is a
family of functions h that satisfy the following properties:-
1
https://spark.apache.org/docs/2.2.0/ml-features.htmllocality-sensitive-hashing
30
∀p,q∈M,
d(p,q)≤r1⇒Pr(h(p)=h(q))≥p1
d(p,q)≥r2⇒Pr(h(p)=h(q))≤p2
Both HashingTF and CountVectorizer 2 are methods to vectorise the tokenised text (i.e. generate
term frequency vectors). HashingTF takes sets of terms or tokenised text as input and converts
them into fixed-length feature vectors. It utilizes the hashing trick, using MurmurHash 3 typically
as its hashing function and transforms a raw feature by mapping it into an index (term). Then
term frequencies are calculated based on the mapped indices. Hashing usually suffers from potential
collisions when different terms are hashed into same buckets. This can be reduced by increasing
the number of buckets in our hash tables, which in our case is the target feature dimension.
Count Vectorizer is a similar transformer, but unlike HashingTF that parses data once and then
forms feature vectors directly, CountVectorizer undergoes scanning of data twice, initially for build-
ing the CountVectorizer model and then transforming tokenised text to feature vectors. This incurs
cost of additional storage space for unique features and also more time consumption, but results in
a reversible process. Since, we can convert the features vectors to text again with CountVectorizer
model, which is not possible with the typical HashingTf model.
2
https://spark.apache.org/docs/2.2.0/ml-features.htmlfeature-extractors
31
MinHash is an LSH family for Jaccard distance where input features ∈ N (i.e. sets of natural
numbers). Jaccard distance of two sets is defined by the cardinality of their intersection and union.
Given two sets A and B respectively, their Jaccard distance is identified as:-
|A∩B|
d(A, B) = 1 − |A∪B|
MinHash 3 applies a random hash function g to each element in the set and takes the minimum of
all computed hashed values:
The input sets for MinHash are represented as binary vectors, where the vector indices represent
the elements themselves and the non-zero values in the vector represent the presence of that el-
ement in the set. While both dense and sparse vectors are supported, typically sparse vectors
are recommended for efficiency. Since, empty sets cannot be transformed by MinHash implies
any input vector must have at least 1 non-zero entry. The spark MinHashLSH function takes set-
NumHashTables as a parameter determining the number of hash tables to be used on input features
for computing the hash values. This parameter value is adjusted by user as per the requirement. A
higher value leads to better accuracy when performing approx similarity join among the two sets
and a lower value leads to time saving solutions. This is a race of efficiency vs effectiveness of the
3
https://spark.apache.org/docs/2.2.0/ml-features.htmlminhash-for-jaccard-distance
32
Approx Similarity Join 4 takes two datasets as input, computes distances among the entities in
two datasets and returns pairs of rows in datasets whose distance is smaller than the user defined
threshold(t). It supports self-join as well as join between two different datasets. The returned
output also contains a distance column indicating the distance between each of the paired entity
returned as a row. Another feature approx nearest neighbor is available as an LSH operation in
spark, but we are not using it currently in our approach.
4
https://spark.apache.org/docs/2.2.0/ml-features.htmlapproximate-similarity-join
33
As discussed above, our approaches focuses on reducing the efforts and time consumption induced
in block building stage(by the above discussed approaches) for entity comparison i.e. we ideally
compare the entities without categorizing them into blocks and without multiple iterations of
training our model as involved in various learning based approaches. Also, our approach eliminates
the need of labelled data for training our model.
In this approach, we utilise all the information available with us about the entity mentions in the
two databases, without any filterations.
Definition:- Let the entity collections in two databases be E1 and E2, then an entity mention e1 in a
certain collection be recorded as a tuple ¡s, Ap¿ i, where Ap is a set of attributes, namely containing
subject, attribute names and attribute values for the specific entity mention e1 and s be the subject
for the profile. The Entity collections E1 and E2 contains several entity mentions such that their is
no duplicates with in an entity collection i.e no two entity mentions with in E1 would be pointing
to the same real world object.
For each of the entity collections, for every single entity mention we vectorise its attributes using
an HashingTF or a CountVectorizer technique and then create hashes of the respective feature
vectors using spark MinHashLSH model, with 3 different random hash functions i.e. keeping the
setNumHashTables attribute to 3. Find the similarity between the respective transformed hashes
for each of the entity mentions of the two entity collections with an spark approxSimilarityJoin
method on an appropriate threshold(t). This gives us the entities that are marked similar amongst
the two entity collections E1 and E2 respectively. The two entities are marked similar only if the
distance between them is less than the threshold(t).
34
The knowledge graphs are typically incomplete. There are lot of inconsistencies between the two
different Entity collections to be compared. This missing information and noisy attribute informa-
tion becomes an hinge. It results in huge distance between the two entity mentions, beyond the
threshold being used for comparison, when we look for the similarities between the entity mentions
in the two entity collections respectively. Thus, the two entity mentions are marked non- similar
that are otherwise similar in real world.
In this approach, we focus on using 1 or 2 main attributes for the entity mentions, instead of
utilising all the available knowledge. The inspiration for selecting the 1 or 2 attributes is attained
35
from (2).
Definition:- Let the entity collections in two databases be E1 and E2, then an entity mention e1 in a
certain collection be recorded as a tuple ¡s, Ap¿ i, where Ap is a set of attributes, namely containing
subject, only the selected 1 or 2 attribute names and its respective attribute values for the specific
entity mention e1 and s be the subject for the profile. The Entity collections E1 and E2 contains
several entity mentions such that their is no duplicates with in an entity collection.
For each of the entity collections, for every single entity mention we vectorise its attributes using
an HashingTF or a CountVectorizer technique and then create hashes of the respective feature
vectors using spark MinHashLSH model, keeping the setNumHashTables attribute to 3. Find the
similarity between the respective transformed hashes for each of the entity mentions of the two
entity collections with an spark approxSimilarityJoin method on an appropriate threshold(t). The
similar entities marked among the two entity collections are computed.
In this approach, selecting specific attribute information( 1 or 2 attributes) amongst the hetero-
geneous, loosly schema binded information available with us from two sources forms an important
aspect for comparison. Although, this approach uses only selected attribute information about the
entity mentions, it marks most of the similar entities correctly. The only hinge being less utilisa-
tion of the available knowledge in the knowledge graphs. Thus, maximum utilisation of available
knowledge from two sources becomes the inspiration for our next approach.
For broader comparison of entity mentions in two knowledge graphs, keeping an account of informa-
tion from various attributes, we determine a 3-step process for entity resolution. The idea behind
this approach is selecting common attributes of the two entity mentions based on the similarity of
their subjects.
Definition:- Let the entity collections in two databases be E1 and E2, then an entity mention e1 in
a certain collection be recorded as a tuple ¡s, Ts¿ i, where Ts is a set formed by the tokenization
of the subject(s), for the specific entity mention e1 and s be the subject for the profile. The Entity
collections E1 and E2 contains several entity mentions such that their is no duplicates with in an
entity collection.
For each of the entity collections, for every single entity mention we vectorise its subject tokens
using an HashingTF or a CountVectorizer technique and then create hashes of the respective feature
vectors using spark MinHashLSH model, keeping the setNumHashTables attribute to 3. Find the
similarity between the respective transformed hashes for each of the entity mentions of the two
entity collections with an spark approxSimilarityJoin method on an appropriate threshold(t). The
similar entities marked among the two entity collections are computed.
37
Since, our focus is on maximum utilisation of data, we compare the attribute names of only those
two entities that are marked similar based on the similarity of their subjects. We extract the
attribute names(i.e. predicates in RDF data) for the similar matched entities and compare their
respective predicates by jaccard similarity thresholds(Jp). Jp, being the Jaccard similariy threshold
38
4.3.3.3. Step 3 - Compare the attribute values for intersecting attribute names in
matched entities
Figure 4.9: Compare the attribute values for intersecting attribute names
In the final step, we focus on reducing the false positives. Thus, we extract attribute information(i.e.
objects in RDF data) for only intersecting attribute names among the two similar marked entity
39
mentions, fitted on Jp. We compare the objects of only intersecting predicates by Jaccard similarity
threshold of objects, namely Jo. This matches the similar entities in two knowledge sources.
Figure 4.10: Compare the attribute values for intersecting attribute names code snippet
Approach -3 identifies the entities that are same in real world in these two dataframes.
Utilizing the attribute level knowledge for each of the paired entities is a keen task for identifying
the exact matches and justifying the validity of our pairs (that are paired based on the similarity
of their subjects). The predicate level knowledge for each of the the paired entity is in the En-
tity1 predicates and Entity2 predicates respectively as shown in Figure 4.13. We now compute a
paired match as valid, only if the Jaccard similarity of the predicates of the paired entities surpasses
the user’s threshold (Jp discussed above).
Figure 4.14: Matched entities with intersecting attribute names filtered with Jaccard similarity on
attribute names
Since, the Approach-1 suffers issues with the inconsistencies in the two knowledge bases (as discussed
in section 4.3.1.1), we perform our detailed analysis with Approach-2 and Approach-3.
Storage forms an important part in design and implementation of every system. We are using
Hadoop file system to store our input data and the computed output file. Our intermediate com-
puted results are computed and stored in spark RDDs and dataframes, owing to their distributed
fast in memory computations.
4.4.2. Input and Output for Approach-2 minHash LSH method (1 or 2 attribute)
The input for our Approach-2 is in CSV format. Three dataset pairs are used, namely DBLP-ACM,
DBLP-Scholar and Abt-Buy datasets. These are available on the (12). Each of the dataset have
their true matches identified stored in a CSV ground truth file, which is used by us to evaluate
43
the accuracy of our methodology. The detailed description of each of the datasets are discussed in
section 5.2 . The CSV data for each datasets comparison is loaded into spark dataframes through
CSV data source support 5 for Apache Spark. The entity profiles to be compared are formed
from the respective dataframes in the format discussed in section 4.3.2. On applying Approach-2
methodology, the true matches in the two dataframes are identified with the approx similarity join
and stored in a spark rdd with their respective distances.Our predicted results are verified against
the available ground truth to evaluate and ensure the valadity of our approach. Our predicted
results are finally written to the Hadoop file system.
4.4.3. Storage and computation for Approach-3 minHash LSH subjects and Jaccard
similarity attributes
The input data for Approach-3 is in RDF format. The N-triples are loaded in the spark rdds
and distinct triples are taken into consideration. We are working with Dbpedia 3.0rc and 3.4
datasets. The triples with predicates owl:sameas, wikiPageID, wikiPageRevisionID, wikiPageRevi-
sionLink, wikiPageUsesTemplate, wikiPageHistoryLink, wikiPageExternalLink, wikiPageEditLink,
wikiPageExtracted, wikiPageLength, wikiPageModified, wikiPageOutDegree, wikiPageRedirects
are not used. Also, for the rdf:labels, we are using anly labelled text with literal language as ”en”.
The creation of ground truth for the datasets being compared is discussed in section 5.2, where we
elaborate details of each of the datasets. The entity profiles are created from the loaded triples as
discussed in section 4.3.3.1 and stored in spark rdds respectively. Spark framework partitions the
data effectively so that it can be processed in a distributed manner.
The entity subject data is tokenised and stored in Entity sub tokenised column in Dataframes x
and y respectively as shown in section 4.3.3.4. The tokeinsed sets in this column are vectorised
through HashingTF or CountVectorizer model. The hash values are computed on the generated
5
https://github.com/databricks/spark-csv
44
feature vectors using the minHashLSH method and approximate similarity join is applied to identify
matches. The paired entities are now deduced and we join the attribute level information for these
entities using our initial rdd. The attribute names for each of the paired matches is compared and
intersecting predicates are extracted only for the paired matches that surpass the user threshold
Jp for jaccard similarity on their attribute names. The paired matches now filtered out have their
intersecting predicates and their respective attribute values identified. We compare the attribute
value jaccard similarity of these paired matches against the user threshold Jo to identify the final
matches as our marked similar entities. The marked similar entities are further evaluated against
our created ground truth and stored in Hadoop file system. The source is available at the github
page 6
6
https://github.com/amrit1210/LSH-based-Entity-Resolution
45
CHAPTER 5
EVALUATION
In this chapter, we discuss the evaluation of the two successful approaches, namely Approach-2
(minHash LSH method(1 or 2 attribute)) and Approach-3 (minHash LSH subjects and jaccard
similarity attributes). We discuss:-
1. The various datasets used in evaluating each of the two approaches respectively
2. The parameters used for configuration
3. The comparison of our results with existing state of the art results based on precision, recall and
F-measure values recorded.
4. The execution time involved in our approaches with HashingTF and CountVectorizer model.
To test out the performance of our Scalable Entity Resolution model, based on Spark and Scala,
we deploy it on an Spark cluster. We use a Spark Standalone mode with Spark version 2.2.1 and
Scala with version 2.11.11. A small private cluster of Smart Data Analytics (SDA) consisting of
4 servers is used. These servers have a total of 256 cores, and each server has Xeon Intel CPUs
at 2.3GHz, 256GB of RAM and 400GB of disks space, running Ubuntu 16.04.3 LTS (Xenial) and
connected in a Gigabit Ethernet2 2 network. Each Spark executor is assigned a memory of 250GB.
5.2. DATASETS
For the Approach - 2, we are considering three different datasets, namely DBLP-ACM, DBLP-
Scholar and Abt-Buy dataset. These datasets are used in the (2) publication. The link to the
datasets is attached at 1 We compare our results with their result for 1 and 2 attribute approaches.
The dataset DBLP-ACM and DBLP-Scholar are biblographic datasets while Abt-Buy is from ecom-
merce domain. The number of entities and attributes involved in each of the dataset is described
in table 5.1. The comparisons in these three different datasets possess different levels of difficulty
based on their dataset characteristics.
1
https://dbs.uni-leipzig.de/research/projects/object matching/benchmark datasets for entity resolution
46
1. The DBLP-ACM match entity task is expected to be of low difficulty because it deals with
publication entities from two well-structured bibliographic data sources (the Digital Bibliography
Library Project [DBLP] and the ACM digital library) that are at least partially under manual
curation. The selected DBLP and ACM entities cover the same computer science conferences and
journals.
2. The second dataset i.e DBLP-Scholar matches DBLP publications with publications from the en-
tity search engine Google Scholar (Scholar). Scholar automatically extracts its publication entities
from full-text documents crawled from the Web. This data has many quality problems particularly,
duplicate publications, heterogeneous representations of author lists or venue names, misspellings,
and extraction errors.
3. The e-commerce tasks(Abt-Buy) deal with sets of related product entities from online retailers
Abt.com and Buy.com
For Approach-3, we are using Dbpedia datasets. We can also use other datasets like YAGO, free-
base or wikei data, but currently we are considering only DBpedia datasets. The medium and large
sized datasets are taken to evaluate the scalability of our code. We have taken different versions of
Dbpedia for comparison, in particular - for medium size dataset, we are comparing the two infobox
versions, namely Infobox 3.0rc (containing 9,91,933 entities) with Infobox 3.4 (containing 19,40,631
entities).The Infobox 3.0rc is 2.7 GB in size while Infobox 3.4 is 5.8 GB in size. The ground truth
for calculating precision, recall and F-measure is constructed by considering the matches as entities
with exactly same URL. The inspiration for the dataset and ground truth construction is from
(9). Another large sized Dbpedia dataset, we are comparing Dbpedia 3.0rc (containing 47,81,249
entities) with Infobox 3.4 (containing 19,40,631 entities). The Dbpedia 3.0rc is 19GB in size while
Infobox 3.4 is 5.8GB in size as aforementioned. The ground truth constructed for medium sized
dataset is used here also for evaluation.
47
The adjustment of configuration parameters is of utmost importance to achieve desired results from
any algorithm. In our setting, we have the following three configuration parameters discussed.
1. Similarity threshold for ApproxSimilarityJoin
As discussed in section 4, the ApproxSimilarityJoin returns two entities with the calculated distance
between them, if the distance between them is less than the Similarity threshold set by user. Thus,
the similarity threshold set by user plays an important role in determining similar entities. Both,
the similarity threshold and the calculated distance between two entities is a real number ranging
between zero and one. The smaller the threshold set by user, the ApproxSimilarJoin returns more
closely related entities. In our task, for different datasets, we have set different values for the Sim-
ilarity threshold.
2. Jaccard Similarity threshold for Predicates(Jp)
This parameter is only set for Approach-3. As discussed in section 4, for the minHash LSH subject
similarity retrieved entities, we check the Jaccard similarity of the predicates and retrieve only
those pairs that have Jaccard similarity more than the the user determined parameter Jp. This
parameter is also a real number with value ranging between zero and one. The more the Jaccard
similarity for predicates(Jp) in our model, the closer the matched pair of entities. In our task, for
medium and large dataset, we have adjusted the similarity predicate to be 0.15 for both medium
and large sized Dbpedia datasets. Although, we have tried resolving two different versions of same
dataset i.e. Dbpedia, but it is not an easy task because for medium sized dataset 9,48,698 entities
very newly added between infobox 3.0rc and infobox infobox 3.4. Only, 5 percent of attribute value
pairs are common between these two different versions of datasets.
3. Jaccard Similarity threshold for Objects(Jo)
This parameter is set only for Approach-3. As discussed in section4, for the entity pairs fitted to
the Jaccard Similarity of predicates(Jp) with their intersecting predicates, we calculate the Jaccard
48
similarity of their attribute values i.e. objects and retrieve similar entities as the entities whose cal-
culate similarities are more than the user defined Jaccard Similarity of objects(Jo). This parameter
and the calculated Jaccard similarities for the intersecting predicates are both real numbers ranging
between zero and one. In our task, we have adjusted this parameter to 0.25 for both medium and
large sized datasets.
The Figure 5.1 represents the confusion matrix, that helps in the understanding of precision and
recall equations described in the following sections.
5.4.1. Precision
Precision is a measure that identifies the percentage of our results that are relevant. 2 It is defined
as the number of true positives divided by the number of true positives plus the number of false
positives. True positives are data point classified as positive by the model that actually are positive
(i.e. they are correct) and False positives are the cases that model incorrectly labels as positive,
but are actually negative.
true positive
P recision = true positive+f alse positive
2
https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c
49
5.4.2. Recall
Recall is a measure that identifies the percentage of total relevant results correctly classified by our
algorithm. the number of true positives divided by the number of true positives plus the number
of false negatives. False negatives are data points the model identifies as negative that actually are
positive (i.e. they are incorrect). Recall can be thought as of a model’s ability to find all the data
points of interest in a dataset.
true positive
Recall = true positive+f alse negative
5.4.3. F-measure
The F-measure is the harmonic mean of precision and recall taking both metrics into account in
the following equation. The harmonic mean is used instead of a simple average because it punishes
extreme values.
precision∗recall
F − measure = 2 ∗ precision+recall
The prime focus of this section is on examining implementation results of our Approach-2 and
Approach-3. We will compare and validate the accuracy of our approaches with the state of the
art methods. Both, the approaches are implemented with HashingTF and CountVectorizer model.
So, we will be comparing execution time involved in each of the strategy on different datasets.
We have completed our experiments on Spark client mode with the cluster configuration described
in table 5.3.
50
The DBLP-ACM dataset contains 2,616 entities and 2,294 entities respectively with the attributes
title, authors, venue and year. For 1-attribute approach, we have taken the attribute(title) and
for 2-attributes, we have considered the attributes(title and authors), as considered in the state of
the art approaches. A similarity threshold of 0.25 is considered in 1-attribute comparison and 0.28
for 2-attribute comparison during ApproxSimilarityJoin. The precision, recall and f-measure is as
shown in table 5.4 for 1-attribute and 2-attribute comparison respectively. Although, the state of
the art methods attain slightly high precision, recall and F-measure values for 1-attribute com-
parison, our methodology still achieves comparable results in less time with no effort involved in
labelling data for training the model and no need of multiple iterations. Our methodology directly
processes the input dataframes and actively returns similar entities with efficiency.
The DBLP-Scholar dataset contains 2,616 entities and 64,263 entities respectively with the at-
tributes title, authors, venue and year.Since, it is again a bibliographic dataset, for 1-attribute
approach, we have taken the attribute(title) and for 2-attributes, we have considered the at-
tributes(title and authors), as considered in the state of the art approaches. A similarity threshold
of 0.15 is considered in 1-attribute comparison and 0.42 for 2-attribute comparison during Approx-
SimilarityJoin. The precision, recall and f-measure is as shown in table 5.5 for 1-attribute and
2-attribute comparison respectively. Here, we attain similar precision, recall and F-measure values
as state of the art methods. This adhers to the validity and accuracy of our approach and also goes
in line with aborting the blocking stage and directly processing the entities for comparison.
51
The Abt-Buy dataset is an e-commerce website crawled dataset containing 1,081 entities and 1,092
entities respectively with the attributes name, description, manufacturer and price. For 1-attribute
approach, we have taken the attribute(name) and for 2-attributes, we have considered the at-
tributes(name and description), as considered in the state of the art approaches. This dataset
contains lot of inconsistencies, noisy and missing values. A similarity threshold of 0.5 is considered
in 1-attribute comparison and 0.685 for 2-attribute comparison during ApproxSimilarityJoin. The
precision, recall and f-measure is as shown in table 5.6 for 1-attribute and 2-attribute comparison
respectively. Although, the state of the art methods attain slightly better F-measures for both
1-attribute and 2-attribute approach, we argue that our model is better because it does not suffers
huge variations in precision and recall parameters. This explains that LSH model is an effective
and efficient model to remove block building and training overheads with small compromises on
accuracy depending on datasets.
5.5.2. Execution Times for Approach-2 datasets with HashingTF and CountVector-
izer models
The execution time is a crucial parameter to analyse the effectiveness and efficiency of our algorithm.
We have implemented and evaluated our approach by both HashingTF and CountVectorizer model.
52
As we have closely discussed about the pros and cons of both the models in section 4.2, Here
we can see its effect on the execution times for our different datasets. For approach-2, we have
considered three small-sized datasets discussed above. Figure 5.1 shows the execution time (in
seconds) involved in 1 and 2 attribute approach with both the models. The DBLP-Scholar dataset
takes more time than the other two dataset because the Scholar dataset has 24 times more entities
to be processed than DBLP dataset. Clearly, 2-attribute processing takes little more time than
1-attribute owing to small increase in the size of data to be processed. This is also more evident
with the DBLP-Scholar execution time variations for 1 and 2 attribute. Thus, with the increase in
the number of entities and data to be processes, the execution time also increases. The HashingTF
model executes much faster in comparison to the CountVectorizer model as it directly transforms the
tokenised text to feature vectors and does not needs to scan the data twice unlike CountVectorizer,
once for building the model and then latter to transform it to feature vectors. With little collisions
into the same bucket, it works fast and best.
As discussed in detail in section 5.2, for medium sized dataset, we are comparing Infobox 3.0rc
dataset(containing approx 9 million entities) against Infobox 3.4 dataset (containing approx 19
million entities) and for large sized dataset, we are considering entire Dbpedia 3.0rc dataset (con-
taining approx 47 million entities) and Infobox 3.4 dataset. The same ground truth is used for
evaluating both. The precision, recall and f-measure values recorded as as shown in table 5.7. In
order to attain these results, the configuration parameters used are as discussed below:-
1. A 0.10 similarity threshold for ApproxSimilarityJoin is adjusted for the initial phase to determine
the subject similarity between the entities in two dataframes. The lower the threshold set here by
user, the more similar are their subjects.
2. A Jaccard Similarity threshold(Jp) of 0.15 is considered among the predicates for the entities
paired on the subject similarity basis. Although, we attain good precision, recall and f-measure
scores at this level, we move one step further with the goal to reduce the false positives in our
results. This would vary with different datasets.
3. A Jaccard Similarity threshold(Jo) of 0.25 is considered for similarities among the objects fo the
intersecting predicates for the entities matched at phase - 2 of our algorithm. We do not attain
huge viable variations is precision, recall and f-measure values at level 2 and 3 of our algorithm(we
donot have much false positives), but it depends entirely on the dataset.
5.5.4. Execution Times for Approach-3 datasets with HashingTF and CountVector-
izer models
We have implemented and evaluated our approach for both medium and large sized Dbpedia
datasets with HashingTf and CountVectorizer models respectively. In our algorithm, the most
54
time is consumed at the approxSimilarityJoin in step-1. The figure 5.2 shows the execution time
(in minutes) for both the datasets on level-2 and level-3 our approach i.e with the Jaccard Similarity
threshold(Jp) set for the predicates of subject similarity based pairs and with the Jaccard Similarity
threshold(Jo) for objects for final true matches found. Clearly, the execution time varies with the
size of the dataset and with the model used for forming the feature vectors with the text. The
HashingTF model with large sized dataset executes in comparably less time than the CountVec-
torizer model with medium sized dataset. The CountVectorizer model takes twice as much time as
HashingTf model for large sized dataset. Thus, HashingTF model based model is the fastest model
for our algorithm. For larger dataset evaluating step-3 our algorithm i.e. filtering with the Jaccard
Similarity for Objects takes much higher time. So, it depends on user’s choice and the dataset to
proceed to step-3 after step-2. With prior knowledge of the dataset and the precision, recall and
f-measure values after step 2, the user can abort the step-3 compromising with small number of
false positives and reducing the overheads of time consumption.
55
CHAPTER 6
CONCLUSION
In our work, we have developed an effective, efficient and scalable approach for entity resolution. We
have implemented and experimented two different approaches, namely Approach-2 and Approach-
3 discussed in section-4 with different datasets and conclude that the HashingTf model performs
faster than the CountVectorizer model in all cases and thus, works the best for our approaches.
Our approaches clearly reach our prior mentioned goals dealing with the heterogeneity of data
and handling structured as well as unstructured data. With the advent of finding intersecting
predicates in approach-3 and comparing the similarities in their attribute values, we solve the
problem of missing and noisy information to a great extent.
With little i.e. 5-7% or no compromises in accuracy (depending on the dataset), our approach
performs entity resolution in linear time competing with the existing approaches. Our methodology
successfully removes the block building stage in entity resolution. Thus, completely reducing the
time incurred in this step. Also, we abort the effort of labelling data for supervised training
56
methods and the time involved in multiple iterations of training the model. Hence, the task of
entity resolution in RDF data could be simplified and leveraged with our suggested approach.
Our thesis work Our thesis work will be an addition for Semantic Analytics Stack(SANSA), a
distributed computing frameworks (specifically Spark and Flink) with the semantic technology
stack 1 as shown in the figure 6.1. A layer named as ”Scalable Entitiy Resoultion” will be added
in SANSA Stack. This will help in identifying duplicate entities in RDF data, simplifying the task
of data integration effectively.
The current thesis work can be further extended. Datasets with ids as subjects instead of entity
names in URL are difficult to compare with our approach. In our future work, this can be extended
by taking into consideration the rdf:labels for such datasets. We are currently performing our work
on Dbpedia datasets and can be extended to other RDF datasets like Freebase, YAGO and wikidata
etc. A major challenge occurs in evaluating the scalability of our approaches as their are no ground
truths available for large datasets. Thus, creation of ground truth for large sized datasets is another
major task to enhance the current work. We can also improve the evaluation of our approach by
performing analysis of entity resolution with different Dbpedia language datasets. It would also be
exciting to perform the current work on other distributed frameworks like Flink as it automatically
manages operations like data partitioning and caching that are otherwise set manually in Spark.
Performing a comparative analyses of our work on Spark and Flink framework will provide better
insights of our algorithm.
57
REFERENCES
[1] P. Christen, Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution,
and Duplicate Detection. Springer, 2012.
[2] H. Köpcke, A. Thor, and E. Rahm, Learning-Based Approaches for Matching Web Data En-
tities. IEEE Internet Computing, July/August 2010.
[5] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, and W. Nejdl, Beyond 100 Million
Entities: Large-scale Blocking-based Resolution for Heterogeneous Data. WSDM, 2012.
[7] G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl, Meta-Blocking: Taking Entity Reso-
lution to the Next Level. IEEE TKDE, 2014.
[8] J. Fisher, P. Christen, Q. Wang, and R. Erhard, A clustering-based framework to control block
sizes for Entity Resolution. Proceedings of the 21th ACM, 2015.
[9] G. Papadakis, E. Ioannou, C. Niederée, and P. Fankhauser, Efficient entity resolution for large
heterogeneous information spaces. WSDM, 2011.
[10] O. Benjelloun et al, Swoosh: A generic approach to Entity Resolution. VLDBJ 18(1), 2009.
[11] eng.uber.com/lsh/, Detecting Abuse at Scale: Locality Sensitive Hashing at Uber Engineering.