Schema-agnostic Blocking for Streaming Data
Tiago Brasileiro Araújo
Kostas Stefanidis
Carlos Eduardo Santos Pires
Federal University of Campina
Grande
Campina Grande, Brazil
[email protected]
Tampere University
Tampere, Finland
[email protected]
Federal University of Campina
Grande
Campina Grande, Brazil
[email protected]
Jyrki Nummenmaa
Thiago Pereira da Nóbrega
Tampere University
Tampere, Finland
[email protected]
State University of Paraíba
Campina Grande, Brazil
[email protected]
1
ABSTRACT
Currently, a wide number of information systems produce a large
amount of data continuously. Since these sources may have overlapping knowledge, the Entity Resolution (ER) task emerges as a
fundamental step to integrate multiple knowledge bases or identify similarities between entities. Considering the quadratic cost
of the ER task, blocking techniques are often used to improve
efficiency. Such techniques face two main challenges related to
data volume (i.e., large data sources) and variety (i.e., heterogeneous data). Besides these challenges, blocking techniques also face
two other ones: streaming data and incremental processing. To address these four challenges simultaneously, we propose PI-Block, a
novel incremental schema-agnostic blocking technique that utilizes
parallelism (through distributed computational infrastructure) to
enhance blocking efficiency. In our experimental evaluation, we
use four real-world data source pairs, and highlight that PI-Block
achieves better results regarding efficiency and effectiveness compared to the state-of-the-art technique.
CCS CONCEPTS
· Information systems → Entity resolution; Semi-structured
data;
KEYWORDS
Entity resolution, streaming data, heterogeneous data, metablocking, incremental processing.
ACM Reference Format:
Tiago Brasileiro Araújo, Kostas Stefanidis, Carlos Eduardo Santos Pires,
Jyrki Nummenmaa, and Thiago Pereira da Nóbrega. 2020. Schema-agnostic
Blocking for Streaming Data. In The 35th ACM/SIGAPP Symposium on Applied Computing (SAC ’20), March 30-April 3, 2020, Brno, Czech Republic.
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3341105.3375776
INTRODUCTION
With the growing use of information, systems are producing a large
amount of data continuously. The data provided by these different applications may have overlapping knowledge. For instance,
numerous data provided by a sensor network may generate mass
similar data. To address this context, Entity Resolution (ER) emerges
as a fundamental step to integrate multiple knowledge bases or
identify similarities between entities [4]. ER is a very common task
in data processing and data integration areas, where different entity
profiles, usually described under different schemas, are mapped to
the same real-world object [6]. Formally, ER identifies records (the
entity profiles) from several data sources (the entity collections)
that refer to the same real-world entity. In the context of heterogeneous data, ER faces with two well-known data quality challenges:
volume, as it handles a growing number of entities; and variety,
since different formats and schemes are used to represent the entity
profiles [4]. To deal with volume, blocking and parallel computing
are applied [2, 7]. Blocking groups similar entities into blocks and
perform comparisons within each block. Thus, blocking avoids the
huge number of comparisons to be performed when the comparisons are guided by the Cartesian product. ER in parallel aims to
minimize the overall execution time of the task by distributing the
computational cost (i.e., the comparisons between entities) among
the resources of a computational infrastructure [1].
The challenge of variety is related to the difficulty of performing the blocking task since heterogeneous data hardly share the
same schema and compromises the blocks generation. Therefore,
traditional (i.e., schema-based) blocking techniques (e.g., sorted
neighborhood and adaptive window) do not present satisfactory
effectiveness [13]. In turn, the variety challenge is addressed by
schema-agnostic blocking techniques [4]. Among them, Metablocking emerges as the most promising approach [13]: blocks form a
weighted graph and pruning criteria are applied to remove edges
with weight below a threshold, aiming to discard comparisons between entities with few chances of being considered a match.
Furthermore, it is possible to detach two additional challenges
faced by the ER task: streaming data and incremental processing
[3, 5, 10, 14]. Streaming data is commonly related to dynamic data
sources (e.g., from Web Systems, Social Media, sensors), whose data
is sent continuously. Therefore, we assume that not all data, from
all data sources, are available at once. For this reason, we have to
process (i.e., match) the entities as they appear, also considering
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
the incremental behavior (in other words, entities already matched
previously). Regarding incremental ER, new challenges need to be
considered, such as i) how to manage the entities already processed
since they can be infinite? and ii) how to execute efficiently the ER
task considering the whole stream of entities? [3, 8].
Notice that the challenges are strengthened when the ER task
is considered in the context of heterogeneous data, streaming data
and incremental processing, simultaneously. To this end, we propose the Parallel-based Incremental Blocking (PI-Block) technique,
a promising schema-agnostic blocking technique capable to incrementally process entity profiles. To our knowledge, there is a lack
of blocking techniques for addressing all challenges emerged in
this scenario. In this sense, we propose a time window strategy that
discards old entities, based on a time threshold. Thus, the PI-Block
technique aims to deal with streaming and incremental data efficiently using a distributed computational infrastructure. Overall,
the contributions of our work are the following: i) we propose a
novel incremental and parallel-based schema-agnostic blocking
technique, called PI-Block, that deals with streaming and incremental data; ii) we introduce a parallel-based workflow to perform
entity blocking: PI-Block reduces the number of parallel steps of
Metablocking, achieving efficiency gains without decreasing effectiveness; iii) we present strategies to avoid unnecessary entity
comparisons during the blocking step; and iv) we propose a time
window strategy to discard entities already processed based on the
time they were sent, avoiding problems related to excessive memory consumption. PI-Block is evaluated against the state-of-the-art
technique, regarding efficiency and effectiveness, using four pairs
of real data sources.
2
PROBLEM DESCRIPTION
PI-Block receives as input data provided by two data sources D 1
and D 2 . Since data is sent incrementally, the ER task processing is
divided into a set of increments I = {i 1, i 2, i 3, ..., i |I | }, such that |I | is
the number of increments. Moreover, consider that each increment
is associated with a predetermined time interval τ . Thus, the time
between each increment should be T (i |I | ) − T (i |I |−1 ) = τ . For each
increment i ∈ I , each data source D sends an entity collection Ei =
{e 1, e 2, e 3, ..., e |Ei | }, such that |Ei | is the number of entities. Since
the entities can follow different schemes, each entity e ∈ Ei has a
specific attribute set and a value associated to each attribute, denoted by Ae = {⟨a 1, v 1 ⟩, ⟨a 2, v 2 ⟩, ⟨a 3, v 3 ⟩, ..., ⟨a |Ae | , v |Ae | ⟩}, such
that |Ae | is the amount of attributes associated with e. Moreover,
in order to generate the entity blocks, tokens are extracted from
the attribute values. Namely, all tokens associated to an entity e are
Ð
grouped into a set Λe , i.e., Λe = (Γ(v) | ⟨a, v⟩ ∈ Ae ), such that
Γ(v) is a function to extract the tokens from the attribute value v.
To generate the entity blocks, a similarity graph G(X, L) is created, in which each e ∈ Ei is mapped to a vertex x ∈ X and a
non-directional edge l ∈ L is added. Each l is represented by a triple
⟨x 1, x 2, ρ⟩, such that x 1 and x 2 are vertices of G and ρ is the similarity value between the vertices. Thus, the similarity value between
two vertices (i.e., entities) x 1 and x 2 is denoted by ρ = Φ(x 1, x 2 ).
Then, the similarity value is given by the average of common tokens
|Λx ∩Λx |
2
between x 1 and x 2 , Φ(x 1, x 2 ) = max ( |Λ1 |, |Λ
.
x1
x 2 |)
Araújo T. B. et al.
In ER, a blocking technique aims to group the vertices of G
into a set of blocks denoted by BG = {b1, b2, ..., b |BG | }. However,
a pruning criterion Θ(G) is applied to remove redundant comparisons, resulting in a pruned graph G ′ . The vertices of G ′ are
′ = {b ′ , b ′ , ..., b ′
′ (b ′ =
}, s.t. ∀b ′ ∈ BG
grouped into blocks BG
′
′
1 2
|B ′ |
G′
{x 1, x 2, ..., x |b ′ | }), ∀⟨x 1, x 2 ⟩ ∈ b ′ : ∃⟨x 1, x 2, ρ⟩ ∈ L and ρ ≥ θ ,
where θ is a threshold defined by a pruning criterion Θ(G). Intuitively, each data increment, denoted by ∆Ei , also affects G. Thus,
we denote the increments over G by ∆G i . Let {∆G 1, ∆G 2, ..., ∆G |I | }
be a set of |I | data increments on G. Each ∆G i is directly associated
with an entity collection Ei , which represents the entities in the
increment i ∈ I . The computation of BG , for each ∆G i , is performed
on a parallel distributed computing infrastructure, composed by
multiple nodes (e.g., computers or virtual machines). In this context, N = {n 1, n 2, ..., n |N | } is the set of nodes used to compute
BG . The execution time using a single node n ∈ N is denoted by
n (B ), while the time using the whole computing infrastructure
T∆G
G
i
N (B ).
N is denoted by T∆G
G
i
Since blocking is performed in parallel over the infrastructure
N , the whole execution time is given by the execution time of the
node that demanded the highest time to execute the task for a
N (B ) = max(T n (B )), n ∈ N . Asspecific increment ∆G i : T∆G
G
∆G i G
i
suming now the streaming behavior, where each increment arrives
in each τ time interval, it is necessary to determine a restriction of
N (B ) ≤ τ .
execution time to process each increment, given by T∆G
G
i
This restriction aims to prevent the blocking execution time from
overcoming the time interval of each data increment. To achieve
this restriction, blocking must be performed as quickly as possible.
As stated previously, one possible solution to minimize the execution time of the blocking step is to execute it in parallel over a
distributed infrastructure.
3
STREAMING METABLOCKING
The state-of-the-art blocking techniques do not work properly in
scenarios involving incremental and streaming data since they were
not conceived to deal with these situations. In this sense, we developed three blocking techniques capable to deal with streaming and
incremental data: Streaming Metablocking, PI-Block, and PI-Blockwindowed. The Streaming Metablocking technique is based on
the same state-of-the-art parallel workflow proposed in [6]. However, Streaming Metablocking was adapted to take into account
challenges involving streaming and incremental data. Since the
technique considers the incremental behavior, we need to update
the blocks in order to consider the new entities that are coming.
Using a brute force strategy, after the arrival of a new increment,
Streaming Metablocking needs to rearrange all blocks, including
blocks that did not suffer any update. Clearly, this strategy is costly
in terms of efficiency since it performs a huge number of unnecessary comparisons that have already been performed. To avoid
this kind of unnecessary comparisons, Streaming Metablocking applies a store structure provided by Spark Streaming that considers
only the data being updated in the current increment. Therefore,
Streaming Metablocking only considers the blocks that suffered at
least one update for the current increment. The reduction on the
number of comparisons helps to minimize the computational cost
Schema-agnostic Blocking for Streaming Data
of generating the blocking graph and performing the pruning step
since the technique evaluates fewer number of entity pairs.
4
PI-BLOCK TECHNIQUE
Metablocking was not originally conceived to deal with incremental
and streaming scenarios. This fact motivated us to propose the
PI-Block technique, a schema-agnostic blocking technique able
to deal with the challenges related to these scenarios efficiently.
Unlike the previous parallel-based Metablocking approaches [6],
PI-Block uses a different workflow in order to reduce the number of
MapReduce jobs and, consequently, minimize the execution time of
the task as a whole. In turn, the PI-Block workflow is composed of
two MapReduce jobs (see Figure 1), two jobs less when compared
with the Parallel Metablocking proposed in [6]. Moreover, it is
important to highlight that we improved the workflow proposed
in [6] so that the reduction on the number of MapReduce jobs
does not impact negatively on the effectiveness results. In other
words, the novel workflow does not modify the generated blocks.
Figure 1 will be used throughout this section to illustrate the PIBlock workflow, which is divided into three steps: token extraction,
blocking generation, and pruning.
Token Extraction Step. In this step, tokens are extracted from
data. Each token will be used as a blocking key. Initially, for each
increment, blocking receives a pair of entity collections E D 1 and
E D 2 provided by D 1 and D 2 . The tokens are extracted from the
attribute values of each entity. For each entity e from E D 1 and E D 2 ,
all tokens Λe associated with e are extracted and stored. This set
of tokens Λe will be used in the following step to determine the
similarity between the entities. Each token in Λe will be used as
a blocking key and included in the map of blocks B following the
format ⟨t, ⟨e, Λe , D⟩⟩, such that t is the blocking key, e represents
the entity, Λe the set of tokens (i.e., blocking keys) and D the data
source that provided e.
In Figure 1, there are two sets of data that represent the entities
provided by two distinct increments. For the first increment (top
of the figure), D 1 provides entities e 1 and e 2 , while D 2 provides
entities e 3 and e 4 . In the token extraction step, the tokens A, B and
C are extracted from entity e 1 . For example, in a real-world scenario, the entity e 1 can be represented by e 1 = {⟨name, Steve Jobs⟩
⟨nationality, American⟩}. Thus, the tokens A, B and C represent
the attribute values łStevež, łJobsž and łAmericanž, respectively.
From the extracted tokens, all entities sharing the same token are
grouped in the same block. Thus, each token is used as a blocking
key. For instance, block b1 is related to token A and contains entities e 1 and e 4 since both entities share the token A. Moreover, in
this step, the entities are arranged in the format ⟨e, B⟩ such that
e represents a specific entity and B denotes the set of blocks that
contain entity e.
Blocking Generation Step. In this step, the weight graph is generated to define the level of similarity between entities. Initially, the
blocks B generated in the previous step are received as input. For
each blocking key k in B, the entities stored in the same block are
compared. Thus, the entities provided from different data sources
are compared to define the similarity ρ between them. The similarity is defined based on the number of co-occurring blocks (i.e.,
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
similar blocking keys) between the entities. After defining the similarity between the entities, the entity pairs are inserted into the
graph G, such that the similarity ρ represents the weight of the
edge that links the entity pair. The blocks generated in this step
are stored in memory to maintain them available for the next increments. In this sense, new entity blocks will be included or merged
with the entity blocks previously stored. The blocking generation
step is the most costly (in terms of computational costs) in the
workflow since the comparison between the entities is performed
in this step.
For instance, in Figure 1, block b1 contains entities e 1 and e 4 .
Therefore, these entities must be compared to determine the similarity between them. The similarity between them is 1 since they
co-occur in all blocks in which each one is contained. On the other
hand, in the second increment (bottom of the figure), block b1 receives entities e 5 and e 7 . Thus, in the second increment, block b1
contains entities e 1 , e 4 , e 5 and e 7 since all of them share token A.
For this reason, entities e 1 , e 4 , e 5 and e 7 must be compared1 with
each other to determine the similarity between them. However, it is
important to detach that entities e 1 and e 4 were already compared
in the first increment and, consequently, they must not be compared
twice. This would be considered an unnecessary comparison.
There are three types of unnecessary comparisons at the blocking
generation step: i) since there is overlapping between the blocks, an
entity pair can be compared in several blocks (i.e., more than once)
unnecessarily; ii) during the incremental process, some blocks may
not be updated. In this sense, entities contained in blocks that did
not suffer any updates must not be compared again since this would
demand time and memory consumption unnecessarily; iii) updated
blocks also contains entity pairs that have already been compared
in previous iterations. Therefore, these entity pairs should not be
compared again in future increments.
To avoid unnecessary comparisons, three strategies are applied.
For type i), the Marked Common Block Index (MaCoBI) [1, 6] condition is used. For each entity pair ⟨ei , e j ⟩, the blocks (i.e., blocking
keys) associated with the entities that have an identifier lower than
the identifier of the block in question are added to a set of marked
blocks (MB). The MaCoBI condition is satisfied when there are
no block identifiers in common between the entities ei and e j , i.e.,
MBi ∩ MB j = ∅. For type ii), only new blocks and blocks that suffered any update will be considered. To this end, an update-oriented
structure is applied to store the blocks previously generated, ensuring that only blocks that have been updated will be taken into
account. For type iii), entities previously compared (in previous increments) are marked. Thus, an entity pair must only be compared
if at least one of the entities is not marked as already compared.
This strategy prevents entity pairs already compared in previous
increments from being compared again.
To better understand how PI-Block avoids the three types of
unnecessary comparisons, we will consider the second iteration
(bottom of Figure 1). For type i), the pair ⟨e 1, e 4 ⟩ should be compared
in blocks b1 , b2 and b3 . To avoid these unnecessary comparisons,
the entity pair ⟨e 1, e 4 ⟩ will not be compared in blocks b2 and b3
since it does not satisfy the MaCoBi condition. For type ii), only the
1 Following
compared.
the restriction that only entities from different data sources should be
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
Araújo T. B. et al.
Figure 1: PI-Block workflow.
updated block b1 and the new blocks b6 , b7 and b8 are considered.
For type iii), at block b1 , e 1 and e 4 are marked (with the symbol
*) since they were already compared in the first increment. Thus,
during the comparisons of the entities contained in block b1 , the
pair ⟨e 1, e 4 ⟩ will not be compared again.
Pruning Step. After the comparison between entities, the pruning
criterion is applied to discard entity pairs with low similarity values.
In the pruning step, a pruning criterion is applied to generate the
set of high-quality blocks B ′ . Regarding the pruning criterion, the
works [6, 13] propose different pruning algorithms that can be
applied in this step. Particularly, in this work, we apply the WNPbased pruning algorithm [13] since it has achieved better results
than its competitors [6, 13]. The WNP algorithm applies the vertexcentric pruning algorithm with a local weight threshold that is
given by the average edge weight of each neighborhood. Thus, for
each vertex in G, the WNP algorithm calculates the sum of the edge
weights and the average of the edge weights. The average of the
edge weights is applied as the local pruning threshold. Therefore,
the neighborhood entities whose edge weight is greater than the
local threshold are inserted in B ′ . The other entities (i.e., edge
weight is lower than the local threshold) are discarded.
Furthermore, we highlight that only the entity pairs compared in
the previous step will be considered, implying a significant saving
in the computational processing of the pruning step. For example,
in the first increment (top of Figure 1), the pairs ⟨e 1 , e 4 ⟩ and ⟨e 2 ,
e 3 ⟩ should be compared in the ER task since they are considered
promising pairs (i.e., with high chances of being considered similar).
In the second increment (bottom of Figure 1), only the pair ⟨e 5 , e 7 ⟩
is considered a promising pair (should be compared in the ER task).
5
PI-BLOCK-WINDOWED TECHNIQUE
Considering the incremental challenges, the PI-Block technique
faces limitations related to resource consumption (e.g., memory).
Since PI-Block stores the blocks previously generated to block the
entities incrementally, the consumption of memory may increase
infinitely as the increments are processed. This behavior directly
results in memory-intensive consumption or problems related to
memory overflow. The time window strategy is applied during the
blocking generation step since in this step the generated blocks are
stored in a data structure, to be used during the next increments.
Then, the proposed strategy applies a time window to maintain the
entities in the data structure for a certain time interval, preventing
excessive memory consumption. However, it is worth mentioning
that the application of a time window may affect negatively the
effectiveness results since this strategy discards entities which exceed the window time interval. Thus, similar entities cannot be
compared because they are not sent at the same time interval.
Considering some entities used in the example depicted in Figure
1, we will describe the PI-Block-windowed technique by means of
Figure 2. In this example, three increments are sent at three different
times (i.e., T1 , T2 and T3 ). Moreover, the size of the time window (i.e.,
the time threshold) is given by the time interval of two increments.
In the first increment (i.e., T1 ), PI-Block receives entities e 1 and
e 3 . After the blocking generation, blocks b1 , b2 , b3 , b4 and b5 are
generated. For the second increment (i.e., T2 ), PI-Block receives
entities e 2 and e 4 . Considering the blocks already stored, the entity
e 2 is added to blocks b3 and b4 and the entity e 4 is added to b1 ,
b2 and b3 . Since the time threshold is two increments, the entities
provided by the first increment should be discarded. Therefore, for
the third increment (i.e., T3 ), the entities e 1 and e 3 are removed from
Schema-agnostic Blocking for Streaming Data
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
Figure 2: The time window strategy applied to PI-Block.
Table 1: Data sources characteristics.
Pairs of Datasets
Abt-Buy
Amazon-GP
DBLP-ACM
IMDB-DBpedia
|D1 |
1,076
1,354
2,616
27,615
|D2 |
1,076
3,039
2,294
23,182
|M|
1,076
1,104
2,224
22,863
|A1 |
3
4
4
4
|A2 |
3
4
4
7
the blocks. Furthermore, it is important to detach that block b5 is
discarded since all entities contained in this block were removed.
Finally, entities e 5 and e 7 are inserted into block b1 and a new block
b6 is generated with the entities e 5 and e 7 .
6
EXPERIMENTS
In this section, we evaluate PI-Block2 and Streaming Metablocking
in terms of effectiveness and efficiency. We run our experiments on
a cluster infrastructure with 13 nodes (one master and 12 slaves),
each one with one core. Each node has an Intel(R) Xeon(R) 1.0GHz
CPU, 6GB memory, runs the 64-bit Debian GNU/Linux OS with a
64-bit JVM and Apache Spark 2.0.13 . In our evaluation, four realworld pairs of data sources4 were used. Table 1 shows the amount
of entities (D) and attributes (A) contained in each dataset, and the
number of duplicates (i.e., matches - M) present in each pair of data
sources.
To simulate the streaming behavior on the data, a data streaming
sender was implemented. This data streaming sender reads the
entities from the data sources and sends the entities to the Kafka
producer. The Kafka producer is responsible to provide the data, in
a continuous way, to be consumed by the PI-Block technique for
each τ time interval (i.e., increment).
To measure the effectiveness of blocking, three quality metrics
have been applied: i) Pair Completeness (PC) - similar to recall estimates the portion of matches that were identified, denoted by
|M (B ′ ) |
PC = |M (D ,D ) | , where |M(B ′ )| is the amount of duplicate enti1
2
ties in the set of pruned blocks B ′ and |M(D 1, D 2 )| is the amount
of duplicate entities between the data sources D 1 and D 2 ; ii) Pair
2 https://github.com/brasileiroaraujo/Streaming/
3 https://spark.apache.org/
4 Available
in the project’s repository.
Quality (PQ) - similar to precision - estimates the portion of exe|M (B ′ ) |
cuted comparisons that result in matches, denoted by PQ = | |B ′ | | ,
where ||B ′ || is the amount of comparisons to be performed in the
pruned blocks; iii) Reduction Ratio (RR) - estimates the portion
of comparisons that are avoided in B ′ (i.e., ||B ′ ||) with respect to
the comparisons guided by Cartesian product (i.e., |D 1 | · |D 2 |) | |B ′ | |
is defined by RR = 1 − |D | · |D | . PC, PQ and RR take values in
1
2
the interval [0, 1], with higher values indicating a better result.
However, PQ commonly presents low values since it considers all
comparisons to be performed (in all blocks) [4, 13]. In terms of
efficiency, we measure the whole execution time (i.e., including all
steps) of PI-Block considering the execution of all increments. In
addition, we evaluate the memory consumption of the distributed
infrastructure. Thus, we calculate the average of memory consumed
(in percentage) by the nodes that compose the cluster.
To compare PI-Block against Streaming Metablocking, we developed two scenarios of incremental inputs. First, we evaluate both
techniques in a scenario where the increment size is the same for
all increments. To this end, we set the number of entities per increment as 10% of the whole data source. Thus, for each data source,
there are 10 increments containing 10% of entities from the data
source. In the second scenario, the increment size varies during
the timeline, similar to many real-world streaming data sources.
Then, we randomly define the percentage of entities from each
data source to be sent in each of the 6 increments. Namely, these
percentage values for D 1 are 23%, 8%, 19%, 15%, 22%, 13%, and for
D 2 are 12%, 26%, 11%, 20%, 7%, 24%.
For PI-Block-windowed, we vary the size of the time window in
order to evaluate the impact of the window size on the PI-Block
technique. Thus, we apply the notation α · τ to determine the
window size, such that α determines the number of time intervals
τ (i.e., increments) covered by the window.
Efficiency. In this experiment, we evaluate the efficiency of the
PI-Block, PI-Block-windowed and Streaming Metablocking techniques. The execution times are given by the average of five executions of each blocking technique. Figures 3 and 4 illustrate the
results of the comparative analysis between the PI-Block, PI-Blockwindowed and Metablocking techniques for the fixed and varying
incremental size scenarios, respectively.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
Araújo T. B. et al.
Figure 3: Fixed incremental size scenario: execution time of PI-Block, PI-Block-windowed and Metablocking techniques.
Figure 4: Varying incremental size scenario: execution time of PI-Block, PI-Block-windowed and Metablocking techniques.
We evaluate the execution time (in seconds) varying the number
of nodes (one up to 12 nodes) in the distributed infrastructure. For
the four data source pairs, depicted in Figure 3 (a-d) and Figure 4
(a-d), it is possible to detach that both PI-Block techniques (with and
without the time window strategy) outperformed the Metablocking
for all scenarios. This result is directly related to the novel workflow
proposed in this work, which requires two MapReduce jobs less than
the Metablocking workflow. The new workflow provides a reducing
in the execution time of 46% and 45%, on average, for the fixed and
varying incremental size scenarios, respectively. By comparing PIBlock-windowed against PI-Block, it is possible to notice a small
decreasing in the execution time, since the former applies the time
Schema-agnostic Blocking for Streaming Data
window strategy (described in Section 5) and, therefore, takes into
account a fewer number of entities to be compared.
Regarding IMDB-DBpedia (illustrated in Figures 3 (d) and 4 (d)),
PI-Block (without time window) and Metablocking are not able to
be executed when less than 12 nodes are used by the distributed
infrastructure. It occurs since these techniques consider all the
entities sent in all increments. Therefore, the data structure that
stores the blocks previously generated requires a large amount
of memory of the distributed infrastructure. For this reason, PIBlock and Metablocking have enough memory to be executed only
when 12 nodes are used. This limitation leads us to propose the
PI-Block-windowed technique which handles a higher amount of
entities efficiently. For this data source pair, the maximum window
size applied was 4 · τ since the application of bigger window sizes
exceeds the memory consumption for one node.
We also evaluate the memory consumption for the fixed incremental size scenario. We vary the number of nodes used by the
distributed infrastructure, as depicted in Figure 5. For DBLP-ACM
(Figure 5 (a)), it is possible to note that PI-Block-windowed consumes less memory than PI-Block and Metablocking for all number
of nodes. PI-Block-windowed consumes less memory due to the
application of the time window strategy, which maintains a fewer
number of entities on memory. For the pair IMDB-DBpedia (Figure
5 (b)), the memory consumption for PI-Block and Metablocking
achieve (on average) around 95% of all available memory in the
nodes, when 12 nodes are applied. On the other hand, PI-Blockwindowed (with a window size of 4 · τ ) consumes around 47% of all
available memory in the nodes. However, as the number of nodes
decreases (consequently, the amount of total memory available
decreases), the average memory consumption increases.
Effectiveness. Regarding effectiveness results, Tables 2 and
3 illustrate the PC, PQ and RR metrics for PI-Block, PI-Blockwindowed and Streaming Metablocking techniques. For the PIBlock-windowed technique, the effectiveness results are shown
for different window sizes, 2 · τ to 8 · τ for the fixed incremental
size scenario and 2 · τ to 4 · τ for the varying incremental size
scenario. If PI-Block and Metablocking techniques receive the same
input and apply the same pruning criterion, they will generate the
same output. For this reason, notice that the effectiveness results
are also the same for both techniques. Related to the incremental
size scenarios (i.e., fixed and varying), PI-Block and Metablocking
present the same effectiveness results for both scenarios since the
last pruned graph (i.e., after process all increments) produces the
same blocks for the fixed and varying scenarios.
For PC, PI-Block and Metablocking present promising results
for both incremental size scenarios, achieving more than 96% for
all data source pairs. However, since PI-Block-windowed considers
only the entities sent between a time interval according to the window size, PC is directly affected. Intuitively, the larger the window
size, the better the PC value. This behavior was confirmed for both
incremental size scenarios, as illustrated in Tables 2 and 3. For all
data source pairs, PC tends to be low for small window sizes and
higher for large window sizes. For instance, in the fixed incremental
size scenario, PI-Block-windowed achieves PC values above 0.70
with a window size of 6 · τ for all data source pairs. In the varying
incremental size scenario, PI-Block-windowed achieves PC values
over than 0.75 with a window size of 4 · τ for all data source pairs.
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
Figure 5: Memory consumption of PI-Block, PI-Blockwindowed and Metablocking techniques for (a) DBLP-ACM
and (b) IMDB-DBpedia data sources.
In terms of PQ, the blocking techniques achieve different values
according to each data source pair. PQ is directly related to the
nature of the data sources (e.g., content, number of entities, number
of attributes, and entropy of attribute values) that can interfere
with the accuracy of the generated blocks [4]. However, PQ is
different from Precision commonly used to evaluate the results of
ER. PQ evaluates the accuracy of generated blocks. Thus, PQ values
achieved by the PI-Block and Metablocking techniques, as depicted
in Tables 2 and 3, are satisfactory results for blocking [1, 2, 6, 13].
RR estimates the relative decrease in the number of comparisons
conveyed by blocking techniques. Tables 2 and 3 show the RR values for the four data source pairs. RR is fundamental for measuring
the efficiency gains of ER since it directly estimates the percentage
of comparisons that are avoided after blocking. PI-Block presents
promising results in terms of RR, achieving RR values higher than
0.75 for all data source pairs. For IMDB-DBpedia, PI-Block presents
an RR value around 0.90. Therefore, PI-Block is able to enhance the
efficiency of ER since it reduces up to 90% the number of comparisons to be executed in the ER task.
7
RELATED WORK
Several blocking techniques in stand-alone [2, 13] or parallel mode
[1, 6] have been proposed to deal with heterogeneous data. In
terms of incremental blocking techniques for relational data sources,
[5, 8, 11] propose approaches capable of blocking entities in an incremental way. Thus, these works propose an incremental workflow
to the blocking task, considering the evolutionary behavior of data
SAC ’20, March 30-April 3, 2020, Brno, Czech Republic
Araújo T. B. et al.
Table 2: Fixed incremental size scenario: effectiveness results of PI-Block, PI-Block-windowed and Metablocking techniques.
Data Sources
Abt-Buy
Amazon-GP
DBLP-ACM
IMDB-DBpedia
PC
0.72
0.17
0.37
0.27
2·τ
PQ
0.0190
4·10−4
0.0015
3·10−4
RR
0.96
0.91
0.92
0.96
PC
0.87
0.34
0.60
0.58
PI-Block-windowed
4·τ
6·τ
PQ
RR PC
PQ
0.0125 0.93 0.92 0.0098
5·10−4 0.84 0.71 8·10−4
0.0015 0.86 0.81 0.0016
3·10−4 0.93 0.85 3·10−4
RR
0.92
0.82
0.81
0.92
PC
0.96
0.92
0.94
0.95
8·τ
PQ
0.0092
9·10−4
0.0016
3·10−4
RR
0.91
0.80
0.78
0.91
PI-Block/Metablocking
PC
PQ
RR
0.99 0.0091
0,90
0.98 0.0010
0,79
0.97 0.0017
0,76
0.98 3·10−4
0,89
Table 3: Varying incremental size scenario: effectiveness results of PI-Block, PI-Block-windowed and Metablocking techniques.
Data Source Pair
Abt-Buy
Amazon-GP
DBLP-ACM
IMDB-DBpedia
PC
0.84
0.21
0.30
0.44
PI-Block-windowed
2·τ
4·τ
PQ
RR PC
PQ
0.0135
0.94 0.94
0.0095
4.36·10−4 0.87 0.79 9.06·10−4
0.00161
0.93 0.76
0.00168
3.19·10−4 0.95 0.86 3.15·10−4
sources to perform the blocking. However, these works do not deal
with heterogeneous and streaming data.
Related to other incremental tasks that present useful strategies
for ER, we can detach the works [9, 15] that propose incremental
models to address the tasks of Name Disambiguation and Dynamic
Graph Processing, respectively. More specifically in the ER context,
[12] proposes an incremental approach to perform ER on Social
Media data sources. Although such sources commonly provide heterogeneous data, this work ignores the challenges related to such
kind of data. To this end, the workflow proposed in this work generates an intermediate schema so that the extracted data from the data
sources follow such schema. Thus, [12] differs from our technique
since it does not consider the heterogeneous data challenges and
does not apply or propose blocking techniques to support ER. Some
ER approaches that deal with streaming data, e.g., [10, 14], do not
consider incremental processing and therefore discard the previously processed data. Thus, none of them deal simultaneously with
the three challenges (i.e., heterogeneous data, streaming data and
incremental processing) addressed by our work. However, these
works (i.e., [10, 14]) apply Spark Streaming and Kafka platforms,
supporting us to apply this kind of platform for streaming data.
8
SUMMARY
Blocking techniques are widely applied to ER approaches as a preprocessing step in order to avoid the quadratic cost of the ER task.
In this context, heterogeneous data, streaming data and incremental processing emerge as the major challenges faced by blocking
techniques, resulting in a lack of techniques that address all these
challenges [2, 4, 5]. In this sense, we propose PI-Block, a novel incremental schema-agnostic blocking technique that utilizes parallelism
to enhance blocking efficiency. PI-Block is able to deal with streaming and incremental scenarios as well as minimize the challenges
related to both scenarios. Based on the experimental results, we can
highlight that PI-Block presents better results regarding efficiency
than the state-of-the-art technique (i.e., Streaming Metablocking)
without negative impacts on the effectiveness.
RR
0.91
0.81
0.84
0.90
PI-Block/Metablocking
PC
PQ
RR
0.99
0.0091
0,90
0.98
0.0010
0,79
0.97
0.0017
0,76
0.98 3.16·10−4
0,89
REFERENCES
[1] Tiago Brasileiro Araújo, Carlos Eduardo Santos Pires, and Thiago Pereira da
Nóbrega. 2017. Spark-based Streamlined Metablocking. In ISCC.
[2] Tiago Brasileiro Araújo, Carlos Eduardo Santos Pires, Demetrio Gomes Mestre,
Thiago Pereira da Nóbrega, Dimas Cassimiro do Nascimento, and Kostas Stefanidis. 2019. A noise tolerant and schema-agnostic blocking technique for
entity resolution. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied
Computing. ACM, 422ś430.
[3] Tiago Brasileiro Araújo, Kostas Stefanidis, Carlos Eduardo Santos Pires, Jyrki
Nummenmaa, and Thiago Pereira da Nóbrega. 2019. Incremental Blocking for
Entity Resolution over Web Streaming Data. In IEEE/WIC/ACM International
Conference on Web Intelligence. ACM, 332ś336.
[4] Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity
Resolution in the Web of Data. Synthesis Lectures on the Semantic Web (2015).
[5] Dimas Cassimiro do Nascimento, Carlos Eduardo Santos Pires, and
Demetrio Gomes Mestre. 2018. Heuristic-based approaches for speeding
up incremental record linkage. Journal of Systems and Software 137 (2018),
335ś354.
[6] Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis,
and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution
over big heterogeneous data. Information Systems 65 (2017), 137ś157.
[7] V. Efthymiou, G. Papadakis, K. Stefanidis, and V. Christophides. 2019. MinoanER:
Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities.
In EDBT.
[8] Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental record
linkage. Proceedings of the VLDB Endowment 7, 9 (2014), 697ś708.
[9] Wuyang Ju, Jianxin Li, Weiren Yu, and Richong Zhang. 2016. iGraph: an incremental data processing system for dynamic graph. Frontiers of Computer Science
10, 3 (2016), 462ś476.
[10] Kun Ma and Bo Yang. 2017. Stream-based live entity resolution approach with
adaptive duplicate count strategy. International Journal of Web and Grid Services
13, 3 (2017), 351ś373.
[11] Markus Nentwig and Erhard Rahm. 2018. Incremental Clustering on Linked Data.
In IEEE International Conference on Data Mining Workshop (ICDMW). IEEE. (to
appear).
[12] Bernd Opitz, Timo Sztyler, Michael Jess, Florian Knip, Christian Bikar, Bernd
Pfister, and Ansgar Scherp. 2014. An Approach for Incremental Entity Resolution
at the Example of Social Media Data.. In AIMashup@ ESWC.
[13] George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis
Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with
enhanced meta-blocking. EDBT.
[14] Xiangnan Ren and Olivier Curé. 2017. Strider: A hybrid adaptive distributed RDF
stream processing engine. In International Semantic Web Conference.
[15] Alan Filipe Santana, Marcos André Gonçalves, Alberto HF Laender, and Anderson A Ferreira. 2017. Incremental author name disambiguation by exploiting
domain-specific heuristics. Journal of the Association for Information Science and
Technology 68, 4 (2017), 931ś945.