Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
a r t i c l e i n f o a b s t r a c t
Article history: Dbscan is a density-based clustering algorithm which is well known for its ability to discover clusters
Received 15 November 2021 of arbitrary shape as well as to distinguish noise. As it is computationally expensive for large datasets,
Received in revised form 27 April 2022 research studies on the parallelization of Dbscan have been received a considerable amount of attention.
Accepted 8 June 2022
In this paper we present an exact, efficient and scalable parallel Dbscan algorithm which we call Hy-
Available online 10 June 2022
Dbscan. It employs three major techniques to enable scalable data clustering on distributed-memory
Keywords: computers i) a modified kd-tree for domain decomposition, ii) a spatial indexing approach based on
Dbscan grid and inference, and iii) a cluster merging scheme based on distributed Rem’s Union-Find algorithm.
Density-based clustering Moreover, Hy-Dbscan exploits process level and thread level parallelization. In experiments, we have
Parallel algorithm demonstrated performance and scalability using two scientific datasets on up to 2048 cores of a
Distributed-memory distributed-memory computer. Through extensive evaluation, we show that Hy-Dbscan significantly
outperforms previous state-of-the-art Dbscan implementations.
© 2022 Elsevier Inc. All rights reserved.
https://doi.org/10.1016/j.jpdc.2022.06.005
0743-7315/© 2022 Elsevier Inc. All rights reserved.
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
reported that accelerate the neighborhood querying while pre- cess sequentiality of classical Dbscan. They implemented the par-
serving the clustering quality. (3) Expensive merging. To merge allel versions for shared-memory and distributed-memory archi-
subclusters across blocks, existing distributed methods use global tectures respectively. The same authors developed an end-to-end
synchronization with three different communication patterns, i.e., clustering solution Bd-cats [30]. They have demonstrated applica-
master/slave, parallel merging, and neighbor exchange. In this step, tion of Bd-cats to plasma physics and cosmology simulations at
massive merging requests may cause communication congestion. trillion particle scale. Götz et al. [11] presented a highly paral-
At the same time, communication deadlocks also happen when lel approach which is called Hpdbscan. It employs a data index
multiple blocks keep sending messages to each other even though preprocessing step, a rule-based cluster merging scheme, and a
correct global clusters have been built. cost heuristic which allows to balance the computation workload.
In this paper, we propose, Hy-Dbscan, an exact scalable Dbscan Sama et al. [31] proposed a scalable distributed algorithm μDb-
algorithm which addresses the described issues in a proactive way. scan which optimizes neighborhood query by using a two-level
1) To improve load balancing, we use a modified kd-tree to dis- R-tree.
tribute point data across blocks. By design, the splitting planes in Besides parallelization with MPI, MapReduce-based [7,18,35]
the kd-tree decomposition are constrained by ghost zones, which and Spark-based [9,16,33] approaches are proposed as well. Map-
ensures that the number of points in each block is as even as pos- Reduce and Spark provide users with readily usable programming
sible after gathering ghost points. 2) In the process of local cluster- interfaces while hiding the messy details for parallelism, while MPI
ing, we use grid-based spatial indexing for serving region queries. demands programmers to take care of implementation in detail.
We also attempt to find core grid cells and infer their membership Neukirchen [25] investigated the runtime performance and scal-
in a cluster without evaluating the points inside these core cells. It ability of several publicly available parallel Dbscan implementa-
saves a substantial fraction of -neighborhood queries. 3) For clus- tions running either on HPC platforms or on big data platforms.
ter merging, we propose a distributed Rem’s Union-Find algorithm It is found that the big data implementations are not yet mature
which solves the deadlock of communication and converges to the and Hpdbscan [11] outperforms all other considered implementa-
correct result with a finite number of iterations. tions.
Hy-Dbscan is implemented with MPI and OpenMP; it exploits Graphics processing units (GPUs) have been also used for clus-
two levels of parallelization by mapping processes to multi-core tering analysis by researchers. Mr.Scan [37] is the first distributed,
processors and threads to cores and is able to deliver high scala- end-to-end Dbscan system that incorporates the use of GPUs. The
bility on distributed-memory computers. In our experimental eval- authors only implemented a 2D version, but claim it is feasi-
uation, we evaluate both the scalability of the algorithm and the ble to extend the approach to any d-dimensional Euclidean space.
quality of the results, i.e., how close these results are to those Gowanlock et al. [12,13] proposed a hybrid CPU/GPU algorithm.
of a sequential Dbscan. We compare our approach with compet- However, their approach is developed only for shared-memory sys-
ing Dbscan implementations on real-world datasets. Experiments tems. Mustafa et al. [24] reviewed the existing GPU algorithms and
show that while achieving excellent performance and scalability, conducted experimental study comparing these GPU algorithms to
our approach produces clustering results with comparable quality identify the best performing algorithm.
to the classical Dbscan algorithm. Furthermore, Hy-Dbscan often As the computation time is intolerable to calculate clusters
outperforms other Dbscan distributed implementations, e.g., Hpdb- precisely on large datasets, several approximate Dbscan algo-
scan [11] and Pdsdbsacn [28] by up to 285 times. rithms have been developed to replace exact Dbscan, for example,
To summarize, the main technical contributions are as follows: ρ -Dbscan [8], Pardicle [29], Ng-Dbscan [21]. If slight inaccuracy
in the clustering results is permitted, the running time can be
• We propose an exact scalable Dbscan algorithm, Hy-Dbscan, brought down to O (n). However, for all approximate computing
which outperforms other Dbscan distributed implementations; techniques, there exists a speed-vs-accuracy tradeoff, and tuning
• We present optimizations for data partitioning, spatial index- the degree of approximation is a nontrivial task.
ing, and cluster merging that together provide efficiency and
scalability; 3. The DBSCAN algorithm
• We exploit hybrid MPI+OpenMP parallelization to take advan-
tage of the resources of modern HPC architectures; Dbscan is a density-based clustering algorithm, which defines
• We successfully demonstrate accuracy, performance and scala- density via two parameters and minpts. Dbscan operates by find-
bility of our algorithm. ing dense regions and expanding them in order to form clusters. In
the following, based on [6,28], we first review some definitions of
The remainder of this paper is organized as follows. In Sec- concepts used in Dbscan and then present a brief description of
tion 2, we survey the related work. In Section 3, we review the sequential Dbscan.
classical Dbscan algorithm. In Section 4, we describe Hy-Dbscan in Let X be the set of data points to be clustered. The neigh-
detail. In Section 5, we present hybrid parallelization implemen- borhood of a point x ∈ X within a given radius is called the
tation. In Section 6, we present performance evaluation. Finally, -neighborhood of x, denoted by N (x). More formally, N (x) =
we discuss and conclude the work in Sections 7 and 8, respec- { y |dist (x, y ) ≤ , y = x}, where dist (x, y ) is the distance measure
tively. between x and y.
58
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Fig. 1. (a) Illustration of kd-tree geometric partitioning from a molecular dynamic simulation. The domain is divided into 8 irregularly sized blocks. (b) 2D example to show
kd-tree partitioning with ghost zones. The width of ghost zone is . The splitting line is selected to ensure the number of points that fall inside each partition (including
ghost zone) is approximately equal.
x, xn = y such that xi +1 is directly density-reachable from xi for all Algorithm 1 Pseudocode of sequential Dbscan algorithm [28]. In-
1 ≤ i ≤ n. put: Point dataset X , radius , and density threshold minpts. Out-
put: A set of clusters C .
Definition 4 (Density-connected). A pair of points (x, y ) ∈ X is 1: procedure Dbscan( X , , minpts)
2: for each unvisited point x ∈ X do
density-connected if there exists a point o ∈ X such that both x
3: mark x as visited;
and y are density-reachable from o. 4: if | N (x)| < minpts then
5: mark x as noise;
6: else
Definition 5 (Border point). A point x is a border point if it is not a
7: C ← {x};
core point but is density-reachable from another core point. 8: for each point x ∈ N (x) do
9: if x is not visited then
Definition 6 (Noise point). A point x is a noise if it is not directly 10: mark x as visited;
11: if | N (x )| ≥ minpts then
density-reachable from any core point.
12: N (x) ← N (x) ∪ N (x );
13: if x is not yet member of any cluster then
Definition 7 (Cluster). A cluster C is a maximal set of density con- 14: C ← C ∪ {x };
nected points satisfying the following two conditions:
blocks. Then, each block contains a defined region of space and all
• Maximality: If a core point x ∈ C , then all the points density-
the points inside that space. Before each block starts local clus-
reachable from x also belong to C .
tering based on its subset of points, points in ghost zones are
• Connectivity: For any pair (x, y ) ∈ C , x is density-connected to
exchanged between neighbor blocks. During local Dbscan, the dis-
y.
joint set data structure is used to efficiently merge points into
clusters, and also for storing clustering information. Next, repre-
The pseudocode for the sequential Dbscan algorithm is given sentative points are sent to intermediate blocks to start the merge
in Algorithm 1. Initially all points are marked unvisited. The al- step where clusters are progressively merged in rounds. Finally, we
gorithm starts with an arbitrary point. Once an unvisited core traverse clusters globally and compute their ids. To the best of our
point is found, a cluster is formed by recursively retrieving directly knowledge, other existing parallel Dbscan algorithms also perform
density-reachable or density-reachable points. The clustering pro- the same framework, and the only difference is the redesign in
cess continues until all points have been assigned to a cluster or some individual steps [11,28,30,31,37].
classified as noise. In the following, we describe Hy-Dbscan in detail, including
The time complexity of Algorithm 1 is O (n · Q ), where n is geometric partitioning, point exchange with ghost points, local Db-
the number of points, Q is the complexity of -neighborhood scan, merging clusters, and computing cluster IDs.
query [32]. Dbscan without a spatial index is O (n2 ) in time com-
plexity because all points in the dataset must be compared with 4.2. Geometric partitioning
each other. To overcome high time complexity, it is often to em-
ploy spatial indexing methods (e.g., trees or grids) to speed up the After file reads, the following step is to partition the input data
Dbscan algorithm in previous studies. points into blocks (subdomains) such that computation in each
block is only based on its own data. One of the most naive ap-
4. The proposed algorithm: HY-DBSCAN proaches is to decompose the domain into a regular grid of blocks
with uniform size. Although regular decomposition is easy to im-
4.1. Overview plement, points are not distributed evenly throughout the domain.
To cope with this problem, kd-tree has been previously studied in
Hy-Dbscan is an exact scalable algorithm for density-based the context of clustering algorithms. The motivation of kd-tree par-
clustering, consisting of a number of steps. The algorithm starts titioning is to decompose the domain into irregularly sized blocks
with decomposing domain into a kd-tree and sorting points into that have approximately the same number of points (see Fig. 1(a)).
59
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Different with other state-of-the-art algorithms [30,31], we use Algorithm 2 Point exchange with ghost points. Input: Point dataset
a histogram based median computation technique to distribute X , radius . Output: Blocks with ghost points.
points across blocks. We start with the entire bounds of the point 1: procedure PointExchange( X , )
data as input to create a kd-tree with n p leaves, where n p is the 2: decompose domain using a kd-tree
3: for each block b do in parallel
number of MPI processes. Initially, the root node of the kd-tree
4: k←1
contains all points. In the first iteration, each process computes 5: nbrs ← k-neighborhood
a local histogram of its point along the current dimension and 6: while not done do
sends it to a root process for gathering. The root process deter- 7: for each neighbor block b ∈ nbrs do
8: for each point p ∈ b do
mines the median and broadcasts it to the other processes within
9: if dist ( p , b) ≤ then
the group. Once the median is received, processes exchange data 10: enqueue p to b
with their partners. For this, the processes are divided into two all-queued = AllAdd(number of points enqueued)
subgroups and each process from one subgroup chooses a partner 11: if all-queued = 0 then
12: break
from the other subgroup to exchange points. Half the processes
13: exchange points
send points above the median to theirs partners, and partners 14: nbrs ← k + 1-neighborhood
send the points below the median back to the former processes.
In subsequent iterations, the communication happens only within
each newly created subgroup. Given n p processes, the algorithm 4.4.1. Merging points using disjoint sets
will cycle through each dimension to split the data domain un- There can be a large number of merges in local computation,
til log 2 n p iterations are executed. By construction, the result is a which would be very expensive. To avoid excessive updates to clus-
kd-tree. ter assignments, we use the disjoint set data structure [28] which
During the actual process of clustering, we usually need to stores label equivalence information of points. It has two main op-
gather ghost points between neighbor blocks, producing uneven erations: Find and Union. Initially, each point has a pointer point-
redistribution of points again. When a point set is highly clus- ing to itself. During the cluster expansion phase, when a point is to
tered, this step can lead to severe load imbalance. We overcome be added to a cluster, only pointers need to be updated to repre-
this problem by a novel redesign of median selection. Fig. 1(b) sent a point and the cluster it belongs to. There exist several ways
shows the basic idea. In this design, we use a sliding window to of implementing the Union-Find algorithm. The two well known
compute a median. The sliding window is an overlapped region techniques are known as Union-by-rank and Union-by-size. In this
between two partitions with the width of 2 . Having computed paper, we have used Rem’s Union-Find algorithm (a higher in-
the histogram, we slide the window, attempting to find the split dexed root points to a lower indexed root) [5]. The basic outline of
location to ensure the number of points in each partition (in- Rem’s algorithm for merging two points is shown in Algorithm 3.
cluding ghost zone) is approximately equal. This scheme results Two pointers move upwards along their respective Find-path and
in much better load balance in the following steps of parallel Db- perform the Union operation as soon as one pointer reaches its
scan. root with higher number. It also uses a compression technique
Each of the resulting blocks is exclusively mapped to an MPI known as splicing to compress the tree. An experimental study
process that is responsible for data clustering. Throughout this pa- shows that Rem’s algorithm is faster than other implementations
per, we ignore the distinction of block and process, and use these in practice [26].
terms interchangeably.
Algorithm 3 Rem’s algorithm.
4.3. Point exchange with ghost points 1: procedure Rem(x, y)
2: r x ← x, r y ← y
3: while p (r x ) = p (r y ) do
After geometric partitioning, the bounding boxes of the data 4: if p (r x ) > p (r y ) then
stored at each block do not intersect. In order to be able to an- 5: if r x = p (r x ) then
swer all -neighborhood queries and avoid costly communication, 6: p (r x ) ← p (r y ), break
we gather all points belonging to other blocks that lies in an - 7: z ← p (r x ), p (r x ) ← p (r y ), r x ← z
extended strip (also known as ghost zone) of the block boundary. 8: else
9: if r y = p (r y ) then
Ghost points do not change the actual extent of a block and can
10: p (r y ) ← p (r x ), break
be removed after the local clustering step.
11: z ← p (r y ), p (r y ) ← p (r x ), r y ← z
As blocks in a kd-tree decomposition are not evenly spaced,
a block may be very close to another block that it does not in-
tersect [23]. To gather all ghost points, we adaptively grow the
blocks’ neighborhoods. We define the k-neighborhood of a block to 4.4.2. Grid-based spatial indexing and inference
be the set of blocks at Manhattan distance k. The point exchange As we described in Section 3, the main reason why Db-
algorithm proceeds in rounds. In each round, as in Algorithm 2, scan is expensive is that it requires retrieving each point’s -
each block b iterates over all its k-neighborhood. If a point of its neighborhood, for which the distance between all pairs needs to be
k-neighborhood lies in the ghost zone of block b, we enqueue computed. In [12,37],
√ the authors use a equal-sized square-shaped
it to the block b. The iteration suffices to stop once no k + 1- grid with /(2 d)-length cells to detect dense boxes, where d
neighborhood enqueues any points to the block b. denotes the data dimension. They mark all cells that have at
least minpts points contained within as dense boxes. By defini-
4.4. Local DBSCAN tion, it guarantees that adjacent dense boxes contain points that
are within of each other. Then, they use dense boxes to elimi-
Having exchanged ghost points between blocks, the local Db- nate computing the distances between points if they exist in very
scan computation follows. The original Dbscan paper proposes the dense regions.
use of spatial indexing to reduce time complexity [6]. Šidlauskas et Hy-Dbscan uses a similar approach to theirs. However, in our
al. [34] show that grid is superior to trees (e.g., kd-tree, R-tree) on work, the point data
√ in each block is divided into a grid with the
index creation and queries. In this paper, we have used a far more side width of / d. In such case, it guarantees that the distance
efficient approach to query and merge points. of every pair of points within the same cell is at most . All grid
60
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
cells that have at least minpts points contained within are marked Algorithm 4 Local Dbscan. Input: Point dataset X , radius , and
as core cells. Points within core cells are marked as core points. In density threshold minpts.
return there is a reduction in the complexity of local computation. 1: procedure LocalDbscan( X , , minpts)
While we use this grid-based spatial indexing, we make a minor 2: for each block b do in parallel
3: construct grid and sort points into grid cells
modification which is to only index non-empty grid cells. Thus, the 4: mark core cells and core points
size of the index is a function of the space occupied by the data, 5: merge core points within its cell
and not the entire bounding volume. 6: for each core cell c do process core cells
In local Dbscan step, points are firstly merged using the Union- 7: G ← NeighbourGridQuery(c)
8: for each core cell c ∈ G do
Find algorithm [26] within each core cell. It then proceeds by
9: select two points x ∈ c , x ∈ c
attempting to merge core cells that are adjacent to each other. 10: if Find(x) = Find(x ) then
For this, a Bichromatic Closest Pair [1] (BCP) problem needs to 11: c and c are in the same cluster and skip
be solved. The goal of the BCP problem is to find a pair of points 12: else if Bcp(c , c ) is true then
13: Union(x, x )
(x, y ) ∈ c i × c j with the smallest distance, where c i , c j are core cells
14: for each non-core cell c do process non-core cells
and x ∈ c i , y ∈ c j . If dist (x, y ) ≤ , then x is density-reachable from
15: for each point x ∈ c do
y and c i , c j should be merged. As an optimization, when perform- 16: if | N (x)| ≥ minpts then
ing the BCP operations, once a point pair is found with a distance 17: mark x as core point
less than , the rest of the comparisons for those two cells are 18: for each point y ∈ N (x) do
19: if y is a core point then
skipped.
20: Union(x, y )
Moreover, we do not need to solve the BCP problem for all pairs 21: else if y is not yet member of any cluster then
of neighboring core cells. We find that the merging of core cells 22: Union(x, y )
has redundancies that are described as follows [3]. 23: else if x is not yet member of any cluster then
24: for each point y ∈ N (x) do
25: if y is a core point then
• symmetry: A core cell c i needs to perform a merging oper-
26: Union(x, y ) and skip
ation with a core cell c j and vice versa. Both two cases are
equivalent, and either one can be skipped.
• transitivity: If the core cells c i and c j are in the same cluster child of the local root; all points within the same tree have a com-
and so do c j and ck , then c i and ck must be in the same clus- mon parent. In the step of cluster merging, only local tree roots
ter, and the merging operations between c i and ck should be require communications with other blocks, and non-root points do
skipped. not involve communications.
61
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
all Union query messages have been consumed. After merging, dis- Algorithm 6 Computing cluster IDs. Input: A list of global disjoint-
tributed disjoint-set trees are formed. set trees T . Output: Cluster ID of each point.
1: procedure ComputeClusterIds(T )
Algorithm 5 Distributed Rem’s Union-Find algorithm. Input: A list 2: for each block b do in parallel
3: gid ← global ID of the block
of local disjoint-set trees T , a queue U containing Union query mes- 4: nb ← number of global roots in b
sages. Output: A list of global disjoint-set trees. 5: label the global roots in b with IDs ranging from 0 to nb
1: procedure MergeClusters(T , U ) 6: Vec← AllReduce(nb )
2: while any block has any Union query messages do 7: for each block b do in parallel
3: for each block b do in parallel
gid−1
8: offset ← i =0 Vec[i ]
4: for each Union query (b x , x; b y , y ) ∈ U do 9: add offset to the IDs of global roots in b
5: r ← p( y) 10: for each point x ∈ b do
6: if x = r and b x = br then 11: r = p (x)
7: already linked and skip 12: if r is a global root then
8: if b x < br then 13: clusterID(x)← clusterID(r)
9: if p (r ) = r and br = b then r is a global root 14: else
10: p (r ) ← x, b p (r ) ← b x , and skip 15: enqueue Grandparent query (b, x; b p (r ) , p (r ))
11: else exchange messages
12: enqueue Union query (b x , x; b p (r ) , p (r )) 16: while any block has any messages do
13: p (r ) ← x, b p (r ) ← b x path compression 17: for each block b do in parallel
14: else 18: if current message (b x , x; b y , y) is Grandparent query then
15: enqueue Union query (b p (r ) , p (r ); b x , x) 19: if p ( y ) is a global root then
20: enqueue Grandparent reply (b p ( y ) , p ( y ); b x , x)
16: exchange Union query messages
21: else
22: enqueue Grandparent query (b x , x; b p ( y ) , p ( y ))
23: else if current message (b x , x; b y , y) is Grandparent reply then
4.6. Computing cluster IDs 24: clusterID(y)← clusterID(x)
25: exchange messages
62
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Table 1
Attributes of benchmark datasets.
63
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Fig. 3. Clusters identified in the datasets with the default parameters listed in Table 1. Clusters are colored by the ID. Each set of points in the same color represents a single
cluster.
Fig. 4. (a)-(b) Impact of the parameter on runtime. (c)-(d) Impact of the minpts parameter on runtime.
tering takes about 99% at 1 core to 63% for 2048 cores. For the the ratio of point exchange increases. The reason is that point ex-
LIDAR dataset, local dbscan takes about 99% at 1 core to 80% for change occurs at block boundaries and more points are exchanged
2048 cores. Point exchange is the next largest component taking as the number of blocks increases. We continue to see that the
around 28% and 16% of the total runtime at 2048 cores for the MD cluster merging and the computing ids steps take the least time
and LIDAR datasets respectively. As the number of cores increases, on the both datasets (< 1% of total time).
64
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Fig. 5. (a)-(b) Execution time and (c)-(d) speed up curves of the Hy-Dbscan application analyzing the MD and LIDAR datasets.
Fig. 6. Taken time by different steps of Hy-Dbscan on the MD and LIDAR datasets using 1 ∼ 64 nodes.
6.7. Efficiency details Load balance Fig. 7 illustrates the load balance as the number
of blocks varies. The load balance is defined as the ratio of the
We contend that the superior efficiency of Hy-Dbscan is mainly maximum to the average number of points in any block. The ratio
attributed to the optimizations of individual steps. In this section, remains nearly constant for kd-tree partitioning. After gathering
we test the performance of these optimizations respectively. ghost points, the degree of load imbalance tends to increase. For
65
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Fig. 7. Points balance for kd-tree domain decomposition. Each block receives roughly same amount of data after kd-tree decomposition (black). Neighbor point exchange
results in imbalance (red). With the modified kd-tree proposed in Section 4.2, the load balance is improved (blue).
Fig. 8. Runtime comparison of different spatial indexing methods. In the framework of Hy-Dbscan, we use indexes such as kd-tree, R-tree, and -grid to substitute our
approach proposed in Section 4.4.2.
the MD dataset, the ratio varies from 1.0 to 1.14 at 64 blocks. In and 10 rounds respectively. As only local roots involve communi-
contrast, the ratio varies from 1.03 to 1.78 at 64 blocks for the cations, the overall time cost of cluster merging is negligible.
LIDAR dataset. Therefore, the load balance of the MD dataset is
much better than that of the LIDAR dataset. With our proposed 6.8. Comparison to Hpdbscan and Pdsdbscan
method for selecting median planes, load balance is improved for
the skewed point data. As discussed in related work there are a number of other paral-
lel implementations of exact Dbscan. However, almost all of au-
Spatial indexing. Since spatial indexing method is crucial to Db-
thors do not provide the source code of their implementations,
scan, we also compare our results with those obtained using other
which disallows us to perform a direct comparison on the same
index based implementations. For this, we integrate kd-tree [20],
hardware platform and benchmark datasets. To our knowledge,
R-tree [15], and grid [34] (a grid with -length cells, in the fol-
Pdsdbscan [28] and Hpdbscan [11] are open-source. However, Pds-
lowing just “ -grid”) approaches to Hy-Dbscan for performing
dbscan throws runtime error and can not produce correct results.
neighborhood queries. A runtime comparison of proposed method
Therefore, we have implemented a new version of Pdsdbscan us-
with considered implementations is shown in Fig. 8. We observe
ing MPI. In the following, we compare Hy-Dbscan with Pdsdbscan
that our approach remarkably outperforms other index-based im-
and Hpdbscan.
plementations. This is because, the merging of core cells avoids
In this test, we use the default parameters listed in Table 1 and
expensive region queries and greatly reduces the computational
scale the core count from 1 to 2048. Hy-Dbscan and Hpdbscan run
complexity of local Dbscan.
in hybrid parallel mode, and Pdsdbscan runs in MPI-only mode.
Cluster merging. Fig. 9 shows the performance of cluster merging Fig. 10 presents the obtained results. Hy-Dbscan performs con-
based on the distributed Union-Find algorithm. We can see that sistently better than Hpdbscan and Pdsdbscan on both data sets
the algorithm executes in more rounds as the number of blocks considered in this paper. On the MD dataset, Hy-Dbscan is 1.8× ∼
increases. For the MD and LIDAR datasets, the loop terminates in 4 3.9× faster than Hpdbscan and is 3.8× ∼ 6.2× faster than Pdsdb-
66
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Fig. 9. Performance of cluster merging based on the distributed Union-Find algorithm. MPI processes exchange messages for cluster merging in rounds. When all processes
finish local work and no messages are being exchanged, the iteration terminates. The execution time is negligible comparing with the other steps.
Fig. 10. Runtime comparison of Hy-Dbscan with Hpdbscan and Pdsdbscan using the default parameters listed in Table 1. Given the same number of cores, Hy-Dbscan and
Hpdbscan run in a hybrid parallel mode, and Pdsdbscan runs in an MPI-only mode.
scan. On the LIDAR dataset, Hy-Dbscan is 12.2× ∼ 24.1× faster count of samples. In our work, a histogram-based approach is
than Hpdbscan and is 45× ∼ 285× faster than Pdsdbscan. As used to find the median when creating the kd-tree. Moreover, we
local Dbscan is the most dominant component of efficiency (re- choose the median considering the influence of ghost zones.
call Section 6.6), we conclude that our spatial indexing scheme
can eliminate searching the -neighborhoods of a large fraction Spatial indexing. Welton et al. [37] used√a kd-tree to detect dense
of points and thus provide significant improvement on the per- data regions and use a grid with /(2 d)-length cells to detect
formance efficiency. dense boxes. Gowanlock [12] employed √ two grids, one with -
length cells and another with /(2 d)-length cells for the dense √
7. Discussion box calculation. In their algorithms, using cells of length /(2 d)
guarantees that adjacent dense boxes contain points that are
In this section, we first discuss the advantages of our algo- within of each √ other. In contrast, our algorithm uses a grid with
rithm compared with other exact parallel algorithms running on cells of length / d. Grid cells that have at least minpts points are
HPC platforms. We then discuss the limitations of our algorithm. marked as core cells, and points within them are marked as core
points. Compared with [12,37], our grid cells are coarse-grained
7.1. Comparison with other methods and the condition is less strict. Moreover, we employ an inference
scheme to alleviate the redundancies of core cell merging. Conse-
We compare our algorithm with other parallel algorithms in quently, local Dbscan can avoid more distance calculations.
different aspects, which are detailed in Table 2.
Cluster merging. Götz et al. [11] presented a rule-based clus-
Geometric partitioning. In previous algorithms, such as [28,31], ter merging scheme which employs collective synchronous com-
the input point data is partitioned using sampling-based kd-tree munications to relabel clusters. Patwary et al. [28] presented a
decomposition. In these studies, as more points are sampled, the parallel technique to merge disjoint-set trees in the distributed-
median computed more precisely matches the real median. How- memory architecture. In comparison, we propose a distributed
ever, for large datasets, it is difficult to determine an appropriate Rem’s Union-Find algorithm to merge clusters. Since our algorithm
67
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
Table 2
Comparison with other exact parallel Dbscan algorithms running on HPC platforms.
68
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69
[23] D. Morozov, T. Peterka, Efficient Delaunay tessellation through K-D tree decom- Guoqing Wu is an associate professor at Institute
position, in: SC16: International Conference for High Performance Computing, of Applied Physics and Computational Mathematics,
Networking, Storage and Analysis, IEEE Computer Society Press, Salt Lake City, Beijing. He received his Ph.D. in computer science
UT, USA, 2016, pp. 728–738. from China Academy of Engineering Physics in 2009.
[24] H. Mustafa, E. Leal, L. Gruenwald, An experimental comparison of gpu tech-
His research interests include data parallel process-
niques for dbscan clustering, in: IEEE International Conference on Big Data,
2019.
ing, scientific data visualization and software develop-
[25] H. Neukirchen, Elephant against goliath: performance of big data versus high- ment. In particular, he focuses on developing parallel
performance computing DBSCAN clustering implementations, in: First Interna- algorithmic approaches for large-scale simulation data
tional Workshop, SimScience, Göttinggen, Germany, 2017, pp. 251–271. processing.
[26] M. Patwary, J. Blair, F. Manne, Experiments on union-find algorithms for the
disjoint-set data structure, in: Proceedings of the 9th International Symposium
Liqiang Cao is an associate professor at Institute of
on Experimental Algorithms, in: LNCS, vol. 6049, Springer, 2010, pp. 411–423.
Applied Physics and Computational Mathematics, Bei-
[27] M.A. Patwary, P. Refsnes, F. Manne, Multi-core spanning forest algorithms using
the disjoint-set data structure, in: Proceedings of the 26th IEEE International jing. He received his Ph.D. in computer science from
Parallel & Distributed Processing Symposium, IPDPS 2012, 2012, pp. 827–835. Institute of Computing Technology, Chinese Academy
[28] M.M.A. Patwary, D. Palsetia, A. Agrawal, et al., A new scalable parallel DBSCAN of Sciences in 2005. His research interests include
algorithm using the disjoint-set data structure, in: SC12: International Confer- performance optimization of large-scale parallel pro-
ence for High Performance Computing, Networking, Storage and Analysis, Los gram and scientific data management in HPC.
Alamitos, CA, USA, 2012, pp. 1–11.
[29] M.M.A. Patwary, N. Satish, N. Sundaram, et al., PARDICLE: parallel approxi-
mate density-based clustering, in: SC14: International Conference for High Per- Hongyun Tian is a Ph.D. candidate at the Gradu-
formance Computing, Networking, Storage and Analysis, Piscataway, NJ, USA, ate School of China Academy of Engineering Physics.
2014, pp. 560–571. He holds a B.Sc. degree in electronic engineering from
[30] M.M.A. Patwary, S. Byna, N. Satish, et al., BD-CATS: big data clustering at tril- Zhejiang University and a M.Sc. degree in computer
lion particle scale, in: SC15: International Conference for High Performance architecture from Beihang University of China. His re-
Computing, Networking, Storage and Analysis, Austin, TX, USA, vol. 6, 2015, search interests include computer system optimiza-
pp. 1–12. tion and assessment.
[31] A. Sarma, P. Goyal, S. Kumari, A. Wani, J.S. Challa, S. Islam, N. Goyal, μDBSCAN:
an exact scalable DBSCAN algorithm for big data exploiting spatial locality, in:
IEEE International Conference on Cluster Computing, Albuquerque, NM, USA, Wei Wang is an assistant professor at the High
2019, pp. 1–11. Performance Computing Center at the Institute of Ap-
[32] E. Schubert Jörg, M. Ester, H.-P. Kriegel, X. Xu, DBSCAN revisited, revisited: plied Physics and Computational Mathematics, Bei-
why and how you should (still) use DBSCAN, ACM Trans. Database Syst. 42 (3) jing. He received a master degree in computer science
(2017) 19:1–19:21. from Xi’an JiaoTong University, China, in 2009. His re-
[33] T.P. Shibla, K.B.S. Kumar, Improving efficiency of DBSCAN by parallelizing kd-
search interests include performance evaluation and
tree using Spark, in: International Conference on Intelligent Computing and
optimization of large-scale parallel applications.
Control Systems, Madurai, India, 2018, pp. 1197–1203.
[34] D. Šidlauskas, S. Šaltenis, C.W. Christiansen, J.M. Johansen, D. Šaulys, Trees or
grids? Indexing moving objects in main memory, in: Proceedings of the 17th
ACM SIGSPATIAL, 2009, pp. 236–245.
[35] H. Song, J.-G. Lee, RP-DBSCAN: a superfast parallel DBSCN algorithm
based on random partitioning, in: SIGMOD, ACM, Houston, TX, USA, 2018,
pp. 1173–1187.
[36] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st edition,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.
[37] B. Welton, E. Samanas, B.P. Miller, Mr.scan: extreme scale density-based clus-
tering using a tree-based network of GPGPU nodes, in: SC13: International
Conference for High Performance Computing, Networking, Storage and Anal-
ysis, Denver, Colorado, USA, 2013, pp. 84:1–84:11.
[38] X. Xu, J. Jäger, H.-P. Kriegel, A fast parallel clustering algorithm for large spatial
databases, Data Min. Knowl. Discov. 3 (1999) 263–290.
69