Journal of Parallel and Distributed Computing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Journal of Parallel and Distributed Computing 168 (2022) 57–69

Contents lists available at ScienceDirect

Journal of Parallel and Distributed Computing


www.elsevier.com/locate/jpdc

HY-DBSCAN: A hybrid parallel DBSCAN clustering algorithm scalable


on distributed-memory computers
Guoqing Wu ∗ , Liqiang Cao, Hongyun Tian, Wei Wang
Institute of Applied Physics and Computational Mathematics, Fenghao Dong Road No. 2, Haidian District, 100094, Beijing, China

a r t i c l e i n f o a b s t r a c t

Article history: Dbscan is a density-based clustering algorithm which is well known for its ability to discover clusters
Received 15 November 2021 of arbitrary shape as well as to distinguish noise. As it is computationally expensive for large datasets,
Received in revised form 27 April 2022 research studies on the parallelization of Dbscan have been received a considerable amount of attention.
Accepted 8 June 2022
In this paper we present an exact, efficient and scalable parallel Dbscan algorithm which we call Hy-
Available online 10 June 2022
Dbscan. It employs three major techniques to enable scalable data clustering on distributed-memory
Keywords: computers i) a modified kd-tree for domain decomposition, ii) a spatial indexing approach based on
Dbscan grid and inference, and iii) a cluster merging scheme based on distributed Rem’s Union-Find algorithm.
Density-based clustering Moreover, Hy-Dbscan exploits process level and thread level parallelization. In experiments, we have
Parallel algorithm demonstrated performance and scalability using two scientific datasets on up to 2048 cores of a
Distributed-memory distributed-memory computer. Through extensive evaluation, we show that Hy-Dbscan significantly
outperforms previous state-of-the-art Dbscan implementations.
© 2022 Elsevier Inc. All rights reserved.

1. Introduction used for neighbor queries, the complexity reduces to O (nlogn) in


2D space [14]. Unfortunately, for dimensionality d ≥ 3, the Dbscan
Clustering is an unsupervised machine learning approach that problem requires (n4/3 ) time to solve [8]. As data size grows, this
divides a given set of points (or particles) into meaningful sub- complexity essentially states that sequential Dbscan is computa-
groups (called clusters) such that it minimizes a similarity measure tionally intractable in practice; for this very reason, Dbscan needs
between individual clusters and maximizes it within one cluster. to be efficiently parallelized and be able to execute on distributed-
Clustering algorithms are now prevalent in many scientific appli- memory computers.
cations, for example, molecular dynamics (MD), N-body cosmology, In most of distributed-memory Dbscan algorithms, data paral-
particle-in-cell plasma physics; and in sensor measurements such lelism is a common paradigm; input points are partitioned into
as those from laser radar point clouds, geography information sys- data blocks, which are distributed among MPI processes. With
tem. Dbscan (Density Based Spatial Clustering of Applications with data-parallelism, local clusters are produced from individual data
Noise) is a density-based clustering algorithm [6]. It is a well- blocks before they are merged into the final result. This fam-
known algorithm proven to work in practice, and it received the ily algorithms still suffer from the following three problems: (1)
SIGKDD test-of-time award in 2014. The basic idea of Dbscan is Load imbalance. Load balancing in general is a concern of any
that, points that belong to a cluster typically have at least minpts distributed algorithm. Since most real-world applications involve
neighbors within a fixed radius ( ), i.e. the density of the neigh- large, nonuniform datasets, being able to distribute them evenly
borhood has to exceed some threshold [28]. The main advantage of is very important for performance and scalability of parallel Db-
Dbscan (over algorithms such as K-means [36]) is its capability of scan. We note that kd-tree has been often applied in the spatial
discovering irregularly shaped clusters without prior knowledge of data partitioning of previous algorithms [28,30,31,33]. However, in
the number of clusters (while K-means typically returns ball-like these algorithms, each data block must gather ghost points from
clusters). neighbor blocks for subcluster merging. This step may lead to an
The computational complexity of Dbscan is O (n2 ), where n is imbalanced data distribution again, neutralizing the benefits of kd-
the number of points. As a partial remedy, if spatial indexing is tree decomposition. (2) Expensive  -neighborhood query. In the
classical Dbscan, neighborhood queries are executed for all points.
Even spatial data indexing, such as kd-tree [20] or r-tree [15],
* Corresponding author. is utilized to perform  -neighborhood queries, the complexity is
E-mail address: [email protected] (G.Q. Wu). still O (n2 ) in the worst case. Few parallel algorithms have been

https://doi.org/10.1016/j.jpdc.2022.06.005
0743-7315/© 2022 Elsevier Inc. All rights reserved.
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

reported that accelerate the neighborhood querying while pre- cess sequentiality of classical Dbscan. They implemented the par-
serving the clustering quality. (3) Expensive merging. To merge allel versions for shared-memory and distributed-memory archi-
subclusters across blocks, existing distributed methods use global tectures respectively. The same authors developed an end-to-end
synchronization with three different communication patterns, i.e., clustering solution Bd-cats [30]. They have demonstrated applica-
master/slave, parallel merging, and neighbor exchange. In this step, tion of Bd-cats to plasma physics and cosmology simulations at
massive merging requests may cause communication congestion. trillion particle scale. Götz et al. [11] presented a highly paral-
At the same time, communication deadlocks also happen when lel approach which is called Hpdbscan. It employs a data index
multiple blocks keep sending messages to each other even though preprocessing step, a rule-based cluster merging scheme, and a
correct global clusters have been built. cost heuristic which allows to balance the computation workload.
In this paper, we propose, Hy-Dbscan, an exact scalable Dbscan Sama et al. [31] proposed a scalable distributed algorithm μDb-
algorithm which addresses the described issues in a proactive way. scan which optimizes neighborhood query by using a two-level
1) To improve load balancing, we use a modified kd-tree to dis- R-tree.
tribute point data across blocks. By design, the splitting planes in Besides parallelization with MPI, MapReduce-based [7,18,35]
the kd-tree decomposition are constrained by ghost zones, which and Spark-based [9,16,33] approaches are proposed as well. Map-
ensures that the number of points in each block is as even as pos- Reduce and Spark provide users with readily usable programming
sible after gathering ghost points. 2) In the process of local cluster- interfaces while hiding the messy details for parallelism, while MPI
ing, we use grid-based spatial indexing for serving region queries. demands programmers to take care of implementation in detail.
We also attempt to find core grid cells and infer their membership Neukirchen [25] investigated the runtime performance and scal-
in a cluster without evaluating the points inside these core cells. It ability of several publicly available parallel Dbscan implementa-
saves a substantial fraction of  -neighborhood queries. 3) For clus- tions running either on HPC platforms or on big data platforms.
ter merging, we propose a distributed Rem’s Union-Find algorithm It is found that the big data implementations are not yet mature
which solves the deadlock of communication and converges to the and Hpdbscan [11] outperforms all other considered implementa-
correct result with a finite number of iterations. tions.
Hy-Dbscan is implemented with MPI and OpenMP; it exploits Graphics processing units (GPUs) have been also used for clus-
two levels of parallelization by mapping processes to multi-core tering analysis by researchers. Mr.Scan [37] is the first distributed,
processors and threads to cores and is able to deliver high scala- end-to-end Dbscan system that incorporates the use of GPUs. The
bility on distributed-memory computers. In our experimental eval- authors only implemented a 2D version, but claim it is feasi-
uation, we evaluate both the scalability of the algorithm and the ble to extend the approach to any d-dimensional Euclidean space.
quality of the results, i.e., how close these results are to those Gowanlock et al. [12,13] proposed a hybrid CPU/GPU algorithm.
of a sequential Dbscan. We compare our approach with compet- However, their approach is developed only for shared-memory sys-
ing Dbscan implementations on real-world datasets. Experiments tems. Mustafa et al. [24] reviewed the existing GPU algorithms and
show that while achieving excellent performance and scalability, conducted experimental study comparing these GPU algorithms to
our approach produces clustering results with comparable quality identify the best performing algorithm.
to the classical Dbscan algorithm. Furthermore, Hy-Dbscan often As the computation time is intolerable to calculate clusters
outperforms other Dbscan distributed implementations, e.g., Hpdb- precisely on large datasets, several approximate Dbscan algo-
scan [11] and Pdsdbsacn [28] by up to 285 times. rithms have been developed to replace exact Dbscan, for example,
To summarize, the main technical contributions are as follows: ρ -Dbscan [8], Pardicle [29], Ng-Dbscan [21]. If slight inaccuracy
in the clustering results is permitted, the running time can be
• We propose an exact scalable Dbscan algorithm, Hy-Dbscan, brought down to O (n). However, for all approximate computing
which outperforms other Dbscan distributed implementations; techniques, there exists a speed-vs-accuracy tradeoff, and tuning
• We present optimizations for data partitioning, spatial index- the degree of approximation is a nontrivial task.
ing, and cluster merging that together provide efficiency and
scalability; 3. The DBSCAN algorithm
• We exploit hybrid MPI+OpenMP parallelization to take advan-
tage of the resources of modern HPC architectures; Dbscan is a density-based clustering algorithm, which defines
• We successfully demonstrate accuracy, performance and scala- density via two parameters  and minpts. Dbscan operates by find-
bility of our algorithm. ing dense regions and expanding them in order to form clusters. In
the following, based on [6,28], we first review some definitions of
The remainder of this paper is organized as follows. In Sec- concepts used in Dbscan and then present a brief description of
tion 2, we survey the related work. In Section 3, we review the sequential Dbscan.
classical Dbscan algorithm. In Section 4, we describe Hy-Dbscan in Let X be the set of data points to be clustered. The neigh-
detail. In Section 5, we present hybrid parallelization implemen- borhood of a point x ∈ X within a given radius  is called the
tation. In Section 6, we present performance evaluation. Finally,  -neighborhood of x, denoted by N  (x). More formally, N  (x) =
we discuss and conclude the work in Sections 7 and 8, respec- { y |dist (x, y ) ≤  , y = x}, where dist (x, y ) is the distance measure
tively. between x and y.

2. Related work Definition 1 (Core point). A point x is referred as a core point if


its  -neighborhood contains as least minpts points, i.e., | N  (x)| ≥
In the last decade, a number of parallel Dbscan algorithms minpts.
have been implemented with MPI. Many parallelizations adopt the
master-slave model [2,4,38]. This strategy may incur high commu- Definition 2 (Directly density-reachable). A point y ∈ X is directly
nication overhead between the master and slaves, and a low par- density-reachable from x if it is within the  -neighborhood of a
allel efficiency. A few variants have been proposed to improve the core point x.
runtime performance and scalability. Patwary et al. [28] proposed
an efficient parallel algorithm Pdsdbscan which uses a disjoint-set Definition 3 (Density-reachable). A point y ∈ X is density-reachable
data structure known as Union-Find algorithm to break the ac- from x ∈ X if there exists a chain of points x1 , x2 , ..., xn , with x1 =

58
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Fig. 1. (a) Illustration of kd-tree geometric partitioning from a molecular dynamic simulation. The domain is divided into 8 irregularly sized blocks. (b) 2D example to show
kd-tree partitioning with ghost zones. The width of ghost zone is  . The splitting line is selected to ensure the number of points that fall inside each partition (including
ghost zone) is approximately equal.

x, xn = y such that xi +1 is directly density-reachable from xi for all Algorithm 1 Pseudocode of sequential Dbscan algorithm [28]. In-
1 ≤ i ≤ n. put: Point dataset X , radius  , and density threshold minpts. Out-
put: A set of clusters C .
Definition 4 (Density-connected). A pair of points (x, y ) ∈ X is 1: procedure Dbscan( X ,  , minpts)
2: for each unvisited point x ∈ X do
density-connected if there exists a point o ∈ X such that both x
3: mark x as visited;
and y are density-reachable from o. 4: if | N  (x)| < minpts then
5: mark x as noise;
6: else
Definition 5 (Border point). A point x is a border point if it is not a
7: C ← {x};
core point but is density-reachable from another core point. 8: for each point x ∈ N  (x) do
9: if x is not visited then
Definition 6 (Noise point). A point x is a noise if it is not directly 10: mark x as visited;
11: if | N  (x )| ≥ minpts then
density-reachable from any core point.
12: N  (x) ← N  (x) ∪ N  (x );
13: if x is not yet member of any cluster then
Definition 7 (Cluster). A cluster C is a maximal set of density con- 14: C ← C ∪ {x };
nected points satisfying the following two conditions:
blocks. Then, each block contains a defined region of space and all
• Maximality: If a core point x ∈ C , then all the points density-
the points inside that space. Before each block starts local clus-
reachable from x also belong to C .
tering based on its subset of points, points in ghost zones are
• Connectivity: For any pair (x, y ) ∈ C , x is density-connected to
exchanged between neighbor blocks. During local Dbscan, the dis-
y.
joint set data structure is used to efficiently merge points into
clusters, and also for storing clustering information. Next, repre-
The pseudocode for the sequential Dbscan algorithm is given sentative points are sent to intermediate blocks to start the merge
in Algorithm 1. Initially all points are marked unvisited. The al- step where clusters are progressively merged in rounds. Finally, we
gorithm starts with an arbitrary point. Once an unvisited core traverse clusters globally and compute their ids. To the best of our
point is found, a cluster is formed by recursively retrieving directly knowledge, other existing parallel Dbscan algorithms also perform
density-reachable or density-reachable points. The clustering pro- the same framework, and the only difference is the redesign in
cess continues until all points have been assigned to a cluster or some individual steps [11,28,30,31,37].
classified as noise. In the following, we describe Hy-Dbscan in detail, including
The time complexity of Algorithm 1 is O (n · Q ), where n is geometric partitioning, point exchange with ghost points, local Db-
the number of points, Q is the complexity of  -neighborhood scan, merging clusters, and computing cluster IDs.
query [32]. Dbscan without a spatial index is O (n2 ) in time com-
plexity because all points in the dataset must be compared with 4.2. Geometric partitioning
each other. To overcome high time complexity, it is often to em-
ploy spatial indexing methods (e.g., trees or grids) to speed up the After file reads, the following step is to partition the input data
Dbscan algorithm in previous studies. points into blocks (subdomains) such that computation in each
block is only based on its own data. One of the most naive ap-
4. The proposed algorithm: HY-DBSCAN proaches is to decompose the domain into a regular grid of blocks
with uniform size. Although regular decomposition is easy to im-
4.1. Overview plement, points are not distributed evenly throughout the domain.
To cope with this problem, kd-tree has been previously studied in
Hy-Dbscan is an exact scalable algorithm for density-based the context of clustering algorithms. The motivation of kd-tree par-
clustering, consisting of a number of steps. The algorithm starts titioning is to decompose the domain into irregularly sized blocks
with decomposing domain into a kd-tree and sorting points into that have approximately the same number of points (see Fig. 1(a)).

59
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Different with other state-of-the-art algorithms [30,31], we use Algorithm 2 Point exchange with ghost points. Input: Point dataset
a histogram based median computation technique to distribute X , radius  . Output: Blocks with ghost points.
points across blocks. We start with the entire bounds of the point 1: procedure PointExchange( X ,  )
data as input to create a kd-tree with n p leaves, where n p is the 2: decompose domain using a kd-tree
3: for each block b do in parallel
number of MPI processes. Initially, the root node of the kd-tree
4: k←1
contains all points. In the first iteration, each process computes 5: nbrs ← k-neighborhood
a local histogram of its point along the current dimension and 6: while not done do
sends it to a root process for gathering. The root process deter- 7: for each neighbor block b ∈ nbrs do
8: for each point p ∈ b do
mines the median and broadcasts it to the other processes within
9: if dist ( p , b) ≤  then
the group. Once the median is received, processes exchange data 10: enqueue p to b
with their partners. For this, the processes are divided into two all-queued = AllAdd(number of points enqueued)
subgroups and each process from one subgroup chooses a partner 11: if all-queued = 0 then
12: break
from the other subgroup to exchange points. Half the processes
13: exchange points
send points above the median to theirs partners, and partners 14: nbrs ← k + 1-neighborhood
send the points below the median back to the former processes.
In subsequent iterations, the communication happens only within
each newly created subgroup. Given n p processes, the algorithm 4.4.1. Merging points using disjoint sets
will cycle through each dimension to split the data domain un- There can be a large number of merges in local computation,
til log 2 n p iterations are executed. By construction, the result is a which would be very expensive. To avoid excessive updates to clus-
kd-tree. ter assignments, we use the disjoint set data structure [28] which
During the actual process of clustering, we usually need to stores label equivalence information of points. It has two main op-
gather ghost points between neighbor blocks, producing uneven erations: Find and Union. Initially, each point has a pointer point-
redistribution of points again. When a point set is highly clus- ing to itself. During the cluster expansion phase, when a point is to
tered, this step can lead to severe load imbalance. We overcome be added to a cluster, only pointers need to be updated to repre-
this problem by a novel redesign of median selection. Fig. 1(b) sent a point and the cluster it belongs to. There exist several ways
shows the basic idea. In this design, we use a sliding window to of implementing the Union-Find algorithm. The two well known
compute a median. The sliding window is an overlapped region techniques are known as Union-by-rank and Union-by-size. In this
between two partitions with the width of 2 . Having computed paper, we have used Rem’s Union-Find algorithm (a higher in-
the histogram, we slide the window, attempting to find the split dexed root points to a lower indexed root) [5]. The basic outline of
location to ensure the number of points in each partition (in- Rem’s algorithm for merging two points is shown in Algorithm 3.
cluding ghost zone) is approximately equal. This scheme results Two pointers move upwards along their respective Find-path and
in much better load balance in the following steps of parallel Db- perform the Union operation as soon as one pointer reaches its
scan. root with higher number. It also uses a compression technique
Each of the resulting blocks is exclusively mapped to an MPI known as splicing to compress the tree. An experimental study
process that is responsible for data clustering. Throughout this pa- shows that Rem’s algorithm is faster than other implementations
per, we ignore the distinction of block and process, and use these in practice [26].
terms interchangeably.
Algorithm 3 Rem’s algorithm.
4.3. Point exchange with ghost points 1: procedure Rem(x, y)
2: r x ← x, r y ← y
3: while p (r x ) = p (r y ) do
After geometric partitioning, the bounding boxes of the data 4: if p (r x ) > p (r y ) then
stored at each block do not intersect. In order to be able to an- 5: if r x = p (r x ) then
swer all  -neighborhood queries and avoid costly communication, 6: p (r x ) ← p (r y ), break
we gather all points belonging to other blocks that lies in an  - 7: z ← p (r x ), p (r x ) ← p (r y ), r x ← z
extended strip (also known as ghost zone) of the block boundary. 8: else
9: if r y = p (r y ) then
Ghost points do not change the actual extent of a block and can
10: p (r y ) ← p (r x ), break
be removed after the local clustering step.
11: z ← p (r y ), p (r y ) ← p (r x ), r y ← z
As blocks in a kd-tree decomposition are not evenly spaced,
a block may be very close to another block that it does not in-
tersect [23]. To gather all ghost points, we adaptively grow the
blocks’ neighborhoods. We define the k-neighborhood of a block to 4.4.2. Grid-based spatial indexing and inference
be the set of blocks at Manhattan distance k. The point exchange As we described in Section 3, the main reason why Db-
algorithm proceeds in rounds. In each round, as in Algorithm 2, scan is expensive is that it requires retrieving each point’s  -
each block b iterates over all its k-neighborhood. If a point of its neighborhood, for which the distance between all pairs needs to be
k-neighborhood lies in the ghost zone of block b, we enqueue computed. In [12,37],
√ the authors use a equal-sized square-shaped
it to the block b. The iteration suffices to stop once no k + 1- grid with  /(2 d)-length cells to detect dense boxes, where d
neighborhood enqueues any points to the block b. denotes the data dimension. They mark all cells that have at
least minpts points contained within as dense boxes. By defini-
4.4. Local DBSCAN tion, it guarantees that adjacent dense boxes contain points that
are within  of each other. Then, they use dense boxes to elimi-
Having exchanged ghost points between blocks, the local Db- nate computing the distances between points if they exist in very
scan computation follows. The original Dbscan paper proposes the dense regions.
use of spatial indexing to reduce time complexity [6]. Šidlauskas et Hy-Dbscan uses a similar approach to theirs. However, in our
al. [34] show that grid is superior to trees (e.g., kd-tree, R-tree) on work, the point data
√ in each block is divided into a grid with the
index creation and queries. In this paper, we have used a far more side width of  / d. In such case, it guarantees that the distance
efficient approach to query and merge points. of every pair of points within the same cell is at most  . All grid

60
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

cells that have at least minpts points contained within are marked Algorithm 4 Local Dbscan. Input: Point dataset X , radius  , and
as core cells. Points within core cells are marked as core points. In density threshold minpts.
return there is a reduction in the complexity of local computation. 1: procedure LocalDbscan( X ,  , minpts)
While we use this grid-based spatial indexing, we make a minor 2: for each block b do in parallel
3: construct grid and sort points into grid cells
modification which is to only index non-empty grid cells. Thus, the 4: mark core cells and core points
size of the index is a function of the space occupied by the data, 5: merge core points within its cell
and not the entire bounding volume. 6: for each core cell c do process core cells
In local Dbscan step, points are firstly merged using the Union- 7: G ← NeighbourGridQuery(c)
8: for each core cell c  ∈ G do
Find algorithm [26] within each core cell. It then proceeds by
9: select two points x ∈ c , x ∈ c 
attempting to merge core cells that are adjacent to each other. 10: if Find(x) = Find(x ) then
For this, a Bichromatic Closest Pair [1] (BCP) problem needs to 11: c and c  are in the same cluster and skip
be solved. The goal of the BCP problem is to find a pair of points 12: else if Bcp(c , c  ) is true then
13: Union(x, x )
(x, y ) ∈ c i × c j with the smallest distance, where c i , c j are core cells
14: for each non-core cell c do process non-core cells
and x ∈ c i , y ∈ c j . If dist (x, y ) ≤  , then x is density-reachable from
15: for each point x ∈ c do
y and c i , c j should be merged. As an optimization, when perform- 16: if | N  (x)| ≥ minpts then
ing the BCP operations, once a point pair is found with a distance 17: mark x as core point
less than  , the rest of the comparisons for those two cells are 18: for each point y ∈ N  (x) do
19: if y is a core point then
skipped.
20: Union(x, y )
Moreover, we do not need to solve the BCP problem for all pairs 21: else if y is not yet member of any cluster then
of neighboring core cells. We find that the merging of core cells 22: Union(x, y )
has redundancies that are described as follows [3]. 23: else if x is not yet member of any cluster then
24: for each point y ∈ N  (x) do
25: if y is a core point then
• symmetry: A core cell c i needs to perform a merging oper-
26: Union(x, y ) and skip
ation with a core cell c j and vice versa. Both two cases are
equivalent, and either one can be skipped.
• transitivity: If the core cells c i and c j are in the same cluster child of the local root; all points within the same tree have a com-
and so do c j and ck , then c i and ck must be in the same clus- mon parent. In the step of cluster merging, only local tree roots
ter, and the merging operations between c i and ck should be require communications with other blocks, and non-root points do
skipped. not involve communications.

However, these redundancies are neglected by conventional grid-


4.5.2. Merging
based algorithms [11,12,37]. To alleviate the above-mentioned re-
Inspired by the classical Rem’s algorithm (see also Algorithm 3),
dundancies during merging adjacent core cells, we check whether
we propose a distributed Rem’s Union-Find algorithm to merge
they belong to the same cluster or not and merging them if
the clusters on the block boundary. The merging is performed on
needed. At the moment of processing a core cell c, every neigh-
a pair of ghost point and local point that belongs to the same
boring core cell c  of c is looped. We select a point arbitrarily from
disjoint-set tree. Each block sends Union query messages to other
c and c  , respectively. In the case that the Find operation returns
blocks where the ghost points originally belong to. Union query
the same representative element of the set, the two cells are in the
message is a structure which contains the point ID and the block
same cluster. This scheme turns out to be useful since it can skip
ID of two endpoints. The merging is done by setting the parent
real BCP comparisons between any cells if they can be inferred
pointer of one local root points to another local root.
that they are already in the same cluster.
To ensure no cycle of communications happens, the distributed
While core cells can be detected and processed, it is still nec-
Union operation of two disjoint-set trees is based on two ID rules:
essary to process non-core cells. Note that each point x ∈ non-core
(1) If the trees are within the same block, the parent of the root
cell may or may not be a core point. To perform an  -neighborhood
of one tree is set to the root with a smaller point ID in the other
search around x, points that may be within  are guaranteed to be
tree; and (2) If the trees are within different blocks, the parent of
within 2-neighborhood cells, and the cell containing x (125 total
the root of one tree is set to the root with a smaller block ID in
cells in 3D). If x is a core point, it should merge all  -neighborhood
the other tree.
points; otherwise, it should be added to any core point within its
The details of our distributed Rem’s Union-Find algorithm are
 -neighborhood. The full algorithm for local Dbscan is given as Al-
given in Algorithm 5. If x ∈ b x is a local root and y ∈ b y is a ghost
gorithm 4.
point that belongs to the same tree, block b x sends a Union query
message (b x , x; b y , y ) to block b y asking to perform a Union oper-
4.5. Merging clusters ation. Once, block b y receives the message, it gets the local root of
the tree containing y using a parent function p (). If x and y are
The result of local Dbscan is a list of local disjoint-set trees already in the same tree, the current message is consumed. Oth-
formed by local points of blocks. Merging clusters generated by erwise, if the parent of y, r is a global root and it satisfies the
blocks is needed to generate the final correct clusters. required criteria of the Union operation (i.e., ID rules), we merge
the trees by setting x to p (r ). If r is not a global root, we repack
4.5.1. Path compression the Union query message (b x , x; b p (r ) , p (r )) and send to block b p (r ) .
During parallel merging, there exists a large number of merge At the same time, we use a path compression technique by setting
requests between blocks. Path compression can reduce commu- x to p (r ). The effect of path compression is that the new parent
nications and make distributed Union operations more efficient. has a smaller ID than its old parent, thus hopefully shorten the
Therefore, we shorten the paths in the local disjoint-set trees be- height of distributed disjoint-set trees. If the message (b x , x; b y , y )
fore merging. For this, non-root points modify their parent pointers does not satisfy the required criteria of the Union operation, we
from their parents to the parents’ parents until each point’s parent reverse the two endpoints and send back to block b x asking to per-
pointer points to local roots. As a result, every non-root point is a form the Union operation. The algorithm proceeds in rounds until

61
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

all Union query messages have been consumed. After merging, dis- Algorithm 6 Computing cluster IDs. Input: A list of global disjoint-
tributed disjoint-set trees are formed. set trees T . Output: Cluster ID of each point.
1: procedure ComputeClusterIds(T )
Algorithm 5 Distributed Rem’s Union-Find algorithm. Input: A list 2: for each block b do in parallel
3: gid ← global ID of the block
of local disjoint-set trees T , a queue U containing Union query mes- 4: nb ← number of global roots in b
sages. Output: A list of global disjoint-set trees. 5: label the global roots in b with IDs ranging from 0 to nb
1: procedure MergeClusters(T , U ) 6: Vec← AllReduce(nb )
2: while any block has any Union query messages do 7: for each block b do in parallel
3: for each block b do in parallel
 gid−1
8: offset ← i =0 Vec[i ]
4: for each Union query (b x , x; b y , y ) ∈ U do 9: add offset to the IDs of global roots in b
5: r ← p( y) 10: for each point x ∈ b do
6: if x = r and b x = br then 11: r = p (x)
7: already linked and skip 12: if r is a global root then
8: if b x < br then 13: clusterID(x)← clusterID(r)
9: if p (r ) = r and br = b then r is a global root 14: else
10: p (r ) ← x, b p (r ) ← b x , and skip 15: enqueue Grandparent query (b, x; b p (r ) , p (r ))
11: else exchange messages
12: enqueue Union query (b x , x; b p (r ) , p (r )) 16: while any block has any messages do
13: p (r ) ← x, b p (r ) ← b x path compression 17: for each block b do in parallel
14: else 18: if current message (b x , x; b y , y) is Grandparent query then
15: enqueue Union query (b p (r ) , p (r ); b x , x) 19: if p ( y ) is a global root then
20: enqueue Grandparent reply (b p ( y ) , p ( y ); b x , x)
16: exchange Union query messages
21: else
22: enqueue Grandparent query (b x , x; b p ( y ) , p ( y ))
23: else if current message (b x , x; b y , y) is Grandparent reply then
4.6. Computing cluster IDs 24: clusterID(y)← clusterID(x)
25: exchange messages

At this point, we have got a list of clusters that are represented


by distributed disjoint-set trees. However, in many applications, to change the parent pointer of a point owned by another thread.
each point needs to be assigned a unique cluster ID. This final la- Therefore, we use locks to resolve data races, similar to what was
beling enables such as: calculation of aggregate quantities for each done in [27]. To balance the work load among threads, a work
cluster, filtering and visualization of clusters. In this step, we par- stealing approach is used. We add schedule (dynamic) clause to the
allelize the process of computing cluster IDs from the distributed parallel for pragma. When the dynamically pulled workload is ap-
disjoint-set trees. propriately chosen, optimal performance is achieved. Empirically,
Algorithm 6 shows the pseudocode of computing cluster IDs. we found that a dynamic chunk size of 100 gave the best perfor-
First, it traverses the trees to find global roots in each block and mance for the benchmark datasets used in this paper.
label them locally ranging from 0 to nb , where nb is the number
of global roots in block b. Second, it constructs a unique labeling 6. Experimental evaluation
across blocks by adding an offset to each range. After computing
cluster IDs of global roots, each point traverses its tree upwards 6.1. Experimental setup
and gets the cluster ID from the global root. If a global root is lo-
cal to the block it belongs to, no communication is required and We conducted all the experiments on a supercomputing sys-
child points get cluster IDs from the root immediately. Otherwise, tem which is a 2-petaflop machine with 2190 compute nodes. Each
two types of messages are exchanged across blocks. Grandparent node is configured with two 16-core Xeon E5 2682-v4 processors
query message is used when a local point requests for its grandpar- clocked at 2.5 GHz and 128 GB RAM. GCC v9.3.0 with -O2 opti-
ent. This message is sent from the local root to its parent. Such a mization was used to compile the test code. The MPI distribution
message might traverse multiple blocks before reaching the global is MPICH in version 3.0.
root. Grandparent reply message is used to answer a grandparent By default, experiments were run using the MPI+OpenMP hy-
query by sending the cluster ID directly to the request initiator. brid features of Hy-Dbscan. We spawned an MPI process for each
The algorithm executes in multiple rounds until there are no more compute node available and fixed the number of OpenMP threads
messages to be forwarded, ultimately constructing a unique ID for per process to 32, which is equal to the number of cores on each
each cluster and marking each point with its corresponding cluster node. We set the number of blocks in domain decomposition to
ID. be equal to the number of MPI processes, in order to ensure that
each process is assigned a single block. In all experiments, as we
5. Hybrid MPI+OpenMP parallelization concentrate on the performance of parallel clustering, I/O time is
excluded in the total time.
Hy-Dbscan is implemented based on C++11 using MPI for
distributed-memory parallelism and OpenMP for shared-memory 6.2. Datasets
parallelism. There are many performance advantages to this ap-
proach: (1) As the number of MPI processes increases, the number An overview of the scientific datasets and the default parame-
of ghost points increases, which results in more redundant data. ters used in this paper is depicted in Table 1. We give more details
Explicitly using threads locally in compute nodes can reduce re- about the datasets below.
dundant data and thus is more efficient than going through MPI;
and (2) It is easier to balance the workload among threads than 6.2.1. MD data
across a large number of MPI processes. Microjet is a complex dynamics phenomenon. When a shock
As Rem’s Union-Find algorithm adds (merges) points to its wave reflects from material surface, microjet forms, which expands
clusters (trees) without any specify ordering, we can easily divide continuously and finally fragments into a number of ejecta. Molec-
the computation of local Dbscan among all threads using OpenM- ular dynamics simulation has been adopted to study the micro-
P’s omp for pragma. Note that merging points could cause a thread scopic mechanism of ejecta production, for example, the influence

62
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Table 1
Attributes of benchmark datasets.

Dataset Points Dims Bounding box  minpts Noise


MD 37,959,191 3 1×1×1 0.02 200 0.007%
LIDAR 81,031,165 3 30 K × 10 K × 30 K 100 1000 2.45%

of shock intensity, the total amount and the size distribution of


ejecta, etc [17]. The dataset used in our work is from a middle time
step of a simulation carried out at Institute of Applied Physics and
Computational Mathematics. The whole dataset contains roughly
37 million particles.

6.2.2. LIDAR data


Light detection and ranging (LIDAR) is a remote sensing pro-
cess which collects measurements used to describe the surface of
objects. The LIDAR dataset used in this paper is a 3D point cloud
of the old town of Bremen, Germany. The raw data was collected
by Dorit Borrmann and Andreas Nüchter from the Jacobs Univer-
sity Bremen, Germany. A post-processed data file in HDF5 format,
created by Götz et al., is available from B2SHARE [10]. Dbscan
can be used here to identify distinct objects, such as houses, trees
or roads. The whole point cloud contains roughly 81 million data
points.
Fig. 2. Quality score of the clustering output on the MD dataset with  = 0.02 and
minpts = 200, and the LIDAR dataset with  = 100 and minpts = 1000.
6.3. Clustering quality
because larger  values put a greater load on the  -neighborhood
Our first test compares the clustering quality of our algorithm.
query.
In this test, we compare the result of our parallel algorithm with
Fig. 4(c) and 4(d) plot the runtime as a function of minpts. For
a sequential version. While in principle the two versions should
the MD dataset, we fix  = 0.02 and vary minpts from 50, 100,
produce the same results, in practice that will not happen. Dbscan 150, ..., 500. For the LIDAR dataset, we fix  = 100 and vary minpts
clustering results can vary slightly if the order in which clusters are from 200, 400, ..., 2000. We clearly see that execution time in-
explored changes. For instance, a border point might be density- creases with larger values of minpts. This is because, if the value
reachable from different core points and be added to a cluster of minpts is increased, there are likely to be fewer core cells. Con-
that visits it first. Therefore, Dbscan has slightly non-deterministic sequently, the computational complexity of Hy-Dbscan increases
behavior depending on the order in which the data points are vis- with increase in minpts.
ited.
We use a metric in [19,22] to evaluate the correctness of paral- 6.5. Scalability
lel clustering results. The metric assigns a score between 0 and 1
to each point as EE ∩ F
∪ F , where E is the cluster assigned to the point We next investigate Hy-Dbscan’s strong scalability properties
using sequential Dbscan and F is the cluster assigned to the point on MD and LIDAR datasets. In our experiments, we vary the num-
using parallel Dbscan. If a point is misidentified as a noise (or non- ber of cores from 1 to 2048. Given the fixed number of cores, we
noise) point, then the point gets a score of 0. The final quality arrange them either in an MPI-only configuration or in a hybrid
score is an average of the points’ quality scores. Therefore, if both parallel configuration.
algorithms have identical output, this metric is maximized (i.e., the Fig. 5 presents the obtained results. As shown from this fig-
quality score is equal to 1). In this test, the number of blocks varies ure, the hybrid configuration shows consistent improvement over
from 1 to 64 using up to the maximum of 2048 cores. Fig. 2 plots the MPI-only configuration. We observe a speedup of 556× while
the accuracy of Hy-Dbscan on the MD and LIDAR datasets. We ob- running the MD dataset on 2048 cores compared to running on
serve that Hy-Dbscan achieves nearly identical output compared 1 cores. For the LIDAR dataset, the speedup achieved is 378 us-
to the sequential implementation. Fig. 3 shows the spatial distri- ing 2048 cores. For the LIDAR dataset, we achieved relatively low
bution of clusters identified by Hy-Dbscan. speedups due to imbalanced workload. Specifically, points in the
LIDAR dataset are highly clustered, and the computation complex-
6.4. Impact of  and minpts ity of  -neighborhood query in different blocks changes dramat-
ically. In contrast, the data distribution of MD dataset is not as
The next experiment aims to understand the impact of the  skewed and hence performance on it scales well. On both datasets,
and minpts parameters on the execution time. Fig. 4(a) and 4(b) the time needed to compute clustering results is in the order of
plot the runtime as a function of  . For the MD dataset, we fix seconds, demonstrating that Hy-Dbscan is viable in real-world ap-
minpts = 200 and vary  from 0.001, 0.02, 0.04, ..., 0.2. For the LI- plications.
DAR dataset, we fix minpts = 1000 and vary  from 1, 100, 200,
..., 1000. We find that on both datasets, smaller  values require 6.6. Runtime breakdown
more execution time. This is because with a small  value, the
probability of at least minpts points being contained within a grid In this section, we discuss the timing breakdown on both
cell is very low. When the fraction of data points that can be datasets. The detailed breakdown of the hybrid configuration in
detected using core cell is low, our algorithm degenerates to the Section 6.5 is shown in Fig. 6.
grid-based Dbscan. We also note that there is an increase in exe- In the parallel Dbscan clustering, local computation is the most
cution time as the value of  is increased beyond a certain value dominant component of runtime. For the MD dataset, Dbscan clus-

63
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Fig. 3. Clusters identified in the datasets with the default parameters listed in Table 1. Clusters are colored by the ID. Each set of points in the same color represents a single
cluster.

Fig. 4. (a)-(b) Impact of the  parameter on runtime. (c)-(d) Impact of the minpts parameter on runtime.

tering takes about 99% at 1 core to 63% for 2048 cores. For the the ratio of point exchange increases. The reason is that point ex-
LIDAR dataset, local dbscan takes about 99% at 1 core to 80% for change occurs at block boundaries and more points are exchanged
2048 cores. Point exchange is the next largest component taking as the number of blocks increases. We continue to see that the
around 28% and 16% of the total runtime at 2048 cores for the MD cluster merging and the computing ids steps take the least time
and LIDAR datasets respectively. As the number of cores increases, on the both datasets (< 1% of total time).

64
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Fig. 5. (a)-(b) Execution time and (c)-(d) speed up curves of the Hy-Dbscan application analyzing the MD and LIDAR datasets.

Fig. 6. Taken time by different steps of Hy-Dbscan on the MD and LIDAR datasets using 1 ∼ 64 nodes.

6.7. Efficiency details Load balance Fig. 7 illustrates the load balance as the number
of blocks varies. The load balance is defined as the ratio of the
We contend that the superior efficiency of Hy-Dbscan is mainly maximum to the average number of points in any block. The ratio
attributed to the optimizations of individual steps. In this section, remains nearly constant for kd-tree partitioning. After gathering
we test the performance of these optimizations respectively. ghost points, the degree of load imbalance tends to increase. For

65
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Fig. 7. Points balance for kd-tree domain decomposition. Each block receives roughly same amount of data after kd-tree decomposition (black). Neighbor point exchange
results in imbalance (red). With the modified kd-tree proposed in Section 4.2, the load balance is improved (blue).

Fig. 8. Runtime comparison of different spatial indexing methods. In the framework of Hy-Dbscan, we use indexes such as kd-tree, R-tree, and  -grid to substitute our
approach proposed in Section 4.4.2.

the MD dataset, the ratio varies from 1.0 to 1.14 at 64 blocks. In and 10 rounds respectively. As only local roots involve communi-
contrast, the ratio varies from 1.03 to 1.78 at 64 blocks for the cations, the overall time cost of cluster merging is negligible.
LIDAR dataset. Therefore, the load balance of the MD dataset is
much better than that of the LIDAR dataset. With our proposed 6.8. Comparison to Hpdbscan and Pdsdbscan
method for selecting median planes, load balance is improved for
the skewed point data. As discussed in related work there are a number of other paral-
lel implementations of exact Dbscan. However, almost all of au-
Spatial indexing. Since spatial indexing method is crucial to Db-
thors do not provide the source code of their implementations,
scan, we also compare our results with those obtained using other
which disallows us to perform a direct comparison on the same
index based implementations. For this, we integrate kd-tree [20],
hardware platform and benchmark datasets. To our knowledge,
R-tree [15], and grid [34] (a grid with  -length cells, in the fol-
Pdsdbscan [28] and Hpdbscan [11] are open-source. However, Pds-
lowing just “ -grid”) approaches to Hy-Dbscan for performing
dbscan throws runtime error and can not produce correct results.
neighborhood queries. A runtime comparison of proposed method
Therefore, we have implemented a new version of Pdsdbscan us-
with considered implementations is shown in Fig. 8. We observe
ing MPI. In the following, we compare Hy-Dbscan with Pdsdbscan
that our approach remarkably outperforms other index-based im-
and Hpdbscan.
plementations. This is because, the merging of core cells avoids
In this test, we use the default parameters listed in Table 1 and
expensive region queries and greatly reduces the computational
scale the core count from 1 to 2048. Hy-Dbscan and Hpdbscan run
complexity of local Dbscan.
in hybrid parallel mode, and Pdsdbscan runs in MPI-only mode.
Cluster merging. Fig. 9 shows the performance of cluster merging Fig. 10 presents the obtained results. Hy-Dbscan performs con-
based on the distributed Union-Find algorithm. We can see that sistently better than Hpdbscan and Pdsdbscan on both data sets
the algorithm executes in more rounds as the number of blocks considered in this paper. On the MD dataset, Hy-Dbscan is 1.8× ∼
increases. For the MD and LIDAR datasets, the loop terminates in 4 3.9× faster than Hpdbscan and is 3.8× ∼ 6.2× faster than Pdsdb-

66
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Fig. 9. Performance of cluster merging based on the distributed Union-Find algorithm. MPI processes exchange messages for cluster merging in rounds. When all processes
finish local work and no messages are being exchanged, the iteration terminates. The execution time is negligible comparing with the other steps.

Fig. 10. Runtime comparison of Hy-Dbscan with Hpdbscan and Pdsdbscan using the default parameters listed in Table 1. Given the same number of cores, Hy-Dbscan and
Hpdbscan run in a hybrid parallel mode, and Pdsdbscan runs in an MPI-only mode.

scan. On the LIDAR dataset, Hy-Dbscan is 12.2× ∼ 24.1× faster count of samples. In our work, a histogram-based approach is
than Hpdbscan and is 45× ∼ 285× faster than Pdsdbscan. As used to find the median when creating the kd-tree. Moreover, we
local Dbscan is the most dominant component of efficiency (re- choose the median considering the influence of ghost zones.
call Section 6.6), we conclude that our spatial indexing scheme
can eliminate searching the  -neighborhoods of a large fraction Spatial indexing. Welton et al. [37] used√a kd-tree to detect dense
of points and thus provide significant improvement on the per- data regions and use a grid with  /(2 d)-length cells to detect
formance efficiency. dense boxes. Gowanlock [12] employed √ two grids, one with  -
length cells and another with  /(2 d)-length cells for the dense √
7. Discussion box calculation. In their algorithms, using cells of length  /(2 d)
guarantees that adjacent dense boxes contain points that are
In this section, we first discuss the advantages of our algo- within  of each √ other. In contrast, our algorithm uses a grid with
rithm compared with other exact parallel algorithms running on cells of length  / d. Grid cells that have at least minpts points are
HPC platforms. We then discuss the limitations of our algorithm. marked as core cells, and points within them are marked as core
points. Compared with [12,37], our grid cells are coarse-grained
7.1. Comparison with other methods and the condition is less strict. Moreover, we employ an inference
scheme to alleviate the redundancies of core cell merging. Conse-
We compare our algorithm with other parallel algorithms in quently, local Dbscan can avoid more distance calculations.
different aspects, which are detailed in Table 2.
Cluster merging. Götz et al. [11] presented a rule-based clus-
Geometric partitioning. In previous algorithms, such as [28,31], ter merging scheme which employs collective synchronous com-
the input point data is partitioned using sampling-based kd-tree munications to relabel clusters. Patwary et al. [28] presented a
decomposition. In these studies, as more points are sampled, the parallel technique to merge disjoint-set trees in the distributed-
median computed more precisely matches the real median. How- memory architecture. In comparison, we propose a distributed
ever, for large datasets, it is difficult to determine an appropriate Rem’s Union-Find algorithm to merge clusters. Since our algorithm

67
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

Table 2
Comparison with other exact parallel Dbscan algorithms running on HPC platforms.

Name Parallel model Partitioner Spatial indexing Merging Data structure


Hpdbscan [11] MPI+OpenMP grid grid rule-based N/A
Mr.Scan [37] CPU+GPU grid kd-tree+dense box overlapping cell N/A
Pdsdbscan [28] MPI or OpenMP kd-tree kd-tree parallel union disjoint-set
μDbscan [31] MPI kd-tree μR-tree parallel union disjoint-set
BPS-HDbscan [12] shared-memory N/A grid+dense box N/A disjoint-set

Our work MPI+OpenMP modified kd-tree grid+inference distributed Rem’s disjoint-set

employs a compression technique, it has minimal communication Acknowledgments


overhead and is more efficient than previous methods.
This work was supported by National Natural Science Founda-
7.2. Limitations tion of China [61403036, 11871109, and 61672003].

Our algorithm distributes points to blocks as evenly as possible. References


However, due to the fact that the cost of retrieving  -neighborhood
[1] P.K. Agarwal, H. Edelsbrunner, O. Schwarzkopf, Euclidean minimum spanning
of different points varies, the workload of blocks with the same trees and bichromatic closest pairs, Discrete Comput. Geom. 6 (1991) 407–422.
number of points may be unbalanced. For example, a core point [2] D. Arlia, M. Coppola, Experiments in parallel clustering with DBSCAN, in: Euro-
needs to cluster almost all the data points, while a noise point Par Parallel Processing, Springer, Heidelberg, 2001, pp. 326–331.
idles most of the time. Therefore, point count is not an ideal mea- [3] T. Boonchoo, X. Ao, Y. Liu, et al., Grid-based DBSCAN: indexing and inference,
Pattern Recognit. 90 (2019) 271–284.
sure for parallel Dbscan implementation. To improve algorithm
[4] S. Brecheisen, H.-P. Kriegel, M. Pfeifle, Parallel density-based clustering of com-
scalability, we need to employ a cost heuristic to determine a more plex objects, in: Advances in Knowledge Discovery and Data Mining, Springer,
balanced data space subdivision. Heidelberg, 2006, pp. 179–188.
Our current implementation considers 3D space, and only sup- [5] E.W. Dijkstra, A Discipline of Programming, Prentice-Hall, 1976.
ports Euclidean distance. As data dimensionality increases, the [6] M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discov-
ering clusters in large spatial databases with noise, in: Proceedings of the
computational complexity of spatial indexing grows quickly. This 2nd International Conference on Knowledge Discovery and Data Mining, 1996,
is a task to be solved in the future. pp. 226–231.
[7] Y. Fu, W. Zhao, H. Ma, Research on parallel DBSCAN algorithm design based on
MapReduce, Adv. Mater. Res. 301–303 (2011) 1133–1138.
8. Conclusion
[8] J. Gan, Y. Tao, DBSCAN revisited: mis-claim, un-fixability, and approximation,
in: Proceedings of the ACM SIGMOD International Conference on Management
Scientific research has entered the era of big data. Parallel and of Data, Melbourne, Australia, 2015, pp. 519–530.
distributed data analysis techniques are very important for accel- [9] Y. Gong, R.O. Sinnott, P. Rimba, RT-DBSCAN: real-time parallel clustering of
spatio-temporal data using spark-streaming, in: International Conference on
erating knowledge discovery. In this paper, we have presented Hy-
Computational Science, Wuxi, China, 2018, pp. 524–539.
Dbscan - a hybrid parallel Dbscan algorithm to perform clustering [10] M. Götz, C. Bodenstein, Hpdbscan benchmark test files, http://hdl.handle.net/
analysis on large-scale scientific data. Specifically, we optimized 11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f. (Accessed 5 June 2020).
kd-tree decomposition to provide better load balancing. Further, [11] M. Götz, C. Bodenstein, M. Riedel, HPDBSCAN - highly parallel DBSCAN, in:
SC15: International Conference for High Performance Computing, Network-
we use grid-based spatial indexing and inference to avoiding the
ing, Storage and Analysis, Machine Learning in HPC Environments Workshop,
region queries of a large fraction of points in datasets. We also pro- Austin, TX, USA, vol. 2, 2015, pp. 1–10.
pose a distributed Rem’s Union-Find algorithm to solve the dead- [12] M. Gowanlock, Hybrid CPU/GPU clustering in shared memory on the billion
locks of communication. Hy-Dbscan exploits hybrid MPI+OpenMP point scale, in: Proceedings of the ACM International Conference on Supercom-
parallelization, and thus can be used in a diverse architectural puting, Phoenix, Arizona, USA, 2019, pp. 35–45.
[13] M. Gowanlock, C.M. Rude, D.M. Blair, A hybrid approach for optimizing parallel
landscape. As verified by detailed experiments using two scientific clustering throughput using the GPU, IEEE Trans. Parallel Distrib. Syst. 30 (4)
datasets on up to 2048 cores, Hy-Dbscan is efficient, scalable, and (2019) 766–777.
can produce exact Dbscan clustering. We believe that our work can [14] A. Gunawan, A faster algorithm for DBSCAN, Master’s thesis, Technische Uni-
significantly raise the usability of Dbscan in the era of big data. versity Eindhoven, March 2013.
[15] A. Guttman, R-trees: a dynamic index structure for spatial searching, in: Pro-
As future work, we intend to develop a hybrid CPU+GPU Db-
ceedings of the ACM SIGMOD International Conference on Management of
scan algorithm which is efficient on the heterogeneous architec- Data, vol. 2, 1984, pp. 47–57.
tures. We also plan to use Hy-Dbscan to help scientists gain in- [16] D. Han, A. Agrawal, W. Liao, A. Choudhary, Parallel DBSCAN algorithm using
sights into large scientific datasets. a data partitioning strategy with spark implementation, in: IEEE International
Conference on Big Data, Seattle, WA, USA, 2018, pp. 305–312.
[17] A. He, P. Wang, J. Shao, Molecular dynamics simulations of ejecta size distribu-
CRediT authorship contribution statement tions for shock-loaded Cu with a wedged surface groove, Comput. Mater. Sci.
98 (2015) 271–277.
[18] Y. He, H. Tan, W. Luo, et al., MR-DBSCAN: an efficient parallel density-based
Guoqing Wu: Conceptualization, Funding acquisition, Methodol-
clustering algorithm using MapReduce, in: Proceedings of the IEEE 17th Inter-
ogy, Software, Validation, Writing – original draft. Liqiang Cao: For- national Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011,
mal analysis, Validation, Visualization, Writing – review & editing. pp. 473–480.
Hongyun Tian: Formal analysis, Resources, Software. Wei Wang: [19] E. Januzaj, H.-P. Kriegel, M. Pfeifle, DBDC: density based distributed cluster-
Data curation, Resources, Software, Writing – review & editing. ing, in: Proceedings of the 9th Intl. Conf. on Advances in Database Technology,
vol. 2992, 2004, pp. 88–105.
[20] M.B. Kennel, KDTREE 2: Fortran 95 and C++ software to efficiently search for
Declaration of competing interest near neighbors in a multi-dimensional Euclidean space, 2004.
[21] A. Lulli, M. Dell’Amico, P. Michiardi, L. Ricci, NG-DBSCAN: scalable density-
based clustering for arbitrary data, in: Proceedings of the VLDB Endowment,
The authors declare that they have no known competing finan- New Delhi, India, vol. 10 of 3, 2016, pp. 157–168.
cial interests or personal relationships that could have appeared to [22] G. Michael, D.M. Blair, V. Pankratius, Optimizing parallel clustering throughput
influence the work reported in this paper. in shared memory, IEEE Trans. Parallel Distrib. Syst. 20 (2017) 2595–2607.

68
G.Q. Wu, L. Cao, H. Tian et al. Journal of Parallel and Distributed Computing 168 (2022) 57–69

[23] D. Morozov, T. Peterka, Efficient Delaunay tessellation through K-D tree decom- Guoqing Wu is an associate professor at Institute
position, in: SC16: International Conference for High Performance Computing, of Applied Physics and Computational Mathematics,
Networking, Storage and Analysis, IEEE Computer Society Press, Salt Lake City, Beijing. He received his Ph.D. in computer science
UT, USA, 2016, pp. 728–738. from China Academy of Engineering Physics in 2009.
[24] H. Mustafa, E. Leal, L. Gruenwald, An experimental comparison of gpu tech-
His research interests include data parallel process-
niques for dbscan clustering, in: IEEE International Conference on Big Data,
2019.
ing, scientific data visualization and software develop-
[25] H. Neukirchen, Elephant against goliath: performance of big data versus high- ment. In particular, he focuses on developing parallel
performance computing DBSCAN clustering implementations, in: First Interna- algorithmic approaches for large-scale simulation data
tional Workshop, SimScience, Göttinggen, Germany, 2017, pp. 251–271. processing.
[26] M. Patwary, J. Blair, F. Manne, Experiments on union-find algorithms for the
disjoint-set data structure, in: Proceedings of the 9th International Symposium
Liqiang Cao is an associate professor at Institute of
on Experimental Algorithms, in: LNCS, vol. 6049, Springer, 2010, pp. 411–423.
Applied Physics and Computational Mathematics, Bei-
[27] M.A. Patwary, P. Refsnes, F. Manne, Multi-core spanning forest algorithms using
the disjoint-set data structure, in: Proceedings of the 26th IEEE International jing. He received his Ph.D. in computer science from
Parallel & Distributed Processing Symposium, IPDPS 2012, 2012, pp. 827–835. Institute of Computing Technology, Chinese Academy
[28] M.M.A. Patwary, D. Palsetia, A. Agrawal, et al., A new scalable parallel DBSCAN of Sciences in 2005. His research interests include
algorithm using the disjoint-set data structure, in: SC12: International Confer- performance optimization of large-scale parallel pro-
ence for High Performance Computing, Networking, Storage and Analysis, Los gram and scientific data management in HPC.
Alamitos, CA, USA, 2012, pp. 1–11.
[29] M.M.A. Patwary, N. Satish, N. Sundaram, et al., PARDICLE: parallel approxi-
mate density-based clustering, in: SC14: International Conference for High Per- Hongyun Tian is a Ph.D. candidate at the Gradu-
formance Computing, Networking, Storage and Analysis, Piscataway, NJ, USA, ate School of China Academy of Engineering Physics.
2014, pp. 560–571. He holds a B.Sc. degree in electronic engineering from
[30] M.M.A. Patwary, S. Byna, N. Satish, et al., BD-CATS: big data clustering at tril- Zhejiang University and a M.Sc. degree in computer
lion particle scale, in: SC15: International Conference for High Performance architecture from Beihang University of China. His re-
Computing, Networking, Storage and Analysis, Austin, TX, USA, vol. 6, 2015, search interests include computer system optimiza-
pp. 1–12. tion and assessment.
[31] A. Sarma, P. Goyal, S. Kumari, A. Wani, J.S. Challa, S. Islam, N. Goyal, μDBSCAN:
an exact scalable DBSCAN algorithm for big data exploiting spatial locality, in:
IEEE International Conference on Cluster Computing, Albuquerque, NM, USA, Wei Wang is an assistant professor at the High
2019, pp. 1–11. Performance Computing Center at the Institute of Ap-
[32] E. Schubert Jörg, M. Ester, H.-P. Kriegel, X. Xu, DBSCAN revisited, revisited: plied Physics and Computational Mathematics, Bei-
why and how you should (still) use DBSCAN, ACM Trans. Database Syst. 42 (3) jing. He received a master degree in computer science
(2017) 19:1–19:21. from Xi’an JiaoTong University, China, in 2009. His re-
[33] T.P. Shibla, K.B.S. Kumar, Improving efficiency of DBSCAN by parallelizing kd-
search interests include performance evaluation and
tree using Spark, in: International Conference on Intelligent Computing and
optimization of large-scale parallel applications.
Control Systems, Madurai, India, 2018, pp. 1197–1203.
[34] D. Šidlauskas, S. Šaltenis, C.W. Christiansen, J.M. Johansen, D. Šaulys, Trees or
grids? Indexing moving objects in main memory, in: Proceedings of the 17th
ACM SIGSPATIAL, 2009, pp. 236–245.
[35] H. Song, J.-G. Lee, RP-DBSCAN: a superfast parallel DBSCN algorithm
based on random partitioning, in: SIGMOD, ACM, Houston, TX, USA, 2018,
pp. 1173–1187.
[36] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, 1st edition,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.
[37] B. Welton, E. Samanas, B.P. Miller, Mr.scan: extreme scale density-based clus-
tering using a tree-based network of GPGPU nodes, in: SC13: International
Conference for High Performance Computing, Networking, Storage and Anal-
ysis, Denver, Colorado, USA, 2013, pp. 84:1–84:11.
[38] X. Xu, J. Jäger, H.-P. Kriegel, A fast parallel clustering algorithm for large spatial
databases, Data Min. Knowl. Discov. 3 (1999) 263–290.

69

You might also like