Modular design of data-parallel graph algorithms
Santanu Kumar Dash ∗ , Sven-Bodo Scholz† , Bruce Christianson∗
∗ University of Hertfordshire, Hatfield, United Kingdom
{s.dash, b.christianson}@herts.ac.uk
† Heriot-Watt University, Edinburgh, United Kingdom
[email protected]
contact author: Santanu Kumar Dash
The authors would like to thank Dr. Keshav Pingali and Dmitris Prountzos of University of Texas at Austin for fruitful discussions while this work was
conducted at the University of Texas at Austin.
The authors would also like to thank the University of Hertfordshire for supporting this research work through a research grant.
1
Modular design of data-parallel graph algorithms
Abstract—Amorphous Data Parallelism has proven to be a
suitable vehicle for implementing concurrent graph algorithms
effectively on multi-core architectures. In view of the growing
complexity of graph algorithms for information analysis, there
is a need to facilitate modular design techniques in the context
of Amorphous Data Parallelism. In this paper, we investigate
what it takes to formulate algorithms possessing Amorphous
Data Parallelism in a modular fashion enabling a large degree
of code re-use. Using the betweenness centrality algorithm, a
widely popular algorithm in the analysis of social networks,
we demonstrate that a single optimisation technique can suffice
to enable a modular programming style without loosing the
efficiency of a tailor-made monolithic implementation.
Index Terms—multi-core, parallelisation, programming model
I. I NTRODUCTION
With the recent advent of the internet and social networks,
massive amounts of digital information is being generated
today. A need to analyze this information has triggered the
development of sophisticated graph algorithms [1]. The complexity of these algorithms coupled with the massive datasets
that they are typically applied to creates a strong demand for
an efffective utilisation of multi- and many-core systems.
Unfortunately, graph algorithms do not lend themselves well
to implementation on multi-core architectures. The irregular
structure of graphs makes static analysis a hard task and it is
difficult to foretell the execution footprint of graph algorithms.
Much of the parallelism inherent in graph algorithms comes
from the complex interplay of runtime factors which cannot be
modelled statically. While some of the algorithms have been
successfully deployed on multi-core architectures through the
use of clever heuristics [2][3], a large class of graph algorithms
have eluded a parallel implementation. To ameliorate this
situation, researchers have focused on identifying generic
models for parallelism in graph algorithms.
One recent approach known as Tao Analysis abstracts away
from the algorithmic specification and instead, formulates
graph algorithms in terms of operations that are performed
on the graph structure as a whole [4][5]. The key idea of the
Tao Analysis is to identify a set of nodes named active nodes
and then to collectively apply a combination of operators to all
these nodes. When an operator is applied to an active node, it
usually affects not only the node itself but also an area of the
graph around that node. This operator-specific and potentially
also node-specific area is referred to as the neighbourhood.
The concurrency in an algorithm is exposed when operators
with non-overlapping neighbourhood are executed in parallel.
This form of parallelism is called Amorphous Data Parallelism
(ADP) and has been shown to be prevalent in a broad class
of graph algorithms [5].
ADP offers several benefits to the application programmer.
Firstly, ADP paves the way for realising scalable performance
on multi-core systems for a large group of graph algorithms [6]
[5] [4]. Secondly, algorithms possessing ADP properties are
formulated on a single layer of abstraction that enables the
encapsulation of the low-level concurrency mechanisms into
highly optimised libraries. These libraries are fine-tuned to deliver efficient multi-core execution without explicit instructions
from the programmer. Consequently, application programmers
can focus on writing the graph algorithms without worrying
about low-level concurrency tuning.
This paper investigates how far this bi-section of programming skills can be driven forward. In most published works, individual algorithms under investigation have been handcrafted
with a view to performance. Re-using pre-developed codebase
of other applications has been of minor concern. However, if
ADP is to be used in main-stream computing it is very likely
that programmers will try to adopt a programming style that
attempts to maximise modularity and, with it, the amount of
code re-use possible. In this paper, we investigate the potential
impact such a modularised programming style may have on
the overall performance achieved.
As an example, we look at the betweenness centrality
algorithm - a widely used algorithm in the analysis of social
networks. We implement two versions of this algorithm: a
monolithic version that implements the entire algorithm within
one ADP operation, and a modularised version where the
algorithm is formulated as a composition of several much
simpler ADP operations developed for potential re-use in
varied contexts.
The main contribution of the paper is not only an extensive
experimental comparison of the two versions but also an
identification of an optimisation technique that allows to
transform one form into the other. Furthermore, we briefly
discuss the challenges in automating the optimisation within
a compiler setting and highlight how expressing the algorithm
in a functional language can simplify the applicability of the
optimisation.
The rest of the paper is organized as follows. In section
II, we give an introduction to the Tao analysis of algorithms
and ADP. We give an overview of the betweenness algorithm
in section III. The operator formulation of the betweenness
algorithm is discussed in section IV. Section V describes
the advantages of operator fusion and presents experimental
results to reinforce the advantages of operator fusion. A
case for implementing operator formulations in functional
languages is presented in section VI. Finally, the paper is
concluded in section VII.
II. A MORPHOUS DATA PARALLELISM
The operator formulation of graph algorithms expresses
algorithms in terms of actions they perform on the graph.
Vertices on which these operators are applied are called active
nodes. Upon application of an operation to an active node,
2
other nodes or edges in the vicinity of the active node may
be modified. If there are no conflicting neighbourhoods for
activities at two different active nodes, then those activities can
be executed in parallel. Otherwise, we may need some form of
locking to enable the activities to execute. The parallelism thus
uncovered is known as Amorphous Data Parallelism (ADP).
While ADP seems like an intuitive concept, exploiting latent
ADP in applications is a non-trivial task. Amongst other
things, one needs to understand the nature of active nodes and
neighbourhoods in the operator formulation. An understanding
of the life-cycle of active nodes is also necessary i.e. the way
in which vertices become active. It is important to comprehend
whether active nodes can be executed in parallel or they need
to be executed in a certain order. This largely depends on the
nature of the operators that are applied to the active nodes.
Therefore, it is important to study the operators themselves
before deciding on a runtime scheme for scheduling and
synchronizing parallel activities. In order to unravel ADP in
an algorithm, therefore, we need a combination of a rigourous
analytical framework as well as powerful runtime support for
different scheduling and synchronization policies.
While there are standard schemes available for scheduling
and synchronization, it is the initial analysis of the algorithm
that is challenging and needs further elaboration. For uncovering the latent ADP in the algorithm and coming up with a
suitable scheduling scheme, the analytical framework that is
used is termed as Tao analysis [5]. There are three dimensions
to the Tao analysis which are enumerated below.
1) Topology: It is important to understand the structure of
the graph on which the operators are executed. This
information is necessary as regular graphs are amenable
to many compile time optimizations. Also, it is easier
to come up with a compile time scheduling policy for
algorithms on regular graphs.
2) Active Node: Often the execution of operators on currently active nodes spawns new ones. It is necessary
to understand the manner in which active nodes come
into being. This information can be used to work out
the right scheduling policy while applying operators to
active nodes. For the same purpose, it is important to
understand whether operators on the currently active
nodes can be executed in a certain order or they can
be executed in any order.
3) Operators: A good understanding of the nature of operators is necessary in deciding what kind of locking
and roll-back mechanism may be necessary for the
algorithm. There are three classes of operators that have
been discovered so far in the context of graph algorithms
[5]. The first class of operators is the morph operator.
This operator modifies the graph in the neigbourhood
of the active nodes. The second class of operators are
called local computations. They do not modify the graph
but update the values on the vertices and edges in
the vicinity of the active nodes. Finally, readers do
not modify or update values on vertices and edges.
Instead, they are commonly used to read these values.
For most reader and local computation operators, no
locking scheme is normally necessary. However, morph
Vertex (ω)
a
b
Shortest paths from ω
a→b
a→b→c
a→b→c→e
a→b→d→e
b→c
b→c→e
b→d→e
Paths through d
δω• (d)
a→b→d→e
1÷4 = 0.25
b→d→e
1÷3 = 0.33
TABLE I
E XAMPLE OF DEPENDENCE VALUE COMPUTATION FOR VERTEX D IN
FIGURE 1
operators necessitate the use of locking or roll-back
schemes is essential to ensure program correctness.
III. T HE B ETWEENNESS C ENTRALITY MEASURE
The betweenness centrality measure is indicative of the
realtive importance of a vertex in a graph. For the subsequent
discussions let us assume that the algorithm operates on a
graph G = V × E, where V is the set of vertices in the graph
and E is the set of edges. Betweenness centrality for a vertex
v is defined as the sum of the dependence of all vertices s ∈ V
on v in reaching all other vertices t ∈ V .
The computation of betweenness values for a vertex v is
shown in equation 1. Here, δs• (v) is the dependence of vertex
s on v to reach all other vertices in the graph. Equation 2 shows
how to compute the dependence of a vertex s on another vertex
v. This equation sums up the dependence of s on v in reaching
a target vertex t for all t ∈ V . Here, δst (v) is the target
specific dependence of s on v. Computation of target-specific
dependence is further highlighted in equation 3. Target-specific
dependence of a given source s on another vertex v for a
given target t is defined as the ratio of number of shortest
paths between s and t passing through v (denoted by σst (v)
in equation 3) to the ratio of the total number of shortest paths
between s and t (denoted by σst in equation 3).
BC(v) =
X
δs• (v)
(1)
δst (v)
(2)
s∈V,s6=v
δs• (v)
=
X
t∈V,t6=v
σst (v)
(3)
σst
Take the case of the graph shown in figure 1 as an
example. The matrix alongside the graph shows all shortest
paths between pairs of vertices. Let us try to compute the
betweenness value for vertex d here. Only shortest paths
from a and b to other vertices pass through d. Therefore, the
dependence of other vertices on d can be ignored in this case.
Now let us look at the paths themselves. Table III shows the
shortest path information from a and b to other vertices. It also
shows shortest paths from a and b to other vertices passing
through d and the computation of dependence values from this
information. Thus, the betweenness value for vertex d based
on the information from table III is 0.25 + 0.33 = 0.58.
As can be observed, computation of the betweenness values
involves computation of all-pair all-shortest-paths. This can
δst (v)
=
3
a
b
c
c
d
a→b
a→b→c
a→b→d
b→c
b→d
b
d
c
d
e
Fig. 1.
b
e
a→b→c→e
a→b→d→e
b→c→e
b→d→e
c→e
d→e
All-pair all-shortest-paths for a directed graph
be very expensive even for small graphs. However, it was
observed in [7] that the computation of the number of shortest
paths i.e. σst values can be done using a modified breadth first
search. For the subsequent discussion, we define the depth of
a node v in a breadth first tree as the hop distance of v from
the source of the breadth first tree. The observation in [7] was
based on the fact that if a parent p has a depth of one less
that the depth of its child q in the graph for a given breadth
first tree spanning the graph, then the shortest path from edge
from the source to p augmented with the edge from p to q is
also a shortest path from the source to q.
If for every vertex t, we can compute a parent sub-set Ps (t)
such that these parents have a depth of one less than t in the
breadth first search originating at s, we can compute σst as
shown in 4. It was further shown in [7] that the δs• values
can be computed recursively as well using the σst values as
shown in equation 5. The details of the derivation of this
expression can be found in equation 5 and the derivation uses
a similar observation as the one used for simplifying the σst
computation.
σst
=
X
σsr
(4)
r∈Ps (t)
δs• (r)
=
X σsr
(1 + δs (t))
σst
(5)
r∈Ps (t)
There are two distinct advantages of equations 4 and 5.
Firstly, they do away with the added complexity of computing
the number of shortest paths passing through a given vertex v
(denoted by σst (v)). Secondly and more importantly, they simplify the computation of dependence values to an extend where
it can be done recursively using a breadth first backtracking. In
other words, if we start at the leaves of a breadth first tree and
backtrack we can recursively compute the dependence values
by using equation 5.
IV. O PERATOR FORMULATION OF THE ALGORITHM
In this section, we present the Tao analysis and operator
formulation of the betweenness centrality algorithm in order
to uncover latent amorphous data parallelism. We discuss the
nature of active nodes in the algorithm as well as the operators
that operate on these active nodes. Finally, we comment upon
the source of parallelism based on the operator formulation.
It must be noted here that the first dimension of Tao analysis
is inapplicable to the betweenness centrality algorithm as it
operates on unstructured graphs. Therefore, we focus on the
second and third dimensions.
A. Active Nodes
For every vertex in the graph, we need to compute its
dependence on other vertices in the graph. For a given vertex
v, once we have the dependence of other vertices on v,
we need to add the dependence values of all vertices on v
to get the betweenness centrality for v. This exercise needs
to be repeated for every vertex. Therefore, it can be safely
assumed that all vertices in the graph are active nodes in the
betweenness centrality algorithm. The more important observation here is that no new active nodes are spawned during the
application of operators to the currently active nodes. In the
Tao analysis framework such algorithms are called topologydriven algorithms. For topology-driven algorithms, application
of operators at an active node does not cause other vertices
to become active. This is in contrast to data-driven algorithms
where application of an operator at an active node causes other
vertices to become active.
B. Operators
As evidenced from the discussions in section III, there
are four operators in the betweenness algorithm which are
enumerated below.
1) Breadth first search and registration of leaf nodes: The
first operator that is applied to the active nodes is the
breadth first walk that is initiated at each active node.
This walk is used to compute the level of each vertex and
this is done for each of the breadth first walks. At the end
of the breadth first walks, we store the leaf nodes in the
breadth first tree for the backtracking and dependence
value computation as described in section III.
2) Parent sub-set and all-pair shortest path counting: The
next operator that is applied traverses the graph starting
at each active node s ∈ V . The operator computes
the parent sub-set Ps (t) for the given active node s
and a given vertex t such that all vertices in Ps (t) are
parents of t and have a depth of one less that t in the
breadth first tree having s as its root. As the vertices in
Ps (t) are discovered, the computation of σst can also
be completed as described in equation 4.
3) Computation of dependence values: A backtracking operator is then applied which starts at the leaf nodes
of the breadth first trees that are recorded in the first
step. During the backtracking the dependence values are
recursively computed as shown in equation 5.
4) Betweenness centrality computation: The final operator
sums up the dependence values of vertices on active
4
nodes for each active node. This gives the final betweenness centrality value for the active node.
Each of the operators stated above touches the entire graph.
Therefore, the neighbourhood of the operators is the entire
graph. One would think that this large a neighbourhood would
surely impede parallel execution of the operators at active
nodes. However, upon closer inspection it can be seen that
for the same operator, none of the active nodes depend on
results from operator applications at other active nodes. In
other words, the application of operators to active nodes need
not be ordered for the same operator. This observation unravels
the latent amorphous data parallelism in the application. If the
active nodes have their own private copies of data, then the
operators can be applied in parallel without paying attention
to runtime coordination.
of 8 cores) processors running the Linux 2.6.32 kernel. The
cores shared an L3 cache of 8MB. Figure V shows the runtime
and scalability measures for the two graph sizes for both
the fused and segmented operators. The scalibility measures
are obtained by dividing the runtime for a given number of
processor threads by the runtime for a single thread. The
figure shows clearly how the segmented operator formulation
suffers from poor runtime performance. This is attributed to
ineffective cache usage. More importantly, it can be observed
that the segmented version of the algorithm does not scale
beyond 4 cores. Our investigations have shown that due to the
massive amount of intermediate data that is generated in the
segmented operator formulation, there is a significant amount
of L3 cache thrashing beyond 4 threads. This makes the
computation memory bound and the runtime of the algorithm
does not improve despite increasing the number of threads.
V. I MPROVING RUNTIME AND SCALABILITY
While the Tao analysis framework provides a nice theoretical basis for uncovering latent amorphous data parallelism
in an algorithm, it does not address subtle runtime issues
like memory access overheads. If the sequence of operations
is not scheduled properly, it may lead to cache thrashing
and the operation may ultimately become memory bound.
This deteriorates algorithm runtimes and is detrimental to the
scalability of the algorithm as well. In this section, we propose
operator fusion as an optimization for operator formulation in
order to improve runtime and scalability of the algorithm as
the number of threads executing the algorithm is increased.
Consider the operator formulation of betweenness centrality
as an example. There are two ways in which the operators
can be applied to the active nodes in this algorithm. In the
first way, each operator can be applied to all active nodes
before the next operator is chosen. The other way is to
apply all operators to an (or a set of) active node(s) before
moving on to other active nodes. We call the former way
of applying the operator segmented operator formulation and
the later way fused operator fomulation. Both versions of the
operator formulations for the betweenness centrality algorithm
are shown in table V.
In both versions of the formulation, all vertices in the graphs
are marked as active nodes and the active nodes are stored on
a worklist - a data structure that holds currently active nodes.
The runtime system fetches active nodes from the worklist
and executes operators on them. In the segmented version of
the algorithm, the four operators identified in section IV are
applied one at a time to all the active nodes in the worklist.
On the other hand, in the fused version, all operators are
applied to each of the active nodes in one go. This ensures
better cache utilization by improving data locality. As we
will demonstrate shortly with experimental results, the fused
version of the algorithm performs far better in terms of runtime
and scalability as the number of processor threads is increased.
We executed both the segmented and fused operator formulations using two random graphs as inputs. The first graph
had 8192 vertices and 60743 edges while the second graph
had 16384 vertices and 123329 edges. The experiments were
run on a machine with 2 Intel X5570 quad-core (giving a total
VI. A
CASE FOR FUNCTIONAL PROGRAMMING
In section V, we demonstrated the importance of operator
fusion in improving the runtimes of operator formulation of
algorithms. While operator fusion is an essential ingredient to
ensure better runtimes and scalability through better cache usage, it is not feasible to handcraft the fusion process for every
composition of operators. Moreover, handcrafting the fusion
process is at odds with code reuse and ease of programming
where the prgrammer wishes to compose new formulations
from a library of pre-developed operators. This creates a
need for the fusion process to be a part of the compilation
framework so that the programmer can compose formulations
without being concerned about runtime overheads.
However, expressing operators in a imperative language
often makes the analysis process complicated. Identification of
avenues for fusion in an imperative setting is a non-trivial task
given that an imperative program may contain global variables
and state information which complicates data-flow analysis.
On the other hand, if the operator formulation is written in a
functional language, we do not have any concept of a state.
Data flow analysis of a purely functional language is a much
simpler task due to the referential transparency in functional
languages. Consequently, the process of identifying avenues
for operator fusion is greatly simplified if the operator formulation is written in a functional style. In fact, a process similar
to operator fusion has already been successfully applied in
the context of array programming in the Single Assignment
C (SAC) language [8][9]. In SAC, this process is called withloop folding and operations on multiple array elements are
folded together to form a unified operation wherever possible
to reduce memory overheads.
Apart from simplifying the analysis process for operator
fusion, functional languages offer other benefits as well. Functional languages typically come with rich features like pattern
matching, higher order functions and partial application. These
features are especially useful in expressing algorithms in a succinct manner thereby boosting productivity of programmers.
The power of higher order functions and pattern matching in
expressing succinct graph algorithms has already been presented in [10]. With improved modularity through higher order
5
Segmented operators
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
G=V×E
for all v ∈ V do
markAsActiveNode(v)
end for
WL ← activeNodes(G)
for all s ∈ WL do
BFS(s)
end for
for all s ∈ WL do
σcalc (s)
end for
for all s ∈ WL do
δcalc (s)
end for
for all s ∈ WL do
BCcalc (s)
end for
Fused operators
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
G=V×E
for all v ∈ V do
markAsActiveNode(v)
end for
WL ← activeNodes(G)
for all s ∈ WL do
BFS(s);
σcalc (s);
δcalc (s);
BCcalc (s);
end for
TABLE II
S EGMENTED VS FUSED OPERATOR FORMULATION FOR THE BETWEENNESS CENTRALITY ALGORITHM
Fig. 2.
Runtime and scalability measures for random graphs of 8192 and 16384 vertices for the segmented and fused operator formulations
functions and the lack of side effects, understanding complex
code segments is much easier in a functional setting than in an
imperative setting. This makes maintenance of large software
codebases much simpler in functional languages. With a thrust
towards developing complicated techniques for information
analysis using graph theory, functional languages are a good
candidate for implementing complex graph algorithms.
Finally, the biggest selling point of functional languages
are that they greatly simplify analysis of source code for
identifying task parallelism. This is due to the Church-Rösser
property of functional languages which states that in the
absence of side effects, any two operations can be executed in
parallel provided there is no data dependency between them.
This is the reason why functional languages have been hugely
successful at harnessing the power of multi-core architecture.
Due to the above reasons, we believe that extending preexisting ideas in functional languages to the domain of amorphous data parallelism presents a compelling case for further
work. It will be very interesting to explore the synergies
between the two approaches and we intend to embark on this
6
task as a part of our future research work.
VII. C ONCLUSION
In this paper, we discussed a parallel implementation of
the betweennes centrality algorithm. We did a Tao analysis
of the betweenness algorithm in order to better understand
the nature of active nodes and operators in the algorithm.
Then, we formulated the algorithm in terms of operators that
mutate information stored at active nodes. While the operator formulation of the algorithm exposed latent amorphous
data parallelism in the algortihm, we showed that a naı̈ve
implementation of the operator formulation suffers from cache
thrashing and memory access overheads. We showed that
performance and scalability of operator formulations can be
significantly improved by augmenting operator formulation
with operator fusion and presented experimental results to
reinforce the value of the fusion process.
While operator fusion was shown to be invaluable in optimizing operator formulations, it was discussed that analysis of
source code to identify avenues for operator fusion was a nontrivial task. At the same time, it was showcased how the use of
functional languages to code operator formulations eases the
fusion analysis by doing away with the side effects. Given
the added benefits of functional languages like code-reuse,
modularity and Church-Rösser property, it can be said that
functional languages are highly relevant to graph algorithms
in general and parallel implementation of graph algorithms in
particular. Therefore, it is worthwhile to consider the benefits
of functional languages in the context of amorphous data
parallelism.
R EFERENCES
[1] Ulrik Brandes and Thomas Erlebach, editors. Network Analysis: Methodological Foundations [outcome of a Dagstuhl seminar, 13-16 April
2004], volume 3418 of Lecture Notes in Computer Science. Springer,
2005.
[2] David A. Bader and Guojing Cong. Fast shared-memory algorithms for
computing the minimum spanning forest of sparse graphs. J. Parallel
Distrib. Comput., 66:1366–1378, November 2006.
[3] John R. Gilbert and Robert Schreiber. Highly parallel sparse cholesky
factorization. SIAM Journal on Scientific and Statistical Computing,
13:1151–1172, 1992.
[4] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan,
Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN conference on
Programming language design and implementation, PLDI ’07, pages
211–222, New York, NY, USA, 2007. ACM.
[5] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher,
M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew
Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos,
and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the
32nd ACM SIGPLAN conference on Programming language design and
implementation, PLDI ’11, pages 12–25, New York, NY, USA, 2011.
ACM.
[6] Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali.
Ordered vs. unordered: a comparison of parallelism and work-efficiency
in irregular algorithms. In PPOPP, pages 3–12, 2011.
[7] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal
of Mathematical Sociology, 25:163–177, 2001.
[8] S.-B. Scholz. With-loop-folding in SAC–Condensing Consecutive Array
Operations. In C. Clack, K.Hammond, and T. Davie, editors, Implementation of Functional Languages, 9th International Workshop, IFL’97, St.
Andrews, Scotland, UK, September 1997, Selected Papers, volume 1467
of LNCS, pages 72–92. Springer, 1998.
[9] Sven-Bodo Scholz. Single Assignment C — efficient support for highlevel array operations in a functional setting. Journal of Functional
Programming, 13(6):1005–1059, 2003.
[10] Martin Erwig. Inductive graphs and functional graph algorithms. Journal
of Functional Programming, 11:467–492, September 2001.