A Benchmark For Betweenness Centrality Approximation Algorithms On Large Graphs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

A Benchmark for Betweenness Centrality Approximation

Algorithms on Large Graphs


Ziyad AlGhamdi Fuad Jamour
King Abdullah University of Science and Technology King Abdullah University of Science and Technology
Thuwal, Saudi Arabia 23955 Thuwal, Saudi Arabia 23955
[email protected] [email protected]

Spiros Skiadopoulos Panos Kalnis


University of the Peloponnese King Abdullah University of Science and Technology
Tripoli, Greece 22100 Thuwal, Saudi Arabia 23955
[email protected] [email protected]

ABSTRACT ACM Reference format:


Betweenness centrality quantifies the importance of graph nodes Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis. 2017.
A Benchmark for Betweenness Centrality Approximation Algorithms on
in a variety of applications including social, biological and commu-
Large Graphs. In Proceedings of SSDBM ’17, Chicago, IL, USA, June 27-29,
nication networks. Its computation is very costly for large graphs; 2017, 12 pages.
therefore, many approximate methods have been proposed. Given DOI: http://dx.doi.org/10.1145/3085504.3085510
the lack of a golden standard, the accuracy of most approximate
methods is evaluated on tiny graphs and is not guaranteed to be rep-
resentative of realistic datasets that are orders of magnitude larger.
1 INTRODUCTION
In this paper, we develop BeBeCA, a benchmark for betweenness Graphs represent complex relationships among entities, such as
centrality approximation methods on large graphs. Specifically: (i) friendships in social networks, interactions between proteins in
We generate a golden standard by deploying a parallel implementa- bioinformatics, or links among web pages. Centrality indices, such
tion of Brandes algorithm using 96,000 CPU cores on a supercom- as degree [38] closeness [33] and betweenness centrality [12], quan-
puter to compute exact betweenness centrality values for several tify node importance in a graph. In this paper, we focus on be-
large graphs with up to 126M edges. (ii) We propose an evaluation tweenness centrality [12], which defines the importance of a node
methodology to assess various aspects of approximation accuracy, u proportionally to the fraction of all-pairs shortest paths passing
such as average error and quality of node ranking. (iii) We survey through u. Betweenness centrality is used in many graph analytics
a large number of existing approximation methods and compare applications such as finding influential persons in social networks
their performance and accuracy using our benchmark. (iv) We [15]; identifying hubs in wireless ad-hoc networks [26]; community
publicly share our benchmark, which includes the golden standard detection [14]; identifying potential leaders in terrorist networks
exact betweenness centrality values together with the scripts that [21]; and optimizing routing protocols [10].
implement our evaluation methodology; for researchers to compare The best known algorithm for computing the exact values of
their own algorithms and practitioners to select the appropriate betweenness centrality was proposed by Brandes [5]. The algorithm
algorithm for their application and data. needs to compute all-pairs shortest paths in the input graph G.
Let G contain |V | vertices and |E| edges. The time complexity
CCS CONCEPTS of Brandes algorithm is O (|V ||E|) and O (|V ||E| + |V | 2loд|V |) for
unweighted and weighted graphs, respectively. Unfortunately, such
•Theory of computation →Graph algorithms analysis; Ap-
cost is prohibitive for real-life graphs. For example, for a relatively
proximation algorithms analysis; Parallel algorithms; Shortest
small graph with 4M vertices and 34M edges, the algorithm would
paths;
require more than a year on a typical workstation1 . Moreover,
many networks evolve dynamically [8], rendering the computation
KEYWORDS of updates infeasible in practice.
Social Networks, Betweenness Centrality, Approximation Algo- To mitigate the cost of computing exact betweenness centrality,
rithms, Experimental Evaluation many approximation algorithms have been introduced. Instead
of computing all-pairs shortest paths, they either consider only a
subset of them [1, 7, 13, 31, 32], or approximate the shortest path
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed computation itself [11, 30]. There exist numerous variations of each
for profit or commercial advantage and that copies bear this notice and the full citation approach. Nevertheless, assessing their accuracy and efficiency is
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
difficult. Because of the unavailability of large-scale ground truth
to post on servers or to redistribute to lists, requires prior specific permission and/or a data, accuracy is typically calculated on tiny inputs, with a few
fee. Request permissions from [email protected]. thousands of vertices and edges. Even if the algorithms provide
SSDBM ’17, Chicago, IL, USA
1 Serial implementation of Brandes algorithm on a 64-bit machine with Intel Xeon
© 2017 ACM. 978-1-4503-5282-6/17/06. . . $15.00
DOI: http://dx.doi.org/10.1145/3085504.3085510 X5550 @ 2.67GHz CPU and 192GB RAM
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

theoretical bounds, these are typically loose. Therefore, the reported 1 2 3


results are not guaranteed to be representative for large, real-life
graphs.
Motivated by the importance of betweenness centrality in many 4 5
applications, the necessity of utilizing approximation algorithms
and the lack of a golden standard to evaluate them, in this paper 6 7 8
we propose a benchmark for approximate betweenness centrality,
called BeBeCA2 . Our benchmark contains:
(1) pre-calculated exact betweenness centrality values for many Figure 1: Graph G contains two shortest paths from node 2
large real-life graphs; to node 4; node 1 lies on one of them, so σ24 =2 and σ24 (1)=1
(2) an evaluation methodology consisting of several relevant
accuracy metrics;
(3) a collection of implementations of the most representative years of continuous calculations. We publish the computed ground
existing approximation algorithms; and truth values as part of our benchmark, BeBeCA.
(4) all necessary scripts to automate the evaluation. We also propose an evaluation methodology that considers the
following metrics:
BeBeCA can be used by researchers to evaluate their novel algo-
rithms in a systematic manner, or by practitioners to select the most • Execution time.
appropriate existing approximation algorithm, depending on their • Absolute accuracy, represented by the maximum and aver-
application and characteristics of their datasets. To the best of our age error between the exact and approximate betweenness
knowledge, this is the first effort toward standardizing the process centrality values.
of evaluating betweenness centrality approximation algorithms. • Node ranking accuracy, measured by the Kendall-tau dis-
Our contributions can be summarized as follows: tance.
• Top-1% accuracy.
Survey of approximate methods: We present an extensive sur-
• Accuracy for specific nodes.
vey of existing approximation algorithms for betweenness central-
ity. Most of them are variations of the exact Brandes algorithm [5], These metrics encapsulate the requirements of user applications
which performs breadth-first and reverse breadth-first traversals and, consequently, the practical usefulness of the approximation.
(BFT ) from all vertices in the graph and accumulates the contribu- We publish the scripts for calculating the metrics as part of BeBeCA.
tions of all node-pairs to the betweenness centrality of each vertex. Experimental evaluation: As a case study of our benchmark, we
We identify three categories of approximation algorithms: select 7 representative approximate methods and 5 representative
Source sampling: Instead of performing BFTs from every ver- graphs. We perform extensive experimental evaluation and discuss
tex, the methods in this category first sample a subset of the trade-offs of each method. We observed that, in general, source
vertices and start BFTs only from that subset, thus, re- sampling methods perform well in terms of accuracy. With the
ducing the number of BFTs; each method differs on the exception of some code provided by the original authors, we imple-
sampling technique [1, 7, 9, 13, 19, 25, 28]. mented most of the methods on a common platform for fairness.
Node-pair sampling: This category first samples pairs of ver- All our implementations are available online.
tices. Then, they approximate the betweenness centrality The remainder of this paper is organized as follows: In Section
of each graph node by taking into account only the contri- 2 we introduce the necessary background. Section 3 contains the
butions of the sampled vertex pairs. Each method uses a survey of existing approximation algorithms following our catego-
different sampling technique [3, 4, 16, 27, 31, 32, 39]. rization. In Section 4 we describe the datasets and metrics of the
Bounded traversal: This category reduces the cost of the shortest- proposed benchmark. The experimental results of our case study
path algorithm by bounding the number of hops of the BFTs are presented in Section 5, whereas, Section 6 concludes the paper.
and approximating the actual shortest distance [11, 20, 23,
29, 30]. 2 PRELIMINARIES
Benchmark and golden standard: We select 20 real-life graphs A graph G = (V , E) is composed of a set of nodes V and a set of
to obtain the ground truth betweenness centrality values. The edges E ⊆ V × V . For simplicity, we assume that G is unweighted,
graphs are characterised by a variety of sizes (i.e., number of ver- undirected and connected.
tices and edges), average node degrees and diameters. The largest The betweenness centrality [5] value BC G [v] of node v in graph
graph contains 18M nodes and 126M edges, which is orders of mag- G is the fraction of the shortest paths between all pairs of nodes in
nitude larger than any graph reported in the literature. To tackle the graph that pass through v. Formally:
the excessive cost of computing the exact values, we implemented a X σst (v)
parallel version of the Brandes algorithm and executed it on a Cray BC G [v] = (1)
σst
s,t ∈V
XC40 supercomputer, using 96,000 CPU cores and roughly 400TB s,t ,v
of RAM. We spent 2M core hours; for comparison, a high-end work-
where σst is the number of shortest paths from node s to node t, and
station (24-core CPU, 256GB RAM) would have required around 10
σst (v) is the number of shortest paths from s to t that pass through
2 BeBeCA: Benchmark for Betweenness Centrality Approximation v. For example in Figure 1, σ24 =2, σ24 (1)=1, and BC G [1]=0.5.
A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

1 of the input graph. Such computational complexity is prohibitive


1:
σ12 = 1 2:
σ14 = 1 for large graphs with millions of nodes. To this end, approximation
P1 (2) = {1} P1 (4) = {1}
algorithms were introduced. Such methods trade the accuracy of the
2 4 13: δ 1• (4) = 3.5
14: δ 1• (2) = 1.5 betweenness centrality values for a reduced number of traversals
σ13 = 2 σ16 = 1 (reducing the number of traversals also reduces execution cost). The
3: 4:
P1 (3)={2, 4}
3 6
P1 (6) = {4} ideas incorporated by the state-of-the-art approximation methods
12: δ 1• (3) = 2 11: δ 1• (6) = 1 are described in the following section.
σ15 = 2 σ18 = 2 σ17 = 1
5: 6: 7:
P1 (5)={3} 5 P1 (8)={3} 8 7 P1 (7) = {6} 3 APPROXIMATION METHODS
10: δ 1• (5)=0 9: δ 1• (8) = 0 8: δ 1• (7) = 0 The exact betweenness centrality values are computed by the pair
dependencies δst of all pairs of nodes (s, t ) of the input graph
Figure 2: Computing δ 1• (v) (illustrations of Example 2.1. (Equation 3). The general idea of approximation is to reduce the
The breadth-first search DAGs of node 1 in the example computational cost by using a subset of the pair dependencies (in-
graph of Figure 1. stead of the complete set required by the exact computation). The
key difference between approximation methods is the mechanism
used to select a sample (i.e., choosing the specific pairs of nodes to
The fastest known algorithm for computing betweenness central- form the particular subset). Such methods can be categorized into
ity of the nodes of a graph was proposed by Brandes [5]. For any the following three categories (see also Table 1).
triplet of nodes s, t, and v in V , Brandes defines pair dependency,
denoted by δst (v), and source dependency, denoted by δs• (v) as: Source sampling methods (described in Section 3.1) form
the subset by selecting all pairs starting from several source
σst (v) X
nodes.
δst (v) = and δs• (v) = δst (v) (2)
σst Node-pair sampling methods (described in Section 3.2) form
t ∈V
t ,v
the subset by selecting several pairs of nodes.
respectively. Using the above notions, Equation 1 can be equiva- Bounded traversal methods (described in Section 3.3) form
lently written as: the subset by selecting all pairs with a specific shortest dis-
tance starting from several source nodes.
XX X
BC G [v] = δst (v) = δs• (v) (3)
s ∈V t ∈V s ∈V
s,v t ,v s,v 3.1 Source sampling
Brandes proved that δs• (v) can be computed using the following Source sampling methods initially sample several nodes of the
formula: X σsv graph, commonly referred to as pivots. Briefly, these methods com-
δs• (v) = · (1 + δs• (w )) (4) pute all pair dependencies that start from the selected pivots and
σsw
v ∈Ps (w ) use these values to approximate betweenness centrality values.
where σi j is the number of shortest paths from i to j and Ps (w ) is More specifically, if P is the set of pivots, source sampling meth-
the list of parents of w in the breadth-first search DAG of s. Brandes ods compute the source dependencies δp• (v) for all pivots p ∈ P
performs a two-phase process. The first phase executes breadth-first and all graph nodes v ∈ V . Following, pivot source dependencies
search initiated from s to compute σst and Ps (t ) for all nodes t ∈ V are scaled up to approximate the source dependencies values of
with s , t. The second phase performs reverse breadth-first search non-pivot nodes (i.e., δu• (v) for all u ∈ V − P and all graph nodes
(i.e., from the leaves to the root) and uses Equation 4 to compute v ∈ V ). Typically, the scale up is performed by multiplying pivot
δs• (v). The process is highlighted in the following example: source dependencies by |V |/|P |. Finally the computation of the
betweenness centrality values is performed using Equation 3.
Example 2.1. Consider graph G of Figure 1. The computation
of δ 1• (v) for all v ∈ V is illustrated in Figure 2. The first phase of Initial methods. Brandes and Pich [7] explored and evaluated the
Brandes algorithm performs breadth-first search and computes σ1v following different strategies for selecting pivots.
and P1 (v) (these results are illustrated in light gray boxes num- • The first strategy selects pivots uniformly at random. This
bered according to the search order). For instance, in the 3rd step, strategy is simple, easy to implement and totally unbiased.
we have σ13 =2 and P1 (3)={2, 4}. The second phase performs re- • The second strategy selects as pivots the nodes with the
verse breadth-first search and uses σ1v , P 1 (v) and Equation 4 to highest degree since such nodes are likely to be middle
compute δ 1• (v) (these results are illustrated in dark gray boxes nodes in many shortest paths.
numbered according to the search order). For instance, in the 12th • The third strategy selects pivots having maximum distance
step, δ 1• (3)= σσ13
15
(1+δ 1• (5)) + σσ13
18
(1+δ 1• (8))=2. in order to spread the selection in the whole graph.
To summarize, computing exact betweenness centrality values • Selecting pivots having large distance may result in over-
with Brandes algorithm requires a graph traversal starting from estimating betweenness centrality values [7]. Thus, the
every node of the graph. Each graph traversal takes linear time in fourth strategy selects pivots having minimum distance.
the number of nodes |V | and the number of edges |E| of the input Brandes and Pich [7] evaluated all the above selection strategies.
graph. Thus, since |V | traversals are required to produce the exact Highest degree and maximum and minimum distance strategies do
results, the total time complexity is quadratic in the number of nodes not show a consistent behavior. Their performance highly depends
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

Table 1: Summary of the presented approximation methods (methods presented in bold are evaluated in Section 5)

Category Method Main idea


RAND1 [7] Uniform random sampling.
Brandes and Pich [7] Pivots with highest degree.
Brandes and Pich [7] Pivots with maximum distance.
Brandes and Pich [7] Pivots with minimum distance.
Source RAND2 [13] Uniform random sampling with lower contributions for near to pivots nodes.
sampling Khopkar et al. [19] Non-uniform random sampling based on proximity to previously selected pivots.
GSIZE [1] Progressive sampling based on the number of nodes in the graph.
Chehreghani [9] Random sampling with different probabilities assigned to the graph nodes.
Lim et al. [25] Random walks with expansion based sampling.
Maiya and Berger-Wolf [28] Expansion sampling via crawling using BFS, DFS, and random walks.
ABRA [32] Progressive sampling based on Rademacher averages.
Node-pair Ji and Yan [16] Combining ABRA [32] and GSIZE [1] for estimating certain vertex value.
sampling DIAM [31] Sampling paths based on the graph vertex diameter.
Bergamini et al. [4] Built on top of DIAM [31] and supports only incremental updates on a graph.
Bergamini and Meyerhenke [3] Built on top of DIAM [31] and supports fully-dynamic graphs.
Yoshida [39] Finding top nodes using hypergraphs from all shortest paths between the node-pairs.
Mahmoody et al. [27] Finding top nodes using hyperedges from any shortest path between the node-pairs.
KPATH [30] K-Bounded distance shortest path calculation.
Bounded Ostrowski [29] Bounded-distance of betweenness centrality following MapReduce paradigm.
traversal EGO [11] Calculating centrality based on direct neighbors.
Kourtellis et al. [20] Sampling random walks with k bounded steps.
Lee et al. [23] Approximate values on subgraph where the graph’s update occurs.

4 6 centrality value of node 2 (that is close 1 and has degree two), ini-
tial sampling methods compute δ 1• (2) = 5 and scale it up using
1 2 3 |V |/|P | = 7/1 = 7 where V and P is the set of nodes and piv-
ots respectively. Overall the betweeness centrality value of node
2, is approximated by 5 · 7 = 35. Compared with the exact be-
5 7
tweeness centrality value of node 2, which is 5, the approximation
overestimates by a factor of |V |/|P | = 7/1 = 7. Although this ex-
Figure 3: Example of overestimating node 1 using RAND1 ample shows an extreme overestimation case, Geisberger et al. [13]
showed that it is rather common particularly for the nodes lying
close to pivots.
on the input graph. Particularly, common imbalances in a graph’s To address this overestimation problem, Geisberger et al. [13]
structure may result in poor pivot choices [7]. There are even cases and Khopkar et al. [19] introduce ideas that reduce the contribution
where the introduction of additional pivots decreases the quality of of pivots to the betweenness centrality values of nodes in their
the betweenness centrality approximation. proximity. To this end, Geisberger et al. [13] determine the contri-
The experimental analysis of [7] showed that uniform random bution of the node in computing the betweenness centrality based
sampling forms the best selection strategy. The random selection on the distance from the sampled pivots; nodes closer to sampled
of the pivots allows the strategy to handle structural imbalances of pivots have lower contribution. Geisberger et al. [13] introduced
the input graph and avoids the drawbacks of the other strategies. and evaluated several scaling functions that consider the distance
In total, uniform random sampling is a simple method that is both to the sampled pivots and showed that better results were achieved
effective and consistent in all tested graphs. For these reasons, it using the linear scaling function f (x ) = x, where x is the distance
is used in several studies as a comparison method [13, 19, 20, 31]. from a node to a pivot.
Thus, we have also decided to include the uniform random source In the same spirit, Khopkar et al. [19] introduced a non-uniform
sampling selection in our evaluation framework (Section 4). This random sampling method that discourages near to pivot nodes form
method is referred to as RAND1 (see Table 1). being sampled. To this end, they incorporate a penalty box that
contains nodes that cannot be sampled. Nodes near the sampled
Different contributions sampling. Initial sampling methods pivots are put in the penalty box. If a node stays in the penalty
tend to overestimate betweenness centrality values for nodes that box for more than k subsequent sampling iterations (where k is a
are near to the sampled pivots [13, 19]. For example, consider parameter) it is released and can be used as a pivot in a following
Figure 3 and assume that we only sample as a pivot node 1 (with sampling iteration.
degree one). To compute the approximation of the betweeness
A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

Algorithm: GSIZE as computing the exact betweenness centrality values [9]. As an


Input: Graph G (V , E ), approximation constant α and an arbitrary alternative, Chehreghani [9] proposed to sample pivots k with the
node v ∈ V a probability pk , defined as:
Output: The approximate betweenness centrality values BC G for all 1
nodes of graph G d (k,v )
pk = P |V | (6)
1
1 Set s ← 0 and k ← 0 j=1 d (j,v )
2 while s ≤ α · |V | do
which can be computed efficiently, but does not have the same
3 Sample a pivot w ∈ V and set k ← k + 1
guaranties.
4 For all nodes u ∈ V , compute all the dependencies δw • (u ) and
set BCG (u ) ← BCG (u ) + δw • (u ) Online sampling. Finally, there are research efforts that com-
5 Set s = s + δw • (v ) pute the approximate betweeness centrality values by using online
6 for all u ∈ V do source sampling methods. Such methods are applicable in cases
7 BCG (u ) ← BCG (u ) · |V | where full access to the graphs is not possible. For instance, Lim
k
et al. [25] and Maiya and Berger-Wolf [28], sample pivots on the
8 return all BCG values
available set of nodes, and as the graph expands, new nodes are
added as potential pivots. Such methods are beyond the scope of
our evaluation.
In our evaluation framework (Section 4), we have included Geis-
berger et al. [13] method with the linear scaling function. This 3.2 Node-pair sampling
method is referred to as RAND2 (see Table 1). The general idea behind node-pair sampling is to sample pairs of
Graph size sampling. Bader et al. [1] proposed an approximation nodes at random directly and computing pair dependencies be-
algorithm that is based on progressive sampling and aims to quickly tween them to approximate betweennes centrality values for all
approximate a particular node’s betweenness centrality value. The nodes. Each node-pair sampling approximation method has a dif-
proposed algorithm keeps sampling pivots until a stopping con- ferent approach of computing the pair dependencies between the
dition is met. Without changing the stopping condition, we have sampled pairs, and a different way of producing the approximate
slightly modified the original algorithm in order to compute be- betweenness centrality values.
tweenness centrality values of all nodes. The resulting algorithm, Some node-pair sampling methods use the structural properties
named GSIZE, takes as input a graph G, an arbitrary node v and of the input graph to decide the number of pairs to be sampled, and
an approximation constant α and computes the approximated be- then use the shortest paths between sampled pairs to produce ap-
tweenness centrality values BC (u) for the all nodes u of graph G. proximate betweenness centrality values [3, 4, 31]. Other methods
Constant α determines the sample size; increasing α also increases use progressive sampling of node-pairs to reach a certain accuracy
the sample size, and vice versa. Bader et al. [1] showed that α = 5 with some guarantees [16, 32]. In other words, they keep sampling
is an appropriate choice since it gives a good trade-off between the node-pairs and updating betweenness centrality values until a pre-
sample size and the quality of the approximation value in most of defined stopping condition is met. Finally, some methods attempt
the tested graphs. to find the approximate top-k betweenness centrality nodes by pro-
Algorithm GSIZE continues to sample pivots (Lines 2–5) until ducing a hypergraph where a hyperedge is constructed with the
the sum of the dependencies s on the chosen node v (computed nodes along the shortest paths between the sampled node-pairs,
in Line 5) is larger than α · |V |. At the same time, it updates be- and then using the hyperedge degrees of nodes in the resulting
tweeness centrality values (Line 4). When the sample is complete, hypergraph to approximate the betweenness centrality ranking of
the betweeness centrality values are scaled up to approximate the the input graph nodes [27, 39].
remaining pair dependencies (Line 7) and the final approximate Sampling based on graph structural properties. Riondato and
betweeness centrality values are returned (Line 8). Kornaropoulos [31] showed that the number of sampled paths
In our evaluation framework (Section 4), we have included Bader should be determined by the structure of the graph and not by the
et al. [1] method with α = 5. This method is referred to as GSIZE number of its nodes. Therefore, they introduced an approximation
(see Table 1). algorithm that determines the size of the sample based on vertex-
Changing sampling probabilities. Common sampling assigns diameter of the graph, i.e., the length of graphs’ longest shortest
to the pivots of a graph G (V , E), a uniform sampling probability path plus one. Their approach uses the Vapnik-Chervonenkis di-
1/|V |. Chehreghani [9] proposed to assign probabilities in a non- mension [37] to decide the number of node-pairs to be sampled and
uniform fashion and proved that considering a single node v and provide a guarantee on the average error of the produced approxi-
sampling pivots i with the a probability pi , defined as: mate betweenness centrality values.
Given the graph vertex-diameter VD(G), and the allowed addi-
δ •i (v)
pi = P |V | (5) tive error ϵ with probability 1 − δ , we can compute the required
j=1 δ •j (v) sample size using the following formula [31]:
results in the exact betweenness centrality value for the node v c  1
r = 2 log2 (VD(G) − 2) + 1 + ln (7)

(δ •i (v) ia defined analogously to δi• (v), see also Equation 2). Un- ϵ δ
fortunately, computing the probabilities of Equation 5 is as hard where c is a constant typically set to 0.5 [31].
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

Algorithm: AVD Algorithm: ABRA


Input: Graph G (V , E ) Input: Graph G (V , E ), allowed error ϵ and probability δ
Output: The approximated vertex-diameter VD(G ) of graph G Output: The approximate betweenness centrality values BC G for all
1 Choose a node v ∈ V uniformly at random. nodes of graph G
2 Perform a single-source shortest path problem from node v to all 1 Set S to be the initial sample size
other nodes of the graph. 2 while true do
3 Set VD(G ) to be the sum of the lengths of the two longest shortest 3 for j ← 1 to S do
paths initiated from v. 4 Sample distinct pair of nodes v ∈ V and w ∈ V uniformly at
4 return VD(G ) random
5 Compute all shortest paths P between v and w
6 Choose uniformly at random a path between v and w
7 for all nodes z in the chosen path do
Algorithm: DIAM
8 Compute f z (v, w ) = σvσvww(z )
Input: Graph G (V , E ), allowed error ϵ and probability δ
9 Add f z (v, w ) to the set of functions F z
Output: The approximate betweenness centrality values BC G for all
nodes of graph G 10 Use all available sets of functions to check the stopping condition
1 VD(G ) ← AVD(G ) and if it is met exit the while loop
2 r ← ϵc2 ( log2 (VD(G ) − 2) + 1 + ln δ1 )
  11 Set S 0 to the next sample size and S =S 0 − S
3 for i ← 1 to r do 12 for all nodes u in graph G do
Sample distinct pair of nodes v ∈ V and w ∈ V uniformly at BCG (u ) ← ( f )/S
P
4 13
f ∈ Fu
random
5 Compute all shortest paths between v and w 14 return all BCG values
6 Choose uniformly at random a path between v and w
7 for all nodes u in the chosen path do
8 BCG (u ) ← BCG (u ) + r1 In our evaluation framework (Section 4), we have included Rion-
9 return all BCG values dato and Kornaropoulos [31] method with an allowed error ϵ = 0.01
and probability δ = 0.1. This method is referred to as DIAM (see
Table 1).
Progressive sampling. Riondato and Upfal [32] suggested an
Unfortunately, computing the vertex-diameter is as hard as com- approximation algorithm based on progressive sampling. Using the
puting the exact betweenness centrality values of a graph (since analysis of Rademacher Averages and pseudodimension [34, 36],
both require finding all-pairs shortest paths). Thus, Riondato and they set a stopping condition when the desired level of accuracy
Kornaropoulos [31] use Algorithm AVD to efficiently approximate is reached. In a nutshell, giving a finite domain D, a uniformly
the vertex-diameter (G) of a graph G. random and independent sample from that domain S and a family
Riondato and Kornaropoulos [31] proposed Algorithm DIAM of functions F that map the domain to [0,1]; the maximum deviation
that takes as input a graph G, an allowed error ϵ and a probability δ . from the sum of functions on the domain and the sum of functions
The algorithm outputs the approximated betweenness centrality for on the sample could be bounded using the Rademacher average
all the graph’s nodes. The main idea of this algorithm is to compute formula.
the graph’s vertex-diameter (Line 1) and the required sample size Specifically for betweenness centrality computation, the goal is
(Line 2). Following, DIAM samples node-pairs (Line 4) and paths to bound the maximum deviation between the exact betweenness
(Line 6), and then compute the betweenness centrality values for centrality mean and the approximated mean. Assuming a pair of
nodes laid on the sampled paths (Lines 7-8). sampled nodes u and v, for each node w, we can define fw : D →
Bergamini et al. [4] proposed a method based on [31] that sup- IR+ as the function:
ports incremental updates on a graph. They showed that it is pos-
σuv (w )
sible to recompute an upper bound of the diameter after inserting fw (u, v) = (8)
edges in linear time instead of computing the approximate diameter σuv
from scratch, and then sample additional paths accordingly. In the Let F be the set of these functions defined on the pair of nodes u and
same setting, Bergamini and Meyerhenke [3] introduced an algo- v. Using F , we can define our stopping condition and as we sample
rithm that approximates betweenness centrality in fully-dynamic node-pairs and compute the approximated betweenness centrality
graphs supporting both edge insertions and deletions. The pro- values, we keep checking if our desired accuracy is reached or not.
posed algorithm is also based on [31] and provides an alternative However, computing or even estimating Rademacher average is
algorithm for recomputing an upper bound for the vertex-diameter an expensive operation [32]. Therefore, Riondato and Upfal [32]
of the graph. Moreover, Bergamini and Meyerhenke [3] adopt the proposed alternative methods to upper bound the Rademacher aver-
Tuned-SWSF [2] to update the necessary single source shortest age without computing it. With the use of their proposed formula,
paths after an edge update. Overall, the proposed method approxi- they were able to compute the stopping condition efficiently.
mates betweenness centrality values in fully-dynamic graphs with In total, Riondato and Upfal [32] introduced Algorithm ABRA
an error ϵ with probability of at least 1 − δ . that takes as input a graph G, an allowed error ϵ and a probability
A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

δ and outputs the approximate betweenness centrality values of the betweenness centrality of all nodes in an ego network. There-
all nodes of the graph G. The algorithm starts by computing the fore, Everett and Borgatti [11] suggested using all neighbors with
initial sample size S (Line 1). After sampling the required number distance one or two hops.
of node-pairs and paths, it computes Equation 8 (Line 8) for all In addition to the previously mentioned methods, Kourtellis et
nodes that lie on the sampled paths (Line 7). These are used to al. introduced a new metric called k-path centrality that is based
compute the stopping condition (Line 10). If the stopping condition on random walks with at most k steps [20]. They reported that the
is satisfied, the algorithm computes and returns the betweenness nodes with high k-path centrality values also have high between-
centrality values (Line 13). Otherwise, it increases the sample size ness centrality values. Lee et al. [23] showed that the exact or the
and samples new node-pairs and paths (i.e., repeats the While loop approximate betweenness centrality values of a modified graph
of Line 2-11). could be found by recalculating the betweenness centrality values
Note that there are two critical points that may affect the per- on the portion of the graph where the update occurred.
formance dramatically; the initial sample size (Line 1) and the In our evaluation framework (Section 4), we have included Pfeffer
incremental factor of the sample size at each iteration (Line 11). and Carley [30], and Everett and Borgatti [11] methods, referred to
These two factors affect the performance because expensive compu- as KPATH and EGO, respectively (see Table 1).
tations are performed whenever the stopping condition is checked.
If we increase the sample size dramatically, we might end up having 4 OUR BENCHMARK: BeBeCA
unnecessary samples. On the other hand, if we increase the sample In this section we present BeBeCA, our benchmark for the evalua-
size slightly, we will have too many iterations with too many ex- tion of approximate betweenness centrality algorithms. BeBeCA
pensive operations to check the stopping condition. Riondato and consists of:
Upfal [32] proposed formulas to compute the initial size and the
• 20 real-world graphs with different sizes, average degrees
incremental factor in each iteration.
and diameters; their characteristics cover a wide range of
In our evaluation framework (Section 4), we have included Rion-
real applications.
dato and Upfal [32] method with an allowed error ϵ = 0.01 and
• The exact betweenness centrality values for all the above
probability δ = 0.1. This method is referred to as ABRA (see Table
graphs. These values form the golden standard that is
1).
needed to assess the accuracy of the tested methods.
Other approaches. All previously mentioned node-pair sampling • A set of metrics that allow us to assess various aspects of
methods are meant to approximate the betweenness centrality val- the approximation efficiency and effectiveness: execution
ues for all nodes. Ji and Yan [16] introduced an approximation time, absolute accuracy (i.e., maximum and average error),
algorithm that combines both techniques introduced by Bader et whole graph ranking and top-1% ranking accuracy, plus
al. [1], and Riondato and Upfal [32] to have a progressive sampling accuracy for specific nodes.
approach for approximating betweenness centrality for a specific • Implementations of the most representative state-of-the-
node. art approximation methods, and the scripts that calculate
the above-mentioned evaluation metrics.
BeBeCA can be used by researchers to evaluate any betweenness
centrality approximation method in a systematic way, or by practi-
3.3 Bounded traversal tioners to select the most appropriate method for the characteristics
of their applications and datasets. BeBeCA is available online3 .
The general idea behind bounded traversal methods is to sample
pairs of nodes with a specific shortest distance, compute their pair Computing infrastructure: We used two machines:
dependencies and use these values to approximate the betweenness Workstation: We have used a server with twin AMD Opteron-
centrality values. The sum of all dependencies on a node v will 6172 CPUs (24 cores in total) running at 2.1GHz and 192GB
form the approximated betweenness centrality value for that node. of RAM. We used the workstation for all approximate al-
Pfeffer and Carley [30] proposed a bounded traversal method gorithms and for computing the exact values for small
that works similarly to Brandes algorithm, except that all traversals graphs.
are bounded by K steps, i.e., the height of the DAGs considered Supercomputer: We have used a Cray XC40 supercomputer
is K. Given a node v, the intuition behind this approach is that with 6,174 nodes. Each node contains four 8-core Intel
nodes that are distant from v are unlikely to affect the betweenness Haswell CPUs running at 2.3GHz and 128GB RAM. We
centrality value of v. Ostrowski [29] proposed a methodology that used roughly half of the machine, that is 96,000 CPU cores
follows the MapReduce paradigm to enable parallel implementation and almost 400TB of RAM. The supercomputer was used
of bounded-distance approximation of betweenness centrality. for computing the exact betweenness centrality values of
Everett and Borgatti [11] devise a method utilizing the notion all large graphs.
of ego network. The ego network of a node v is a subgraph that
consists of node v and all its direct neighbors [11]. Specifically, 4.1 Datasets
their experiments showed that there is a correlation between the BeBeCA contains 20 real-world datasets (see Table 2) downloaded
betweenness centrality value of a node in its ego network and its from SNAP [24] and Konect [22] repositories. They cover a variety
betweenness centrality value in the whole graph. However, con-
sidering only direct neighbors will not give a good indicator of 3 http://bit.ly/BeBeCa
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

Table 2: Summary of datasets. The bolded datasets are used for evaluation in Section 5.

Graph |V | |E| Average degree Diameter Exact result computed on


ca-GrQc 4,158 13,428 6.46 17 Workstation
ca-HepTh 8,638 24,827 5.75 17 Workstation
ca-HepPh 11,204 117,649 21.00 13 Workstation
ca-AstroPh 17,903 197,031 22.01 14 Workstation
ca-CondMat 21,363 91,342 8.55 14 Workstation
email-Enron 33,696 180,811 10.73 11 Workstation
com-DBLP 317,080 1,049,866 6.62 21 Supercomputer
com-Amazon 334,863 925,872 5.53 44 Supercomputer
com-Youtube 1,134,890 2,987,624 5.27 20 Supercomputer
flickr-links 1,624,992 15,472,576 19.04 24 Supercomputer
as-skitter 1,694,616 11,094,209 13.09 31 Supercomputer
Amazon 2,146,057 5,743,146 5.35 28 Supercomputer
Wiki-Talk 2,388,953 4,656,682 3.90 9 Supercomputer
orkut-links 3,072,441 117,185,083 76.28 10 Supercomputer
youtube-u-growth 3,216,075 9,369,874 5.83 31 Supercomputer
cit-Patents 3,764,117 16,511,741 8.77 26 Supercomputer
com-LiveJournal 3,997,962 34,681,189 17.35 17 Supercomputer
Dblp 4,000,148 8,649,011 4.32 50 Supercomputer
livejournal-links 5,189,809 49,151,786 18.94 23 Supercomputer
dbpedia-link 18,265,512 126,888,089 13.89 12 Supercomputer

of domains and properties. We extract the largest connected com- discussed in the literature so far. We publish the calculated exact
ponent in each graph and perform our analysis on it. The ground values as a golden standard for the evaluation of existing and future
truth for all 20 datasets is included in our benchmark. However, approximation algorithms. It is worth noting that we spent 2M
for clarity, in the experimental section of this paper we consider core hours for the calculations; for comparison, this corresponds to
only 5 representative datasets (highlighted in bold in Table 2). The roughly 10 years of calculations on a high-end workstation with 24
following are brief descriptions of the chosen datasets: CPU cores and 256GB RAM.
ca-GrQc is a collaboration network between authors in arXiv, pub- Our parallel Brandes algorithm implementation works as follows:
lished under category “General Relativity and Quantum Cosmol-
(1) The sources of the input graph are divided uniformly among
ogy”. Nodes are authors, whereas edges represent co-authorship.
available threads.
The data covers all papers between 1993 and 2003.
(2) Each thread solves the single-source shortest path problem
email-Enron captures email communications. Nodes are email (SSSP) from its assigned source and computes the contri-
addresses, wheras edges represent email exchange between two bution of these sources to the betweenness centrality of all
addresses. nodes in the graph.
com-Amazon is a customer-product network created by crawling (3) The results is accumulated and summed by the master
the Amazon online store. Nodes are customers and edges denote thread to have the exact betweenness centrality values for
that two customers bought the same product. all nodes.
com-LiveJournal is a social network where nodes are users and
edges are friendships. 4.3 Evaluation metrics
dbpedia-link is a hyperlink network created from Wikipedia; In this section, we will detailed the measures incorporated by our
pages are nodes and hyperlinks are edges. evaluation framework.
Execution time. To measure the efficiency of the tested algorithm,
4.2 Golden standard: Exact values our framework incorporates wall-clock time.
In order to evaluate the various accuracy metrics for the approxi- Accurancy. To evaluate the accuracy of the tested approximation
mation algorithms, it is necessary to compare against the ground algorithm, we measure the maximum and the average error between
truth. However, this is a difficult issue because of the excessive the exact and the approximate betweenness centrality values. Since
cost of computing the ground truth for large realistic graphs. To there are nodes with betweenness centrality equal to zero in all the
solve this problem, we used 96,000 CPU cores with 400TB RAM chosen datasets, we do not measure the minimum error because
on a supercomputer to compute the exact betweenness centrality all approximation algorithms do not overestimate nodes with be-
values of the above-mentioned 20 datasets. The largest one contains tweenness centrality values equal to zero. Thus, the maximum and
126M edges; that is 3 orders of magnitudes larger than any graph the average error suffice to assess the deviation between the exact
A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

and the approximate betweenness centrality values and, thus, the GSIZE. The implementation of GSIZE is also based on our imple-
accuracy of the tested method. mentation of Brandes’ algorithm. However, instead of sampling all
Ranking. Many applications use the ranking of the nodes based the pivots from the beginning, we use an adaptive sampling where
on the betweenness centrality values rather than the actual be- we sample until we meet a certain stopping condition then we ter-
tweenness centrality values. To measure the quality of the ranking minate. Therefore, we modified our Brandes’ implementation such
produced by the tested method, we use the Kendall-tau distance that it will take as an additional arguments node v and constant
[17, 18] to compare it with the ranking produced by the exact C, and then it will keep sampling pivots until the sum of source
method. Kendall-tau distance [17, 18] calculates the number of dependencies on v is larger than C |V |. Meanwhile, we compute and
disagreements between the rankings of the compared methods. maintain the sum of dependencies for all nodes in the input graph.
The value of Kendall-tau distance is in the range [-1,1], where 1 After that, for any node v, we scale up our betweenness centrality
means that the two rankings are in total agreement, while -1 means approximation by multiplying the sum of dependencies by |V |
|P | ,
that the two rankings are in complete disagreement. For instance, where |P | is the number of sampled pivots, to find its approximate
say we have a graph with n = 5 nodes A, B, C, D and E. Let their betweenness centrality value.
exact and approximate rankings be τ1 (A, B, C, D, E) = (1, 2, 3, 4, 5) DIAM. For the implementation of DIAM, we used the available
and τ2 (A, B, C, D, E) = (1, 5, 2, 4, 3), respectively. Let the number of parallel implementation by the authors in NetworKit library [35].
concordant pairs be α = 6 and the number of discordant pairs be
ABRA. For the implementation of ABRA, we used the available
β = 4. The Kendall-tau distance for the two rankings is:
parallel implementation by the authors in NetworKit library [35].
α −β KPATH. According to Brandes [6], KPATH can be implemented by
K (τ1 , τ2 ) = 1 = 0.2 (9)
2 n(n − 1) modifying Brandes’ algorithm by backtracking whenever a node of
K distance is reached. Therefore, we modified our implementation
Identifying top 1%. Some applications require only the nodes of Brandes’ algorithm to implement KPATH algorithm. For the
with highest betweenness centrality values. Therefore, we measure value of K, we choose it to be 20% of the diameter of the graph.
how many nodes the approximation methods managed to identify EGO. For the implementation of EGO, we followed the same ap-
in the top 1% set of nodes with the highest betweenness centrality proach as KPATH, except that we traverse at most two hops from
values. any node instead of K steps.
Betweenness centrality for specific nodes. Some approxima-
tion algorithms showed that they could estimate certain nodes’ However, we need to emphasize that all of the above implemen-
betweenness centrality values in very limited time. Therefore, we tations of the approximation algorithms are serial except for DIAM
use these algorithms to estimate the betweenness centrality values and ABRA where the available source codes by the authors run in
of some nodes and compare these values with the respective val- parallel.
ues that were produced by other approximation algorithms that
estimate all nodes’ values. We measure the average error for ap- 5 EXPERIMENTS
proximating the chosen nodes’ values.
In this section, we compare the selected approximation methods
(Section 3) using our evaluation methodology (Section 4). We show
4.4 Approximation algorithms the results for 5 real world datasets (Section 4.1) . Note that we set
Our framework includes the implementations of seven methods a limit of 7 days on the execution time for an algorithm. So if an
for the approximate computation of betweenness centrality values, algorithm does not finish within 7 days, we terminate it and we do
namely RAND1, RAND2, GSIZE, DIAM, ABRA, KPATH and EGO. not include its results in our comparisons.
As discussed in Section 3 the chosen methods are representative
of the related state-of-the-art literature. Let us now briefly discuss
the implementation of these methods. Note that for all tested al- 5.1 Runtime
gorithms, we have used their original implementation whenever As we can see in Figure 4, the runtime follows the same behavior
it was publicly available or it was kindly provided by the authors, for all approximation methods; the runtime increases as the dataset
otherwise, we have implemented them using C++. size increases.
We can see that GSIZE outperforms almost by a factor of 10,
RAND1. To implement RAND1, we first implemented Brandes’
all other approximation methods and terminates within a limited
algorithm. Then, instead of using all nodes of the graph, pivots are
time in all datasets. On the other hand, EGO and KPATH outper-
sampled uniformly at random and Brandes’ algorithm is executed
form most approximation algorithms for the smallest dataset, but
from the selected pivots. The sums of dependencies are kept for all
their running time increase dramatically as the size of the dataset
nodes in the input graph. Finally, for any node v, the betweenness
increases and they become the worst. EGO and KPATH did not
centrality approximation is scaled up by multiplying the sum of
finish within 7 days for the largest datasets com-LiveJournal and
dependencies by |V |
|P | , where |P | is the number of sampled pivots to dbpedia-link, so we terminated them. The data points for EGO and
find its approximate betweenness centrality value. KPATH are not available in all the figures. DIAM did not finish
RAND2. For the implementation of RAND2, we used the available within 7 days for the largest dataset, and its data points for that
source code in NetworKit library [35]. dataset are not available in all figures as well.
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

107
RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO
10-1
106

105
Time(seconds)

104

Max Error
10-2

103

102

101 10-3

100
c

on

on

nk

on

on

nk
na

na
rQ

rQ
-li

-li
nr

az

nr

az
ur

ur
-G

-G
ia

ia
l-E

l-E
m

m
Jo

Jo
ed

ed
ca

ca
-A

-A
ai

ai
ve

ve
p

p
em

em

m
db

db
i

i
-L

-L
co

co
m

m
co

co
Dataset Dataset

Figure 4: Runtimes of approximation algorithms. Figure 6: Max error of approximation algorithms.

10-2 RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO


RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO
1
10-3
0.9

0.8
10-4
Kendal Tau

0.7
Avg. Error

10-5 0.6

0.5

10-6 0.4

0.3
10-7
0.2

0.1
10-8
c

on

on

al

k
rQ

lin
rn
nr

az
c

on

on

al

a-
-G

ou
rQ

lin

l-E
rn

m
nr

az

di
ca
a-

eJ
-G

ou

-A

pe
ai
l-E

di

iv
ca

em

m
eJ
-A

db
pe
ai

-L
co
iv
em

m
db
-L
co

co
m
co

Dataset
Dataset

Figure 5: Average error of approximation algorithms. Figure 7: Kendall correlation coefficient for approximation
algorithms.

As the dataset size increase, the differences in the runtime be- approximation algorithms, so it offers a good trade off between
come more clear. Hence, by taking into consideration that DIAM accuracy and speed.
and ABRA ran in parallel, the approximation algorithms perfor- On the other hand, KPATH and EGO provide the worst accuracy
mances could be generally, in all datasets, ranked as follows: GSIZE, among all other approximation algorithms in all datasets, since
RAND2, RAND1, ABRA, DIAM, EGO then KPATH. they result in the highest maximum and average error.

5.2 Max and average error 5.3 Ranking


To measure the approximation accuracy for all nodes, we measured As mentioned earlier, Kendall-tau distance is used to evaluate the
the average and maximum error between the exact and the approxi- nodes’ rankings produced by the approximate algorithms to the
mated betweenness centrality values. As we can see in Figures5 and exact rankings produced by Brandes’ algorithm.
6, RAND1, RAND2, DIAM and ABRA perform almost similarly on As shown in Figure 7, all approximation algorithms produce
the datasets ca-GrQc and email-Enron. However, as we move to the good nodes’ rankings in the smallest dataset, ca-GrQc. However, as
larger datasets, com-Amazon, com-LiveJournal and dbpedia-link, the dataset size increases, we can see that KPATH, EGO, DIAM and
we can see that RAND2 provides more accurate results with smaller ABRA ranking quality decreases. Furthermore, DAIM and ABRA
average and maximum error in general. ranking quality becomes the worst among all other approximation
Furthermore, GSIZE provides less accurate approximation in algorithms, especially in com-Amazon and com-LiveJournal.
general than RAND1, RAND2, DIAM and ABRA but better than Although ABRA’s ranking quality improved in the largest dataset,
KPATH and EGO. However, it is almost 10 times faster than all it is still outperformed by all the approximation methods. When
A Benchmark for Betweenness Centrality Approximation Algorithms on Large Graphs SSDBM ’17, June 27-29, 2017, Chicago, IL, USA

RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO RAND1 RAND2 DIAM ABRA GSIZE KPATH EGO
10-1
100

90

80 10-2
Top 1% Hits

Avg. Error
70

60
10-3

50

40
10-4
c

on

on

nk

on

on

nk
na

na
rQ

rQ
-li

-li
nr

az

nr

az
ur

ur
-G

-G
ia

ia
l-E

l-E
m

m
Jo

Jo
ed

ed
ca

ca
-A

-A
ai

ai
ve

ve
p

p
em

em

m
db

db
i

i
-L

-L
co

co
m

m
co

co
Dataset Dataset

Figure 8: Percentage of top 1% nodes identified by approxi- Figure 9: Average error of selected nodes.
mation algorithms.

we investigated this case, we found that dbpedia-link has lots of As we can see in Figure 9, for all datasets, KPATH and EGO
nodes with the same betweenness centrality values. Therefore, provide the worst approximation for the chosen nodes. On the other
these nodes will always be ranked by their node IDs, and this will hand, RAND1, RAND2, DIAM and ABRA produce more accurate
improve the overall ranking quality of the approximation method. approximations than GSIZE.
On the other hand, RAND1, RAND2 and GSIZE provide high Therefore, GSIZE does not provide more accurate estimation
quality rankings consistently for all datasets. However, for ap- for the chosen vertices, but it does give a good estimation in very
plications that care only about the nodes’ ranking, GSIZE is the limited time. As we were observing the behavior of GSIZE, when
best option taking into consideration its running time performance we change its input from c = 5 to c = 10, it will take a longer time
demonstrated previously. with larger sample size, but it will produce a better approximation
for the chosen nodes.
5.4 Identifying top 1%
As we can see in Figure 8, for ca-GrQc, EGO was the worst in
6 CONCLUSIONS
identifying the top betweenness centrality values; it managed to
identify almost half of them. GSIZE and KPATH identified almost The lack of a golden standard benchmark for evaluating approx-
80% of the top ranked nodes. On the other hand, RAND1, RAND2, imate betweenness centrality algorithms made it difficult for re-
DIAM and ABRA performed the best and managed to identify more searchers to compare their algorithms to existing ones, and difficult
than 90% of the top ranked nodes. for users to choose the appropriate approximation algorithm for
For the datasets email-Enron and com-Amazon, GSIZE, KPATH their applications. Motivated by that, we developed the golden stan-
and EGO perform almost similarly and they managed to identify dard BeBeCA in an attempt to standardize and facilitate evaluating
almost 80% of the top ranked nodes. On the other hand, RAND1, betweenness centrality approximation algorithms.
RAND2, DIAM and ABRA still performed better and they managed Our extensive evaluation of existing betweenness centrality ap-
to identify around 90% of top ranked nodes on both datasets. proximation algorithms showed that simple approximation algo-
However, as we move toward the largest dataset, we can see that rithms based on source sampling offer a good trade-off between
RAND2 starts outperforming other approximation algorithms. performance and accuracy. More specifically, algorithms based on
a uniform random sampling of sources perform well consistently
5.5 Betweenness centrality for specific nodes as the graph size grows, with tolerable runtimes, and they outper-
form more sophisticated and recent algorithms both in terms of
GSIZE’s goal is to produce a good estimation of an individual node performance and accuracy.
in very limited time. We have witnessed in the previous subsections Our plans for future work include extending BeBeCA to support
that it has the best runtime performance among the approximation other centrality metrics to facilitate evaluating other metrics on
algorithms, and we want to test if it will give a good estimation for large graphs.
certain nodes we choose at random.
To that end, we select 10 nodes with high betweenness centrality,
and we run GSIZE to approximate their betweenness centrality. We ACKNOWLEDGEMENTS
report in Figure 9 the average error produced by GSIZE for these For computer time, this research used the resources of the Super-
nodes only. For comparison, we include the average error of the computing Laboratory at King Abdullah University of Science and
other approximation algorithms for these nodes. Technology (KAUST) in Thuwal, Saudi Arabia.
SSDBM ’17, June 27-29, 2017, Chicago, IL, USA Ziyad AlGhamdi, Fuad Jamour, Spiros Skiadopoulos, and Panos Kalnis

REFERENCES [22] Jérôme Kunegis. 2013. Konect: the koblenz network collection. In Proceedings of
[1] David A Bader, Shiva Kintali, Kamesh Madduri, and Milena Mihail. 2007. Ap- the 22nd International Conference on World Wide Web. ACM, 1343–1350.
proximating betweenness centrality. In International Workshop on Algorithms [23] Min-Joong Lee, Sunghee Choi, and Chin-Wan Chung. 2016. Efficient algorithms
and Models for the Web-Graph. Springer, 124–137. for updating betweenness centrality in fully dynamic graphs. Information Sciences
[2] Reinhard Bauer and Dorothea Wagner. 2009. Batch dynamic single-source 326 (2016), 278–296.
shortest-path algorithms: An experimental study. In International Symposium on [24] Jure Leskovec and Rok Sosič. 2016. Snap: A general-purpose network analysis and
Experimental Algorithms. Springer, 51–62. graph-mining library. ACM Transactions on Intelligent Systems and Technology
[3] Elisabetta Bergamini and Henning Meyerhenke. 2016. Approximating Between- (TIST) 8, 1 (2016), 1.
ness Centrality in Fully Dynamic Networks. Internet Mathematics 12, 5 (2016), [25] Yeon-sup Lim, Daniel S Menasché, Bruno Ribeiro, Don Towsley, and Prithwish
281–314. Basu. 2011. Online estimating the k central nodes of a network. In Network
[4] Elisabetta Bergamini, Henning Meyerhenke, and Christian L Staudt. 2014. Ap- Science Workshop (NSW), 2011 IEEE. IEEE, 118–122.
proximating betweenness centrality in large evolving networks. In 2015 Pro- [26] Leandros A Maglaras and Dimitrios Katsaros. 2012. New measures for char-
ceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments acterizing the significance of nodes in wireless ad hoc networks via localized
(ALENEX). SIAM, 133–146. path-based neighborhood analysis. Social Network Analysis and Mining 2, 2
[5] Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of (2012), 97–106.
mathematical sociology 25, 2 (2001), 163–177. [27] Ahmad Mahmoody, Charalampos E Tsourakakis, and Eli Upfal. 2016. Scalable Be-
[6] Ulrik Brandes. 2008. On variants of shortest-path betweenness centrality and tweenness Centrality Maximization via Sampling. arXiv preprint arXiv:1609.00790
their generic computation. Social Networks 30, 2 (2008), 136–145. (2016).
[7] Ulrik Brandes and Christian Pich. 2007. Centrality estimation in large networks. [28] Arun S Maiya and Tanya Y Berger-Wolf. 2010. Online sampling of high centrality
International Journal of Bifurcation and Chaos 17, 07 (2007), 2303–2318. individuals in social networks. In Pacific-Asia Conference on Knowledge Discovery
[8] Kathleen M Carley. 1999. On the evolution of social and organizational networks. and Data Mining. Springer, 91–98.
Research in the Sociology of Organizations 16, 0 (1999). [29] David Alfred Ostrowski. 2015. An approximation of betweenness centrality
[9] Mostafa Haghir Chehreghani. 2014. An efficient algorithm for approximate for Social Networks. In Semantic Computing (ICSC), 2015 IEEE International
betweenness centrality computation. Comput. J. (2014), bxu003. Conference on. IEEE, 489–492.
[30] Jürgen Pfeffer and Kathleen M Carley. 2012. k-Centralities: local approximations
[10] Shlomi Dolev, Yuval Elovici, and Rami Puzis. 2010. Routing betweenness central-
of global measures based on shortest paths. In Proceedings of the 21st International
ity. Journal of the ACM (JACM) 57, 4 (2010), 25.
Conference on World Wide Web. ACM, 1043–1050.
[11] Martin Everett and Stephen P Borgatti. 2005. Ego network betweenness. Social
[31] Matteo Riondato and Evgenios M Kornaropoulos. 2016. Fast approximation of
networks 27, 1 (2005), 31–38.
betweenness centrality through sampling. Data Mining and Knowledge Discovery
[12] Linton C Freeman. 1977. A set of measures of centrality based on betweenness.
30, 2 (2016), 438–475.
Sociometry (1977), 35–41.
[32] Matteo Riondato and Eli Upfal. 2016. ABRA: Approximating betweenness cen-
[13] Robert Geisberger, Peter Sanders, and Dominik Schultes. 2008. Better approx-
trality in static and dynamic graphs with Rademacher averages. arXiv preprint
imation of betweenness centrality. In Proceedings of the Meeting on Algorithm
arXiv:1602.05866 (2016).
Engineering & Expermiments. Society for Industrial and Applied Mathematics,
[33] Gert Sabidussi. 1966. The centrality index of a graph. Psychometrika 31, 4 (1966),
90–100.
581–603.
[14] M. Girvan and M. E. J. Newman. 2002. Community structure in social and
[34] Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning:
biological networks. Proceedings of the National Academy of Sciences 99, 12
From theory to algorithms. Cambridge university press.
(2002), 7821–7826.
[35] Christian Staudt, Aleksejs Sazonovs, and Henning Meyerhenke. 2014. Net-
[15] K-I Goh, Eulsik Oh, Byungnam Kahng, and Doochul Kim. 2003. Betweenness
workit: An interactive tool suite for high-performance network analysis. CoRR,
centrality correlation in social networks. Physical Review E 67, 1 (2003), 017101.
abs/1403.3005 (2014).
[16] Shiyu Ji and Zenghui Yan. 2016. Refining approximating betweenness centrality
[36] Vladimir Naumovich Vapnik. 2000. The nature of statistical learning theory, ser.
based on samplings. arXiv preprint arXiv:1608.04472 (2016).
Statistics for engineering and information science. New York: Springer 21 (2000),
[17] Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2
1003–1008.
(1938), 81–93.
[37] Vladimir N Vapnik and A Ya Chervonenkis. 2015. On the uniform convergence
[18] Maurice George Kendall. 1948. Rank correlation methods. (1948).
of relative frequencies of events to their probabilities. In Measures of Complexity.
[19] Sushant S Khopkar, Rakesh Nagi, and Gregory Tauer. 2016. A penalty box
Springer, 11–30.
approach for approximation betweenness and closeness centrality algorithms.
[38] Stanley Wasserman and Katherine Faust. 1994. Social network analysis: Methods
Social Network Analysis and Mining 6, 1 (2016), 1–13.
and applications. Vol. 8. Cambridge university press.
[20] Nicolas Kourtellis, Tharaka Alahakoon, Ramanuja Simha, Adriana Iamnitchi,
[39] Yuichi Yoshida. 2014. Almost linear-time algorithms for adaptive betweenness
and Rahul Tripathi. 2013. Identifying high betweenness centrality nodes in large
centrality using hypergraph sketches. In Proceedings of the 20th ACM SIGKDD
social networks. Social Network Analysis and Mining 3, 4 (2013), 899–914.
international conference on Knowledge discovery and data mining. ACM, 1416–
[21] Valdis E Krebs. 2002. Mapping networks of terrorist cells. Connections 24, 3
1425.
(2002), 43–52.

You might also like