GreenBST: Energy-Efficient Concurrent Search
Tree
Ibrahim Umar(
)
, Otto Anshus, and Phuong Ha
Department of Computer Science
UiT The Arctic University of Norway
{ibrahim.umar, phuong.hoai.ha, otto.anshus}@uit.no
Abstract. Like other fundamental abstractions for energy-efficient computing, search trees need to support both high concurrency and finegrained data locality. However, existing locality-aware search trees such
as ones based on the van Emde Boas layout (vEB-based trees), poorly
support concurrent (update) operations while existing highly-concurrent
search trees such as the non-blocking binary search trees do not consider
data locality.
We present GreenBST, a practical energy-efficient concurrent search tree
that supports fine-grained data locality as vEB-based trees do, but unlike
vEB-based trees, GreenBST supports high concurrency. GreenBST is
a k-ary leaf-oriented tree of GNodes where each GNode is a fixed size
tree-container with the van Emde Boas layout. As a result, GreenBST
minimizes data transfer between memory levels while supporting highly
concurrent (update) operations. Our experimental evaluation using the
recent implementation of non-blocking binary search trees, highly concurrent B-trees, conventional vEB trees, as well as the portably scalable
concurrent trees shows that GreenBST is efficient: its energy efficiency (in
operations/Joule) and throughput (in operations/second) are up to 65%
and 69% higher, respectively, than the other trees on a high performance
computing (HPC) platform (Intel Xeon), an embedded platform (ARM),
and an accelerator platform (Intel Xeon Phi). The results also provide
insights into how to develop energy-efficient data structures in general.
1
Introduction
Recent researches have suggested that the energy consumption of future computing systems will be dominated by the cost of data movement [12, 34, 35]. It is
predicted that for 10nm technology chips, the energy required between accessing
data in nearby on-chip memory and accessing data across the chip, will differ as
much as 75× (2pJ versus 150pJ), whereas the energy required between accessing on-chip data and accessing off-chip data will only differ 2× (150pJ versus
300pJ) [12]. Therefore, in order to construct energy-efficient software systems,
data structures and algorithms must not only be concerned with whether the
data is on-chip (e.g., in cache) or not (e.g., in DRAM), but must consider also
data locality in finer-granularity: where the data is located on the chip.
·105
2.5 Energy efficiency
·107
Throughput
??CBTree [28]
BSTTK [13]
2
DeltaTree [36]
1.5
1
operations / second
operations / Joule
LFBST [30]
1.5
??CBTree [28]
LFBST [30]
BSTTK [13]
DeltaTree [36]
1
0.5
100% 95% 90% 80% 50%
percentage of search workload
100% 95% 90% 80% 50%
percentage of search workload
Fig. 1: Result of 5 millions tree operations of decreasing search percentage workloads
using 12 cores (1 CPU). DeltaTree’s energy efficiency and throughput are lower than
the other concurrent search trees after 95% search workload on a dual Intel Xeon
E5-2650Lv3 CPU system with 64GB RAM.
Concurrent search trees are crucial data structures that are widely used as a
backend in many important systems such as databases (e.g., SQLite [24]), filesystems (e.g., Btrfs [32]), and schedulers (e.g., Linux’s Completely Fair Scheduler
(CFS)), among others. These important systems can access and organize data
in a more energy efficient manner by adopting the energy-efficient concurrent
search trees as their backend structures.
Devising fine-grained data locality layout for concurrent search trees is challenging, mainly because of the trade-offs needed: (i) a platform-specific locality
optimization might not be portable (i.e., not work on different platforms while
there are big interests of concurrent data structures for unconventional platforms [18, 21]), (ii) the usage of transactional memory [20, 23] and multi-word
synchronization [19, 22, 27] complicates locality because each core in a CPU needs
to consistently track read and write operations that are performed by other cores,
and (iii) fine-grained locality-aware layouts (e.g., van Emde Boas layout) poorly
support concurrent update operations. Some of the fine-grained locality-aware
search trees such as Intel Fast [25] and Palm [33] are optimized for a specific
platform. Concurrent B-trees (e.g., B-link tree [28]) only perform well if their B
size is optimal. Highly concurrent search trees such as non-blocking concurrent
search trees [14, 30] and Software Transactional Memory (STM)-based search
trees [1, 11], however, do not take into account fine-grained data locality.
Fine-grained data locality for sequential search trees can be theoretically
achieved using the van Emde Boas (vEB) layout [15, 31], which is analyzed using
cache-oblivious (CO) models [16]. An algorithm is categorized as cache-oblivious
for a two-level memory hierarchy if it has no variables that need to be tuned with
respect to cache size and cache-line length, in order to optimize its data transfer
complexity, assuming that the optimal off-line cache replacement strategy is used.
If a cache-oblivious algorithm is optimal for an arbitrary two-level memory, the
algorithm is also asymptotically optimal for any adjacent pair of available levels
of the memory hierarchy [9]. Therefore, cache-oblivious algorithms are expected
to be locality-optimized irrespective of variations in memory hierarchies, enabling
less data transfer between memory levels and thereby saving energy.
However, the throughput of a vEB-based tree when doing concurrent updates
is lower compared to when it is doing sequential updates. Inserting or deleting
a node may result in relocating a large part of the tree in order to maintain
the vEB layout. Solutions to this problem have been proposed [7]. The first
proposed solution’s structure requires each node to have parent-child pointers.
Update operations may result in updating the pointers. Pointers will also increase
the tree memory footprint. The second proposed solution uses the exponential
tree algorithm [3]. Although the exponential tree is an important theoretical
breakthrough, it is complex [10]. The exponential tree grows exponentially in
size, which not only complicates maintaining its inter-node pointers, but also
exponentially increases the tree’s memory footprint. Recently, we have proposed
a concurrency-aware vEB layout [36], which has a higher throughput when doing
concurrent updates compared to when it is doing sequential updates. In the same
study, we have proposed DeltaTree, a B+tree that uses the concurrency-aware
vEB layout. We have documented that the concurrency-aware vEB layout can
improve DeltaTree’s concurrent search and update throughput over a concurrent
B+tree [36].
Nevertheless, we find DeltaTree’s throughput and energy efficiency are lower
than the state-of-the-art concurrent search trees (e.g., the portably scalable search
tree [13]) for the update-intensive workloads (cf. Figure 1). Our investigation
reveals that the cost of DeltaTree’s runtime maintenance (i.e., rebalancing the
nodes) dominates the execution time. However, reducing the frequency of the
runtime maintenance lowers DeltaTree’s energy efficiency and throughput for
the search-intensive workloads, because DeltaTree nodes will then be sparsely
populated and frequently imbalanced. Note that DeltaTree energy efficiency and
throughput are already optimized for the search intensive workloads [36, 37].
In this paper, we present GreenBST, an energy-efficient concurrent search tree
that is more energy efficient and has higher throughput for both the concurrent
search- and update-intensive workloads than the other concurrent search trees (cf.
Table 1). GreenBST applies two significant improvements on DeltaTree in order
to lower the cost of the tree runtime maintenance and reduce the tree memory
footprint. First, unlike DeltaTree, GreenBST rebalances incrementally (i.e., finegrained node rebalancing). In DeltaTree, the rebalance procedure has to rebalance
all the keys within a node and the frequency of rebalancing cannot be lowered
as they are necessary to keep DeltaTree in good shape (i.e., keeping DeltaTree’s
height low and its nodes are densely populated). Incremental rebalance makes
the overall cost of each rebalance in GreenBST lower than DeltaTree. Second, we
reduce the tree memory footprint by using a different layout for GreenBST’s leaf
nodes (heterogeneous layout). Reduction in the memory footprint also reduces
GreenBST’s data transfer, which consequently increases the tree’s energy efficiency
and throughput in both update- and search- intensive workloads. We will show
that with these improvements, GreenBST can become up to 195% more energy
efficient than DeltaTree (cf. Section 3).
We evaluate GreenBST’s energy efficiency (in operations/Joule) and throughput (in operations/second) against six prominent concurrent search trees (cf.
Table 1) using parallel micro-benchmarks Synchrobench [17] and STAMP database
benchmark Vacation [29] (cf. Section 3). We present memory and cache profile
data to provide insights into what make GreenBST energy efficient (cf. Section
# Algorithm Ref Description
1 SVEB
[8]
2
3
4
5
CBTree
Citrus
LFBST
BSTTK
[28]
[4]
[30]
[13]
6 DeltaTree
[36]
7 GreenBST -
Synchronization Code
authors
Conventional vEB layout search global mutex
U. Aarhus
tree
Concurrent B-tree (B-link tree)
lock-based
U. Tromsø
RCU-based search tree
lock-based
Technion
Non-blocking binary search tree
lock free
UT Dallas
Portably
scalable
concurrent lock-based
EPFL
search tree
Locality aware concurrent search lock-based
U. Tromsø
tree
Improved locality aware concur- lock-based
this paper
rent search tree
Data
structure
binary-tree
b+tree
binary tree
binary tree
binary tree
b+tree
b+tree
Table 1: List of the evaluated concurrent search tree algorithms.
3). We also provide insights into what are the key ingredients for developing
energy-efficient data structures in general (cf. Section 4).
Our contributions. Our contributions are threefold:
1. We have devised a new portable fine-grained locality-aware concurrent search
trees, GreenBST (cf. Section 2.1). GreenBST are based on our proposed
concurrency-aware vEB layout [36] with the two improvements, namely the
incremental node rebalance and the heterogeneous node layouts.
2. We have evaluated GreenBST throughput (in operations/second) and energy
efficiency (in operations/Joule) with six prominent concurrent search trees (cf.
Table 1) on three different platforms (cf. Section 3). We show that compared
to the state of the art concurrent search trees, GreenBST has the best energy
efficiency and throughput across different platforms for most of the concurrent
search- and update- intensive workloads.
GreenBST code and evaluation benchmarks are available at: https://github.
com/uit-agc/GreenBST.
3. We have provided insights into how to develop energy-efficient data structures
in general (cf. Section 4).
2
Design overview
We devise GreenBST based on the concurrency-aware vEB layout [36], based
on the idea that the layout has the same data transfer efficiency between two
memory levels as the conventional sequential vEB layout [15, 31]. Therefore,
theoretically, we can use the concurrency-aware layout within a concurrent search
tree to minimize data movements between memory levels, which can eventually
be a basis of an energy-efficient concurrent search tree. This section starts with
brief descriptions about the original vEB layout and the concurrency-aware vEB
layout for concurrent search tree, followed by detailed description of GreenBST
structure and algorithms.
The van Emde Boas (vEB) layout. The vEB layout has inspired several
cache-oblivious (CO) search trees such as the concurrent CO B-trees [5, 6] and
the CO binary trees [8]. The vEB layout based trees recursively arrange related
(a)
(b)
1
3
2
4
8
10
2
6
5
9
11
12
7
13
14
5
7
6
B2=16
4x
3
4
15
8
10
9
B4= 1024
10x
(c)
1
11
L
1
13
12
14
15
B1=16
4x
L
2
C
L
L
C
D
R
A
M
D
I
S
K
B3=16
4x
Fig. 2: Illustration of the required data block transfer in searching for (a) key 13 in
BFS tree and (b) key 12 in vEB tree, where a node’s value is its address in the physical
memory. Note that in (b), adjacent nodes are grouped together (e.g., (1,2,3) and
(10,11,12)) because of the recursive tree building. The similarly colored nodes indicates
a single block transfer B. An example of multi-level memory is shown in (c), where Bx
is the block transfer size B between levels of memory.
data in contiguous memory locations, minimizing data transfer between any two
adjacent levels of the memory hierarchy.
Figure 2 illustrates the vEB layout, where B size is 3. B is the data block
transfer between two memory levels (e.g., RAM and disk) in the I/O model [2].
Traversing a complete binary tree with the Breadth First Search layout (or BFS
tree for short) with height 4 will need three data block transfers to locate the
key at leaf-node 13 (cf. Figure 2a). The first two levels with three nodes (1, 2, 3)
fit within a single block transfer while the next two levels need to be loaded in
two separate block transfers that contain nodes (6, 7, 8)1 and nodes (13, 14, 15),
respectively. Generally, the number of data block transfers for a BFS tree of size
N is (log2 N − log2 B) = log2 N/B ∼ log2 N for N ≫ B.
For a vEB tree with the same height, the required block transfers is only two.
As shown in Figure 2b, locating the key in leaf-node 12 requires only a transfer
of nodes (1, 2, 3), followed by a transfer of nodes (10, 11, 12). Generally, the
data transfer (or I/O) complexity of searching for a key in a tree of size N is
2N
now reduced to log
log2 B = logB N , simply by using an efficient tree layout so that
nearby nodes are located in adjacent memory locations. If B = 1024, searching a
BFS tree for a key at a leaf requires 10× (or log2 B) more I/Os than searching a
vEB tree with the same size N , where N ≫ B.
On commodity machines with multi-level memory, the vEB layout is even
more efficient. So far the vEB layout is shown to have log2 B less I/Os for twolevel memory. In a typical machine having three levels of inclusive caches (with
cache line size of 64B), a RAM (with page size of 4KB) and a disk, a vEB tree
search can intuitively give 640× less I/Os than a BFS tree search, assuming the
node size is 4 bytes (cf. Figure 2c). However, the drawback of the vEB layout is
in its recursive structure. For example if the tree is full, a new bigger tree needs
to be built, recursively in one contiguous block of memory, which also means
that the old tree needs to be invalidated and its members copied to the new tree.
This drawback prevents an effective way to implement concurrency.
The concurrency-aware vEB layout. Our proposed concurrency-aware vEB
layout has been proved to have the same data transfer efficiency between two
1
For simplicity, we assume that the memory controller transfers a block of 3 nodes
starting at the address of the requested node in memory.
1: Struct Map:
2:
member fields:
3:
lef t ∈ N, left child pointer address interval
4:
right ∈ N, right child pointer address intvl.
5: Map map[UB]
6: function right(p, base)
7:
nodesize ← sizeOf(node)
8:
idx ← (p − base)/nodesize
9:
if (map[idx].right != 0) then
10:
11:
12:
return base + map[idx].right
else
return 0
13: function left(p, base)
14:
nodesize ← sizeOf(node)
15:
idx ← (p − base)/nodesize
16:
if (map[idx].lef t != 0) then
17:
return base + map[idx].lef t
18:
else
19:
return 0
Fig. 3: Map structure and the mapping functions.
memory levels as the conventional sequential vEB layout [36]. Because of the
limited space, we spared the full details of our layout design in this paper, but
in brief, a concurrency-aware vEB layout tree (U ) is a tree consisting of |U |
GNodes Ti , i = 1, . . . , |U |. Nodes of tree Ti are called internal nodes in order to
distinguish them from GNodes. Each GNode contains a pre-allocated vEB-layout
binary search tree (BST) structure that can hold a maximum of UB internal
nodes. Each GNode’s internal leaf nodes may link to another GNode’s internal
root node, which eventually form a k-ary tree of GNodes at the higher level. Note
that this k-ary tree does not required to have a cache-oblivious layout [36].
2.1
GreenBST
GreenBST and DeltaTree is designed by devising three major strategies, namely
it uses a common GNode map instead of pointers or arithmetic-based implicit
BST (i.e., a node’s successor memory address is calculated on the fly) for node
traversals, crafting an efficient inter-node connection, and using balanced layouts.
In addition to the shared common traits with DeltaTree, GreenBST also employs
two new major strategies: (i) GreenBST uses incremental GNode rebalance and
(ii) GreenBST uses heterogeneous GNode layouts.
Data structures. GreenBST is a collection of GNodes where each GNode
consists of an UB internal nodes that hold the tree keys and a 1 /2 UB link array
that links the GNode internal leaf nodes to another GNode’s root node. Chain of
GNodes formed a B+tree (to avoid confusion, from this point onward, we refer
the "fat" nodes of GreenBST as GNode and the GNode’s internal tree nodes
as internal nodes or nodes). Each GNode also contains a lock (locked); a rev
counter that is used for optimistic concurrency [26]; nextRight variable, which
is a pointer that points to the GNode’s right sibling; and highKey variable,
which contains the lowest key member of the right sibling GNode. These last
four variables are used for GreenBST concurrency control.
Cache-resident map instead of pointers or arithmetic implicit array.
GreenBST does not use pointers to link between its internal nodes, instead it uses
a single map-based implicit BST array. This approach is unique to the concurrencyaware vEB layout as it benefits from the usage of the fixed-size GNodes. The
usage of pointers and arithmetic-based implicit array in cache-oblivious (CO)
trees has been previously studied [8] and both are found to have weaknesses.
1: function Search(key, GNode, maxDepth)
2:
while GNode is not leaf do
3:
rev ← GNode.rev ⊲ Get revision
4:
bits ← 0
5:
depth ← 0
6:
p ← GNode.nodes[0 ]
7:
base ← p
8:
link ← GNode.link
⊲ continue until leaf node:
9:
while (p & p.key! = EMPTY ) do
⊲ increment depth:
10:
depth ← depth + 1
⊲ shift one bit to the left in each level
11:
bits ← bits << 1
12:
if (key < p.key) then
13:
14:
15:
p ← left(p, base)
else
p ← right(p, base)
⊲ right child color is 1:
16:
bits ← bits + 1
⊲ pad the bits:
17:
bits ← bits << (maxDepth − depth) − 1
18:
if (GNode.rev != rev or not even) then
19:
Goto 3
⊲ Re-try GNode search
⊲ follow nextRight if key ≤ highKey:
20:
if (GNode.highKey ≤ key) then
21:
GNode ← GNode.nextRight
22:
else
23:
GNode ← link [bits] ⊲ child GNode
24:
return GNode
Fig. 4: Search within pointer-less GNode. This function will return the leaf GNode
containing the searched key. From there, an implicit array search using left and right
functions is adequate to pinpoint the key location. The search operations are utilizing
both the nextRight pointers and highKey variables to handle concurrent search even
during GNode split.
Pointer based CO tree search operation is slow, mainly because overheads in every
data transfer between memory (although CO tree can minimize data transfers,
the inclusion of pointers can lower the amount of meaningful data (e.g., keys) in
each block transfer). The implicit array that uses arithmetic calculation for every
node traversal may increase the cost of computation, especially if the tree is big.
The cache-resident-maps technique emulates BST’s (left and right) child
traversals inside a GNode using a combination of a cache-resident GNode map
structure and left and right functions (cf. Figure 3). The left and right
functions, given an arbitrary node v and its GNode’s root memory addresses,
return the addresses of the left and right child nodes of v, or 0 if v has no children
(i.e., v is an internal leaf node of a GNode). The left and right operations
throughout GreenBST share a common cache-resident map instance (cf. Figure 3,
line 5). All GNodes use the same fixed-size vEB layout, so only one map instance
with size UB is needed for all traversing operations. This makes GreenBST’s
memory footprint small and keeps the frequently used map instance in cache.
Note that the mapping approach does not induce memory fragmentation.
This is because mapping approach applies only for each GNode, and map is
only used to point to internal nodes within a GNode. GNode layout uses a
contiguous memory block of fixed size UB and update operations can only change
the values of GNode internal nodes (e.g., from EMPTY to a key value in the
case of insertion), but cannot change GNode’s memory layout.
Inter-GNode connection. To enable traversing from a GNode to its child
GNodes, we develop a new inter-GNode connection mechanism. We logically
assign binary values to GNode’s internal edges so that each path from GNode root
to an internal leaf node is represented by a unique bit-sequence. The bit-sequence
is then used as an index in a link array containing pointers to child GNodes. As
GNode’s internal node has only left and right edges, we assign 0 and 1 to the
left and right edges, respectively. The maximum size of the bit representation
is GNode’s height or log(UB ) bits. We allocate a link pointer array whose size
is half UB length. The algorithm in Figure 4 explains how the inter-GNode
connection works in a pointer-less search function.
Balanced and concurrent tree. GreenBST adopts the concurrent algorithms
of B-link tree that provides lock-free search operations and adopts the B+tree
structure for its high-level structure [28]. However, unlike B-link tree, GreenBST is
an in-memory tree and uses optimistic concurrency to handle lock-free concurrent
search operations even in the occurrences of the unique "in-place" GNodes
maintenance operations.
Similar to B-link tree, GreenBST insert operations built the tree from the
bottom up, but unlike B-link tree, GreenBST insert operation can trigger rebalance
operation, a unique GreenBST feature to maintain GNode’s small height.
Function rebalance(Ti ) is responsible for rebalancing a GNode Ti after an
insertion. If a new node v is inserted at the last level node of a GNode, that
GNode is rebalanced to a complete BST. Rebalance sets all GNode leaves node
height to ⌊log N ⌋ + 1, where N is the count of the GNode’s internal nodes and
N ≤ UB . Note that this is the default rebalance strategy used by DeltaTree, the
incremental rebalance used by GreenBST is explained further in this section.
The delete operation in GreenBST simply marks the requested key (v) as
deleted. This function fails if v does not exist in the tree or v is already marked.
GreenBST does not employ merge operation between GNodes as node reclamation
is done by the rebalance and split operations. The offline memory reclamation
techniques used in the B-link tree [28] can be deployed to merge nearly empty
GNodes in the case where delete operations are the majority. Our new search
trees aim at workloads dominated by search operations.
GreenBST concurrency control uses locks and nextRight and highKey
variables to coordinate between search and update operations [28] in addition to
rev variable that is used for the search’s optimistic concurrency. When a GNode
needs to be maintained by either rebalance or split operations, the GNode’s rev
counter is incremented by one before the operation starts. The GNode counter is
incremented by one again after the maintenance operation finishes. Note that all
maintenance procedures happen when the lock is still held by the insert operation
and therefore, only one operation may update rev counter and maintain a GNode
at a time. The usage of rev counter is to prevent search from returning wrong
key because of the "in-place" GNode maintenance operation.
The search operation in GreenBST uses a combination of function Search (cf.
Figure 4) and an implicit tree traversal using map. Function Search traverses
the tree from the internal root node of the root GNode down to a leaf GNode, at
which the search is handed over to the implicit tree traversal to find the searched
key within the leaf GNode. GreenBST search operation does not wait or use lock,
even in the occurrence of the concurrent updates.
GreenBST search uses optimistic concurrency [26] to ensure the operation
always returns the correct answer even if it arrives at a GNode that is undergoing
the in-place maintenance operation (i.e., rebalance and split). First, before starting
to traverse a GNode, a search operation records the GNode rev counter. Before
following a link to a child GNode or returning a key, the search operation re-checks
again the counter. If the current counter value is an odd number or if it is not
equal to the recorded value, the search operation needs to retry search as this
indicates that GNodes are being or have been maintained.
Incremental Rebalance. As explained earlier, the rebalance in DeltaTree
always involves UB keys, which eventually makes insertions require amortized
O(UB ) time. GreenBST borrows the incremental rebalance idea similar to the
conventional vEB layout [8] that has the amortized O((log2 UB )/(1 − Γ1 )) time
if used in GreenBST. However, unlike the conventional vEB layout that might
have to rebalance the whole tree, we only apply the incremental rebalance to
GNodes. To briefly explain the idea, we denote density(w) as the ratio of number
of keys inside a subtree rooted at w divided by the number of maximum keys
that a subtree rooted at w can hold. For example, a subtree with root w that
is located three levels away from an internal leaf of a GNode can hold at most
23 − 1 keys. If the subtree only contains 3 keys, then density(w) =3 /7 = 0.42.
We also denote a density threshold 0 < Γ1 < Γ2 < ... < ΓH = 1, where H is the
GNode’s height. The main idea is after a new key is inserted at an internal leaf
position v, we find the nearest ancestor w of v where density(w) ≤ Γdepth(w) and
depth(w) is the level where w resides, counted from the root of the GNode. If
that w is found, we rebalance the subtree rooted at w.
Heterogeneous GNodes. We aim to reduce the overhead of rebalancing and
lower the GreenBST height with the usage of different layout for the leaf GNodes
(or heterogeneous). All DeltaTree’s GNodes use the leaf-oriented BST layout, or
DeltaTree uses homogeneous GNodes. Unlike DeltaTree, leaf GNodes in GreenBST
use the internal tree layout instead of the external (or leaf-oriented) tree layout.
In the internal tree layout, keys are located in all nodes of a tree, while in the
external tree layout, keys are only located in the leaf nodes. The reasoning behind
this choice is although leaf-oriented GNodes layout is required for inter-GNode
connection (i.e., between parent- and child- GNodes), leaf GNodes do not have any
children and therefore, need not to adopt same structure as the other GNodes.
3
Experiments
We run several different benchmarks to evaluate GreenBST throughput and energy
efficiency. We combine the benchmark results with the last level cache (LLC) and
memory profiles of the trees to draw a conclusion of whether GreenBST improved
fine-grained data locality layout (i.e., heterogeneous layout) and concurrency
(i.e., lower overall cost of runtime maintenance) over DeltaTree are able to make
GreenBST the most energy-efficient tree across different platforms, even when
processing the update-intensive workloads. Note that we are not collecting the
computation profiles (e.g., Mflops/second) because all the tree operations are
data-intensive instead of compute-intensive.
We conduct an experiment on GreenBST and several prominent concurrent
search trees (cf. Table 1) using parallel micro-benchmark that is based on Synchrobench [17] (cf. Figure 5). The trees’ LLC and memory profiles during the
micro-benchmarks are collected and presented in Figure 5d and 5e, respectively.
To investigate GreenBST behavior in real-world applications, we implement
GreenBST and CBTree as the backend structures in the STAMP database benchmark Vacation [29], alongside the Vacation’s original backend structure red-black
tree (rbtree) (cf. Figure 6).
All the experimental benchmarks are conducted on an Intel high performance
computing (HPC) platform with 24 core 2× Intel Xeon E5-2650Lv3 CPU and
64GB of RAM, an ARM embedded platform with an 8 core Samsung Exynos
5410 CPU and 2GB of RAM (Odroid XU+E), and an accelerator platform based
on the Intel Xeon Phi 31S1P with 57 cores and 6GB of RAM (MIC platform).
For the parallel micro-benchmark, the trees are pre-initialized with several initial
keys before running 5 million operations of 100% (search-intensive) and 50%
searches (update-intensive), respectively. The initial keys given to both the ARM
and MIC platforms are 222 keys and to the HPC platform are 223 keys. All
experiments are repeated at least 5 times to guarantee consistent results.
Energy efficiency metrics (in operations/Joule) are the energy consumption divided by the number of operations and throughput metrics (in operations/second)
are the number of operations divided by the maximum time for the threads
to finish the whole operations. Energy metrics are collected from the on-board
power measurement on the ARM platform, Intel RAPL interface on the HPC
platform, and micras sysfs interface (i.e., /sys/class/micras/power) on the
MIC platform.
Experimental results. Based on the results in Figure 5 and 6, GreenBST’s
energy efficiency and throughput are the highest compared to DeltaTree and
the other trees. Because of its incremental rebalance, GreenBST outperformed
DeltaTree (and the other trees) in the update-intensive workloads. With its
heterogeneous layout, GreenBST is able to outperform DeltaTree in the searchintensive workloads. GreenBST energy efficiency and throughput are up to 195%
higher than DeltaTree for the update intensive benchmark and up to 20% higher
for the search intensive benchmark (cf. Figure 5b). Compared to the other trees,
GreenBST energy efficiency and throughput are up to 65% and 69% higher,
respectively. Note that CBTree (B-link tree) is a highly-concurrent B-tree variant
that it’s still used as a backend in popular database systems such as PostgreSQL.
The reason behind GreenBST good results is GreenBST’s data transfer (cf.
Figure 5e) and LLC misses (cf. Figure 5d) are among the lowest of all the trees.
These facts prove that combination of locality-aware layout and the optimizations
that GreenBST has over DeltaTree are beneficial to both fine-grained locality and
concurrency, of which are the key ingredients of an energy-efficient concurrent
search tree.
4
Discussions
Some of the benchmark results showed that besides data movements, efficient
concurrency control is also necessary in order to produce energy-efficient data
structures. For example, the conventional vEB tree (SVEB) always transferred
the smallest amount of data between memory to CPU, but unfortunately, its
CBTree
CBTree
·105
100% Search
3 Energy efficiency
citrus
citrus
LFBST
LFBST
BSTTK
operations / Joule
operations / Joule
operations / Joule
1.5
1
GreenBST
DeltaTree
GreenBST
·105
100% Search
Energy efficiency
·105
50% Search
Energy efficiency
2
DeltaTree
BSTTK
1
0.5
·105
50% Search
Energy efficiency
8
operations / Joule
SVEB
SVEB
6
4
4
3
2
2
1
6
·107
Throughput
12
18
24 cores
1.5
1
6
·107
Throughput
12
18
1
2
·106
Throughput
24 cores
3
4 cores
1
2
·106
Throughput
1
0.5
0.5
0
0
1
6
12
18
24 cores
1
6
12
18
6
operations / Joule
1.5
1
0.5
1
1
0.5
0
1
2
3
4 cores
1
2
3
4 cores
(b) ARM platform. GreenBST is up to
65% more energy efficient than CBTree in
the 50% search benchmark using 4 cores.
Its throughput is up to 69% higher than
CBTree in the 50% search benchmark using
4 cores.
·104
50% Search
Energy efficiency
6
100% Search
LLC-DRAM data transfer (R/W)
50% Search
10 LLC-DRAM data transfer (R/W)
8
4
Gigabytes
·105
100% Search
Energy efficiency
2
1.5
0
24 cores
(a) HPC platform. GreenBST is up to
50% more energy efficient than CBTree in
the 50% search benchmark using 12 cores
and its throughput is up to 40% higher than
CBTree in the 100% search benchmark using
24 cores.
3
operations / second
1.5
1
2
4
Gigabytes
2
operations / second
operations / second
operations / second
4 cores
2
2.5
operations / Joule
3
2
6
4
2
28
57 cores
1
14
·106
6
Throughput
operations / second
operations / second
1
14
·107
1.5 Throughput
1
0.5
28
57 cores
0
14
28
57 cores
12
18
24 cores
1
6
12
18
24 cores
2
0
1
6
(d) Data movement between CPU’s last level
cache (LLC) and DRAM on the HPC platform.
4
1.4
0
0
1
1
14
28
57 cores
(c) MIC platform. GreenBST is up to 50%
more energy efficient than BSTTK in the
50% search benchmark using 14 cores and
its throughput is up to 20% higher than
BSTTK in the 100% search benchmark using 14 cores.
·108
L2 miss
100% Search
9
1.41·10
2.54·109
5.96·109
1.4
·108
L2 miss
50% Search
4.59·109
9.17·109
14
28
1.79·1010
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
1
14
28
57 cores
1
57 cores
(e) L2 cache misses on the MIC platform.
❛❛
❛❛ Tree name
SVEB
❛❛
❛❛
❛ 0.1
Memory used (in GB)
CBTree citrus LFBST BSTTK DeltaTree GreenBST
0.4
(f) The tree memory footprint after 2
0.8
23
0.7
1.0
0.6
0.4
integer keys insertion on the HPC platform.
Fig. 5: (a,b,c) Energy efficiency and throughput comparison of the trees. On the HPC
platform, DeltaTree and GreenBST energy efficiency and throughput decreases in the
50% search benchmark using 18 and 24 cores (i.e., with 2 chips) because of the coherence
overheads between two CPUs (cf. Section 4). In the 50% search benchmark using 57
cores (MIC platform), BSTTK energy efficiency and throughput beats GreenBST
by 20% because of the coherence overheads in the MIC platform (cf. Section 4). (d)
LLC-DRAM data movements on the HPC platform, collected from the CPU counters
using Intel PCM. (e) L2 cache miss counter on the MIC platform, collected using PAPI
library. (f) The tree memory footprint.
HPC time required
0.53
0.33
0.25
0.98
0.65
0.48
24
12
6.33
5.43
4
3.35
ARM time required 57
10.74
1
0
2
25.01
15.92
12.14
40.19
25.69
17.92
24
12
4
6
20.74
1
6.05
4.67
8
10
0
5
10
15
4
18.85
HPC energy consumption
181.49
141.74
0
50 100 150 200 250 300 350
20
27.68
26.93
66.54
61.78
1
17.73
11.42
36.03
0
57
31.66
30.37
1
21
0
5 10 15 20 25 30
GreenBST CBTree rbtree
10
112.86
115.71
67.83
ARM energy consumption
318.84
1
MIC time required
1.3
1.28
0.88
20
30
40 50 60 70
seconds (shorter is better)
MIC energy consumption
4,684.44
4,326.72
1
2,555.3
0
1,000 2,000 3,000 4,000 5,000
Joules (shorter is better)
Fig. 6: GreenBST energy efficiency and throughput against CBTree and STAMP’s builtin red-black tree (rbtree) for the vacation benchmark. At best, GreenBST consumes
41% less energy and requires 42% less time than CBTree (in the 57 clients benchmark
on the MIC platform).
energy efficiency and throughput failed to scale when using 2 or more cores.
SVEB is not designed for concurrent operations and an inefficient concurrency
control (a global mutex) had to be implemented in order to include the tree in this
study (note that we are unable to use a more fine-grained concurrency because
SVEB uses recursive layout in a contiguous memory block). Therefore, even if
SVEB has the smallest amount of data transfer during the micro-benchmarks,
the concurrent cores have to spend a lot of time waiting and competing for a
lock. This is inefficient as a CPU core still consumes power (e.g., static power)
even when it is waiting (idle).
Finally, an important lesson that we have learned is that minimizing overheads
in locality-aware data structures can reduce the structure’s energy consumption.
One of the main differences between DeltaTree and GreenBST is that DeltaTree
uses the homogeneous (leaf-oriented) layout, while GreenBST does not. Leaforiented leaf GNodes increased DeltaTree’s memory footprint by 50% compared
to GreenBST (cf. Figure 5f) and has caused higher data transfer between LLC
and DRAM (cf. Figure 5d). Bigger leaf size also increases maintenance cost for
each leaf GNode, because there more data that need to be arranged in every
rebalance or split operation, which leads to lower update concurrency. Therefore,
DeltaTree energy efficiency and throughput are lower than GreenBST.
Inter-CPU and many-core coherence issue Our experimental analysis has
revealed that multi-CPU and many-core cache coherence, if triggered, can degrade
concurrent update throughput and energy efficiency of the locality-aware trees.
Figure 5a shows the "dips" in GreenBST’s 50% update energy efficiency and
throughput on the HPC platform (i.e., in the 50% update/18 cores and 50%
update/24 cores cases). Figure 5c also shows that BSTTK beats GreenBST in
the 50% update/57 cores case on the MIC platform.
Using the CPU performance counters, we have found that the GreenBST
concurrent updates frequently triggered the inter-CPU coherency mechanism. In
the HPC platform, coherency mechanism causes heavy bandwidth saturation in
the CPU interconnect. In the MIC platform, it causes most of the L2 data cache
misses to be serviced from other cores and saturates the platform’s bidirectional
ring interconnect. These facts highlight the challenge faced by the locality-aware
concurrent search tree: because of its locality awareness (i.e., related data are
kept nearby and often re-used), the tree concurrent update operations might
trigger heavy interconnect traffic on the multi-CPU platforms. The coherency
mechanisms increase the total number of data transfer and the platform’s energy
consumption.
5
Conclusions
The results presented in this paper not only show that GreenBST is an energyefficient concurrent search tree, but also provide an important insight into how
to develop energy efficient data structures in general. On single core systems,
having locality-aware data structures that can lower data movement has been
demonstrated to be good enough to increase energy-efficiency. However, on
multi-CPU and many cores systems, data-structures’ locality-awareness alone
is not enough and good concurrency and multi-CPU cache strategy are needed.
Otherwise, the energy overhead of "waiting/idling" CPUs or multi-CPU coherency
mechanism can exceed the energy saving obtained by fewer data movements.
Acknowledgments
This work has received funding from the European Union Seventh Framework
Programme (EXCESS project, grant n◦ 611183) and from the Research Council
of Norway (PREAPP project, grant n◦ 231746/F20).
References
1. Afek, Y., Kaplan, H., Korenfeld, B., Morrison, A., Tarjan, R.E.: Cbtree: a practical concurrent
self-adjusting search tree. In: Proc. 26th international Conf. Distributed Computing. pp. 1–15.
DISC’12 (2012)
2. Aggarwal, A., Vitter, Jeffrey, S.: The input/output complexity of sorting and related problems.
Commun. ACM 31(9), 1116–1127 (1988)
3. Andersson, A.: Faster deterministic sorting and searching in linear space. In: Proc. 37th Annual
Symp. on Foundations of Computer Science. pp. 135–141. FOCS ’96 (Oct 1996)
4. Arbel, M., Attiya, H.: Concurrent updates with rcu: Search tree as an example. In: Proc. 2014
ACM Symposium on Principles of Distributed Computing. pp. 196–205. PODC ’14 (2014)
5. Bender, M., Demaine, E.D., Farach-Colton, M.: Cache-oblivious b-trees. SIAM Journal on
Computing 35, 341 (2005)
6. Bender, M.A., Farach-Colton, M., Fineman, J.T., Fogel, Y.R., Kuszmaul, B.C., Nelson, J.:
Cache-oblivious streaming b-trees. In: Proc. 19th annual ACM Symp. Parallel algorithms and
architectures. pp. 81–92. SPAA ’07 (2007)
7. Bender, M.A., Fineman, J.T., Gilbert, S., Kuszmaul, B.C.: Concurrent cache-oblivious b-trees.
In: Proc. 17th annual ACM Symp. Parallelism in algorithms and architectures. pp. 228–237.
SPAA ’05 (2005)
8. Brodal, G.S., Fagerberg, R., Jacob, R.: Cache oblivious search trees via binary trees of small
height. In: Proc. 13th ACM-SIAM Symp. Discrete algorithms. pp. 39–48. SODA ’02 (2002)
9. Brodal, G.: Cache-oblivious algorithms and data structures. In: Hagerup, T., Katajainen, J.
(eds.) Algorithm Theory - SWAT 2004, Lecture Notes in Computer Science, vol. 3111, pp. 3–13
(2004)
10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, Third Edition. The MIT Press (2009)
11. Crain, T., Gramoli, V., Raynal, M.: A speculation-friendly binary search tree. In: Proc. 17th
ACM SIGPLAN Symp. Principles and Practice of Parallel Programming. pp. 161–170. PPoPP
’12 (2012)
12. Dally, B.: Power and programmability: The challenges of exascale computing. In: DoE Arch-I
presentation (2011)
13. David, T., Guerraoui, R., Trigonakis, V.: Asynchronized concurrency: The secret to scaling
concurrent search data structures. In: Proc. 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems. pp. 631–644. ASPLOS ’15 (2015)
14. Ellen, F., Fatourou, P., Ruppert, E., van Breugel, F.: Non-blocking binary search trees. In: Proc.
29th ACM SIGACT-SIGOPS Symp. Principles of distributed computing. pp. 131–140. PODC
’10 (2010)
15. van Emde Boas, P.: Preserving order in a forest in less than logarithmic time. In: Proc. 16th
Annual Symp. Foundations of Computer Science. pp. 75–84. SFCS ’75 (1975)
16. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc.
40th Annual Symp. Foundations of Computer Science. p. 285. FOCS ’99 (1999)
17. Gramoli, V.: More than you ever wanted to know about synchronization: Synchrobench, measuring the impact of the synchronization on concurrent algorithms. In: Proceedings of the
20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 1–10.
PPoPP 2015 (2015)
18. Ha, P.H., Tsigas, P., Anshus, O.J.: Wait-free programming for general purpose computations on
graphics processors. In: Proc. 2008 IEEE International Symposium on Parallel and Distributed
Processing. pp. 1–12. IPDPS’08 (2008)
19. Ha, P.H., Tsigas, P.: Reactive multi-word synchronization for multiprocessors. In: Proc. 12th
Intl. Conf. on Parallel Architectures and Compilation Techniques. pp. 184–193. PACT ’03 (2003)
20. Ha, P.H., Tsigas, P., Anshus, O.J.: Nb-feb: A universal scalable easy-to-use synchronization
primitive for manycore architectures. In: Proc. 13th Intl. Conf. on Principles of Distributed
Systems. pp. 189–203. OPODIS ’09 (2009)
21. Ha, P.H., Tsigas, P., Anshus, O.J.: The synchronization power of coalesced memory accesses.
IEEE Transactions on Parallel and Distributed Systems 21(7), 939–953 (2010)
22. Ha, P.H., Tsigas, P., Wattenhofer, M., Wattenhofer, R.: Efficient multi-word locking using randomization. In: Proc. 24th Annual ACM Symp. on Principles of Distributed Computing. pp.
249–257. PODC ’05 (2005)
23. Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: Proc. 20th Annual Intl. Symp. on Computer Architecture. pp. 289–300. ISCA ’93
(1993)
24. Hipp, D.R.: Sqlite (2015), http://www.sqlite.org
25. Kim, C., Chhugani, J., Satish, N., Sedlar, E., Nguyen, A.D., Kaldewey, T., Lee, V.W., Brandt,
S.A., Dubey, P.: Fast: fast architecture sensitive tree search on modern cpus and gpus. In: Proc.
2010 ACM SIGMOD Intl. Conf. Management of data. pp. 339–350. SIGMOD ’10 (2010)
26. Kung, H.T., Robinson, J.T.: On optimistic methods for concurrency control. ACM Trans.
Database Syst. 6(2), 213–226 (Jun 1981)
27. Larsson, A., Gidenstam, A., Ha, P.H., Papatriantafilou, M., Tsigas, P.: Multi-word atomic
read/write registers on multiprocessor systems. In: Proc. 12th Annual European Symposium on
Algorithms (ESA ’04). pp. 736–748. LNCS 3221 (2004)
28. Lehman, P.L., Yao, s.B.: Efficient locking for concurrent operations on b-trees. ACM Trans.
Database Syst. 6(4), 650–670 (Dec 1981)
29. Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: Stamp: Stanford transactional applications
for multi-processing. In: Workload Characterization, 2008. IISWC 2008. IEEE International
Symposium on. pp. 35–46 (Sept 2008)
30. Natarajan, A., Mittal, N.: Fast concurrent lock-free binary search trees. In: Proc. 19th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 317–328. PPoPP
’14 (2014)
31. Prokop, H.: Cache-oblivious algorithms. Master’s thesis, MIT (1999)
32. Rodeh, O.: B-trees, shadowing, and clones. Trans. Storage 3(4), 2:1–2:27 (Feb 2008)
33. Sewall, J., Chhugani, J., Kim, C., Satish, N.R., Dubey, P.: Palm: Parallel architecture-friendly
latch-free modifications to b+ trees on many-core processors. Proc. VLDB Endowment 4(11),
795–806 (2011)
34. Tran, V., Barry, B., Ha, P.H.: RTHpower: Accurate fine-grained power models for predicting
race-to-halt effect on ultra-low power embedded systems. In: Proc. 17th IEEE International
Symposium on Performance Analysis of Systems and Software. ISPASS ’16 (2016), pages to
appear
35. Tran, V., Barry, B., Ha, P.H.: Supporting energy-efficient co-design on ultra-low power embedded
systems. In: Proc. 2016 Intl. Conf. on Embedded Computer Systems: Architectures, Modeling,
and Simulation. SAMOS XVI (2016), pages to appear
36. Umar, I., Anshus, O.J., Ha, P.H.: Deltatree: A locality-aware concurrent search tree. In: Proc.
2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer
Systems. pp. 457–458. SIGMETRICS ’15 (2015)
37. Umar, I., Anshus, O.J., Ha, P.H.: Effect of portable fine-grained locality on energy efficiency
and performance in concurrent search trees. In: Proc. 21th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming. pp. 36:1–36:2. PPoPP ’16 (2016)