2015 Stucki Rompf Ureche Bagwell

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

RRB Vector: A Practical General Purpose Immutable Sequence

Nicolas Stucki† Tiark Rompf ‡ Vlad Ureche† Phil Bagwell ∗



EPFL, Switzerland: {first.last}@epfl.ch

Purdue University, USA: {first}@purdue.edu

Abstract Bulk operations on immutable collections lend themselves to


State-of-the-art immutable collections have wildly differing per- implicit parallelization. This allows the execution to proceed either
formance characteristics across their operations, often forcing pro- sequentially, by traversing the collection one element at a time, or
grammers to choose different collection implementations for each in parallel, by delegating parts of the collection to be traversed
task. Thus, changes to the program can invalidate the choice of in different execution contexts and combining the intermediate
collections, making code evolution costly. It would be desirable to results. Therefore, the bulk operations allow programs to scale
have a collection that performs well for a broad range of operations. to multiple cores without explicit coordination, thus lowering the
To this end, we present the RRB-Vector, an immutable se- burden on programmers.
quence collection that offers good performance across a large num- Most state-of-the-art collection implementations are tailored to
ber of sequential and parallel operations. The underlying innova- some specific operations, which are executed very fast, at the ex-
tions are: (1) the Relaxed-Radix-Balanced (RRB) tree structure, pense of the others, which are slow. For example, the ubiquitous
which allows efficient structural reorganization, and (2) an opti- Cons list is extremely efficient for prepending elements and access-
mization that exploits spatio-temporal locality on the RRB data ing the head of the list, performing both operations in O(1) time.
structure in order to offset the cost of traversing the tree. However, it has a linear O(n) cost for reading and updating random
In our benchmarks, the RRB-Vector speedup for parallel opera- elements. And although sequential scanning is efficient, requiring
tions is lower bounded by 7× when executing on 4 CPUs of 8 cores O(1) time per element, it cannot benefit from parallel execution,
each. The performance for discrete operations, such as appending since both splitting and combining take sequential O(n) time, can-
on either end, or updating and removing elements, is consistently celling out any gains from the parallel execution.
good and compares favorably to the most important immutable se- This non-uniform behavior across different operations forces
quence collections in the literature and in use today. The memory programmers to carefully choose the collections they use based on
footprint of RRB-Vector is on par with arrays and an order of mag- the operations required by the task at hand. This ad-hoc choice also
nitude less than competing collections. stifles code evolution, as new features often rely on different opera-
tions, forcing the programmers to revisit their choice of collections.
Categories and Subject Descriptors E.1 [Data Structures]: Ar- Furthermore, having different collection choices for each module
rays; E.1 [Data Structures]: Trees; E.1 [Data Structures]: Lists, prevents good end-to-end performance, since data must be passed
stacks, and queues from one collection to another, adding overhead.
Keywords Data Structures, Immutable, Sequences, Arrays, Instead of asking programmers to choose a collection which
Trees, Vectors, Radix-Balanced, Relaxed-Radix-Balanced performs well for their needs, it would be much better to provide
a default collection that performs well across a broad range of
operations, both discrete and bulk. Having such a collection readily
1. Introduction available would allow programmers to rely on it without worrying
In functional programs, immutable sequence data structures are about performance, except in extremely critical places, and would
used in two distinct ways: encourage modules to standardize their interfaces around it.
To this end, we present the RRB-Vector, an immutable indexed
• to perform discrete operations, such as accessing, updating,
sequence collection that inherits and improves the fast discrete op-
inserting or deleting random collection elements; erations of tree-based structures while supporting efficient paral-
• for bulk operations, such as mapping a function over the entire lel execution by providing fast split and combine primitives. The
collection, filtering using a predicate or grouping using a key RRB-Vector is a good candidate for a default immutable collec-
function. tion, thanks to its good all-around performance, allowing programs
to use it without the risk of running into unexpected linear or supra-
∗ Phil Bagwell passed away on October 6, 2012. He made significant con- linear overheads.
tributions to this work, and to the field of data structures in general. Bulk data parallel operations on the RRB-Vector are executed
with effectively-constant1 sequential overheads thanks to the under-
Permission to make digital or hard copies of all or part of this work for personal or lying wide Relaxed-Radix-Balanced (RRB) tree structure. The key
classroom use is granted without fee provided that copies are not made or distributed property is the relaxed balancing requirement, which allows effi-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM cient structural changes without introducing extremely unbalanced
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, states. Data parallel operations, such as map, are executed in three
to post on servers or to redistribute to lists, requires prior specific permission and/or a phases: (1) the RRB-Vector is split into chunks in an effectively-
fee. Request permissions from [email protected].
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
1 Proportionalto a logarithm of the size with a large base. In practice our
ICFP’15, August 31 – September 2, 2015, Vancouver, BC, Canada
ACM. 978-1-4503-3669-7/15/08...$15.00 choice of index representation as signed integer limits to log32 (231 ) + 1
http://dx.doi.org/10.1145/2784731.2784739 which corresponds to approximately 6.2 indirections.

342
constant sequential operation, (2) each execution context traverses 0 1 … m-1

one or more chunks, with an amortized-constant overhead per el- 0 1 … m-1 0 1 … m-1

ement and (3) the intermediate results are concatenated in a final


effectively-constant sequential operation. 0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1

Discrete operations, such as appends on either side, updates


and deletions are performed in amortized-constant time. This is Figure 1. Radix-Balanced tree structure
achieved thanks to a lightweight fragmented representation of the
tree that reduces the propagation of updates to branches, thus ex-
ploiting locality of operations. This provides an adapted and more this structure for m children on each node. Logically each node
efficient treatment compared to the widely-used tree structural shar- is a copy-on-write array that contains subtrees or elements.
ing [11], thus lowering the asymptotic complexity from effectively Apart from the tree itself, a Vector keeps the tree height as
constant to amortized constant. In the worst case, if operations are a field, in order to improve performance. This height is upper
called in an adversarial manner, the behavior remains effectively bounded by logm (n−1)+1 for nodes of m branches. For example,
constant and the additional overhead is limited to a range check if m is 32, the tree becomes quite shallow and the complexity to
and a single set of assignment operations. traverse it from root to leaf is considered as effectively constant4
We implemented the RRB-Vector data structure in Scala2 and when taking into account that the number of elements will never
measured the performance of its most important operations. On 4 be larger than the maximum index representable with 32 bit signed
cores, bulk operation performance is at least 2.3× faster compared integers, which corresponds to a maximum height of 7 levels5 .
to sequential execution, scaling better with heavier workloads. Dis- Usually the number of elements in a Vector does not exactly
crete operations take either amortized- or effective-constant time, match a full tree (mi for some i > 0). To mark the start and end
with good constants: compared to mutable arrays, sequential reads of the elements in the tree, the vector keeps these indices as fields.
are at most 2× slower, while random access is 2-3.5× slower. All subtrees on the left and right that are outside of the filled index
We compare our RRB-Vector implementation to other im- range are represented by empty references.
mutable collections such as red-black trees, finger trees, copy-on-
2.2 Core Operations
write arrays and the current Vector implementation in the Scala
library. Overall, the RRB-Vector is at most 2.5× slower than Indexing Elements are fetched from the tree using radix search
the best collection for each operation and consistently delivers on the index. If the tree has a branching factor of 32, the index can
good performance across all benchmarks. The memory footprint be split bitwise in blocks of 5 (25 = 32) and used to know the path
of RRB-Vector is on-par with copy-on-write arrays and an order that must be taken from the root down to the element. The indices
of magnitude better than red-black trees and finger trees. at each level L can be computed with (index >> (5 · L))&31. For
We claim the following contributions: example the index 526843 would be:
• We present the Relaxed-Radix-Balanced (RRB) tree data struc- 526843 = 00 00000 00000 10000 00010 01111 11011
ture and show how it enables the efficient splitting and concate- 0 0 16 2 15 27

nation operations necessary for data parallel operations (§3);


• We describe the additional structural optimizations that exploit 0 … 31

spatio-temporal locality on RRB-Trees (§4); 0 … 16 … 31

• We discuss the technical details of our Scala RRB-Vector im-


plementation (which is under consideration for inclusion in the 0 1 2 … 31

Scala standard library as a replacement for the current Vector 0 … 15 … 31

collection) in hope that other language implementers will ben-


0 … 27 … 31
efit from our experience (§5);
• We evaluate the performance of our implementation and com- Figure 2. Accessing element at index 526843 in a tree of depth 5.
pare the results of 7 different core operations across 5 different Empty nodes represent collapsed subtrees.
immutable collections (§6).
This scheme can be generalized to any branching size m where
m = 2i for 0 < i ≤ 31. The formula is:
2. Background: Vectors as Balanced Trees
In this section we present a base version of the immutable Vector, (index >> (i · L))&(m − 1)
which is based on Radix-Balanced trees3 . This simple version pro-
vides logarithmic complexities: O(logm (n)) on random accesses It is also possible to generalize for other values of m using the
and O(m · logm (n)) while updating, appending at either end, or modulo, division and power operations. In that case the formula
splitting. The constant m is the branching factor (ideally a power would become (index/(mL )) mod m.
of two). The base implementation of the indexing operation requires a
single traversal from the root to the leaf containing the element,
2.1 Radix-Balanced Tree Structure with the path taken defined by the index of the element and ex-
tracted using efficient bitwise operations. With this traversal of the
A Radix-Balanced tree is a shallow and complete (or perfect) m-ary tree, the complexity of the operation is O(logm (n)).
tree located only in the leaves. The nodes have a fixed branching The same radix-based traversal of the tree is used in the rest of
size m, and are either internal nodes linking to sub-trees or leaves the operations to find the leaf corresponding to a given index. It can
containing elements. In practice the branching used is 32 [4, 37], be optimized for the first and last leaf by removing computations to
but as we will later see, the node size can be any power of 2, improve performance on operations on the ends.
allowing efficient radix-based implementations. Figure 1 shows

2 https://github.com/nicolasstucki/scala-rrb-vector 4 There exists a small enough constant bound (due to practical limitations).
3 Base structure for Relaxed-Radix-Balanced Vector. 5 Maximum height of log32 (231 ) + 1, which is 6.2.

343
17 newNode(idx) =
1 type Node = Array[AnyRef] 18 if (depth == 1) elem
2 val Node = Array 19 else newBranch(depth-1)
20 newNode
21 }
1 val i = // bits in blocks of the index 22 if (needNewRoot()) {
2 val mask = (1 << i) - 1 23 val newRoot = whr match {
3 def get(index: Int): A = { 24 case Frt => Node(newBranch(depth), root)
4 def getRadix(idx: Int, nd: Node, level: Int): A = { 25 case Bck => Node(root, newBranch(depth))
5 if (depth == 0) nd(idx & mask) 26 }
6 else { 27 new Vector(newRoot, depth+1, ...)
7 val indexInLevel = (idx >> (level * i)) & mask 28 } else {
8 getRadix(idx, nd(indexInLevel), level-1) 29 new Vector(appendedFront(root, depth), depth, ...)
9 } 30 }
10 } 31 }
11 getRadix(index, vectorRoot, vectorDepth)
12 }
In the code above, isTreeFull and needNewRoot are op-
erations that compute the answer using efficient bitwise operations
Updating Since the structure is immutable, the updated on the start/end index of the vector.
operation has to recreate the entire path from the root to the element Since the algorithm traverses and creates new nodes from the
being updated. The leaf update creates a fresh copy of the leaf root to a leaf, the complexity of the operation is O(m · logm (n)).
array with one updated element. Then, the parent of the leaf is also Like the updated operation, it can be optimized by keeping tran-
updated with the reference to the new leaf, then the parent’s parent, sient states of the immutable vector (described in §4).
and so on all the way up to the root.
Splitting The core operations to remove elements in a Radix-
1 def updated(index: Int, elem: A) = {
2 def updatedNode(node: Node, level: Int): Node = { Balanced tree are the take and drop operations. They are used to
3 val indexInNode = // compute index implement many other operations such as splitAt, tail, init
4 val newNode = copy(node) and others.
5 if(level == 0) { The take and drop operations are similar. The first step is
6 newNode(indexInNode) = elem
7 } else { traversing the tree down to the leaf where the cut will be done.
8 newNode(indexInNode) = Then the branch is copied and cleared on one side. The tree may
updatedNode(node(indexInNode), level-1) become shallower during this operation, in which case some of the
9 } nodes on the top will be dropped instead of being copied. Finally,
10 newNode
11 } the start and end are adjusted according to the changes on the tree.
12 new Vector(updatedNode(vectorRoot, vectorDepth),
1 def take(index) = split(index, Right)
...)
2 def drop(index) = split(index, Left)
13 }
3 def split(index: Int, removeSide: Side) = {
4 def splitRec(node: Node, level: Int): (Node, Int) =
Therefore the complexity of this operation is O(m · logm (n)), {
since it traverses and recreates O(logm (n)) nodes of size O(m). 5 val indexInNode = // compute index
For example, if some leaf has all its elements updated from left to 6 if (level == 0) {
7 (copyAndSplitNode(node, indexInNode,
right, the branch will be copied as many times as there are updates. removeSide), 1)
We will later explain how this can be optimized by allowing tran- 8 } else removeSide match {
sient states that avoid re-creating the path to the root tree node with 9 case Left if indexInNode == node.length - 1 =>
each update (described in §4). 10 splitedRec(node(indexInNode), level - 1)
11 case Right if indexInNode == 0 =>
Appending front and back The implementation of appended 12 splitedRec(node(indexInNode), level - 1)
13 case _ =>
front/back has two cases, depending on the current state of the 14 val newNode = copyAndSplitNode(node,
Radix-Balanced tree: If the first/last leaf is not full the element is indexInNode, removeSide)
inserted directly and all nodes of the first/last branch are copied. 15 val (newSubnode, depth) =
If the leaf is full we must find the lowest node in the last branch splitedRec(node(indexInNode), level-1)
16 newNode(indexInNode) = newSubnode
where there is still room left for a new branch. Then a new branch 17 (newNode, level)
that only contains the new element is appended to it. 18 }
In both cases the new vector object will have the start/end index 19 }
decreased/increased by one. When the root is full, the depth of the 20 val (newRoot, newDepth) = splitRec(vectorRoot,
vectorDepth)
vector will also increase by one. 21 new Vector(newRoot, newDepth, ...)
22 }
1 val m = // branching factor
2 def appended(elem: A, whr: Where): Vector[A] = { The computational complexity of any split operation is O(m ·
3 def appended(node: Node, level: Int) = {
4 val indexInNode = // compute index based on
logm (n)) due to the traversal and copying of nodes on the branch
start/end index where the cut index is located. O(logm (n)) for the traversal of
5 if (level == 1) the branch and then O(m · logm (n2 )) for the creation of the new
6 copyAndUpdate(node, indexInNode, elem) branch, where n2 is the size of the new vector (with 0 ≤ n2 < n).
7 else
8 copyAndUpdate(node, indexInNode,
9 appended(node(indexInNode), level-1)) 3. Immutable Vectors as Relaxed Radix Trees
10 }
11 def newBranch(depth: Int): Node = { Relaxed-Radix-Balanced vectors use a new tree structure that ex-
12 val newNode = Node.ofDim(m) tends the Radix-Balanced trees to allow fast concatenation of vec-
13 val idx = whr match {
14 case Frt => m-1
tors without losing performance on other core operations [4]. Re-
15 case Bck => 0 laxing the vector consists in using a slightly unbalanced extension
16 } of the tree that combines balanced subparts. This vector still en-

344
sures the logm (n) bound on the height of the tree and on the oper- where index < sizes[subIndex]. The fastest way to find it is by
ations presented in the previous section. using binary search to reduce the search space and when it is small
enough to take advantage of cache lines and switch to linear search.
3.1 Relaxed-Radix-Balanced Tree Structure
1 def getBranchIndex(sizes: Array[Int], indexInTree:
The basic difference in the structure is that in an Relaxed-Radix- Int): Int = {
Balanced (or RRB) tree, we allow nodes that contain subtrees that 2 var (lo, hi) = (0, sizes.length)
3 while (linearThreshold < hi - lo) {
are not completely full. As a consequence, the start and end index 4 val mid = (hi + lo) / 2
are no longer required, as the branches on the ends can be truncated. 5 if (sizes(mid) <= indexInTree) lo = mid
The structure of the RRB trees does not ensure by itself that the tree 6 else hi = mid
height is bounded by logm (n). This bound is maintained by each 7 }
8 while (sizes(lo) <= indexInTree) lo += 1
operation using an additional invariant on the tree balance. In our 9 lo
case the concatenation operation is the only one that can affect the 10 }
inner structure (excluding ends) of the tree and as such it is the only
one that needs to worry about this invariant. Note that to traverse the tree down to the leaf where the index
is located, the sub-indices are computed from the sizes as long
Tree balance As the tree will not always be perfectly balanced, as the tree node is unbalanced. If the node is balanced, then the
we define an additional invariant on the tree that will ensure an more efficient radix based method is used from there to the leaf,
equivalent logarithmic bound on the height. We use the relation to avoid accessing the additional array in each level. In the worst
between the maximum and minimum branching factor mmax and case the complexity of indexing will become O(log2 (m)·logm (n))
mmin at each level. These give corresponding maximum height where log2 (m) is a constant factor that is only added on unbalanced
hmax and least height hmin needed to represent a given number of nodes.
elements n. Then hmin = logmmax (n) and hmax = logmmin (n)
def get(index: Int): A = {
or as hmin = lg(m1max ) ·lg(n) and hmax = lg(m1min ) ·lg(n). Trees 1
2 def getRadix(idx, Int, node: Node, depth: Int) = ...
lg(mmin ) def get(idx: Int, node: Node, depth: Int) = {
that are better balanced will have a height ratio, hr = lg(mmax )
, 3
4 val sizes = // get sizes from node
that is closer to 1, perfect balance. In our tree we use mmax = 5 if(isUnbalanced(sizes)) {
mmin + 1 to make hr as close to 1 as possible. In practice (using 6 val branchIdx = getBranchIndex(sizes, idx)
m = 32) in the worst case scenario there is an increase from around 7 val subIdx = indexInTree-sizes(branchIdx)
8 get(subIdx, node(branchIdx), depth-1)
6.2 to 6.26 in the maximum possible height (i.e. 7 levels in both 9 } else getRadix(idx, node, depth)
cases). 10 }
11 get(index, root, depth)
Sizes metadata When one of these trees (or subtrees) is 12 }
unbalanced, it is no longer possible to know the location of an
index just by applying radix manipulation on it. To avoid losing
the performance of traversing down the tree in such cases, each Updating and Appending For each one of these operations,
unbalanced node will keep metadata on the sizes of its subtrees. The the only fundamental difference with the Radix-Balanced tree is
sizes are kept in a separate6 copy-on-write array as accumulated that when a node of a branch is updated the sizes must be updated
sizes. This way, they represent the location of the ranges of the with it (if needed). In the case of updating, the structure does not
indices in the current subtree. To avoid creating additional objects change and as such it always keeps the same sizes object reference.
in memory, these sizes are attached at the end of the node. To have a The traversal down the tree is done using the new abstraction used
homogeneous representation of nodes, the balanced subtrees have in the relaxed version of indexing.
an empty reference attached at the end. For leaves, however, we In the case of appending to the back, an updated unbalanced
make an exception: since they will always be balanced, they only node must increment the accumulated size of its last subtree by
contain the data elements but not the size metadata. one. When a new branch is appended, a new size is appended to the
sizes. The newBranch operation is simplified by using truncated
0 1 … m-1 m 0 1 … m-1 nodes and letting the node on which it gets appended handle any
index shifting required.
0 1 … m-1 m 0 1 … m-1 m

1 def appended(elem: A, whr: Where): Vector[A] = {


0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1 0 1 … m-1
2 ...
3 def newBranch(depth: Int): Node = {
Figure 3. Relaxed radix balanced tree structure 4 val newNode = Node.ofDim(1)
5 newNode(0) = if (depth == 1) elem else
newBranch(depth-1)
6 newNode
3.2 Relaxed Core Operations 7 }
8 ...
Algorithms for the relaxed version assume that the tree is unbal- 9 }
anced and use a relaxed version of the code for Radix-Balanced
trees. But, as soon as a balanced subtree is encountered the more In the case of appending front, an updated node must increment
efficient radix based algorithm is used. We also favor the creation the accumulated size of each subtrees by one. When a new branch
of balanced trees/subtrees when possible to improve performance is appended, a 1 is appended on the front of the sizes and all other
on subsequent operations. accumulated sizes are incremented by one.
The complexity of these operations is still O(m · logm (n)),
Indexing When the tree is relaxed it is not possible to com- log2 (m)·logm (n) for the traversal plus m·logm (n) for the branch
pute the sub-indices directly from the index. By keeping the accu- update or creation.
mulated sizes in the node the computation of sub-indices becomes
trivial. The sub-index is the same as the first index in the sizes array Splitting While splitting, the traversal down the tree is done
using the relaxed version of indexing. The splitting operation just
6 To
be able to share them across different vectors. This is a common case truncates the node on the left/right. In addition, when encountering
when using updated. an unbalanced node, the sizes are truncated and adjusted. The

345
complexity of this operation is still O(m · logm (n)), log2 (m) · that ensures the logarithmic bound. The first one leaves the tree
logm (n) for the traversal plus m · logm (n) for the branch update. better balanced, while the second is faster. As we aim to have
good performance on all operations we use the first variant7 . The
3.3 Concatenation following snippet of code shows a high level implementation for
The concatenation algorithm used on RRB-Vectors is a slightly this first variant. Details for the second variant can be found in [4]
modified version of the one proposed in the RRB-Trees technical in case that concatenation is prioritized over all other operations.
report [4]. This version favors nodes of size m over m − 1 making 1 def mergeRebalance(left: Node, center: Node, right:
the trees more balanced. With this approach, we sacrifice a bit of Node) = {
performance for concatenations but we gain performance on all 2 // join all branches
3 val merged = left ++ centre ++ right
other operations: better balancing implies higher chance of using 4 var newRoot = new ArrayBuilder
fast radix operations on the trees. 5 var newSubtree = new ArrayBuilder
From a high level, the algorithm merges the rightmost branch 6 var newNode = new ArrayBuilder
of the vector on the LHS with the leftmost branch of the vector on 7 def checkSubtree() = {
8 if(newSubtree.length == m) {
the RHS. While merging the nodes, each of them is rebalanced in 9 newRoot += computeSizes(newSubtree.result())
order to ensure the O(logm (n)) bound on the height of the tree 10 newSubtree.clear()
and avoid the degeneration of the structure. The RRB version of 11 }
concatenation has a time complexity of O(m2 · logm (n)) where m 12 }
13 for (subtree <- merged; node <-subtree) {
is constant. 14 if(newNode.length == m) {
1 def concatenate(left: Vector[A], right: Vector[A]) = 15 checkSubtree()
{ 16 newSubtree += computeSizes(newNode.result())
2 val newTree = mergedTrees(left.root, right.root) 17 newNode.clear()
3 val maxDepth = max(left.depth, right.depth) 18 }
4 if (newTree.hasSingleBranch) 19 newNode += node
5 new Vector(newTree.head, maxDepth) 20 }
6 else 21 checkSubtree()
7 new Vector(newTree, maxDepth+1) 22 newSubtree += computeSizes(newNode.result())
8 } 23 computeSizes(newRoot.result)
9 def mergedTrees(left: Node, right: Node, depth: Int) 24 }
= {
10 if (depth==1) { Figures 4, 5, 6 and 7 show a concrete step by step (level by
11 mergedLeaves(left, right) level) example of the concatenation of two vectors. In the example,
12 } else {
13 val merged =
some of the subtrees were collapsed. This is not only to make the
14 if (depth==2) mergedLeaves(left.last, diagrams fit, but also to expose only the nodes that are referenced
right.first) during the execution of the algorithm. Nodes with colors represent
15 else mergedTrees(left.last, right.first, depth-1) new nodes and changes, to help track them from figure to figure.
16 mergeRebalance(left.init, merged, right.tail)
17 }
18 }
19 def mergedLeaves(left: Node, right: Node) = {
20 // create a balanced new tree of height 2
21 // with all elements in the nodes
22 }

The concatenation operation starts at the bottom of the branches Figure 6. Concatenation example: Rebalancing level 2
by merging the leaves into a balanced tree of height 2 using
mergedLeaves. Then, for each level on top of it, the newly
created merged subtree and the remaining branches on that level
will be merged and rebalanced into a new subtree. This new sub-
tree always adds a new level to the tree, even though it might get
dropped later. New sizes of nodes are computed each time a node
is created based on sizes of children nodes.
Figure 7. Concatenation example: Rebalancing level 3

The concatenation algorithm chosen for the RRB Vector is the


one that is slower but that is better at rebalancing. The reason be-
hind this decision is that with better balanced trees all other oper-
Figure 4. Concatenation example: Rebalancing level 0 ations on the trees are more efficient. In fact, choosing the least
efficient option does not need to be seen as a reduction in per-
formance, because the improvement is in relation to the Relaxed-
Balanced tree concatenation of linear complexity. An interesting
consequence of this choice is that all trees (or subtrees) of size at
most m2 (the maximum size of a two level RRB tree) that were
created by concatenation will be completely balanced.
It is important to have a smart rebalancing implementation, due
Figure 5. Concatenation example: Rebalancing level 1 to the m2 elements that can possibly be accessed. The first crucial
factor is the speed of copying the nodes. With an implementation
The rebalancing algorithm has two proposed variants. The first that takes advantage of spatial locality by using arrays (§5.2), the
consists of completely rebalancing the nodes on the two top levels amount of work required can be reduced to m fast node copies
of the subtree. The second also rebalances the top two level of
the subtree but it only rebalances the minimum amount of nodes 7 Performance of operations using the second variant was analyzed in [37].

346
rather than m2 element copies. Another crucial but obvious imple- RRB tree this covers the entire tree and will effectively only use the
mentation detail is to never duplicate a node if it does not change. more efficient radix based operations.
This requires a small amount of additional logic and comes with a
benefit on memory used and in good cases can reduce the number 4.1 Faster Access
of node copies required, potentially reducing the effective work to One of the uses of the focused branch is as a direct access to
o(m ∗ logm (n)) if there is a good alignment. a cached branch. If the same leaf node is used in the following
When improving the vector on locality (§4), concatenating a operation, there is no need for vertical tree traversal which is key to
small vector using the concatenation algorithm is less efficient than amortize operation to constant time. In the case another branch is
appending directly on the other tree. That case is identified by a needed, it can be fetched from the lowest common node of the two
simple bound on the lengths, and then all elements from the smaller branches.
vector are appended to the larger one. To know which is the level of the lowest common node in a
vector of branching size m (where m = 2i and i is the number
Other Operations Having efficient concatenation and spitting of bits in the sub-indices), only the focused index and the index
allows us to also implement several other operations that change being fetched are needed. The operation index Y f ocus will return
the structure of the tree. Some of there operations are: inserting an a number that is bounded to the maximum number of elements
element/vector in any position, deleting an element/subrange of the in a tree of that level. The actual level can be extracted with
vector and patching/replacing part of the vector. The complexity some if statements. This operation bounded by the same number
of these operations are bounded by the complexity of the core of operations that will be needed to traverse the tree back down
operations used. through the new branch. This is computed in O(logm (n)) without
Parallelizing the Vector To parallelize operations we use the accesses to memory.
fork-join pool model from Java [18, 31]. In this model process- 1 val i = // nubmer of bits of sub-indices
ing is achieved by splitting the work into smaller parts until they 2 def lowestCommonLevel(idx: Int, focus: Int): Int = {
are deemed small enough to ensure good parallelism. This can be 3 val xor = idx ^ focus
4 if (xor < (1<<(1*i)) ) 0
achieved using the efficient splitting of the RRB-Tree. For certain 5 else if (xor < (1<<(2*i)) ) 1
operations, like map, filter and reduce, the results obtained in 6 else if (xor < (1<<(3*i)) ) 2
parallel must be aggregated, such as concatenating the partial vec- 7 ...
tors produced by the parallel workers. The aggregation can oc- 8 else 5
9 }
cur in several steps, where partial results from different workers
are aggregated in parallel, recursively, until a single result is pro- When deciding which will be the focused branch of a new vector
duced. The overhead associated with the distribution of work is two heuristics are used: If there was an update operation on some
O(m2 · logm (n)). branch where that operations could be used again, that branch is
used as focus. If the first one can’t be applied, the focus is set to the
4. Improvements on Operation Locality and first element as this helps key collection operations (such as getting
Amortization of Costs an iterator).
The vector iterator and builder use this to switch from one leaf
In this section we present different approaches aimed at improving to the next one with the minimal number of steps. In fact, this
the performance using explicit caching on the subtrees in a branch. effectively amortizes out the cost of traversing the tree structure
This is a new generalization of the Clojure [15] (and current Scala) over the accesses in the leaves as each edge of the tree will only
optimizations on their Radix-Balanced vectors. All optimizations be accessed once. In the case of RRB tree iteration there is an
we describe rely on the vector object keeping a set of fields that additional abstraction for switching from one balanced subtree to
track the entire branch of the tree, from the root to a leaf. Figure 4 the next one.
shows such an RRB-vector with the levels numbered starting from
0 at the bottom. The explicit caches focus on the nodes reaching 4.2 Amortizing Costs using Transient States
from the root to the tree0 leaf. Transient states are the key to providing amortized constant-time
appending, local updating and local splits. To achieve this, we de-
tree2 couple the tree by creating an equivalent tree that does not contain
redundant edges on the current focused branch. The information
tree1 missing in the edges of the tree is represented and can be recon-
structed from the trees in the focused branch.
tree0

Figure 8. Branch trees in cache.

To know on which branch the vector is focused there is also a


focus index field. It can be the index of any element in the cur-
rent tree0. To follow the simple implementations scheme of im-
mutable objects in concurrent contexts, the focus is also immutable. Figure 9. Transient tree with current focus branch marked in white
Therefore each vector object will have a single focused branch dur- and striped nulled edges.
ing its existence. Each method that creates a new vector must de-
cide which focus to set. Without transient states when a leaf is updated, the entire branch
These optimizations depend heavily on the radix operations for must be updated. On the other hand, if the state is transient, it is
efficiency. To avoid losing these completely on unbalanced RRB only necessary to update the subtree affected by the change. In the
trees we will only use these operation on a balanced subtree of the case of updates on the same leaf, only the leaf must be updated.
branch. The vector will keep extra meta data on the start and end When appending or updating consecutive indices, m−1 m
opera-
index of this subtree as well as its height. In the case of a balanced tions must only update the leaf, then m−1
m2
need to update two lev-

347
els of the tree and so on. These operations will thus be amortized Table 1. Comparison between the Radix and Relaxed Radix Vec-
to constant time if they are executed in succession. This is due to tors. In this table all aC have logm worst case scenario.
the bound given by average number of node update per operation: Radix-Balanced Vector RRB Vector With m = 32
P∞ k·(m−1) m
k=1 mk
= m−1 . indexing logm logm eC
There is a cost associated to the transformation from canonical scanning aC aC aC
to transient state and back. This cost is equivalent to one update of update m · logm m · logm eC
the focused branch. The transient state operations only start paying update local aC aC aC
off after 3 consecutive operations. With 2 consecutive operations concat/insert L m2 · logm L v.s. eC
they are matched and with 1 there is a loss in performance. insert ends aC aC aC
building aC aC aC
Canonicalization The transient state aims to improve perfor- split m · logm m · logm eC
mance of some operations by amortizing costs. But, the transient split ends aC aC aC
state is not ideal for performance of other operations. For example remove L m2 · logm L v.s. eC
an indexing operation on an unbalanced vector may lack the size
information it requires to efficiently access certain indices. And an
iterator relies on a canonical tree for performance. It is possible to We use the notation eC as effective constant time when we
implement these operations on a transient state, but this involves have logm (n) complexities assuming that n will be bounded by
both code duplication and additional overhead on each call. the index integer representation and m is large enough. In our case
The solution we used involves converting the transient represen- Int is a 32-bit signed integer and m = 32, giving us the bound
tation to a canonical one. This conversion, called canonicalization, logm (n) < 7 and hence argue that this is bounded by a constant. In
is applied when an operation that requires the cannonical form is table 1, aC (amortized constant) has a worst case scenario of logm
called on an instance of the immutable vector. The mutation of the or m · logm , in other terms it has is eC in the worst case. In table 2,
vector is not visible from the outside and only happens at most once for the RRB Vector the aC has a worst case of eC, for COW Array
(Figure 10). This transformation only affects the nodes that are on aC has linear worst case and for the rest of aC-s have a worst case
the focused branch, as it copies each one (except the leaf) and links of log2 . C and L are constant and linear time respectively.
the trees. If the node is unbalanced, the size of the subtree in focus
is inserted. This transformation could be seen as a lazy initialization 5. Implementation
of the current branch.
5.1 Scala Indexed Sequences
Our implementation of the RRB-Vector8 is based on the Scala Col-
lection [26] IndexedSeq, which acts as a decorator exposing
many predefined generic operations which are based on just a few
core primitives. For example, most operations that involve the en-
tire collection (such as map, filter and foreach) use iterators
and builders. To improve performance, we overwrote several of the
decorator operations to use the efficient vector primitives directly,
without going through the IndexedSeq code.
Parallel RRB Vector The implementation of the parallel RRB
Vector is a trivial wrapper over the sequential RRB Vector using
Figure 10. Objects states and effect of operations. the Scala Parallel Collections API [25, 33, 34]. This only requires
the presence of the split and combine operations, which, in our
Vector objects can only be in the transient state if they were case, are simply splitting an iterator on the tree and combining
created this way. For example, the appending operations will create using concatenation. Both of these use the efficient core RRB tree
a new object that is in transient state and focused on the last/first operations. When splitting, we additionally have heuristics that
branch. If the source object was not focusing the last branch, then yield a well aligned concatenation and create a balanced tree.
it is canonicalized (if needed) before change of branch operation.
Vectors of depth 1 are special cases, they are always in canonical 5.2 Arrays as Nodes
form and their operations are equivalent to those in transient form. One of the aims of Scala Collections [25, 26, 33, 34] is the elimina-
tion of code duplication, and one of the mechanisms to achieve this
4.3 Comparison
8 Along with all other sequences we compare against.
In table 1 we show the difference in complexities between the
Radix-Balanced Vector and the Relaxed-Radix-Balanced Vector. In
table 2 we compare the RRB-Vector to other possible implementa-
tions. The operations in the table represent all different complexi- Table 2. Comparisons with other data structures that could be used
ties that can be reached for the core operations. The operations are to implement indexed sequences.
divided into four categories: (i) fetching, (ii) updating, (iii) insert- RRB Vector COW Array FingerTree RedBlack Tree
ing , and (iv) removing. In (i) there is the indexing or random access indexing eC C log2 log2
given an element index and the sequential scanning of elements. scanning aC C aC aC
Category (ii) is divided into normal (or random) update and has a update eC L log2 log2
special case for updates that are done locally. The category (iii) is update local aC L log2 log2
divided into building (with/without result size), concatenation and concat/insert eC L log2 L
insertions of element (or vectors) and has a special case for inser- insert ends aC L aC log2
tions on ends (appended front/back and small concat). Remov- building aC C/aC aC log2
ing (iv) is divided into splits (split, take, drop, . . . ), splits split eC L log2 log2
ends (tail/init, drop(n)/take(n) with n in the first/last split ends aC L aC log2
remove eC L log2 L
leaf) and a general removal of elements in a range of indices.

348
is the use of generic types [9, 22]. But this also has a drawback: the compiles it using the statistics to guide optimizations. The Vector
need to box primitive values in order for them to be stored in the code tries to gain performance by aligning with the JIT heuristics
collection. We implemented all our sequences in this context. and hence taking advantage of its optimizations. The most impor-
All nodes are stored in arrays of type Array[AnyRef], since tant such optimization is inlining, which eliminates the overhead of
this allows us to quickly access elements (which are boxed any- calling a method and, furthermore, enables other optimizations to
way due to generics9 ) without dispatching on the primitive array improve the inlined code. Critical parts of the Vector code are care-
type. A welcome side effect of this decision is that elements are fully designed to to match the heuristics of the JVM. In particular,
already boxed when they are passed to the Vector, thus accessing a heuristic that arose commonly is that only methods of size less
and storing them does not incur any intermediate boxing/unbox- than 35 bytes are inlined, which meant we had to split the code into
ing operations, which would add overhead. However, it is known several methods to stay below this threshold.
that using the boxed representation for primitive types is inefficient
when operating on the values themselves, so the sizes of unbal- 6. Evaluation
anced nodes are stored in Array[Int] objects, guaranteeing the
6.1 Methodology
most compact and efficient data representation.
Most of the memory used in the vector data structure will be ScalaMeter [30] is used to measure performance of operations on
composed of arrays. There are three key operations used on these different implementations of indexed sequences.
arrays: creation, update and access. Since the arrays are used with To have reproducible results with low error margins, ScalaMe-
copy-on-write semantics, actual update operations are only allowed ter was configured on a per benchmark basis. Each test is run on
when the array is initialized. This also implies that each time there 32 different JVM instances to average out badly allocated VMs.
is a modification on some part of an array, a new array must be On each JVM, 32 measurements were taken and they were filtered
created and the old elements must be copied. using outlier elimination to remove those runs that where excep-
The size of the array will affect the performance of the vector. tionally different. This could happen if a more thorough garbage
With larger arrays in the nodes the access times will be reduced collection cycle occurs in a particular run, due to JIT compilation
because the depth of the tree will decrease. But, on the other or if the operating system switches to a more important task dur-
hand, increasing the size of the arrays will slow down the update ing the benchmark process [12]. Before taking measurements, the
operations, as they have to copy the entire array to execute the JVM is warmed up by running the benchmark code several times,
element update, due to the copy-on-write semantics. without taking the measurements into account. This allows us to
For an individual reference to an RRB-Vector of size n and measure the time after the JIT compilation has occurred, when the
branching of m, the memory usage will composed by the arrays system is in a steady state.
located in the leaves10 and the ones that form the tree structure. There are three main directions in the performance compar-
n isons. The first compares the Radix-Balanced vectors with well-
In our case we save references and hence we need d m e arrays
11
of m references . The structure requrires at least the references balanced RRB Vectors, with the goal of having an equivalent per-
to the child nodes and in the worst case scenario an additional formance, even if the RRB Vectors have an inherent additional
integer the size of each child. Going up level by level, the reference overhead. The second axis shows the effects of unbalanced nodes
count decreases by a factor of m and hence the total is bounded on RRB-Tree. For this we compare the same perfect balanced vec-
tor with an extremely unbalanced vector. The later vector is gener-
by k=2m d mnk e < ∞
Plog (n) P n n+m
k=2 d mk e ≤ m·(m−1) refrences. For the ated by concatenating pseudo-random small vectors together. The
sizes of the nodes, given our choice of rebalancing algorithm, they amount of unbalanced nodes is in part affected by the size of the
will only appear on nodes that are of height 3 or larger and hence vector. The third axis is the comparison between vectors in general
Plog (n) P∞
the sizes will be bounded by k=3m d mnk e < n
k=3 d mk e ≤ and other well known functional and/or immutable data structures
n+m
m2 ·(m−1)
integers. used to implement sequences. We used a copy-on-write (COW) ar-
rays, finger trees [16] (FingerTreeSeq12 ) and red black trees [13]
5.3 Running on a JVM (RedBlackSeq13 ).
In practice, Scala compiles to Java bytecode and executes on a Java 6.2 Results
Virtual Machine (JVM), where we used the Oracle Java SE dis-
tribution [29] as a reference. This imposes additional characteris- For the results of this sections, benchmarks where executed on a
tics of performance that can’t be evaluated on the algorithmic level Java HotSpot(TM) 64-Bit Server VM on a machine with an Intel(R)
alone, and ask for a more nuanced discussion. Core(TM) i7-4770 CPU @ 3.40GHz with 32GiB on RAM. Each
One of the JVM components that directly affects vectors is the benchmarking VM instance was setup with 16GiB of heap memory.
garbage collector (or GC). Vector operations tend to create a large The parallel vector split-combine was executed on a machine with
number of Array objects, some of which are only necessary for a 4 Intel(R) Xeon(R) Processors, of type E5-4640 @ 2.40GHz with
short time. These objects will use up memory and thus degrade 128GiB on RAM.
overall performance as the GC is invoked more often. For this Iterating The benchmark in Figure 11 shows the time it takes
reason our code is optimized to avoid the redundant creation of to scan the whole sequence using a specialized iterator. Unsurpris-
intermediary objects, delaying the GC cycles and thus improving ingly, the results show that the best option is the array. But the vec-
performance. tor is only 1-2× slower, closer to 1× in the most common cases.
Instead of directly compiling bytecode to native code, the JVM It is also possible to see that vectors are 7-15× faster than other
uses a just in time compilation (JIT) mechanism in order to take ad- deeper trees, mainly due to the reduction in indirections and in-
vantage of run-time profiling information. At first it runs the com- creased locality.
piled bytecode inside an interpreter and collects execution statis-
tics (profiles). Later, once a method has executed enough times, it Building The benchmark in Figure 12 shows the time it takes
to build a sequence using a specialized builder. In general, the
9A limitation that could be circumvented by Miniboxing [40]. 12 Adapted version of https://github.com/Sciss/FingerTree where
10 Note that the memory used in the leaves is equivalent to the memory used abstractions that did not involve sequences where removed.
for an array that contains all the elements. 13 Adaptation of the standard Scala Collections RedBlackTree where keys
11 It could be any kind of data. are used as indices.

349
Figure 11. Iterating through the sequence Figure 13. Accessing 10k consecutive indices

Figure 12. Building a sequence. Figure 14. Accessing 10k random indices

Updating Figure 15 shows the time taken to update 10k


builder for these sequences does not know the size of the resulting
elements in consecutive indices and Figure 16 shows the same for
sequence. In the case of array builder there is the possibility of
randomly chosen indices. In this case the array is clearly the worst
giving it a hint of the result size (Hinted in the benchmarks). In
option because it creates a new version and copies the contents with
this case the vector wins against all other implementations. It is
each update. The vector behaves effectively as having constant time
faster than other trees because they require re-balancing during the
while taking advantage of locality and degenerates slightly with
building, whereas the vector behaves more like an array building
randomness. The vector is 4.3× faster on local updates and 1-2.3×
by allocating chunks of memory and filling them. Array building
faster on random updates than the red black tree.
requires resizing of the array whenever it is filled or the result is
returned, which implies a copy of the whole array. By contrast, the Concatenating Figures 17 and 18 show the time it takes to
vector only requires a copy of the last branch when returned. This is concatenate two sequences (two points of view of the same 3D
the main reason the vector is able to outperform the array building plot). The two axes on the bottom represent the sizes of the LHS
process. Also, the standard array builder uses the hint as such and (left hand side) and RHS (right hand side) of the concatenation
therefore still requires some copies of the array. operation. It can be seen that the RRB Vector and finger trees are
almost equivalent in performance (bottom planes). The array up to
Indexing Figure 13 shows the time taken to access 10k ele- a result size of 4096 is able to concatenate faster thanks to locality,
ments in consecutive indices while Figure 14 shows the same for but then grows linearly with the result size (middle plane). The
randomly chosen indices. From the algorithmic point of view they vector without efficient concatenation (on Radix-Balanced trees)
are exactly the same, the difference is in how the memory is kept in behaves just like the array but with worse constant factors (top
the processor caches. It shows that in either cases the vector access plane). The red black tree was omitted from this graph due its
behaves effectively as constant time like the array, where the finger inefficient concatenation operation.
trees and red black trees degenerate with randomness. A vector of
depth 3 is 2-3.5× slower than the array, the cost of accessing the Appending Figures 19 and 20 show the time it takes to
arrays in the 3 levels of the branches. append 256 elements on by one. In the first case we append them

350
Figure 17. Concatenating two sequences (point of view 1). RRB
Vector and Finger Tree are the planes at the bottom, COW Array is
the plane in the middle and Vector is the plane on the top.
Figure 15. Updating on 10k consecutive indices

Figure 18. Concatenating two sequences (point of view 2). More


informationg on the first point of view on figure 17.

Figure 16. Updating on 10k random indices both versions, where the results are identical. Then we parallelized
it on fork-join thread pools of 1, 2, 4, 8, 16, 32 and 64 thread on a 64
threaded (32 cores) machine. Without concatenation, there is a loss
to the front and in the second to the back of the sequence. The of performance on when passing from sequential to parallel and al-
large number of elements was chosen in order show the amortized though the performance increases with the addition of threads, even
time of the operation on the vectors. In this case the array is with 64 threads it’s only slightly better than the sequential version.
clearly the worst option because it creates a new version and copies By contrast, with our new vector, the gain in performance starts
the contents with each append. The vector is around 2.5× slower with one thread in the pool (dedicated thread) and then increases.
than the finger trees, a structure that specifically focuses on these Giving a 1.55× increase with 2 threads, 2.46× for 4 thread, 3.52×
operations. The vector can be 1-2× faster than a red black tree. for 8 thread, 4.60× for 16 thread, 5.52× for 32 thread (core limit)
and 7.18× for 64 thread (hardware thread limit).
Splitting Figures 21 and 22 show the time it takes to split a
sequence on the left and on the right. We fixed the cut point to Memory Overhead Figure 25 shows the memory overhead of
the middle of the sequence to be able to compare the time it takes the data structures used in the benchmarks. This overhead is the
to take or drop the same number of elements. It can be seen that additional space used in relation to the COW Array. The overhead
splitting a vector is more efficient than other structures. Even more, of a vector is 17.5× smaller than the finger tree and 40× smaller
the vector behaves with an effectively constant time. than the red black trees.
Parallel Vector Split-combine Overhead The benchmarks in
Figure 23 and 24 show the amount of overhead associated with the 7. Related Work
parallelization of the vector with and without efficient concatena- Related data structures There is a strong relation between
tion. They show the typical overhead of a parallel map, filter or RRB Trees and various data structures in the literature. Patricia
other similar operations that create a new version of the sequence. tries [24] are one of the earliest documented uses of radix trees, per-
The benchmark computes a map operation using the identity func- forming lookups and updates one bit or character at a time. Wider
tion, such that the execution time is dominated by the time it takes radix trees were used to build space efficient sparse arrays, Array
to split and combine the sequence rather than the function compu- Mapped Tries (AMT) [1], and on top of that Hash Array Mapped
tations. As a base for comparison we used the sequential map on Tries (HAMT) [2], which have been popularized and adapted to an

351
Figure 19. Appending front 256 elements one by one Figure 21. Taking the first half of the sequence

Figure 20. Appending back 256 elements one by one Figure 22. Dropping the first half of the sequence

immutable setting in Clojure [15]. Radix trees are also used as in-
mutable vector like the ones found in Scala Collections [26] and
dex structures for in-memory databases [19]. In databases B-Trees
Clojure [15]. Allowing the implementation of an additional wide
[6] are a ubiquitous index data structure that are similar to RRB
range of efficient structural modification on the vectors. Later, a
Trees, in the sense that they are wide trees and allow a certain de-
concrete version for Scala where in all optimization on locality are
gree of slack to enable efficient merging. However the chosen trade-
adapted to RRB vectors was implemented. This is the implemen-
offs are different and B-Trees are not normally used as sequence
tation used in this paper and presented with more technical details
data structures. Ropes [5] are a sequence data structure with effi-
in [37] on the implementation in Scala. Another related project is
cient concat and rebalancing operations. Immutable sequence data
[20], where more detailed mathematical proofs where shown for
structures include VLists [3], Finger Trees [16] and various kinds
the RRB Trees and their operations. They also introduce a different
of random access lists [27].
approach on transience using semi-mutable vectors with efficient
Parallelism Parallel execution is achieved in the RRB-Vector snapshot persistence and provide a C implementation.
by relying on the fork-join pools in Java [18, 31]. The vector is
split into chunks which are processed in parallel. The overhead of Scala Library In Scala, most general purpose data structures
splitting can be offset by using cooperative tasks [14, 21], but, in the are contained in Scala Collections [26]. This framework aims to re-
case of RB-Vector the cost of splitting is much smaller compared duce code duplication to a minimum using polymorphism, higher-
to the cost of combining (assembling) the partial results returned order functions [8], higher kinded types [23], implicit parameters
by parallel execution contexts. This is where the RRB trees make a [28] and other language features. It also aims to simplify the inte-
difference: by allowing efficient structural changes, it enables the gration of new data structures with a minimum of effort. In addi-
concatenation to occur in effectively constant time, much better tion, the Scala Parallel Collection API [31–34] allows parallel col-
than the previous O(n) for the Scala Vector. lections to integrate seamlessly with the rest of the library. Behind
the scenes, Scala Parallel Collections use the Java fork-join pools
RRB Trees The core data structure was first described in a [18] as a backend for implicit parallelism in the data structure op-
technical report [4] as a way to improve the concatenation of im- erations.

352
Figure 23. Parallel (non-RRB) Vector overhead on a map opera- Figure 25. Memory overhead of the sequences in relation to arrays
tion

spatio-temporal locality on the RRB data structure in order to offset


the cost of navigating from the tree root to the leaves.
The RRB-Vector implementation in Scala speeds up bulk op-
eration performance on 4 cores by at least 2.33× compared to se-
quential execution, scaling better with light workloads. Discrete
operations take either amortized- or effective-constant time, with
good constants: compared to mutable arrays, sequential reads are at
most 2× slower, while random access is 2-3.5× slower. The imple-
mentation of the project is open-source14 and is being considered
for inclusion in the Scala standard library.

Acknowledgements
We would like to thank the ICFP reviewers for their feedback and
suggestions, which allowed us to improve the paper both in terms of
clarity and in terms of breadth. We are grateful to Martin Odersky
for allowing the Master Thesis [37] that forms the base of this paper
to be supervised in the LAMP laboratory at EPFL. We would also
like to thank Vera Salvisberg for her thorough review of both the
paper and the code, which led to remarkable improvements the
Figure 24. Parallel RRB Vector overhead on a map operation quality of the final submission. Last but not least, we would like
to thank Sébastien Doeraene for allowing Nicolas to dedicate time
to improving this paper while he was working on Scala.js.
Low level optimizations We took into account the capabili-
ties of the VM to do escape analysis and inlining of the compiled
bytecode. These are influenced by the concrete implementation of
References
the Java Hotspot VM compilers (C1 and C2) [17, 29]. At the com- [1] P. Bagwell. Fast and Space-efficient Trie Searches. Technical report,
piler level there are optimizations techniques that remove the cost EPFL, 2000.
of type generic abstractions such as specialization [10], Minibox- [2] P. Bagwell. Ideal hash trees. Technical report, EPFL, 2001.
ing [39, 40] or Scala Blitz [33]. This last one can go further do [3] P. Bagwell. Fast Functional Lists, Hash-Lists, Deques and Variable
fusion on collection [7]. Additionally it is possible to use staging Length Arrays. In Implementation of Functional Languages, 2002.
techniques [35, 36, 38] to further optimize the code. [4] P. Bagwell and T. Rompf. RRB-Trees: Efficient Immutable Vectors.
Technical report, EPFL, 2011.
Benchmarks Running code on a virtualized environment like
the JVM where the factors that influence performance are not under [5] H.-J. Boehm, R. Atkinson, and M. Plass. Ropes: An alternative to
our control and can vary from execution to execution complicates strings. Software: Practice and Experience, 25(12):1315–1330, 1995.
the benchmarking process [12]. In Scala there is a tool (ScalaMeter [6] D. Comer. The ubiquitous b-tree. ACM Comput. Surv., 11(2):121–137,
[41]) designed to overcome these issues. 1979.
[7] D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists
to streams to nothing at all. In Proceedings of the 12th ACM SIG-
8. Conclusions PLAN International Conference on Functional Programming, ICFP
In this paper we presented the RRB-Vector, an immutable se- ’07, pages 315–326, New York, NY, USA, 2007. ACM.
quence collection that offers good performance across a broad [8] I. Dragos. Optimizing Higher-Order Functions in Scala. In
range of sequential and parallel operations. The underlying innova- ICOOOLPS, 2008.
tions are the Relaxed-Radix-Balanced (RRB) Tree structure, which
allows efficient structural changes and an optimization that exploits 14 https://github.com/nicolasstucki/scala-rrb-vector

353
[9] I. Dragos. Compiling Scala for Performance. PhD thesis, IC, 2010. [26] M. Odersky and A. Moors. Fighting bit Rot with Types (Experience
[10] I. Dragos and M. Odersky. Compiling Generics through User-directed Report: Scala Collections). In R. Kannan and K. N. Kumar, edi-
Type Specialization. In ICOO0LPS ’09. ACM, 2009. tors, IARCS Annual Conference on Foundations of Software Technol-
ogy and Theoretical Computer Science, volume 4 of Leibniz Interna-
[11] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data tional Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, 2009.
structures persistent. J. Comput. Syst. Sci., 38(1):86–124, Feb. 1989. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[12] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java [27] C. Okasaki. Purely Functional Data Structures. Cambridge University
performance evaluation. In Proceedings of the 22Nd Annual ACM Press, New York, NY, USA, 1998.
SIGPLAN Conference on Object-oriented Programming Systems and
Applications, OOPSLA ’07, pages 57–76, New York, NY, USA, 2007. [28] B. C. d. S. Oliveira, A. Moors, and M. Odersky. Type Classes as
ACM. Objects and Implicits. In OOPSLA ’10. ACM, 2010.
[13] S. Hanke. The Performance of Concurrent Red-Black Tree Algo- [29] M. Paleczny, C. Vick, and C. Click. The java hotspottm server com-
rithms. In J. Vitter and C. Zaroliagis, editors, Algorithm Engineering, piler. In Proceedings of the 2001 Symposium on JavaTM Virtual
volume 1668 of Lecture Notes in Computer Science. Springer Berlin Machine Research and Technology Symposium - Volume 1, JVM’01,
Heidelberg, 1999. pages 1–1, Berkeley, CA, USA, 2001. USENIX Association.
[14] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. [30] A. Prokopec. ScalaMeter. https://scalameter.github.io/.
Apr. 2008. [31] A. Prokopec. Data Structures and Algorithms for Data-Parallel Com-
[15] R. Hickey. The Clojure programming language, 2006. puting in a Managed Runtime. PhD thesis, IC, Lausanne, 2014.
[32] A. Prokopec and M. Odersky. Near optimal work-stealing tree sched-
[16] R. Hinze and R. Paterson. Finger Trees: A Simple General-purpose
uler for highly irregular data-parallel workloads. In Languages and
Data Structure. J. Funct. Program., 16(2), 2006.
Compilers for Parallel Computing, Lecture Notes in Computer Sci-
[17] T. Kotzmann, C. Wimmer, H. Mössenböck, T. Rodriguez, K. Russell, ence, pages 55–86. Springer International Publishing, 2014.
and D. Cox. Design of the Java HotSpot&Trade; Client Compiler for
Java 6. ACM Trans. Archit. Code Optim., 5(1), May 2008. [33] A. Prokopec, D. Petrashko, and M. Odersky. Efficient Lock-Free
Work-stealing Iterators for Data-Parallel Collections. 2015.
[18] D. Lea. A Java Fork/Join Framework. In Proceedings of the ACM
2000 Conference on Java Grande, JAVA ’00, New York, NY, USA, [34] A. Prokopec, T. Rompf, P. Bagwell, and M. Odersky. On a generic
2000. ACM. parallel collection framework, 2011.
[19] V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful [35] T. Rompf and M. Odersky. Lightweight Modular Staging: A Prag-
indexing for main-memory databases. In C. S. Jensen, C. M. Jermaine, matic Approach to Runtime Code Generation and Compiled DSLs.
and X. Zhou, editors, 29th IEEE International Conference on Data Communications Of The Acm, 55, 2012.
Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages [36] T. Rompf, A. K. Sujeeth, K. J. Brown, H. Lee, H. Chafi, and K. Oluko-
38–49. IEEE Computer Society, 2013. tun. Surgical precision JIT compilers. In M. F. P. O’Boyle and K. Pin-
[20] J. N. L’orange. Improving RRB-Tree Performance through Tran- gali, editors, ACM SIGPLAN Conference on Programming Language
sience. Master’s thesis, Norwegian University of Science and Tech- Design and Implementation, PLDI ’14, Edinburgh, United Kingdom -
nology, June 2014. June 09 - 11, 2014, page 8. ACM, 2014.
[21] Moir and Shavit. Concurrent data structures. In Mehta and Sahni, [37] N. Stucki. Turning Relaxed Radix Balanced Vector from Theory into
editors, Handbook of Data Structures and Applications, Chapman & Practice for scala collections. Master’s thesis, EPFL, 2015.
Hall/CRC. 2005. [38] W. Taha and T. Sheard. MetaML and Multi-Stage Programming with
[22] A. Moors. Type Constructor Polymorphism for Scala: Theory and Explicit Annotations. In Theoretical Computer Science. ACM Press,
Practice (Type constructor polymorfisme voor Scala: theorie en prak- 1999.
tijk). PhD thesis, Informatics Section, Department of Computer Sci- [39] V. Ureche, E. Burmako, and M. Odersky. Late data layout: Unifying
ence, Faculty of Engineering Science, May 2009. Joosen, Wouter and data representation transformations. In Proceedings of the 2014 ACM
Piessens, Frank (supervisors). International Conference on Object Oriented Programming Systems
[23] A. Moors, F. Piessens, and M. Odersky. Generics of a Higher Kind. Languages &#38; Applications, OOPSLA ’14, pages 397–416, New
Acm Sigplan Notices, 43, 2008. York, NY, USA, 2014. ACM.
[24] D. R. Morrison. PATRICIA-practical algorithm to retrieve information [40] V. Ureche, C. Talau, and M. Odersky. Miniboxing: Improving the
coded in alphanumeric. J. ACM, 15(4):514–534, Oct. 1968. Speed to Code Size Tradeoff in Parametric Polymorphism Transla-
tions. In OOPSLA’13, OOPSLA ’13, pages 73–92, New York, NY,
[25] M. Odersky. Future-Proofing Collections: From Mutable to Persistent
USA, 2013. ACM.
to Parallel. In Compiler Construction, volume 6601 of Lecture Notes
in Computer Science. Springer-Verlag New York, Ms Ingrid Cunning- [41] B. Venner, G. Berger, and C. C. Seng. Scalatest. http://www.
ham, 175 Fifth Ave, New York, Ny 10010 Usa, 2011. scalatest.org/.

354

You might also like