2015 Stucki Rompf Ureche Bagwell
2015 Stucki Rompf Ureche Bagwell
2015 Stucki Rompf Ureche Bagwell
342
constant sequential operation, (2) each execution context traverses 0 1 … m-1
one or more chunks, with an amortized-constant overhead per el- 0 1 … m-1 0 1 … m-1
2 https://github.com/nicolasstucki/scala-rrb-vector 4 There exists a small enough constant bound (due to practical limitations).
3 Base structure for Relaxed-Radix-Balanced Vector. 5 Maximum height of log32 (231 ) + 1, which is 6.2.
343
17 newNode(idx) =
1 type Node = Array[AnyRef] 18 if (depth == 1) elem
2 val Node = Array 19 else newBranch(depth-1)
20 newNode
21 }
1 val i = // bits in blocks of the index 22 if (needNewRoot()) {
2 val mask = (1 << i) - 1 23 val newRoot = whr match {
3 def get(index: Int): A = { 24 case Frt => Node(newBranch(depth), root)
4 def getRadix(idx: Int, nd: Node, level: Int): A = { 25 case Bck => Node(root, newBranch(depth))
5 if (depth == 0) nd(idx & mask) 26 }
6 else { 27 new Vector(newRoot, depth+1, ...)
7 val indexInLevel = (idx >> (level * i)) & mask 28 } else {
8 getRadix(idx, nd(indexInLevel), level-1) 29 new Vector(appendedFront(root, depth), depth, ...)
9 } 30 }
10 } 31 }
11 getRadix(index, vectorRoot, vectorDepth)
12 }
In the code above, isTreeFull and needNewRoot are op-
erations that compute the answer using efficient bitwise operations
Updating Since the structure is immutable, the updated on the start/end index of the vector.
operation has to recreate the entire path from the root to the element Since the algorithm traverses and creates new nodes from the
being updated. The leaf update creates a fresh copy of the leaf root to a leaf, the complexity of the operation is O(m · logm (n)).
array with one updated element. Then, the parent of the leaf is also Like the updated operation, it can be optimized by keeping tran-
updated with the reference to the new leaf, then the parent’s parent, sient states of the immutable vector (described in §4).
and so on all the way up to the root.
Splitting The core operations to remove elements in a Radix-
1 def updated(index: Int, elem: A) = {
2 def updatedNode(node: Node, level: Int): Node = { Balanced tree are the take and drop operations. They are used to
3 val indexInNode = // compute index implement many other operations such as splitAt, tail, init
4 val newNode = copy(node) and others.
5 if(level == 0) { The take and drop operations are similar. The first step is
6 newNode(indexInNode) = elem
7 } else { traversing the tree down to the leaf where the cut will be done.
8 newNode(indexInNode) = Then the branch is copied and cleared on one side. The tree may
updatedNode(node(indexInNode), level-1) become shallower during this operation, in which case some of the
9 } nodes on the top will be dropped instead of being copied. Finally,
10 newNode
11 } the start and end are adjusted according to the changes on the tree.
12 new Vector(updatedNode(vectorRoot, vectorDepth),
1 def take(index) = split(index, Right)
...)
2 def drop(index) = split(index, Left)
13 }
3 def split(index: Int, removeSide: Side) = {
4 def splitRec(node: Node, level: Int): (Node, Int) =
Therefore the complexity of this operation is O(m · logm (n)), {
since it traverses and recreates O(logm (n)) nodes of size O(m). 5 val indexInNode = // compute index
For example, if some leaf has all its elements updated from left to 6 if (level == 0) {
7 (copyAndSplitNode(node, indexInNode,
right, the branch will be copied as many times as there are updates. removeSide), 1)
We will later explain how this can be optimized by allowing tran- 8 } else removeSide match {
sient states that avoid re-creating the path to the root tree node with 9 case Left if indexInNode == node.length - 1 =>
each update (described in §4). 10 splitedRec(node(indexInNode), level - 1)
11 case Right if indexInNode == 0 =>
Appending front and back The implementation of appended 12 splitedRec(node(indexInNode), level - 1)
13 case _ =>
front/back has two cases, depending on the current state of the 14 val newNode = copyAndSplitNode(node,
Radix-Balanced tree: If the first/last leaf is not full the element is indexInNode, removeSide)
inserted directly and all nodes of the first/last branch are copied. 15 val (newSubnode, depth) =
If the leaf is full we must find the lowest node in the last branch splitedRec(node(indexInNode), level-1)
16 newNode(indexInNode) = newSubnode
where there is still room left for a new branch. Then a new branch 17 (newNode, level)
that only contains the new element is appended to it. 18 }
In both cases the new vector object will have the start/end index 19 }
decreased/increased by one. When the root is full, the depth of the 20 val (newRoot, newDepth) = splitRec(vectorRoot,
vectorDepth)
vector will also increase by one. 21 new Vector(newRoot, newDepth, ...)
22 }
1 val m = // branching factor
2 def appended(elem: A, whr: Where): Vector[A] = { The computational complexity of any split operation is O(m ·
3 def appended(node: Node, level: Int) = {
4 val indexInNode = // compute index based on
logm (n)) due to the traversal and copying of nodes on the branch
start/end index where the cut index is located. O(logm (n)) for the traversal of
5 if (level == 1) the branch and then O(m · logm (n2 )) for the creation of the new
6 copyAndUpdate(node, indexInNode, elem) branch, where n2 is the size of the new vector (with 0 ≤ n2 < n).
7 else
8 copyAndUpdate(node, indexInNode,
9 appended(node(indexInNode), level-1)) 3. Immutable Vectors as Relaxed Radix Trees
10 }
11 def newBranch(depth: Int): Node = { Relaxed-Radix-Balanced vectors use a new tree structure that ex-
12 val newNode = Node.ofDim(m) tends the Radix-Balanced trees to allow fast concatenation of vec-
13 val idx = whr match {
14 case Frt => m-1
tors without losing performance on other core operations [4]. Re-
15 case Bck => 0 laxing the vector consists in using a slightly unbalanced extension
16 } of the tree that combines balanced subparts. This vector still en-
344
sures the logm (n) bound on the height of the tree and on the oper- where index < sizes[subIndex]. The fastest way to find it is by
ations presented in the previous section. using binary search to reduce the search space and when it is small
enough to take advantage of cache lines and switch to linear search.
3.1 Relaxed-Radix-Balanced Tree Structure
1 def getBranchIndex(sizes: Array[Int], indexInTree:
The basic difference in the structure is that in an Relaxed-Radix- Int): Int = {
Balanced (or RRB) tree, we allow nodes that contain subtrees that 2 var (lo, hi) = (0, sizes.length)
3 while (linearThreshold < hi - lo) {
are not completely full. As a consequence, the start and end index 4 val mid = (hi + lo) / 2
are no longer required, as the branches on the ends can be truncated. 5 if (sizes(mid) <= indexInTree) lo = mid
The structure of the RRB trees does not ensure by itself that the tree 6 else hi = mid
height is bounded by logm (n). This bound is maintained by each 7 }
8 while (sizes(lo) <= indexInTree) lo += 1
operation using an additional invariant on the tree balance. In our 9 lo
case the concatenation operation is the only one that can affect the 10 }
inner structure (excluding ends) of the tree and as such it is the only
one that needs to worry about this invariant. Note that to traverse the tree down to the leaf where the index
is located, the sub-indices are computed from the sizes as long
Tree balance As the tree will not always be perfectly balanced, as the tree node is unbalanced. If the node is balanced, then the
we define an additional invariant on the tree that will ensure an more efficient radix based method is used from there to the leaf,
equivalent logarithmic bound on the height. We use the relation to avoid accessing the additional array in each level. In the worst
between the maximum and minimum branching factor mmax and case the complexity of indexing will become O(log2 (m)·logm (n))
mmin at each level. These give corresponding maximum height where log2 (m) is a constant factor that is only added on unbalanced
hmax and least height hmin needed to represent a given number of nodes.
elements n. Then hmin = logmmax (n) and hmax = logmmin (n)
def get(index: Int): A = {
or as hmin = lg(m1max ) ·lg(n) and hmax = lg(m1min ) ·lg(n). Trees 1
2 def getRadix(idx, Int, node: Node, depth: Int) = ...
lg(mmin ) def get(idx: Int, node: Node, depth: Int) = {
that are better balanced will have a height ratio, hr = lg(mmax )
, 3
4 val sizes = // get sizes from node
that is closer to 1, perfect balance. In our tree we use mmax = 5 if(isUnbalanced(sizes)) {
mmin + 1 to make hr as close to 1 as possible. In practice (using 6 val branchIdx = getBranchIndex(sizes, idx)
m = 32) in the worst case scenario there is an increase from around 7 val subIdx = indexInTree-sizes(branchIdx)
8 get(subIdx, node(branchIdx), depth-1)
6.2 to 6.26 in the maximum possible height (i.e. 7 levels in both 9 } else getRadix(idx, node, depth)
cases). 10 }
11 get(index, root, depth)
Sizes metadata When one of these trees (or subtrees) is 12 }
unbalanced, it is no longer possible to know the location of an
index just by applying radix manipulation on it. To avoid losing
the performance of traversing down the tree in such cases, each Updating and Appending For each one of these operations,
unbalanced node will keep metadata on the sizes of its subtrees. The the only fundamental difference with the Radix-Balanced tree is
sizes are kept in a separate6 copy-on-write array as accumulated that when a node of a branch is updated the sizes must be updated
sizes. This way, they represent the location of the ranges of the with it (if needed). In the case of updating, the structure does not
indices in the current subtree. To avoid creating additional objects change and as such it always keeps the same sizes object reference.
in memory, these sizes are attached at the end of the node. To have a The traversal down the tree is done using the new abstraction used
homogeneous representation of nodes, the balanced subtrees have in the relaxed version of indexing.
an empty reference attached at the end. For leaves, however, we In the case of appending to the back, an updated unbalanced
make an exception: since they will always be balanced, they only node must increment the accumulated size of its last subtree by
contain the data elements but not the size metadata. one. When a new branch is appended, a new size is appended to the
sizes. The newBranch operation is simplified by using truncated
0 1 … m-1 m 0 1 … m-1 nodes and letting the node on which it gets appended handle any
index shifting required.
0 1 … m-1 m 0 1 … m-1 m
345
complexity of this operation is still O(m · logm (n)), log2 (m) · that ensures the logarithmic bound. The first one leaves the tree
logm (n) for the traversal plus m · logm (n) for the branch update. better balanced, while the second is faster. As we aim to have
good performance on all operations we use the first variant7 . The
3.3 Concatenation following snippet of code shows a high level implementation for
The concatenation algorithm used on RRB-Vectors is a slightly this first variant. Details for the second variant can be found in [4]
modified version of the one proposed in the RRB-Trees technical in case that concatenation is prioritized over all other operations.
report [4]. This version favors nodes of size m over m − 1 making 1 def mergeRebalance(left: Node, center: Node, right:
the trees more balanced. With this approach, we sacrifice a bit of Node) = {
performance for concatenations but we gain performance on all 2 // join all branches
3 val merged = left ++ centre ++ right
other operations: better balancing implies higher chance of using 4 var newRoot = new ArrayBuilder
fast radix operations on the trees. 5 var newSubtree = new ArrayBuilder
From a high level, the algorithm merges the rightmost branch 6 var newNode = new ArrayBuilder
of the vector on the LHS with the leftmost branch of the vector on 7 def checkSubtree() = {
8 if(newSubtree.length == m) {
the RHS. While merging the nodes, each of them is rebalanced in 9 newRoot += computeSizes(newSubtree.result())
order to ensure the O(logm (n)) bound on the height of the tree 10 newSubtree.clear()
and avoid the degeneration of the structure. The RRB version of 11 }
concatenation has a time complexity of O(m2 · logm (n)) where m 12 }
13 for (subtree <- merged; node <-subtree) {
is constant. 14 if(newNode.length == m) {
1 def concatenate(left: Vector[A], right: Vector[A]) = 15 checkSubtree()
{ 16 newSubtree += computeSizes(newNode.result())
2 val newTree = mergedTrees(left.root, right.root) 17 newNode.clear()
3 val maxDepth = max(left.depth, right.depth) 18 }
4 if (newTree.hasSingleBranch) 19 newNode += node
5 new Vector(newTree.head, maxDepth) 20 }
6 else 21 checkSubtree()
7 new Vector(newTree, maxDepth+1) 22 newSubtree += computeSizes(newNode.result())
8 } 23 computeSizes(newRoot.result)
9 def mergedTrees(left: Node, right: Node, depth: Int) 24 }
= {
10 if (depth==1) { Figures 4, 5, 6 and 7 show a concrete step by step (level by
11 mergedLeaves(left, right) level) example of the concatenation of two vectors. In the example,
12 } else {
13 val merged =
some of the subtrees were collapsed. This is not only to make the
14 if (depth==2) mergedLeaves(left.last, diagrams fit, but also to expose only the nodes that are referenced
right.first) during the execution of the algorithm. Nodes with colors represent
15 else mergedTrees(left.last, right.first, depth-1) new nodes and changes, to help track them from figure to figure.
16 mergeRebalance(left.init, merged, right.tail)
17 }
18 }
19 def mergedLeaves(left: Node, right: Node) = {
20 // create a balanced new tree of height 2
21 // with all elements in the nodes
22 }
The concatenation operation starts at the bottom of the branches Figure 6. Concatenation example: Rebalancing level 2
by merging the leaves into a balanced tree of height 2 using
mergedLeaves. Then, for each level on top of it, the newly
created merged subtree and the remaining branches on that level
will be merged and rebalanced into a new subtree. This new sub-
tree always adds a new level to the tree, even though it might get
dropped later. New sizes of nodes are computed each time a node
is created based on sizes of children nodes.
Figure 7. Concatenation example: Rebalancing level 3
346
rather than m2 element copies. Another crucial but obvious imple- RRB tree this covers the entire tree and will effectively only use the
mentation detail is to never duplicate a node if it does not change. more efficient radix based operations.
This requires a small amount of additional logic and comes with a
benefit on memory used and in good cases can reduce the number 4.1 Faster Access
of node copies required, potentially reducing the effective work to One of the uses of the focused branch is as a direct access to
o(m ∗ logm (n)) if there is a good alignment. a cached branch. If the same leaf node is used in the following
When improving the vector on locality (§4), concatenating a operation, there is no need for vertical tree traversal which is key to
small vector using the concatenation algorithm is less efficient than amortize operation to constant time. In the case another branch is
appending directly on the other tree. That case is identified by a needed, it can be fetched from the lowest common node of the two
simple bound on the lengths, and then all elements from the smaller branches.
vector are appended to the larger one. To know which is the level of the lowest common node in a
vector of branching size m (where m = 2i and i is the number
Other Operations Having efficient concatenation and spitting of bits in the sub-indices), only the focused index and the index
allows us to also implement several other operations that change being fetched are needed. The operation index Y f ocus will return
the structure of the tree. Some of there operations are: inserting an a number that is bounded to the maximum number of elements
element/vector in any position, deleting an element/subrange of the in a tree of that level. The actual level can be extracted with
vector and patching/replacing part of the vector. The complexity some if statements. This operation bounded by the same number
of these operations are bounded by the complexity of the core of operations that will be needed to traverse the tree back down
operations used. through the new branch. This is computed in O(logm (n)) without
Parallelizing the Vector To parallelize operations we use the accesses to memory.
fork-join pool model from Java [18, 31]. In this model process- 1 val i = // nubmer of bits of sub-indices
ing is achieved by splitting the work into smaller parts until they 2 def lowestCommonLevel(idx: Int, focus: Int): Int = {
are deemed small enough to ensure good parallelism. This can be 3 val xor = idx ^ focus
4 if (xor < (1<<(1*i)) ) 0
achieved using the efficient splitting of the RRB-Tree. For certain 5 else if (xor < (1<<(2*i)) ) 1
operations, like map, filter and reduce, the results obtained in 6 else if (xor < (1<<(3*i)) ) 2
parallel must be aggregated, such as concatenating the partial vec- 7 ...
tors produced by the parallel workers. The aggregation can oc- 8 else 5
9 }
cur in several steps, where partial results from different workers
are aggregated in parallel, recursively, until a single result is pro- When deciding which will be the focused branch of a new vector
duced. The overhead associated with the distribution of work is two heuristics are used: If there was an update operation on some
O(m2 · logm (n)). branch where that operations could be used again, that branch is
used as focus. If the first one can’t be applied, the focus is set to the
4. Improvements on Operation Locality and first element as this helps key collection operations (such as getting
Amortization of Costs an iterator).
The vector iterator and builder use this to switch from one leaf
In this section we present different approaches aimed at improving to the next one with the minimal number of steps. In fact, this
the performance using explicit caching on the subtrees in a branch. effectively amortizes out the cost of traversing the tree structure
This is a new generalization of the Clojure [15] (and current Scala) over the accesses in the leaves as each edge of the tree will only
optimizations on their Radix-Balanced vectors. All optimizations be accessed once. In the case of RRB tree iteration there is an
we describe rely on the vector object keeping a set of fields that additional abstraction for switching from one balanced subtree to
track the entire branch of the tree, from the root to a leaf. Figure 4 the next one.
shows such an RRB-vector with the levels numbered starting from
0 at the bottom. The explicit caches focus on the nodes reaching 4.2 Amortizing Costs using Transient States
from the root to the tree0 leaf. Transient states are the key to providing amortized constant-time
appending, local updating and local splits. To achieve this, we de-
tree2 couple the tree by creating an equivalent tree that does not contain
redundant edges on the current focused branch. The information
tree1 missing in the edges of the tree is represented and can be recon-
structed from the trees in the focused branch.
tree0
347
els of the tree and so on. These operations will thus be amortized Table 1. Comparison between the Radix and Relaxed Radix Vec-
to constant time if they are executed in succession. This is due to tors. In this table all aC have logm worst case scenario.
the bound given by average number of node update per operation: Radix-Balanced Vector RRB Vector With m = 32
P∞ k·(m−1) m
k=1 mk
= m−1 . indexing logm logm eC
There is a cost associated to the transformation from canonical scanning aC aC aC
to transient state and back. This cost is equivalent to one update of update m · logm m · logm eC
the focused branch. The transient state operations only start paying update local aC aC aC
off after 3 consecutive operations. With 2 consecutive operations concat/insert L m2 · logm L v.s. eC
they are matched and with 1 there is a loss in performance. insert ends aC aC aC
building aC aC aC
Canonicalization The transient state aims to improve perfor- split m · logm m · logm eC
mance of some operations by amortizing costs. But, the transient split ends aC aC aC
state is not ideal for performance of other operations. For example remove L m2 · logm L v.s. eC
an indexing operation on an unbalanced vector may lack the size
information it requires to efficiently access certain indices. And an
iterator relies on a canonical tree for performance. It is possible to We use the notation eC as effective constant time when we
implement these operations on a transient state, but this involves have logm (n) complexities assuming that n will be bounded by
both code duplication and additional overhead on each call. the index integer representation and m is large enough. In our case
The solution we used involves converting the transient represen- Int is a 32-bit signed integer and m = 32, giving us the bound
tation to a canonical one. This conversion, called canonicalization, logm (n) < 7 and hence argue that this is bounded by a constant. In
is applied when an operation that requires the cannonical form is table 1, aC (amortized constant) has a worst case scenario of logm
called on an instance of the immutable vector. The mutation of the or m · logm , in other terms it has is eC in the worst case. In table 2,
vector is not visible from the outside and only happens at most once for the RRB Vector the aC has a worst case of eC, for COW Array
(Figure 10). This transformation only affects the nodes that are on aC has linear worst case and for the rest of aC-s have a worst case
the focused branch, as it copies each one (except the leaf) and links of log2 . C and L are constant and linear time respectively.
the trees. If the node is unbalanced, the size of the subtree in focus
is inserted. This transformation could be seen as a lazy initialization 5. Implementation
of the current branch.
5.1 Scala Indexed Sequences
Our implementation of the RRB-Vector8 is based on the Scala Col-
lection [26] IndexedSeq, which acts as a decorator exposing
many predefined generic operations which are based on just a few
core primitives. For example, most operations that involve the en-
tire collection (such as map, filter and foreach) use iterators
and builders. To improve performance, we overwrote several of the
decorator operations to use the efficient vector primitives directly,
without going through the IndexedSeq code.
Parallel RRB Vector The implementation of the parallel RRB
Vector is a trivial wrapper over the sequential RRB Vector using
Figure 10. Objects states and effect of operations. the Scala Parallel Collections API [25, 33, 34]. This only requires
the presence of the split and combine operations, which, in our
Vector objects can only be in the transient state if they were case, are simply splitting an iterator on the tree and combining
created this way. For example, the appending operations will create using concatenation. Both of these use the efficient core RRB tree
a new object that is in transient state and focused on the last/first operations. When splitting, we additionally have heuristics that
branch. If the source object was not focusing the last branch, then yield a well aligned concatenation and create a balanced tree.
it is canonicalized (if needed) before change of branch operation.
Vectors of depth 1 are special cases, they are always in canonical 5.2 Arrays as Nodes
form and their operations are equivalent to those in transient form. One of the aims of Scala Collections [25, 26, 33, 34] is the elimina-
tion of code duplication, and one of the mechanisms to achieve this
4.3 Comparison
8 Along with all other sequences we compare against.
In table 1 we show the difference in complexities between the
Radix-Balanced Vector and the Relaxed-Radix-Balanced Vector. In
table 2 we compare the RRB-Vector to other possible implementa-
tions. The operations in the table represent all different complexi- Table 2. Comparisons with other data structures that could be used
ties that can be reached for the core operations. The operations are to implement indexed sequences.
divided into four categories: (i) fetching, (ii) updating, (iii) insert- RRB Vector COW Array FingerTree RedBlack Tree
ing , and (iv) removing. In (i) there is the indexing or random access indexing eC C log2 log2
given an element index and the sequential scanning of elements. scanning aC C aC aC
Category (ii) is divided into normal (or random) update and has a update eC L log2 log2
special case for updates that are done locally. The category (iii) is update local aC L log2 log2
divided into building (with/without result size), concatenation and concat/insert eC L log2 L
insertions of element (or vectors) and has a special case for inser- insert ends aC L aC log2
tions on ends (appended front/back and small concat). Remov- building aC C/aC aC log2
ing (iv) is divided into splits (split, take, drop, . . . ), splits split eC L log2 log2
ends (tail/init, drop(n)/take(n) with n in the first/last split ends aC L aC log2
remove eC L log2 L
leaf) and a general removal of elements in a range of indices.
348
is the use of generic types [9, 22]. But this also has a drawback: the compiles it using the statistics to guide optimizations. The Vector
need to box primitive values in order for them to be stored in the code tries to gain performance by aligning with the JIT heuristics
collection. We implemented all our sequences in this context. and hence taking advantage of its optimizations. The most impor-
All nodes are stored in arrays of type Array[AnyRef], since tant such optimization is inlining, which eliminates the overhead of
this allows us to quickly access elements (which are boxed any- calling a method and, furthermore, enables other optimizations to
way due to generics9 ) without dispatching on the primitive array improve the inlined code. Critical parts of the Vector code are care-
type. A welcome side effect of this decision is that elements are fully designed to to match the heuristics of the JVM. In particular,
already boxed when they are passed to the Vector, thus accessing a heuristic that arose commonly is that only methods of size less
and storing them does not incur any intermediate boxing/unbox- than 35 bytes are inlined, which meant we had to split the code into
ing operations, which would add overhead. However, it is known several methods to stay below this threshold.
that using the boxed representation for primitive types is inefficient
when operating on the values themselves, so the sizes of unbal- 6. Evaluation
anced nodes are stored in Array[Int] objects, guaranteeing the
6.1 Methodology
most compact and efficient data representation.
Most of the memory used in the vector data structure will be ScalaMeter [30] is used to measure performance of operations on
composed of arrays. There are three key operations used on these different implementations of indexed sequences.
arrays: creation, update and access. Since the arrays are used with To have reproducible results with low error margins, ScalaMe-
copy-on-write semantics, actual update operations are only allowed ter was configured on a per benchmark basis. Each test is run on
when the array is initialized. This also implies that each time there 32 different JVM instances to average out badly allocated VMs.
is a modification on some part of an array, a new array must be On each JVM, 32 measurements were taken and they were filtered
created and the old elements must be copied. using outlier elimination to remove those runs that where excep-
The size of the array will affect the performance of the vector. tionally different. This could happen if a more thorough garbage
With larger arrays in the nodes the access times will be reduced collection cycle occurs in a particular run, due to JIT compilation
because the depth of the tree will decrease. But, on the other or if the operating system switches to a more important task dur-
hand, increasing the size of the arrays will slow down the update ing the benchmark process [12]. Before taking measurements, the
operations, as they have to copy the entire array to execute the JVM is warmed up by running the benchmark code several times,
element update, due to the copy-on-write semantics. without taking the measurements into account. This allows us to
For an individual reference to an RRB-Vector of size n and measure the time after the JIT compilation has occurred, when the
branching of m, the memory usage will composed by the arrays system is in a steady state.
located in the leaves10 and the ones that form the tree structure. There are three main directions in the performance compar-
n isons. The first compares the Radix-Balanced vectors with well-
In our case we save references and hence we need d m e arrays
11
of m references . The structure requrires at least the references balanced RRB Vectors, with the goal of having an equivalent per-
to the child nodes and in the worst case scenario an additional formance, even if the RRB Vectors have an inherent additional
integer the size of each child. Going up level by level, the reference overhead. The second axis shows the effects of unbalanced nodes
count decreases by a factor of m and hence the total is bounded on RRB-Tree. For this we compare the same perfect balanced vec-
tor with an extremely unbalanced vector. The later vector is gener-
by k=2m d mnk e < ∞
Plog (n) P n n+m
k=2 d mk e ≤ m·(m−1) refrences. For the ated by concatenating pseudo-random small vectors together. The
sizes of the nodes, given our choice of rebalancing algorithm, they amount of unbalanced nodes is in part affected by the size of the
will only appear on nodes that are of height 3 or larger and hence vector. The third axis is the comparison between vectors in general
Plog (n) P∞
the sizes will be bounded by k=3m d mnk e < n
k=3 d mk e ≤ and other well known functional and/or immutable data structures
n+m
m2 ·(m−1)
integers. used to implement sequences. We used a copy-on-write (COW) ar-
rays, finger trees [16] (FingerTreeSeq12 ) and red black trees [13]
5.3 Running on a JVM (RedBlackSeq13 ).
In practice, Scala compiles to Java bytecode and executes on a Java 6.2 Results
Virtual Machine (JVM), where we used the Oracle Java SE dis-
tribution [29] as a reference. This imposes additional characteris- For the results of this sections, benchmarks where executed on a
tics of performance that can’t be evaluated on the algorithmic level Java HotSpot(TM) 64-Bit Server VM on a machine with an Intel(R)
alone, and ask for a more nuanced discussion. Core(TM) i7-4770 CPU @ 3.40GHz with 32GiB on RAM. Each
One of the JVM components that directly affects vectors is the benchmarking VM instance was setup with 16GiB of heap memory.
garbage collector (or GC). Vector operations tend to create a large The parallel vector split-combine was executed on a machine with
number of Array objects, some of which are only necessary for a 4 Intel(R) Xeon(R) Processors, of type E5-4640 @ 2.40GHz with
short time. These objects will use up memory and thus degrade 128GiB on RAM.
overall performance as the GC is invoked more often. For this Iterating The benchmark in Figure 11 shows the time it takes
reason our code is optimized to avoid the redundant creation of to scan the whole sequence using a specialized iterator. Unsurpris-
intermediary objects, delaying the GC cycles and thus improving ingly, the results show that the best option is the array. But the vec-
performance. tor is only 1-2× slower, closer to 1× in the most common cases.
Instead of directly compiling bytecode to native code, the JVM It is also possible to see that vectors are 7-15× faster than other
uses a just in time compilation (JIT) mechanism in order to take ad- deeper trees, mainly due to the reduction in indirections and in-
vantage of run-time profiling information. At first it runs the com- creased locality.
piled bytecode inside an interpreter and collects execution statis-
tics (profiles). Later, once a method has executed enough times, it Building The benchmark in Figure 12 shows the time it takes
to build a sequence using a specialized builder. In general, the
9A limitation that could be circumvented by Miniboxing [40]. 12 Adapted version of https://github.com/Sciss/FingerTree where
10 Note that the memory used in the leaves is equivalent to the memory used abstractions that did not involve sequences where removed.
for an array that contains all the elements. 13 Adaptation of the standard Scala Collections RedBlackTree where keys
11 It could be any kind of data. are used as indices.
349
Figure 11. Iterating through the sequence Figure 13. Accessing 10k consecutive indices
Figure 12. Building a sequence. Figure 14. Accessing 10k random indices
350
Figure 17. Concatenating two sequences (point of view 1). RRB
Vector and Finger Tree are the planes at the bottom, COW Array is
the plane in the middle and Vector is the plane on the top.
Figure 15. Updating on 10k consecutive indices
Figure 16. Updating on 10k random indices both versions, where the results are identical. Then we parallelized
it on fork-join thread pools of 1, 2, 4, 8, 16, 32 and 64 thread on a 64
threaded (32 cores) machine. Without concatenation, there is a loss
to the front and in the second to the back of the sequence. The of performance on when passing from sequential to parallel and al-
large number of elements was chosen in order show the amortized though the performance increases with the addition of threads, even
time of the operation on the vectors. In this case the array is with 64 threads it’s only slightly better than the sequential version.
clearly the worst option because it creates a new version and copies By contrast, with our new vector, the gain in performance starts
the contents with each append. The vector is around 2.5× slower with one thread in the pool (dedicated thread) and then increases.
than the finger trees, a structure that specifically focuses on these Giving a 1.55× increase with 2 threads, 2.46× for 4 thread, 3.52×
operations. The vector can be 1-2× faster than a red black tree. for 8 thread, 4.60× for 16 thread, 5.52× for 32 thread (core limit)
and 7.18× for 64 thread (hardware thread limit).
Splitting Figures 21 and 22 show the time it takes to split a
sequence on the left and on the right. We fixed the cut point to Memory Overhead Figure 25 shows the memory overhead of
the middle of the sequence to be able to compare the time it takes the data structures used in the benchmarks. This overhead is the
to take or drop the same number of elements. It can be seen that additional space used in relation to the COW Array. The overhead
splitting a vector is more efficient than other structures. Even more, of a vector is 17.5× smaller than the finger tree and 40× smaller
the vector behaves with an effectively constant time. than the red black trees.
Parallel Vector Split-combine Overhead The benchmarks in
Figure 23 and 24 show the amount of overhead associated with the 7. Related Work
parallelization of the vector with and without efficient concatena- Related data structures There is a strong relation between
tion. They show the typical overhead of a parallel map, filter or RRB Trees and various data structures in the literature. Patricia
other similar operations that create a new version of the sequence. tries [24] are one of the earliest documented uses of radix trees, per-
The benchmark computes a map operation using the identity func- forming lookups and updates one bit or character at a time. Wider
tion, such that the execution time is dominated by the time it takes radix trees were used to build space efficient sparse arrays, Array
to split and combine the sequence rather than the function compu- Mapped Tries (AMT) [1], and on top of that Hash Array Mapped
tations. As a base for comparison we used the sequential map on Tries (HAMT) [2], which have been popularized and adapted to an
351
Figure 19. Appending front 256 elements one by one Figure 21. Taking the first half of the sequence
Figure 20. Appending back 256 elements one by one Figure 22. Dropping the first half of the sequence
immutable setting in Clojure [15]. Radix trees are also used as in-
mutable vector like the ones found in Scala Collections [26] and
dex structures for in-memory databases [19]. In databases B-Trees
Clojure [15]. Allowing the implementation of an additional wide
[6] are a ubiquitous index data structure that are similar to RRB
range of efficient structural modification on the vectors. Later, a
Trees, in the sense that they are wide trees and allow a certain de-
concrete version for Scala where in all optimization on locality are
gree of slack to enable efficient merging. However the chosen trade-
adapted to RRB vectors was implemented. This is the implemen-
offs are different and B-Trees are not normally used as sequence
tation used in this paper and presented with more technical details
data structures. Ropes [5] are a sequence data structure with effi-
in [37] on the implementation in Scala. Another related project is
cient concat and rebalancing operations. Immutable sequence data
[20], where more detailed mathematical proofs where shown for
structures include VLists [3], Finger Trees [16] and various kinds
the RRB Trees and their operations. They also introduce a different
of random access lists [27].
approach on transience using semi-mutable vectors with efficient
Parallelism Parallel execution is achieved in the RRB-Vector snapshot persistence and provide a C implementation.
by relying on the fork-join pools in Java [18, 31]. The vector is
split into chunks which are processed in parallel. The overhead of Scala Library In Scala, most general purpose data structures
splitting can be offset by using cooperative tasks [14, 21], but, in the are contained in Scala Collections [26]. This framework aims to re-
case of RB-Vector the cost of splitting is much smaller compared duce code duplication to a minimum using polymorphism, higher-
to the cost of combining (assembling) the partial results returned order functions [8], higher kinded types [23], implicit parameters
by parallel execution contexts. This is where the RRB trees make a [28] and other language features. It also aims to simplify the inte-
difference: by allowing efficient structural changes, it enables the gration of new data structures with a minimum of effort. In addi-
concatenation to occur in effectively constant time, much better tion, the Scala Parallel Collection API [31–34] allows parallel col-
than the previous O(n) for the Scala Vector. lections to integrate seamlessly with the rest of the library. Behind
the scenes, Scala Parallel Collections use the Java fork-join pools
RRB Trees The core data structure was first described in a [18] as a backend for implicit parallelism in the data structure op-
technical report [4] as a way to improve the concatenation of im- erations.
352
Figure 23. Parallel (non-RRB) Vector overhead on a map opera- Figure 25. Memory overhead of the sequences in relation to arrays
tion
Acknowledgements
We would like to thank the ICFP reviewers for their feedback and
suggestions, which allowed us to improve the paper both in terms of
clarity and in terms of breadth. We are grateful to Martin Odersky
for allowing the Master Thesis [37] that forms the base of this paper
to be supervised in the LAMP laboratory at EPFL. We would also
like to thank Vera Salvisberg for her thorough review of both the
paper and the code, which led to remarkable improvements the
Figure 24. Parallel RRB Vector overhead on a map operation quality of the final submission. Last but not least, we would like
to thank Sébastien Doeraene for allowing Nicolas to dedicate time
to improving this paper while he was working on Scala.js.
Low level optimizations We took into account the capabili-
ties of the VM to do escape analysis and inlining of the compiled
bytecode. These are influenced by the concrete implementation of
References
the Java Hotspot VM compilers (C1 and C2) [17, 29]. At the com- [1] P. Bagwell. Fast and Space-efficient Trie Searches. Technical report,
piler level there are optimizations techniques that remove the cost EPFL, 2000.
of type generic abstractions such as specialization [10], Minibox- [2] P. Bagwell. Ideal hash trees. Technical report, EPFL, 2001.
ing [39, 40] or Scala Blitz [33]. This last one can go further do [3] P. Bagwell. Fast Functional Lists, Hash-Lists, Deques and Variable
fusion on collection [7]. Additionally it is possible to use staging Length Arrays. In Implementation of Functional Languages, 2002.
techniques [35, 36, 38] to further optimize the code. [4] P. Bagwell and T. Rompf. RRB-Trees: Efficient Immutable Vectors.
Technical report, EPFL, 2011.
Benchmarks Running code on a virtualized environment like
the JVM where the factors that influence performance are not under [5] H.-J. Boehm, R. Atkinson, and M. Plass. Ropes: An alternative to
our control and can vary from execution to execution complicates strings. Software: Practice and Experience, 25(12):1315–1330, 1995.
the benchmarking process [12]. In Scala there is a tool (ScalaMeter [6] D. Comer. The ubiquitous b-tree. ACM Comput. Surv., 11(2):121–137,
[41]) designed to overcome these issues. 1979.
[7] D. Coutts, R. Leshchinskiy, and D. Stewart. Stream fusion: From lists
to streams to nothing at all. In Proceedings of the 12th ACM SIG-
8. Conclusions PLAN International Conference on Functional Programming, ICFP
In this paper we presented the RRB-Vector, an immutable se- ’07, pages 315–326, New York, NY, USA, 2007. ACM.
quence collection that offers good performance across a broad [8] I. Dragos. Optimizing Higher-Order Functions in Scala. In
range of sequential and parallel operations. The underlying innova- ICOOOLPS, 2008.
tions are the Relaxed-Radix-Balanced (RRB) Tree structure, which
allows efficient structural changes and an optimization that exploits 14 https://github.com/nicolasstucki/scala-rrb-vector
353
[9] I. Dragos. Compiling Scala for Performance. PhD thesis, IC, 2010. [26] M. Odersky and A. Moors. Fighting bit Rot with Types (Experience
[10] I. Dragos and M. Odersky. Compiling Generics through User-directed Report: Scala Collections). In R. Kannan and K. N. Kumar, edi-
Type Specialization. In ICOO0LPS ’09. ACM, 2009. tors, IARCS Annual Conference on Foundations of Software Technol-
ogy and Theoretical Computer Science, volume 4 of Leibniz Interna-
[11] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data tional Proceedings in Informatics (LIPIcs), Dagstuhl, Germany, 2009.
structures persistent. J. Comput. Syst. Sci., 38(1):86–124, Feb. 1989. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[12] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous java [27] C. Okasaki. Purely Functional Data Structures. Cambridge University
performance evaluation. In Proceedings of the 22Nd Annual ACM Press, New York, NY, USA, 1998.
SIGPLAN Conference on Object-oriented Programming Systems and
Applications, OOPSLA ’07, pages 57–76, New York, NY, USA, 2007. [28] B. C. d. S. Oliveira, A. Moors, and M. Odersky. Type Classes as
ACM. Objects and Implicits. In OOPSLA ’10. ACM, 2010.
[13] S. Hanke. The Performance of Concurrent Red-Black Tree Algo- [29] M. Paleczny, C. Vick, and C. Click. The java hotspottm server com-
rithms. In J. Vitter and C. Zaroliagis, editors, Algorithm Engineering, piler. In Proceedings of the 2001 Symposium on JavaTM Virtual
volume 1668 of Lecture Notes in Computer Science. Springer Berlin Machine Research and Technology Symposium - Volume 1, JVM’01,
Heidelberg, 1999. pages 1–1, Berkeley, CA, USA, 2001. USENIX Association.
[14] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. [30] A. Prokopec. ScalaMeter. https://scalameter.github.io/.
Apr. 2008. [31] A. Prokopec. Data Structures and Algorithms for Data-Parallel Com-
[15] R. Hickey. The Clojure programming language, 2006. puting in a Managed Runtime. PhD thesis, IC, Lausanne, 2014.
[32] A. Prokopec and M. Odersky. Near optimal work-stealing tree sched-
[16] R. Hinze and R. Paterson. Finger Trees: A Simple General-purpose
uler for highly irregular data-parallel workloads. In Languages and
Data Structure. J. Funct. Program., 16(2), 2006.
Compilers for Parallel Computing, Lecture Notes in Computer Sci-
[17] T. Kotzmann, C. Wimmer, H. Mössenböck, T. Rodriguez, K. Russell, ence, pages 55–86. Springer International Publishing, 2014.
and D. Cox. Design of the Java HotSpot&Trade; Client Compiler for
Java 6. ACM Trans. Archit. Code Optim., 5(1), May 2008. [33] A. Prokopec, D. Petrashko, and M. Odersky. Efficient Lock-Free
Work-stealing Iterators for Data-Parallel Collections. 2015.
[18] D. Lea. A Java Fork/Join Framework. In Proceedings of the ACM
2000 Conference on Java Grande, JAVA ’00, New York, NY, USA, [34] A. Prokopec, T. Rompf, P. Bagwell, and M. Odersky. On a generic
2000. ACM. parallel collection framework, 2011.
[19] V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful [35] T. Rompf and M. Odersky. Lightweight Modular Staging: A Prag-
indexing for main-memory databases. In C. S. Jensen, C. M. Jermaine, matic Approach to Runtime Code Generation and Compiled DSLs.
and X. Zhou, editors, 29th IEEE International Conference on Data Communications Of The Acm, 55, 2012.
Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pages [36] T. Rompf, A. K. Sujeeth, K. J. Brown, H. Lee, H. Chafi, and K. Oluko-
38–49. IEEE Computer Society, 2013. tun. Surgical precision JIT compilers. In M. F. P. O’Boyle and K. Pin-
[20] J. N. L’orange. Improving RRB-Tree Performance through Tran- gali, editors, ACM SIGPLAN Conference on Programming Language
sience. Master’s thesis, Norwegian University of Science and Tech- Design and Implementation, PLDI ’14, Edinburgh, United Kingdom -
nology, June 2014. June 09 - 11, 2014, page 8. ACM, 2014.
[21] Moir and Shavit. Concurrent data structures. In Mehta and Sahni, [37] N. Stucki. Turning Relaxed Radix Balanced Vector from Theory into
editors, Handbook of Data Structures and Applications, Chapman & Practice for scala collections. Master’s thesis, EPFL, 2015.
Hall/CRC. 2005. [38] W. Taha and T. Sheard. MetaML and Multi-Stage Programming with
[22] A. Moors. Type Constructor Polymorphism for Scala: Theory and Explicit Annotations. In Theoretical Computer Science. ACM Press,
Practice (Type constructor polymorfisme voor Scala: theorie en prak- 1999.
tijk). PhD thesis, Informatics Section, Department of Computer Sci- [39] V. Ureche, E. Burmako, and M. Odersky. Late data layout: Unifying
ence, Faculty of Engineering Science, May 2009. Joosen, Wouter and data representation transformations. In Proceedings of the 2014 ACM
Piessens, Frank (supervisors). International Conference on Object Oriented Programming Systems
[23] A. Moors, F. Piessens, and M. Odersky. Generics of a Higher Kind. Languages & Applications, OOPSLA ’14, pages 397–416, New
Acm Sigplan Notices, 43, 2008. York, NY, USA, 2014. ACM.
[24] D. R. Morrison. PATRICIA-practical algorithm to retrieve information [40] V. Ureche, C. Talau, and M. Odersky. Miniboxing: Improving the
coded in alphanumeric. J. ACM, 15(4):514–534, Oct. 1968. Speed to Code Size Tradeoff in Parametric Polymorphism Transla-
tions. In OOPSLA’13, OOPSLA ’13, pages 73–92, New York, NY,
[25] M. Odersky. Future-Proofing Collections: From Mutable to Persistent
USA, 2013. ACM.
to Parallel. In Compiler Construction, volume 6601 of Lecture Notes
in Computer Science. Springer-Verlag New York, Ms Ingrid Cunning- [41] B. Venner, G. Berger, and C. C. Seng. Scalatest. http://www.
ham, 175 Fifth Ave, New York, Ny 10010 Usa, 2011. scalatest.org/.
354