Rust As A Language For High Performance GC Implementation
Rust As A Language For High Performance GC Implementation
Rust As A Language For High Performance GC Implementation
Abstract 1. Introduction
High performance garbage collectors build upon performance- A fast yet robust garbage collector (GC) is the key to garbage col-
critical low-level code, typically exhibit multiple levels of concur- lected language runtimes. However, implementing such a GC is not
rency, and are prone to subtle bugs. Implementing, debugging and easy. First, a collector must manipulate raw memory, depending
maintaining such collectors can therefore be extremely challeng- on carefully optimized code to do so, making it naturally prone to
ing. The choice of implementation language is a crucial considera- memory bugs. Second, high performance GCs are rich in concur-
tion when building a collector. Typically, the drive for performance rency, typically featuring thread parallelism, including thread-local
and the need for efficient support of low-level memory operations allocation, parallel tracing, and possibly mutator concurrency, mak-
leads to the use of low-level languages like C or C++, which offer ing it prone to race conditions and extremely time consuming bugs.
little by way of safety and software engineering benefits. This risks What makes the situation worse is that the implementation lan-
undermining the robustness and flexibility of the collector design. guage usually does not provide help in terms of memory safety
Rust’s ownership model, lifetime specification, and reference bor- and thread safety. The imperative of performance encourages the
rowing deliver safety guarantees through a powerful static checker use of languages such as C and C++ in collector implementations.
with little runtime overhead. These features make Rust a com- But their weak type system, lack of memory safety, and lack of
pelling candidate for a collector implementation language, but they integrated support for concurrency [5] throws memory and thread
come with restrictions that threaten expressiveness and efficiency. safety squarely back into the hands of developers.
We describe our experience implementing an Immix garbage Poor software engineering leads not only to hard-to-find bugs
collector in Rust and C. We discuss the benefits of Rust, the obsta- and performance pitfalls, but decreases reuse, inhibiting progress
cles encountered, and how we overcame them. We show that our by thwarting creativity and innovation. Unfortunately, program-
Immix implementation has almost identical performance on micro ming languages often place positive traits such as abstraction and
benchmarks, compared to its implementation in C, and outperforms safety at odds with performance. However, we are encouraged:
the popular BDW collector on the gcbench micro benchmark. We first, by prior work [10, 11] that shows that in an implementation
find that Rust’s safety features do not create significant barriers to such as a garbage collector, low-level code is the exception, not
implementing a high performance collector. Though memory man- the rule; and second by the Rust programming language, which
agers are usually considered low-level, our high performance im- rather boldly describes itself as a systems programming language
plementation relies on very little unsafe code, with the vast major- that runs blazingly fast, prevents segfaults, and guarantees thread
ity of the implementation benefiting from Rust’s safety. We see our safety [14].
experience as a compelling proof-of-concept of Rust as an imple- We evaluate the software engineering of a high performance
mentation language for high performance garbage collection. collector, and our experience confirms the prior work. In particu-
lar, we confirm that: (i) performance-critical code is very limited in
Categories and Subject Descriptors D.3.4 [Programming Lan- its scope, (ii) memory-unsafe code is very limited in its scope, and
guages]: Processors—Memory management (garbage collection), (iii) language-supported, high performance thread-safe data struc-
Run-time environments tures are fundamental to collector implementation. For these rea-
sons, a well-chosen language may greatly benefit collector imple-
General Terms Experimentation, Languages, Performance, Mea- mentations without compromising performance.
surement Our prior experience in collector implementation includes both
C/C++ and high level languages. This, and the emergence of Rust
Keywords memory management, garbage collection, Rust led us to evaluate it as a language for high performance collector
implementation. Rust is type, memory, and thread safe; all safety
features that we believe will help in delivering a robust collector.
Rust also permits unsafe operations (and inline assembly1 ) in un-
safe blocks, allowing us to access bare memory, and to fine-tune
Permission to make digital or hard copies of part or all of this work for personal or performance on fast paths when needed. Furthermore, Rust uses a
classroom use is granted without fee provided that copies are not made or distributed powerful compile-time safety checker to shift as much as possible
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. of the safety burden to compile time, avoiding runtime overheads
For all other uses, contact the Owner/Author. where possible. The checker is based on Rust’s model of object
Copyright is held by the owner/author(s). ownership, lifetimes, and reference borrowing. The model elimi-
ISMM’16, June 14, 2016, Santa Barbara, CA, USA
ACM. 978-1-4503-4317-6/16/06...$15.00 1 Inline assembly is currently only available in nightly releases, and not
http://dx.doi.org/10.1145/2926697.2926707 used in this work.
89
nates the possibility of dangling pointers and races, and ensures 3. Rust Background
memory safety. However, this model also restricts the language’s We now introduce some of the key concepts in Rust used in this
expressiveness. Mapping semantics required of a collector imple- paper.
mentation into Rust’s language model was the major challenge that
we faced during the work. Ultimately, we found that not only was Ownership. In Rust, variable binding grants a variable unique
this achievable, but that we could do so with no discernable over- ownership of the value it is bound to. This is similar to C++11’s
head compared to an implementation of the same collector written std::unique_ptr, but it is mandatory for Rust as it is the key
in C. concept upon which Rust’s memory safety is built. Unbound vari-
The principal contributions of this paper are: (i) a discussion of ables are not allowed, and rebinding involves move semantics,
the challenges of implementing a high-performance collector in a which transfers the ownership of the object to the new variable
type, memory and thread-safe language, (ii) a discussion of the se- while invalidating the old one.3 When a variable goes out of scope,
mantic impedance between Rust’s language model and the seman- the associated ownership expires and resources are reclaimed.
tic requirements of a collector implementation, (iii) a performance References. Acquiring the ownership of an object for accessing is
comparison evaluating Rust and C implementations of the same expensive because the compiler must emit extra code for its proper
high performance collector design, and (iv) a comparison with the destruction on expiry. A lightweight approach is to instead borrow
popular Boehm-Demers-Weiser (BDW) collector implemented in references to access the object. Rust allows one or more co-existing
C [7, 8]. immutable references to an object or exactly one mutable reference
We start by describing how we are able to use Rust’s particular with no immutable references. The ownership of an object cannot
language features in our high performance collector implementa- be moved when it is borrowed. This rule eliminates data races, as
tion. We then discuss cases where we found it necessary to abuse mutable (write) and immutable (read) references are made mutually
Rust’s unsafe escape hatches, avoiding its restrictive semantics, and exclusive by the rule. More interestingly, this mutual exclusion is
ensuring the performance and semantics we required. Finally, we guaranteed mostly at compile time by Rust’s borrow checker.
conduct a head-to-head performance comparison between our col-
lector implementation in Rust and a mostly identical implementa- Data Guarantees (Wrapper Types). An important feature of Rust
tion in C to demonstrate that if used properly, the safety and ab- is that the language and its library provide various wrapper types
straction cost from Rust is minimal, compared to an unsafe lan- with different guarantees and tradeoffs. For example, plain ref-
guage such as C. Also we show that both implementations outper- erences such as &T and &mut T statically guarantee a read-write
form BDW, a production-level GC. This suggests that it is possible ‘lock’ for single-threaded code with no runtime overhead, while
to have a fast GC implementation that also benefits from its imple- RefCell<T> offers the same guarantee at the cost of runtime
mentation language’s safety guarantees. checks but is useful when the program has complicated data flow.
Our implementation uses the following wrapper types as described
in later sections of the paper. Box<T> represents a pointer which
uniquely owns a piece of heap-allocated data. Arc<T> is another
frequently used wrapper which provides an atomically reference-
2. Related Work counted shared pointer to data of type T, and guarantees the data
There has been much previous work addressing the implementation stays accessible until every Arc<T> to it goes out of scope (i.e.,
of efficient collectors in safe languages [1, 2, 4, 11, 13, 18]. The the count drops to zero). A common idiom to share mutable data
advantages of using safe languages demonstrated by these projects among threads is Arc<Mutex<T>> which provides a mutual ex-
motivate our work. However, our use of Rust takes some further clusive lock for type T, and allows sharing the mutex lock across
steps: (i) Rust guarantees type, memory, and thread safety. Previous threads.
work uses languages that make weaker safety guarantees such as
Unsafe. Rust provides a safe world where there are no data races
type safety and weaken memory safety (by disabling GC) and leave
and no memory faults. However, the semantics in safe Rust are in
thread safety exposed. (ii) Rust has considerably more restricted se-
some cases either too restrictive or too expensive. Rust allows un-
mantics and expressiveness, since it performs most safety checks at
safe code, such as raw pointers (e.g., *mut T), forcefully allowing
compile time. This constrains GC implementation, and invites the
sharing data across threads (e.g., unsafe impl Sync for T{}),
question of whether implementing a high performance GC in Rust
intrinsic functions (e.g., mem::transmute() for bit casting with-
is even viable. (iii) Rust is an off-the-shelf language. We take the
out check), and external functions from other languages (e.g.,
challenge to map GC features efficiently to the language semantics
libc::malloc()). Unsafe is a powerful weapon for program-
and we use Rust without changes to the base language. Previous
mers to wield at their own risk. Rust alerts programmers by re-
work changed or augmented the semantics of the implementation
quiring unsafe code to be contained within a block that is marked
languages to favor GC implementation [11, 18].
unsafe, or exposed to the caller by marking the containing function
There are also projects2 that implement collectors in Rust for
as itself unsafe.
Rust. Though these projects use Rust as the implementation lan-
guage, their focus is in introducing GC as a language feature to
Rust. Their implementations do not reflect the state of the art of 4. Using Rust
GC, and they do not report performance: the delivery of both sig- We now describe key aspects of how we use Rust’s language fea-
nificantly adds difficulty in a collector implemented in Rust. Our tures to construct a high performance garbage collector. In Sec-
work takes on the challenge of achieving both, to deliver a high- tion 5 we discuss how we found it necessary to abuse Rust, se-
performance advanced collector implementation. lectively bypassing its restrictive semantics to achieve the perfor-
mance and semantics necessary for a high performance collector.
For the sake of this proof of concept implementation, we imple-
ment the Immix garbage collector [3]. We use it because it: (i) is
2A reference counted type with cycle collection for Rust: https://
github.com/fitzgen/bacon-rajan-cc; a simple tracing (mark 3 Rebinding of Copy types, such as primitives, makes a copy of the value
and sweep) garbage collector for Rust: https://github.com/ for the new variable instead of moving ownerships; the old variable remains
Manishearth/rust-gc. valid.
90
1 #[derive(Copy, Clone, Eq, Hash)]
arbitrary location in the memory space managed by the GC and
2 pub struct Address(usize); address arithmetic is allowed (and necessary) on the address type,
3 while an object reference maps directly to a language-level object,
4 impl Address { pointing to a piece of raw memory that lays out an object and that
5 // address arithmetic
6 #[inline(always)]
assumes some associated language-level per-object meta data (such
7 pub fn plus(&self, bytes: usize) -> Address { as an object header, dispatch table, etc). Converting an object ref-
8 Address(self.0 + bytes) erence to an address is always valid, while converting an address to
9 } an object reference is unsafe.
10
11 // dereference a pointer
Abstracting and differentiating addresses is important, but since
12 #[inline(always)] addresses are used pervasively in a GC implementation, the ab-
13 pub unsafe fn load<T: Copy> (&self) -> T { straction must be efficient, both in space and time. We use a single-
14 *(self.0 as *mut T) field tuple struct to provide Address and ObjectReference,
15 }
16
abstracting over Rust’s word-width integer usize to express ad-
17 // bit casting dresses, as shown in Figure 1. This approach disables the oper-
18 #[inline(always)] ations on the inner type, and allows a new set of operations on
19 pub fn from_ptr<T> (ptr: *const T) -> Address { the abstract type. This abstraction adds no overhead in type size,
20 unsafe {mem::transmute(ptr)}
21 }
and the static invocation of its methods can be further marked as
22 #[inline(always)] to remove any call overhead. So while the
23 // cons a null types have the appearance of being boxed, they are materialized as
24 #[inline(always)] unboxed values with zero space and time overheads compared to an
25 pub unsafe fn zero () -> Address {
26 Address(0)
untyped alternative, whilst providing the benefits of strong typing
27 } and encapsulation.
28 We restrict the creation of Addresss to be either from raw
29 ... pointers, which may be acquired from mmap and malloc, or de-
30 }
rived from an existing Address. Address creation from arbitrary
integers is forbidden, with the single exception of the constant
Address::zero(). This serves as an initial value for some fields
Figure 1. An excerpt of our Address type, showing some of its
safe and unsafe methods. of type Address within other structs, since Rust does not allow
structs with uninitialized fields. A safer alternative in Rust is to
use Option<Address> initialized as None to indicate that there
a high-performance garbage collector, (ii) has interesting charac- is no valid value. However, this adds an additional conditional and
teristics beyond a simple mark-sweep or copying collector, and a few run-time checks to extract the actual address value in the
(iii) has a well-documented publicly available reference implemen- performance-critical path of allocation, which adds around 4% per-
tation. Our implementation supports parallel (thread-local) alloca- formance overhead. We deem this tradeoff not to be worthwhile
tion and parallel collection. We have not yet implemented oppor- given the paramount importance of the allocation fast path and the
tunistic compaction, nor generational or reference counting vari- infrequency with which this idiom arises within the GC implemen-
ants [16]. We do not limit our discussion to the Immix algorithm, tation. Thus we choose to allow Address::zero() but mark it as
but rather we consider Rust’s broader suitability as a GC imple- unsafe so that implementors are explicitly tasked with the burden
mentation language. of ensuring safety.
Our implementation follows three key principles: (i) the collec- Our implementation of ObjectReference follows a very sim-
tor must be high performance, with all performance-critical code ilar pattern. The ObjectReference type provides access to per-
closely scrutinized and optimized, (ii) we do not use unsafe code object memory manager metadata (such as mark-bits/-bytes). An
unless absolutely unavoidable, (iii) we do not modify the Rust lan- Address cannot be safely cast to an ObjectReference; the
guage in any way. allocator code responsible for creating objects must do so via
The remainder of this section considers four distinct elements an unsafe cast, explicitly imposing the burden of correctness
of our experience of Rust as a GC implementation language: (i) the for fabricating objects onto the implementer of the allocator. An
encapsulation of Address and ObjectReference types, (ii) man- ObjectReference can always be cast to an Address.
aging ownership of address blocks, (iii) managing global ownership
of thread-local allocations, and (iv) utilizing Rust libraries to sup-
port efficient parallel collection. 4.2 Ownership of Memory Blocks
Thread-local allocation is an essential element of high performance
4.1 Encapsulating Address Types memory management for multithreaded languages. The widely
Memory managers manipulate raw memory, conjuring language- used approach is to maintain a global pool of raw memory re-
level objects from raw memory. Experience shows the importance gions from which thread-local allocators take memory as they need
of abstracting over both arbitary raw addresses and references to it, and to which thread-local collectors push memory as they re-
user-level objects [4, 11]. Such abstraction offers type safety and cover it [1]. This design means that the common case for allocation
disambiguation with respect to implementation-language (Rust) involves no synchronization, whilst still facilitating sharing of a
references. Among the alternatives, raw pointers can be mislead- global memory resource. The memory manager must ensure that
ing and dereferencing an untyped arbitrary pointer may yield un- it correctly manages raw memory blocks to thread-local alloca-
expected data, while using integers for addresses implies arbitrary tors, ensuring exclusive ownership of any given raw block. Note,
type casting between pointers and integers, which is dangerous. however that once objects are fabricated from these raw blocks,
Abstracting address types also allows us to distinguish ad- they may (according to the implemented language’s semantics) be
dresses from object references for the sake of software engineering shared among all threads. Furthermore, at collection time a parallel
and safety. Addresses and object references are two distinct ab- collector may have no concept of memory ownership, with each
stract concepts in GC implementations: an address represents an thread marking objects at any place in the heap, regardless of any
91
1 // thread local allcator
ple, allocators might be told to yield by a collector thread via this
2 pub struct AllocatorLocal { data structure). Rust will not allow for a mixed ownership model
3 ... like this except by making the data structure shared, which means
4 space: Arc<Space>, that all accesses are vectored through a synchronized wrapper type,
5
6 // allocator may own a block it can allocate into
ensuring that every allocation is synchronized, thus defeating the
7 // Option suggests the possibily of being None, very purpose of the thread-local allocator.
8 // which leads to the slow path to acquire a block We deal with this by breaking the per-thread Allocator into
9 block: Option<Box<Block>> two parts, a thread-local part and a global part, as shown in FIg-
10 }
11
ure 3. The thread-local part includes the data that is accessible
12 // global space, shared among multiple allocators strictly within current thread and an Arc reference to its global part.
13 pub struct Space { All shared data goes to the global part (with a safe wrapper if muta-
14 ... bility is required). This allows efficient access to thread local data,
15 usable_blocks : Mutex<LinkedList<Box<Block>>>,
16 used_blocks : Mutex<LinkedList<Box<Block>>>
while allowing shared per-thread data to be accessed globally.
17 }
18
19 impl AllocatorLocal { 1 pub struct AllocatorLocal {
20 fn alloc_from_global (&mut self, 2 // fields that are strictly thread local
21 size: usize, align: usize) -> Address { 3 ...
22 // allocator will return the ownership of 4
23 // current block (if any) to global space 5 // fields that are logically per allocator
24 if block.is_some() { 6 // but need to be accessed globally
25 let block = self.block.take().unwrap(); 7 global: Arc<AllocatorGlobal>
26 self.space.return_used_block(block); 8 }
27 } 9
28 10 pub struct AllocatorGlobal {
29 // keep trying acquiring a new block from space 11 // any field in this struct that requires
30 loop { 12 // mutability needs to be either be atomic
31 let new_block 13 // or lock-guarded
32 = self.space.get_next_usable_block(); 14 ...
33 ... 15 }
34 } 16
35 } 17 // statics that involve dynamic allocation
36 } 18 lazy_static! {
19 pub static ref ALLOCATORS
20 : Vec<Arc<AllocatorGlobal>> = vec![];
21 }
Figure 2. Ownership transfer between the global memory pool and
a thread local allocator.
Figure 3. Separating a per-thread Allocator into two parts. The
local part is strictly thread local, while the global part can be
notion of ownership over the object’s containing block. We make accessed globally.
this guarantee by using Rust’s ownership semantics.
Ownership is the key part of Rust’s approach to delivering both
performance and safety. We map the ownership semantics to this
scenario to make the guarantee that each block managed by our 4.4 Library-Supported Parallelism
GC is in a coherent state among usable, used, or being allocated Parallelism is essential to high performance collector implementa-
into by a unique thread. To achieve this, we create Block objects, tions. Aside from the design of the high level algorithm, the ef-
each of which uniquely represents the memory range of the block ficiency of a collector depends critically on the implementation
and its meta data. The global memory pool owns the Blocks, of fast, correct, parallel work queues [12]. In a marking collector
and arranges them into a list of usable Blocks and a list of used such as Immix and most tracing collectors, a work queue (or ‘mark
Blocks (Figure 2). Whenever an allocator attempts to allocate, it stack’) is used to manage pending work. When a thread finds new
acquires the ownership from the usable Block list, gets the memory marking work, it adds a reference to the object to the work queue,
address and allocation context from the Block, then allocates into and when a thread needs work, it takes it from the work queue.
the corresponding memory. When the thread-local memory block is Ensuring efficient and correct operation of a parallel work queue
full, the Block is returned to the global used list, and waits there for is a challenging aspect of high performance collector implementa-
collection. The Rust’s ownership model ensures that allocation will tion [4, 12].
not happen unless the allocator owns the Block, and, further every We were pleased to find that Rust provides a rich selection of
Block is guaranteed to be in one of the three states: (i) owned by safe abstractions that perform well as part of its standard and ex-
the global space as a usable Block, (ii) owned by a single allocator, ternal libraries (known as crates in Rust parlance). Using an ex-
and being allocated into, (iii) owned by the global space as a used ternal crate is as simple as adding a dependency in the project
Block. During collection, the collector scavenges memory among configuration, which greatly benefits code reusability. The use of
used Blocks, and classifies them as usable for further allocation if standard libraries is deeply problematic when using a modified
they are free. or restricted language subset, as has been commonly used in the
past [1, 2, 4, 11, 13, 18]. For example, if using a restricted subset of
4.3 Globally Accessible Per-Thread State Java, one must be able to guarantee that any library used does not
A thread-local allocator avoids costly synchronization on the allo- violate the preconditions of the subset, which may be extremely re-
cation fast path because mutual exclusion among allocators is en- strictive (such using only the fully static subset of the language, ex-
sured. This is something that Rust’s ownership model ensures can cluding allocation and dynamic dispatch). Consequently, standard
be implemented very efficiently. However parts of the thread-local libraries are off limits when using restricted Java to build a garbage
allocator data structure may be shared at collector time (for exam- collector.
92
We utilize two crates in Rust, std::sync::mpsc, which pro- 1 pub struct AddressMapTable {
vides a multiple-producers single-consumer FIFO queue, and 2 start : Address,
crossbeam::sync::chase_lev4 , which is a lock-free Chase- 3 end : Address,
4
Lev work stealing deque that allows multiple stealers and one sin-
5 len : usize,
gle worker [9]. We use these two abstraction types as the backbone 6 ptr : *mut u8
of our parallel collector with a modest amount of additional code 7 }
to integrate them. 8 // allow sharing of AddressMapTable across threads
Our parallel collector starts single-threaded, to work on a local 9 unsafe impl Sync for AddressMapTable {};
10 unsafe impl Send for AddressMapTable {};
queue of GC roots; if the length of the local queue exceeds a 11
certain threshold, the collector turns into a controller and launches 12 impl AddressMapTable {
multiple stealer collectors. The controller creates an asynchronous 13 pub unsafe fn set (&self, addr: Address, value: u8)
mpsc channel and a shared deque; it keeps the receiver end for 14 {
15 let index = addr.diff(self.start) >> LOG_PTR_SIZE;
the channel, and the worker for the deque. The sender portion and 16 unsafe {
stealer portion are cloned and moved to each stealer collector. The 17 let ptr = self.ptr.offset(index);
controller is responsible for receiving object references from stealer 18 // intrinsics::atomic_store_relaxed(ptr, value);
threads and pushing them onto the shared deque, while the stealers 19 *ptr = value;
20 }
steal work (ObjectReferences) from the deque, do marking and 21 }
tracing on them, and then either push the references that need to 22 }
be traced to their local queue for thread-local tracing or, when the
local queue exceeds a threshold, send the references back to the
controller where the references will be pushed to the global deque. Figure 4. Our AddressMapTable allows concurrent access with
When the local queue is not empty, the stealer prioritizes getting unsafe methods. The user of this data structure is responsible for
work from the local queue; it only steals when the local queue is ensuring that it is used safely.
empty.
Using those existing abstract types makes our implementation
straightforward, performant and robust: our parallel marking and mark table. However, in Rust, if we were to create the line mark
tracing features only 130 LOC (shown as Appendix B) while there table as a Rust array of u8, Rust would forbid concurrent writing
are over one thousand lines of well tested code from the libraries into the array. Ways to bypass this within the confines of Rust are
to support our implementation. We measure and discuss the per- to either break the table down into smaller tables, or to use a coarse
formance of our parallel marking and tracing implementation in lock on the large table, both of which are impractical.
Section 6. On the other hand, during collection, the mutual exclusion en-
joyed by the allocator does not exist: two collector threads may race
5. Abusing Rust to mark adjacent lines, or even the same line. The algorithm ensures
In the previous section, we described how Rust’s semantics af- that such races are benign, as both can only set the line to ‘live’ and
fect the implementation of a high performance garbage collector. storing to a byte is atomic on the target architecture. However, in
Though Rust’s model is sometimes restrictive, in most cases we Rust, it is strictly forbidden to modify a shared object’s non-atomic
were able to fairly straightforwardly adapt the collector design to fields without going through a lock. We are unaware of a reliable
take full advantage of Rust’s safety and performance. However, solution to this in stable Rust releases, which do not support an
there are a few places where we found that Rust’s safety model AtomicU8 type, nor intrinsic atomic operations as in the nightly
was too restrictive to express the necessary semantics efficiently, releases.
and thus found ourselves having to dive into unsafe code, where Instead, we use the work-around shown in Figure 4. We general-
the programmer bears responsibility for safety, rather than Rust and ize the line mark table as an AddressMapTable. We wrap the nec-
its compiler. essary unsafety into the AddressMapTable implementation which
almost entirely comprises safe code. We acknowledge also that for
5.1 Shared Bit and Byte Maps proper atomicity of the byte store (with respect to both the compiler
Garbage collectors often implement bit maps and byte maps to rep- and target) we should also be using an atomic operation to store the
resent collection state, mapping addresses to table offsets. Exam- value rather than a normal assignment. Here we rely on the Rust
ples include card tables (which remember modified memory re- compiler to generate an x86 byte store which is atomic. Otherwise,
gions), and mark tables (which remember marked objects). To im- there are reasonable compiler optimizations that could defeat the
plement these correctly and efficiently, they are frequently byte correctness of our code [6]. What is more, the target architecture
maps (allowing atomic update). Semantics may include, for exam- might not have an atomic byte store operation. The availability of
ple, multiple writers but idempotent transitions: during the mark LLVM intrinsics in the non-stable nightly Rust releases would al-
phase, the writers may only set the mark byte (not clear it). For ex- low us to use a relaxed atomic store to achieve the correct code,
ample, an object map indicates the start of objects: in a heap where as shown in the comment. This exposes a shortcoming in Rust’s
every object is 8-byte aligned, every bit in such a bitmap can repre- current atomic types where we desire an AtomicU8 type, along
sent whether an 8-byte aligned address is the start of an object. In the lines of the existing AtomicUsize. This need is reflected in
Immix, a line mark table is used to represent the state of every line the recently accepted Rust RFC #1543: ‘Add more integer atomic
in the memory space — an unsigned byte (u8) for every 256-bytes types’ [15].
of allocated memory.
During allocation, the line mark table may be accessed by mul-
tiple allocator threads, exclusively for the addresses that they are al-
6. Evaluation
locating into. Since every allocator allocates into a non-overlapping The two primary objectives of our proof-of-concept implementa-
memory block, they access non-overlapping elements in the line tion were to establish: (i) to what extent we are able to exploit
Rust’s safety (hopefully minimizing the amount of unsafe code),
4 https://github.com/aturon/crossbeam and (ii) the impact of Rust on performance. In this section, we dis-
93
cuss our evaluation of our proof-of-concept collector, focusing on to be efficient and allows a head-to-head comparison for Rust per-
these concerns. formance. We took particular care to ensure that the performance-
critical hot paths were implemented efficiently in the respective lan-
6.1 Safe Code guages.
Our first major challenge was to map our collector design into the We chose three performance-critical paths of the collector to run
Rust language. As we discuss in Sections 4 and 5, for the main single-threaded as micro benchmarks: allocation, object marking,
part, the collector implementation can be expressed entirely in safe and object tracing. Each micro benchmark allocates 50 million
Rust code. As shown in Table 1, 96 % of 1449 lines of the code objects of 24 bytes each, which takes 1200 MB of heap memory; we
are safe. This suggests that though GC is usually considered to be use a 2000 MB memory for each run so that the GC will not collect
a low-level module that operates heavily on raw memory, the vast spontaneously (we control when tracing and collection occurs in
majority of its code can in fact be safe, and can benefit from the the respective micro benchmarks). In each micro benchmark, we
implementation language if that language offers safety. measure the time spent on allocating, marking, and tracing the 50
million objects. We use rustc 1.6.0 stable release for Rust, and clang
Language Files Lines of Code Unsafe LOC (%) 3.7 for C, both of which use LLVM 3.7 as backend. We run each
Rust 13 1449 58 (4.0%) implementation with 20 invocations on a 22 nm Intel Core i7 4770
processor (Haswell, 3.4 GHz) with Linux kernel version 3.17.0.
Table 1. Unsafe code is minimal in our implementation. The results appear in Table 2.
The unsafe code amounts to 4.0 % and mainly comes from just C Rust (% to C)
two sources. The first is where unsafe is required for access to alloc 370 ± 0.1 ms 374 ± 2.9 ms (101%)
raw memory, such as dereferencing raw pointers during tracing, mark 63.7 ± 0.5 ms 64.0 ± 0.7 ms (100%)
manipulating object headers, zeroing memory, etc. This is unavoid- trace 267 ± 2.1 ms 270 ± 1.0 ms (101%)
able in memory manager implementations. Our experience shows
that through proper abstraction, the unsafe code for accessing raw Table 2. Average execution time with 95% confidence interval for
memory can be restricted to a small proportion of the code base. micro benchmarks of performance critical paths in GC. Our imple-
The second source of unsafety is due to Rust’s restricted seman- mentation in Rust performs the same as the C implementation.
tics. Rust trades expressiveness for the possibility of statically en-
forcing safety. Section 4 shows that for most of the cases, we are From the micro benchmark results, we can see that with careful
able to adapt our collector implementation to Rust’s constraints. In performance tuning, the Rust implementation matches the perfor-
the exceptional case described in Section 5 where Rust stands in mance of our C implementation across all the three micro bench-
our way, we are able to encapsulate it in a small amount of unsafe marks (identifying most performance critical paths in a collector
code. implementation). In our initial implementation (without fine per-
Our experience demonstrates that a garbage collector can be formance tuning), Rust was within 10% slowdown of C on micro
implemented in a safe language such as Rust with very little unsafe benchmarks. We found it encouraging, considering: (i) our source
code. Furthermore, we can report that, subjectively, the discipline code in Rust offers stronger abstraction than C, a low-level imper-
imposed upon us by Rust was a real asset when we went about this ative language, and (ii) the source code enjoys Rust’s safety guar-
non-trivial systems programming task with its acute performance antees. We then fine-tuned the performance, examining assembly
and correctness focus. code generated for each implementation, where necessary altering
fast path code to avoid idioms with negative performance implica-
6.2 Performance tions. Our micro benchmarks have tiny kernels and are memory in-
Our second challenge was to deliver on our goal of high per- tensive, and one instruction may affect results. We found although
formance. Since at this stage we are implementing a standalone rustc is agressive it is quite predicatable making it not difficult to
garbage collector, not yet integrated into a larger language run- generate highly performant code. Lack of tools for finer control on
time, it is hard to provide performance evaluation via comprehen- the generated code such as branch hints may be a drawback of the
sive benchmarks; instead we use micro benchmarks to evaluate the Rust compiler, but did not hinder performance in the micro bench-
collector. We are not interested in evaluating garbage collection al- marks. Appendix A shows Rust code for the allocation fast path,
gorithms per se (we take an existing algorithm off the shelf). Rather, and Appendix B shows the code for mark and trace fast path in a
we simply wish to provide proof of concept for an implementation parallel collector.
of a high performance collector in Rust and show that it performs
well in comparison to an equivalent collector written in C. To this 6.2.2 Library-based Parallel Mark and Trace
end, we are particularly interested in the performance of critical hot We evaluate the performance scaling of parallel GC in our im-
paths, both for collection and allocation since the performance of plementation. As described in Section 4.4, we quickly imple-
the algorithm itself is already established [3], and our prior experi- mented the parallel mark and trace collector by completely bas-
ence demonstrates the overwhelming criticality of these hot paths ing its parallelism on existing Rust crates: std::sync::mpsc and
to performance. crossbeam:: sync::chase_lev. They provide all the concur-
rency as the backbone of our parallel collector. This implementa-
6.2.1 Micro Benchmarks tion approach is high-level and productive, but as we shall show, it
To evaluate the performance of our implementation in Rust, we is also performant.
also implemented the collector in C, following the same Immix al- We use a micro benchmark to trace 50 quad trees of depth ten to
gorithm. We did not try to make the two implementations exactly allow parallel collectors to build a reasonable local work queue for
identical, but used the features of the available language in a natu- thread-local tracing and to push excessive references to the global
rally fluent way. For most scenarios described in Section 4 and 5, it deque. We use a large heap to avoid spontaneous collections during
is either unnecessary or simply impossible to write C code the same the tree allocation. We run this with twenty invocations on the
way as Rust code. The C implementation allows us to set a base- same i7 Haswell machine, using from zero to seven GC worker
line for performance in an implementation language that is known threads, and measure the tracing time. Note that zero means no
94
mainly a bump pointer allocator, while BDW uses a free list alloca-
tor. Immix outperforms freelist allocators by 16% in large bench-
marks [17]; we expect the performance advantage is even bigger in
micro benchmarks that allocate in a tight loop. We ran our alloc
micro benchmark for BDW, and we find that the performance dif-
ference between our allocator and the BDW allocator is similar,
confirming our belief. Our GC implementation is different from
the BDW collector in a few other respects, which contribute to the
performance difference: (i) Our GC is conservative with stacks but
precise with the heap, while the BDW collector is conservative with
both; (ii) Our GC presumes a specified heap size and reserves con-
tiguous memory space for the heap and metadata side tables during
initialization, while the BDW collector allows dynamic growing of
a discontiguous heap.
We also compared the two collectors with a multi-threaded
version of gcbench (as mt-gcbench in Table 3). Both collectors
use eight allocator threads, and eight GC threads. The results show
that our implementation performs 2× faster than BDW on this
workload. Our implementation outperforms the BDW collector on
Figure 5. Performance scaling for our fast implemented libraries- gcbench and mt-gcbench (respectively by 79 % and 2×), which
based parallel mark and trace collector. suggests our implementation in Rust delivers good performance
compared to the widely used BDW collector.
parallel GC while one to seven reflect the number of GC worker BDW Immix(Rust) % of BDW
threads (with one additional controller thread). Seven workers with gcbench 172 ± 0.8 ms 97 ± 0.3 ms 56%
one controller is the full capacity of our machine (four cores with mt-gcbench 1415 ± 3.1 ms 466 ± 1.9 ms 33%
eight SMT threads). Figure 5 shows the results along with a line
indicating hypothetical perfect scaling (which assumes workloads Table 3. Performance comparison between our Immix GC in Rust
are equally divided among threads with no overhead compared to a and BDW on gcbench and multi-threaded gcbench.
single-threaded collector).
With parallel GC disabled, single-threaded marking and tracing
takes 605 ms, while with one worker thread, the benchmark takes We conclude that using Rust to implement GC does not preclude
716 ms. The overhead is due to sending object references back to high performance, and justify this with the following observations:
the global deque through an asynchronous channel, and stealing (i) our implementation in Rust performs as well as our C imple-
references from the shared deque when the local work queue is mentation using the same algorithm in performance-critical paths,
empty. With two and three worker threads, the scaling is satis- and (ii) our implementation in Rust outperforms the widely-used
factory, with execution times of 378 ms and 287 ms (52.8 % and BDW collector on gcbench and mt-gcbench. This result shows
40.0 % compared with one worker). When the number of worker the capability of Rust for high-performance GC implementations,
threads exceed four, the scaling starts to fall off slightly. With seven as a language with memory, thread and type-safety.
worker threads, the execution time is 166 ms, which is 23.2 % of
one worker thread. The performance degradation is most likely 7. Conclusion
from two sources: (i) GC workers start to share resources from the Rust is a compelling language that makes strong claims about its
same core after every core hosts one worker, (ii) having one central suitability for systems programming, promising both performance
controller thread to receive object references and push them to the and safety. We explored Rust as the implementation language for
global deque starts to be a performance bottleneck. These results a high performance garbage collector by implementing the Immix
could undoubtedly be improved with further tuning. However, as it garbage collector in both C and Rust, and used micro benchmarks
is one tricky part of the implementation, and we worked towards a to evaluate their performance alongside the well-established BDW
working (and safe) implementation with limited time and limited collector.
lines of code by using existing libraries, the approach itself is inter- We found that the Rust programming model is quite restrictive,
esting and demonstrates the performance tradeoff due to improved but not needlessly so. In practice we were able to use Rust to imple-
productivity. We believe the performance scaling is good, and that ment Immix. We found that the vast majority of the collector could
having a language that provides higher level abstractions can ben- be implemented naturally, without difficulty, and without violat-
efit a parallel GC implementation (and possibly a concurrent GC) ing Rust’s restrictive static safety guarantees. In this paper we have
greatly. discussed each of the cases where we ran into difficulties and how
we overcame those challenges. Our experience was very positive:
6.2.3 GCBench
we enjoyed programming in Rust, we found its restrictive program-
We compare our Immix implementation in Rust with the BDW ming model helpful in the context of a garbage collector implemen-
collector on the gcbench micro benchmark. We enable thread-local tation, we appreciated access to its standard libraries (something
allocators and parallel marking with eight GC threads on BDW. We missing when using a restricted language such as restricted Java),
run on the same machine for the comparison, and use a moderate and we found that it was not difficult to achieve excellent perfor-
heap size of 25 MB (which is roughly 2× minimal heap size). mance. Our experience leads us to the view that Rust is very well
In a run of 20 invocations (as shown in Table 3), the aver- suited to garbage collection implementation.
age execution time for BDW is 172 ms, while the average for
our implementation is 97 ms (79 % faster). We do not find the re-
sult surprising. Our GC implements an Immix allocator which is
95
Acknowledgments [14] Rust. The Rust Language. URL https://www.rust-lang.
org.
We gratefully acknowledge the comments and thoughtful feedback
that we received from Kathryn McKinley, Eliot Moss, Hans Boehm [15] Rust RFC 1543. Add more integer atomic types, May 2016. URL
https://github.com/rust-lang/rfcs/pull/1543.
and our anonymous reviewers. Data61 is funded by the Australian
Government through the Department of Communications and the [16] R. Shahriyar, S. M. Blackburn, X. Yang, and K. M. McKinley. Tak-
Australian Research Council through the ICT Centre of Excellence ing off the gloves with reference counting Immix. In ACM SIG-
PLAN Conference on Object Oriented Programming Systems Lan-
Program. This work is also supported by National Science Foun-
guages and Applications, Indianapolis, Indiana, Oct. 2013. doi: 10.
dation grant no. CCF-1408896 and Australian Research Council 1145/2509136.2509527.
Discovery grant no. DP140103878.
[17] R. Shahriyar, S. M. Blackburn, and K. S. McKinley. Fast conservative
garbage collection. In ACM International Conference on Object Ori-
References ented Programming, Systems, Languages, and Applications, Portland,
Oregon, Oct. 2014. doi: 10.1145/2660193.2660198.
[1] B. Alpern, C. R. Attanasio, A. Cocchi, D. Lieber, S. Smith, T. Ngo, J. J.
[18] C. Wimmer, M. Haupt, M. L. van de Vanter, M. Jordan, L. Daynès,
Barton, S. F. Hummel, J. C. Sheperd, and M. Mergen. Implementing
and D. Simon. Maxine: An approachable virtual machine for, and in,
Jalapeño in Java. In ACM SIGPLAN Conference on Object-oriented
Java. ACM Transactions on Architecture and Code Optimization, 9(4):
Programming, Systems, Languages, and Applications, Denver, Col-
30:1–30:24, Jan. 2013. doi: 10.1145/2400682.2400689.
orado, Nov. 1999. doi: 10.1145/320384.320418.
[2] B. Alpern, S. Augart, S. M. Blackburn, M. Butrico, A. Cocchi,
P. Cheng, J. Dolby, S. Fink, D. Grove, M. Hind, K. S. McKinley,
M. Mergen, J. E. B. Moss, T. Ngo, V. Sarkar, and M. Trapp. The Appendices
Jikes Research Virtual Machine project: Building an open source re-
search community. IBM Systems Journal, 44(2):399–417, May 2005.
Here we present the key Rust code in our implementation, notably
doi: 10.1147/sj.442.0399.
fast-path allocation and parallel mark and trace.
[3] S. M. Blackburn and K. S. McKinley. Immix: A mark-region garbage
collector with space efficiency, fast collection, and mutator perfor-
mance. In ACM SIGPLAN Conference on Programming Language A. Allocation Fastpath
Design and Implementation, Tucson, Arizona, June 2008. doi: 10.
1 pub struct ImmixMutatorLocal {
1145/1375581.1375586. 2 id : usize,
[4] S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and water? High 3
performance garbage collection in Java with MMTk. In International 4 // use raw pointer here instead of AddressMapTable
Conference on Software Engineering, Edinburgh, Scotland, May 2004. 5 // to avoid indirection in the fast path
6 alloc_map : *mut u8,
doi: 10.1109/icse.2004.1317436. 7 space_start: Address,
[5] H.-J. Boehm. Threads cannot be implemented as a library. In ACM 8
SIGPLAN Conference on Programming Language Design and Imple- 9 // cursor and limit will be invalid after GC.
mentation, Chicago, Illinois, June 2005. doi: 10.1145/1065010. 10 // we avoid using Option<Address>. instead,
11 // we reset both cursor and limit to Address::zero()
1065042. 12 // so that alloc will go to slow path
[6] H.-J. Boehm. How to miscompile programs with “benign” data races. 13 cursor : Address,
In USENIX Conference on Hot Topics in Parallelism, Berkeley, Cal- 14 limit : Address,
ifornia, May 2011. URL http://www.usenix.org/events/ 15 line : usize,
16
hotpar11/tech/final_files/Boehm.pdf.
17 space : Arc<ImmixSpace>,
[7] H.-J. Boehm and M. Weiser. Garbage collection in an uncooperative 18 block : Option<Box<ImmixBlock>>,
environment. Software Practice and Experience, 18(9):807–820, Sept. 19
96
50 #[inline(always)] 70 // never happened for us. defining a reasonable
51 pub fn init_object(&mut self, 71 // PUSH_BACK_THRESHOLD will allow stealers enough
52 obj: Address, encode: u8) { 72 // time to work through their local queue before
53 let index = (obj.diff(self.space_start) 73 // trying to fetch work from global deque.
54 >> LOG_PTR_SIZE) as isize; 74 // for robustness, we need a backup solution here,
55 unsafe { 75 // and it has little implications on performance
56 *self.alloc_map.offset(index) = encode; 76 // unless triggered
57 } 77 match worker.try_pop() {
58 } 78 // leftover work in the deque (the rare case)
59 } 79 // we will relaunch gc threads
80 Some(obj_ref) => worker.push(obj_ref),
B. Parallel Mark and Trace 81 // parallel gc finishes
82 None => break
1 extern crate crossbeam; 83 }
2 84 }
3 #[cfg(feature = "parallel-gc")] 85 }
4 use self::crossbeam::sync::chase_lev::*; 86
5 87 #[cfg(feature = "parallel-gc")]
6 #[cfg(feature = "parallel-gc")] 88 fn start_steal_trace
7 pub fn start_trace(roots: &mut Vec<ObjectReference>, 89 (stealer: Stealer<ObjectReference>,
8 immix_space: Arc<ImmixSpace>, 90 job_sender: Sender<ObjectReference>,
9 lo_space: Arc<FreeListSpace>) 91 immix_space: Arc<ImmixSpace>,
10 { 92 lo_space: Arc<FreeListSpace>)
11 // create work deque: one Worker with several 93 {
12 // Stealers. Worker can push to deque, while 94 let mut local_queue = vec![];
13 // Stealer can only pull 95
14 let (mut worker, stealer) = deque(); 96 // load invariant fields used in the loop
15 97 let line_mark_table = &immix_space.line_mark_table;
16 // push roots to the shared deque 98 let alloc_map = immix_space.alloc_map.ptr;
17 while !roots.is_empty() { 99 let trace_map = immix_space.trace_map.ptr;
18 worker.push(roots.pop().unwrap()); 100 let immix_start = immix_space.start();
19 } 101 let immix_end = immix_space.end();
20 102 let mark_state = objectmodel::MARK_STATE
21 loop { 103 .load(Ordering::Relaxed) as u8;
22 // since the deque allows only one Worker, 104
23 // we create an asynchronous channel for 105 loop {
24 // stealers to pass back references to the 106 let obj = {
25 // controller, then the controller 107 // fetch work from local queue if possible
26 // with the Worker will push them to deque 108 if !local_queue.is_empty() {
27 let (sender, receiver) 109 local_queue.pop().unwrap()
28 = channel::<ObjectReference>(); 110 } else {
29 111 // otherwise, steal from global deque
30 // launch parallel GC threads 112 let work = stealer.steal();
31 let mut gc_threads = vec![]; 113 match work {
32 for _ in 114 // global deque is empty, the thread quits
33 0..GC_THREADS.load(atomic::Ordering::Relaxed) 115 Steal::Empty => return,
34 { 116 // lost a race to steal, retry
35 let new_immix_space = immix_space.clone(); 117 Steal::Abort => continue,
36 let new_lo_space = lo_space.clone(); 118 // get work load, proceed tracing object
37 let new_stealer = stealer.clone(); 119 Steal::Data(obj) => obj
38 let new_sender = sender.clone(); 120 }
39 let t = thread::spawn(move || { 121 }
40 start_steal_trace(new_stealer, new_sender, 122 };
41 new_immix_space, 123
42 new_lo_space); 124 // as steal_trace_object() is inlined,
43 }); 125 // passing arguments is no-op
44 gc_threads.push(t); 126 steal_trace_object(obj, &mut local_queue,
45 } 127 &job_sender, alloc_map,
46 128 trace_map, line_mark_table,
47 // controller gives up its Sender, 129 immix_start, immix_end,
48 // thus only stealers own Sender. 130 mark_state, &lo_space);
49 // when all stealers quit (Senders dropped), 131 }
50 // the loop ends 132 }
51 drop(sender); 133
52 134 // functions are inlined so we omit
53 // main loop for the controller 135 // parameters/arguments
54 loop { 136 #[inline(always)]
55 // fetch from the channel 137 #[cfg(feature = "parallel-gc")]
56 let recv = receiver.recv(); 138 pub fn steal_trace_object(...) {
57 match recv { 139 let addr = obj.to_address();
58 // push obj reference to deque 140
59 Ok(obj) => worker.push(obj), 141 if addr >= immix_start && addr < immix_end {
60 // job finishes 142 // mark object in immix space
61 Err(_) => break 143 mark_as_traced(trace_map, immix_start,
62 } 144 obj, mark_state);
63 } 145 // and associated lines
64 146 line_mark_table.mark_line_live(addr);
65 // a sanity check: 147
66 // since we use an asynchronous channel, it is 148 let mut base = addr;
67 // possible that stealers find an empty deque 149 loop {
68 // and quit before the controller receives and 150 let value
69 // pushes more work to the deque. however, this
97
151 = objectmodel::get_ref_byte(alloc_map, 187 return;
152 immix_start, obj); 188 } else {
153 let ref_bits = 189 // for larger objects, we adjust base pointer
154 lower_bits(value, REF_BITS_LEN); 190 // and trace again
155 let short_encode = 191 base = base.plus(REF_BITS_LEN * 8);
156 test_nth_bit(value, SHORT_ENCODE_BIT); 192 }
157 193 }
158 // fast path trace for common patterns 194 } else {
159 match ref_bits { 195 // mark and trace a large object
160 // first word is reference 196 lo_space.mark_trace(obj)
161 0b0000_0001 => { 197 }
162 steal_process_edge(base, ...); 198 }
163 }, 199
164 // first two words are references 200 pub const PUSH_BACK_THRESHOLD : usize = 50;
165 0b0000_0011 => { 201
166 steal_process_edge(base, ...); 202 #[inline(always)]
167 steal_process_edge(base.plus(8), ...); 203 #[cfg(feature = "parallel-gc")]
168 }, 204 pub fn steal_process_edge(...) {
169 // first 4 words are references 205 // load the objectreference from the field
170 0b0000_1111 => { 206 let obj = unsafe{addr.load::<ObjectReference>()};
171 steal_process_edge(base, ...); 207
172 steal_process_edge(base.plus(8), ...); 208 // push the object reference to work queue
173 steal_process_edge(base.plus(16), ...); 209 // if it is not zero and not traced
174 steal_process_edge(base.plus(24), ...); 210 if !obj.to_address().is_zero()
175 }, 211 && !is_traced(trace_map, immix_start,
176 // more patterns 212 obj, mark_state) {
177 ... 213 // prioritize using local queue
178 // slow path 214 if local_queue.len() < PUSH_BACK_THRESHOLD {
179 _ => trace_object_slow(base, ref_bits) 215 local_queue.push(obj_addr);
180 } 216 } else {
181 217 job_sender.send(obj_addr).unwrap();
182 // for objects that are smaller than 48 bytes, 218 }
183 // its reference locations can be encoded within 219 }
184 // 1 byte 220 }
185 // we finished tracing its fields
186 if short_encode {
98