1
CHERI Concentrate:
Practical Compressed Capabilities
Jonathan Woodruff, Alexandre Joannou, Hongyan Xia, Anthony Fox, Robert Norton, Thomas Bauereiss,
David Chisnall, Brooks Davis, Khilan Gudka, Nathaniel W. Filardo, A. Theodore Markettos, Michael Roe,
Peter G. Neumann, Robert N. M. Watson, Simon W. Moore
Abstract—We present CHERI Concentrate, a new fat-pointer compression scheme applied to CHERI, the most developed
capability-pointer system at present. Capability fat pointers are a primary candidate to enforce fine-grained and non-bypassable
security properties in future computer systems, although increased pointer size can severely affect performance. Thus, several
proposals for capability compression have been suggested elsewhere that do not support legacy instruction sets, ignore features
critical to the existing software base, and also introduce design inefficiencies to RISC-style processor pipelines. CHERI Concentrate
improves on the state-of-the-art region-encoding efficiency, solves important pipeline problems, and eases semantic restrictions of
compressed encoding, allowing it to protect a full legacy software stack. We present the first quantitative analysis of compiled capability
code, which we use to guide the design of the encoding format. We analyze and extend logic from the open-source CHERI prototype
processor design on FPGA to demonstrate encoding efficiency, minimize delay of pointer arithmetic, and eliminate additional
load-to-use delay. To verify correctness of our proposed high-performance logic, we present a HOL4 machine-checked proof of the
decode and pointer-modify operations. Finally, we measure a 50% to 75% reduction in L2 misses for many compiled C-language
benchmarks running under a commodity operating system using compressed 128-bit and 64-bit formats, demonstrating both
compatibility with and increased performance over the uncompressed, 256-bit format.
Index Terms—Capabilities, Fat Pointers, Compression, Memory Safety, Computer Architecture
F
1
I NTRODUCTION
I
NTEL Memory Protection Extensions (MPX) and Software
Guard Extensions (SGX), as well as Oracle Silicon Secured
Memory (SSM), signal an unprecedented industrial willingness to implement hardware mechanisms for memory safety
and security. As industry looks to the next generation, capability pointers have become a primary candidate to conclusively solve memory safety problems. Capability pointers
are stronger than fault detection schemes such as MPX and
SSM, and are able to achieve provable containment at the
granularity of program-defined objects that is as strong as
address-space separation.
The greatest cost for capability pointers involves the
object bounds encoded with each pointer to enforce memory
safety. Encoding both upper and lower bounds as well as
a pointer address requires either larger capabilities [1] or
•
•
Jonathan Woodruff, Alexandre Joannou, Hongyan Xia, Anthony
Fox, Robert Norton, Thomas Bauereiss, David Chisnall, Khilan
Gudka, Nathaniel Filardo, Theo Markettos, Michael Roe, Robert
Watson, Simon Moore are with the Department of Computer Science and Technology, University of Cambridge, England. Email is
{firstname.lastname}@cl.cam.ac.uk.
Brooks Davis and Peter Neumann are with SRI International. Email is
{firstname.lastname}@sri.com.
This work is part of the CTSRD, ECATS, and CIFV projects sponsored by
the Defense Advanced Research Projects Agency (DARPA) and the Air Force
Research Laboratory (AFRL), under contracts FA8750-10-C-0237, HR001118-C-0016, and FA8650-18-C-7809. The views, opinions, and/or findings
contained in this paper are those of the authors and should not be interpreted
as representing the official views or policies, either expressed or implied, of
the Department of Defense or the U.S. Government. Approved for Public
Release, Distribution Unlimited. We also acknowledge the EPSRC REMS
Programme Grant [EP/K008528/1], the EPSRC Impact Acceleration Account
[EP/K503757/1], Arm Limited, and Google, Inc.
restrictions on region properties, semantics, and address
space [2], [3].
This paper presents CHERI Concentrate (CC), a compression scheme applied to CHERI, the most developed
capability-pointer system at present. CC achieves the best
published region encoding efficiency, solves important
pipeline problems caused by a decompressed register file,
and eases semantic restrictions due to the compressed encoding. The contributions of this paper are:
•
•
•
•
A floating-point bounds encoding with an Internal
Exponent that provides maximum precision for small
objects, spending bits to encode an exponent only for
larger and less common objects.
The first quantitative characterization of capability operations in compiled programs to inform capability
instruction optimization.
A power-of-two Representable Region beyond object
bounds to allow temporarily out-of-bounds pointers,
enabling compatibility with a broad legacy code base.
A Representability Check for pointer arithmetic with delay comparable to a pointer add, enabling integration
with standard processor designs.
CC improves efficiency over Low-Fat Pointers, the previous best capability bounds format, by inferring the most
significant bit of the Top field and by encoding the exponent
within the bounds. CC also improves both semantics and
timing by allowiny out-of-bounds pointer manipulations,
which simplifies the pointer arithmatic check allowing it to
be performed directly on the compressed format.
Submitted for review to IEEE Transactions on Computers
2
2
BACKGROUND
The importance of architectural support for fine-grained
memory protection has been demonstrated in research (e.g.,
Mondriaan Memory Protection [4], [5]; Hardbound [6]) and
in industry (e.g., MPX [7]). In particular, Mondriaan pointed
out that current operating systems use paged memory for
both protection and virtualization, creating tension between
granularity and performance. Capability pointer systems
use bounded pointers for fine-grained protection, and use
paged memory only for virtualization.
Early capability-pointer machines include the CAP computer [8], Intel iAPχ432 [9], and the Intel i960 [10]. These machines used indirection to efficiently store object bounds in
an object table. In contrast, more recent capability machines,
including the M-Machine [2] and the CHERI processor [1],
[11], [12], [13] encode object bounds directly in unforgeable
fat pointers, avoiding an additional memory access to an
object table, or an associative lookup in an object cache.
The increased memory footprint due to encoding bounds
in every pointer can be a major challenge for fat-pointer capability schemes. Fat-pointer compression techniques, used
by the M-Machine [2], Aries [14], and Low-Fat Pointers [3],
exploited redundancy between a pointer address and its
bounds to reduce their storage requirements. We learn from
these techniques, improve upon them, and solve additional
challenges to apply them in the context of a conventional
RISC pipeline and a large legacy software base.
Notes on notation: In this paper we use the notation
U ’V to indicate a field named U composed of V bits. U [W :
X] indicates a selection of bits W down to X from bit vector
U . {Y, Z} indicates a concatenation of the bits of Y above
Z . Thus, U ’16 is a 16-bit field, and {U [7 : 0], U [15 : 8]}
indicates a byte-swap of U .
Furthermore, we use lower-case letters for full-length
values (e.g., a for address and t for top); we use upper-case
letters for fields used for compression (e.g., T for select bits
of the top (t), and E for the exponent).
2.1
CHERI-256
The CHERI instruction-set architecture [1], [13] uses a large
256-bit capability format that encodes the base, length, and
address as independent 64-bit fields. The simplicity of 64bit integers is attractive, as complex encoding adds latencies
to common operations. Intel MPX also uses a 256-bit format
with base, top, and address as full 64-bit values [7], enabling
simple low-latency bounds checking but wasting memory.
We have built upon the uncompressed CHERI implementation shown in Figure 1, which we refer to as CHERI-256.
CHERI-256 supports out-of-bounds pointers; that is, the
address may stray outside of bounds during address calculations with bounds enforced only on dereference. The
term out-of-bounds pointers carries this meaning throughout
this paper. CHERI-256 naturally supports such promiscuous
arithmetic as the upper and lower bounds are independent
of the address, each fully represented with 64-bit values that
are unperturbed by wild modifications to the address field.
Previous fat-pointer systems have found that representing out-of-bounds pointers was necessary for compatibility
with a broad code base [6], [15], and CHERI-256 relies on
this feature to compile and execute a considerable amount
63
0
otype’24
p’31
l’64
s
b’64
a’64
otype: object type
l: object length
256-bits
p: permissions
b: base
s: sealed
a: pointer address
Fig. 1: CHERI-256 capability format
of legacy C code [12], [13]. Despite these benefits, 256-bit
pointers exact a heavy toll on cache footprint and processor
data-path size. While we seek to encode fat-pointers more
efficiently, we maintain both reasonable latencies for pointer
arithmetic and out-of-bounds pointers.
2.2
M-Machine
The M-Machine [2] is a highly efficient capability-pointer
design developed in the early 1990s. The M-Machine was
not designed to support legacy software, but its capability format is elegantly simple and encodes base, top, and
pointer address within 64 bits.
63
p’4 E’6
0
a’54
o
64 bits
Fig. 2: M-Machine capability format
The M-Machine format, shown in Figure 2, encodes the
base as the bits of a above E (a [53 : E]) with the lower bits
set to zero, the top as the base plus 2E , and the current
pointer address as simply a in its entirety. This capabilitypointer compression introduces limitations on pointer arithmetic: address modifications must not change the decoded
bounds without invalidating the capability pointer. For the
M-Machine, pointer arithmetic may change only the bits
below E , but any modification of the bits above E changes
the decoded base, and must not produce a valid capability.
Thus, all valid capability pointers maintain their original
bounds, and out-of-bounds pointers cannot be represented.
The M-Machine supports segments that are naturally
2E aligned and 2E sized. This power-of-two alignment
restriction prevents precise enforcement of irregular object
sizes. While the M-machine could mitigate coarse-grained
memory safety issues by padding large objects with unallocated pages, this results in severe memory fragmentation,
as demonstrated in Section 4.3. Any imprecise fat-pointer
encoding would need allocator support to ensure memory
safety; memory savings from pointer compression must be
balanced against waste, due to memory fragmentation.
Finally, the M-Machine supports an unusual 54-bit address space due to dedicating upper bits to encode the
bounds. This non-standard address size can cause compatibility issues, and somewhat limits future address-space
expansion. The C language allows integers to be stored in
pointers (e.g., intptr t); the CHERI C compiler enables this
behavior by placing the integer in the address field.
3
2.3
3.2
Low-fat pointers
Similar to the M-Machine, the Low-fat pointer scheme (‘Lowfat’) compresses its capabilities into 64-bit pointers [3], with
just 46 bits of usable address space (Figure 3). As in the MMachine, memory is described in 2E -sized blocks; however,
regions may be described as a contiguous range of blocks.
Thus, Low-fat supports a finer granularity for bounds than
the M-Machine.
To encode a contiguous segment, Low-fat stores the 6bit top (T ) and base (B ) blocks of the region, with the 6-bit
block size, or exponent (E ). T or B is simply inserted into the
pointer address at E to produce the corresponding bound of
the region, as illustrated in Figure 3. Low-fat can encode any
span of blocks, up to a length of 26 , regardless of alignment,
by inferring any difference in the upper bits of the bounds.
As a result, the only restriction on Low-fat regions is that
the top and base must be aligned at 2E .
63
0
45
E’6
T ’6
(
B’6
t =
bounds
b =
a’46
a[45 : E + 6] + ct T ’6
a[45 : E + 6] − cb B’6
E : exponent
T & B : top & base block number
a: address
0
0
shift by E
Amid = a[E + 5 : E]
ct = (Amid > T )? 1 : 0
cb = (Amid 6 B)? 1 : 0
Fig. 3: Low-fat capability format (our notation)
3
S HORTCOMINGS OF THE S TATE - OF - THE -A RT
The Low-fat work, an improvement over the similar Aries
[14] scheme, presents an attractive middle ground between
the restrictive M-Machine encoding and the verbose CHERI256 approach. The decoding simplicity of Low-fat is promising; however, it still does not fit naturally into conventional
pipelines, and is not compatible with common language
semantics. CHERI Concentrate clears these hurdles, while
also improving encoding efficiency.
3.1
Encoding Inefficiencies
Low-fat misses two opportunities for a more efficient encoding. First, if the exponent E is always chosen to be as small
as possible, E directly implies the most significant bit of the
size. This principle is used to save a bit in IEEE floatingpoint formats; with careful thought, we can save a bit in
region encodings as well (see Section 4.2).
Second, Low-fat devotes equal encoding space for all
values of E . That is, small regions and big regions have
the same number of bits in T and B , despite the fact that
small objects are far more common than large allocations.
Pipeline Problems
The Low-fat encoding requires all valid capability pointers
to be in-bounds; therefore, pointer arithmetic must never
produce an out-of-bounds pointer. This implies a boundscheck on all pointer arithmetic. A simple bounds-check
requires decoding the bounds and comparing the bounds
against the arithmetic result. Performing a pointer add,
decoding the bounds, and the final comparison must all
be completed after forwarded operands are available. Crucially, the comparison would need to be done after arithmetic is complete, extending the critical path of this logic.
The published implementation of Low-fat solves these
issues by decoding capability pointers in the register file
to eliminate additional delay for pointer arithmetic. The
register file does not directly hold the decoded bounds, but
holds the distance from the current address to the base and
to the top of the region. These distances can be compared to
the offset operand directly, in parallel with pointer addition,
which results in no additional delay for the bounds check.
This optimization has three costs:
•
•
•
Delaying pointer loads due to decoding;
Widening the register file to 164 bits;
Updating the offsets on pointer arithmetic.
More than doubling the width of the register file for no
architecturally visible benefit is undesirable; also, unpacking
pointers on loads from memory is detrimental to performance. Load-to-use delay is a key performance parameter,
and rarely has slack in a balanced design. Low-fat attempts
to mitigate this issue by making the bounds available later
in the pipeline than the address, although this introduces
undue complexity; we demonstrate that there is a more
efficient solution.
3.3
Out-of-Bounds Pointers being Unrepresentable
The authors of Low-fat note that their system can accommodate C-pointer calculations going out of bounds by padding
allocations with unusable space – i.e., by simply widening
the bounds beyond what was requested, and tagging the
padded space as unusable with a separate fine-grained
memory-type mechanism. Ideally, we would neither sacrifice memory nor require another complex mechanism to
accommodate temporarily out-of-bounds pointers. While
addresses should be allowed to wander temporarily beyond
the bounds during pointer arithmetic, strict segment bounds
should be enforced on dereference.
Both M-Machine and Low-fat invalidate pointers when
arithmetic temporarily pushes them out of bounds, with the
result that all valid pointers are in-bounds and no bounds
check is required on dereference. While conceptually attractive, this optimization is neither realistic nor necessary.
Dereference without a bounds check means that the instruction set (ISA) cannot support indexed addressing; while
avoided in the bespoke Low-fat ISA, indexed addressing is
required by every widely used ISA. Memory access already
supports exceptions due to address translation, and any
bounds check on the virtual address can be performed in
parallel to translation, making memory access a particularly
convenient time to perform a bounds check. In contrast,
pointer-arithmetic operations are far more timing sensitive.
4
No Evaluation of Compiled Programs
4
CC P RINCIPLES I — I MPROVING ON L OW- FAT
CHERI Concentrate (CC) includes innovations in encoding
efficiency, execution efficiency, and semantic flexibility. We
describe an 18-bit encoding for bounds for direct comparison with Low-fat, and use this bounds field in our CHERI-64
encoding in Figure 9. However, the principles are independent of the field size; our 128-bit implementation (which
supports a full 64-bit address space (Figure 13)) uses a 41bit field for high-precision bounds. This section introduces
innovations that directly improve the Low-fat model before
introducing support for CHERI semantics in Section 5. A
complete specification for decoding 64-bit CC (CHERI-64) is
given in Figure 9.
4.1
Implied Most-Significant Bit of Top
If we consistently choose the smallest possible E to encode
any set of bounds, the most significant bit of the length will
be implied directly by E . Thus, we designed our instruction
that encodes bounds (CSetBounds) to deterministically
choose the smallest possible E when assigning an encoding
from a full-precision base and length. All capabilities in the
system are in this normal form.
For capabilities in the normal form, we can derive the
top bit of T from the remainder of the encoding. We may
conceptually imagine a 6-bit Length field, L = (t−b)[E +5 :
E], where T = B +L. As the top bit of L is known to be 1, to
calculate the top bit of T , we need only the top bit of B and
the carry-out from the lower bits of B + L. This carry-out
is 1 if T [4 : 0] < B[4 : 0] – that is, when adding the lower
bits of L to B has produced a value smaller than B . For the
format in Figure 4, the formula to reconstitute the MSB of
Top is
(
1, if T [4 : 0] < B[4 : 0]
Lcarry out =
0, otherwise
L[5] =1(implied)
T [5] =B[5] + Lcarry
out
+ L[5]
Thus, as all capabilities in the system are in normal form,
one bit can be saved in the encoding. The improved Lowfat format in Figure 4 uses this bit to indicate an internal
exponent, as described in the Section 4.2.
4.2
Internal Exponent Encoding
As exponents of zero are most common, we encode a zero
exponent with one bit (internal exponent, IE ), allowing 8bit precision for small objects, improving on Low-fat for
0
T [7 : 3]
IE
T [2 : 0] or E[2 : 0]
B[8 : 3]
B[2 : 0] or E[5 : 3]
Fig. 4: Bounds with Embedded Exponent and Implied T8
the common case. For larger objects, the lower bits of the
bounds are used for a 6-bit exponent field.
The most-significant bit of T is implied to be 1 only when
E is nonzero. For all objects with sizes between 29 and 264 ,
IE is set, T [8] is implied, and T [2 : 0] = B[2 : 0] = 0,
leaving 6 bits of precision with a 5-bit T and 6-bit B field. In
summary, CC can encode 8 bits of precision for small objects
and 6 bits of precision for all others in the same 18 bits used
by Low-fat, which offers a uniform 6 bits of precision.
4.3
Evaluation of Representability
In order to evaluate the usable precision of CC against
the Low-fat encoding, we used the dtrace framework on
Mac OS X 10.9 to collect traces from every allocator
found in six real-world applications: Chrome 38.0.2125,
Firefox 31, Apache 2.4, iTunes 12, MPlayer build #127, and
mySQL 5. Allocators included many forms of malloc(), several application-specific allocators, driver internal allocators,
and many other variants. We eliminated duplicate entries
in the trace due to allocators passing the same requested
allocation down through multiple lower-level allocators.
Figure 5 shows the precision required for several sizes that
would lose precision in Low-fat. While Figure 5 broadly
justifies a 6-bit precision, nearly 2% of all allocations have
a 7-bit length and require 7 bits of precision and are representable in CC but not Low-Fat. There are also notable
collections of allocations that require up to 11 bits of precision, indicating the utility of the greater precision available
in a 128-bit format.
Figure 6 gives representability results for specific applications. CC improves the precision of capabilities over Lowfat without increasing the number of bits required to encode
bounds. As is clear from the encoding, CHERI CC is a strict
improvement, i.e., under no circumstance does CHERI CC
have worse precision than Low-fat.
not representable
by Low-fat
6.0
7
5.0
8
4.0
9
CC: 8-bit precision
for size < 2 9
3.0
2.0
10
11
12
1.0
13
log 2(object size)
Finally, the Low-fat pointer work was implemented within
a proprietary instruction set without a TLB or even exception support, and thus was unable to validate support
for compiled languages or an operating system. Indeed,
no previous systems with capability-pointer compression
have evaluated compiled programs. This leaves obvious
questions, such as the frequency of various capabilitypointer operations, as well as the appropriate granularity for
bounds. While Low-fat is promising, its utility, as written,
for general-purpose computing has yet to be demonstrated.
% of all allocations
3.4
8
14
0.0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
precision required (bits)
15
16
Fig. 5: Percentage of total allocations vs. precision required
for a set of requested lengths for applications in Figure 6.
Imprecise capabilities /%
5
10
Low-fat
CHERI CC
31.1 30.7
5.3
5
0
Chrome
Firefox Apache
iTunes MySQL Mplayer
Fig. 6: The percentage of allocations that cannot be precisely
represented in a capability. Lower is better.
4.4
Heap Allocators and Imprecision
The prevalence of small allocations is reflected in the design
of the FreeBSD default memory allocator, jemalloc(),
which assumes that applications will primarily allocate objects under 512 bytes [16]. When object bounds cannot be
precisely represented by CC, the allocator may have to pad
the allocation with unused memory to maintain memory
safety. In practice, we have never observed jemalloc()
requiring more than 6 bits to represent the memory reserved
for an allocation, as the allocator itself requires alignment
to ease memory management. Nevertheless, additional precision can be used to enforce precise object bounds, and
enable precise enforcment of subobjects where possible.
5
CC P RINCIPLES II — CHERI S EMANTICS
Up to this point we have presented encoding improvements
that directly translate to the Low-fat model. That is, we improved the encoding of a memory region in a 46-bit virtual
address space where the address is between bounds. From
this point we introduce some CHERI semantics necessary to
support a legacy software base.
5.1
for software interpretation. CHERI Concentrate includes 12
permission bits in CHERI-64 and 15 bits in CHERI-128 to
support the full CHERI permissions model.
Full Address Space
CHERI semantics require a full 32-bit address space for 32bit architectures, and 64-bit address space for 64-bit architectures. CHERI seeks to replace pointers in traditional computer systems with capability fat-pointers and to interact
naturally with traditional operating systems and software.
A non-standard address space, such as Low-fat’s 46-bit
proposal, would require a deeper rewrite of the modern software stack. CHERI Concentrate therefore chooses to support
a 32-bit address space with our 18-bit bounds field in a 64bit capability format, which we call CHERI-64. The smaller
virtual address space reduces our E field by one bit, yielding
the final 18-bit CHERI Concentrate format in Figure 7, which
provides the same precision as the format in Figure 4, but
also guarantees out-of-bounds representable space at least
as large as the object itself. We support a full 64-bit address
space with CHERI-128 described in Section 6.5.
Representable Buffer
CHERI-256 supports out-of-bounds pointers, encoding full,
independent 64-bit words for the top, base, and address, to allow arbitrary pointer arithmetic without losing
bounds [12]. This feature enables CHERI to substitute capabilities for pointers in a wide array of C programs without violating programmer expectations, providing memory
safety without requiring unnecessary source modifications.
To confirm that out-of-bounds pointers are required to
run a significant software stack, we implemented our capability format with the Low-fat strict bounds semantics.
That is, we invalidated capabilities that went beyond object
bounds so that they could no longer be dereferenced. Several binaries failed to run with these Low-fat semantics. Zlib
provided the following critical example in inftrees.c:
static const unsigned short lbase[31] = ...;
...
base = lbase;
base -= 257;
...
val = base[work[sym]];
This function fails without temporarily out-of-bounds support because subtraction from base moves it well below the
base of the object, though it actually should continue safely
as base is later dereferenced legally using an offset larger
than 257. Support for temporarily out-of-bounds pointers allows all binaries and libraries depending on zlib (including
gzip, OpenSSH, libpng, etc.) to function as intended with
compressed capabilities.
While out-of-bounds pointer support is desirable, our
bounds are encoded with respect to the current pointer
address, and this encoding cannot support unlimited manipulation without losing the ability to decode the original
bounds. Nevertheless, we enable the vast majority of common behaviors for CC by extending the B field by 1 bit
8
0
L[7]
orE[4]
IE
B[8 : 2]
Permissions Bits
Unlike Low-fat, CHERI supports permissions bits on capabilities for read, write, and execute, and to limit capability propagation, with a few permission bits reserved
T [1 : 0] or E[1 : 0]
B[1 : 0] or E[3 : 2]
Fig. 7: 18-bit CHERI Concentrate Encoding with Representable Buffer Supporting a 32-bit Address Space
Unrepresentable regions:
bounds cannot be represented if
address is in these regions
top
address
base
5.2
T [6 : 2]
Representable space (spaceR ):
address may have any value in
this region
Dereferenceable region:
base 6 address < top, memory
access is permitted in this region
Fig. 8: Memory regions implied by a CC encoding
6
to provide a representable space that is at least twice the
size of the object itself. This bit allows us to locate the base
and top addresses with respect to the pointer address as
the pointer moves beyond the object bounds. As a result,
there are three address categories for any capability. Those
between the bounds are in the dereferenceable region. These
addresses are a subset of those within the larger representable
space, spaceR . Addresses outside of spaceR render the region unrepresentable, as depicted in Figures 8 and 10. With
an extra bit of B to extend the representable space, we may
now say that we infer the two most-significant bits of T :
(
1, if T [6 : 2] < B[6 : 2]
Lcarry out =
0, otherwise
(
1,
if IE = 1
Lmsb =
L7 , otherwise
T [8 : 7] = B[8 : 7] + Lcarry
6
out
+ Lmsb
CHERI C ONCENTRATE R EGION A RITHMETIC
We have carefully balanced the binary arithmetic needed
for the CHERI Concentrate encoding to allow practical
use in a traditional RISC pipeline. All operations that are
traditionally single-cycle for integer pointers are also singlecycle in CHERI Concentrate despite enforcing sophisticated
guarantees. These arithmetic operations are a crucial contribution of this work.
6.1
Encoding the bounds
We have added a CSetBounds instruction to the CHERI
instruction set to allow selecting the appropriate precision
for a capability.1 CSetBounds takes the full pointer address
a as the desired base, and takes a length operand from a
general-purpose register, thus providing full visibility of the
precise base and top to a single instruction – which can select
the new precision without violating a tenet of MIPS (our
base ISA) by requiring a third operand.
Deriving E
The value of E is a function of the requested length, l:
index of msb(x) = size of(x) − count leading zeros(x)
E = index of msb(l[31 : 8])
This operation chooses a value for E that ensures that
the most significant bit of l will be implied correctly. If l
is larger than 28 , the most significant bit of l will always
align with T [7], and indeed T [7] can be implied by E . If l is
smaller than 28 , E is 0, giving more bits to T and B and so
enabling proportionally more out-of-bounds pointers than
otherwise allowed for small objects.
We may respond to a request for unrepresentable precision by extending the bounds slightly to the next representable bound, or by throwing an exception. These
two behaviors are implemented in the CSetBounds and
CSetBoundsExact variants respectively.
1. Prior to this instruction, the bounds of a capability were set
sequentially using CIncBase and CSetLength. CIncBase had to
assign a compressed encoding to the base, possibly losing precision
before the desired length was known, and CSetLength had no way to
restore the lost precision if the final length would have allowed it.
Extracting T and B
The CSetBounds instruction derives the values of B and
T by simply extracting bits at E from b and t respectively
(with appropriate rounding):
E=0
E > 0;
T = t[6 : 0]
B = b[8 : 0]
T [1 : 0] and B[1 : 0] implied 0s
T [6 : 2] = t[E + 6 : E + 2] + round
round = one if nonzero(t[E + 1 : 0])
B[8 : 2] = b[E + 8 : E + 2]
Rounding Up length
The CSetBounds instruction may round up the top or
round down the base to the nearest representable alignment
boundary, effectively increasing the length and potentially
increasing the MSB of length by one, thus requiring that E
increase to ensure that the MSB of the new L can be correctly
implied. Rather than detect whether overflow will certainly
occur (which did not pass timing in our 100MHz CHERI-128
0
31
IE ’1 L[7] T [6 : 2] TE ’2
p’12
B[8 : 2]
BE ’2
a’32
p: permissions
If I = 0:
E
T [1 : 0]
B[1 : 0]
=
=
=
out
=
Lmsb
T [8 : 7]
=
=
Lcarry
If I = 1:
E
T [1 : 0]
B[1 : 0]
=
=
=
out
=
Lmsb
T [8 : 7]
=
=
Lcarry
IE : internal exponent
0
TE
B
(E
1,
a: address
if T [6 : 0] < B[6 : 0]
otherwise
L7
B[8 : 7] + Lcarry out + Lmsb
0,
{L7 , TE , BE }
0
0
(
1, if T [6 : 2] < B[6 : 2]
0,
otherwise
1
B[8 : 7] + Lcarry
out
+ Lmsb
t =
a[31 : E + 9] + ct
T [8 : 0]
0’E
b =
a[31 : E + 9] + cb
B[8 : 0]
0’E
shift by E
To calculate ct and cb :
Amid = a[E + 8 : E]
R= {B[8 : 6] − 1, zeros’6}
Amid < R
T <R
ct
Amid < R
B<R
cb
false
false
true
true
false
true
false
true
0
+1
−1
0
false
false
true
true
false
true
false
true
0
+1
−1
0
Fig. 9: CC capability format
7
FPGA prototype), we choose to detect whether L[7 : 3] is all
1s – i.e., the largest length that would use this exponent
– and force T to round up and increase E by one. This
simplifies the implementation at the expense of precision
for 1/16th of the requestable length values.
6.2
Decoding the bounds
Unlike Low-fat, CHERI Concentrate can decode the full t
and b bounds from the B and T fields even when the pointer
address a is not between the bounds. We now detail how
each bit of the bounds is produced:
Lower bits: The bits below E in t and b are zero, that is,
both bounds are aligned at E .
Middle bits: The middle bits of the bounds, t[E + 8 : E]
and b[E + 8 : E], are simply T and B respectively, with the
top two bits of T reconstituted as in Section 5.3. In addition,
if IE is set, indicating that E is stored in the lower bits of T
and B , the lower two bits of T and B are also zero.
Upper bits: The bits above E +8, for example t[31 : E +9],
are either identical to a[31 : E + 9], or need a correction
of ±1, depending on whether a is in the same alignment
boundary as t, as described below and in Figure 9.
Deriving the representable limit, R
CC allows pointer addresses within a power-of-two-sized
space, spaceR , without losing the ability to decode the original bounds. The size of spaceR is s = 2E+9 , fully utilizing
the encoding space of B . Figure 10 shows an example of
object bounds within the larger spaceR . Due to the extra
bit in B , spaceR is twice the maximum object size (2E+8 ),
ensuring that the out-of-bounds representable buffers are,
in total, at least as large the object itself.
As portrayed in Figure 10, spaceR is not usually naturally
aligned, but straddles an alignment boundary. Nevertheless,
as spaceR is power-of-two-sized, a bit slice from its base
address rb [E + 9 : E] will yield the same value as a bit slice
from the first address above the top, rt [E + 9 : E]. We call
this value the representable limit, R. Locating b, t, and a either
above or below the alignment boundary in spaceR requires
comparison with this value R. We may choose R to be any
out-of-bounds value in spaceR , but to reduce comparison
logic we have chosen:
R = {B[8 : 6] − 1, zeros’6}
0x3000
rt
0x2A00
spaceU s
t
0x2400
s
0x2000
0x1E00
b
rb
0x1A00
spaceL s
representable
space, spaceR
dereferenceable
region
R2E
0x1000
multiple of s = 2E+9
Fig. 10: CHERI Concentrate bounds in an address space.
Addresses increase upwards. To the left are example values
for a 0x600-byte object based at 0x1E00.
This choice ensures that R is at least 1/8 and less than 1/4
of the representable space below b, leaving at least as much
representable buffer above t as below b.
For every valid capability, the address a as well as the
bounds b and t lie within spaceR . However the upper bits of
any of these addresses may differ by at most 1 by virtue
of lying in the upper or lower segments of spaceR . For
example, if a is in the upper segment of spaceR , the upper
bits of a bound will be one less than the upper bits of a if the
bound lies in the lower segment. We can determine whether
a falls into upper or lower segment of spaceR by inspecting:
Amid = a[E + 8 : E]
If Amid is less than R, then a must lie in the upper segment
of spaceR , and otherwise in the lower segment. The same
comparison for T and B locates each bound uniquely in the
upper or the lower segment. These locations directly imply
the correction bits ct and cb (computed as shown in Figure 9)
that are needed to compute the upper bits of t and b from
the upper bits of a. As we have chosen to align R such that
R[5 : 0] are zero, only three-bit arithmetic is required for this
comparison, specifically:
a in upper segment = Amid [8 : 6] < R[8 : 6]
While Low-fat requires a 6-bit comparison to establish the
relationship between a, t, and b, growing with the precision
of the bounds fields, CC requires a fixed 3-bit comparison
regardless of field size, particularly benefiting CHERI-128,
which uses 21-bit T and B fields. CC enables capabilities
to be stored in the register file in compressed format, often
requiring decoding before use. As a result, this comparison
lies on several critical paths in our processor prototype.
The bounds t and b are computed relative to Aupper :
t = {(Aupper + ct ), T, zeros’E}
b = {(Aupper + cb ), B, zeros’E}
where Aupper = a[31 : E + 9]
The bounds check during memory access is then:
b 6 computed address < t
In summary, CC generalizes Low-fat arithmetic to allow
full use of the power-of-two-sized encoding space for representing addresses outside of the bounds, while improving
speed of decoding.
Encoding full address space
The largest encodable 32-bit value of t is 0xFF800000,
making a portion of the address space inaccessible to the
largest capability. We can resolve this by allowing t to be
a 33-bit value, but this bit-size mismatch introduces some
additional complication when decoding t. The following
condition is required to correct t for capabilities whose
representable region wraps the edge of the address space:
if ((E < 24) &((t[32 : 31] − b[31]) > 1)) then t[32] =!t[32]
That is, if the length of the capability is larger than E allows,
invert the most significant bit of t.
8
6.3
Fast representable limit checking
Pointer arithmetic is typically performed using addition,
and does not raise an exception. If we wish to preserve
these semantics for capabilities, capability pointer addition
must fit comfortably within the delay of simple arithmetic in
the pipeline, and should not introduce the possibility of an
exception. For CC, as with Low-fat, typical pointer addition
requires adding only an offset to the pointer address, leaving the rest of the capability fields unchanged. However, it
is possible that the address could pass either the upper or
the lower limits of the representable space, beyond which
the original bounds can no longer be reconstituted. In this
case, CC clears the tag of the resulting capability to maintain
memory safety, preventing an illegal reference to memory
from being forged. This check against the representable
limit, R, has been designed to be much faster than a precise
bounds check, thereby eliminating the costly measures the
Low-fat design required to achieve reasonable performance.
To ensure that the critical path is not unduly lengthened,
CC verifies that an increment i will not compromise the
encoding by inspecting only i and the original address field.
We first ascertain if i is inRange, and then if it is inLimit.
The inRange test determines whether the magnitude of i is
greater than that of the size of the representable space, s,
which would certainly take the address out of representable
limits:
inRange = −s < i < s
The inLimit test assumes the success of the inRange test, and
determines whether the update to Amid could take it beyond
the representable limit, outside the representable space:
(
Imid < (R − Amid − 1),
if i > 0
inLimit =
Imid > (R − Amid ) and R 6= Amid , if i < 0
The inRange test reduces to a test that all the bits of Itop
(i[63 : E + 9]) are the same. The inLimit test needs only 9-bit
fields (Imid = i[E + 8, E]) and the sign of i.
The Imid and Amid used in the inLimit test do not include
the lower bits of i and a, potentially ignoring a carry in
from the lower bits, presenting an imprecision hazard. We
solve this by conservatively subtracting one from the representable limit when we are incrementing upwards, and
by not allowing any subtraction when Amid is equal to R.
One final test is required to ensure that if E > 23, any
increment is representable. (If E = 23, the representable
space, s, encompases the entire address space.) This handles
a number of corner cases related to T , B , and Amid describing bits beyond the top of a virtual address. Our final fast
representability check composes these three tests:
representable = (inRange and inLimit) or (E > 23)
To summarize, the representability check depends only
on four 9-bit fields, T , B , Amid , and Imid , and the sign of
i. Only Imid must be extracted during execute, as Amid is
cached in our register file. This operation is simpler than
reconstructing even one full bound, as demonstrated in
Section 8. This fast representability check allows us to perform pointer arithmetic on compressed capabilities directly,
avoiding decompressing capabilities in the register file that
introduces both a dramatically enlarged register file and
substantial load-to-use delay.
6.4
Examplary Encodings
We walk through a pair of exemplary capability encodings
to illustrate the above encoding details. In Figures 11 and
12, the dark fields were requested by CSetBounds, the orange and green fields are stored in the capability encoding,
and the white fields are implied by the encoded fields. Each
example shows only the bottom 12 bits of each field.
Unaligned 128-byte example
A programmer has instantiated a 129-character string on his
stack, and wants to trim the first character and pass the
subset to a function. The capability instructions to create the
subset are as follows:
CIncOffset $trimmed , $ s t r , 1
CSetBounds $trimmed , $trimmed , 128
As this string began on the stack that is at least wordaligned, the trimmed capability is now unaligned. The requested capability is perfectly representable using CC, as
are all objects less than 255 bytes; the resulting encoding
is shown in Figure 11. We note that this capability would
not be representable in Low-Fat, which can represent with
byte-precision only up to 63-byte objects.
Bounds Encoding
Upper Rep. Limit
Encoded Top
Requested Top
Address (a)
Requested Bottom
Encoded Bottom
Lower Rep. Limit
IE L[7]
1
1
1
0
0
0
0
0
0
0
1
1
1
1
T[6:2]
B[8:2]
0
0
0
1
1
1
1
1
0
0
1
1
1
1
0
0
0
1
1
1
0
1
0
0
1
0
0
1
T[1:0]
B[1:0]
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
IE 0
L[7]
1
1
1 E
0 0 0 0 0 0
1
1
0
Fig. 11: Example Capability with Exponent=0
Spanning alignment boundary: We observe that the
upper bits of the Top are entirely different from the upper
bits of Bottom in Figure 11, though they numerically differ
by only 1. During decode, we can ascertain that the upper
bits of Top must be one larger than the upper bits of Bottom
using the following steps:
T [8 : 7] = B[8 : 7] + L[7] + Lcarry
= 11 + 01 + 00 = 00
out
(1)
(2)
This produces a T [8 : 0] of 000000001. We may now infer
any difference in the bits above T [8] in the pointer using the
representable limit, R:
Amid = a[8 : 0] = 111110000
Ru = R[8 : 6] = B[8 : 6] − 001 = 101
(
0, if (Ru < T [8 : 6]) = (Ru < Amid [8 : 6])
ct =
1, otherwise
T [11 : 9] = Aupper [2 : 0] + ct = 011 + 001 = 100
(3)
(4)
(5)
(6)
9
504-byte example
Let us now consider a larger object that cannot be represented precisely with CC, and that has an out-of-bounds
pointer:
63
0
p: permissions
CSetBounds $ s t r , $sp , 504
CIncOffset $ s t r , $ s t r , −16
E
Encoded Top
Requested Top
Address (a)
Requested Base
Encoded Bottom
Lower Rep. Limit
0
0
1
0
0
0
0
0
0
1
0
1
1
0
TE ’3
B[20 : 3]
BE ’3
s: sealed
a: pointer address
Fig. 13: CHERI-128 capability format.
The encoding of the new object is shown in Figure 12
and the assembly that will generate the capability passed to
skip16 is:
1
1
0
0
0
0
0
T [18 : 3]
a’64
char s t r [ 5 0 4 ] ;
skip16 ( s t r − 1 6 ) ;
Upper Rep. Limit
IE ’1 s/L[19]
p’15
1
0
1
1
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
0
0
IE 1
0
L[7] 1
1 E
1 0 0 0 1 0
1
0
0
Fig. 12: Example Capability with Encoded Exponent
Losing precision: The size of this object is 504 bytes,
which is larger than the maximum size of 255 that can
be precisely represented by CC. We select E by inspecting
the length that has been requested, which is 111111000 in
binary.
Et = index of msb(l[11 : 8]) = 1
(
Et + 1, if L[7 : 4] = 1111
E=
Et ,
otherwise
(7)
E=2
(9)
(8)
In this case, due to the length being the maximum possible
length we could represent with this exponent, we increase
E by one to account for the possiblity that the length may
round up. Since E is non-zero, we must encode the exponent
in the capability (IE = 1), and T [1 : 0] and B[1 : 0] are no
longer available for precision. As a result, the bottom four
bits are rounded to the appropriate alignment boundary
(away from the object) in the Encoded Top and Bottom in
Figure 12, and the encoded length is 512 bytes.
Representability Check: After the bounds of the
capability are set, we move the pointer 16 bytes below
the base. We assert that the result of the add of -16
(111111110000) will be representable with the following
steps:
inRange = −s < i < s = −2048 < −16 < 2048
(10)
R − Amid = 0010000 − 0100000 = 1010000
(11)
inLimit = (Imid > (R − Amid )) &(R 6= Amid ), as i < 0 (12)
= (1111111 > 1010000) &(0010000 6= 0100000) (13)
rep. = inRange & inLimit = True
(14)
As the subtraction of 16 will produce a result within representable bounds, we may simply perform the subtraction
on the Address field of the capability, producing the result
in Figure 12.
6.5
CHERI-128
While we have implemented the above format with 18 bits
for bounds for a 32-bit address space in CHERI-64, our
CHERI Concentrate format for a 64-bit address space uses
41 bits for bounds. This CHERI-128 format uses 21 bits for
B and and 19 bits for T, and is shown in Figure 13. Our
CHERI-64 and CHERI-128 encodings reserve a few bits for
future use (2 bits and 7 bits respectively), which could be
applied to greater precision if needed.
7
I NSTRUCTION F REQUENCY S TUDY
CHERI Concentrate pipeline optimizations have a firm
grounding in analysis of compiled capability programs.
Table 1 contains the first published study of the frequency
of capability instructions in compiled programs. These programs include the Duktape Javascript interpreter running
the Splay benchmark from the Octane suite, a SQLite
benchmark developed for the LevelDB project, the P7Zip
benchmark from the LLVM test suite, and a boot of FreeBSD
with all user-space processes compiled in a pure-capability
mode. In each case we traced around 1 billion user-space
instructions from the FPGA implementation, about 10 seconds of execution time on our 100MHz processor, sampled
throughout the benchmark.
This study of capability instructions is unique compared
to studies using conventional instruction sets, as a capability
instruction set distinguishes between pointer and integer
operations. A capability instruction set distinguishes between memory operations accessing non-pointer data and
those that support pointers; it also distinguishes between
pointer modification and integer arithmetic instructions.
According to Table 1, pointer-sized loads constitute up
to 12% of common programs. Thus, pointer loads should
minimize additional delay caused by an unpack operation
to decode capabilities into the register file or else risk
TABLE 1: Dynamic capability instruction mix
Percent of total instruction mix for different benchmarks
Category
DukJS
SQLite
P7Zip
Boot
load/store data
load capability
store capability
13.69%
11.69%
6.91%
21.29%
7.99%
5.24%
16.93%
1.52%
0.62%
16.26%
4.69%
3.17%
cap pointer arithmetic
15.15%
13.16%
7.19%
2.82%
5.06%
10.09%
3.35%
9.81%
stack pointer
other
0.99%
6.20%
0.93%
1.89%
jump to capability
get special capability
compare capabilities
2.73%
0.08%
1.38%
1.68%
0.03%
1.90%
0.50%
0.00%
0.19%
0.50%
0.02%
0.15%
set capability bounds
read a capability field
0.36%
1.17%
0.69%
0.68%
0.00%
0.00%
0.07%
0.05%
10
Classic MIPS
WB Bounds
Pointer Add
Bounds
1.9ns
Check
WB / UNPACK
&
MEM / WB
Memory
ALU
byte select
EX / MEM
Low-fat
2.7ns
Distance
Update
Unpack
4.5ns
2.9ns
3.9ns
MEM / WB
Unpack
Memory
byte select
ALU
Bounds
Check
EX / MEM
CHERI Concentrate
WB Data
Pointer Add
Representable
Check
Low-Fat has a complex Unpack operation that decodes the
distance to the bounds into the register file. Complete LowFat bounds decoding required over 4ns on their Xilinx Virtex
6 [3] and 4.47ns in our experiment in Figure 15, which is
too complex to be performed after memory load and before
register writeback without requiring an additional cycle.
Low-Fat added an Unpack stage to the pipeline and also
performed Pointer Add and Bounds Check there to ensure
that these operations can see capabilities forwarded from a
load. It is preferable to eliminate this stage and maintain a
single set of forwarding paths to Execute.
CHERI Concentrate does not entirely eliminate the Unpack procedure but reduces its cost dramatically so that it
is not in the critical path. The CC unpack operation requires
1.70ns to decode the bits of the pointer address at Exp
(corresponding to T and B) and the top two bits of T. This
delay is comparable to the data byte select, which is not
required when loading capabilities; thus, CC avoids extra
delay between the cache and the register file.
MEM / WB
WB Data
ID / EX
Unpack
byte select
EX / MEM
Memory
CHERI C ONCENTRATE P IPELINE
Capability compression adds three operations to a pipeline:
1) Unpack: Any logic that transforms a capability from
memory representation to register format.
2) Pointer Add: Any logic required to add an offset to
a capability, producing a new capability, handling any
checks for an unrepresentable result.
3) Bounds Check: Any logic to verify that an offset lands
within the bounds of a capability for memory access.
Figure 14 shows the placement of these operations in a
typical MIPS pipeline using the Low-Fat and CHERI Concentrate micro-architectural approaches. In the CC pipeline,
Unpack and Pointer Add lie on a performance-critical path,
and Bounds Check does not. For Low-Fat, all three operations are moved to a new capability pipeline stage after
memory access where bounds are unpacked, updated, and
checked. Placing all bounds operations in a single stage
after cache access enables full pipelining even as capabilities
are loaded, as long as bounds are not required earlier in
the pipeline. This is true as long as loads and stores can
be issued with the pointer address speculatively, verifying
bounds only in the cycle when a loaded value is available.
Write Back
Memory Access
ALU
ID / EX
8
Execute
Address Calc.
ID / EX
greatly impacting performance. Table 1 further indicates
that pointer arithmetic commonly constitutes over 10% of
executed instructions. Therefore pointer add must remain
simple, fast, and energy-efficient, in the face of new requirements imposed by capabilities. Table 1 shows loads
and stores of data and capabilities constituting as much
as 35% of common programs. Therefore the bounds check
on the offset addressing operation must not impede the
critical path. or else would risk greatly impacting program
performance.
As described in Section 8, CHERI Concentrate respects
all three of these requirements by nearly eliminating the
unpack operation, by introducing a fast representability check
for pointer arithmetic using only the compressed format,
and by acknowledging that the more complex bounds check
operation does not lie on the critical path.
1.7ns
WB Data/Capability
Fig. 14: Crucial Capability Functions in the Pipeline
Pointer Add
Low-Fat Pointer Add consists of 3 operations: address addition, Bounds Check, and a distance update. The address
addition is done in the traditional ALU, but the Bounds
Check and update of the distances to the bounds occur
in the new pipeline stage for bounds. The Bounds Check
operation is highly optimized by fully decoding the distance
to the bounds in the register file, requiring only 1.88ns in our
experiment in Figure 15, and resulting in a full Pointer Add
of only 2.74ns – only 70% longer than the simple 64-bit add
required by CHERI-256.
CC does not have the advantage of fully decoded
bounds, but uses the representability check described in Section 6.3 to assert that the result will be representable without
checking the precise bounds. This check achieves a delay of
only 2.89ns to modify the pointer address and to perform
the representability check – only slightly longer than Lowfat’s Pointer Add. CC’s representability check fits easily
within the execute stage of our pipeline.
Bounds Check
A standalone Bounds Check operation is very fast for LowFat, only 1.88ns, but we consider this a wasted optimization
as the stand-alone bounds check would not lie on the critical
path for memory access. CC required 3.85ns for a precise
Bounds Check, which fits comfortably into the exception path
of our pipeline (which is parallel to cache lookup).
11
bench synthesized in Altera’s Quartus Prime 15.1 for a Stratix
V 5SGXEA7N2F45C2 FPGA. Each synthesis performed with
three random seeds. Combinational logic usage (ALUTs) was
constant, but layout variation perturbed timing. The Low-fat
algorithms were reproduced from their descriptions. The CC
algorithm used here has a 48-bit virtual address and an 18-bit
bounds field for direct comparison with Low-fat.
By pushing the full Bounds Check delay from the pointer
load-to-use path to the memory access path, CHERI Concentrate avoids pointer load-to-use delays, an inflated register
file, and unusual pipeline forwarding. While we performed
our integration with a canonical MIPS pipeline, these solutions enable a reasonable capability implementation in any
processor without violating the general conventions of highperformance pipelines.
9
E XECUTION P ERFORMANCE
To evaluate the CC, we modified our open-source CHERI
processor from http://www.cheri-cpu.org/, extended the
LLVM compiler [17], created a custom embedded OS based
on CHERI protection, and extended the FreeBSD OS [18].
9.1
Microbenchmarks
We used a modified LLVM to compile a number of small
benchmarks to use capabilities for all data pointers (including heap and stack allocations) to enforce spatial memory
safety, and to use capabilities for return addresses and function pointers to enforce Control-Flow Integrity (CFI) [19].
We included the MiBench [20] suite, which is representative
of typical embedded data-centric C code, and the Olden
[21] suite, which is representative of pointer-based data
structure algorithms. These were executed under a custom
embedded operating system running on a 100MHz Stratix
IV FPGA prototype that used a 32KiB, 4-way set-associative,
write-through L1 data cache and a 256KiB, 4-way write-back
L2 cache. Across all benchmarks, we compare CHERI-256,
CHERI-128 and CHERI-64. This study allows us to compare CHERI-256 with CHERI-128, which provides a direct
improvement by halving pointer size while supporting the
full 64-bit virtual address space with negligible alignment
restrictions. On the other hand, we can compare CHERI-128
with CHERI-64, which restricts the virtual address space to
9.2
Larger applications and benchmark suites
We have designed CHERI Concentrate to fit in a standard
RISC architecture and to support compiled C programs. As
a result, we are able to compile many standard applications
and execute them under a full operating system, CheriBSD
(i.e., FreeBSD with capability extensions).
100
100
50
50
0
0
100
100
50
50
CHERI-128
0
Pointer
density
(%)
CHERI-64
CHERI-256
0
< 0.2
8
23
24
24
32
O-perimeter
Unpack
Pointer Add Bounds Check
Fig. 15: Complexity and speed of key operations. Test
O-treeadd
5
4
3
2
1
0
O-bisort
delay (ns)
0
O-mst
200
M-patricia
400
32 bits (based on the MIPS-n32 ABI), using the upper 32
bits to encode capability fields. The three architectures share
code generation and differ only in capability size.
Figure 16 measures the improvement in execution time
and L2 cache misses of CHERI-128 and CHERI-64 against
CHERI-256, with CHERI-256 normalized at 100%. The
two metrics represent overall performance and additional
DRAM traffic respectively. We have ordered the graphs by
pointer memory footprint as measured from core dumps.
The box plots aggregate all benchmarks with a pointer
density of less than 0.2% of allocated memory. These include
bitcount, qsort, stringsearch, rijndael, CRC, SHA, dijkstra
and adpcm, all from the MiBench suite. Apart from Patricia,
MiBench has low pointer density and has little memory
impact of using capabilities of any size. The Olden benchmarks with pointer-based data structures show up to a 20%
reduction in run time and a 50% reduction in DRAM traffic when moving from CHERI-256 to CHERI-128. CHERI64 further improves over CHERI-128, with a performance
improvement approaching 10%, but achieving an equally
dramatic reduction in DRAM traffic for these pointer-heavy
use cases.
The low-pointer density benchmarks show almost no
performance improvement with smaller pointers, indicating
that many applications will see very little cost from adopting
extensive capability protections regardless of capability size.
Nevertheless, these low-pointer-density applications occasionally saw a notable decrease in L2 misses.
Low 64
CC
Low 128
Low-fat
normalized
execution time (%)
ALUTs
CHERI-256
normalized
L2 misses (%)
600
Fig. 16: Percentage of run time and total L2 misses
for CHERI-128 and CHERI-64 versus CHERI-256 (dashed
lines, normalized to 100%). MiBench benchmarks with low
pointer density (Low 128/64) are collated on the left. High
pointer-density benchmarks are on the right.
12
120
normalized
execution time %
CHERI-128
64-bit MIPS
100
80
60
normalized
L2 misses %
100
60
SL-fillrandom
SL-fillseq
JS-splay
JS-earley-boyer
7Zip
SP-gobmk
SP-h264ref
SP-libquantum
SP-hmmer
SP-xalancbmk
SP-astar
SP-bzip2
20
Fig. 17: Percentage of run time and total L2 misses for
CHERI-128 and 64-bit MIPS versus CHERI-256 (dashed line,
normalized to 100%): SPECint 2006 (SP-), Javascript (JS-),
and Sqlite3 (SL-). Origins are non-zero to improve visibility.
This section uses the same hardware platform and compiler configuration as in Section 9.1. All benchmarks (P7Zip
16.02, Octane with Duktape 1.4.0, Sqlite3 3.21.0, and SPECint
2006) run under CheriBSD. SPECint 2006 benchmarks are
run with test datasets due to time and memory constraints
on our FPGA platform. Unfortunately, due to the lack of
LLVM MIPS-n32 support under FreeBSD, the CHERI-64 results are replaced with classic 64-bit MIPS benchmarks as a
reference. Note that MIPS code not only has smaller pointer
sizes, but also does not have any overhead from CHERI
protection at all. MIPS code generation also differs significantly from CHERI code generation when using integer
registers for pointer access, which would cause CHERI to
be slightly faster in some cases. This paper demonstrates the
benefits of capability size reduction and does not speculate
on optimal capability code generation, although the CHERI
LLVM extension continues to improve.
Figure 17 shows the results obtained from these workloads. P7Zip is an ALU- and data-heavy benchmark with
very few pointers, resulting in the performance of CHERI256, CHERI-128 and MIPS being very close. The Octane
JavaScript benchmarks, Splay and Earlay-Boyer, were selected for pointer density, with around 25% of all data
memory holding pointers. These JavaScript benchmarks
running under Duktape show a dramatic improvement
with capability compression, cutting run time by 10% and
reducing L2 cache misses by 40%. Though the common
database application Sqlite3 has lower pointer density and
relies heavily on file I/O, there is still a sizable reduction
of run time and DRAM traffic. On average, approximately
10% of the total run time and 30% of the DRAM traffic can be
eliminated by deploying CHERI-128 for these benchmarks.
10
P ROOF OF CORRECTNESS
The CHERI Concentrate capability compression and decompression algorithms are complex. Consequently, we
have undertaken a machine-checked proof of correctness
in HOL4 for key properties, which identified bugs and
confirmed the necessity of all corner-case handling required
in the hardware implementation. The first four proofs relate
to compression and decompression, and the last two proofs
verify the fast representable bounds check. These proofs
produced counterexamples for our initial E selection algorithm, and for our initial representability-check algorithm;
the corrections are reflected in the algorithms presented in
Section 6. These proofs are available online: hwww.cl.cam.
ac.uk/research/security/ctsrd/cheri/cheri-concentrate/i.
• For any address (a) in the representable region defined
by a requested base and top from which we derive an
encoded b and t:
1) b 6 base
2) base − b < 2E+2 , that is, the error on the encoded
base is less than the value of least significant bit of B
when there is an internal exponent
3) t > top
4) t − top 6 2E+2 , that is, the error on the encoded top
is less than the value of the least significant bit of T
when there is an internal exponent
• For any address a, increment i, valid exponent E < 26,
and representable limit R (see Section 6.2):
1) The fast representability check, IsRep, will succeed
only if p = a + i is within one representable space,
s = 2E+9 , of the representable base rb :
IsRep =⇒ p − rb < s
2) The fast representability check will succeed if p is
reasonably within s of rb :
2E ≤ p − rb < s − 2E =⇒ IsRep
11
R ELATED WORK
CHERI Concentrate safely encodes fine-grained memory
protection properties similar to the M-machine and LowFat pointers, representing a family of capability fat-pointer
machines. Computer architects have also explored several
other useful approaches to encode memory protection metadata in computer systems.
11.1
Table-based encoding
Table-based designs encode protection information in an
external table, keyed either by the data memory address,
choice of segment registers, or by an explicit index in the
memory reference. Whereas an arbitrarily large amount of
protection metadata may be encoded in a small index, the
table approach optimizes for a small fixed set of objects, and
does not fit today’s large and layered software landscape.
Early capability systems described a C-list of capabilities
for a process [22]; systems including the CAP computer [8]
and the i960 [10] implemented pointers as indices into
this table. Some foundational capability systems, in addition to supporting tables of memory descriptor capabilities,
13
supported capabilities as abstract identifiers that required
interpretation from a trusted object manager [23], [24].
Page-based memory protection is a table-based design that
is ubiquitous in commercial hardware and usually includes
protection metadata. Page-based protection has been extended in various ways to support in-address-space security
domains including domain-page protection [25] and page group
identifiers in the HP PA-RISC [26]. In page-based systems,
protection metadata is associated with the virtual address
and not with the pointer; costs scale with the size of the
address space and not with the number of pointers. However, as a consequence any access to memory in a process is
treated equally, and it is not possible to detect a corrupted
pointer that illegally accesses valid memory in the process.
Mondriaan [4] also uses table indirection to encode protection metadata in a Protection Look-aside Buffer (PLB) for
fine-grained address validation. While Mondriaan avoids
page granularities, performance still benefits greatly from
reducing the number of objects, making this approach undesirable for fine-grained protection.
Segmentation, pioneered in the Multics system [27] and
once common on IA-32 architectures (now deprecated in Intel 64 [28]), encodes protection metadata in segment descriptor tables indexed by segment selector registers. Despite
earlier wide-spread deployment, this memory protection
primitive was not picked up in scalable language models.
11.2
Tagged memory
Complex tags on memory locations may also encode protection metadata for pointers in memory [29], [30], [31].
Silicon Secured Memory from Oracle [32] tags all data
with version numbers, which must match in-pointer metadata to avoid temporal confusion.
Memory Protection Extensions (MPX) [7] from Intel maps a
shadow space for protection metadata for pointers in memory, and is similar to HardBound [6] (an academic proposal).
Despite hardware support, no form of compression is used,
and the metadata space is four times the size of the protected
address space.
As memory locality suffers from shadow-space or table
lookups, CHERI minimizes the tag to a single bit, which
can be used to protect the integrity of the remainder of the
metadata that may be stored in-line.
11.3
Capability pointer compression
CHERI Concentrate is based on CHERI-256 and inspired by
the M-Machine and Low-fat Pointers (which are thoroughly
discussed in Section 2).
11.4
Software fat-pointer techniques
Software-only techniques for fine-grained protection have
achieved surprising performance, but have had large memory overheads due to lack of compression. Baggy bounds [33],
SoftBound [34] and PAriCheck [35] dynamically check bounds
of fat pointers in software, and trade memory fragmentation
for improved performance. Cyclone [36] explicitly breaks
compatibility with C to define a safer C dialect that provides
fine-grained memory safety. Cyclone’s abstraction is close
to the CHERI model, but adds many static annotations.
Although Cyclone was not widely adopted, it influenced
pointer annotation in current C compilers. CHERI Concentrate can accelerate such systems to allow precise checks,
with negligible performance overhead.
12
C ONCLUSIONS
CHERI Concentrate (CC) resolves major roadblocks to the
adoption of capability fat-pointers for fine-grained, deterministic memory protection. CC inherits the mature CHERI
capability semantics, enabling support for a wide software
base. CC also inherits the efficient Low-Fat compression
techniques, improving compression efficiency by eliminating one bit of the bounds without losing precision and by
encoding the exponent within the bounds field, achieving
the highest published capability fat-pointer encoding efficiency. In addition, we developed arithmetic operations that
operate directly on the compressed encoding. These allow
the register file to hold compressed capabilities, eliminating
complexity on the load path, and also allow more flexible
out-of-bounds modifications required by CHERI semantics.
We validated our design by building an FPGA prototype that achieves the same frequency as the original 256bit CHERI implementation. We extended CHERI LLVM to
support CHERI-128 and CHERI-64 capability formats to run
multiple benchmarks and applications under both a custom
embedded OS and FreeBSD with CHERI support. CHERI-64
and CHERI-128 provide a convincing performance improvement over CHERI-256, reducing L2 cache misses by up to
75% for pointer-heavy benchmarks and greatly reducing the
performance overhead versus unprotected MIPS programs.
Finally, formal proofs provide assurance that our aggressive
arithmetic optimizations do not compromise the correctness
of our capability implementation.
In conclusion, this work presents a major maturation of
the state of the art in implementing capability fat pointers,
and prepares the way for commercial adoption to harden
systems against security challenges to computer systems.
13
ACKNOWLEDGMENTS
This work is part of the CTSRD, ECATS, and CIFV projects
sponsored by the Defense Advanced Research Projects
Agency (DARPA) and the Air Force Research Laboratory
(AFRL), under contracts FA8750-10-C-0237, HR0011-18-C0016, and FA8650-18-C-7809. The views, opinions, and/or
findings contained in this paper are those of the authors and
should not be interpreted as representing the official views
or policies, either expressed or implied, of the Department
of Defense or the U.S. Government. Approved for Public
Release, Distribution Unlimited. We also acknowledge the
EPSRC REMS Programme Grant [EP/K008528/1], the EPSRC Impact Acceleration Account [EP/K503757/1], Arm
Limited, and Google, Inc. We would also like to acknowledge Alex Richardson, Lawrence Esswood, Peter Rugg, Peter Sewell, Graeme Barnes, and Bradley Smith who assisted
in various capacities to complete this work.
14
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
J. Woodruff, R. N. M. Watson, D. Chisnall, S. W. Moore, J. Anderson, B. Davis, B. Laurie, P. G. Neumann, R. Norton, and
M. Roe, “The CHERI capability model: Revisiting RISC in an
age of risk,” in Proceedings of the 41st International Symposium on
Computer Architecture, June 2014.
N. P. Carter, S. W. Keckler, and W. J. Dally, “Hardware support for
fast capability-based addressing,” SIGPLAN Not., vol. 29, no. 11,
pp. 319–327, Nov. 1994.
A. Kwon, U. Dhawan, J. M. Smith, T. F. Knight, Jr., and A. DeHon, “Low-fat pointers: Compact encoding and efficient gate-level
implementation of fat pointers for spatial safety and capabilitybased security,” in 20th Conference on Computer and Communications
Security. ACM, November 2013.
E. Witchel, J. Cates, and K. Asanović, “Mondrian memory protection,” ACM SIGPLAN Notices, vol. 37, no. 10, pp. 304–316, 2002.
E. Witchel, J. Rhee, and K. Asanović, “Mondrix: Memory isolation
for Linux using Mondriaan memory protection,” in Proceedings of
the 20th ACM Symposium on Operating Systems Principles, October
2005.
J. Devietti, C. Blundell, M. M. K. Martin, and S. Zdancewic,
“Hardbound: architectural support for spatial safety of the C
programming language,” SIGARCH Comput. Archit. News, vol. 36,
no. 1, pp. 103–114, Mar. 2008.
Intel Plc., “Introduction to Intel® memory protection extensions,”
http://software.intel.com/en-us/articles/introduction-to-intelmemory-protection-extensions, July 2013.
M. Wilkes and R. Needham, The Cambridge CAP computer and its
operating system. Elsevier North Holland, New York, 1979.
F. J. Pollack, G. W. Cox, D. W. Hammerstrom, K. C. Kahn, K. K. Lai,
and J. R. Rattner, “Supporting Ada memory management in the
iAPX-432,” in ACM SIGARCH Computer Architecture News, vol. 10,
no. 2, 1982, pp. 117–131.
“BiiN CPU architecture reference manual,” BiiN, Hillsboro, Oregon, Tech. Rep., July 1988.
R. N. M. Watson, P. G. Neumann, J. Woodruff, M. Roe, J. Anderson,
D. Chisnall, B. Davis, A. Joannou, B. Laurie, S. W. Moore, S. J.
Murdoch, R. Norton, and S. Son, “Capability Hardware Enhanced
RISC Instructions: CHERI Instruction-Set Architecture,” University of Cambridge, Computer Laboratory, 15 JJ Thomson Avenue,
Cambridge CB3 0FD, United Kingdom, Tech. Rep. UCAM-CL-TR876, Nov. 2015.
D. Chisnall, C. Rothwell, B. Davis, R. N. Watson, J. Woodruff,
M. Vadera, S. W. Moore, P. G. Neumann, and M. Roe, “Beyond
the PDP-11: Processor support for a memory-safe C abstract
machine,” in Proceedings of the 20th Architectural Support for Programming Languages and Operating Systems. ACM, 2015.
R. N. M. Watson, J. Woodruff, P. G. Neumann, S. W. Moore,
J. Anderson, D. Chisnall, N. Dave, B. Davis, K. Gudka, B. Laurie,
S. J. Murdoch, R. Norton, M. Roe, S. Son, and M. Vadera, “CHERI:
A Hybrid Capability-System Architecture for Scalable Software
Compartmentalization,” in Proceedings of the 36th IEEE Symposium
on Security and Privacy, May 2015.
J. Brown, J. Grossman, A. Huang, and T. F. Knight Jr, “A capability representation with embedded address and nearly-exact
object bounds,” Project Aries Technical Memo 5, http://www.
ai. mit. edu/projects/aries/Documents/Memos/ARIES-05. pdf,
Tech. Rep., 2000.
G. C. Necula, S. McPeak, and W. Weimer, “CCured: Type-safe
retrofitting of legacy code,” ACM SIGPLAN Notices, vol. 37, no. 1,
pp. 128–139, 2002.
J. Evans, “A scalable concurrent malloc(3) implementation for
FreeBSD,” in BSDCan, 2006.
C. Lattner and V. Adve, “LLVM: A compilation framework for
lifelong program analysis & transformation,” in Proceedings of
the International Symposium on Code Generation and Optimization:
Feedback-directed and runtime optimization. IEEE, 2004.
M. K. McKusick, G. V. Neville-Neil, and R. N. M. Watson, The Design and Implementation of the FreeBSD Operating System. Pearson,
2014.
M. Abadi, M. Budiu, Úlfar Erlingsson, and J. Ligatti, “Controlflow integrity: Principles, implementations, and applications,” in
Proceedings of the 12th ACM conference on Computer and Communications Security. ACM, 2005.
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,
T. Mudge, and R. B. Brown, “MiBench: A free, commercially
representative embedded benchmark suite,” in Proceedings of
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
the Workload Characterization, 2001. WWC-4. 2001 IEEE International
Workshop, ser. WWC ’01. Washington, DC, USA: IEEE
Computer Society, 2001, pp. 3–14. [Online]. Available: http:
//dx.doi.org/10.1109/WWC.2001.15
A. Rogers, M. C. Carlisle, J. H. Reppy, and L. J. Hendren, “Supporting dynamic data structures on distributed-memory machines,”
ACM Trans. Program. Lang. Syst., vol. 17, no. 2, pp. 233–263, Mar.
1995.
J. B. Dennis and E. C. Van Horn, “Programming semantics for
multiprogrammed computations,” Commun. ACM, vol. 9, no. 3,
pp. 143–155, 1966.
R. Feiertag and P. Neumann, “The foundations of a Provably
Secure Operating System (PSOS),” in Proceedings of the National
Computer Conference. AFIPS Press, 1979, pp. 329–334.
P. G. Neumann, R. S. Boyer, R. J. Feiertag, K. N. Levitt, and
L. Robinson, “A Provably Secure Operating System: The system,
its applications, and proofs,” Computer Science Laboratory, SRI
International, Tech. Rep., May 1980, 2nd edition, Report CSL-116.
E. J. Koldinger, J. S. Chase, and S. J. Eggers, “Architecture
support for single address space operating systems,”
in Proceedings of the Fifth International Conference on Architectural
Support for Programming Languages and Operating Systems, ser.
ASPLOS V. New York, NY, USA: ACM, 1992, pp. 175–186.
[Online]. Available: http://doi.acm.org/10.1145/143365.143508
R. B. Lee, “Precision architecture,” Computer, vol. 22, no. 1, pp.
78–91, Jan 1989.
F. J. Corbató and V. A. Vyssotsky, “Introduction and overview of
the Multics system,” in AFIPS ’65 (Fall, part I): Proceedings of the
November 30–December 1, 1965, fall joint computer conference, part I.
New York, NY, USA: ACM, 1965, pp. 185–196.
Intel Plc., “Intel® 64 and IA-32 architectures, software developer’s
manual, Volume 1: Basic architecture,” Intel Corporation, December
2015.
N. Zeldovich, H. Kannan, M. Dalton, and C. Kozyrakis,
“Hardware enforcement of application security policies using
tagged memory,” in Proceedings of the 8th USENIX Conference
on Operating Systems Design and Implementation, ser. OSDI’08.
Berkeley, CA, USA: USENIX Association, 2008, pp. 225–240.
[Online]. Available: http://dl.acm.org/citation.cfm?id=1855741.
1855757
A. DeHon, B. Karel, J. Thomas F. Knight, G. Malecha, B. Montagu,
R. Morisset, G. Morrisett, B. C. Pierce, R. Pollack, S. Ray, O. Shivers,
J. M. Smith, and G. Sullivan, “Preliminary design of the SAFE platform,” in Proceedings of the 6th Workshop on Programming Languages
and Operating Systems (PLOS 2011), October 2011.
U. Dhawan, C. Hritcu, R. Rubin, N. Vasilakis, S. Chiricescu,
J. M. Smith, T. F. Knight, B. C. Pierce, and A. DeHon, “Architectural Support for Software-Defined Metadata Processing,” in 20th
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS). ACM, March 2015.
“Oracle’s SPARC T7 and SPARC M7 server architecture,” Oracle,
Tech. Rep., 08 2016.
P. Akritidis, M. Costa, M. Castro, and S. Hand, “Baggy
bounds checking: An efficient and backwards-compatible defense
against out-of-bounds errors,” in Proceedings of the 18th Conference
on USENIX Security Symposium, ser. SSYM’09. Berkeley, CA,
USA: USENIX Association, 2009, pp. 51–66. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1855768.1855772
S. Nagarakatte, J. Zhao, M. M. Martin, and S. Zdancewic,
“Softbound: Highly compatible and complete spatial memory
safety for C,” in Proceedings of the 2009 ACM SIGPLAN Conference
on Programming Language Design and Implementation, ser. PLDI ’09.
New York, NY, USA: ACM, 2009, pp. 245–258. [Online]. Available:
http://doi.acm.org/10.1145/1542476.1542504
Y. Younan, P. Philippaerts, L. Cavallaro, R. Sekar, F. Piessens,
and W. Joosen, “Paricheck: An efficient pointer arithmetic checker
for c programs,” in Proceedings of the 5th ACM Symposium on
Information, Computer and Communications Security, ser. ASIACCS
’10. New York, NY, USA: ACM, 2010, pp. 145–156. [Online].
Available: http://doi.acm.org/10.1145/1755688.1755707
T. Jim, J. G. Morrisett, D. Grossman, M. W. Hicks, J. Cheney,
and Y. Wang, “Cyclone: A safe dialect of C,” in Proceedings
of the General Track of the Annual Conference on USENIX Annual
Technical Conference, ser. ATEC ’02. Berkeley, CA, USA:
USENIX Association, 2002, pp. 275–288. [Online]. Available:
http://dl.acm.org/citation.cfm?id=647057.713871
15
Jonathan Woodruff received an Bachelor’s degree in Electrical Engineering at the University
of Texas at Austin, Masters and PhD degree
in Computer Science at the University of Cambridge. He is a Research Associate at the University of Cambridge Department of Computer
Science and Technology. His research interests
include instruction-set support for security, microarchitectural optimizations for security features, and FPGA prototyping. He has authored
9 papers.
Alexandre Joannou received a degree of Engineering at Ecole Centrale d’Electronique Paris, a Masters in Computer Architecture at Universite Pierre et Marie Currie and a PhD in Computer Science at the
University of Cambridge. He is a Research Associate at the University
of Cambridge Department of Computer Science and Technology. His
research interests include instruction modeling, FPGA prototyping, and
security. He is an author of 3 papers and a member of IEEE.
Hongyan Xia received his Bachelor’s degree in Electrical and Computer
Engineering at the University of Birmingham, Masters in Computer
Science at the University of Cambridge. He is a PhD student at the
University of Cambridge Department of Computer Science and Technology. His current research interests include memory safety for embedded
systems and secure real-time operating systems.
Anthony Fox received Bachelor’s and PhD degree in Computer Science at Swansea University. He is a Principal Security Engineer at
Arm Ltd. His research interests include formal models of instruction set
architectures, interactive theorem proving, and the formal verification of
compilers and machine code. He has authored sixteen papers.
Robert M. Norton received a bachelor’s degree and PhD In Computer
Science from the University of Cambridge. He is currently a research
associate, also at the University of Cambridge. His research interests
include architectural support for security features including memory
safety and formal semantics of instruction sets. He has authored 3
papers.
Thomas Bauereiss is a research associate at the University of Cambridge working on formal modeling and verification of Instruction Set
Architectures. He previously developed techniques for verifying information flow security properties of software systems at DFKI Bremen.
Khilan Gudka received a Bachelor’s, Masters and PhD degree in Computer Science at Imperial College London. He is a Research Associate
at the University of Cambridge Department of Computer Science and
Technology. His research interests include program analysis, compilers,
application compartmentalisation and concurrency. He has authored 4
papers.
David Chisnall received a Bachelor’s degree and PhD in Computer
Science at Swansea University. He is a Researcher at the Microsoft Research Cambridge. His research interests include hardware-language
co-design and security.
Brooks Davis is a senior computer scientist at SRI International based
in Walla Walla, Washington. His research interests include the operating
systems, security, and tools to aid the incremental adoption of new
technologies. He received a BS in computer science from Harvey Mudd
College. He is the author of numerous papers on the open-source
FreeBSD operating system. He is a member of the ACM and the IEEE
Computer Society.
Nathaniel Filardo received his Bachelor’s degrees in Physics and Computer Science from Carnegie Mellon University and his Masters and
PhD degrees in Computer Science from Johns Hopkins University. He
is a Research Associate at the University of Cambridge Department
of Computer Science and Technology. His research interests include
architectural security and static type systems.
A. Theodore Markettos received the MA and MEng degrees in Electrical and Information Sciences, and subsequently a PhD in Computer
Science, from the University of Cambridge. He is a Senior Research
Associate in the Department of Computer Science and Technology at
the University of Cambridge. His research interests include hardware
security, notably security architectures and I/O security, FPGA design
and electronics manufacturing. He has authored 12 papers on related
topics.
Michael Roe is a senior research associate at the University of Cambridge Computer Laboratory. He received a PhD in computer science
from Swansea University. His research interests include capability systems, cryptographic protocols, and formal methods.
Peter G. Neumann received AM, SM, and PhD degrees from Harvard,
and a Dr rerum naturalium from Darmstadt. He is Chief Scientist of the
SRI International Computer Science Lab (a not-for-profit research institution), involved primarily in system trustworthiness – including security,
safety, and high assurance. In the computer field since 1953, he has
numerous published papers and reports. He is a Fellow of the IEEE,
ACM, and AAAS.
Robert N. M. Watson received his BS in Logic and Computation with
double major in Computer Science from Carnegie Mellon University, and
his PhD in Computer Science from the University of Cambridge. He
is a University Senior Lecturer (Associate Professor) at the University
of Cambridge Department of Computer Science and Technology. His
research interests span computer architecture, compilers, program analysis and transformation, operating systems, networking, and security.
He is a member of the ACM.
Simon Moore is a Professor of Computer Engineering in the Computer Architecture group at the University of Cambridge Department of
Computer Science and Technology, where he conducts research and
teaching in the general area of computer design with particular interests
in secure computer architecture. He is a senior member of the IEEE.