Chisel Constructing Hardware in A Scala Embedded Language

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

52.

Chisel: Constructing Hardware in a


Scala Embedded Language

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee,


Andrew Waterman, Rimas Avižienis, John Wawrzynek, Krste Asanović
EECS Department, UC Berkeley ∗
{jrb|huytbvo|richards|yunsup|waterman|rimas|johnw|krste}@eecs.berkeley.edu

ABSTRACT designs must be inferred from a subset of the language, com-


In this paper we introduce Chisel, a new hardware construc- plicating tool development and designer education. These
tion language that supports advanced hardware design using languages also lack the powerful abstraction facilities that
highly parameterized generators and layered domain-specific are common in modern software languages, which leads to
hardware languages. By embedding Chisel in the Scala pro- low designer productivity by making it difficult to reuse com-
gramming language, we raise the level of hardware design ab- ponents. Constructing efficient hardware designs requires
straction by providing concepts including object orientation, extensive design-space exploration of alternative system mi-
functional programming, parameterized types, and type in- croarchitectures [9] but these traditional HDLs have limited
ference. Chisel can generate a high-speed C++-based cycle- module generation facilities and are ill-suited to producing
accurate software simulator, or low-level Verilog designed to and composing the highly parameterized module generators
map to either FPGAs or to a standard ASIC flow for syn- required to support thorough design-space exploration. Re-
thesis. This paper presents Chisel, its embedding in Scala, cent extensions such as SystemVerilog improve the type sys-
hardware examples, and results for C++ simulation, Verilog tem and parameterized generate facilities but still lack many
emulation and ASIC synthesis. powerful programming language features.
To work around these limitations, one common approach
is to use another language as a macro processing language
Categories and Subject Descriptors for an underlying HDL. For example, Genesis2 uses Perl to
B.6.3 [Logic Design]: [Design Aids – automatic synthesis, provide more flexible parameterization and elaboration of
hardware description languages] hardware blocks written in SystemVerilog [9]. The language
called Verischemelog [6] provides a Scheme syntax for spec-
ifying modules in a similar format to Verilog. JHDL [1]
General Terms equates Java classes with modules. HML [7] uses standard
Design, Languages, Performance ML functions to wire together a circuit. These approaches
allow familiar and powerful languages to be macro languages
for hardware netlists, but effectively require leaf components
Keywords of the design to be described in the underlying HDL. This
CAD combined approach is cumbersome, combining the poor ab-
straction facilities of the underlying HDL with a completely
different high-level programming model that does not un-
1. INTRODUCTION derstand hardware types and semantics.
The dominant traditional hardware-description languages An alternative approach is to begin from a domain-specific
(HDLs), Verilog and VHDL, were originally developed as application programming language from which a hardware
hardware simulation languages, and were only later adopted block is generated. Esterel [2] uses event-based statements
as a basis for hardware synthesis. Because the semantics of to program hardware for reactive systems. DIL [4] is an in-
these languages are based around simulation, synthesizable termediate language targeted at stream processing and hard-
∗ ware virtualization. Bluespec [3] supports a general concur-
Research supported by DoE Award DE-SC0003624, and
by Microsoft (Award #024263 ) and Intel (Award #024894) rent computation model, based on guarded atomic actions.
funding and by matching funding by U.C. Discovery (Award While these can provide great designer productivity when
#DIG07-10227). the task in hand matches the pattern encoded in the appli-
cation programming model, they are a poor match for tasks
outside their domain. For example, the design of a pro-
grammable microprocessor is not well described in a stream
Permission to make digital or hard copies of all or part of this work for programming model, and guarded atomic actions are not a
personal or classroom use is granted without fee provided that copies are natural way to express a high-level DSP algorithm. Further-
not made or distributed for profit or commercial advantage and that copies more, in general it is difficult to derive an efficient microar-
bear this notice and the full citation on the first page. To copy otherwise, to chitecture from a higher-level computation model, especially
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
if the goal is a programmable engine to run many applica-
DAC 2012, June 3-7, 2012, San Francisco, California, USA. tions, where the human designer would prefer to write a
Copyright 2012 ACM 978-1-4503-1199-1/12/06 ...$10.00.

1212

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
generator to explore this design space in detail. Bundles and Vecs can be arbitrarily nested to build com-
In this paper, we introduce Chisel (Constructing Hard- plex data structures. The set of primitive classes (Bits, Fix,
ware In a Scala Embedded Language), a new hardware de- UFix, Bool) plus the aggregate classes (Bundles and Vecs)
sign language we have developed based on the Scala pro- all inherit from a common superclass, Data. Every object
gramming language [8]. Chisel is intended to be a simple that ultimately inherits from Data can be represented as a
platform that provides modern programming language fea- bit vector in a hardware design.
tures for accurately specifying low-level hardware blocks,
but which can be readily extended to capture many use- 2.2 Combinational Circuits
ful high-level hardware design patterns. By using a flexible A circuit is represented as a graph of nodes in Chisel. Each
platform, each module in a project can employ whichever node is a hardware operator that has zero or more inputs
design pattern best fits that design, and designers can freely and that drives one output. A literal is a degenerate node
combine multiple modules regardless of their programming that has no inputs and drives a constant value on its output.
model. Chisel can generate fast cycle-accurate C++ simu- One way to create and wire together nodes is using textual
lators for a design, or generate low-level Verilog suitable for expressions:
either FPGA emulation or ASIC synthesis with standard (a & b) | (~c & d)
tools. We present several design examples and results from where & and | represent bitwise-AND and -OR respectively,
emulation and synthesis experiments. and ~ represents bitwise-NOT. The names a through d rep-
resent named wires of some (unspecified) width. Any simple
2. CHISEL OVERVIEW expression can be converted directly into a circuit tree, with
Instead of building a new hardware design language from named wires at the leaves and operators forming the inter-
scratch, we chose to embed hardware construction primi- nal nodes. The final circuit output of the expression is taken
tives within the Scala programming language. We chose from the operator at the root of the tree, in this example,
Scala for a number of reasons: Scala 1) is a very powerful the bitwise-OR.
language with features we feel are important for building Simple expressions can build circuits in the shape of trees,
circuit generators, 2) is specifically developed as a base for but to construct circuits in the shape of arbitrary directed
domain-specific languages, 3) compiles to the JVM, 4) has a acyclic graphs (DAGs), we must describe fan-out. In Chisel,
large set of development tools and IDEs, and 5) has a fairly we can name a wire holding a subexpression by declaring a
large and growing user community. Chisel comprises a set variable, then referencing it multiple times in subsequent
of Scala libraries that define new hardware datatypes and a expressions:
val sel = a | b
set of routines to convert a hardware data structure into ei- val out = (sel & in1) | (~sel & in0)
ther a fast C++ simulator or low-level Verilog for emulation
or synthesis. This section describes the features of the base The named Chisel wire sel holds the output of the first
Chisel system, whereas the next two sections describe how bitwise-OR operator so that the output can be used multiple
Chisel supports abstraction and powerful generators. times in the second expression.
Bit widths are automatically inferred unless set manually
2.1 Chisel Datatypes by the user. The bit-width inference engine starts from the
The basic Chisel datatypes are used to specify the type of graph’s input ports and calculates node output bit widths
values held in state elements or flowing on wires. In Chisel, a from their respective input bit widths, always preserving
raw collection of bits is represented by the Bits type. Signed exact results unless an explicit truncation is requested.
and unsigned integers are considered subsets of fixed-point 2.3 Functions
numbers and are represented by types Fix and UFix respec-
We can define functions to factor out a repeated piece of
tively. Boolean values are represented as type Bool. Note
logic that we later reuse multiple times in a design:
that these types are distinct from Scala’s builtin types such
def clb(a: Bits, b: Bits, c: Bits, d: Bits) = (a & b) | (~c & d)
as Int. Constant or literal values are expressed using Scala val out = clb(a,b,c,d)
integers or strings passed to constructors for the types.
Chisel provides a Bundle class, which the user extends The def keyword is part of Scala and introduces a func-
to make collections of values with named fields (similar to tion definition, with each argument followed by a semicolon
structs in other languages): then its type, and the function return type given after the
semicolon following the argument list. The equals (=) sign
class MyFloat extends Bundle { indicates the start of the function definition.
val sign = Bool()
val exponent = UFix(width = 8)
val significand = UFix(width = 23)
2.4 Ports and Components
} Ports are used as interfaces to hardware components. A
val x = new MyFloat()
val xs = x.sign
port is simply any Data object that has directions assigned
to its members. An example port declaration is as follows:
The keyword val is part of Scala, and is used to name vari- class FIFOInput extends Bundle {
ables that have values that won’t change. The width named val ready = Bool(OUTPUT)
val valid = Bool(INPUT)
parameter to the UFix constructor specifies the number of val data = Bits(32, INPUT)
bits in the type. Chisel also provides Vecs for indexable }
collections of values: FIFOInput becomes a new type that can be used in compo-
// Vector of five 23-bit signed integers. nent interfaces or for named collections of wires.
val myVec = Vec(5) { Fix(width = 23) } The direction of an object can also be assigned at instan-
val reg3 = myVec(3) // Connect to one element of vector.
tiation time:

1213

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
class ScaleIO extends Bundle { where Vec takes a size as the first argument and a block
val in = new MyFloat().asInput returning a port as the second argument.
val scale = new MyFloat().asInput
val out = new MyFloat().asOutput We can now compose two filters into a filter block as fol-
} lows:
class Block extends Component {
By folding directions into the object declarations, Chisel is val io = new FilterIO()
able to provide powerful wiring constructs described later. val f1 = new Filter()
In Chisel, components are very similar to modules in Ver- val f2 = new Filter()
ilog, defining a hierarchical structure in the generated cir- f1.io.x <> io.x
f1.io.y <> f2.io.x
cuit. The hierarchical component namespace is accessible in f2.io.y <> io.y
downstream tools to aid in debugging and physical layout. }
A user-defined component is defined as a class which: (1) in- where <> bulk connects interfaces of opposite gender. Bulk
herits from Component, (2) contains an interface stored in a connections connect leaf ports of the same name to each
port field named io, and (3) wires together subcircuits in its other. After all connections are made and the circuit is
constructor. As an example, consider defining a two-input being elaborated, Chisel warns users if ports have other than
multiplexer as a component: exactly one connection.
class Mux2 extends Component {
val io = new Bundle { 2.5 State Elements
val sel = Bits(1, INPUT)
val in0 = Bits(1, INPUT) The simplest form of state element supported by Chisel is
val in1 = Bits(1, INPUT) a positive-edge-triggered register, which can be instantiated
val out = Bits(1, OUTPUT) functionally as:
}
io.out := (io.sel & io.in1) | (~io.sel & io.in0) Reg(in)
} This circuit has an output that is a copy of the input signal
The wiring interface to a component is a collection of ports in delayed by one clock cycle. Note that we do not have to
in the form of a Bundle, held in a field named io. The := specify the type of Reg as it will be automatically inferred
assignment operator, used here in the body of the definition, from its input when instantiated in this way. In the current
is a special operator in Chisel that wires the input of left- version of Chisel, clock and reset are global signals that are
hand side to the output of the right-hand side. implicitly included where needed.
Port classes represent the interface to a component, and Using registers, we can quickly define a number of useful
users can organize interfaces into hierarchies using standard circuit constructs. For example, a rising-edge detector that
Scala facilities. For example, a user could define a simple takes a boolean signal in and outputs true when the current
link for handshaking data as follows: value is true and the previous value is false is given by:
class SimpleLink extends Bundle { def risingedge(x: Bool) = x && !Reg(x)
val data = Bits(16, OUTPUT)

}
val rdy = Bool(OUTPUT) 2.6 Conditional Updates
In the previous examples with registers, we simply wired
We can then extend SimpleLink by adding parity bits using their inputs to combinational logic blocks. When describing
bundle inheritance: the operation of state elements, it is often useful to instead
class PLink extends SimpleLink { specify when updates to the registers will occur and to spec-
val parity = Bits(5, OUTPUT)
} ify these updates spread across several separate statements.
Chisel provides conditional update rules in the form of the
From there we can define a filter interface by nesting two when construct to support this style of sequential logic de-
PLinks into a new FilterIO bundle: scription. For example,
class FilterIO extends Bundle { val r = Reg() { UFix(width = 16) }
val x = new PLink().flip when (c === UFix(0)) {
val y = new PLink() r := r + UFix(1)
} }
where flip recursively changes the “gender” of a bundle, where register r is updated at the end of the current clock
changing input to output and output to input. cycle only if c is zero. The argument to when is a predicate
We can now define a filter by defining a filter class extend- circuit expression that returns a Bool. The update block
ing component: following when can only contain update statements using the
class Filter extends Component { update operator :=, simple expressions, and named wires
val io = new FilterIO()
...
defined with val.
} In a sequence of conditional updates, the last conditional
update whose condition is true takes priority. For example,
where the io field contains FilterIO.
when (c1) { r := Bits(1) }
Beyond single elements, vectors of elements form richer hi- when (c2) { r := Bits(2) }
erarchical interfaces. For example, in order to create a cross-
leads to r being updated according to the following truth
bar with a vector of inputs, producing a vector of outputs,
table:
and selected by a UFix input, we utilize the Vec constructor:
class CrossbarIo(n: Int) extends Bundle {
c1 c2 r
val in = Vec(n){ new PLink().flip() } 0 0 r r unchanged
val sel = UFix(ceilLog2(n), INPUT) 0 1 2
val out = Vec(n){ new PLink() } 1 0 1
} 1 1 2 c2 takes precedence over c1

1214

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
Initial values is constrained to work on inputs of type Num for which Chisel
0 0 e1
multiplication and addition are defined.
Like parameterized functions, we can also parameterize
c1 f t classes to make them more reusable. For instance, we can
when (c1) generalize the Filter class, defined in section 2.4, to use any
{ r <== e1 } kind of link. We do so by parameterizing the FilterIO class
e2
and defining the constructor to take a zero-argument type
c2 f t constructor function as follows:
when (c2) class FilterIO[T <: Data]()(type: => T) extends Bundle {
{ r <== e2 } val x = type.asInput.flip
enable in val y = type.asOutput
}
clock r
We can now define Filter by defining a component class
out
that also takes a link type constructor argument and passes
it through to the FilterIO interface constructor:
Figure 1: Equivalent hardware constructed for con- class Filter[T <: Data]()(type: => T) extends Component {
ditional updates. val io = (new FilterIO()){ type }
...
}
Figure 1 shows how each conditional update can be viewed
as inserting a mux before the input of a register to select 3.2 Abstract Data Types
either the update expression or the previous input according Through support for abstract data types, Chisel permits
to the when predicate. In addition, the predicate is OR-ed much simpler code than would otherwise be possible. For
into a firing signal that drives the load enable of the register. example, consider constructing a block, such as the FFT, re-
The compiler places initialization values at the beginning of quiring arithmetic on complex numbers. In Chisel, complex
the chain so that if no conditional updates fire in a clock numbers can be defined as follows:
cycle, the load enable of the register will be deasserted and class Complex(val real: Fix, val imag: Fix) extends Bundle {
the register value will not change. def +(b: Complex): Complex =
new Complex(real + b.real, imag + b.imag)
...
3. ABSTRACTION }

In this section we discuss abstraction within Chisel. Ab- where we overload infix operators to provide an intuitive
straction is an important aspect of Chisel as it 1) allows algebraic interface. Complex numbers can now be used in
users to conveniently create reusable objects and functions, both the interface and in arithmetic:
2) allows users to define their own data types, and 3) allows class Example extends Component {
users to better capture particular design patterns by writing val io = new Bundle {
their own domain-specific languages on top of Chisel. val a = new Complex(Fix(2, INPUT), Fix(2, INPUT))
val b = new Complex(Fix(2, INPUT), Fix(2, INPUT))
val out = new Complex(Fix(2, OUTPUT), Fix(2, OUTPUT))
3.1 Polymorphism and Parameterized Types }
Scala is a strongly typed language and uses parameterized val c = io.a + io.b
io.out.r := c.r
types to specify generic functions and classes. Chisel users io.out.i := c.i
can define their own reusable functions and classes using }
parameterized classes. For instance we can write a param-
eterized function for defining an inner-product FIR digital
filter generically over Chisel Num’s. The inner product FIR 4. GENERATORS
filter can be mathematically defined as: A key motivation for embedding Chisel in Scala is to sup-
X port highly parameterized circuit generators, a weakness of
y[t] = wj ∗ xj [t − j] (1) traditional HDLs.
j
4.1 Cache Generator
where x is the input and w is a vector of weights. In Chisel
this can be defined as: One example of a highly parameterized subsystem is a
memory cache generator. In Chisel, the basic configuration
def innerProductFIR[T <: Num] (w: Array[Int], x: T) = options can first be defined:
foldR(Range(0, w.length).map(i => Num(w(i))
* delay(x, i)), _ + _) object CacheParams {
val DIR_MAPPED = 0
def delay[T <: Bits](x: T, n: Int): T = val SET_ASSOC = 1
if (n == 0) x else Reg(delay(x, n - 1)) val WRITE_THRU = 0
val WRITE_BACK = 1
def foldR[T <: Bits] (x: Seq[T], f: (T, T) => T): T = }
if (x.length == 1) x(0) else f(x(0),
foldR(x.slice(1, x.length), f)) The main body of the cache generator component can then
be declared with desired generator parameters and optional
where delay creates an n-cycle delayed copy of its input and default values. The io bundle then references two IO inter-
foldR (for “fold right”) constructs a reduction circuit given face bundles, one specifying a connection to a CPU and the
a binary combiner function f. In this case, foldR creates a other defining the memory interface. Computed parameters
summation circuit. Finally, the innerProductFIR function are then defined, followed by the main body of the generator:

1215

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
class Cache(cache_type: Int = DIR_MAPPED, min1.io.in1 <> min_first_half.io.out
associativity: Int = 1, }
line_size: Int = 128, if (in_width < 3) {
cache_depth: Int = 16, min1.io.in2 <> io.in_vec(1)
write_policy: Int = WRITE_THRU } else {
) extends Component { val min_second_half = new SortVector(in_width - midpoint)
val io = new Bundle() { for (i <- midpoint until in_width)
val cpu = new IoCacheToCPU() min_second_half.io.in_vec(i - midpoint) <> io.in_vec(i)
val mem = new IoCacheToMem().flip() min1.io.in2 <> min_second_half.io.out
} }
val addr_idx_width = ( log(cache_depth) / log(2) ).toInt }
val addr_off_width = ( log(line_size/32) / log(2) ).toInt
val addr_tag_width = 32 - addr_idx_width - addr_off_width - 2 Note that Verilog is not able to describe this type of recur-
val log2_assoc = ( log(associativity) / log(2) ).toInt
...
sion, and a designer would need to use a different language,
if (cache_type == DIR_MAPPED) such as Python, to generate Verilog from a recursive routine.
...
} 4.3 Memory
The resulting Cache generator can then be used in a larger Memories are given special treatment in Chisel since hard-
system: ware implementations of memory have many variations, e.g.,
... FPGA memories are instantiated quite differently from ASIC
val data_cache = new Cache(cache_type = SET_ASSOC, line_size = 64) memories. Chisel defines a memory abstraction that can
connection_to_cpu <> data_cache.io.cpu map to either simple Verilog behavioral descriptions, or to
connection_to_mem <> data_cache.io.mem
...
instances of memory modules that are available from exter-
nal memory generators provided by foundry or IP vendors.
In the simplest form, Chisel allows memory to be defined
4.2 Sorting Network with a single write port and multiple read ports as follows:
In addition to offering flexible parameterization, Chisel
Mem(depth: Int,
supports recursive creation of hardware subsystems. In the target: Symbol = ’default, readLatency: Int = 0)
example below a simple sorting network is specified using a
two-input SortBlock defined with handshaking ports. First, where depth is the number of memory locations, target is
a simple queue IO interface data type is defined by extend- the type of memory used, readLatency is the latency of read
ing the Bundle class. This data type will be used to define ports to be defined on the memory. A memory object can
connections between the sorting primitives: then be read from using the read(rdAddress) method. For
class IoSortBlockOut extends Bundle() { example, an audio recorder could be defined as follows:
val output = Bits(sort_data_size, OUTPUT)
def audioRecorder(n: Int) = {
val output_rdy = Bool(OUTPUT)
val addr = counter(UFix(n));
val has_output = Bool(OUTPUT)
val ram = Mem(n).write(button(), addr)
val pop = Bool(INPUT)
ram.read(Mux(button(), UFix(0), addr))
}
}
The SortBlock primitive is then defined to output the
where a counter is used as an address generator into a mem-
minimum of the two inputs, subject to handshaking:
ory. The device records while button is true, or plays back
class SortBlock extends Component() { when false.
override val io = new Bundle() {
val in1 = new IoSortBlockOut() We can use simple memory to create register files. For
val in2 = new IoSortBlockOut() example we can make a one write port, two read port register
val out = new IoSortBlockOut() file with 32 registers as follows:
}
... val regs = Mem(32)
} regs.write(wr_en, wr_addr, wr_data)
val idat = regs.read(iaddr)
Using this sorting primitive, it is then possible to define a val mdat = regs.read(maddr)
recursive architecture to find the minimum of a vector of
numbers. SortVector below recursively finds the minimum where a new read port is created for each call to read.
of the first and second halves of the input vector, and re- Additional parameters are available to mimic common
turns the minimum of the two results. This example also memory behaviors, to aid with the process of mapping to
demonstrates the power of using Bundle to combine inputs real-world hardware. The following is an example of a mem-
and outputs along with arrays of Bundle using Vec. ory that is first defined with no memory ports, after which
class SortVector(in_width: Int) extends Component() {
read, write, or read/write ports are added:
val io = new Bundle() { val regfile =
val in_vec = Vec(in_width) { new IoSortBlockOut().flip } Mem(64, readLatency = 1,
val out = new IoSortBlockOut() hexInitFile = "hex_init_values.txt");
} regfile.write(addr_in, data_in1, wen, w_mask = bit_mask);
val min1 = new SortPair() val read_data = regfile.read(addr_in);
min1.io.out <> io.out
val midpoint = in_width / 2
if (in_width < 4) {
By default, this memory will be compiled to Verilog RTL.
// Connect first input directly to min1 To produce a reference to a Verilog instance of a memory
min1.io.in1 <> io.in_vec(0) module, one adds target = ’inst to the constructor call.
} else { When Chisel compiles to Verilog, a second file will be gener-
val min_first_half = new SortVector(midpoint)
for (i <- 0 until midpoint) ated, e.g., design.conf, which can be used by the synthesis
min_first_half.io.in_vec(i) <> io.in_vec(i) design flow to construct the requested memory objects.

1216

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
5. FAST C++ SIMULATOR faster than VCS.
Fast simulation is crucial to reduce hardware develop- Finally, we have developed a complete 64-bit vector pro-
ment time. Custom logic simulation engines can provide cessor including FPUs, MMUs, and 32 K 4-way set-associative
fast cycle-accurate simulation, but are generally too expen- instruction and data caches. The Chisel code was used to
sive to be used by individual designers. FPGA emulation generate an LVS and DRC-clean GDSII layout in an IBM
approaches are valuable but the FPGA tool flow can take 45 nm SOI 10-metal layer process using memory-compiler-
hours to map a design iteration. Conventional software Ver- generated 6T and 8T SRAM blocks. Total area was 1.76 mm2 ,
ilog RTL simulators are popular, as they can be run by in- with a critical path of 1 ns.
dividual designers on workstations or shared server farms,
but are slow. 7. CONCLUSION
For Chisel, we have developed a fast C++ simulator for Chisel makes the power of a modern software program-
RTL debugging. The Chisel compiler produces a C++ class ming language available for hardware design, supporting
for each Chisel design, with a C++ interface including clock- high-level abstractions and parameterized generators with-
low and clock-high methods. We rely on two techniques to out mandating a particular computational model, while also
accelerate execution. First, the simulator code generation providing high-quality Verilog RTL output and a fast C++
strategy is based on a templated C++ multi-word bit-vector simulator.
runtime library that executes all the basic Chisel operators. The Chisel system and hardware libraries are being made
The C++ templates specialize operations for bit vectors us- available as an open-source project available at:
ing a two-level template scheme that is first parameterized http://chisel.eecs.berkeley.edu
on bits and then on words. In particular, all overhead is to encourage wide adoption.
removed for the case where the RTL bit vector fits into the
host machine’s native word size. Second, we remove as much
branching as possible in the code so that we can best utilize
8. ACKNOWLEDGEMENTS
the ILP available in a modern processor and minimize the We’d like to thank Christopher Batten for sharing his fast
number of stalls. multiword C++ template library that inspired our fast emu-
lation library. We’d also like to thank all the Berkeley EECS
6. RESULTS graduate students who participated in the Chisel bootcamp
and have given feedback on Chisel after using it in various
In this section, we present preliminary results on using classes and research projects.
Chisel for various hardware designs. To measure designer
productivity, we took a simple 3-stage 32-bit RISC processor
that was originally hand-written in Verilog, and converted 9. REFERENCES
it to equivalent Chisel code. The original Verilog code was [1] Bellows, P., and Hutchings, B. JHDL – an HDL
3020 lines of code whereas the resulting Chisel code was only for reconfigurable systems. IEEE Symposium on
1046 lines, yielding a nearly 3× reduction. FPGAs for Custom Computing Machines (1998).
To compare quality of results, we used a set of floating- [2] Berry, G., and Gonthier, G. The Esterel
point primitive components we have designed in Chisel, in- synchronous programming language: Design, semantics,
cluding multiplication, addition, and several data conver- implementation. Science of Computer Programming 10,
sion operators. A 64-bit Fused-Multiply-Add (FMA) unit 2 (1992).
has been mapped to both Verilog and C++ emulation code, [3] Bluespec Inc. Bluespec(tm) SystemVerilog Reference
and both results have been simulated in testbenches using Guide: Description of the Bluespec SystemVerilog
SoftFloat and TestFloat [5] to verify IEEE-754-2008 compli- Language and Libraries. Waltham, MA, 2004.
ance. The generated Verilog was mapped to a commercial [4] Goldstein, S., and Budiu, M. Fast compilation for
65 nm process and compared to the same design described pipelined reconfigurable fabrics. ACM/FPGA
using hand-coded Verilog, and as expected there was no sig- Symposium on Field Programmable Gate Arrays (1999).
nificant difference in results: [5] Hauser, J. The softfloat and testfloat packages.
http://www.jhauser.us/arithmetic/index.html.
Source Clock Period Total Area Logic Area
[6] Jenning, J., and Beuscher, E. Verischemelog:
Chisel 7ns 62197 um2 60801 um2
Verilog embedded in scheme. Proceedings of DSL’99:
Verilog 7ns 62881 um2 61485 um2
The 2nd conference on Domain Specific Languages
Chisel 2.5ns, Retimed 66472 um2 61279 um2 (Oct 1999).
Verilog 2.5ns, Retimed 67034 um2 62227 um2
[7] Li, Y., and Leeser, M. HML – a novel hardware
To compare the speed of simulation using the Chisel C++
description language and its translation to VHDL.
simulator, we used a more sophisticated 64-bit five-stage in-
IEEE Transactions on Very Large Scale Integration
order RISC pipeline with a floating-point unit, MMU, and
(VLSI) Systems 8, 1 (Oct 2000).
caches. We compared the speed of Chisel C++ simulation
and Synopsys VCS Verilog simulation when booting a re- [8] Odersky, M. e. a. Scala programming language.
search OS on this processor (88, 291, 350 cycles total) with http://www.scala-lang.org/.
results as follows: [9] Shacham, O., Azizi, O., Wachs, M., Qadeer, W.,
Asgar, Z., Kelley, K., Stevenson, J.,
Simulator Time (s) Speedup Solomatnikov, A., Firoozshahian, A., Lee, B.,
VCS RTL simulator 5390 1.00 Richardson, S., and M., H. Rethinking digital
Chisel C++ RTL simulator 694 7.77 design: Why design must change. IEEE Micro
The Chisel-generated C++ simulator is approximately 8× (Nov/Dec 2010).

1217

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
10. SUPPLEMENTAL be precisely type checked. Unfortunately, the type system
In this section we give more detailed examples, results, is not able to infer bit widths automatically, so we have to
and discussion of Chisel. add a separate bit-width inference pass, as described below.
The advantage is that our Chisel design is more portable to
10.1 Builtin Operators other host languages.
Chisel defines a set of hardware operators for the builtin
types which can be found in Table 1. 10.4 Bitwidth Inference
Users are required to set bitwidths of ports and registers,
10.2 Layers of Languages but otherwise, bit widths on wires are automatically inferred
Scala was designed to support the creation of embedded unless set manually by the user. The bit-width inference en-
domain-specific languages. In fact, it is easy to create a gine starts from the graph’s input ports and calculates node
series of languages, one layered on top of another, resulting output bit widths from their respective input bit widths ac-
in improved clarity and efficiency in specification. As a small cording to the following set of rules:
example, we can easily build a switch statement involving operation bit width
a series of comparisons against a common key, based on the z = x + y wz = max(wx, wy) + 1
Chisel conditional updates introduced earlier. z = x - y wz = max(wx, wy) + 1
As a small example, we can easily build a switch state- z = x & y wz = max(wx, wy)
ment involving a series of comparisons against a common z = Mux(c, x, y) wz = max(wx, wy)
key, based on the Chisel conditional updates introduced ear- z = w * y wz = wx + wy
lier. z = x << n wz = wx + maxNum(n)
switch construct translates into z = x >> n wz = wx - minNum(n)
switch(idx) { when (idx === v2) { u2 } z = Cat(x, y) wz = wx + wy
is(v1) { u1 } when (idx === v1) { u1 } z = Fill(n, x) wz = wx * maxNum(n)
is(v2) { u2 }
} where for instance wz is the bit width of wire z, and the &
rule applies to all bitwise logical operations.
The switch construct supports simple specification of FSMs: The bit-width inference process continues until no bit width
changes. Except for right shifts by known constant amounts,
val s_even :: s_odd :: Nil = Enum(2){ UFix() }
val state = Reg(resetVal = s_even) the bit-width inference rules specify output bit widths that
switch (s.in) { are never smaller than the input bit widths, and thus, out-
is (s_even) { state <== s_odd } put bit widths either grow or stay the same. Furthermore,
is (s_odd) { state <== s_even }
} the width of a register must be specified by the user either
explicitly or from the bitwidth of the reset value. From these
We are exploring embedding new domain-specific languages
two requirements, we can show that the bit-width inference
in Chisel to provide high-level behavioral synthesis.
process will converge to a fixpoint.
10.3 Scala Embedding Discussion 10.5 BlackBox’s
Embedding Chisel in Scala gave a number of advantages
Users can create wrappers for existing opaque IP com-
but also presented a number of challenges.
ponents using BlackBoxes which are Components with only
In Scala, we are able to cleanly integrate Chisel compo-
IO and no body. For example, a Verilog-based memory con-
nents, bundles and interfaces with Scala classes. Using in-
troller module can be linked in by defining it as a subclass
trospection, we can find all relevant fields and their names in
of BlackBox:
Scala objects. Scala also provides a number of facilities for
writing domain-specific languages including operator over- class MemoryController extends BlackBox {
val io = new MemoryIo();
loading. }
Unfortunately, there are other areas where it is still chal-
lenging to customize the language seamlessly. The first one and then by instantiating it and connecting to it as done
is providing a succinct literal format. Unfortunately, unlike with any other Chisel component. The emitted Verilog will
Common Lisp, in Scala it is impossible to define new tokens. then contain code to create and wire in the module.
The second one is that, at least in standard Scala, it is im-
possible to overload existing syntax, such as if statements. 10.6 Vending Machine FSM Example
In general, there is no way to extend the Scala syntax in ar- Here is an example of a vending machine FSM defined
with a switch statement:
bitrary ways. Higher-order functions and lightweight thunks
class VendingMachine extends Component {
help, but the result is that the Chisel syntax is slightly more val io = new Bundle {
awkward than we’d ideally like. val nickel = Bool(INPUT)
Yet another challenge is providing informative error mes- val dime = Bool(INPUT)
val rdy = Bool(OUTPUT) }
sages. When errors occur, it is possible to provide stack val s_idle :: s_5 :: s_10 :: s_15 :: s_ok :: Nil = Enum(5){UFIx()}
backtraces to report to users on what line number an error val state = Reg(resetVal = s_idle)
occurred. Unfortunately, it is challenging to filter the stack switch (state) {
is (s_idle) {
trace to give the user the exact line the error occurred. when (io.nickel) { state <== s_5 }
Although Scala has a large number of data types, we are when (io.dime) { state <== s_10 }
not able to completely layer our hardware data types on to } is (s_5) {
these Scala ones. We instead built a parallel type hierarchy. when (io.nickel) { state <== s_10 }
when (io.dime) { state <== s_15 }
Scala has a very powerful parameterized type system that } is (s_10) {
allows us to create generic functions and classes that can when (io.nickel) { state <== s_15 }

1218

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
Example Explanation
Bitwise operators. Valid on Bits, Fix, UFix, Bool.
val invertedX = ~x Bitwise-NOT
val hiBits = x & Bits("h_ffff_0000") Bitwise-AND
val flagsOut = flagsIn | overflow Bitwise-OR
val flagsOut = flagsIn ^ toggle Bitwise-XOR
Bitwise reductions. Valid on Bits, Fix, and UFix. Returns Bool.
val allSet = andR(x) AND-reduction
val anySet = orR(x) OR-reduction
val parity = xorR(x) XOR-reduction
Equality comparison. Valid on Bits, Fix, UFix, and Bool. Returns Bool.
val equ = x === y Equality
val neq = x != y Inequality
Shifts. Valid on Bits, Fix, and UFix.
val twoToTheX = Fix(1) << x Logical left shift.
val hiBits = x >> UFix(16) Right shift (logical on Bits & UFix, arithmetic on Fix).
Bitfield manipulation. Valid on Bits, Fix, UFix, and Bool.
val xLSB = x(0) Extract single bit, LSB has index 0.
val xTopNibble = x(15,12) Extract bit field from end to start bit position.
val usDebt = Fill(3, Bits("hA")) Replicate a bit string multiple times.
val float = Cat(sign,exponent,mantissa) Concatenates bit fields, with first argument on left.
Logical operations. Valid on Bools.
val sleep = !busy Logical NOT.
val hit = tagMatch && valid Logical AND.
val stall = src1busy || src2busy Logical OR.
val out = Mux(sel, inTrue, inFalse) Two-input mux where sel is a Bool.
Arithmetic operations. Valid on Nums: Fix and UFix.
val sum = a + b Addition.
val diff = a - b Subtraction.
val prod = a * b Multiplication.
val div = a / b Division.
val mod = a % b Modulus
Arithmetic comparisons. Valid on Nums: Fix and UFix. Returns Bool.
val gt = a > b Greater than.
val gte = a >= b Greater than or equal.
val lt = a < b Less than.
val lte = a <= b Less than or equal.

Table 1: Chisel operators on builtin data types.

when (io.dime) { state <== s_ok } ulation performance depends on the number of target cycles
} is (s_15) { to be simulated. While the Chisel C++ emulator runs ap-
when (io.nickel) { state <== s_ok }
when (io.dime) { state <== s_ok } proximately 10× faster than VCS, as shown in Figure 2, this
} is (s_ok) { advantage is only realized when simulating millions of cycles
state <== s_idle or more. FPGA emulation is only fastest for simulations
}
} exceeding billions of target cycles. We are planning to ex-
io.rdy := (state === s_ok) periment with techniques to improve the compile-time per-
} formance of the Chisel-generated C++ code, possibly with
switches to optimize for compile-time or run-time.
10.7 Simulation Performance
In Section 6 we compared simulation speed for a 64-bit 10.8 FIFO
five-stage RISC processor design using a Chisel-generated A generic FIFO could be defined as shown in Figure 3 and
C++ simulator and Synopsys VCS Verilog simulation. Ta- used as follows:
ble 2 is a more complete breakdown of the results in terms class DataBundle() extends Bundle {
of compile time, run time, and total time. We also include val A = UFix(width = 32);
results for a Chisel-generated FPGA emulation, which pro- val B = UFix(width = 32);
vides the fastest per-cycle emulation performance but with }
a large compile time. object FifoDemo {
Because of compilation time, the fastest backend for sim- def apply () = (new Fifo(32)){ new DataBundle() };

1219

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2
Simulator Compile Compile Run Run Total Total
Time (s) Speedup Time (s) Speedup Time (s) Speedup
VCS RTL simulator 22 1.000 5368 1.00 5390 1.00
Chisel C++ RTL simulator 119 0.184 575 9.33 694 7.77
Virtex-6 FPGA 3660 0.006 76 70.60 3736 1.44

Table 2: Comparison of simulation time between Chisel C++ simulator, Synopsys VCS Verilog simulation,
and FPGA emulation, on a 64-bit five-stage RISC processor running an OS boot test.

class FifoIO[T <: Data]()(gen: => T) extends Bundle() {


val enq_val = Bool(INPUT)
6
10 val enq_rdy = Bool(OUTPUT)
Chisel C++ (gcc −O0) val deq_val = Bool(OUTPUT)
Chisel C++ (gcc −O3) val deq_rdy = Bool(INPUT)
5 val enq_dat = gen.asInput
10 Chisel Verilog (VCS) val deq_dat = gen.asOutput
Chisel Verilog (Virtex−6) }

4 class Fifo[T <: Data] (n: Int)(gen: => T) extends Component {


10
Time (s)

val io = new FifoIO()( gen )


val enq_ptr = Reg(resetVal = UFix(0, sizeof(n)))
val deq_ptr = Reg(resetVal = UFix(0, sizeof(n)))
3 val is_full = Reg(resetVal = Bool(false))
10
val do_enq = io.enq_rdy && io.enq_val
val do_deq = io.deq_rdy && io.deq_val
val is_empty = !is_full && (enq_ptr === deq_ptr)
2
10 val deq_ptr_inc = deq_ptr + UFix(1)
val enq_ptr_inc = enq_ptr + UFix(1)
val is_full_next =
1 Mux(do_enq && ~do_deq && (enq_ptr_inc === deq_ptr), Bool(true),
10 5 6 7 8 9 Mux(do_deq && is_full, Bool(false),
10 10 10 10 10 is_full))
Simulated Cycles enq_ptr <== Mux(do_enq, enq_ptr_inc, enq_ptr)
deq_ptr <== Mux(do_deq, deq_ptr_inc, deq_ptr)
is_full <== is_full_next
Figure 2: A comparison of total time required to val ram = Mem(n, do_enq, enq_ptr, io.enq_dat)
io.enq_rdy := !is_full
compile and simulate a system using various back- io.deq_val := !is_empty
ends from Chisel. ram.read(deq_ptr) <> io.deq_dat
}

}
Figure 3: Parameterized FIFO example.
It is also possible to define a generic decoupled interface:
class ioDecoupled[T <: Data]()(data: => T) extends Bundle() {
val ready = Bool(INPUT)
val valid = Bool(OUTPUT)
simulator due to the low-level structural nature of the Ver-
val bits = data.asOutput ilog code generated by Chisel. However, we have not yet
} tuned the Verilog output for Verilog simulation performance,
This template can then be used to add a handshaking pro- and we believe even the current slowdown is acceptable to
tocol to any set of signals: enable co-simulation.
class decoupledDemo extends ioDecoupled()( new DataBundle() ) 10.10 Chisel Components
The FIFO interface in Figure 3 can be now be simplified as Chisel has been in use for over a year and a number of
follows: components have been written in it. We developed the fol-
class FifoIO[T <: Data]()(gen: => T) extends Bundle() { lowing components as part of our research infrastructure,
val enq = new ioDecoupled()( gen ).flip() many of which are used in the vector processor described in
val deq = new ioDecoupled()( gen )
} Section 10.11:
• clock dividers
10.9 Generated Verilog • queues
Running the Chisel compiler on the FIFO example gener-
• decoders, encoders, popcount
ates the Verilog code shown in Figure 4 .
The Verilog output from Chisel might need to be sim- • scoreboards
ulated together with other existing Verilog IP blocks. We • integer ALUs
compared the Verilog simulation speed of the Chisel-generated
Verilog versus hand-written behavioral Verilog for a 64-bit • LFSR
data-parallel processor design, including pipelined single and • Booth multiplier, iterative divider
double-precision FMA units, and a pipelined 64-bit integer • ROMs, RAMs, CAMs
multiplier. We ran 92 test assembly programs on both VCS-
generated simulators. The Chisel-generated Verilog simula- • TLB
tor was 1.65× slower in total than the behavioral Verilog • direct-mapped caches, set-associative blocking caches

1220

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.
52.2

module Fifo(input clk, input reset,


input io_enq_val,
output io_enq_rdy,
output io_deq_val,
input io_deq_rdy,
input [31:0] io_enq_dat_A,
input [31:0] io_enq_dat_B,
output[31:0] io_deq_dat_A,
output[31:0] io_deq_dat_B);

wire T0;
wire is_empty;
wire T1;
reg[4:0] deq_ptr;
wire[4:0] T2;
wire[4:0] deq_ptr_inc;
wire do_deq;
reg[4:0] enq_ptr;
wire[4:0] T3;
wire[4:0] enq_ptr_inc; Figure 5: Data-parallel processor layout results
wire do_enq;
wire T4;
reg[0:0] is_full;
wire is_full_next;
• direct-mapped caches, set-associative non-blocking caches
wire T5; • prefetcher
wire T6;
wire T7; • fixed-priority arbiters, round-robin arbiters
wire T8;
wire T9; • single-precision, double-precision floating-point units
wire T10;
wire T11;
• 64-bit decoupled in-order single-issue 5-stage processor
• 64-bit vector unit (data-parallel processor)
assign io_deq_val = T0;
assign T0 = ! is_empty; We are working to factor these components into a standard
assign is_empty = T11 && T1;
assign T1 = enq_ptr == deq_ptr; library from which developers can more readily build large-
assign T2 = do_deq ? deq_ptr_inc : deq_ptr; scale designs.
assign deq_ptr_inc = deq_ptr + 1’h1/* 1*/; We have taught a class in advanced computer architecture
assign do_deq = io_deq_rdy && io_deq_val;
assign T3 = do_enq ? enq_ptr_inc : enq_ptr;
design where all students produced projects in Chisel. Ex-
assign enq_ptr_inc = enq_ptr + 1’h1/* 1*/; ample projects included accelerators for security, FFT, and
assign do_enq = io_enq_rdy && io_enq_val; spatial computing.
assign io_enq_rdy = T4;
assign T4 = ! is_full;
Additionally, Berkeley EECS graduate student Chris Ce-
assign is_full_next = T7 ? 1’h1/* 1*/ : T5; lio is developing a number of educational processor microar-
assign T5 = T6 ? 1’h0/* 0*/ : is_full; chitectures with associated labs to help undergraduates learn
assign T6 = do_deq && is_full; computer architecture. These included a microcoded pro-
assign T7 = T9 && T8;
assign T8 = enq_ptr_inc == deq_ptr; cessor, one-stage, two-stage, and five-stage pipelines, and
assign T9 = do_enq && T10; an out-of-order processor, all with accompanying visualiza-
assign T10 = ~ do_deq; tions.
assign T11 = ! is_full;

always @(posedge clk) begin 10.11 Data-Parallel Processor Layout Results


if(reset) begin The data-parallel processor layout results using IBM 45nm
deq_ptr <= 5’h0/* 0*/;
end else if(1’h1/* 1*/) begin SOI 10-metal layer process using memory compiler gener-
deq_ptr <= T2; ated 6T and 8T SRAM blocks are shown in Figure 5.
end
if(reset) begin
enq_ptr <= 5’h0/* 0*/;
end else if(1’h1/* 1*/) begin
enq_ptr <= T3;
end
if(reset) begin
is_full <= 1’h0/* 0*/;
end else if(1’h1/* 1*/) begin
is_full <= is_full_next;
end
end
endmodule

Figure 4: Verilog Generated from Chisel for the


FIFO example

1221

Authorized licensed use limited to: University of Athens. Downloaded on September 29,2021 at 18:28:07 UTC from IEEE Xplore. Restrictions apply.

You might also like