FPGA Code Accelerators - The Compiler Perspective: Conference Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260767373

FPGA Code Accelerators - the Compiler Perspective

Conference Paper · May 2013


DOI: 10.1145/2463209.2488908

CITATIONS READS
9 226

2 authors, including:

Walild A. Najjar
University of California, Riverside
206 PUBLICATIONS   3,981 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Parallel Discrete Event Simulation View project

Hardware Acceleration View project

All content following this page was uploaded by Walild A. Najjar on 10 September 2014.

The user has requested enhancement of the downloaded file.


ACM/IEEE Design Automation Conference (DAC) 2013

FPGA Code Accelerators - The Compiler Perspective


Walid Najjar Jason Villarreal
University of California Riverside Jacquard Computing Inc.
Computer Science & Engineering Riverside, CA

[email protected] [email protected]

ABSTRACT code accelerators and discuss various aspects of the challenge of


FPGA-based accelerators have repeatedly demonstrated superior compiling HLLs to accelerators mapped onto FPGAs (Section 2).
speed-ups on an ever-widening spectrum of applications. Section 3 describes the difficulties inherent in bridging the
However, their use remains beyond the reach of traditionally abstraction gap between high-level languages and hardware
trained applications code developers because of the complexity of circuits. In Section 4 we describe the ROCCC (Riverside
their programming tool-chain. Compilers for high-level languages Optimizing Compiler for Configurable Computing) [3] approach
targeting FPGAs have to bridge a huge abstraction gap between to compiling code accelerators for FPGAs. The programming
two divergent computational models: a temporal, sequentially model and code optimizing features of ROCCC are described in
consistent, control driven execution in the stored program model Section 5 with an emphasis on its use for high-level design space
versus a spatial, parallel, data-flow driven execution in the spatial exploration.
hardware model. In this paper we discuss these challenges to the
compiler designer and report on our experience with the ROCCC
2. THE HISTORICAL PERSPECTIVE
toolset. The first documented use of FPGAs as code accelerators
appeared, to our knowledge, just four years after the introduction
Categories and Subject Descriptors of the first SRAM-based FPGA device (Xilinx, 1985). The PAM
D.3.4 [Programming Languages]]: Processors—Retargetable (Programmable Active Memory) [2], built at the DEC Paris
compilers; optimizations; B.5.2 [Register-Transfer-Level Research Lab, is described as “universal hardware co-processor
Implementation]: Design Aids closely coupled to a standard host computer.” Ten benchmark
codes were implemented and evaluated on PAM [3], including:
General Terms long multiplication, RSA cryptography, Ziv-Lempel compression,
Algorithms, Design, Performance. edit distance calculations, heat and Laplace equations, N-body
calculations, binary 2D convolution, Boltzman machine model,
Keywords 3D graphics (including translation, rotation, clipping and
FPGAs. Compiler. Hardware Accelerators. perspective projection) and discrete cosine transform. The
authors’ conclusions were that PAM delivered a performance
1. INTRODUCTION comparable to that of ASIC chips or supercomputers, of the time,
In recent years we have witnessed a dramatic widening of the and was one to two orders of magnitude faster than software.
scope of use of FPGAs as computing devices. It is driven by a They also state that because of the PAM’s large off-chip I/O
variety of factors including their larger size enabling a very large bandwidth (6.4 Gb/s) it was ideally suited for “... on-the-fly data
degree of parallelism, a richer set of embedded functionalities acquisition and filtering, ...”
(RAM, DSP etc.), high efficiency, as compared to software,
What has changed in the nearly 25 years since the first PAM?
coupled with re-programmability, lower energy per task, high I/O
FPGAs are much larger and faster; the application domains have
bandwidth that eliminates the need for memory off-loading of
grown in scope following the growth in size and speed of the
data, etc. A general-purpose use of FPGAs as accelerators was
devices. However, the main challenge to FPGAs as code
already described a few years after the introduction of the first
accelerators, namely the abstraction gap between application
device [2]. However, the main obstacle facing a wider use of
development and FPGA programming, not only remains un-
FPGAs as code accelerators is their poor programmability using
changed but has probably gotten worse due to increase in
high-level programming languages (HLL). The challenge lies in
complexity of the applications enabled by the larger device sizes.
the translation of a stored-program machine, the high-level
language, to a spatial and parallel computing structure with no 3. THE ABSTRACTION GAP
instruction set architecture and no pre-determined control In this section we discuss two issues that define the complexity of
structures. compiling HLLs to hardware circuits: (1) the semantic gap
In this paper we revisit the earliest documented use of FPGAs as between the sequential stored-program execution model implicit
in these languages and (2) the effects of virtualization, or lack
thereof, on the complexity of the compiler.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 3.1 Semantics of the Execution Model
not made or distributed for profit or commercial advantage and that CPUs and GPUs are inherently stored-program machines (or von
copies bear this notice and the full citation on the first page. To copy Neumann machines) and so are the programming languages used
otherwise, to republish, to post on servers or to redistribute to lists, on these, essentially most of the languages in use today. As such
requires prior specific permission and/or a fee.
they are bound by the sequential consistency of that model, both
DAC ’13, May 29 - June 07 2013, Austin, TX, USA.
Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00 at the language and machine levels. While CPU and GPU
architectures exploit various forms of parallelism, instruction, data
and thread-level, they do so circumventing the sequential of a code accelerator is an existing software application a subset
consistency implied in the source code internally (branch of which, being frequently executed, is translated into hardware.
prediction, out-of-order execution, SIMD parallelism, etc.), while That subset is, quasi by definition, a loop nest. Hopefully that loop
preserving the appearance of a sequentially consistent execution nest is parallelizable and can therefore exploit the FPGA
externally (reorder buffers, precise interrupts etc.). Von Neumann, resources. By focusing on loop nests, the task of compiling HLLs
or imperative, languages are even more constrained by sequential to FPGAs is simplified and opportunities for loop transformations
consistency: sequential execution, pre-determined control flow and optimizations abound. This is the approach taken by the
structures, etc. The compiling of a HLL code to a CPU or GPU is ROCCC compiler (Figure 1) and is described in the rest of this
therefore the translation from one stored program machine model, paper.
the HLL, to another, the machine’s ISA. A digital circuit, on the
other hand, is inherently parallel, spatial, with distributed storage, 3.2 Virtualization
timed behavior etc. the abstraction and semantic gap between the Virtualization is probably one of the greatest achievements of
hardware and software computing models is summarized in Table modern computer system design: when a CPU issues a load
1. Translating a HLL to a circuit requires a transformation of the instruction to an address in memory it is not aware of its actual
sequential to a spatial/parallel, with the creation of custom physical location: the loaded data may not be in its cache or
sequencing, timed synchronizations, distributed storage, physical memory, it may not even be in the same time zone as the
pipelining, etc. CPU! Thanks to multiple layers of hardware and software support,
the world of the CPU is a single dimensional memory as large as
Table 1. Features of hardware and software computing its address space1. Obviously, this storage model is a perfect fit to
models the one implicit in all commonly used HLLs: one large flat array
Stored Program Spatial Computing of bytes.
Distributed, small, FPGA-based accelerators do not enjoy, yet, such sophisticated
Central, large, levels of virtualization. Rather, the compiler must be aware of all
physical. Streaming,
Storage & virtualized. data placements, on and off-chip, and actively manage the
limited memory and
data access Memory resident, interfaces to one or more memory modules as well as data
caching. No
multi-level caches streaming interfaces (e.g. PCI, USB, Ethernet etc.). Furthermore,
virtualization
Dynamic - separate Static - combined ILP, most HLLs do not support streams as programming constructs or
Parallelism indications of physical data locations. The compiler must
ILP, DLP, TLP DLP, TLP
Central, static, Data-flow, therefore manage the data location, both off and on-chip, with no
Sequencing support in the HLL, through pragmas or GUI-based indications
appearance of SC asynchronous
Pre-designed, one from the user.
Customized, very deep
size fits all. Each of these interfaces, to physical memories or streams, implies
Data-Path pipelines. No dynamic
Dynamic data a preset data width, addressing modality (bursts or singletons) and
data dependencies
dependencies mapping to a single or multiple data channels on the circuit. None
of these parameters is supported in the HLL let alone in the
intermediate representation (IR) the compiler uses to generate the
Raising the abstraction level of FPGA programming to that of
code. On a CPU, or GPU, all data values entering the data-path
CPU or GPU programming is a daunting task that is yet to be
come from the L1 cache with a pre-determined timing pattern.
fully completed. It is of critical importance in the programming of
Consider a loop body that accesses four arrays, or streams, from
accelerators as opposed to the high-level design of arbitrary
two separate memory modules and two streams. Its data-path on
digital circuit, which is the focus of high- level synthesis.
an FPGA requires four physical data interfaces each with its own
timing patterns that could raise very significant timing and
synchronization issues.

4. THE ROCCC APPROACH


As mentioned above, the objective of ROCCC [1] is the
generation of efficient customized hardware accelerators for
frequently executing code segments, namely loop nests. Its target
audience is application code developers with hardly any training
as hardware designers. The objective being to make FPGA-based
code accelerators accessible to a wider spectrum of users. As
such, ROCCC is not a general-purpose high-level logic design
tool, rather its focus is on generating hardware accelerators from
existing C codes with minimal modifications to the source code.
The same code can be compiled for software execution or
Figure 1: The programming abstraction gap between translated to hardware.
HLLs and FPGAs. The ROCCC approach focuses on a
subset of C to generate code accelerators The ROCCC compiler supports an extensive set of loop
optimizations and transformations. One of the driving
Code accelerators differ from general purpose logic design in one
important way: the starting point of logic design is a device whose 1
behavior is specified by a hardware description code implemented This has not always been the case; there was a time when users,
in a hardware description language (HDL) such as VHDL, or compilers, had to actively manage the memory allocation of
Verilog, SystemC, SystemVerilog, or Bluespec. The starting point data and code.
philosophies of ROCCC is that there should be one source code of coding at a higher level but has no fixed requirement for the
description of an accelerator that could be compiled, using low level components. All modules and systems are stored in a
different transformations, into multiple hardware database, supported in the GUI, from where the user can drag and
implementations. All transformations are therefore done in the drop them in other projects. An example module that sorts two
GUI by the user, and not by re-implementing the source code. values is shown in Figure 2. In C, this code takes two integers and
Users are given the flexibility to choose which optimizations and returns two integers in sorted order. When compiled with
transformations to perform, on each individual loop and exactly ROCCC, this generates a pipelined component that can take two
how to apply them to different parts of their application. Control integers every clock cycle and generated two sorted integers every
of optimizations is given to the user as options in the GUI and clock cycle. The generated hardware is purely computational and
each can have a dramatic effect on the generated hardware consists of a pipeline that performs a comparison and two
[7][5][4]. It would have been possible, in some limited instances, multiplexors.
to let the compiler automatically decide which transformations
should be applied. This would imply having, in the compiler, void BitonicSort2(int a, int b, int& o1, int& o2)
knowledge of all possible FPGA platforms, i.e. system and board
architectures, current and future. { if (a < b) { o1 = a; o2 = b;}

The objectives of ROCCC in generating code accelerators is else {o1 = b; o2 = a;} }


maximizing throughput through (1) parallelism, (2) minimizing Figure 2: Bitonic sort module for two numbers
the area occupied by the circuit, (3) the reuse of data fetched off-
chip [6] and (4) pipelining to reduce clock cycle time. ROCCC System code describes computational kernels that may apply large
favors throughput over space, so, under user control, it could amounts of computation on input streams of data and generate
generate as much hardware as necessary to maximize parallelism. output streams of data. Streams connections in hardware are
The data-path generated is purely data driven with no FSM inferred from array accesses in C. Figure 3 shows an example
created for resource sharing: data is pushed onto the top and flows system that performs the Median filter operation on a 3x3 window
through without any control. There is minimal control logic of an NxN stream. The call to BitonicSort9 is a function call in C,
generated to keep track of which pipeline stages are active so the but is translated into an instantiation of the BitonicSort9 module
hardware can output values at the correct clock cycle. and placed in a pipeline when converted to hardware. The
BitonicSort9 module is not shown, but is constructed by
4.1 The ROCCC Programming Model
ROCCC code is a subset of C. All ROCCC code can be com- #include "roccc-library.h"
piled and run with a normal software compiler such as gcc and
will generate the same output as the ROCCC-generated hardware void MedianFilter(int** A, int N, int** Out) {
from the same source. The limitations of ROCCC, compared to C, int i, j ;
are (1) no recursion, (2) no arbitrary use of pointers that the
int s1, s2, s3, s4, s5, s6, s7, s8, s9 ;
compiler cannot un-alias statically. The use of dynamic pointers
inside loop bodies would result in multiple memory de- for (i = 0 ; i < N ; ++i) {
referencing accesses being serialized, for consistency reasons, and for (j = 0 ; j < N ; ++j) {
would eliminate the parallelism.
BitonicSort9(A[i][j], A[i][j+1],
4.1.1 Bottom-up design and code reuse A[i][j+2], A[i+1][j], A[i+1][j+1],
Just as in software construction, designs for hardware accelerators
can benefit from opportunities for code reuse and raising the A[i+1][j+2], A[i+2],[j], A[i+2][j+1],
abstraction level. ROCCC is designed to support a modular A[i+2][j+2], s1, s2, s3, s4, s5,
approach to hardware accelerator design, enabling reuse of s6, s7, s8, s9) ;
components and ease of design space exploration [8][9].
Out[i][j] = s4; } } }
C code compiled by ROCCC falls under one of two categories:
modules or systems. Both modules and systems are represented in Figure 3: Median filter on a 3X3 window using the
C as a function call and can be compiled with gcc to perform the bitonic sort module
same operations in software as in hard- ware. instantiating many copies of the BitonicSort2 module in a
Module code describes components, which perform a computation butterfly network. An input stream and an output stream are
on scalar inputs and generate a set of scalar outputs. They are inferred from the parameters A and Out respectively, and result in
translated into pipelined hardware structures that can take a set of hardware that communicates with memory in order to feed the
inputs every clock cycle and generate a set of outputs every clock pipeline elements from A and stores output to Out.
cycle. Each module is itself a complete hardware implementation The generated data-path with no optimizations specified requires
and may be used by itself as a complete design, or may be nine elements from A each clock cycle in order to generate one
included as a component in larger modules or systems. output each clock cycle. The first iteration, all of these values
Each module may have different optimizations performed in order must be fetched from memory, but subsequent iterations only
to best suit the user’s specific needs with regard to clock speed or need fetch three new elements from memory and can reuse six
area. Modules included in larger designs are treated as black elements.
boxes by the compiler so as not to affect any implementation The parameter N to the function in Figure 3 is translated into an
decisions made at the lower level. Treating module instantiations input scalar. A connection is made in the generated hardware to a
as black boxes could obscure some optimization opportunities, so register that is read once at the beginning of execution and then
inlining is given as an option if the user wants to take advantage kept constant.
4.1.2 Data types throughput is doubled as two values are generated every clock
In addition to the standard C data types (char, int, long, float, cycle. However, the resulting hardware requires two new data
double) ROCCC supports variable bit width data types both elements from the A input stream every clock cycle in order to
integer and fixed-point. maintain this. ROCCC provides fine-grained control over loop
unrolling and stream connections to a degree not normally seen.
ROCCC does not assume a fixed target data-path width so Individual loops can be unrolled different amounts independently
operations such as addition and multiplication do not need to be of one another, creating memory access requirements specific to
truncated after every step. The user can elect to maximize individual streams.
precision or adopt a C-like truncation model. For example, the
addition of two eight-bit numbers in software will result in an By default, each input and output stream has one channel to
eight-bit value, but in the generated hardware the result can be memory through which all values must go. If there are multiple
stored and used as a nine-bit value. Floating-point operations are values generated in one clock cycle but only one output stream
assumed to be present in software, but require hardware channel, the data must be serialized. The number of channels to
components. Different FPGA platforms may have varying levels memory may be configured on a stream by stream basis for each
of support for floating point operations and since the hardware input and output stream. Each stream may be configured have the
generated is not specific to a certain platform, there can be no number of memory channels specified to support the highest
assumptions made about the target platform’s resources. As a possible throughput. Conversely, the streams can be tuned to read
solution, ROCCC gives the user the ability to manage a library of fewer elements per clock cycle on hardware platforms that cannot
intrinsics, which include floating point operations and integer support the ideal bandwidth. For multidimensional streams
division. These libraries are reflections of hardware libraries such support, the memory channels are further split up into address
as cores generated by Xilinx Core Generator [12] and generate channels and data channels. Loop unrolling in multiple
connections to include platform specific cores to handle the dimensions has different consequences on the resulting hardware
floating-point operations. Changing the libraries can affect the depending on which loop is unrolled. Unrolling the outer loop
performance of the hardware, but is purely done through the GUI results in more rows being fetched every clock cycle, which can
and has no effect on the source code. be processed by increasing the number of data channels. Unrolling
the inner loop, results in an increase to the size of each burst that
ROCCC supports user-defined tables to be accessed by the data- is fetched but not the number of channels available. Unrolling
path in some instances. These can be read-only or random access. either loop has the potential to increase parallelism.
Some operations are more efficient when implemented as a look-
up table rather than an actual circuit; division is used as an Temporal common subexpression elimination [9] identifies
example later in this paper. These tables are implemented using computations that are identical across consecutive iterations of a
BRAMs when available. Random access tables may be written loop and replaces those computations with a register. This can
once per loop iteration but may be read as many times as drastically reduce the area requirements by eliminating large
necessary in each loop iteration. blocks of hardware. A consequence of this optimization is that
some memory fetches might be determined to be unnecessary,
4.1.3 Importing external modules changing the stream interface.
In many cases the development of hardware accelerators re- Systolic array generation [5] completely transforms two-
quires IP that was created outside the scope of the project and dimensional computation into a one-dimensional computation
must be integrated into a larger design. Just as ROCCC is with much less area and high throughput. The memory
designed to integrate modules into larger designs, external IP can connections of the generated hardware are changed by this
be imported and instantiated. Importing external IP requires the optimization.
user provide a description of the inputs, outputs, and latency of
the core through the GUI. A wrapper with default parameters such Different hardware platforms have different characteristics, such
as stall and done signals connects the external IP to the generated as number of inputs per LUT, which can have an effect on the
data-paths. Calling external IP is identical to a module relative cost of individual operations. When generating a
instantiation and appears as a function call in C. External IP calls, hardware pipeline, the decision of how many basic operations to
as well as module instantiations, can be inserted into application put into each level of the pipe is dependent on this information.
code directly through the GUI. As the compiler has no knowledge of the underlying restrictions,
this control is again passed to the user.
4.2 Transformations and Optimizations The GUI provides both a basic slider to control the pipeline
A major goal of ROCCC is to enable the exploration of large construction and the advanced capability to specify the relative
design spaces through the tuning of optimizations on unchanging cost of each basic operation on the underlying platform. Without
source code. Two types of transformations are exposed to the user changing the source code, many different pipelines can be created
in the GUI that facilitates design space exploration: High-level exploring the tradeoff of clock speed versus latency and area.
transformations control the overall structure of the generated
hardware and can be used to create different memory Another characteristic that differs from device to device is the
configurations. These include inlining of modules, redundancy, amount of routing resources. While high fan-out is to be avoided
loop optimizations, temporal common subexpression elimination, in general, the specific limit on the amount of fan-out per element
and systolic array generation. Low-level transformations control is platform specific. Again, this control is given to the user in
the utilization of the underlying hardware. These include order to control potential routing issues at the high level without
pipelining control and fan-out tree generation. rewriting the application.
Loop unrolling typically increases parallelism but also increases 5. DESIGN SPACE EXPLORATION
the necessary bandwidth to sustain a high throughput as well as In this section we examine the effect of the high and low level
the area used. For example, if the loop in Figure 3 is unrolled once transformations on clock speed and area on a concrete hardware
so that two loop bodies are performed each iteration, the
platform. The implementations were synthesized and placed and input and one output, the second row shows three input channels
routed for a Xilinx Virtex 6 LX760 FPGA. but no transformations, and the third and fourth row show the
Median Filter – Loop Unrolling and Throughput. configuration of unrolling the outer loop once and six times to
interface with a 32-bit and 64-bit interface. In addition to loop
Shown in Figure 3, the median filter works on a 3x3-sliding transformations, the Average Filter example was synthesized
window of 8-bit data over a large 2D array. The 8-bit data is using both a ROCCC compiled look up table (as reported in the
meant to be representative and not restrictive, similar results can column labeled Area Table) and an integer division core
be achieved for other bit widths. It uses the bitonic sort module generated (as reported in the column labeled Area Divider).
(Figure 2) and has 50 cycles latency. The application is
synthesized, placed and routed on the Xilinx Virtex 6 LX760, Table 3: Average Filter implementations using table lookup or
with a generic wrapper consisting of two sets of dual clock integer division core (clock in both cases is 225 MHz)
BRAMs connected to the I/O pins and acts as input and output to In/Out Area Area Throughput Through
the ROCCC generated code. Channels Table Divider (MB/s) put / area
Results for Median Filter are shown in Table 2. Each row shows (slices) (slices)
the effect on area, clock speed, throughput, and throughput per 1/1 218 283 75 0.344
unit area resulting from unrolling the outer loop and adjusting the
3/1 225 275 225 1.00
input and output memory channels appropriately. Throughput per
unit area is reported in MB/s/slice and represents the gain in 4/2 253 351 450 1.78
throughput with respect to the amount of area added with each 8/6 498 826 1350 2.71
transformation.
The results of these transformations provide similar throughput
Table 2: Impact of loop unrolling on Median Filter
while the Table implementation takes less area, even when
In/Out Clock Area Throughput Through unrolling causes duplication of the table to support multiple reads
Channels (MHz) (slices) (MB/s) put / area per clock cycle. The throughput per unit area reported in Table 3
1/1 225 735 75 0.102 is reported for the Table implementation, which is the more space
efficient design. Using lookup tables and unrolling the loop
3/1 225 766 225 0.294 provides nearly 8X improvement in terms of throughput per unit
4/2 225 1215 450 0.370 area over the default case.
8/6 200 3160 1200 0.380 Max Filter – Temporal Common Sub-expression Elimination.

The first row of Table 2 represents the base configuration, where Max Filter computes the maximum value in a sliding 3x3 window
no transformations have taken place and the code was compiled on a 2D array (image) of height x width as shown in Figure 4. We
with the default options. In this case ROCCC generates hardware use it to show the impact of temporal common sub-expression
that has only one input channel and one output channel. Before elimination (TCSE), when combined with loop unrolling, in area
any input can be processed, the hardware has to read three and throughput.
elements from the one input channel, which takes three clock The results are shown in Table 4. The original implementation,
cycles, effectively cutting the throughput into one third of its with no optimizations, is in the first row and has three input
potential. channels and generates one output element every clock cycle. It
The second row shows the effect of specifying three input consists of four Max modules taking up 311 slices. When TCSE
memory channels with no other transformations. This allows all is applied, two of these components are removed and only one
the necessary data to be read in one clock cycle, allowing the new data element is needed each cycle resulting in a lower area
output to be generated every clock cycle resulting in a tripling of for the same throughput.
throughput. The area is slightly larger as the hardware has to deal The third row of Table 4 shows the results when the outer loop is
with multiple connections, but some internal hardware unrolled five times, taking in seven elements each clock cycle and
components that serialized the incoming data are actually generating five outputs. Applying TCSE (fourth row) results in
simplified in this implementation leading to a small increase in smaller area, increased in clock speed and two variables being
area.
void MaxFilterSystem(int** A, int N, int** Out) {
The third and fourth row show the effect of unrolling the outer
loop once and six times, corresponding to connecting to an int i, j ;
interface of 32-bits and 64-bits respectively. Each unrolling int maxCol1, maxCol2, maxCol3, winMax ;
allows the number of input and output channels to increase and
for (i = 0 ; i < N ; ++i) {
still produce all output every clock cycle, resulting in a large
increase in throughput and maximizing the throughput per unity for (j = 0 ; j < N ; ++j) {
area for this experiment. MaxFilter(A[i][j], A[i][j+1], A[i][j+2], maxCol1);
Average Filter – Lookup Tables and Arithmetic Cores. MaxFilter(A[i+1][j],A[i+1][j+1],A[i+1][j+2], maxCol2);
Average Filter computes the average of each 3x3-sliding window MaxFilter(A[i+2][j],A[i+2][j+1],A[i+2][j+2], maxCol3);
in the input array. We compare two versions where the division is
either implemented as a look-up table or as an instantiation of an MaxFilter(maxCol1, maxCol2, maxCol3, winMax);
IP core generated by Xilinx Core Generator. Results are shown in Out[i][j] = winMax ; } } }
Table 3.
Figure 4: Max filter on a 3X3 window
For all transformations the achievable clock speed was 225 MHz.
Again, the first row shows the default configuration with one reused across iterations requiring only five input elements every
clock cycle. Assuming the necessary memory bandwidth is Architectures and Their Efficient Use. Paderborn, Germany,
available, this exploration shows that a 48% increase in area Nov. 1992, Lecture Notes in Computer Science, pages 119–
results in 5X higher throughput and a 3.38X higher throughput per 130. Springer Verlag, 1992.
unit area. [4] Buyukkurt, B., Cortes, J., Villarreal, J. and Najjar, W. A.
Table 4: Impact of TCSE on Max Filter with loop unrolling Impact of high-level transformations within the ROCCC
framework. ACM Trans. Architecture and Code
In/Out Clock Area Through Through Optimizations (TACO), 7(4):17:1–17:36, Dec. 2010.
Channels (MHz) (slices) put put / area
(MB/s) [5] Buyukkurt, B. and Najjar, W. A. Compiler Generated
Systolic Arrays for Wavefront Algorithm Acceleration on
3/1 225 311 225 0.723 FPGAs. In Int. Conference on Field Programmable Logic
1/1 with 225 266 225 0.846 and Applications (FPL), September 2008.
TCSE [6] Buyukkurt, B., Guo, Z., and Najjar, W. A. Impact of loop
7/5 220 526 1100 2.092 unrolling on area, throughput and clock frequency in
ROCCC: C to VHDL compiler for FPGAs. In Proc. Int.
5/5 with 225 460 1125 2.446 Workshop On Applied Reconfigurable Computing (ARC),
TCSE March 2006.
[7] Guo, Z., Buyukkurt, B. and Najjar, W. A. Input data reuse in
compiling window operations onto reconfigurable hardware.
6. CONCLUSION In ACM SIGPLAN/SIGBED Conference on Languages,
The automatic translation of programs written in HLLs to FPGA-
Compilers and Tools for Embedded Systems (LCTES), pages
based hardware accelerators is a daunting task. These tools have
249–256, New York, NY, USA, June 2004. ACM Press.
to (1) overcome a large semantic gap between temporal,
sequential and control driven programs and spatial, parallel and [8] Guo, Z., Najjar, W. and Buyukkurt, B. Efficient hardware
data/event driven circuits; and (2) without any of the code generation for FPGAs. ACM Trans. on Architecture and
virtualizations commonly available with CPUs and GPUs. In this Code Optimizations (TACO), 5(1):26, May 2008.
paper we describe the ROCCC C to VHDL compilation tool, one [9] Hammes, J., Bohm, A.P.W., Ross, C., Chawathe, M., Draper,
of over 40 similar tools developed in academia and industry. The B., Rinker, R., and Najjar, W. Loop Fusion and Temporal
focus of ROCCC is on compiling a subset of C into hardware Common Sub-expression Elimination in Window-based
accelerators while providing an extensive set of compiles time Loops. In Reconfigurable Architecture Workshop, April
transformations and optimizations under user control via a GUI- 2001.
based console. We report the experimental evaluation of the
impacts of some of these transformations on the circuit costs [10] Villarreal, J., Park, A., Najjar, W. and Halstead, R.
(area) and performance (throughput). Designing modular hardware accelerators in C with ROCCC
2.0. In 18th IEEE Ann. Int. Symp. on Field-Programmable
7. ACKNOWLEDGMENTS Custom Computing Machines (FCCM), 2010, pages 127 –
This work was supported in part by NSF Awards CCF-1219180 134, May 2010.
and IIS-1161997 and by AFRL Contract FA945309C0173. [11] Villarreal. J. Compiled acceleration of C programs on
FPGAs. Ph.D. Thesis, U. California Riverside, USA, 2008.
8. REFERENCES AAI3332643.
[1] ROCCC 2.0 - www.jacquardcomputing.com, 2013.
[12] Xilinx Core Generator System
[2] Bertin, P., Roncin, D, and Vuillemin. J. Introduction to www.xilinx.com/tools/coregen.htm
programmable active memories, pages 300–309. Prentice
Hall, 1989.
[3] Bertin, P., Roncin, D, and Vuillemin. J. Programmable
active memories: a performance assessment. In Parallel

View publication stats

You might also like