FPGA Code Accelerators - The Compiler Perspective: Conference Paper
FPGA Code Accelerators - The Compiler Perspective: Conference Paper
FPGA Code Accelerators - The Compiler Perspective: Conference Paper
net/publication/260767373
CITATIONS READS
9 226
2 authors, including:
Walild A. Najjar
University of California, Riverside
206 PUBLICATIONS 3,981 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Walild A. Najjar on 10 September 2014.
[email protected] [email protected]
The first row of Table 2 represents the base configuration, where Max Filter computes the maximum value in a sliding 3x3 window
no transformations have taken place and the code was compiled on a 2D array (image) of height x width as shown in Figure 4. We
with the default options. In this case ROCCC generates hardware use it to show the impact of temporal common sub-expression
that has only one input channel and one output channel. Before elimination (TCSE), when combined with loop unrolling, in area
any input can be processed, the hardware has to read three and throughput.
elements from the one input channel, which takes three clock The results are shown in Table 4. The original implementation,
cycles, effectively cutting the throughput into one third of its with no optimizations, is in the first row and has three input
potential. channels and generates one output element every clock cycle. It
The second row shows the effect of specifying three input consists of four Max modules taking up 311 slices. When TCSE
memory channels with no other transformations. This allows all is applied, two of these components are removed and only one
the necessary data to be read in one clock cycle, allowing the new data element is needed each cycle resulting in a lower area
output to be generated every clock cycle resulting in a tripling of for the same throughput.
throughput. The area is slightly larger as the hardware has to deal The third row of Table 4 shows the results when the outer loop is
with multiple connections, but some internal hardware unrolled five times, taking in seven elements each clock cycle and
components that serialized the incoming data are actually generating five outputs. Applying TCSE (fourth row) results in
simplified in this implementation leading to a small increase in smaller area, increased in clock speed and two variables being
area.
void MaxFilterSystem(int** A, int N, int** Out) {
The third and fourth row show the effect of unrolling the outer
loop once and six times, corresponding to connecting to an int i, j ;
interface of 32-bits and 64-bits respectively. Each unrolling int maxCol1, maxCol2, maxCol3, winMax ;
allows the number of input and output channels to increase and
for (i = 0 ; i < N ; ++i) {
still produce all output every clock cycle, resulting in a large
increase in throughput and maximizing the throughput per unity for (j = 0 ; j < N ; ++j) {
area for this experiment. MaxFilter(A[i][j], A[i][j+1], A[i][j+2], maxCol1);
Average Filter – Lookup Tables and Arithmetic Cores. MaxFilter(A[i+1][j],A[i+1][j+1],A[i+1][j+2], maxCol2);
Average Filter computes the average of each 3x3-sliding window MaxFilter(A[i+2][j],A[i+2][j+1],A[i+2][j+2], maxCol3);
in the input array. We compare two versions where the division is
either implemented as a look-up table or as an instantiation of an MaxFilter(maxCol1, maxCol2, maxCol3, winMax);
IP core generated by Xilinx Core Generator. Results are shown in Out[i][j] = winMax ; } } }
Table 3.
Figure 4: Max filter on a 3X3 window
For all transformations the achievable clock speed was 225 MHz.
Again, the first row shows the default configuration with one reused across iterations requiring only five input elements every
clock cycle. Assuming the necessary memory bandwidth is Architectures and Their Efficient Use. Paderborn, Germany,
available, this exploration shows that a 48% increase in area Nov. 1992, Lecture Notes in Computer Science, pages 119–
results in 5X higher throughput and a 3.38X higher throughput per 130. Springer Verlag, 1992.
unit area. [4] Buyukkurt, B., Cortes, J., Villarreal, J. and Najjar, W. A.
Table 4: Impact of TCSE on Max Filter with loop unrolling Impact of high-level transformations within the ROCCC
framework. ACM Trans. Architecture and Code
In/Out Clock Area Through Through Optimizations (TACO), 7(4):17:1–17:36, Dec. 2010.
Channels (MHz) (slices) put put / area
(MB/s) [5] Buyukkurt, B. and Najjar, W. A. Compiler Generated
Systolic Arrays for Wavefront Algorithm Acceleration on
3/1 225 311 225 0.723 FPGAs. In Int. Conference on Field Programmable Logic
1/1 with 225 266 225 0.846 and Applications (FPL), September 2008.
TCSE [6] Buyukkurt, B., Guo, Z., and Najjar, W. A. Impact of loop
7/5 220 526 1100 2.092 unrolling on area, throughput and clock frequency in
ROCCC: C to VHDL compiler for FPGAs. In Proc. Int.
5/5 with 225 460 1125 2.446 Workshop On Applied Reconfigurable Computing (ARC),
TCSE March 2006.
[7] Guo, Z., Buyukkurt, B. and Najjar, W. A. Input data reuse in
compiling window operations onto reconfigurable hardware.
6. CONCLUSION In ACM SIGPLAN/SIGBED Conference on Languages,
The automatic translation of programs written in HLLs to FPGA-
Compilers and Tools for Embedded Systems (LCTES), pages
based hardware accelerators is a daunting task. These tools have
249–256, New York, NY, USA, June 2004. ACM Press.
to (1) overcome a large semantic gap between temporal,
sequential and control driven programs and spatial, parallel and [8] Guo, Z., Najjar, W. and Buyukkurt, B. Efficient hardware
data/event driven circuits; and (2) without any of the code generation for FPGAs. ACM Trans. on Architecture and
virtualizations commonly available with CPUs and GPUs. In this Code Optimizations (TACO), 5(1):26, May 2008.
paper we describe the ROCCC C to VHDL compilation tool, one [9] Hammes, J., Bohm, A.P.W., Ross, C., Chawathe, M., Draper,
of over 40 similar tools developed in academia and industry. The B., Rinker, R., and Najjar, W. Loop Fusion and Temporal
focus of ROCCC is on compiling a subset of C into hardware Common Sub-expression Elimination in Window-based
accelerators while providing an extensive set of compiles time Loops. In Reconfigurable Architecture Workshop, April
transformations and optimizations under user control via a GUI- 2001.
based console. We report the experimental evaluation of the
impacts of some of these transformations on the circuit costs [10] Villarreal, J., Park, A., Najjar, W. and Halstead, R.
(area) and performance (throughput). Designing modular hardware accelerators in C with ROCCC
2.0. In 18th IEEE Ann. Int. Symp. on Field-Programmable
7. ACKNOWLEDGMENTS Custom Computing Machines (FCCM), 2010, pages 127 –
This work was supported in part by NSF Awards CCF-1219180 134, May 2010.
and IIS-1161997 and by AFRL Contract FA945309C0173. [11] Villarreal. J. Compiled acceleration of C programs on
FPGAs. Ph.D. Thesis, U. California Riverside, USA, 2008.
8. REFERENCES AAI3332643.
[1] ROCCC 2.0 - www.jacquardcomputing.com, 2013.
[12] Xilinx Core Generator System
[2] Bertin, P., Roncin, D, and Vuillemin. J. Introduction to www.xilinx.com/tools/coregen.htm
programmable active memories, pages 300–309. Prentice
Hall, 1989.
[3] Bertin, P., Roncin, D, and Vuillemin. J. Programmable
active memories: a performance assessment. In Parallel