J Sign Process Syst (2011) 65:245–259
DOI 10.1007/s11265-011-0606-x
Design Methodology for Offloading Software
Executions to FPGA
Tomasz Patyk · Perttu Salmela · Teemu Pitkänen ·
Pekka Jääskeläinen · Jarmo Takala
Received: 29 January 2011 / Revised: 4 July 2011 / Accepted: 4 July 2011 / Published online: 30 July 2011
© Springer Science+Business Media, LLC 2011
Abstract Field programmable gate array (FPGA) is a
flexible solution for offloading part of the computations from a processor. In particular, it can be used
to accelerate an execution of a computationally heavy
part of the software application, e.g., in DSP, where
small kernels are repeated often. Since an application
code for a processor is a software, a design methodology is needed to convert the code into a hardware
implementation, applicable to the FPGA. In this paper,
we propose a design method, which uses the Transport
Triggered Architecture (TTA) processor template and
the TTA-based Co-design Environment toolset to automate the design process. With software as a starting point, we generate a RTL implementation of an
application-specific TTA processor together with the
hardware/software interfaces required to offload com-
This work has been supported by the Academy of Finland
under research grant decision 128126.
T. Patyk (B) · P. Salmela · T. Pitkänen ·
P. Jääskeläinen · J. Takala
Department of Computer Systems,
Tampere University of Technology,
P. O. Box 553, 33101, Tampere, Finland
e-mail:
[email protected]
P. Salmela
e-mail:
[email protected]
T. Pitkänen
e-mail:
[email protected]
P. Jääskeläinen
e-mail:
[email protected]
J. Takala
e-mail:
[email protected]
putations from the system main processor. To exemplify how the integration of the customized TTA with a
new platform could look like, we describe a process of
developing required interfaces from a scratch. Finally,
we present how to take advantage of the scalability of
the TTA processor to target platform and applicationspecific requirements.
Keywords Application-specific integrated circuits ·
Hardware accelerator · Computer aided engineering ·
System-on-a-chip · Coprocessors ·
Field programmable gate arrays
1 Introduction
The growing complexity of software applications running on the portable devices like mobile phones, smart
phones, PDAs etc., call for the increase in the processing power offered by their CPUs. Typically, a RISC
processor employed as a general purpose processing
unit does not provide enough computational resources
and the use of a specialized hardware accelerator is
inevitable. A DSP co-processor is a common solution
to speed up multimedia applications. Nevertheless how
powerful the DSP processor is, a dedicated hardware
will do the same task faster, consume less power, and
take smaller silicon area.
Reconfigurable hardware in form of field programmable gate array (FPGA) makes an excellent solution
for increasing the performance of an embedded system,
as part of the application code can be offloaded from
the processor. The performance increase requires careful planning though. Quite often the overhead of such
arrangements, e.g., cost of data transfers between a
246
CPU and an FPGA may be higher than the performance gain. Also the clock frequency of the FPGA is
often much lower than the CPU. Therefore, the inherent parallelism of the application needs to be exploited
efficiently. Finally, the traditional development style
for FPGA resembles hardware design process, which
requires that the designer has expertise in hardware
structures. Additionally, application code is often in a
form of software code, hence, offloading requires the
description to be converted to RTL structure. Therefore,
there is a need for a design methodology converting
software partition to a hardware structure. The methodology could be used, e.g., by software designers without
a deep knowledge on the hardware implementations, as
rapid way of offloading computations to an FPGA.
In this paper, we describe a design methodology
for offloading computations from a CPU to a FPGA.
The proposed method allows a part of an application
code, described in the C language, to be executed on
the application-tailored processor, implemented on the
FPGA. The method supports full ANSI C language;
targets with an operating system; exploits DMA transfers to minimize the overheads, and allows the user
to scale-up/down the computational resources. Our experiments show that this method is scalable and can
exploit the inherent parallelism of the application. In
addition, the designer makes his efforts on the higher
abstraction level, thus deep knowledge on the hardware
design is not needed. The paper extends our previous
work in [1] by providing details of the proposed design
methodology.
The remaining part of the paper is organized as follows. Section 2 presents a brief survey of other available tools automating the offloading process. Section 3
sketches the offloading of computations, Section 4 details the implementation methodology for the described
accelerator blocks, Section 5 describes the platform
specific interfacing, Section 6 discusses results for two
different TTA designs, and Section 7 concludes the
paper.
2 Related Work
Traditionally the C language has been used to implement DSP algorithms and applications. The large
amounts of legacy C code turns attention to design
methods capable of converting functionality described
in C language to a hardware structure, as easily as
possible. A large number of tools, taking C program
as an initial description, is already available on the
market. In theory, such tools could be used for a FPGA
based acceleration. However, many tools have serious
J Sign Process Syst (2011) 65:245–259
limitations, e.g., only a subset of C is fully supported,
which makes the C to hardware conversion process
more complex and time consuming. Furthermore, many
tools generate only the RTL description of an hardware
accelerator without support for the system integration.
The user has to manually design the scheduling and
communication mechanisms between the accelerator
and host processor, build the interface units, and provide device drivers.
Synphony C Compiler [2] generates processor arrays
from C programs. However, it supports only a limited
subset of C. It also requires manual setting of parameters affecting the scheduling of operations. CoWare
Processor Designer [3] is a toolset for designing
application-specific processors (ASIP) and it is not a
generic tool for converting C to HDL descriptions.
Target IP Designer [4, 5] is another similar tool.
AutoESL [6] supports high and low level parallelism
but it does not support the full ANSI C language. Impulse CoDeveloper [7] is targeted for an FPGA based
acceleration but it assumes a computational model
comprised of the sequential processes communicating
with each other. Therefore, it suits well only if the
application consists of the independent processes receiving and emitting data streams. In addition, it does
not support full ANSI C language.
Binachip-FPGA [8] targets also FPGA acceleration.
In contrast to other tools, the description of the system
is given as a compiled binary for a supported processor
architecture instead of a C language source code. Cascade [9] is another tool which uses ARM, PowerPC, or
MicroBlaze binaries as the description of the desired
functionality. These tools inputting binaries instead of
source code are assumed to be targeted to cases where
the source code of the program is not available. Otherwise it is hard to justify the lower level input format
given that even the C language is a very low level
sequential language from which producing a parallel
implementation is already often very challenging, even
if the described algorithm is inherently parallel.
Catapult-C [10] generates a fixed function implementation instead of a processor-based one. As a
drawback, generating the hardware implementation requires lots of user attention. C2H is a tool only for
the Altera FPGA devices. It requires direct access to
a memory, shared with the master processor. The tool
supports only a subset of C and its external connectivity
is based on Altera’s Avalon bus. Cynthesizer [11] is
another tool for rapid hardware generation. However,
it requires using SystemC as well. In general, extensive modifications are required to the original ANSI
C code [12]. NISC [13] is a tool for generating noinstruction-set-computer architecture processors from
J Sign Process Syst (2011) 65:245–259
C. On the architectural level the basic idea of the
NISC, the use of an extremely “bare bone” processor
template, is similar to the TTA template used in this
work. However, full ANSI C is not supported.
In the proposed method, we target for supporting
the full ANSI C descriptions and allow user to trade
of execution time against area according to given requirements. In addition, the proposed method supports
offloading on targets with operating systems (OS).
3 Design Method for Offloading Computations
An FPGA in an embedded system gives a unique opportunity to system designers to offload some of the
computation from the host processor, hence reducing
the computational load on it. This hardware can serve
as a hardware accelerator for some specific, e.g., DSP,
algorithm that cannot be computed efficiently enough
by the main unit. Another common case is to simply
offload some computationally intensive tasks from the
host processor in a multi-task system and let the processor execute other tasks while waiting for the results
of the offloaded computation. Either way, the system
designer is faced with the following design challenges:
–
–
–
host processor utilization;
hardware (HW) / software (SW) interface between
the host processor and the offloaded unit; and
co-design methodology to produce a HW accelerated implementation from the SW implementation.
When considering the host utilization, several issues
need to be taken into account. Firstly, since the multitasking systems, governed by an operating system are
of our primary interest, it is essential that the offloaded
execution is non-blocking. This means that the host
processor should be able to continue execution while
the offloading hardware is doing its job. Quite often this
means that the operating system schedules different
tasks/processes to the processor until the execution can
be resumed. Secondly, in some cases the FPGA system
does not have a random access to the local memory of
the processor where the operands of the computations
are stored. This imposes the requirement of transferring data to and from the local memory of the FPGA
device. Not only this takes time but also, if done actively
by the host processor, it keeps the processor busy. A
common way to avoid occupying the host processor
for the data transfers is the use of Direct Memory
Access (DMA) transfers. In the platforms supporting
DMA, this method offloads the data transfers from
the processor to a peripheral hardware unit. Thirdly,
the FPGA circuit usually runs at a clock frequency
247
several times lower than the one of the host processor.
The actual acceleration expected from using the FPGA
needs to be calculated keeping this in mind. Naturally,
the gain arising from the fact that the host processor
can perform other tasks meanwhile is preserved. These
factors lead us to the following conclusion: in order to
speed up application execution with an FPGA accelerator, the speed of the accelerator hardware should
compensate the additional data transfer penalties, the
potentially lower clock frequency of the accelerator,
and the overhead of task switching in the operating
system. Preferably, the accelerator design technique
should be scalable so it can be used to design accelerators that meet the required computational efficiency
while staying within the silicon area limits of the
platform.
The communication interface is specific to the used
platform and hardware accelerator. If the accelerator
is manually designed for the certain platform, the interface will be a direct map to the interface exposed
by the platform. If the accelerator is generated with
an automated approach, e.g., using a processor template, the need for an adapter interface is most certain. Should the DMA be exploited the interface needs
to implement the means to enable this functionality.
The interface is comprised of the hardware (HW) and
software (SW) part. The HW interface establishes the
signal connections between the system platform and
the accelerator. The SW interface, in its basic form,
allows data transfers to be performed, initiating the
computations, and signaling the host processor about
their completion.
For the design methodology, our approach is to design an application-specific processor for the task to be
offloaded, and then use a retargetable C compiler to
generate a binary code for the customized processor.
We will also show how to create a HW/SW interface
for an arbitrary platform. This interface requires nonrecurring engineer work. Once created it can be reused
on this particular platform with different applicationcustomized accelerators. The HW and SW interfaces
can be later distributed, e.g., in the form of reusable
libraries.
4 Accelerator Implementation
In this work, the transport triggered architecture (TTA)
[14] was used as a processor template for designing the
accelerators. For design automation, TTA-based Codesign Environment (TCE) [15–17], that uses the TTA
paradigm as a template for customizing applicationspecific processors, was used.
248
J Sign Process Syst (2011) 65:245–259
4.1 Processor Template
Transport Triggered Architectures (TTA) belong to a
class of exposed data path VLIW architectures, i.e., the
details of the data path transfers are exposed to the programmer. This enables various unique optimizations
in code generation and the data path interconnection
customization.
In contrast to traditional “operation triggered architectures” where operations are decoded to control
signals that initiate operand transports, TTA instructions explicitly define and schedule the operand transports. The operation executions are side effects of
the operand transports. The internal buses are used
efficiently as the data transports on each bus can be
controlled independently.
The modular structure of the TTA is illustrated in
Fig. 1. Basic building blocks of TTA processors are
function units (FU), register files (RF), a control unit,
and an interconnection network between the data path
resources. TTA processors are programmed by data
transports between the computing resources and the
programming paradigm reminds data flow programming. Each function unit contains one or more input
ports. One of the input ports is a trigger port, which
triggers the operation execution when the operand is
Figure 1 TTA processors
consist of the control unit
(CU) and variable number of
function units (FU), special
FUs (SFU), register files
(RF) and load/store units
(LSU). Unused connections
between the resources can be
excluded from the
interconnection network
because of the data transport
programming.
moved to this port. This means that other operands
have to be moved to corresponding ports on earlier or
at the same instruction cycle as the move to trigger port.
This requires careful scheduling of data transports.
Operands can be passed directly from one function unit
to another (software bypassing). Furthermore, the data
can be often fully bypassed without the need for storing
temporary results in a register file at all. In addition
to reducing the number of needed general purpose
registers to avoid spills, software bypassing lowers register file pressure, one of the biggest bottlenecks of the
VLIW machines [18].
One of the main benefits of the TTA template is its
flexibility; the architectures generated using the TTA
template can be scaled to the requirements at hand. For
instance, there are no limits on the number of parallel
FUs or RFs. The FUs can have an arbitrary number
of pipeline stages or an arbitrary delay. Furthermore,
there is no limit on the number of input and output
ports of FUs and the FUs can be connected to an external interface of the processor directly. The external
interface is simply extended with the connected FU
signals, which allows, e.g., using local memories freely.
A second significant benefit is the simplicity and modularity of the processor, which alleviates verification and
pre-synthesis cost estimations.
interconnection network, which
is visible to the programmer
buses
...
control unit with
program
CU
counter
special
function
units
SFU
function
units
FU
register
files
RF
data
memory
...
...
...
...
LSU
...
load/store units and
memory interfaces
...
...
...
...
...
...
sockets, whose
connections
with the buses
can be customized
according to the
application
J Sign Process Syst (2011) 65:245–259
249
4.2 TTA-based Codesign Environment
The TTA-based Codesign Environment (TCE) [15–17]
is a toolset that uses the TTA paradigm for developing application-specific instruction set processors. TCE
offers a set of tools, which allow a designer to customize
the processor architecture; compile high level language
programs for the designed architectures; simulate the
program execution; and evaluate the cost functions of
execution cycles, area, and energy. The toolset includes
both command line and graphical user interface tools
for powerful scripting and comfortable usability.
TCE allows the designer to design processors completely manually or in a semi-automated fashion. In
the first case, the designer uses a graphical tool to
instantiate an architecture template and to populate
it with resources. The library of predefined processor
units include: register files, functional units, long immediate units etc. Additionally, the designer can add his
own customized application-specific units. The graphical tool allows connecting processor resources with
each other through the transport buses.
In the semi-automated design flow, the designer can
automatically create an architecture based on the requirements of the application. Starting from an initial
Figure 2 TCE design flow
for FPGA circuits.
architecture provided by the designer, the design space
explorer automatically adds and removes resources.
Finally, the designer is given a database of architectures
with associated information about the cycle counts required for executing the application.
The TCE design flow for FPGA circuits is illustrated
in Fig. 2. The input is a high level language program.
The first design space exploration loop is performed
at the architecture level where the designed TTA is
modified using graphical tools and evaluated using a retargetable compiler and a processor simulator. It should
be noted that the “design space explorer” can be an automatic tool or the designer, depending on the desired
design flow. The next phase is the hardware generation where a platform specific implementation of the
architecture is produced. The implementation is then
evaluated with platform vendor specific tools, which
can return the design space exploration back to the
architecture exploration in case the desired constraints
(area, clock frequency, speed, power consumption) are
not met.
The design variations are evaluated at architectural
level by compiling programs for them and running
architectural simulations. The C compiler is ANSI C
compliant, hence, there are no restrictions on the C
HLL
(C, C++, OpenCL)
TCE Design Tools
Processor
Designer Tool
Retargetable Compiler
Retargetable
Instruction Set Simulator
Hardware
Database
(FPGA specific)
Feedback
Processor Generator
Program Image Generator
Platform
Description
(FPGA specific)
Platform Integrator
3rd party tools
FPGA Synthesis Tools
FPGA
Programming files
Feedback
Designer
(or automated
"explorer")
250
J Sign Process Syst (2011) 65:245–259
syntax. Once the designer is satisfied with the architecture the processor and proper program image can be
generated. TCE tools generate the HDL files for the
selected architecture and a bit image of the application.
The processor architecture can be synthesized from the
HDL files using third party tools.
In order to overcome the disadvantage of long instructions in VLIW designs the instruction compression can be used at this point. The binary image of
the application is compressed and a corresponding decompressing block is added to the control unit of the
target processor. For a more detailed description of the
TCE FPGA design flow, the reader is referred to our
previous paper [16].
4.3 Accelerator Design
The method for designing an accelerator on a FPGA
for offloading computations from the host processor
contains the following steps:
1. select a piece of code to be offloaded from the
processor to the FPGA;
2. replace selected code by calls to the device driver
to initiate operand transfers and execution on the
FPGA;
3. customize a TTA processor for the selected code
with the aid of the TCE-toolkit;
4. using TCE tools, generate an HDL description of
the customized TTA processor with required interfaces from platform specific hardware databases
and obtain the FPGA configuration with the commercial synthesis and place & route tools; and
5. generate machine code with the TCE retargetable
compiler for the customized TTA.
At runtime the FPGA configuration is downloaded
to the FPGA and the TTA binary program code is
loaded to the FPGA memory. After the initializations
the FPGA accelerator can be used under the software
running on the host processor. The interfaces are to be
loaded from platform-specific component libraries.
5 Target-Specific Interfacing
The communication between the host processor and
the application-specific TTA processor(s) configured
on the FPGA is target dependent. Therefore, interfaces
and protocols with device drivers are tailored for each
target platform. However, once the tailoring has been
done, the interfaces and protocols can be stored to
libraries and reused for new applications.
Figure 3 Organization of the target platform. MPMC: Multiport memory controller. DMAC: DMA controller.
5.1 Hardware Interface
Our example target platform was RealView Platform
Baseboard for ARM926EJ-S, which contains the ARM
processor and an FPGA chip. The simplified block
diagram presenting the main components and their
connections is shown in Fig. 3.
In this platform, all peripherals, which have a
memory-mapped interface, communicate with the
processor through the ARM specific AMBA AHB bus.
Figure 4 shows the basic connection of the slave peripherals to the tri-state AMBA AHB bus [19]. All AHB
slave modules have their inputs permanently connected
to the AHB signals. Outputs on the other hand are multiplexed. The Decoder component resolves addresses
from the AHB address bus (HADDR) and activates
the right component both by setting its HSEL signal
high and multiplexing its output back to the AHB bus.
Since TTAs use the Harvard architecture, their interface is comprised of the separate busses to instruction
and data memories. Additionally, our TTA included
two control signals: input TTA_START and output
TTA_COMPLETE. Those signals were used to start
the computations and indicate that the results are
ready. Once TTA_COMPLETE signal is asserted, the
TTA is locked and does not perform any tasks. This
prevents from possible data corruption and allows safe
copying of results from the memory on the FPGA to a
memory accessed by the host processor.
The adapting interface between the target platform
and TTA, presented in Fig. 5 is realized through three
distinctive components instantiated on the FPGA: the
J Sign Process Syst (2011) 65:245–259
251
Figure 4 Connection of
AMBA AHB slaves [19].
data memory, the instruction memory and the DMA
module (DMAM). Both memories are AMBA AHB
slaves. The data memory is a dual-port RAM built from
the on-chip memory cells on the FPGA. One port is
connected to the TTA data memory interface, while the
second port, which has an AHB interface, is connected
to the AMBA bus.
Figure 5 Principal block diagram of application-specific processor in FPGA.
The instruction memory is implemented in a similar
way, with one exception. The ports are asymmetric in
width. This is due to the very long instruction word of
the TTA and, at the same time, the 32-bits width of
the port connected to the AMBA bus. Because of this
asymmetry, additional control logic is needed on the
AMBA port to store and assemble several data words
from the host processor, into a complete instruction
word. This control logic is described with generic parameters, thus it can be reused easily by obtaining the
details of binary code from the TCE compiler: memory
size (the number of instructions to be stored); memory
width (the instruction width); and word width (word
with of data obtained from the host interface, in this
case, the AHB uses 32-bit words).
The actual data transfers on the FPGA are managed
by the DMAM, which is also a simple finite state
machine (FSM) that synchronizes DMA transfers with
TTA processing and interleaves the accesses to the data
memory. Typically, the following steps occur:
1. TTA is idle (locked) and does not access data
memory, the DMAM enables DMA transfers;
2. DMA controller transfers data (divided into bursts)
and the DMAM acknowledges consecutive bursts;
252
J Sign Process Syst (2011) 65:245–259
3. after the last burst the DMAM acknowledges transfer and unlocks the TTA which starts processing
data in the memory;
4. once processing is done the TTA locks itself and
informs the DMAM about the task completion; and
5. DMAM enables DMA transfers (pending or
upcoming).
After the last step the DMA controller can setup the
transfer back to the SDRAM. From the host processor point of view, offloading computations is nothing
more than pushing data back and forth. The additional
advantage comes from the fact that locking the TTA
processor can result in significant power savings as the
processor itself is neither polling nor waiting for an
external interrupt.
5.2 Software Interface
The software interface is a platform specific driver.
Our software platform was a Linux based OS, Maemo
Scirocco [20], which is tailored for mobile systems.
Therefore, we implemented the driver as Linux kernel
module.
The host-slave communication is managed by the
host processor through the DMA controller configured
with the device driver. The driver is implemented as
a kernel module, thus the driver can be dynamically
loaded on runtime. The driver is implemented as a
character device driver, which means that all the operations are performed on the file corresponding to the
physical device. The list of system calls implemented by
the DMA driver can be found in Table 1.
The developed driver supports both non-blocking
and blocking data transfers. The driver implements also
the DMA interrupt service, which is used to wakeup
application during the blocking read. Since the DMA
interrupt is enabled per transfer, it is important to
enable it by the ioctl system call before blocking
read is issued. Figure 6 presents a typical use of the
DMA controller driver system calls in the application
program.
Table 1 System calls implemented by DMA controller Linux
driver.
5.3 Processor—Accelerator Interaction
System call
Implementation description
open
Initiates driver specific structure. Opens
a channel to communicate with the DMA
controller.
Finishes an on-going transfer (if any) and
clears private data.
Sets transfer parameters, e.g., channel,
number of bytes to be transferred,
source and destination addresses etc.
In the basic case, configuring a DMA
transfer requires setting parameters in
four registers in the DMA controller
(DMACCxSrcAddr, DMACCxDestAddr,
DMACCxControl, DMACCxConfiguration).
More details can be found in DMA controller
documentation [21].
Triggers transfer from SDRAM to FPGA.
All necessary parameters need to be setup
with the Ioctl beforehand.
Implements a blocking read operation.
Triggers transfer from the FPGA to the
SDRAM. This transfer might be blocked by
the DMAM until TTA finishes processing.
The transfer parameters need to be setup
with the Ioctl beforehand.
Maps buffer from the kernel space to the user
space. Mmap is required to make the same
buffer visible in both spaces. It must be visible
in the kernel space for the DMA controller
and in the user space for the application. Data
copying between kernel and user spaces is
avoided by using the same buffer.
Figure 7 presents the sequence diagram describing how
the host processor operates with the accelerator during
close
ioctl
write
read
mmap
/* Open device */
fd0 = open (CHANNEL0, O_RDWR);
/* Allocate DMA buffer in kernel space */
ioctl (fd0, PL08X_IOC_ALLOC_SDRAM_BUFF, 8* BUF_SIZE);
/* Map buffer from kernel to user space */
buffer = mmap (NULL, BUF_SIZE,
PROT_READ | PROT_WRITE, MAP_SHARED, fd0, 0);
/* Setup DMA controller registers */
ioctl (fd0, PL08X_IOC_SET_ALL, &dmac_c_params);
/* Enable DMA interrupt */
ioctl (fd0, PL08X_IOC_SET_DMA_IRQ,1);
...
/* The write and read system calls
* replace the call to offloaded function
* in the original code.*/
/* Transfer data from SDRAM to FPGA */
write (fd0, NULL, 0);
/* Blocking read until offloading is done */
read (fd0, NULL, 0);
...
/* Unmap buffer */
munmap(v−>work [0], 8* BUF_SIZE);
/ * Close device */
close (fd0);
Figure 6 Example of offloading code with blocking call.
J Sign Process Syst (2011) 65:245–259
253
ARM cycles
TASK:
Tremor Decoder
OS Context Switch
TASK:
other task
OS Context Switch
TASK:
Tremor Decoder
Figure 7 Sequence diagram of a program execution with offloading.
the program execution. Assuming that the FPGA has
already been configured for the given application the
interaction is carried in the following fashion. First, the
application is started on the ARM processor. When
the offloading should start, the host processor configures the DMA controller to perform a block transfer from the SDRAM to the TTA local memory in
FPGA and starts the transfer. The host processor is now
254
J Sign Process Syst (2011) 65:245–259
free for executing other tasks. After the DMA block
transfer is completed, the TTA processor immediately
starts processing the data. Once the TTA processor
has completed the processing, it signals the end of
the processing for the DMA controller such that the
DMA transfer from the FPGA to the SDRAM could
be initiated. When the transfer is finished, the DMA
controller signals the host processor with the interrupt
that offloading is completed and results are available.
The interrupt service routine of the DMA device driver
signals the operating system for context switch and the
application continues its execution. On the consecutive
offloading events the procedure is repeated.
6 Experiments
To proof the feasibility of the proposed methodology, we carried out experiments with the RealView
Platform Baseboard, equipped with a Xilinx Virtex-II
family FPGA. At heart of the board is ARM926EJS, the 32-bit RISC processor with a wide range of
Figure 8 Flow diagram of
Tremor Ogg Vorbis audio
decoder.
ov_read
fetch_and_process_packet
vorbis_dsp_synthesis
mapping_inverse
vorbis_dsp_pcmout
mdct_unroll_lap
mdct_shift_right
MDCT unroll/lapping
14%
mdct_backward
MDCT inverse
51%
floor_inverse1
Floor
9%
floor_inverse2
vorbis_lsp_to_curve
residue_inverse
Residue
23%
J Sign Process Syst (2011) 65:245–259
255
peripherals including the DMA controller (DMAC)
and the memory management unit (MMU). The board
contains also 128MB of the 32-bits wide SDRAM and
128MB of the 32-bits wide NOR flash memories.
The proposed design methodology was experimented by using the Tremor Ogg Vorbis audio decoder [22] as an example application. It is an opensource, fixed-point implementation of the standard,
designed especially for platforms without floating-point
arithmetics. Instead of compiling code directly on the
board we decided to cross-compile it with the Scratch-
ALU
LSU
LOGIC
MUL
SHIFT
box cross-compiling toolkit [23] run on the i686 Linux
based host machine.
Finding a part of the application suitable for
offloading is not trivial in a general case, especially
with large programs. Fortunately in situations when
the computational kernel cannot be easily identified,
profiling tools, like TCE’s proxim or GNU’s gprof,
can be used. The profiling provided information about
the most complex functions in the Tremor Ogg Vorbis
decoder. In Fig. 8, the flow diagram of the decoder
along with the percentage of clock cycles used by the
ALU
LSU
IO_SFU
RF
32x32
BOOL
2x1
RF
32x32
GCU
0
1
2
3
4
(a)
LSU
ALU
LOGIC
MUL
MUL
SHIFT
LSU
ALU
ALU
ALU
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SHIFT
SHIFT
ALU
MUL
IO_SFU
RF
32x32
BOOL
2x1
RF
32x32
RF
32x32
RF
32x32
RF
32x32
GCU
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(b)
Figure 9 Principal organization of the customized TTAs: a machine with limited resources, “smallTTA” and b higher performance
machine, “fastTTA”.
256
J Sign Process Syst (2011) 65:245–259
most significant parts of the application. Nearly 50%
of computation time was used to compute the modified
discrete cosine transform. As this function processes
data in a consistent memory range, it was an obvious
candidate for offloading. We built the customized TTA
processor for the MDCT with the aid of TCE tools.
To illustrate the scalability of the tools, we developed
two processors. First, targeting to the short execution
time, and second, aiming at the smaller area. We call
them as fastTTA and smallTTA, respectively. The starting point was a so called minimal architecture, which
contains just enough resources for the TCE compiler
to compile any program for. The function, computing
MDCT, was extracted from Tremor code and wrapped
to a main function in a separate file. The TCE compiler
supports ANSI C language so no other modifications
were done to the original code. The code was compiled and profiled in a cycle accurate simulator. The
profiling tool shows the utilization of each function
unit, hence, the often used FUs were duplicated to improve performance. The final configuration is given in
Fig. 9a.
The fastTTA, partially presented in Fig. 9b, was
obtained with the design space explorer tool from the
TCE toolset. This tool automates the design process
by adding resources iteratively until the cycle count
cannot be reduced any more. Compared to smallTTA,
fastTTA has following components in addition: two
multipliers, three ALUs, two shifters, two register files,
and 12 buses. The profiling, code modification and
design of two application-specific TTA processors took
approximately two days of work.
Two TTA machines were integrated with the rest of
the hardware system from Fig. 5. Both designs were
synthesized with the Xilinx ISE Design Suite 10.1.
Table 2 presents some results taken from the synthesis
and place & route reports. As we can see, the fastTTA
takes almost three times more FPGA slices than the
smallTTA due to the large number of FUs and inter-
connect buses. Also the difference in number of multipliers is significant. The fastTTa uses nine embedded
18-bit multipliers, when the smallTTA has only three.
The on-chip memory is also almost 50% larger when
the fastTTA is used.
The difference in performance of TTAs represents
TComp value, measured by the clock cycle accurate
simulator from the TCE toolkit. The smallTTA takes
68,315 cycles to execute the offloaded task, while the
fastTTA executes the same routine in 50,639 cycles.
The critical path of both designs, given in Table 2, is
affected by two factors. Firstly, the synthesis was made
for a relatively old FPGA architecture, namely Xilinx
Virtex II. As example, clock frequency of 191 MHz
with a similar TTA processor was obtained when synthesized for a modern Xilinx Virtex-5 FPGA [16].
Secondly, no manual optimizations were used to optimize the critical path. Much higher clock frequencies
could be obtained with manually optimized interconnect buses. That manual optimization can be easily
applied with the help of graphical user interface from
one of the tools in TCE. It is worth mentioned that
this optimization process does not require a hardware
design expertise from a designer. The longer critical
path of fastTTA is due to the more complex interconnection bus. The complexity of the bus increases with
the number of FUs and RFs in the design and the bus is
often in the critical path of the FPGA implementations
of TTAs.
To measure the execution time of the application,
as accurate as possible, we instantiated one additional
component to the FPGA, cycle counter, which simply measures the number of FPGA clock cycles. The
component has memory mapped registers, which allow
start, stop, reset, and read of the measured clock cycles.
The cycle counter is an AMBA AHB slave and can be
accessed by the ARM processor exactly in the same
way as any other memory mapped peripheral in the
system.
The execution time of the accelerated function
TOffload can be split into three distinct parts:
Table 2 Characteristics of the offloading compared to softwareonly implementation.
FPGA slices
FPGA memory [kB]
FPGA Mul18 blocks
Max clock frequency[MHz]
Critical path [ns]
TComp [clock cycles]
TTrans [clock cycles]
TOS [clock cycles]
TOffload [clock cycles]
Offloaded code size [bytes]
(1)
FastTTA
SmallTTA
ARM
TOffload = TComp + TTrans + TOS
15,412
86
9
35.90
27.85
50,639
9,216
∼1,000
60,855
21,064
4,960
54
3
36.02
27.77
68,315
9,216
∼1,000
78,531
10,986
N/A
N/A
N/A
210.00
N/A
682,500
N/A
N/A
682,500
7,380
where TComp is the time used by TTA on the computations, TTrans indicates the time of data transfers
and TOS reflects OS overhead of the master/slave communication. In our experiments, TOffload was measured
with the cycle counter. The exact value of TComp can
be calculated with the cycle accurate simulator from
TCE tools. TTrans can be computed based on the transfer protocol and the number of data elements to be
J Sign Process Syst (2011) 65:245–259
257
(a)
(b)
Figure 10 AMBA HTRANS signal messages during a data transfer: a burst and b non-burst transfer. Each box corresponds to one
clock cycle. Dark boxes indicate cycles when valid data is presented on the bus.
transferred. Based on the previous the TOS can be
calculated according to Eq. 1.
Table 2 lists also the execution time results. The
number of clock cycles the offloading takes is compared
to the cycles that host processor needs to perform same
computations. However, to correctly interpret these
results we need to take into account that generally, the
host processor runs at the higher frequency than the
slave processor. In our case, the ratio equals 7. Keeping
that in mind we obtain 1.6× and 1.24× speedup when
offloading with the fastTTA and smallTTA, respectively. Additional gain comes from a fact that the host
processor can perform other task while waiting for
offloading to complete. We are running a multitask OS
and other processes can be scheduled to run on the
CPU during that time, as shown in Fig. 7.
The number of bytes the offloaded function takes
after compiling is given in the last row of Table 2.
As can be seen, the program for fastTTA is almost
twice as big as the binary for the smallTTA. Conserving
available memory on the FPGA for other purposes can
be another reason to customize the accelerator exactly
to the application requirements.
Finally, the reason for relatively low data transfer
throughput is the transfer protocol used on the AMBA
bus. Figure 10 shows messages transferred from the
master to the slave on the HTRANS, one of the AMBA
signals. There are four distinct messages but only NONSEQ and SEQ indicate valid data on the bus. If we take
a closer look at the messages during the burst transfer,
shown in Fig. 10a, we will see that 18 clock cycles are
required to transfer four words of data. In other words,
4.5 cycles per word. In non-burst transfer, depicted by
Fig. 10b, one data word is send every 6 cycles. If we
transfer 1,024 words, which is the common case for the
Tremor decoder, the transfer will take 4,608 or 6,144
clock cycles in burst and non-burst modes respectively.
The bus we are using to transfer is not used for any
other purpose, so it is safe to claim that calculated
numbers hold in the general case. The burst mode can
be set with the DMACCxControl registers of the DMA
controller.
7 Conclusion
In this paper, we described a method for offloading
computations from a host processor to an FPGA. The
proposed approach supports platforms with an operating system and offloads both computation and data
transfer between host and slave processors. The computations are implemented as a TTA processor, which
is customized for the given application and exploits the
inherent instruction level parallelism of the application.
The interfaces and communication between the host
processor and the slave TTA are target-specific but
can be reused in the same target. The communication
packages and interfaces are generic and allow any type
of functionality to be offloaded from the host processor
under this environment.
As a case study, we customized two TTAs for an
audio decoding application, showing the scalability of
the TCE toolset. The obtained results demonstrate that
the difference in the targeted parameters is significant
and the final product can be a trade-off based on the
requirements. The design work is done with TCE tools
on the high abstraction level, thus no hardware design
expertise is needed. Finally, in our experiment, the
results show that offloading speedup the application execution when compared to the software-only execution.
However, the speedup depends on characteristics of
the processor and FPGA fabric. Additional gain comes
from a fact that the offloading is a non-blocking procedure. In a multitask operating system, other process can
be scheduled to run on the CPU while the offloading is
taking place.
References
1. Patyk, T., Salmela, P., Pitkänen, T., & Takala, J. (2010).
Design methodology for accelerating software executions
with FPGA. In Proc. IEEE workshop signal process. syst.,
Cupertino, CA, USA, 6–8 Oct. 2010 (pp. 46–51).
2. Synopsys Inc. (2011). High-Level Synthesis with Synphony
C Compiler, Mountain View, CA, USA (4 p.) [online].
Available: http://www.synopsys.com/Systems/BlockDesign/
258
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
J Sign Process Syst (2011) 65:245–259
HLS/Pages/SynphonyC-Compiler.aspx. Accessed 17 July
2011.
Hoffman, A., Kogel, T., Nohl, A., Braun, G., Schliebusch, O.,
Wahlen, O., et al. (2001). A novel methodology for the design
of application-specific instruction-set processors (ASIPs) using a machine description language. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems,
20(11), 1338–1354.
Praet, J. V., Lanneer, D., Geurts, W., & Goossens, G. (2001).
Processor modeling and code selection for retargetable compilation. ACM Transactions on Design Automation of Electronic Systems, 6(3), 277–307.
Target Compiler Technologies (2008). IP designed | IP programmer, Leuven, Belgium (p. 4). [online]. Available: http://
www.retarget.com. Accessed 17 July 2011.
Cong, J. (2008). A new generation of C-based synthesis
tool and domain-specific computing. In Proc. IEEE int. soc.
conf., Newport Beach, CA, USA, 17–20 Sept. 2008 (Vol. 6507,
pp. 386–386).
Impulse Accelerated Technologies Inc. (2007). Accelerate C
in FPGA Kirkland, WA, USA (p. 2) [online]. Available:
http://www.impulsec.com. Accessed 17 July 2011.
Goering, R. (2006). Programmable logic: Startup moves binaries into FPGAs. EE Times.
CriticalBlue Ltd (2007). Cascade programmable application
coprocessor generation, Pleasance Edinburgh, United Kingdom (p. 4) [online]. Available: http://www.criticalblue.com.
Accessed 17 July 2011.
Mentor Graphics Corporation (2010). Catapult C synthesis
datasheet, Wilsonville, OR, USA (p. 4) [online]. Available:
http://www.mentor.com/catapult. Accessed 17 July 2011.
Forte Design Systems (2008). CynthesizerT M the most productive path to silicon, San Jose, CA, USA (p. 2. [online].
Available: http://www.forteds.com/products/cynthesizer.asp.
Accessed 17 July 2011.
ESNUG ELSE 06 Item 7 Subject: Mentor Catapult C
(2006). [Online]. Available: http://www.deepchip.com/items/
else06-07.html. Accessed 17 July 2011.
Reshadi, M., & Gajski, D. (2005). A cycle-accurate compilation algorithm for custom pipelined datapaths. In Proc.
IEEE/ACM/IFIP int. conf. HW/SW codesign system synthesis, New York, NY, USA ,18–21 Sept. 2005 (pp. 21–26).
Corporaal, H. (1994). Design of transport triggered architectures. In Proc. 4th great lakes symp. design autom. high perf.
VLSI syst., Notre Dame, IN, USA, 4–5 Mar. 1994 (pp. 130–
135).
Jääskeläinen, P., Guzma, V., Clio, A., Pitkänen, T., &
Takala, J. (2007). Codesign toolset for application-specific
instruction-set processors. In Proc. SPIE multimedia mobile
devices, San Jose, CA, USA, 29–30 Jan. 2007 (Vol. 6507,
pp. 05070X–1–10).
Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C. S.,
Takala, J., & Martinez, J. I. (2010). Customized exposed datapath soft-core design flow with compiler support. In Proc.
int. conf. f ield programmable logic and applications, Milano,
Italy, 31 Aug.–8 Sept. 2010 (pp. 217–222).
TCE: TTA codesign environment (2011). [online]. Available:
http://tce.cs.tut.fi. Accessed 17 July 2011.
Corporaal, H. (1999). TTAs: Missing the ILP complexity wall.
Journal of Systems Architecture, 45(12–13), 949–973.
Implementing AHB peripherals in logic tiles (2007). Application note 119 [online]. Available: http://infocenter.arm.
com/help/index.jsp?topic=/com.arm.doc.dai0119e/index.html.
Accessed 17 July 2011.
Maemo by Nokia (2011). [online]. Available: http://
maemo.org/. Accessed 17 July 2011.
21. AMBA open specifications (2011). [online]. Available:
http://www.arm.com/products/system-ip/amba/amba-openspecifications.php. Accessed 17 July 2011.
22. Tremor by the Xiph.Org foundation (2006). [online].
Available: http://wiki.xiph.org/index.php/Tremor. Accessed
17 July 2011.
23. Scratchbox cross-compilation toolkit project (2011). [online]. Available: http://www.scratchbox.org/. Accessed 17 July
2011.
Tomasz Patyk received his M.Sc. (EE) degree from Poznan
University of Technology, Poznan, Poland in 2007. In 2006 and
2007 he was a Research Assistant in Department of Computer
Systems at Tamper University of Technology (TUT), Tampere,
Finland. Since 2008 holds a position of Research Scientist at the
same department and works towards his Dr. Tech (IT) degree.
In his work at TUT he took part in several industrial sponsored
projects. In years 2010 and 2011 he was an External Software
Designer at Nokia, Tampere. His research interest include embedded systems, HW/SW development for mobile architectures,
and application-specific processors design.
Perttu Salmela received his M.Sc. (IT) in 2000 and Dr.Tech
(IT) in 2009 from Tampere University of Technology (TUT),
Tampere, Finland. His research topics include, but are not limited to, embedded systems, telecommunication and multimedia
applications, and HW/SW development for application specific
processors. His main research work was carried out in TUT from
1998 to 2010 beginning as Research Assistant and ending up
as Postdoctoral Researcher. Currently he is Senior Engineer at
Qualcomm.
J Sign Process Syst (2011) 65:245–259
259
implementation of the retargetable processor simulator of TCE.
Currently he is pursuing a Dr. Tech degree with research conducted for the TCE project, mainly focusing on its retargetable
compiler and multicore design flow issues. His research interests
include processor architectures, processor design methodology,
and code generation for static architectures.
Teemu Pitkänen received his M.Sc. (EE) degree from Tampere
University of Technology, Tampere, Finland (TUT) in 2005.
From 2002 to 2005, he worked as a Research Assistant and
currently he works towards Dr.Tech as researcher in the Institute
of Digital and Computer Systems at TUT. His research interest
include parallel architectures, minimization of energy dissipation
and design methodologies for digital signal processing systems.
Pekka Jääskeläinen has been working in the TTA-based Codesign Environment (TCE) project of Department of Computer
Systems, Tampere University of Technology, since its beginning
at late 2002. His master’s thesis (2005) described the design and
Jarmo Takala received his M.Sc. (hons) (EE) degree and Dr.
Tech. (IT) degree from Tampere University of Technology,
Tampere, Finland (TUT) in 1987 and 1999, respectively. From
1992 to 1996, he was a Research Scientist at VTT-Automation,
Tampere, Finland. Between 1995 and 1996, he was a Senior
Research Engineer at Nokia Research Center, Tampere, Finland.
From 1996 to 1999, he was a Researcher at TUT. Currently,
he is Professor in Computer Engineering at TUT and Head of
Department of Computer Systems of TUT. His research interests include circuit techniques, parallel architectures, and design
methodologies for digital signal processing systems.