Academia.eduAcademia.edu

Design Methodology for Offloading Software Executions to FPGA

2011, Journal of Signal Processing Systems

Field programmable gate array (FPGA) is a flexible solution for offloading part of the computations from a processor. In particular, it can be used to accelerate an execution of a computationally heavy part of the software application, e.g., in DSP, where small kernels are repeated often. Since an application code for a processor is a software, a design methodology is needed to convert the code into a hardware implementation, applicable to the FPGA. In this paper, we propose a design method, which uses the Transport Triggered Architecture (TTA) processor template and the TTA-based Co-design Environment toolset to automate the design process. With software as a starting point, we generate a RTL implementation of an application-specific TTA processor together with the hardware/software interfaces required to offload com-putations from the system main processor. To exemplify how the integration of the customized TTA with a new platform could look like, we describe a process of developing required interfaces from a scratch. Finally, we present how to take advantage of the scalability of the TTA processor to target platform and applicationspecific requirements.

J Sign Process Syst (2011) 65:245–259 DOI 10.1007/s11265-011-0606-x Design Methodology for Offloading Software Executions to FPGA Tomasz Patyk · Perttu Salmela · Teemu Pitkänen · Pekka Jääskeläinen · Jarmo Takala Received: 29 January 2011 / Revised: 4 July 2011 / Accepted: 4 July 2011 / Published online: 30 July 2011 © Springer Science+Business Media, LLC 2011 Abstract Field programmable gate array (FPGA) is a flexible solution for offloading part of the computations from a processor. In particular, it can be used to accelerate an execution of a computationally heavy part of the software application, e.g., in DSP, where small kernels are repeated often. Since an application code for a processor is a software, a design methodology is needed to convert the code into a hardware implementation, applicable to the FPGA. In this paper, we propose a design method, which uses the Transport Triggered Architecture (TTA) processor template and the TTA-based Co-design Environment toolset to automate the design process. With software as a starting point, we generate a RTL implementation of an application-specific TTA processor together with the hardware/software interfaces required to offload com- This work has been supported by the Academy of Finland under research grant decision 128126. T. Patyk (B) · P. Salmela · T. Pitkänen · P. Jääskeläinen · J. Takala Department of Computer Systems, Tampere University of Technology, P. O. Box 553, 33101, Tampere, Finland e-mail: [email protected] P. Salmela e-mail: [email protected] T. Pitkänen e-mail: [email protected] P. Jääskeläinen e-mail: [email protected] J. Takala e-mail: [email protected] putations from the system main processor. To exemplify how the integration of the customized TTA with a new platform could look like, we describe a process of developing required interfaces from a scratch. Finally, we present how to take advantage of the scalability of the TTA processor to target platform and applicationspecific requirements. Keywords Application-specific integrated circuits · Hardware accelerator · Computer aided engineering · System-on-a-chip · Coprocessors · Field programmable gate arrays 1 Introduction The growing complexity of software applications running on the portable devices like mobile phones, smart phones, PDAs etc., call for the increase in the processing power offered by their CPUs. Typically, a RISC processor employed as a general purpose processing unit does not provide enough computational resources and the use of a specialized hardware accelerator is inevitable. A DSP co-processor is a common solution to speed up multimedia applications. Nevertheless how powerful the DSP processor is, a dedicated hardware will do the same task faster, consume less power, and take smaller silicon area. Reconfigurable hardware in form of field programmable gate array (FPGA) makes an excellent solution for increasing the performance of an embedded system, as part of the application code can be offloaded from the processor. The performance increase requires careful planning though. Quite often the overhead of such arrangements, e.g., cost of data transfers between a 246 CPU and an FPGA may be higher than the performance gain. Also the clock frequency of the FPGA is often much lower than the CPU. Therefore, the inherent parallelism of the application needs to be exploited efficiently. Finally, the traditional development style for FPGA resembles hardware design process, which requires that the designer has expertise in hardware structures. Additionally, application code is often in a form of software code, hence, offloading requires the description to be converted to RTL structure. Therefore, there is a need for a design methodology converting software partition to a hardware structure. The methodology could be used, e.g., by software designers without a deep knowledge on the hardware implementations, as rapid way of offloading computations to an FPGA. In this paper, we describe a design methodology for offloading computations from a CPU to a FPGA. The proposed method allows a part of an application code, described in the C language, to be executed on the application-tailored processor, implemented on the FPGA. The method supports full ANSI C language; targets with an operating system; exploits DMA transfers to minimize the overheads, and allows the user to scale-up/down the computational resources. Our experiments show that this method is scalable and can exploit the inherent parallelism of the application. In addition, the designer makes his efforts on the higher abstraction level, thus deep knowledge on the hardware design is not needed. The paper extends our previous work in [1] by providing details of the proposed design methodology. The remaining part of the paper is organized as follows. Section 2 presents a brief survey of other available tools automating the offloading process. Section 3 sketches the offloading of computations, Section 4 details the implementation methodology for the described accelerator blocks, Section 5 describes the platform specific interfacing, Section 6 discusses results for two different TTA designs, and Section 7 concludes the paper. 2 Related Work Traditionally the C language has been used to implement DSP algorithms and applications. The large amounts of legacy C code turns attention to design methods capable of converting functionality described in C language to a hardware structure, as easily as possible. A large number of tools, taking C program as an initial description, is already available on the market. In theory, such tools could be used for a FPGA based acceleration. However, many tools have serious J Sign Process Syst (2011) 65:245–259 limitations, e.g., only a subset of C is fully supported, which makes the C to hardware conversion process more complex and time consuming. Furthermore, many tools generate only the RTL description of an hardware accelerator without support for the system integration. The user has to manually design the scheduling and communication mechanisms between the accelerator and host processor, build the interface units, and provide device drivers. Synphony C Compiler [2] generates processor arrays from C programs. However, it supports only a limited subset of C. It also requires manual setting of parameters affecting the scheduling of operations. CoWare Processor Designer [3] is a toolset for designing application-specific processors (ASIP) and it is not a generic tool for converting C to HDL descriptions. Target IP Designer [4, 5] is another similar tool. AutoESL [6] supports high and low level parallelism but it does not support the full ANSI C language. Impulse CoDeveloper [7] is targeted for an FPGA based acceleration but it assumes a computational model comprised of the sequential processes communicating with each other. Therefore, it suits well only if the application consists of the independent processes receiving and emitting data streams. In addition, it does not support full ANSI C language. Binachip-FPGA [8] targets also FPGA acceleration. In contrast to other tools, the description of the system is given as a compiled binary for a supported processor architecture instead of a C language source code. Cascade [9] is another tool which uses ARM, PowerPC, or MicroBlaze binaries as the description of the desired functionality. These tools inputting binaries instead of source code are assumed to be targeted to cases where the source code of the program is not available. Otherwise it is hard to justify the lower level input format given that even the C language is a very low level sequential language from which producing a parallel implementation is already often very challenging, even if the described algorithm is inherently parallel. Catapult-C [10] generates a fixed function implementation instead of a processor-based one. As a drawback, generating the hardware implementation requires lots of user attention. C2H is a tool only for the Altera FPGA devices. It requires direct access to a memory, shared with the master processor. The tool supports only a subset of C and its external connectivity is based on Altera’s Avalon bus. Cynthesizer [11] is another tool for rapid hardware generation. However, it requires using SystemC as well. In general, extensive modifications are required to the original ANSI C code [12]. NISC [13] is a tool for generating noinstruction-set-computer architecture processors from J Sign Process Syst (2011) 65:245–259 C. On the architectural level the basic idea of the NISC, the use of an extremely “bare bone” processor template, is similar to the TTA template used in this work. However, full ANSI C is not supported. In the proposed method, we target for supporting the full ANSI C descriptions and allow user to trade of execution time against area according to given requirements. In addition, the proposed method supports offloading on targets with operating systems (OS). 3 Design Method for Offloading Computations An FPGA in an embedded system gives a unique opportunity to system designers to offload some of the computation from the host processor, hence reducing the computational load on it. This hardware can serve as a hardware accelerator for some specific, e.g., DSP, algorithm that cannot be computed efficiently enough by the main unit. Another common case is to simply offload some computationally intensive tasks from the host processor in a multi-task system and let the processor execute other tasks while waiting for the results of the offloaded computation. Either way, the system designer is faced with the following design challenges: – – – host processor utilization; hardware (HW) / software (SW) interface between the host processor and the offloaded unit; and co-design methodology to produce a HW accelerated implementation from the SW implementation. When considering the host utilization, several issues need to be taken into account. Firstly, since the multitasking systems, governed by an operating system are of our primary interest, it is essential that the offloaded execution is non-blocking. This means that the host processor should be able to continue execution while the offloading hardware is doing its job. Quite often this means that the operating system schedules different tasks/processes to the processor until the execution can be resumed. Secondly, in some cases the FPGA system does not have a random access to the local memory of the processor where the operands of the computations are stored. This imposes the requirement of transferring data to and from the local memory of the FPGA device. Not only this takes time but also, if done actively by the host processor, it keeps the processor busy. A common way to avoid occupying the host processor for the data transfers is the use of Direct Memory Access (DMA) transfers. In the platforms supporting DMA, this method offloads the data transfers from the processor to a peripheral hardware unit. Thirdly, the FPGA circuit usually runs at a clock frequency 247 several times lower than the one of the host processor. The actual acceleration expected from using the FPGA needs to be calculated keeping this in mind. Naturally, the gain arising from the fact that the host processor can perform other tasks meanwhile is preserved. These factors lead us to the following conclusion: in order to speed up application execution with an FPGA accelerator, the speed of the accelerator hardware should compensate the additional data transfer penalties, the potentially lower clock frequency of the accelerator, and the overhead of task switching in the operating system. Preferably, the accelerator design technique should be scalable so it can be used to design accelerators that meet the required computational efficiency while staying within the silicon area limits of the platform. The communication interface is specific to the used platform and hardware accelerator. If the accelerator is manually designed for the certain platform, the interface will be a direct map to the interface exposed by the platform. If the accelerator is generated with an automated approach, e.g., using a processor template, the need for an adapter interface is most certain. Should the DMA be exploited the interface needs to implement the means to enable this functionality. The interface is comprised of the hardware (HW) and software (SW) part. The HW interface establishes the signal connections between the system platform and the accelerator. The SW interface, in its basic form, allows data transfers to be performed, initiating the computations, and signaling the host processor about their completion. For the design methodology, our approach is to design an application-specific processor for the task to be offloaded, and then use a retargetable C compiler to generate a binary code for the customized processor. We will also show how to create a HW/SW interface for an arbitrary platform. This interface requires nonrecurring engineer work. Once created it can be reused on this particular platform with different applicationcustomized accelerators. The HW and SW interfaces can be later distributed, e.g., in the form of reusable libraries. 4 Accelerator Implementation In this work, the transport triggered architecture (TTA) [14] was used as a processor template for designing the accelerators. For design automation, TTA-based Codesign Environment (TCE) [15–17], that uses the TTA paradigm as a template for customizing applicationspecific processors, was used. 248 J Sign Process Syst (2011) 65:245–259 4.1 Processor Template Transport Triggered Architectures (TTA) belong to a class of exposed data path VLIW architectures, i.e., the details of the data path transfers are exposed to the programmer. This enables various unique optimizations in code generation and the data path interconnection customization. In contrast to traditional “operation triggered architectures” where operations are decoded to control signals that initiate operand transports, TTA instructions explicitly define and schedule the operand transports. The operation executions are side effects of the operand transports. The internal buses are used efficiently as the data transports on each bus can be controlled independently. The modular structure of the TTA is illustrated in Fig. 1. Basic building blocks of TTA processors are function units (FU), register files (RF), a control unit, and an interconnection network between the data path resources. TTA processors are programmed by data transports between the computing resources and the programming paradigm reminds data flow programming. Each function unit contains one or more input ports. One of the input ports is a trigger port, which triggers the operation execution when the operand is Figure 1 TTA processors consist of the control unit (CU) and variable number of function units (FU), special FUs (SFU), register files (RF) and load/store units (LSU). Unused connections between the resources can be excluded from the interconnection network because of the data transport programming. moved to this port. This means that other operands have to be moved to corresponding ports on earlier or at the same instruction cycle as the move to trigger port. This requires careful scheduling of data transports. Operands can be passed directly from one function unit to another (software bypassing). Furthermore, the data can be often fully bypassed without the need for storing temporary results in a register file at all. In addition to reducing the number of needed general purpose registers to avoid spills, software bypassing lowers register file pressure, one of the biggest bottlenecks of the VLIW machines [18]. One of the main benefits of the TTA template is its flexibility; the architectures generated using the TTA template can be scaled to the requirements at hand. For instance, there are no limits on the number of parallel FUs or RFs. The FUs can have an arbitrary number of pipeline stages or an arbitrary delay. Furthermore, there is no limit on the number of input and output ports of FUs and the FUs can be connected to an external interface of the processor directly. The external interface is simply extended with the connected FU signals, which allows, e.g., using local memories freely. A second significant benefit is the simplicity and modularity of the processor, which alleviates verification and pre-synthesis cost estimations. interconnection network, which is visible to the programmer buses ... control unit with program CU counter special function units SFU function units FU register files RF data memory ... ... ... ... LSU ... load/store units and memory interfaces ... ... ... ... ... ... sockets, whose connections with the buses can be customized according to the application J Sign Process Syst (2011) 65:245–259 249 4.2 TTA-based Codesign Environment The TTA-based Codesign Environment (TCE) [15–17] is a toolset that uses the TTA paradigm for developing application-specific instruction set processors. TCE offers a set of tools, which allow a designer to customize the processor architecture; compile high level language programs for the designed architectures; simulate the program execution; and evaluate the cost functions of execution cycles, area, and energy. The toolset includes both command line and graphical user interface tools for powerful scripting and comfortable usability. TCE allows the designer to design processors completely manually or in a semi-automated fashion. In the first case, the designer uses a graphical tool to instantiate an architecture template and to populate it with resources. The library of predefined processor units include: register files, functional units, long immediate units etc. Additionally, the designer can add his own customized application-specific units. The graphical tool allows connecting processor resources with each other through the transport buses. In the semi-automated design flow, the designer can automatically create an architecture based on the requirements of the application. Starting from an initial Figure 2 TCE design flow for FPGA circuits. architecture provided by the designer, the design space explorer automatically adds and removes resources. Finally, the designer is given a database of architectures with associated information about the cycle counts required for executing the application. The TCE design flow for FPGA circuits is illustrated in Fig. 2. The input is a high level language program. The first design space exploration loop is performed at the architecture level where the designed TTA is modified using graphical tools and evaluated using a retargetable compiler and a processor simulator. It should be noted that the “design space explorer” can be an automatic tool or the designer, depending on the desired design flow. The next phase is the hardware generation where a platform specific implementation of the architecture is produced. The implementation is then evaluated with platform vendor specific tools, which can return the design space exploration back to the architecture exploration in case the desired constraints (area, clock frequency, speed, power consumption) are not met. The design variations are evaluated at architectural level by compiling programs for them and running architectural simulations. The C compiler is ANSI C compliant, hence, there are no restrictions on the C HLL (C, C++, OpenCL) TCE Design Tools Processor Designer Tool Retargetable Compiler Retargetable Instruction Set Simulator Hardware Database (FPGA specific) Feedback Processor Generator Program Image Generator Platform Description (FPGA specific) Platform Integrator 3rd party tools FPGA Synthesis Tools FPGA Programming files Feedback Designer (or automated "explorer") 250 J Sign Process Syst (2011) 65:245–259 syntax. Once the designer is satisfied with the architecture the processor and proper program image can be generated. TCE tools generate the HDL files for the selected architecture and a bit image of the application. The processor architecture can be synthesized from the HDL files using third party tools. In order to overcome the disadvantage of long instructions in VLIW designs the instruction compression can be used at this point. The binary image of the application is compressed and a corresponding decompressing block is added to the control unit of the target processor. For a more detailed description of the TCE FPGA design flow, the reader is referred to our previous paper [16]. 4.3 Accelerator Design The method for designing an accelerator on a FPGA for offloading computations from the host processor contains the following steps: 1. select a piece of code to be offloaded from the processor to the FPGA; 2. replace selected code by calls to the device driver to initiate operand transfers and execution on the FPGA; 3. customize a TTA processor for the selected code with the aid of the TCE-toolkit; 4. using TCE tools, generate an HDL description of the customized TTA processor with required interfaces from platform specific hardware databases and obtain the FPGA configuration with the commercial synthesis and place & route tools; and 5. generate machine code with the TCE retargetable compiler for the customized TTA. At runtime the FPGA configuration is downloaded to the FPGA and the TTA binary program code is loaded to the FPGA memory. After the initializations the FPGA accelerator can be used under the software running on the host processor. The interfaces are to be loaded from platform-specific component libraries. 5 Target-Specific Interfacing The communication between the host processor and the application-specific TTA processor(s) configured on the FPGA is target dependent. Therefore, interfaces and protocols with device drivers are tailored for each target platform. However, once the tailoring has been done, the interfaces and protocols can be stored to libraries and reused for new applications. Figure 3 Organization of the target platform. MPMC: Multiport memory controller. DMAC: DMA controller. 5.1 Hardware Interface Our example target platform was RealView Platform Baseboard for ARM926EJ-S, which contains the ARM processor and an FPGA chip. The simplified block diagram presenting the main components and their connections is shown in Fig. 3. In this platform, all peripherals, which have a memory-mapped interface, communicate with the processor through the ARM specific AMBA AHB bus. Figure 4 shows the basic connection of the slave peripherals to the tri-state AMBA AHB bus [19]. All AHB slave modules have their inputs permanently connected to the AHB signals. Outputs on the other hand are multiplexed. The Decoder component resolves addresses from the AHB address bus (HADDR) and activates the right component both by setting its HSEL signal high and multiplexing its output back to the AHB bus. Since TTAs use the Harvard architecture, their interface is comprised of the separate busses to instruction and data memories. Additionally, our TTA included two control signals: input TTA_START and output TTA_COMPLETE. Those signals were used to start the computations and indicate that the results are ready. Once TTA_COMPLETE signal is asserted, the TTA is locked and does not perform any tasks. This prevents from possible data corruption and allows safe copying of results from the memory on the FPGA to a memory accessed by the host processor. The adapting interface between the target platform and TTA, presented in Fig. 5 is realized through three distinctive components instantiated on the FPGA: the J Sign Process Syst (2011) 65:245–259 251 Figure 4 Connection of AMBA AHB slaves [19]. data memory, the instruction memory and the DMA module (DMAM). Both memories are AMBA AHB slaves. The data memory is a dual-port RAM built from the on-chip memory cells on the FPGA. One port is connected to the TTA data memory interface, while the second port, which has an AHB interface, is connected to the AMBA bus. Figure 5 Principal block diagram of application-specific processor in FPGA. The instruction memory is implemented in a similar way, with one exception. The ports are asymmetric in width. This is due to the very long instruction word of the TTA and, at the same time, the 32-bits width of the port connected to the AMBA bus. Because of this asymmetry, additional control logic is needed on the AMBA port to store and assemble several data words from the host processor, into a complete instruction word. This control logic is described with generic parameters, thus it can be reused easily by obtaining the details of binary code from the TCE compiler: memory size (the number of instructions to be stored); memory width (the instruction width); and word width (word with of data obtained from the host interface, in this case, the AHB uses 32-bit words). The actual data transfers on the FPGA are managed by the DMAM, which is also a simple finite state machine (FSM) that synchronizes DMA transfers with TTA processing and interleaves the accesses to the data memory. Typically, the following steps occur: 1. TTA is idle (locked) and does not access data memory, the DMAM enables DMA transfers; 2. DMA controller transfers data (divided into bursts) and the DMAM acknowledges consecutive bursts; 252 J Sign Process Syst (2011) 65:245–259 3. after the last burst the DMAM acknowledges transfer and unlocks the TTA which starts processing data in the memory; 4. once processing is done the TTA locks itself and informs the DMAM about the task completion; and 5. DMAM enables DMA transfers (pending or upcoming). After the last step the DMA controller can setup the transfer back to the SDRAM. From the host processor point of view, offloading computations is nothing more than pushing data back and forth. The additional advantage comes from the fact that locking the TTA processor can result in significant power savings as the processor itself is neither polling nor waiting for an external interrupt. 5.2 Software Interface The software interface is a platform specific driver. Our software platform was a Linux based OS, Maemo Scirocco [20], which is tailored for mobile systems. Therefore, we implemented the driver as Linux kernel module. The host-slave communication is managed by the host processor through the DMA controller configured with the device driver. The driver is implemented as a kernel module, thus the driver can be dynamically loaded on runtime. The driver is implemented as a character device driver, which means that all the operations are performed on the file corresponding to the physical device. The list of system calls implemented by the DMA driver can be found in Table 1. The developed driver supports both non-blocking and blocking data transfers. The driver implements also the DMA interrupt service, which is used to wakeup application during the blocking read. Since the DMA interrupt is enabled per transfer, it is important to enable it by the ioctl system call before blocking read is issued. Figure 6 presents a typical use of the DMA controller driver system calls in the application program. Table 1 System calls implemented by DMA controller Linux driver. 5.3 Processor—Accelerator Interaction System call Implementation description open Initiates driver specific structure. Opens a channel to communicate with the DMA controller. Finishes an on-going transfer (if any) and clears private data. Sets transfer parameters, e.g., channel, number of bytes to be transferred, source and destination addresses etc. In the basic case, configuring a DMA transfer requires setting parameters in four registers in the DMA controller (DMACCxSrcAddr, DMACCxDestAddr, DMACCxControl, DMACCxConfiguration). More details can be found in DMA controller documentation [21]. Triggers transfer from SDRAM to FPGA. All necessary parameters need to be setup with the Ioctl beforehand. Implements a blocking read operation. Triggers transfer from the FPGA to the SDRAM. This transfer might be blocked by the DMAM until TTA finishes processing. The transfer parameters need to be setup with the Ioctl beforehand. Maps buffer from the kernel space to the user space. Mmap is required to make the same buffer visible in both spaces. It must be visible in the kernel space for the DMA controller and in the user space for the application. Data copying between kernel and user spaces is avoided by using the same buffer. Figure 7 presents the sequence diagram describing how the host processor operates with the accelerator during close ioctl write read mmap /* Open device */ fd0 = open (CHANNEL0, O_RDWR); /* Allocate DMA buffer in kernel space */ ioctl (fd0, PL08X_IOC_ALLOC_SDRAM_BUFF, 8* BUF_SIZE); /* Map buffer from kernel to user space */ buffer = mmap (NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd0, 0); /* Setup DMA controller registers */ ioctl (fd0, PL08X_IOC_SET_ALL, &dmac_c_params); /* Enable DMA interrupt */ ioctl (fd0, PL08X_IOC_SET_DMA_IRQ,1); ... /* The write and read system calls * replace the call to offloaded function * in the original code.*/ /* Transfer data from SDRAM to FPGA */ write (fd0, NULL, 0); /* Blocking read until offloading is done */ read (fd0, NULL, 0); ... /* Unmap buffer */ munmap(v−>work [0], 8* BUF_SIZE); / * Close device */ close (fd0); Figure 6 Example of offloading code with blocking call. J Sign Process Syst (2011) 65:245–259 253 ARM cycles TASK: Tremor Decoder OS Context Switch TASK: other task OS Context Switch TASK: Tremor Decoder Figure 7 Sequence diagram of a program execution with offloading. the program execution. Assuming that the FPGA has already been configured for the given application the interaction is carried in the following fashion. First, the application is started on the ARM processor. When the offloading should start, the host processor configures the DMA controller to perform a block transfer from the SDRAM to the TTA local memory in FPGA and starts the transfer. The host processor is now 254 J Sign Process Syst (2011) 65:245–259 free for executing other tasks. After the DMA block transfer is completed, the TTA processor immediately starts processing the data. Once the TTA processor has completed the processing, it signals the end of the processing for the DMA controller such that the DMA transfer from the FPGA to the SDRAM could be initiated. When the transfer is finished, the DMA controller signals the host processor with the interrupt that offloading is completed and results are available. The interrupt service routine of the DMA device driver signals the operating system for context switch and the application continues its execution. On the consecutive offloading events the procedure is repeated. 6 Experiments To proof the feasibility of the proposed methodology, we carried out experiments with the RealView Platform Baseboard, equipped with a Xilinx Virtex-II family FPGA. At heart of the board is ARM926EJS, the 32-bit RISC processor with a wide range of Figure 8 Flow diagram of Tremor Ogg Vorbis audio decoder. ov_read fetch_and_process_packet vorbis_dsp_synthesis mapping_inverse vorbis_dsp_pcmout mdct_unroll_lap mdct_shift_right MDCT unroll/lapping 14% mdct_backward MDCT inverse 51% floor_inverse1 Floor 9% floor_inverse2 vorbis_lsp_to_curve residue_inverse Residue 23% J Sign Process Syst (2011) 65:245–259 255 peripherals including the DMA controller (DMAC) and the memory management unit (MMU). The board contains also 128MB of the 32-bits wide SDRAM and 128MB of the 32-bits wide NOR flash memories. The proposed design methodology was experimented by using the Tremor Ogg Vorbis audio decoder [22] as an example application. It is an opensource, fixed-point implementation of the standard, designed especially for platforms without floating-point arithmetics. Instead of compiling code directly on the board we decided to cross-compile it with the Scratch- ALU LSU LOGIC MUL SHIFT box cross-compiling toolkit [23] run on the i686 Linux based host machine. Finding a part of the application suitable for offloading is not trivial in a general case, especially with large programs. Fortunately in situations when the computational kernel cannot be easily identified, profiling tools, like TCE’s proxim or GNU’s gprof, can be used. The profiling provided information about the most complex functions in the Tremor Ogg Vorbis decoder. In Fig. 8, the flow diagram of the decoder along with the percentage of clock cycles used by the ALU LSU IO_SFU RF 32x32 BOOL 2x1 RF 32x32 GCU 0 1 2 3 4 (a) LSU ALU LOGIC MUL MUL SHIFT LSU ALU ALU ALU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 SHIFT SHIFT ALU MUL IO_SFU RF 32x32 BOOL 2x1 RF 32x32 RF 32x32 RF 32x32 RF 32x32 GCU 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 (b) Figure 9 Principal organization of the customized TTAs: a machine with limited resources, “smallTTA” and b higher performance machine, “fastTTA”. 256 J Sign Process Syst (2011) 65:245–259 most significant parts of the application. Nearly 50% of computation time was used to compute the modified discrete cosine transform. As this function processes data in a consistent memory range, it was an obvious candidate for offloading. We built the customized TTA processor for the MDCT with the aid of TCE tools. To illustrate the scalability of the tools, we developed two processors. First, targeting to the short execution time, and second, aiming at the smaller area. We call them as fastTTA and smallTTA, respectively. The starting point was a so called minimal architecture, which contains just enough resources for the TCE compiler to compile any program for. The function, computing MDCT, was extracted from Tremor code and wrapped to a main function in a separate file. The TCE compiler supports ANSI C language so no other modifications were done to the original code. The code was compiled and profiled in a cycle accurate simulator. The profiling tool shows the utilization of each function unit, hence, the often used FUs were duplicated to improve performance. The final configuration is given in Fig. 9a. The fastTTA, partially presented in Fig. 9b, was obtained with the design space explorer tool from the TCE toolset. This tool automates the design process by adding resources iteratively until the cycle count cannot be reduced any more. Compared to smallTTA, fastTTA has following components in addition: two multipliers, three ALUs, two shifters, two register files, and 12 buses. The profiling, code modification and design of two application-specific TTA processors took approximately two days of work. Two TTA machines were integrated with the rest of the hardware system from Fig. 5. Both designs were synthesized with the Xilinx ISE Design Suite 10.1. Table 2 presents some results taken from the synthesis and place & route reports. As we can see, the fastTTA takes almost three times more FPGA slices than the smallTTA due to the large number of FUs and inter- connect buses. Also the difference in number of multipliers is significant. The fastTTa uses nine embedded 18-bit multipliers, when the smallTTA has only three. The on-chip memory is also almost 50% larger when the fastTTA is used. The difference in performance of TTAs represents TComp value, measured by the clock cycle accurate simulator from the TCE toolkit. The smallTTA takes 68,315 cycles to execute the offloaded task, while the fastTTA executes the same routine in 50,639 cycles. The critical path of both designs, given in Table 2, is affected by two factors. Firstly, the synthesis was made for a relatively old FPGA architecture, namely Xilinx Virtex II. As example, clock frequency of 191 MHz with a similar TTA processor was obtained when synthesized for a modern Xilinx Virtex-5 FPGA [16]. Secondly, no manual optimizations were used to optimize the critical path. Much higher clock frequencies could be obtained with manually optimized interconnect buses. That manual optimization can be easily applied with the help of graphical user interface from one of the tools in TCE. It is worth mentioned that this optimization process does not require a hardware design expertise from a designer. The longer critical path of fastTTA is due to the more complex interconnection bus. The complexity of the bus increases with the number of FUs and RFs in the design and the bus is often in the critical path of the FPGA implementations of TTAs. To measure the execution time of the application, as accurate as possible, we instantiated one additional component to the FPGA, cycle counter, which simply measures the number of FPGA clock cycles. The component has memory mapped registers, which allow start, stop, reset, and read of the measured clock cycles. The cycle counter is an AMBA AHB slave and can be accessed by the ARM processor exactly in the same way as any other memory mapped peripheral in the system. The execution time of the accelerated function TOffload can be split into three distinct parts: Table 2 Characteristics of the offloading compared to softwareonly implementation. FPGA slices FPGA memory [kB] FPGA Mul18 blocks Max clock frequency[MHz] Critical path [ns] TComp [clock cycles] TTrans [clock cycles] TOS [clock cycles] TOffload [clock cycles] Offloaded code size [bytes] (1) FastTTA SmallTTA ARM TOffload = TComp + TTrans + TOS 15,412 86 9 35.90 27.85 50,639 9,216 ∼1,000 60,855 21,064 4,960 54 3 36.02 27.77 68,315 9,216 ∼1,000 78,531 10,986 N/A N/A N/A 210.00 N/A 682,500 N/A N/A 682,500 7,380 where TComp is the time used by TTA on the computations, TTrans indicates the time of data transfers and TOS reflects OS overhead of the master/slave communication. In our experiments, TOffload was measured with the cycle counter. The exact value of TComp can be calculated with the cycle accurate simulator from TCE tools. TTrans can be computed based on the transfer protocol and the number of data elements to be J Sign Process Syst (2011) 65:245–259 257 (a) (b) Figure 10 AMBA HTRANS signal messages during a data transfer: a burst and b non-burst transfer. Each box corresponds to one clock cycle. Dark boxes indicate cycles when valid data is presented on the bus. transferred. Based on the previous the TOS can be calculated according to Eq. 1. Table 2 lists also the execution time results. The number of clock cycles the offloading takes is compared to the cycles that host processor needs to perform same computations. However, to correctly interpret these results we need to take into account that generally, the host processor runs at the higher frequency than the slave processor. In our case, the ratio equals 7. Keeping that in mind we obtain 1.6× and 1.24× speedup when offloading with the fastTTA and smallTTA, respectively. Additional gain comes from a fact that the host processor can perform other task while waiting for offloading to complete. We are running a multitask OS and other processes can be scheduled to run on the CPU during that time, as shown in Fig. 7. The number of bytes the offloaded function takes after compiling is given in the last row of Table 2. As can be seen, the program for fastTTA is almost twice as big as the binary for the smallTTA. Conserving available memory on the FPGA for other purposes can be another reason to customize the accelerator exactly to the application requirements. Finally, the reason for relatively low data transfer throughput is the transfer protocol used on the AMBA bus. Figure 10 shows messages transferred from the master to the slave on the HTRANS, one of the AMBA signals. There are four distinct messages but only NONSEQ and SEQ indicate valid data on the bus. If we take a closer look at the messages during the burst transfer, shown in Fig. 10a, we will see that 18 clock cycles are required to transfer four words of data. In other words, 4.5 cycles per word. In non-burst transfer, depicted by Fig. 10b, one data word is send every 6 cycles. If we transfer 1,024 words, which is the common case for the Tremor decoder, the transfer will take 4,608 or 6,144 clock cycles in burst and non-burst modes respectively. The bus we are using to transfer is not used for any other purpose, so it is safe to claim that calculated numbers hold in the general case. The burst mode can be set with the DMACCxControl registers of the DMA controller. 7 Conclusion In this paper, we described a method for offloading computations from a host processor to an FPGA. The proposed approach supports platforms with an operating system and offloads both computation and data transfer between host and slave processors. The computations are implemented as a TTA processor, which is customized for the given application and exploits the inherent instruction level parallelism of the application. The interfaces and communication between the host processor and the slave TTA are target-specific but can be reused in the same target. The communication packages and interfaces are generic and allow any type of functionality to be offloaded from the host processor under this environment. As a case study, we customized two TTAs for an audio decoding application, showing the scalability of the TCE toolset. The obtained results demonstrate that the difference in the targeted parameters is significant and the final product can be a trade-off based on the requirements. The design work is done with TCE tools on the high abstraction level, thus no hardware design expertise is needed. Finally, in our experiment, the results show that offloading speedup the application execution when compared to the software-only execution. However, the speedup depends on characteristics of the processor and FPGA fabric. Additional gain comes from a fact that the offloading is a non-blocking procedure. In a multitask operating system, other process can be scheduled to run on the CPU while the offloading is taking place. References 1. Patyk, T., Salmela, P., Pitkänen, T., & Takala, J. (2010). Design methodology for accelerating software executions with FPGA. In Proc. IEEE workshop signal process. syst., Cupertino, CA, USA, 6–8 Oct. 2010 (pp. 46–51). 2. Synopsys Inc. (2011). High-Level Synthesis with Synphony C Compiler, Mountain View, CA, USA (4 p.) [online]. Available: http://www.synopsys.com/Systems/BlockDesign/ 258 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. J Sign Process Syst (2011) 65:245–259 HLS/Pages/SynphonyC-Compiler.aspx. Accessed 17 July 2011. Hoffman, A., Kogel, T., Nohl, A., Braun, G., Schliebusch, O., Wahlen, O., et al. (2001). A novel methodology for the design of application-specific instruction-set processors (ASIPs) using a machine description language. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(11), 1338–1354. Praet, J. V., Lanneer, D., Geurts, W., & Goossens, G. (2001). Processor modeling and code selection for retargetable compilation. ACM Transactions on Design Automation of Electronic Systems, 6(3), 277–307. Target Compiler Technologies (2008). IP designed | IP programmer, Leuven, Belgium (p. 4). [online]. Available: http:// www.retarget.com. Accessed 17 July 2011. Cong, J. (2008). A new generation of C-based synthesis tool and domain-specific computing. In Proc. IEEE int. soc. conf., Newport Beach, CA, USA, 17–20 Sept. 2008 (Vol. 6507, pp. 386–386). Impulse Accelerated Technologies Inc. (2007). Accelerate C in FPGA Kirkland, WA, USA (p. 2) [online]. Available: http://www.impulsec.com. Accessed 17 July 2011. Goering, R. (2006). Programmable logic: Startup moves binaries into FPGAs. EE Times. CriticalBlue Ltd (2007). Cascade programmable application coprocessor generation, Pleasance Edinburgh, United Kingdom (p. 4) [online]. Available: http://www.criticalblue.com. Accessed 17 July 2011. Mentor Graphics Corporation (2010). Catapult C synthesis datasheet, Wilsonville, OR, USA (p. 4) [online]. Available: http://www.mentor.com/catapult. Accessed 17 July 2011. Forte Design Systems (2008). CynthesizerT M the most productive path to silicon, San Jose, CA, USA (p. 2. [online]. Available: http://www.forteds.com/products/cynthesizer.asp. Accessed 17 July 2011. ESNUG ELSE 06 Item 7 Subject: Mentor Catapult C (2006). [Online]. Available: http://www.deepchip.com/items/ else06-07.html. Accessed 17 July 2011. Reshadi, M., & Gajski, D. (2005). A cycle-accurate compilation algorithm for custom pipelined datapaths. In Proc. IEEE/ACM/IFIP int. conf. HW/SW codesign system synthesis, New York, NY, USA ,18–21 Sept. 2005 (pp. 21–26). Corporaal, H. (1994). Design of transport triggered architectures. In Proc. 4th great lakes symp. design autom. high perf. VLSI syst., Notre Dame, IN, USA, 4–5 Mar. 1994 (pp. 130– 135). Jääskeläinen, P., Guzma, V., Clio, A., Pitkänen, T., & Takala, J. (2007). Codesign toolset for application-specific instruction-set processors. In Proc. SPIE multimedia mobile devices, San Jose, CA, USA, 29–30 Jan. 2007 (Vol. 6507, pp. 05070X–1–10). Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C. S., Takala, J., & Martinez, J. I. (2010). Customized exposed datapath soft-core design flow with compiler support. In Proc. int. conf. f ield programmable logic and applications, Milano, Italy, 31 Aug.–8 Sept. 2010 (pp. 217–222). TCE: TTA codesign environment (2011). [online]. Available: http://tce.cs.tut.fi. Accessed 17 July 2011. Corporaal, H. (1999). TTAs: Missing the ILP complexity wall. Journal of Systems Architecture, 45(12–13), 949–973. Implementing AHB peripherals in logic tiles (2007). Application note 119 [online]. Available: http://infocenter.arm. com/help/index.jsp?topic=/com.arm.doc.dai0119e/index.html. Accessed 17 July 2011. Maemo by Nokia (2011). [online]. Available: http:// maemo.org/. Accessed 17 July 2011. 21. AMBA open specifications (2011). [online]. Available: http://www.arm.com/products/system-ip/amba/amba-openspecifications.php. Accessed 17 July 2011. 22. Tremor by the Xiph.Org foundation (2006). [online]. Available: http://wiki.xiph.org/index.php/Tremor. Accessed 17 July 2011. 23. Scratchbox cross-compilation toolkit project (2011). [online]. Available: http://www.scratchbox.org/. Accessed 17 July 2011. Tomasz Patyk received his M.Sc. (EE) degree from Poznan University of Technology, Poznan, Poland in 2007. In 2006 and 2007 he was a Research Assistant in Department of Computer Systems at Tamper University of Technology (TUT), Tampere, Finland. Since 2008 holds a position of Research Scientist at the same department and works towards his Dr. Tech (IT) degree. In his work at TUT he took part in several industrial sponsored projects. In years 2010 and 2011 he was an External Software Designer at Nokia, Tampere. His research interest include embedded systems, HW/SW development for mobile architectures, and application-specific processors design. Perttu Salmela received his M.Sc. (IT) in 2000 and Dr.Tech (IT) in 2009 from Tampere University of Technology (TUT), Tampere, Finland. His research topics include, but are not limited to, embedded systems, telecommunication and multimedia applications, and HW/SW development for application specific processors. His main research work was carried out in TUT from 1998 to 2010 beginning as Research Assistant and ending up as Postdoctoral Researcher. Currently he is Senior Engineer at Qualcomm. J Sign Process Syst (2011) 65:245–259 259 implementation of the retargetable processor simulator of TCE. Currently he is pursuing a Dr. Tech degree with research conducted for the TCE project, mainly focusing on its retargetable compiler and multicore design flow issues. His research interests include processor architectures, processor design methodology, and code generation for static architectures. Teemu Pitkänen received his M.Sc. (EE) degree from Tampere University of Technology, Tampere, Finland (TUT) in 2005. From 2002 to 2005, he worked as a Research Assistant and currently he works towards Dr.Tech as researcher in the Institute of Digital and Computer Systems at TUT. His research interest include parallel architectures, minimization of energy dissipation and design methodologies for digital signal processing systems. Pekka Jääskeläinen has been working in the TTA-based Codesign Environment (TCE) project of Department of Computer Systems, Tampere University of Technology, since its beginning at late 2002. His master’s thesis (2005) described the design and Jarmo Takala received his M.Sc. (hons) (EE) degree and Dr. Tech. (IT) degree from Tampere University of Technology, Tampere, Finland (TUT) in 1987 and 1999, respectively. From 1992 to 1996, he was a Research Scientist at VTT-Automation, Tampere, Finland. Between 1995 and 1996, he was a Senior Research Engineer at Nokia Research Center, Tampere, Finland. From 1996 to 1999, he was a Researcher at TUT. Currently, he is Professor in Computer Engineering at TUT and Head of Department of Computer Systems of TUT. His research interests include circuit techniques, parallel architectures, and design methodologies for digital signal processing systems.