Ug Nios2 c2h Compiler
Ug Nios2 c2h Compiler
Ug Nios2 c2h Compiler
Copyright 2009 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.
Contents
Contents
Chapter 5. Accelerating Code Using the Nios II Software Build Tools Command Line
Creating an Accelerator from the Command Line ........................................................................... 51 C2H Performance Metrics .................................................................................................................... 52
9.1
Altera Corporation
Contents
Connection Pragma ............................................................................................................................... 61 Reducing Arbitration Logic ............................................................................................................ 62 Optimizing Sequential Memory Access with Arbitration Shares ............................................. 62 Flow Control Pragma ............................................................................................................................ 63 Interrupt Pragma ................................................................................................................................... 64 Unshare Pointer Pragma ....................................................................................................................... 65
Additional Information
Referenced Documents ............................................................................................................................. Revision History ........................................................................................................................................ How to Contact Altera .............................................................................................................................. Typographic Conventions ........................................................................................................................ 1 2 3 3
Altera Corporation
9.1
Contents
9.1
Altera Corporation
The Nios II C-to-Hardware Acceleration (C2H) Compiler is a tool that allows you to create custom hardware accelerators directly from ANSI C source code. A hardware accelerator is a block of logic that implements a C function in hardware, which often improves the execution performance by an order of magnitude. Using the C2H Compiler, you can develop and debug an algorithm in C targeting an Altera Nios II processor, and then quickly convert the C code to a hardware accelerator implemented in a field programmable gate array (FPGA). The C2H Compiler improves the performance of Nios II programs by implementing specific C functions as hardware accelerators. The C2H Compiler is not designed to create arbitrary hardware systems from C code. Rather, the C2H Compiler is a tool for generating a hardware accelerator module, functionally identical to the original C function, that offloads and enhances the performance of the Nios II processor.
Chapter 1, Introduction to the C2H Compiler provides a detailed background on the C2H Compiler and the concepts required to use it. Chapter 2, Getting Started Tutorial provides hands-on instructions that teach you the first steps to begin using the C2H Compiler. Chapter 3, C-to-Hardware Mapping Reference provides reference on how the C2H Compiler translates C constructs to hardware structures. Chapter 4, Understanding the C2H View helps you use the C2H view to get performance information and to control the compilation of accelerators. Chapter 5, Accelerating Code Using the Nios II Software Build Tools Command Line explains how to use the C2H Compiler with the Nios II software build tools. Chapter 6, Pragma Reference summarizes usage of all C2H #pragma directives. Chapter 7, ANSI C Compliance and Restrictions documents all sections of the ANSI C specification that the C2H Compiler does not support.
9.1
11
Target Audience
Target Audience
This user guide assumes you have at least a basic understanding of hardware design for field programmable gate arrays (FPGAs). It also assumes you are fluent in the C language and you have experience with software design in C for microprocessors. The C2H Compiler operates in conjunction with the following Altera tools:
Quartus II software for creating FPGA designs SOPC Builder system integration tool for creating Nios II processor hardware systems C programming environments for the Nios II processor: Nios II integrated development environment (IDE) Nios II software build tools
To benefit from this user guide, you do not need to be an expert in these tools, and you do not need an understanding of any particular Altera FPGA family. However, at least a basic understanding of each tool is required to use the C2H Compiler practically.
Introduction
This chapter introduces the Nios II C2H Compiler. The sections in this chapter discuss the features, background, and principles of the C2H Compiler, and describe the most appropriate types of C code for acceleration. After reading this chapter, you will understand all the concepts necessary to begin using the C2H Compiler.
Features
The C2H Compiler is founded on the following premises:
ANSI C syntax is sufficient to describe computationally intensive or memory access-intensive tasks. A C-to-hardware tool must not disrupt existing software and hardware development flows.
Based on these premises, the C2H Compiler's design methodology provides the following features:
ANSI C compliance The C2H Compiler operates on plain ANSI C code, and supports most C constructs, including pointers, arrays, structures, global and local variables, loops, and subfunction calls. The C2H Compiler does not require special syntax or library functions to specify the structure of the hardware. Unsupported ANSI C constructs are documented.
9.1
Straightforward C-to-hardware mapping The C2H Compiler maps each element of C syntax to a defined hardware structure, giving you control over the structure of your hardware accelerator. Integration with C language development environments for the Nios II processor, including the Nios II integrated development environment (IDE), and the Nios II software build tools. You control the C2H Compiler with the Nios II C development tools. You do not need to learn a new environment to use the C2H Compiler. Based on SOPC Builder and Avalon system interconnect fabric The C2H Compiler uses SOPC Builder as the infrastructure to connect hardware accelerators into Nios II systems. A C2H accelerator becomes a component within an existing Nios II system. SOPC Builder automatically generates system interconnect fabric to connect the accelerator to the system, saving you the time of manually integrating the hardware accelerator. Reporting of generated results The C2H Compiler produces a detailed report of hardware structure, resource usage, and throughput.
Hardware accelerators generated by the C2H Compiler have the following characteristics:
Parallel scheduling The C2H Compiler recognizes events that can occur in parallel. Independent statements are performed simultaneously in hardware. Direct memory access Accelerators access the same memories that the Nios II processor does during execution. Loop pipelining The C2H Compiler pipelines the logic implemented for loops, based on memory access latency and the amount of code that operates in parallel. Memory access pipelining The C2H Compiler pipelines memory accesses to reduce the effects of memory latency.
9.1
Introduction
implementing entire systems on a chip. As a result, the tools available to FPGA and software designers have undergone continual transformation of design-entry methods and behind-the-scenes optimization techniques. This transformation has enabled designers to create ever-bigger designs to fill ever-growing chip capacity. Recent years have seen the broad acceptance of FPGA-based microprocessor cores, such as the Nios II processor, and system integration tools, such as SOPC Builder. These tools made it possible, for the first time, to implement C code easily in an FPGA-based system. Optimizing and evolving these tools is a natural next step for C-based design on FPGAs. This background sets the stage for practical advances in C-to-hardware technologies based on an established design methodology. FPGA-based processors and system integration tools offer new ways to improve the performance of embedded systems. Traditional methods to increase performance of processor systems include:
Increasing clock speed Upgrading to a processor with higher Dhrystone MIPS-permegahertz performance Coding critical sections of software in assembly language
FPGA-based processor systems enable additional optimization techniques capable of achieving much higher performance gains. These techniques include:
The ability to rapidly alter the FPGA design, allowing you to prototype a variety of architectures The ability to divide and conquer processing tasks by instantiating multiple processor cores The ability to augment a processor with custom hardware that offloads processor-intensive operations into the FPGA fabric The ability to adjust memory architecture for memory-intensive operations, such as using high-speed, point-to-point connections to fast memory buffers
The application of these techniques relies on real-world tools to implement them. Consequently, the acceptance of these techniques has grown as system integration tools, such as Altera's SOPC Builder, have matured and gained acceptance. It is a fortunate coincidence that these techniques also directly benefit C-to-gates methodologies. Flexibility of hardware architecture and ease of implementation are at the heart of the appeal of C-to-gates tools.
9.1
The Nios II C-to-Hardware Acceleration (C2H) Compiler represents Altera's next step in the evolution of embedded systems design. The C2H Compiler uses the infrastructure provided by SOPC Builder and the Nios II processor, and adds a higher level of abstraction: converting C functions directly to hardware.
The C2H Compiler assumes that your C code runs successfully on a Nios II processor system. The result of using the C2H Compiler is a program that runs on a Nios II processor system.
The C2H Compiler works best on C code that adheres to certain structural rules. It works well for many types of programs, but not all. Through education and habit, programmers structure C programs with an existing compiler in mind. Experienced designers learn the particular structures that produce optimal compiled results. The C2H Compiler is also a C compiler. It takes ANSI C programs that execute normally on a processor. However, the program structure for producing optimal hardware results with the C2H Compiler often differs from code structured for execution on a processor. You achieve the best results if you have a reasonable understanding of how the C2H Compiler translates C structures to hardware. Refer to chapter Chapter 3, C-to-Hardware Mapping Reference for details. The C2H Compiler is not a replacement for traditional HDL-based hardware design. Tasks such as connecting modules together and interfacing to bus protocols are not easily inferred from ANSI C code. In the hands of an experienced user, the C2H Compiler allows considerable control over circuit latency and parallelism. However, it does not provide the ability to define user logic with complex timing requirements. For example, the C2H Compiler does not allow you to create an arbitrary state machine that guarantees a particular operation on a specific clock cycle. 1 The Nios II processor is little-endian. For Nios II compatibility, C2H accelerators expect to exchange little-endian data with the processor. If your accelerator must handle big-endian data, you can swap the byte order in the accelerated C code. Ensure that the data is in little-endian form when your accelerated function transfers it to any unaccelerated function.
9.1
Debug your function prior to accelerating Generate the accelerator and incorporate it into your hardware Test and profile your software and hardware with the C2H accelerator
Altera recommends creating new C2H systems with the Nios II IDE.
For information about using the Nios II IDE, refer to the Using the Nios II Integrated Development Environment appendix to the Nios II Software Developer's Handbook, or to the Nios II IDE help system. For information about Nios II tool flows, refer to Development Flows for Creating Nios II Programs in the Overview chapter of the Nios II Software Developer's Handbook. The Nios II Software Build Tools also provide command-line support for pre-existing command-line C2H projects.
For information about using C2H on the command line, refer to Chapter 5, Accelerating Code Using the Nios II Software Build Tools Command Line. 1 The Nios II Software Build Tools for Eclipse do not support the C2H Compiler.
This section describes fundamental concepts underpinning the C2H Compiler. These concepts help you better understand how the C2H Compiler works and how you can produce optimal results.
9.1
C2H Compiler calls other tools in the background to handle the hardware and software integration tasks. Specifically, the C2H Compiler automatically performs the following tasks in the background: 1. Calls SOPC Builder to specify how the accelerator connects to the system, and then generates the system hardware. Calls the Quartus II software to recompile the hardware design and generate an SRAM object file (.sof).
2.
9.1
9.1
System Architecture
Figure 11 shows the architecture of a simple Nios II processor system that includes one hardware accelerator. Figure 11. Example System Topology with Single Hardware Accelerator
Nios II Processor
Instruction M Data M
S Control
Hardware Accelerator
M M
MUX
Arbitrator
Arbitrator
S Instruction Memory
S Peripherals
S Data Memory
S Data Memory
M S
SOPC Builder automatically integrates the accelerator logic into the system as an SOPC Builder component. If there is more than one accelerator in the system, multiple accelerators appear in SOPC Builder. Accelerators are separate from the Nios II processor but can access the same memory devices that the Nios II processor can.
9.1
The accelerator's connections are managed by the C2H Compiler. You can manually customize the connections using pragma directives in the accelerated C code. Chapter 6, Pragma Reference, describes C2H Compiler pragma usage. You cannot edit the accelerator's connections in the SOPC Builder GUI.
2. 3. 4. 5. 6.
7.
One or more state machines that manage the sequence of operations defined by the C function. On any clock cycle, an arbitrary number of computations and memory accesses can happen simultaneously, orchestrated by the state machines. One or more Avalon Memory-Mapped (Avalon-MM) master ports, which fetch and store data as required by the state machines. An Avalon-MM slave port and a set of memory-mapped registers that allow the processor to set up, start, and stop the accelerator.
9.1
The software wrapper, executing on the Nios II processor, controls the accelerator by reading and writing the register interface. From the perspective of the calling function, the result of calling the software wrapper is functionally the same as calling the original C function. The basic operation of the software wrapper is as follows: 1. Sets up parameters for the accelerator, similar to passing variables to the original, unaccelerated function. Optionally flushes the processor's data cache to avoid cache coherency problems. Flushing the data cache might be necessary if the accelerator accesses the same memory that the processor does. Starts the accelerator. Once an accelerator is running, it can return a value, terminate, or run continuously, depending on the design of the C source code. Polls registers in the accelerator hardware to determine when the task completes. If the function returns a result, reads the result value, and returns it to the calling function.
2.
3.
4.
5.
Mathematical operators (such as +, -, *, >>) become direct hardware equivalent circuits (such as add, subtract, multiply and shift circuits). These circuits might be shared between operations, depending on the degree of parallelism inherent in the C code. Loops (such as for, while, do-while) become state machines that iterate over the operations inside the loop, until the loop condition is exhausted. Pointer dereferences and array accesses (such as *p, array[i][j]) become Avalon-MM master ports that access the same memory that the processor does. Statements not dependent on the result of a previous operation are scheduled as early as possible, allowing parallel execution to the extent possible.
9.1
Subfunctions called within an accelerated function are also converted to hardware using the same C-to-hardware mapping rules. The C2H Compiler creates only one hardware instance of the subfunction, regardless of how many times the subfunction is called within the top-level function. Isolating accelerated C code into a subfunction provides a method of creating a shared hardware resource within an accelerator.
The C2H Compiler performs certain optimizations when it can reduce logic utilization based on resource sharing. Refer to Chapter 3, C-to-Hardware Mapping Reference for complete details of the C2H Compiler mappings.
9.1
This section describes guidelines for identifying code that is appropriate for the C2H Compiler.
They contain a relatively small and simple loop or set of nested loops. They iterate over a set of data, performing one or more operations on the data per iteration, and then store the result.
Examples of such iterative tasks include memory copy-and-modify tasks, checksum calculations, data encryption, decryption, and filtering operations. In each of these cases, the C code iterates over a set of data many times, with either one or more memory reads or writes performed during each iteration. Example 11 demonstrates a routine that performs a checksum calculation. This code excerpt is from a TCP/IP stack, and it calculates the checksum over ranges of data in a network protocol stack. Checksum calculations are typically a time-consuming part of an IP stack, because all data transmitted and received must be validated, which requires the processor to loop through all bytes.
9.1
Example 11. Checksum Calculation u16_t standard_chksum(void *dataptr, int len) { u32_t acc; /* Checksum loop: iterate over all data in buffer */ for(acc = 0; len > 1; len -= 2) { acc += *(u16_t *)dataptr; dataptr = (void *)((u16_t *)dataptr + 1); } /* Handle odd buffer lengths */ if (len == 1) { acc += htons((u16_t)((*(u8_t *)dataptr)&0xff)<< 8); } /* Modify result for IP stack needs */ acc = (acc >> 16) + (acc & 0xffffUL); if ((acc & 0xffff0000) != 0) { acc = (acc >> 16) + (acc & 0xffffUL); } return (u16_t)acc; }
Accelerating this function could have a significant impact on execution time, especially the amount of time spent in the for loop. The remaining code executes once per call to format the result and check boundary cases. Accelerating the code outside the loop has little benefit, unless the entire standard_chksum() function is a called from another function that is also a good acceleration candidate. The most efficient hardware accelerator for this code would replace only the for loop. To accelerate the for loop only, you need to refactor the code to isolate the loop in a separate function.
Code that contains many data or control dependencies must perform many sequential operations, and is a poor candidate for acceleration. A large number of dependencies makes it difficult for the C2H Compiler to fully optimize loops. Processors are designed to perform such operations efficiently.
9.1
If the code contains C syntax not supported by the C2H Compiler, it cannot be accelerated. Examples are floating point operations and recursive functions. Refer to Chapter 7, ANSI C Compliance and Restrictions. Code that calls system and runtime library functions is a poor candidate for acceleration. For example, there is little point in accelerating printf() or malloc(). The underlying code contains a complex set of sequential operations and does not contain performance-critical loops. Code that makes extensive use of global or external variables is a poor candidate for acceleration. Each time the C2H accelerator uses a global or external variable, it must access the Nios II processors data memory, which is likely to cause a bottleneck.
Experienced C coders often "unroll" iterative algorithms, representing them as a sequential set of operations to work better with an optimizing C compiler. If you can refactor the code and "roll up" the loop, you might be able to create an efficient hardware accelerator. A critical inner loop might have a complex set of sequential operations which, if accelerated in hardware, consumes a lot of logic resources. This presents a trade-off: If the processor spends an unacceptable amount of time in this loop, it might be worth the hardware cost to accelerate the whole loop. Some runtime library functions are iterative in nature. Examples include common data movement functions and buffer set functions, such as memcpy() or memset(). If your code calls one of these functions, you might consider writing a simple, custom implementation of the function, which you can then accelerate. If your code uses global or external variables, it might be easy for you to refactor it to be suitable for acceleration. Refactor your code to copy the global or external variables to local storage, perform the calculation with the local variables, and then copy results back to global or external storage. The C2H Compiler implements local variables as fast, pipelined registers inside of the accelerator.
9.1
Next Steps
extent that you analyze the code and understand it. In either case, the Nios II IDE profiling features can help you determine where the processor spends most of its time. Examine the structure of the code for processor-specific or compilerspecific optimizations written into the structure of the code. These sections of code might result in poor performance with the C2H Compiler, and could benefit from refactoring for the C2H Compiler. It can be difficult to identify the critical loop just by inspecting code, because programs often spend the majority of time iterating on just a few lines of code. The only way to know exactly where the processor spends the most time is to profile the application, and inspect the bottleneck functions.
Refer to AN 391: Profiling Nios II Systems for further information. Now that you understand the underlying concepts of the Nios II C-to-Hardware Acceleration Compiler, you are ready for hands-on experience accelerating designs. Chapter 2, Getting Started Tutorialdescribes the C2H Compiler design flow, and gives step-by-step instructions to accelerate your first design. Altera also provides tutorials and application notes to deepen your understanding of the C2H Compiler.
Next Steps
Refer to the Nios II literature page for further C2H Compiler documentation: www.altera.com/literature/lit-nio2.jsp.
9.1
Introduction
This chapter describes the design flow for the Nios II C-to-Hardware Acceleration (C2H) Compiler. This chapter provides a design example and gives you a step-by-step tutorial to guide you through the process of creating your first hardware accelerator. The example software design performs multiple iterations of a data-copy function. By accelerating the data-copy function, you achieve more than a 10-fold improvement in the execution performance. The resulting hardware accelerator resembles a hardware block with direct memory access (DMA) to copy data without processor intervention. This tutorial assumes that you are familiar with the Nios II processor and the Nios II design flow.
For introductory information on designing with the Nios II processor, refer to the Nios II Hardware Development Tutorial available on the Altera Nios II literature page at http://www.altera.com/literature/lit-nio2.jsp, and to the Nios II Software Development Tutorial available in the Nios II integrated development environment (IDE) help system. This section discusses the design flow to create a hardware accelerator with the C2H Compiler.
Identify the functions that require acceleration. Debug the functions first targeting the Nios II processor. After accelerating a function, you can no longer debug individual C statements within the function.
You might have existing C code that you need to accelerate to improve performance. Alternatively, you might develop and debug a function in C with the explicit purpose of converting it to hardware. In either case, you achieve the best results if the C code is structured for the C2H Compiler. To start with, you can accelerate your code as-is, and determine if the results meet the design requirements.
9.1
21
A typical design flow using the C2H Compiler to accelerate a function involves the following steps: 1. Develop and debug your application or algorithm in C targeting a Nios II processor system. Profile the code to identify the areas that would benefit from hardware acceleration. Isolate the code you want to accelerate into an individual C function. Specify the function you want to accelerate in the Nios II IDE. Rebuild the project in the Nios II IDE. Profile the results in hardware, or observe estimates from the C2H report in the Nios II IDE. If the results do not meet the design requirements, modify the C source code and system architecture (for example, the memory topology). Return to Step 5, and iterate.
2.
3.
4. 5. 6.
7.
8.
The typical C2H Compiler design flow is an iterative process of accelerating a function, comparing the performance to design requirements, and modifying C code to improve results. If you start with C code that is not optimized for the C2H Compiler, the first iteration of acceleration might not dramatically improve performance. Further iterations, modifying the C code for optimal hardware structure, often improve the final results significantly over the first pass results.
This tutorial does not describe techniques for optimizing hardware accelerator performance. For further information on optimizing C2H Compiler results, refer to the Accelerating Nios II Systems with the C2H Compiler Tutorial.
Software Requirements
The C2H Compiler in evaluation mode is installed as part of the Altera Quartus II Complete Design Suite. You can download the Quartus II Complete Design Suite free from the Altera website. Visit www.altera.com and click Download.
9.1
During the design process with the C2H Compiler, you use the following tools:
Nios II Integrated Development Environment (IDE) You control acceleration options for individual functions in the Nios II IDE. The results of accelerating functions are reported in the Nios II IDE. The output is an executable linking file (.elf) targeting a Nios II CPU. The C2H Compiler also invokes SOPC Builder and optionally the Quartus II software in the background to regenerate the Nios II system and update the SRAM object file (.sof). SOPC Builder SOPC Builder manages the generation of C2H logic and Avalon-MM system interconnect fabric to connect hardware accelerators to the processor. During the software build process, the Nios II IDE can invoke SOPC Builder in the background to update the hardware accelerators when necessary and integrate them into the Nios II hardware design. The output is a set of hardware description language (HDL) files (.v or .vhd) and an SOPC Builder system file (.sopcinfo) defining your system: Nios II processor cores, peripherals, accelerators, on-chip memory, and interfaces to off-chip memory. Quartus II software The Quartus II software compiles and synthesizes HDL produced by the C2H Compiler and SOPC Builder tools, along with any other custom logic in your Quartus II project. During the software build process, the Nios II IDE can invoke the Quartus II software in the background to recompile the Quartus II project. The output is a .sof file that includes the updated Nios II system with accelerators.
Verify the functionality of your design, as well as evaluate its size and speed easily Generate time-limited device programming files for designs that include megafunctions Program an FPGA and verify your design in hardware Simulate the behavior of an accelerator in your system
9.1
Tutorial
OpenCore Plus hardware evaluation supports the tethered mode of operation for C2H. In tethered mode the accelerator runs indefinitely, as long as the target board remains connected to the host computer by an Altera download cable You need to purchase a license for the Nios II C-to-Hardware Acceleration Compiler only when you are completely satisfied with the functionality and performance of your accelerated Nios II system, and want to take your design to production.
For more information on OpenCore Plus hardware evaluation, see AN 320: OpenCore Plus Evaluation of Megafunctions. This section guides you through the steps to accelerate a function using the C2H Compiler. You create a new software project in the Nios II IDE using the provided example design files, accelerate a function, and observe the performance improvement. This tutorial guides you through the steps to implement the example design. These steps start with a C source file and end with a running application that includes an accelerated function. The steps you perform are described in the following sections: 1. 2. 3. 4. 5. 6. 7. Set up the Hardware for the Project on page 25 Create the Software Project on page 26 Run the Project as Software Only on page 27 Create and Configure a Hardware Accelerator on page 28 Rebuild the Project on page 210 Observe Results in the Report File on page 211 Observe the Accelerator in SOPC Builder on page 214
Tutorial
Tutorial Design
The hardware design for this tutorial is based on the standard hardware example design provided with the Nios II EDS. The software design is a C file named dma_c2h_tutorial.c, which is available for download from the Altera website. You can run the tutorial design on any Nios development board available from Altera.
9.1
You can download dma_c2h_tutorial.c from the Nios II literature page. The file is located next to this document (Nios II C2H Compiler User Guide) on the Altera Nios II literature page at http://www.altera.com/ literature/lit-nio2.jsp. The file dma_c2h_tutorial.c includes two functions:
do_dma() This is the function you accelerate. It performs a block memory copy. do_dma() takes a source address pointer, a destination address pointer, and an integer number of bytes to copy. When implemented in hardware, do_dma() resembles DMA copy logic. The prototype for do_dma() is as follows: int do_dma( int * __restrict__ dest_ptr, int * _ _restrict__ source_ptr, int length ) The _ _ restrict__ qualifier informs the compiler that the pointers dest_ptr and source_ptr point to mutually exclusive address ranges. For further information about the __restrict__ qualifier, see Pointer Aliasing on page 332 of Chapter 3, C-toHardware Mapping Reference.
main() main() calls do_dma() and measures the amount of time taken, so that you can compare the software implementation with the hardware accelerator.
main() performs the following actions: 1. 2. 3. 4. 5. 6. Allocates two 1 MB buffers in main memory Fills the source buffer with incrementing values Fills destination buffer with all 0x0. Calls the do_dma() function 100 times Checks the copied data to ensure there were no errors Frees the two allocated buffers
To measure the time it takes for the copy operations to complete, there are timer routines around the loop that calls the do_dma() function. After the application runs, the number of milliseconds that were spent performing the copy operations is displayed to the Console view in the Nios II IDE.
9.1
Tutorial
2.
Set up the hardware project directory a. Using a file management tool on the host computer, locate the standard hardware example design for your Nios development board. For example, on a Windows PC, use Windows Explorer to find the Verilog HDL design files for the Nios development board, Cyclone II Edition at <Nios II EDS install path>/ examples/verilog/ niosII_cycloneII_2c35/standard. Copy the standard directory and name the copied directory c2h_tutorial_hw. This new directory serves as the hardware design for the tutorial.
b.
3. 4.
Start the Quartus II software. Open the Quartus II project standard.qpf located in the c2h_tutorial_hw directory. The Quartus II software might give a warning "Do you want to overwrite the database ... created by Quartus II Version <version>... The database format is compatible..." if the project was created with an earlier version of the software. If so, click Yes to update the database.
5.
Configure the FPGA on the Nios development board. a. On the Tools menu click Programmer. The Programmer appears, with the SRAM object file standard.sof automatically ready to download to the FPGA. Turn on the Program/Configure check box for standard.sof. Click Start. The programmer downloads the configuration data to the FPGA. If Start is not enabled, click Hardware Setup to configure your JTAG download cable.
b. c.
9.1
3.
If the Welcome to the Altera Nios II IDE page displays, close it to view the workbench. Create a new C/C++ Application project. a. On the File menu, point to New and click C/C++ Application. The New Project wizard appears. In the Name box, type c2h_tutorial_sw. In the Select Project Template list, select Blank Project. Use the Select Target Hardware settings to browse to and select the SOPC Builder system (.ptf) file in your c2h_tutorial_hw directory. After you specify the SOPC Builder system, the IDE automatically sets the CPU setting to cpu, which is the name of the only Nios II processor core available in this SOPC Builder system. Click Finish. The IDE generates a new project c2h_tutorial_sw and a new system library project c2h_tutorial_sw_syslib.
4.
b. c. d.
e.
5.
Download the software file dma_c2h_tutorial.c from the Nios II literature page and save it to a known location on your host computer. The file is located next to this document (Nios II C2H Compiler User Guide) on the Altera Nios II literature page at http:// www.altera.com/literature/lit-nio2.jsp. Import the C file dma_c2h_tutorial.c into the c2h_tutorial_sw project. The easiest way to do this is to use an external file management tool, such as Windows Explorer, and drag the file onto the c2h_tutorial_sw project folder in the C/C++ Projects view of the Nios II IDE.
6.
9.1
Tutorial
2.
Observe the execution time in the Console view. Example 21 shows results of approximately 86000 milliseconds. The results you see might be different, depending on the memory characteristics of the target board and the clock speed of the example design.
Example 21. Execution Results as Software-Only Implementation This simple program copies 1048576 bytes of data from a source buffer to a destination buffer. The program performs 100 iterations of the copy operation, and calculates the time spent. Copy beginning SUCCESS: Source and destination data match. Copy verified. Total time: 86520 ms
2.
3.
4.
9.1
b.
Turn on Build software, generate SOPC Builder system, and run Quartus II compilation. When you build the project in the Nios II IDE, this option causes the C2H Compiler to invoke SOPC Builder and the Quartus II software in the background to generate a new .sof file. Quartus II compilations can take a long time. You only need to turn on this option when you want to update the .sof file. You must regenerate the .sof file after you make changes that affect one or more hardware accelerators, and you want to run a program on the hardware system.
c. d.
Expand do_dma() in the C2H view. Under do_dma(), select Use hardware accelerator in place of software implementation. Flush data cache before each call. At run time, this option causes the program to activate the accelerator hardware for do_dma(). With this option, the C2H wrapper function flushes the processor data cache before activating the accelerator. The wrapper function needs to flush the data cache before activating the hardware accelerator if the processor has a data cache and if the processor writes to the same memory that the accelerated function operates on. Failing to flush the cache might result in cache coherency problems.
9.1
Tutorial
In the background, the Nios II IDE performs the following tasks: 1. Launches the C2H Compiler to analyze the do_dma() function, generates the hardware accelerator, and generates the C wrapper function. Invokes SOPC Builder to connect the accelerator into the SOPC Builder system. The build process modifies the SOPC Builder system file (.ptf) in the Quartus II project directory to include the new accelerator as a component in the system. Invokes the Quartus II software to compile the hardware project and regenerate the .sof file. Rebuilds the C/C++ application project and links the accelerator wrapper function into the application.
2.
3.
4.
Progress messages display in the Console view. The build process creates the following files:
accelerator_c2h_tutorial_sw_do_dma.v (or .vhd) This file is the HDL code for the accelerated function. It is stored in the Quartus II project directory, and the name follows the format accelerator_<IDE project name>_<function name>. This file is not visible in the Nios II IDE. alt_c2h_do_dma.c This file is the C2H accelerator driver file, containing the wrapper function for the accelerator. It is located in software project's Debug or Release directory, and the name follows the convention alt_c2h_<function name>.c. (If you use the Nios II software build tools, these files are located in your software application directory.) c2h_accelerator_base_addresses.h This is the C2H accelerator base addresses header file. It is located in the same directory as alt_c2h_<function name>.c.
9.1
If you copy or move a C2H project to a different directory, you must make sure you have the generated C source files and C2H makefile fragments in the new location. If you regenerate your accelerator in the new location, the C2H Compiler recreates these files for you. This is the simplest way, although not the fastest, to ensure that you have these files. If you want to avoid regenerating, simply copy or move the two files when you copy or move your original C source files. Copy or move the files, alt_c2h_<function_name>.c and c2h_accelerator_base_addresses.h, to the subdirectory in the new project location corresponding to their original location.
2. 1
3.
9.1
Tutorial
4.
Expand the Resources section and all subsections, as shown in Figure 22.
The Resources section lists all the master ports on the hardware accelerator. Each master port corresponds to a pointer dereference in the source code. In this example, there are two master ports: one for dereferencing the read pointer, *source_ptr, and one for dereferencing the write pointer, *dest_ptr.
9.1
5.
Expand the Performance section and all subsections, as shown in Figure 23.
The Performance section shows the performance characteristics of each loop in the accelerated function. There are two metrics that determine a loop's performance: loop latency and cycles per loop-iteration (CPLI). Loop latency is the number of cycles required to fill the pipeline. CPLI is the number of cycles required to complete one iteration of the loop, assuming the pipeline is filled and no stalls occur. For example, consider the case of an accelerated function with one loop with loop latency of 13 and CPLI value of 1. (These values can differ, depending on the memory latency on your target board.) These numbers indicate that the pipeline takes 13 cycles to fill; once the pipeline is filled, the pipeline generates a new result every cycle. 1 In general, the goal of optimizing an application for better accelerator performance is to reduce loop latency and CPLI.
9.1
Tutorial
For further information on optimizing C2H Compiler results, refer to the Accelerating Nios II Systems with the C2H Compiler Tutorial.
4. 1 c
9.1
a.
If the Programmer window is not still open, on the Tools menu click Programmer. The Programmer window lists the file standard.sof. Turn on the Program/Configure box for standard.sof. If you don't have a license for the C2H Compiler, the Quartus II software generates a time-limited .sof file with a different name. In this case, select standard.sof and click Delete, then click Add and open the time-limited .sof file.
b. 1
c.
Click Start. The programmer downloads the new configuration data to the FPGA.
3. 4.
Return to the Nios II IDE window. In the C/C++ Projects view, right-click the c2h_tutorial_sw project, point to Run As and click Nios II Hardware. The Nios II IDE downloads the accelerated program to the board and runs it. Observe the execution time in the Console view. Example 22 shows timing results of approximately 8470 milliseconds. The results you see might be different, depending on your target board.
5.
Example 22. Execution Results with Hardware Acceleration This simple program copies 1048576 bytes of data from a source buffer to a destination buffer. The program performs 100 iterations of the copy operation, and calculates the time spent. Copy beginning SUCCESS: Source and destination data match. Copy verified. Total time: 8470 ms
9.1
Next Steps
2.
You must rebuild the project to remove the hardware accelerator from the SOPC Builder system hardware.
Removing the accelerator removes the hardware accelerator component from the SOPC Builder project, and replaces the C2H software wrapper with the original, unaccelerated function. The next time you build the project in the Nios II IDE, the C2H Compiler regenerates the SOPC Builder system and recompiles the Quartus II project to generate a .sof file without the accelerator hardware. 1 To remove an accelerator from the hardware system, you must use the Remove C2H Accelerator command in the Nios II IDE. Do not use SOPC Builder to manually delete the component from the system. If you delete the component from the SOPC Builder system using the SOPC Builder GUI, the C2H Compiler produces undefined results the next time you build the software project.
Next Steps
Congratulations! You have successfully converted an ANSI C function to a hardware accelerator using the C2H Compiler and observed a significant performance increase. After accelerating a function and running it for the first time, your next steps vary depending on your system requirements. If your starting goal is to off-load a routine from the processor to reduce CPU load, you might find that no additional action is required. If the hardware accelerator does not meet performance or resource requirements, you can perform one or
9.1
more iterations of optimization to produce better results. In either case, you can continue developing your system software and hardware, and the accelerator remains in place. 1 It is common to be able to improve first-pass performance results significantly by optimizing the C code and system architecture.
If you modify the accelerated C code, the Nios II IDE automatically regenerates the accelerator hardware with the C2H Compiler the next time you build the C/C++ application project. Alternatively, you can disable an accelerator after it is built and relink the original software implementation, while leaving the hardware accelerator inactive in the hardware.
To get a better understanding of how the C2H Compiler translates C to hardware, read Chapter 3, C-to-Hardware Mapping Reference. After that, for further information on optimizing C2H Compiler results, refer to the Accelerating Nios II Systems with the C2H Compiler Tutorial.
9.1
Next Steps
9.1
This chapter describes how the Nios II C-to-Hardware Acceleration (C2H) Compiler translates ANSI C constructs into functional blocks in a hardware accelerator. Understanding the C-to-hardware mappings enables you to write C functions optimized for the C2H Compiler to achieve higher performance and lower resource utilization.
The C2H Compiler translates each element of C syntax to an equivalent hardware structure using straightforward mapping rules. The mapping rules provide a one-to-one association between elements of C syntax and their equivalent hardware structures. By learning the C-to-hardware mappings, you can control the hardware structure of an accelerator, based on the structure of the C code. The C2H Compiler can perform resource-sharing optimizations which reduce the resource utilization for an accelerator. In these cases, the result is a better than one-to-one mapping.
9.1
31
Table 31 lists the equivalent hardware structures resulting from this function.
Table 31. Hardware Structure for Arithmetic and Logical Operators Line
while (len > 0) {
C Element
while
Hardware Structure
Finite state machine with nominal control logic. Refer to section Iteration Statements on page 35. 32-bit comparator 64-bit adder Avalon-MM master port to read data (two total for *a++ and *b++). Refer to section Indirection Operator (Pointer Dereference) on page 316. 32-bit up-counter (two total for *a++ and *b++) 32x32=64-bit multiplier 32-bit down-counter
++ * (multiply) len--; --
Assignments
A C assignment operator stores the value of an expression to a variable. As a general rule, every assignment operator in the C code, such as =, translates to a registered signal in hardware. The value of an assignment's expression is calculated in one clock cycle. Figure 31 shows the hardware that results from the following statement: int sum = x + y; Figure 31. Hardware Resulting from Assignment
9.1
Assignments that require zero logic elements in hardware Assignments that use multiple registers to pipeline complex arithmetic operations
Description
Right bit-wise shift Left bit-wise shift bit-wise AND bit-wise inclusive OR bit-wise exclusive OR bit-wise inversion Type cast
Required Condition
Right-hand side is constant Right-hand side is constant Either operand is constant Either operand is constant Either operand is constant Right-hand side is unregistered Right-hand side is unregistered
The following assignment is an example of a zero logic-element operation. int masked_data = data_in & 0x000fffff; The C2H Compiler generates no register for the variable masked_data, because its value is represented simply by concatenating 12 bits of zeroes with the lower 20 bits of data_in. Additional examples of unregistered assignments: shift_by_constant or_with_constant = data_in << 3; = data_in | 0xf0f0f0f0;
9.1
Table 33. Complex Arithmetic Operations Pipelined by the C2H Compiler Operator
* /
Description
Multiplication Division
Exceptions
Either operand is a constant power of 2, which reduces to left-shift operation Right-hand operand is a constant power of 2, which reduces to a rightshift operation Right-hand operand is a constant power of 2, which reduces to a masking operation Right-hand side is constant Right-hand side is constant
Modulus
>> <<
The general rule "one registered assignment for every = operator" can be amended to read, "one registered assignment for every = operator or complex arithmetic operator".
9.1
Figure 31 shows the hardware that results from the following statement: int foo = a * b + x; Figure 32. Pipelined Multiplication Operator
Figure 32 shows the hardware that results from the following statement: int bar = a * b * c + x; Figure 33. Two Stages of Pipelined Multiplication Operators
Iteration Statements
An iteration statement (do, for, or while), also known as a loop statement, translates to a finite state machine in hardware. The state machine controls execution of all the statements inside the loop block. A loop inhibits (stalls) its parent state machine. In other words, if a loop exists within an outer loop, the state machine for the outer loop stalls each iteration and waits for the inner loop state machine to complete.
9.1
The fundamental iteration statement for the C2H Compiler is the do loop, which evaluates its condition at the end of each iteration. See section Loop Pipelining on page 342 for information about loop state machines and scheduling.
Selection Statements
A selection statement (if-else, case, switch, and ?:) translates to a multiplexer in hardware. The structure of the hardware depends on the type of statement, as described in the following sections.
if Statement
An if-else statement translates to three elements in hardware:
Logic to perform all operations in the then block Logic to perform all operations in the else block Selection logic that determines which result to use
The results of each element are registered, and the registered signals feed a multiplexer. If the if statement has both a then and an else block, the operations for both blocks execute in parallel. When all operations have completed, the multiplexer selects which value propagates to subsequent statements, based on the value of the control expression. Figure 34 shows the circuit that results from the code in Example 32. Example 32. If-else Logic if (foo > bar) foo += bar; else foo *= bar;
9.1
If the if statement has only a then or only an else block, the resulting logic is a simplification of the if-else case. The multiplexer selects whether to propagate the result of the if block or the initial values from before the if statement. Figure 35 shows the circuit that results from the code in Example 33. Example 33. if Logic Without else if (foo > bar) foo += bar;
9.1
Conditional Operator ?:
The ?: (conditional) operator is functionally equivalent to the if-else statement, but the placement of registers is different. The condition logic and selection logic compute in the same clock cycle, and the result is registered.
9.1
Figure 36 shows the circuit that results from the following code: foo = (foo > bar) ? (foo + bar) : (foo * bar); Figure 36. if-else Logic
9.1
if-else Implementation
if (byte_select == 1) out = in & 0x0000ff00; else if (byte_select == 2) out = in & 0x00ff0000; else if (byte_select == 3) out = in & 0xff000000; else out = in & 0x000000ff;
Figure 37 shows the logic that results of translating the if-else code from Table 34.
9.1
Subfunction Calls
A subfunction is a C function called from within an accelerated function. The C2H Compiler translates subfunctions to hardware using the same mapping rules as for the top-level function. The resulting HDL module for the accelerated subfunction becomes a submodule of the top-level function, as illustrated in Figure 38 on page 312. The C2H Compiler translates the top-level function and all subfunctions to a single hardware accelerator. The C2H Compiler creates only one instance of the subfunction hardware logic, regardless of how many times the subfunction is called within the top-level function. If the calling function calls the subfunction multiple times, the subfunction logic becomes a shared resource. However, the subfunction is a private resource exclusive to the calling function. In other words, if multiple separate, accelerated functions call a common subfunction, the C2H Compiler creates separate instances of the subfunction logic.
9.1
Table 35 shows an example of a subfunction called by two different functions. In this case, functions foo() and bar() both call a subfunction foobar_sub().
Table 35. Shared Subfunction foobar_sub() Top-Level Accelerated Function: foo() Top-Level Accelerated Function: bar()
void foo(int *data_in, int *data_out) { ... foobar_sub(); ... } void bar(int *data_in, int *data_out) { ... foobar_sub(); ... foobar_sub(); ... }
Figure 38 shows the hardware structure of the accelerators resulting from the code in Table 35. Logic for the function foobar_sub() exists within both accelerators. Figure 38. Shared Subfunction Logic
The C2H Compiler does not support external subfunctions. You must locate the subfunction in the same source file as the accelerated function. This is because, unlike the #include construct, a C external function reference requires the presence of a linker. The C2H Compiler has no linker.
9.1
The Nios II C2H Compiler does not perform any type of inline substitution. It ignores the inline function specifier. You can achieve the effect of an inline function through the use of preprocessing macros. If you wish to call a function for which accelerated hardware is replicated for each call, then you must define a macro containing the logic for this function. Before parsing the accelerated code, the C2H Compiler calls the GNU GCC preprocessor, which evaluates the macro and replaces each macro call with your macro definition.
Variable Declarations
This section describes how the C2H Compiler translates variable declarations and other C elements that define data storage.
9.1
Variable Declarations
Scalar Variables
For local scalar variables, the C2H Compiler creates hardware registers inside the accelerator. For example, declaring a char creates an 8-bit register; declaring a short int creates a 16-bit register, and so forth. Declaring a pointer allocates only storage for the pointer itself. For example, declaring a char* creates a 32-bit register to store the address of a char. If you use a scalar variable within a pipelined loop, then its register is replicated for pipelining as needed. The C2H Compiler considers a variable to be scalar if it is not an array, structure, or union. Example 35 demonstrates some examples of scalar variables. Example 35. Scalar Variables int i; int *p; char **c; struct struct_type *pointer_to_struct; int (*pointer_to_array)[8];
Example 36 demonstrates some examples of nonscalar variables. Example 36. Nonscalar Variables int data[1024]; struct struct_type tx_rec; int *array_of_pointers[8];
9.1
Avalon-MM master ports are generated even when local arrays (such as the ones discussed here) are referenced. These master ports only connect to internal slave ports inside the accelerator. However, the master ports do appear in the C2H Compiler build report. Example 37 demonstrates the creation of a local 4 kbyte memory inside the accelerator to store array data[1024]. Because this memory buffer is large, it translates to an embedded memory block. Example 37. Local Array That Uses 4 KBytes of On-Chip Memory Inside Accelerator int my_func(int a_parameter) { int data[1024]; // 1K 4-byte ints ... // Body of the function return data[0]; }
Memory Accesses
Hardware accelerators generated by the C2H Compiler use Avalon-MM master ports to access memory, similar to the Nios II processor and other SOPC Builder components. The Altera SOPC Builder system integration tool handles the task of physically connecting both accelerators and processors to memory, and creating arbitration logic. As a result, the
9.1
Memory Accesses
behavior of a C function accessing memory is the same, regardless of whether the function is implemented as hardware logic or software instructions.
For more information on SOPC Builder, Avalon interfaces, and how SOPC Builder generates system interconnect fabric, refer to the Quartus II Handbook, volume 4: SOPC Builder and the Avalon MemoryMapped Interface Specification. In order to maximize bandwidth, the C2H Compiler creates a master port on the accelerator for every C operator that accesses external memory. Multiple master ports allow the accelerator to read and write data to an unlimited number of locations simultaneously, thereby reducing the bandwidth limitations inherent in a CPU with a single data master port. In some cases, the C2H Compiler can determine that master ports can be shared between several external memory operations without sacrificing performance. However, as a general rule, an Avalon-MM master port is created for each of the following:
Pointer dereference (* operator) Index into an array ([ operator) Index into a struct or union (. or -> operator) Usage of a global or static variable
Example 38 demonstrates various lines of code that generate a master port in hardware. Example 38. C Statements that Generate Avalon-MM master ports *my_ptr = 8; data_in = *src; dst[index] = data_out; pixel = pixel_array[i][j]; buffer.input = 0x80000400; current = s->next;
9.1
Because the array subscript operation and the member operation for structures and unions can be expressed in terms of an address computation and a pointer dereference, this section is fundamental to understanding how arrays, structures, and unions translate to hardware as well.
The C2H Compiler identifies certain opportunities for optimization. In some cases it can collapse multiple master ports to a single master port without affecting performance, which reduces resource utilization. If the C2H Compiler determines that two pointers are exactly equivalent, it consolidates them to a single master port. There are considerations for multidimensional arrays. Refer to section Array Subscript Operator on page 326.
9.1
Memory Accesses
Example 310 shows two dereferences that are identical inside of a loop. The C2H Compiler consolidates them into a single master port. Example 310. Equivalent Pointers in a Loop void equivalent_pointers(char *packed_data, int len) { int i = 0; while (i < len) { char ms_nibble = *(packed_data) >> 4; char ls_nibble = *(packed_data++) & 0x0f; ... i++; } }
Example 311 demonstrates a case of non equivalent pointers. Example 311 is similar to Example 310, but packed_data increments between the two pointer dereferences. In this case the address expressions have different values, which translate to two separate master ports. Example 311. Nonequivalent Pointers void nonequivalent_pointers(char *packed_data, int len) { int i = 0; while (i < len) { char ms_nibble = *(packed_data++) >> 4; char ls_nibble = *(packed_data) & 0x0f; ... i++; } }
9.1
Example 312 demonstrates another case of non equivalent pointers. Example 312 is similar to Example 310, but a value is written to address some_other_pointer between the reads from address (packed_data + i). Example 312. Nonequivalent Pointers Due to Potential Aliasing void nonequivalent_pointers(char *packed_data, int *some_other_pointer, int len) { int i = 0; while (i < len) { char ms_nibble = *(packed_data + i) >> 4; char ls_nibble; *some_other_pointer = i; ls_nibble = *(packed_data + i) & 0x0f; ... } }
In this code, the C2H Compiler cannot determine if some_other_pointer and packed_data overlap addresses (known as aliasing), which would affect the result of the second evaluation of *(packed_data + i). Therefore, the C2H Compiler creates a separate master port for each dereference, creating a total of three master ports. For details on how to inform the C2H Compiler that two pointers do not alias, see section Pointer Aliasing on page 332.
9.1
Memory Accesses
Example 313 demonstrates the use of volatile to guarantee multiple, distinct reads from a constant address. Example 313. volatile Type Qualifier volatile char *DataFIFO = FIFO_BASE; char Byte0 = *DataFIFO; char Byte1 = *DataFIFO; char Byte2 = *DataFIFO; char Byte3 = *DataFIFO;
By comparison, Example 314 demonstrates two sections of code that are equivalent, due to the consolidation of equivalent pointers. In this case, the type of *DataFIFO is not declared volatile. Example 314. Equivalent Pointers char *DataFIFO = FIFO_BASE; char Byte0 = *DataFIFO; char Byte1 = *DataFIFO; char Byte2 = *DataFIFO; char Byte3 = *DataFIFO; // The code above is equivalent to the following: char *DataFIFO = FIFO_BASE; char dereferenced_DataFIFO = *DataFIFO; char Byte0 = dereferenced_DataFIFO; char Byte1 = dereferenced_DataFIFO; char Byte2 = dereferenced_DataFIFO; char Byte3 = dereferenced_DataFIFO;
Logic to compute the address signal For write transfers only, logic to compute the write-data signal Logic to control the read-enable or write-enable signal
9.1
Each of these signals is registered at the master port interface of the hardware accelerator. Logic within the accelerator synchronizes these signals to produce coherent Avalon-MM master transfers at the master port.
Address Computation
Consider the pointer dereference in the following code which performs a read operation: int j = *(ptr_to_int + i); The C2H Compiler generates logic of the following form to compute the address signal: ptr_to_int_i_addr = ptr_to_int + i * sizeof(int); Figure 39 shows an example of the logic created for this pointer dereference for a read operation. Figure 39. Address Generation for a Read Operation
In Figure 39, first, the address expression is evaluated. Assuming sizeof(int) is 4, i must be multiplied by four, which is equivalent to left-shifting by 2 bits. Bitwise shift operators require no logic elements to compute, and the result is not registered. (See section Unregistered Operations and Assignments on page 33.) The signal ptr_to_int_i_address feeds registers that drive the address signals on the Avalon-MM master port. As soon as the address signal ptr_to_int_i_address is valid, read-enable control logic asserts the signal ptr_to_int_i_read, which initiates a transfer on the master
9.1
Memory Accesses
port. After some number of clock cycles determined by the slave memory latency and arbitration delay, valid readdata returns to the master port. (See section Read Operations with Latency on page 337.)
Data Computation
For write operations to dereferenced pointers, data-computation logic in the accelerator computes the value of the expression to write to memory. This value is the write-data for an Avalon-MM master transfer to memory. Data-computation logic operates in parallel with the addresscomputation logic. Consider the pointer dereference in the following code which performs a write operation: *(ptr_to_int + i) = x + y; Figure 310 shows an example of the logic created for this pointer dereference for a write operation. Figure 310. Data Generation for a Write Operation
9.1
The write-data signal for the Avalon-MM master port is computed and registered in parallel with the address assignment. As soon as ptr_to_int_i_addr and ptr_to_int_i_writedata are valid, write-enable control logic asserts the signal ptr_to_int_i_write, which initiates a transfer on the master port. Figure 311 shows the logic created for the following write operation to a dereferenced pointer. Translation of the data-computation logic follows the rules described in section Assignments on page 32. *(ptr_to_int + i) = a*x + y; Figure 311. Complex Write Operation
Master-Slave Connections
The C2H Compiler uses pragmas that allow user control of master-slave connections and arbitration shares. This section describes the pragmas to control master-slave connections. The C language specification dictates that when a compiler implementation encounters a pragma directive it does not recognize, the compiler ignores the pragma. By using pragmas, you can write directives to optimize the C2H Compiler results, without making the C code incompatible with other compilers.
9.1
Memory Accesses
Specifying Master-Slave Connections By default, the C2H Compiler connects all master ports of an accelerator to all the memory slave ports that the CPU data master port connects to. These default connections guarantee that the accelerator can access any memory addresses that the processor can access. However, master-slave connections have an associated hardware resource cost. The extra multiplexing and arbitration logic associated with a master-slave connection often reduces the maximum achievable frequency (fMAX) of the system. If an accelerated function, by design, does not ever access certain memories, you can eliminate the connection to the slave memory to save resources and improve fMAX. The C2H Compiler provides a connection pragma that associates a pointer variable with an Avalon-MM slave port, which is typically a memory. A pointer variable can translate to one or more master ports, depending on how many times it is dereferenced in the C code. The connection pragma directs the C2H Compiler to connect all master ports generated for a particular variable to a specific slave port in the SOPC Builder system. The connection pragma is defined as follows: #pragma altera_accelerate connect_variable \ <function name>/<variable name> to <module>[/<slave name>] The connection pragma must be placed outside the function to accelerate in the same file. <function name> and <variable name> are the exact names of the accelerated function and the pointer variable. <module> is the exact name of the component instance, as specified in SOPC Builder. <slave name> is optional. If <slave name> is provided, it is the exact name of a specific slave port on <module>. If the module only contains one slave, you can omit <slave name>. However, if you omit <slave name> when the module contains multiple slaves, the compiler issues an error. To connect a variable's master ports to multiple slave ports, you can use multiple pragmas. If you use the connection pragma for a specific variable, the C2H Compiler connects only the slave ports specified in pragma statements. In addition to reducing arbitration logic, the connection pragma helps the C2H Compiler determine if two pointers overlap. If the memory connections for two separate variables are mutually exclusive, the compiler concludes that the pointers are never dependent on each other. For more information, refer to section Pointer Aliasing on page 332.
9.1
Example 315 illustrates usage of the connection pragma to connect two master ports for the variable my_ptr to the memory module named onchip_buffer. Example 315. Pragma Connecting a Master Port to a Slave Port #pragma altera_accelerate connect_variable foo/my_ptr to onchip_buffer int foo(int *my_ptr) { int x = *my_ptr; my_ptr[8] = 23; }
Example 316 illustrates using multiple pragmas to connect a pointer variable's master ports to multiple slave ports. Example 316. Pragma Connecting a Master Port to Multiple Slave Ports #pragma altera_accelerate connect_variable foo/my_ptr to onchip_buffer_0 #pragma altera_accelerate connect_variable foo/my_ptr to ext_ram_bridge #pragma altera_accelerate connect_variable foo/my_ptr to sdram #pragma altera_accelerate connect_variable foo/my_ptr to \ onchip_buffer_1/s2 int foo(int *my_ptr) { int x = *my_ptr; my_ptr[8] = 23; }
Specifying Arbitration Shares Arbitration shares benefit memories that have higher efficiency when accessed sequentially, such as SDRAM. You can use arbitration shares to reduce interruptions to sequences of transfers with a specific slave. For example, if a master-slave connection has an arbitration share value of ten, then the arbitrator grants at least ten consecutive transfers to the master port when it begins a sequence of transfer requests. The connection pragma with additional terms for arbitration share is defined as follows, where <shares> is a positive integer from 1 to 100: #pragma altera_accelerate connect_variable \ <function name>/<variable name> to <module>[/<slave name>] arbitration_share <shares>
9.1
Memory Accesses
Example 317 connects the variable x in function myfunc to the memory module named sdram with an arbitration share of 16. Example 317. Pragma Specifying Arbitration Share #pragma altera_accelerate connect_variable myfunc/x to sdram \ arbitration_share 16
Specifying Flow Control Avalon-MM transfers with flow control force a master port to obey flow control signals controlled by a slave port. For example, a slave FIFO might assert flow control signals to prevent write transfers when the FIFO memory is full. The C2H Compiler provides a flow control pragma which enables flow control for all master ports related to a specific pointer variable. The flow control pragma is defined as follows: #pragma altera_accelerate \ enable_flow_control_for_pointer \ <function name>/<variable name> The flow control pragma must be placed in the same file as the function to be accelerated, but outside the function. <function name> and <variable name> are the exact names of the accelerated function and the pointer variable.
For details about Avalon-MM flow control, refer to the Avalon MemoryMapped Interface Specification.
9.1
Although an array is considered to be a pointer type, dereferencing an array variable does not always mean the same thing as dereferencing a pointer. For example, dereferencing or indexing once into a multidimensional array returns a pointer to the first element in another array. For an N-dimensional array, a dereference or index into any of the first (N-1) dimensions does not read a value from the array memory; it computes an offset from the array's base address to determine the address of a subsection of the array. Example 318 demonstrates that indexing to the first level of a twodimensional array does not result in a memory access. Example 318. Indexing a Multidimensional Array without Causing a Memory Access char a[LENGTH][WIDTH]; // Here's a two-dimensional array // The following assignments are all equivalent. char *subscripting = a[3]; char *dereferencing = *(a + 3); char *offset = (char *) (a + 3); char *ptr_arithmetic = (char *) ((void *)a + 3*WIDTH);
Indexing into any of the first (N-1) dimensions of an N-dimensional array requires a multiplication operation, as demonstrated by the evaluation of ptr_arithmetic in Example 318. If the size of the resultant array is an integer power of two, then the multiplication operation is reduced to a constant-shift operation, which does not require a hardware multiplier. (Refer to section Unregistered Operations and Assignments on page 33.) A series of subscript operations that index into all N dimensions of an Ndimensional array is equivalent to an indirection operation, which creates an Avalon-MM master port. Example 319 illustrates several cases that generate an Avalon-MM master port to dereference an array variable. Example 319. Indexing an Array and Causing a Memory Access char a[LENGTH][WIDTH]; /* a is a two-dimensional array */ // The following assignments are equivalent. char *subscripting_a = a[3][2]; char *dereferencing_a = *(*(a + 3) + 2); char b[LENGTH]; /* b is a one-dimensional array */ // The following assignments are equivalent. char *subscripting_b = b[1]; char *dereferencing_b = *(b + 1);
9.1
Memory Accesses
Member Operator .
The member operator (.) accesses a single member of a structure or union. The struct member operation (mystruct.a) is equivalent to Example 320. Example 320. Converted struct Member Operation *((pointer_to_type_of_a) ((void *)&mystruct + offset_of_a))
Similarly, the union member operation (myunion.a) is equivalent to Example 321. Example 321. Converted union Member Operation *((pointer_to_type_of_a)(&myunion))
9.1
Consider the simple struct declaration shown in Example 322. Example 322. Structure Declaration struct s { int element_a; int element_b; int element_c; } my_struct;
In this example, the expression (my_struct.c) translates to the following: *((int *)((void *)&my_struct + 2*sizeof(int)))
The structure pointer operation on a union (e.g. myunion->a) is equivalent to Example 324. Example 324. Converted union Pointer Operation *((pointer_to_type_of_a)myunion)
9.1
Scheduling
Consider the simple struct declaration shown in Example 325. Example 325. Structure Pointer Declaration struct s_ptr { int element_a; int element_b; int element_c; } * my_struct;
In this example, the expression (my_struct->element_c) translates to the following: *((int *)((void *)mystruct + 2*sizeof(int)))
Scheduling
This section describes how the C2H Compiler schedules operations. The C2H Compiler is similar to a traditional C compiler in many respects: It parses code, creates a graph of the dependencies, performs some optimizations, schedules the sequence to execute each operation, and outputs an object file in the form of a hardware accelerator. However, fundamental differences exist between scheduling for a microprocessor and scheduling for a hardware accelerator.
State Machines
Sections One-to-One C-to-Hardware Mapping on page 31, Variable Declarations on page 313, and Memory Accesses on page 315 described how the C2H Compiler translates individual operations, assignments, and memory accesses to atomic functional units in
9.1
hardware. After the C2H Compiler creates the functional units, it generates a hierarchy of state machines to control the operation and interaction of these units. The C2H Compiler generates a distinct state machine for each of the following:
The states comprise a sequence of stages that compute the results of the C function. The C2H Compiler assigns each operation to a state of the state machine. An arbitrary number of operations can execute during one state, allowing multiple operations to execute in parallel. Generally, the time for one state to execute equates to one clock cycle, although certain conditions cause stalls in the state machine's progression through states.
Data Dependencies
Scheduling of assignments within an accelerator is based on the data dependencies between the assignments. If assignment B depends on a value calculated in assignment A, then B cannot execute until A has completed. If two or more assignments are not dependent on each other, they can be scheduled in parallel. One way to illustrate data dependencies is through a dependency graph. For each expression, arrows in a dependency graph represent where the inputs come from, and where the output is used. These arrows illustrate the flow of data through this function. Figure 312 shows the dependency graph for Example 326. Example 326. Data Dependency int foo(int a, int b, int c) { int x = a * b; int y = b * c; int z = x + y; return z; }
9.1
Scheduling
The C2H Compiler uses the dependency graph to assign each assignment to a state in the state machine. Figure 313 shows how the C2H Compiler assigns each assignment to a state. Figure 313. Scheduling Assignments in a Dependency Graph
Pointer Aliasing
Aliasing is a situation where it is possible for a change to one variable or reference to affect another. In the C language, aliasing can occur due to the indirection introduced by pointers. If the address ranges referenced by two pointers overlap, the pointers alias. Pointer aliasing is another form of data dependency that the C2H Compiler must consider. Any read or write operation with a pointer is dependent on all pointer write-
9.1
operations that come before it. Because arrays and structures are equivalent to pointer operations, the same considerations apply when indexing into an array or structures. This section describes the implications of aliasing on the C2H Compiler and outlines methods to prevent unnecessary dependencies. Figure 314 shows the dependency graph for Example 327. Example 327. Pointer Aliasing void foo(int *ptr_a, int *ptr_b) { int a, b; a = *ptr_a; *ptr_a = a + 7; b = *ptr_b; *ptr_b = b + 8; }
In this example, the C2H Compiler cannot determine whether or not ptr_a and ptr_b ever point to the same address. Therefore, it schedules conservatively, under the assumption that they do. The dependency graph shows that the read operation from ptr_b depends on the write operation to ptr_a. This is not a dependency on the variable ptr_a, but rather a dependency on a location in memory that is unknown at
9.1
Scheduling
compile-time due to the possibility of aliasing. This dependency causes the read operation from ptr_b to be scheduled at State 2, rather than at State 0. _ _ restrict_ _ Pointer Type Qualifier to Break Dependencies If you know that a pointer never overlaps with another, you can inform the compiler by declaring the pointer to be a restricted pointer. The restrict type qualifier is introduced in the ISO C 99 specification. The compiler ignores any or all aliasing implications of a pointer qualified by _ _ restrict_ _. The C99 specification states that if a pointer p is declared with restrict, and another pointer q accesses any location also accessed by p, the behavior is undefined. In other words, a restricted pointer promises to never alias another pointer. Example 328 demonstrates several pointers declared using the _ _ restrict_ _ type qualifier. 1 The qualifier comes after the *; __restrict__ qualifies the pointer type, not the type that the pointer points to.
Example 328. Pointer Declarations with _ _ restrict_ _ int * _ _ restrict_ _ my_restricted_pointer_to_integer; const int * _ _ restrict_ _ my_restricted_pointer_to_constant_integer; int * const _ _ restrict_ _ my_constant_restricted_pointer_to_integer;
Figure 315 shows the dependency graph for Example 329, which uses the _ _ restrict_ _ type qualifier to inform the C2H Compiler that ptr_a and ptr_b do not alias: Example 329. Using _ _ restrict_ _ void foo(int * _ _ restrict_ _ ptr_a, int * _ _ restrict_ _ ptr_b) { int a, b; a = *ptr_a; *ptr_a = a + 7; b = *ptr_b; *ptr_b = b + 8; }
9.1
Although a pointer qualified with __restrict__ creates no dependencies with other pointers, it can create dependencies with itself. Figure 316 shows the dependency graph for Example 330. Example 330. Pointers Always Depend on Themselves void foo(int * _ _ restrict_ _ my_ptr, int offset_a, int offset_b) { int a, b; a = my_ptr[offset_a]; my_ptr[offset_a] = a + 7; b = my_ptr[offset_b]; my_ptrb[offset_b] = b + 8; }
9.1
Scheduling
In this example, the C2H Compiler cannot schedule the two read operations in parallel, because it assumes that the two address expressions of my_ptr could overlap. Assuming that offset_a never equals offset_b, to make these operations execute in parallel, you need to declare another restricted pointer. Figure 317 shows the dependency graph for Example 331, which introduces a new restricted pointer, my_ptr_b, to prevent the data dependency present in Figure 316. Example 331. Using Another Pointer to Avoid Self-Dependence void foo(int * _ _ restrict_ _ my_ptr, int offset_a, int offset_b) { int a, b; int * _ _ restrict__ my_ptr_b = my_ptr; a = my_ptr[offset_a]; my_ptr[offset_a] = a + 7; b = my_ptr_b[offset_b]; my_ptr_b[offset_b] = b + 8; }
9.1
If a data structure is referenced by two pointers and one or more of them is restrict-qualified, the ISO C 99 standard specifies that the behavior is undefined. Therefore, make sure that you fully understand the range of values that a pointer can take on during the execution of your application before applying the _ _ restrict__ qualifier. Improper application can result in undesirable functional changes to the code than cannot be debugged in software, due to the limitations of restrict-based optimizations in conventional compilers. The ISO C 99 standard specifies that the volatile type qualifier overrides the __restrict__ pointer type. This means that _ _restrict__ has no effect on volatile pointers. To break pointer dependencies between volatile pointers, use separate interrupt-enabled accelerators instead of multiple loops in the same accelerator. For details about interrupt-enabled accelerators, see Interrupt Pragma on page 64.
9.1
Scheduling
and mitigate the effects of memory latency. Through close integration with SOPC Builder, the C2H Compiler can determine the latency characteristics of the slave ports connected to the accelerator. The C2H Compiler generates logic to maximize bandwidth for the specific memories in the system. Avalon-MM pipelined read transfers increase the bandwidth for synchronous slave ports that require several cycles of latency to return data for the first access, but can return data every cycle thereafter. Using pipelined read transfers, a slave port can begin a new transfer before data from the previous transfer returns. There are only pipelined read transfers; Avalon-MM write transfers do not benefit from pipelined functionality. The C2H Compiler takes memory latency into account when scheduling operations, allowing an accelerator to perform nondependent operations while waiting for data to return from a memory with latency. The master ports associated with a pointer might connect to multiple slave ports with different latency properties. In this case, the C2H Compiler uses the maximum latency of all slave ports. Figure 318 shows the dependency graph for function foo(), shown in Example 332. This example uses the connection pragma to exclusively connect a pointer named ptr_in to a memory with two cycles of read latency. (Refer to section Master-Slave Connections on page 323.) Example 332. Early Scheduling of Read Operation with Latency #pragma altera_accelerate connect_variable \ foo/ptr_in to \ my_memory_with_two_cycles_read_latency int foo(int *ptr_in, int x, int y, int z) { int xy = x * y; int xy_plus_z = xy + z; int ptr_data = *ptr_in; int prod = ptr_data * xy_plus_z; return prod; }
9.1
The C2H Compiler optimizes the dependency graph for this function by moving the read operation for ptr_in up to state 0. This optimization allows the calculation of xy and xy_plus_z to occur during the two cycles of latency required to fetch data for ptr_in.
Stalling
A state machine stalls when data needed for an operation is not available. A state machine might stall while waiting for one or more of the following actions to complete:
The state machine does not proceed until all reasons for stalling are resolved.
Inner Loops
Each loop is implemented as a state machine, and an inner loop translates to a particular state within the state machine for its containing function or outer loop. In other words, an inner loop translates to a state machine within a state machine. As the state machine for an inner loop executes, the outer state machine stalls until the inner loop has completed.
9.1
Scheduling
For the purposes of scheduling, the C2H Compiler treats a loop and its dependencies as a unit. No lines of code past the loop block execute until the whole loop completes. Figure 319 shows the dependency graph for the function transform_and_hash_matrix(), shown in Example 333. Example 333. Dependency Graph for a Function Containing a Loop int transform_and_hash_matrix(int *matrix, int length, int width) { int n_words = length * width; int hash = 1; int i; for (i=0; i<n_words; i++) { ...perform some transform... hash = ...some hash calculation... } return hash; }
As shown in Figure 319, some part of the for loop depends on n_words, and so the C2H Compiler does not schedule the loop until after the assignment to n_words completes. The return statement outside the loop depends on hash, which is assigned inside the loop. As a result, the C2H Compiler does not schedule the return statement until the loop completes. In this case, the state machine for transform_and_hash_matrix() has three states. However, the state machine does not complete in three clock cycles, because State 1 consists of a sub-state-machine, which requires multiple clock cycles to complete.
9.1
If multiple loops have no interdependencies, the C2H Compiler schedules the loops on the same state, allowing the loops to execute in parallel. The code in Example 334 has two while loops with no dependencies on each other. The C2H Compiler schedules these loops on the same state. Example 334. Loops Without Interdependencies Scheduled in Parallel void double_mac (int* _ _ restrict_ _ a, int* __restrict__ b, int* _ _ restrict_ _ c, int* __restrict__ d, long long* _ _ restrict__ res_ab, long long* _ _ restrict__ res_cd, int len) { int len_cd = len; // duplicate the length index // Compute the MAC for a & b long long mac_ab = 0; while (len_ab > 0) { mac_ab += *a++ * *b++; len_ab--; } // Compute the MAC for c & d long long mac_cd = 0; while (len_cd > 0) { mac_cd += *c++ * *d++; len_cd--; } *res_ab = mac_ab; *res_cd = mac_cd; return; }
Subfunction Calls
A subfunction call can stall the state machine in the same way that an inner loop does. When a subfunction contains a looping structure or shares a data dependency with its caller, the subfunction is not pipelined. If this is the case, when the outer state machine reaches its state for the subfunction, the outer state machine stalls until the subfunction has completed. If the subfunction does not contain loops or shared data dependencies the C2H Compiler can pipeline the subfunction. For details about pipelined subfunctions, see Subfunction Pipelining on page 349.
9.1
Scheduling
Memory Transfers
Avalon-MM system interconnect fabric manages arbitration between multiple Avalon-MM master ports that access a single slave port. A master port might have to wait several clock cycles before beginning a transfer due to arbitration. If a master port on an accelerator is being forced to wait, the state machine for the accelerator stalls until the transfer can proceed.
Loop Pipelining
The C2H Compiler structures the state machine for a loop so that iterations of the loop are pipelined. In other words, consecutive iterations of the loop can begin before prior iterations have completed.
9.1
9.1
Scheduling
Figure 321 illustrates how the C2H Compiler schedules successive iterations of the loop shown in Figure 320. The C2H Compiler is able to start a new iteration of the loop immediately after the prior iteration completes State 0. This is an example of an ideally-pipelined loop. Although the C2H Compiler can pipeline many loops ideally, it is sometimes not possible due to the lack of inherent parallelism in the code, as shown in Loop-Carried Dependencies. Figure 321. Pipelined Loop Iterations
Loop-Carried Dependencies
Loop-carried dependencies are data dependencies that manifest when pipelining successive iterations of a loop. If the result of one calculation in an iteration of the loop is used in a later iteration of the loop, then a loop-carried dependency exists between the two operations.
9.1
Figure 322 shows the dependency graph for the do loop in Example 336, which has loop-carried dependencies. Example 336. Loop-Carried Dependency int simple_hash(int *data, int len) { int hash = 0; do { int data_word = *data++; hash = hash + data_word; hash = hash ^ data_word; } while (len--); return hash; }
Variables data and hash have loop-carried dependencies, illustrated by the cyclic arrows in Figure 322. The arrow for hash indicates that the calculation on State 1 in iteration N is dependent on the result of the calculation from State 2 in iteration (N-1). The arrow for data in State 0 illustrates the ideal case in which a state depends only on its own output. The ideal case does not restrict the scheduling of successive iterations.
9.1
Scheduling
Figure 323 illustrates how the C2H Compiler schedules successive iterations of the loop shown in Figure 322, based on the restrictions imposed by hash. State 1 cannot execute until the previous iteration has completed State 2. The C2H Compiler schedules the states as shown in Figure 323 to satisfy the loop-carried dependency. Figure 323. Pipelined Loop Iterations with a Loop-Carried Dependency
In Figure 323, the cyclic arrow for hash in Figure 322 translates to straight arrows between iterations.
9.1
that performs Avalon-MM pipelined read transfers. Inside the accelerator, the master port connects to a FIFO, which guarantees the accelerator can receive data for all pending read transfers, regardless of whether the state machine stalls. Figure 324 shows the dependency graph for Example 337, which demonstrates a loop that pipelines memory accesses with latency. This example uses the connection pragma to connect the master port for variable list to a slave memory named my_mem_with_two_cycles_read_latency, as described in section Master-Slave Connections on page 323. Example 337. Accessing Memory with Latency #pragma altera_accelerate connect_variable \ sum_elements/list to \ my_mem_with_two_cycles_read_latency int sum_elements (int *list, int len) { int i; int sum = 0; for (i=0; i<len; i++) sum += *list++; }
9.1
Scheduling
State 1 in Figure 324 remains empty because list has two cycles of read latency. The loop-carried dependencies on variables sum and list are ideal cases, which do not impose restrictions on the pipeline scheduling. Figure 325 illustrates how the C2H Compiler schedules successive iterations of the loop. Figure 325. Pipelined Loop Iterations Reading Memory with Latency
As shown in Figure 325, the C2H Compiler is able to start a new iteration of the loop immediately after the prior iteration completes State 0. At Time 1, Iteration 1 starts a new read access from list, even though data from list hasn't returned for Iteration 0. Due to the two cycles of read latency, at any given time, there can be a maximum of two pending read operations. Over successive iterations of a loop, the C2H Compiler hides the memory latency by pipelining the read transfers. Although multiple cycles of latency are required to fill the pipeline, successive iterations can complete at a rate of one per clock cycle, assuming no stalling occurs (see section Stalling on page 339).
9.1
Subfunction Pipelining
Each subfunction is implemented as a state machine, and a subfunction call translates to a particular state within the containing function or loop. In other words, a subfunction translates to a state machine within a state machine. The fact that it is a distinct state machine allows it to be a shared resource within the containing function. If the subfunction does not contain loops or shared data dependencies, the C2H Compiler can pipeline the subfunction. The subfunction has its own state machine, but the datapath is pipelined as if it were the body of a loop. When the outer state machine reaches the state to call the subfunction, it can continue to execute other operations in parallel with the inner state machine. Data sets from multiple subfunction calls are pipelined in the subfunctions state machine.The code in Example 338 contains a subfunction which is pipelined by the C2H Compiler.
9.1
Scheduling
Example 338. Pipelined Subfunction int MAX(int a, int b) { return ((a > b)? a : b); } #pragma altera_accelerate connect_variable MAX_loop/a to sdram #pragma altera_accelerate connect_variable MAX_loop/b to onchip_ram_64_kbytes int MAX_loop(int * _ _restrict_ _ a, int * __restrict__ b) { int i, c = 0; for (i = 0; i < 1024; i++) { c += MAX(a[i], b[i]); } return c; }
If the subfunction performs a memory access that stalls, then the outer state machine also stalls. Pipelined subfunctions provide a useful option for controlling shared resources. For further information, see Resource Sharing.
9.1
Resource Sharing
The C2H Compiler is capable of sharing resources which consume significant amounts of logic. A resource only becomes shared if it is under-utilized. In other words, the C2H Compiler only shares a resource if the performance of accelerator is not affected. Table 36 lists all of the resources that can be shared automatically by the C2H Compiler.
Description
Memory Access
Required Conditions
- Multiple dereferences to access data from the same Avalon-MM memory port - Multiple dereferences occurring within the same level in the algorithm (function or loop) - Loop CPLI must not increase - Promoted data width (32 or 64 bit) must be the same for all multiplications. The promoted data width is shown in the C2H report. - Multiplications occurring within the same level in the algorithm (function or loop) - Both operands must be either signed or unsigned. - Loop CPLI must not increase - Promoted data width (32 or 64 bit) must be the same for all divisions. The promoted data width is shown in the C2H report. - Divisions occurring within the same level in the algorithm (function or loop) - Both operands must be either signed or unsigned. - Loop CPLI must not increase - Promoted data width (32 or 64 bit) must be the same for all modulo operations. The promoted data width is shown in the C2H report. - Modulo operations occurring within the same level in the algorithm (function or loop) - Both operands must be either signed or unsigned. - Loop CPLI must not increase - Promoted data width (32 or 64 bit) must be the same for all left shift operations. The promoted data width is shown in the C2H report. - Left shift operations occurring within the same level in the algorithm (function or loop) - Both operands must be either signed or unsigned. - Loop CPLI must not increase - Promoted data width (32 or 64 bit) must be the same for all right shift operations. The promoted data width is shown in the C2H report. - Right shift operations occurring within the same level in the algorithm (function or loop) - Both operands must be either signed or unsigned. - Loop CPLI must not increase
Multiply
Divide
Modulo
<<
Left Shift
>>
Right Shift
9.1
Resource Sharing
The resource sharing technique used for memory accesses differs slightly from all other sharable resources. Memory accesses which share the same Avalon-MM master port use byte enables to control the width of the access. The master port width is equal to the widest data type being accessed. The other sharable resources do not use byte enables, so operator inputs and result must be of equal width for the resource to be shared. 8-bit and 16-bit operators are automatically promoted to be 32 bits wide, so as long as both operands are either signed or unsigned, the operator can be shared. As previously mentioned, resources are not shared if the performance of the hardware accelerator degrades. For example, if the algorithm is capable of performing two multiplications in parallel, the C2H Compiler does not share a multiplier resource. The C2H Compiler optimizes for performance by default and enables sharing only when appropriate. The exception to this rule is when two pointer dereferences occur that access data from the same Avalon-MM memory port. The C2H Compiler knows that memory arbitration already limits the performance of the accelerator and so the resources are shared. This exception is bound to Avalon-MM ports since multi-port components can be accessed concurrently. If you use the appropriate connection pragma statements and _ _ restrict_ _ qualifier, you can ensure that memory accesses occur concurrently instead of becoming a shared Avalon-MM master port. The C2H Compiler only shares resources if they reside at the same level within a loop or function. Figure 326 shows an algorithm which contains shared and independent resources.
9.1
Another way of managing shared resources is to place the code that uses the resource in a subfunction. For example, to ensure that a mathintensive function uses no more than three multipliers, you could place the multiply operation in three subfunctions mul1(), mul2() and mul3(). With pipelined subfunctions, the latency overhead of this approach is not excessive. For further details, see Subfunction Calls on page 341.
9.1
Resource Sharing
9.1
Introduction
This chapter discusses the Altera Nios II C-to-Hardware Acceleration (C2H) Compiler view available in the Nios II IDE. Understanding the C2H view allows you to estimate the resource usage and the performance of the accelerator. You can use this information to perform optimizations to reduce the logic resource size or increase the performance of the accelerator. You can use the C2H view to do the following:
Overview
Add or update hardware accelerators Control how the C2H Compiler compiles your C code to hardware Display information about hardware accelerators
The C2H view displays performance information about your accelerated functions. This information makes it easy to select the best configuration for the next compile. The C2H view contains two sections:
Generation/Compilation Configurations the C2H view allows you to specify various generation and compilation configurations for your accelerated functions. The configurations allow you to control the build flow of the entire software project and the individual accelerated functions. Build Report The C2H Compiler creates the build report during a software compilation when it analyzes functions or generates accelerators. Using the build report you can view the resource usage and scheduling information for each accelerated function.
Generation/Compilation Configurations
The following sections discuss the two levels of configurations that you can use to control the build flow.
9.1
41
Overview
Meaning
You can use this global override configuration to force the linker to use the software implementation of all the existing accelerators in the system. You can use this configuration to perform functional changes to your algorithm without having to regenerate your system or compile the hardware design. This is a global configuration, and affects all accelerated functions. If you only wish to revert to the software implementation of an individual accelerated function, use the function build configurations section. When you use the software implementation, the hardware accelerator remains in the system, but is not used.
This global override configuration allows you to use accelerators that already exist in a system. If you previously switched to Use software implementation for all accelerators, this configuration lets you revert back to the accelerator hardware. In this configuration, the C2H Compiler does not regenerate the accelerator even if you make changes to the accelerator source. You can use this configuration near the end of the product development cycle to help prevent accidental hardware regeneration. If you select Analyze all accelerators, the C2H Compiler analyzes the functions you have marked for acceleration, and produces a report of expected accelerator performance. It does not compile the accelerators. If you have existing accelerators (from a previous compile), they are untouched. In other words, Analyze all accelerators is the same as Use the existing accelerators, except that it also produces a report. Analyze all accelerators lets you quickly display build information without regenerating the SOPC Builder system. This configuration does not overwrite the existing accelerator logic, it simply analyzes the source code.
9.1
Meaning
This configuration allows you to update the function accelerators in your system by forcing SOPC Builder to regenerate the logic. SOPC Builder runs in the background and you can view the process of the logic regeneration from the Nios II IDE console view. Do not open SOPC Builder until logic generation is complete. The Nios II IDE coordinates the software and hardware builds. Once the SOPC Builder regeneration is complete, the Nios II IDE displays the updated build information in the C2H view. The logic regeneration only occurs if the accelerator did not previously exist or the accelerated function source code changes. In order for these changes to take effect you must compile the hardware design using the Quartus II software.
Build software, generate SOPC Builder system, and run Quartus II compilation
This configuration is a superset of Build software and generate SOPC Builder system. Once SOPC Builder generates the system, Quartus II runs in the background and compiles the hardware design. While the Quartus II software is running in the background you can view the progress from the Nios II IDE console view.
Resources
The resources section of the build report shows information about the resource usage of the accelerated function. The following are the resources that can appear in this section of the build report:
Each of these resources has information about how they are configured in the hardware and the line of source code that mapped to them. In the following discussion of these resources, refer to Example 41.
43
Resources
Meaning
Use this configuration if you are not certain whether cache coherency is an issue in your system. In this configuration, every time the software calls the accelerated function, the wrapper function flushes the entire Nios II data cache, to prevent cache coherency issues. The C2H Compiler inserts flush code into the wrapper function so no source code modification is necessary. Since flushing the cache is a fixed overhead, if you have strict processing time requirements you need to study the system architecture and determine if this operation is necessary. This configuration uses the hardware accelerator without flushing the data cache before each invocation. Use this configuration with algorithms that do not require Nios II data cache flushing. Do not use this configuration if you have not studied your algorithm to determine if it could have cache coherency problems. The accelerated function might create cache coherency problems in certain corner cases. To prevent cache coherency problems, use one or more of the following techniques in your code: Allocate all shared data in uncached Nios II memory space. Refer to Bit-31 Cache Bypass in the Cache and Tightly-Coupled Memory chapter of the Nios II Software Developer's Handbook. (1) Flush all shared data before calling the accelerated function. Place all shared data in a tightly coupled memory. Manage cache coherency in a multiprocessor system by establishing a cache coherency protocol between the processor controlling the accelerated function and all other processors. Use cache bypass macros to access all shared data. Do not share memory between the hardware accelerator and the Nios II processor Refer to the Cache and Tightly-Coupled Memory chapter of the Nios II Software Developer's Handbook to learn more about cache coherency.
Much like the project wide configuration, this causes the C2H Compiler to link the software implementation of the accelerated function. Unlike the project wide configuration, this only affects a single accelerated function and not the entire software project. You can use this configuration to prototype changes to your algorithm without having to regenerate or recompile the hardware. This project configuration does not remove the existing accelerator from the system.
9.1
Example 41 requires Avalon-MM read and write master ports to perform memory accesses. It also requires a multiplier and a barrel shifter to perform the right shift operation. The pragma statements inform the C2H Compiler that the input data is stored in a memory called onchipRAM1 and the output data is to be stored in onchipRAM2. When the C2H Compiler compiles this function, the Nios II IDE generates a build report as shown in Figure 41.
45
Resources
The resources section contains a subsection for each type of resource. The report shows Avalon-MM master port resources in a different layout from other operator resources due to the differences between the functionality.
9.1
memory, making it impossible for the accelerator to read both variables on the same clock cycle. The C2H Compiler creates a single Avalon-MM master port to access both values using interleaved accesses. Figure 42. Avalon-MM Master Port Resources
When memory accesses share a single Avalon-MM master port, the reported data width is that of the largest data type being accessed. The report shows each dereference operation for the shared Avalon-MM master port resource. In Example 41 the pointer power requires a separate Avalon-MM master port resource because it resides in a different memory than the input values. For each dereference operation, the report shows the source line on which the C statement appears. It also shows the variable being dereferenced, and the data direction (read or write). Any one statement is either a read
47
Resources
or a write. However, when an Avalon-MM master port is shared among two or more dereference statements, it might need to support both directions. In Example 41 on page 45, the connection pragmas forced the C2H Compiler to create a single, shared Avalon-MM master port called Master Resource 0. If the connection pragmas were omitted from the example software, all dereference operations would have resulted in a single Avalon-MM master port resource connecting to all Avalon-MM memory slave ports.
For more information about connection pragmas, refer to Optimizing Memory Connections in the Optimizing Nios II C2H Compiler Results chapter of the Embedded Design Handbook.
9.1
The resource usage does not reflect the final resource utilization of the compiled hardware. When ANSI C code is compiled, small integer data types are promoted to the int data type. In Figure 43 we can see that the multiplier is 32 bits wide even though the operands are short (16 bits). The C2H Compiler performs the same integer data promotion, creating a 32-bit multiplier. When the Quartus II software compiles the hardware design, the synthesized multiplier is 16 bits in width.
The pipeline value associated with the resource specifies the number of clock cycles that the hardware logic requires for the calculation to complete. Pipelined logic can typically operate at higher clock frequencies due to the additional latency introduced. The C2H Compiler factors in the pipelining of the hardware and schedules the accelerated function accordingly to maximize data throughput. When the report does not show a pipeline value for a resource, that means that the operator is purely combinational, with no latency.
49
Performance
Performance
The performance section of the C2H build report details information about each loop in the accelerated function. For each loop shown, the report contains the following information:
File name and source line number Loop latency Cycles per loop iteration (CPLI) Scheduling information per assignment Scheduling information per state
In the following discussion of information shown in the performance section, refer to Example 42. Example 42. CRC32 (Ethernet CRC) #pragma altera_accelerate connect_variable\ crc_calculation/data to onchipRAM1 #pragma altera_accelerate connect_variable\ crc_calculation/table to onchipRAM2 unsigned long crc_calculation ( unsigned char * _ _ restrict_ _ data, unsigned long * _ _ restrict_ _ table, unsigned long length) { unsigned long i, crc = 0xFFFFFFFF; unsigned char lut_addr; for (i = 0; i < length; i++) { lut_addr = (crc & 0xFF) ^ *data++; crc = (crc >> 8) ^ table[lut_addr]; } return (crc ^ 0xFFFFFFFF); }
9.1
Loop Latency
As mentioned in Chapter 3, C-to-Hardware Mapping Reference, the loop latency is a fixed time overhead, incurred each time the accelerator enters the loop. The loop latency is the number of states needed to set up conditions for efficient loop iteration. 1 It is important to remember that if the accelerator re-enters the loop multiple times, it incurs the loop latency each time.
411
Performance
The report identifies two assignments containing critical path states. The data dependency graph for these statements is shown in Figure 45. The graph shows that data does not depend on any other statement in the loop. However, variables crc, lut_addr and table are involved in a mutually dependent chain of calculations in the critical path. crc, lut_addr and table are therefore the critical path variables. The C2H Compiler displays one critical loop variable (crc) as a way of identifying the critical path.
9.1
lut_addr=(crc&0xFF)^*data++;
lut_addr=(crc&0xFF)^*data++;
crc=(crc>>8)^table[lut_addr];
crc=(crc>>8)^table[lut_addr];
Since the critical path does not involve the pointer data the only operations are on local scalar types and a read operation from the array table. The calculation of lut_addr only depends on the scalar critical loop variable crc, while the calculation of crc depends on a memory reference to critical loop array variable table. Each time the C2H accelerator finishes calculating the value of crc for loop iteration n, it can start calculating crc for iteration n+1. On the same clock cycle, it can also start calculating the value of lut_addr for iteration n+2. This means that the accelerator always gets a one-loop head start on calculating lut_addr. Thus, although the accelerator requires lut_addr to calculate crc, lut_addr does not limit the loop speed, because it is always ready as soon as crc is. The report shows that the critical path is either 0--->6, or 6--->11. Since the C2H Compiler pipelines the logic contained in loops, multiple states are active concurrently. Figure 46 represents the pipelined timing of Example 42 on page 410. Notice that since the assignment of crc is the critical path, the accelerator begins each execution of that statement as
413
Performance
soon as the previous execution is complete. Not surprisingly, the critical path statement is what limits the speed of the loop, and hence what determines CPLI. Figure 46. CRC Critical Path Scheduling by Assignment
Time Loop Iteration 0 2
10
12
14
16
18
20
22
24
13:lut_addr=((crc&0xFF)^ *data++; State(0 6) 14:crc=((crc>>8)^ table[lut_addr]; State(6 11) 13:lut_addr=((crc&0xFF)^ *data++; State(0 6) 14:crc=((crc>>8)^ table[lut_addr]; State(6 11) 13:lut_addr=((crc&0xFF)^ *data++; State(0 6) 14:crc=((crc>>8)^ table[lut_addr]; State(6 11) 13:lut_addr=((crc&0xFF)^ *data++; State(0 6)
Notice, also, that line 13 (lut_addr = (crc & 0xFF) ^ *data++) appears to take more clock cycles than line 14 (crc = (crc >> 8) ^ table[lut_addr]). The C2H Compiler stretches out the calculation of lut_addr so that it is available exactly when it is needed. The memory access in line 14 is nonetheless the limiting operation.
Scheduling Information
There are two ways of presenting the loop scheduling information: per assignment, and per state.
9.1
Using the methodology from Cycles Per Loop Iteration (CPLI), you can create a chart such as Figure 48, corresponding to Example 42 on page 410.
415
Performance
10
12
14
16
18
20
22
24
9.1
This section shows you the assignments that occur during each state. This is the opposite of the previous section however it can sometimes be easier to interpret the information presented. When Example 42 is compiled, the states are mapped and presented in this section of the C2H build report, as shown in Figure 49. Figure 49. CRC Scheduling Per State
417
Performance
(State 8) crc=(crc>>8)^table[lut_addr];
11
12
13
14
15
CPLI
16
17
9.1
In the case of Example 42, a total of 12 states is required to schedule the loop. Figure 410 outlines the same information presented in Figure 48, organizing it by state to show how multiple states execute concurrently:
Further Reading
For more advice on using the information presented in the C2H view, refer to the Optimizing Nios II C2H Compiler Results chapter of the Embedded Design Handbook.
419
Further Reading
9.1
5. Accelerating Code Using the Nios II Software Build Tools Command Line
The Nios II software build tools support the Nios II C2H Compiler on the command line with the nios2-c2h-generate-makefile command. This command creates a C2H makefile fragment that specifies all accelerators and accelerator options for an application. 1 C2H Compiler projects created on the command line cannot be imported into the Nios II Software Build Tools for Eclipse. Altera recommends creating new C2H accelerators with the Nios II IDE.
The nios2-c2h-generate-makefile command usage is as follows: nios2-c2h-generate-makefile \ --sopcinfo=<SOPC Builder System File> [OPTIONS] 1 This command creates a new c2h.mk each time it is called, overwriting the existing c2h.mk.
Table 51 lists the command line arguments for the nios2-c2h-generate-makefile command.
Meaning
The path to the SOPC Builder system file (.sopcinfo). Directory to place the application Makefile and ELF. If omitted, it defaults to the current directory. Specifies a function to be accelerated. This argument accepts up to four comma-separated values: Target function name Target function file Link hardware accelerator instead of original software. 1 or 0. Defaults to 1. Flush data cache before each call. 1 or 0. Defaults to 1. Examples:
--accelerator
9.1
51
Meaning
Disables hardware generation, SOPC Builder system generation, and Quartus II compilation for all accelerators in the application. Building the project with this option only updates the report files. Defaults to 0. Disables all hardware generation steps. The build behaves as if c2h.mk did not exist, with the exception of possible accelerator linking as specified in the --accelerator option. Defaults to 0.
--use_existing_accelerators
Example 51 shows a typical nios2-c2h-generate-makefile command line. Example 51. nios2-c2h-generate-makefile command line nios2-c2h-generate-makefile \ --sopcinfo=../../NiosII_stratix_1s40_standard.sopcinfo \ --app_dir=./ \ --accelerator=doDMA,DMA.c \ --accelerator=analyze,../../../finite.c,1,0 \ --use_existing_accelerators
For more detail about nios2-c2h-generate-makefile, refer to the Nios II Software Build Tools Reference chapter of the Nios II Software Developers Handbook. 1 You must use the --c2h flag when calling nios2-app-generate-makefile in order to make your application with C2H. This flag causes the static C2H make rules to be included in your application makefile. These rules in turn include the c2h.mk fragment generated by this command.
For more information about nios2-app-generate-makefile, refer to the Nios II Software Build Tools Reference chapter of the Nios II Software Developers Handbook.
The C2H Compiler produces a detailed report during software compilation. This report shows hardware structure, resource usage, and throughput. Using the build report you can view the resource usage and scheduling information for each accelerated function.
9.1
Accelerating Code Using the Nios II Software Build Tools Command Line
The report is saved as an XML file in the application directory. The name of the report file is <function_name>.prop, where <function_name> is the name of the accelerated function. For details of the reports contents, refer to Resources on page 43 and Performance on page 410 of Chapter 4, Understanding the C2H View.
9.1
9.1
6. Pragma Reference
Introduction
The C2H Compiler uses pragmas that allow user control of master-slave connections and arbitration shares. This chapter describes these pragmas in detail. The C language specification dictates that when a compiler implementation encounters a pragma directive it does not recognize, the compiler ignores the pragma. By using pragmas, you can write directives to optimize the C2H Compiler results, without making the C code incompatible with other compilers.
Connection Pragma
The C2H Compiler provides a connection pragma that associates a pointer variable with an Avalon-MM slave port, which is typically a memory. A pointer variable can translate to one or more master ports, depending on how many times it is dereferenced in the C code. The connection pragma directs the C2H Compiler to connect all master ports generated for a particular variable to a specific slave port in the SOPC Builder system, reducing arbitration logic. The connection pragma syntax is as follows: #pragma altera_accelerate connect_variable \ <function name>/<variable name> to \ <module>[/<slave name>] [arbitration_share <shares>] <function name> and <variable name> are the exact names of the accelerated function and the pointer variable. <module> is the exact name of the component instance, as specified in SOPC Builder. <slave name> is optional. If provided, <slave name> is the exact name of a specific slave port on <module>; if not provided, the master port connects to all slave ports on <module>. <shares> is a positive integer from 1 to 100. Define the connection pragma in the same file as the function to be accelerated, outside the function body. To connect a variable's master ports to multiple slave ports, you can use multiple pragmas. If you use the connection pragma for a specific variable, the C2H Compiler connects only the slave ports specified in pragma statements.
9.1
61
Connection Pragma
Example 62 illustrates using multiple pragmas to connect a pointer variable's master ports to multiple slave ports. Example 62. Pragma Connecting a Master Ports to Multiple Slave Ports #pragma altera_accelerate connect_variable foo/my_ptr to onchip_buffer_0 #pragma altera_accelerate connect_variable foo/my_ptr to ext_ram_bridge #pragma altera_accelerate connect_variable foo/my_ptr to sdram #pragma altera_accelerate connect_variable \ foo/my_ptr to onchip_buffer_1/s2 int foo(int *my_ptr) { int x = *my_ptr; my_ptr[8] = 23; }
In addition to reducing arbitration logic, the connection pragma helps the C2H Compiler determine if two pointers overlap. If the memory connections for two separate variables are mutually exclusive, the compiler concludes that the pointers are never dependent on each other. For more information, refer to Pointer Aliasing on page 332.
9.1
Pragma Reference
master port when it begins a sequence of transfer requests. The arbitration share of a shared Avalon-MM master port is the sum of the arbitration shares of all master-slave pairs associated with the master port. The connection pragma with additional terms for arbitration share is defined as follows, where <shares> is a positive integer from 1 to 100: #pragma altera_accelerate connect_variable \ <function name>/<variable name> to \ <module>[/<slave name>] arbitration_share <shares> Example 63 connects the variable x in function myfunc to the memory module named sdram with an arbitration share of 16. Example 63. Pragma Specifying Arbitration Share #pragma altera_accelerate connect_variable myfunc/x to sdram \ arbitration_share 16
Avalon-MM transfers with flow control force a master port to obey flow control signals controlled by a slave port. For example, a slave FIFO might assert flow control signals to prevent write transfers when the FIFO memory is full. The C2H Compiler provides a flow control pragma which enables flow control for all master ports related to a specific pointer variable. The flow control pragma is defined as follows: #pragma altera_accelerate \ enable_flow_control_for_pointer <function name>/<variable name> The flow control pragma must be placed outside the function to accelerate in the same file. <function name> and <variable name> are the exact names of the accelerated function and the pointer variable. 1 Using the flow control pragma might result in an accelerator that functions differently from the original function running on the Nios II processor.
For details about Avalon-MM flow control, refer to the Avalon MemoryMapped Interface Specification.
63
Interrupt Pragma
Interrupt Pragma
To use a hardware accelerator in interrupt mode, add the following line to your function source code: #pragma altera_accelerate \ enable_interrupt_for_function <function name> At the next software compilation, the C2H Compiler creates a new header file containing all the macros needed to use the accelerator and service the interrupts it generates. This pragma causes the function (which is assumed to be a top-level accelerated function, not an accelerated subfunction) to be an interruptmode accelerator. Specifically, the following things change:
The accelerator's control slave has an IRQ signal, which is asserted every time the function has completed execution. The polling loop in the generated driver file is removed. When the function is called, the CPU immediately returns after launching the accelerator. A header file is generated, providing macros and definitions required for you to write an ISR. The macros are summarized in Table 61.
Macro Name
ACCELERATOR_<Project Name>_<Function Name>_GET_RETURN_VALUE() ACCELERATOR_<Project Name>_<Function Name>_CLEAR_IRQ() ACCELERATOR_<Project Name>_<Function Name>_BUSY()
An example of this header file is shown in Example 64 for an accelerated function called coprocess() in a Nios II IDE project called my_project. The file is generated in <Project Path>/ <Configuration>, where <Project Path> is the software project directory, and <Configuration> is the project configuration name (Release or Debug). The file name is ACCELERATOR_<Project Name>_<Function Name>_IRQ.h, where <Project Name> is the name of the project (usually the same as <Project Path>), and <Function Name> is the name of the function you are accelerating.
9.1
Pragma Reference
Example 64. Interrupt Header File #ifndef ALT_C2H_COPROCESS_IRQ_H #define ALT_C2H_COPROCESS_IRQ_H #include "io.h" #include "c2h_accelerator_base_addresses.h" #define ACCELERATOR_MY_PROJECT_COPROCESS_GET_RETURN_VALUE() \ (( int ) IORD_32DIRECT ( \ ACCELERATOR_MY_PROJECT_COPROCESS_CPU_INTERFACE0_BASE, \ (1*sizeof(int)))) #define ACCELERATOR_MY_PROJECT_COPROCESS_CLEAR_IRQ() \ ( IOWR_32DIRECT ( \ ACCELERATOR_MY_PROJECT_COPROCESS_CPU_INTERFACE0_BASE, \ (0*sizeof(int)), 0)) #define ACCELERATOR_MY_PROJECT_COPROCESS_BUSY() \ ( IORD_32DIRECT ( \ ACCELERATOR_MY_PROJECT_COPROCESS_CPU_INTERFACE0_BASE, \ ((0*sizeof(int))) & 1) ^ 1) #endif /* ALT_C2H_COPROCESS_IRQ_H */
The hardware accelerator does not have an IRQ level so you must open the system in SOPC Builder and manually assign this value. After assigning the IRQ level press the generate button because this is a change outside of the Nios II IDE. You only have to do this manual step once. In addition, you can use the accelerate_my_project_coprocess_busy macro in a noninterrupt based system in which the user code pulls for the done bit, rather than using the automatically generated C wrapper.
Refer to the Exception Handling chapter of the Nios II Software Developer's Handbook for more information about creating interrupt service routines. As discussed in Resource Sharing in the C-to-Hardware Mapping Reference chapter, the C2H compiler automatically shares a master port for multiple pointer dereference operations that connect to the same slave port or group of slave ports. In certain cases, this causes a reduction in performance. For example, in Example 65 both ptr_a and ptr_b must be connected to both onchip_memory_0 and onchip_memory_1, but they never access the same memory at the same time. By default, the C2H compiler will attempt to share a single master between ptr_a and ptr_b, preventing these dereference operations from being scheduled concurrently and possibly degrading performance.
65
Example 65. Automatically Shared Master Port #pragma altera_accelerate connect_variable ptr_a to onchip_memory_0 #pragma altera_accelerate connect_variable ptr_a to onchip_memory_1 #pragma altera_accelerate connect_variable ptr_b to onchip_memory_0 #pragma altera_accelerate connect_variable ptr_b to onchip_memory_1 if (x) { ptr_a = ONCHIP_MEMORY_1_BASE; ptr_b = ONCHIP_MEMORY_2_BASE; } else { ptr_a = ONCHIP_MEMORY_2_BASE; ptr_b = ONCHIP_MEMORY_1_BASE; } /* ... perform some dereference operations with ptr_a and ptr_b ... */
To overcome this problem, use the unshare_pointer pragma, which instructs the compiler to always optimize for speed, and never defer operations for the purposes of resource scheduling. The syntax is as follows, for a pointer my_ptr in function my_func: #pragma altera_accelerate unshare_pointer my_func/my_ptr
9.1
Introduction
The Nios II C-to-Hardware Acceleration (C2H) Compiler supports a large subset of the ANSI C language as described in Chapter 5 and Chapter 6 of the ISO/IEC 9899:1999(E) Specification. The current Nios II C2H Compiler does not support the C++ programming language or the library functions described in Chapter 7 of the ISO/IEC 9899:1999(E) Specification. This chapter describes Nios II C2H Compiler restrictions, including unsupported ANSI C language syntax, semantics, and constraints.
Language
This section refers to Chapter 6 of the ISO/IEC 9899:1999(E) Specification. Section and paragraph numbers from the ISO/IEC 9899:1999(E) Specification are cited in parentheses.
Declarations
The C2H Compiler supports the majority of data types used in the C programming language. The following sections describe C2H restrictions on C declarations.
float - section 6.3.1.5 double - section 6.3.1.5 _Complex section 6.3.1.7 _Bool section 6.3.1.2 _Imaginary section 6.3.1.7
Floating constants are supported only after casting to a supported type. For example, the following code casts to an integer constant: constant int pi - (int) 3.142957142957;
9.1
71
Language
Escape character sequences in character constants are supported if they are used as string literals rather than character constants. The following declaration is supported because it employs a string literal: char *newline = "\n"; The following declaration is not supported because it employs a character constant: char newline = '\n';
Composite concatenation is not supported in initialization statements. The C2H Compiler does not support the initialization statement which concatenates two strings, such as the following: char s[] = "this" " string"; The following declaration is supported: char s[] = "this string";
9.1
Delayed Declaration
The C2H Compiler does not support delayed declaration of variables. For example, the following code, which first declares an array of unspecified size and later provides the size, is not supported: int a[]; int a[20]; You can establish the size of the array when it is declared: int a[20];
Expressions
The C2H Compiler does not support the following C operators.
9.1
Language
The following example, which passes the address of a as the argument to the function analyze(), is not supported: void foo() { int a=0; int c=analyze(&a); } int analyze( int * p ); You can substitute the following code, which initializes the pointer outside of the accelerator: int *pa = &a; int foo() { return analyze(pa); } int analyze( int * p );
Logical Expressions
All expressions in logical operations are evaluated. The parser does not stop evaluation if the first expression of a compound statement is true. For example, in the following statement, both i and j decrement, even if i is nonzero: if (i-- || j--) In the following code fragment, the C2H Compiler evaluates the divide by 0, which causes an error: int i = 2 || 1 / 0 ;
Functions
The following sections list restrictions on functions.
Function Arguments
This section lists restrictions on arguments to functions. Composite Types (Section 6.2.7) The C2H Compiler does not support function arguments of different but compatible types in function declarations that refer to the same entity.
9.1
For example, the following code shows two definitions of my_func() with compatible arguments, which is not supported: int my_func(int (*)(), double (*)[3]); int my_func(int (*)(char *), double (*)[]); These two declarations can be combined into a single composite function prototype that is compatible with the previous declarations: int my_func(int (*)(char *), double (*)[3]); Ellipsis (Section 6.7.5.3, Paragraph 9) The ellipsis function argument is not supported. The following function includes an incompletely specified parameter list, which is not supported: void foo(int a, short b,...); The previous example can be replaced with a function declaration that completely specifies the parameter list: void foo(int a, short b, char *a); Struct and Union (Section 6.7.2.1, Paragraph 1) The C2H Compiler does not support passing struct or union arguments to a function by value. There are two ways to include structs or union types in the C source:
Pass a pointer to a struct or union as an argument to the function. Define the struct or union globally outside the accelerated function.
The following code, which passes a struct MyStruct as an argument, is not supported: void doDMA(struct s MyStruct); The previous example can be replaced with code that defines the struct s outside of the function call: struct s MyStruct; void doDMA(); Function Pointers (Section 6.7.5.3, Paragraph 8) Function pointers are supported if used to point to functions that exist inside the hardware accelerator. The C2H Compiler does not support function pointers used as input or output arguments to an accelerator.
Altera Corporation November 2009 9.1 75 Nios II C2H Compiler User Guide
Language
Example 71 defines three sub-functions, sub_plus_one(), sub_plus_two() and sub_plus_three(). A fourth function, c2h_fnc(), returns a pointer to one of the three sub-functions, depending on the value of the input argument one_two_or_three. The C2H Compiler supports this use of function pointers as long as all four functions are part of the hardware accelerator. Example 71. Use of Function Pointers Inside the C2H Accelerator int sub_plus_one(int in) { return in + 1; } int sub_plus_two(int in) { return in + 2; } int sub_plus_three(int in) { return in + 3; } int c2h_fnc(int in, int one_two_or_three) { int (*fp)(int); fp = ((one_two_or_three == 3) ? sub_plus_three : ((one_two_or_three == 2) ? sub_plus_two : sub_plus_one)); return fp(in); }
Function Argument Types (Section 6.9.1, Paragraph 13) Functions must define the types of the arguments passed. For example, the following declaration of foo() is not supported because the arguments a,b, and c are not typed: void foo(a,b,c); The following function declaration which defines the argument types inside the function argument list is supported: void foo(char a, char b, char c);
9.1
Function Prototypes (Section 6.9.1, Paragraph 14) A function prototype cannot be encapsulated within a function. For example, the following code is not supported: void doDMA(int a) { void analyze(int i); ... } The C2H Compiler supports separate declarations of functions, as follows: void analyze(int i); void doDMA(int a) { ... }
You can replace recursive functions with equivalent code that implements the function without using recursion. Example 73 shows an equivalent implementation of the factorial function without using recursion.
9.1
Language
Example 73. Nonrecursive Implementation of Factorial Function int factorial(int x) { int tmp = 1, i; for (i = 0 ;i<x;i++) { tmp *= (i+1); } return tmp; }
9.1
Other Restrictions
The C2H Compiler does not support external subfunctions. You must locate the subfunction in the same source file as the accelerated function. This is because, unlike the #include construct, a C external function reference requires the presence of a linker. The C2H Compiler has no linker.
9.1
Other Restrictions
9.1
Additional Information
Referenced Documents
Quartus II Handbook, volume 4: SOPC Builder Overview chapter of the Nios II Software Developer's Handbook Cache and Tightly-Coupled Memory chapter of the Nios II Software Developer's Handbook Exception Handling chapter of the Nios II Software Developer's Handbook Using the Nios II Integrated Development Environment appendix to the Nios II Software Developer's Handbook Avalon Memory-Mapped Interface Specification AN 320: OpenCore Plus Evaluation of Megafunctions AN 391: Profiling Nios II Systems Optimizing Nios II C2H Compiler Results chapter of the Embedded Design Handbook Nios II Hardware Development Tutorial Nios II Software Development Tutorial available in the Nios II integrated development environment (IDE) help system Accelerating Nios II Systems with the C2H Compiler Tutorial
Altera Corporation
9.1
Revision History
Revision History
Chapter
1
The table below displays the revision history for the chapters in this user guide. Document Version
1.6
Date
November 2009
Changes Made
C2H Compiler not supported by Nios II Software Build Tools for Eclipse C2H Compiler generates little-endian hardware
5 7 All 2 6 2 5 7 All May 2007 1.2 October 2007 1.3 November 2008 May 2008 1.5 1.4
C2H accelerators cannot be imported into the Nios II Software Build Tools for Eclipse Document restriction on implicit function return values. SOPC Builder system file implemented as .sopcinfo type (instead of .sopc type). Include makefile fragments when copying project. Added unshare_pointer pragma. Added details about how to delete IDE project containing accelerator
List limitation on external subfunctions Updated for changes between 6.0 and 7.1 Pipelined subfunctions. Extended resource sharing. Support for the software build tools. Interrupt generation. Additional chapter on Nios II software build tools Additional chapter on pragmas Move from chapter 5 to chapter 7 August 2006 1.1 Additional chapter on C2H view Move from chapter 4 to chapter 5 May 2006 1.0 First publication
5 6 7 4 5 All
9.1
Altera Corporation
Additional Information
For the most up-to-date information about Altera products, refer to the following table. Information Type
Technical support Technical training Altera literature services Non-technical support (General) (Software Licensing) Note to table:
(1) You can also contact your local Altera sales office or sales representative.
Contact (1)
www.altera.com/support www.altera.com/training [email protected] [email protected] [email protected] [email protected]
Typographic Conventions
Visual Cue
Bold Type with Initial Capital Letters bold type
Meaning
Command names, dialog box titles, checkbox options, and dialog box options are shown in bold, initial capital letters. Example: Save As dialog box. External timing parameters, directory names, project names, disk drive names, filenames, filename extensions, and software utility names are shown in bold type. Examples: fMAX, \qdesigns directory, d: drive, chiptrip.gdf file. Document titles are shown in italic type with initial capital letters. Example: AN 75: High-Speed Board Design. Variable names are shown in italic type. Example: <file name>, <project name>.pof file. Keyboard keys and menu names are shown with initial capital letters. Examples: Delete key, the Options menu. References to sections within a document and titles of on-line help topics are shown in quotation marks. Example: Typographic Conventions. Signal and port names are shown in lowercase Courier type. Examples: data1, tdi, input. Active-low signals are denoted by suffix n, e.g., resetn. Anything that must be typed exactly as it appears is shown in Courier type. For example: c:\altera\. Also, references to C code are shown in Courier.
Italic Type with Initial Capital Letters Italic type Initial Capital Letters Subheading Title
Courier type
Numbered steps are used in a list of items when the sequence of the items is important, such as the steps listed in a procedure. Bullets are used in a list of items when the sequence of the items is not important. The hand points to information that requires special attention.
Altera Corporation
9.1
Typographic Conventions
Visual Cue c w
r f
Meaning
A caution calls attention to a condition or possible situation that can damage or destroy the product or the users work. A warning calls attention to a condition or possible situation that can cause injury to the user. The angled arrow indicates you should press the Enter key. The feet direct you to more information on a particular topic.
9.1
Altera Corporation