Ug871 Vivado High Level Synthesis Tutorial
Ug871 Vivado High Level Synthesis Tutorial
Ug871 Vivado High Level Synthesis Tutorial
Tutorial
High-Level Synthesis
Table of Contents
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 3: C Validation
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Tutorial Design Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Lab 1: C Validation and Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Lab 2: C Validation with ANSI C Arbitrary Precision Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Lab 3: C Validation with C++ Arbitrary Precision Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Tutorial Description
Overview
This Vivado® tutorial is a collection of smaller tutorials that explain and demonstrate all
steps in the process of transforming C, C++ and SystemC code to an RTL implementation
using High-Level Synthesis. The tutorial shows how you create an initial RTL implementation
and then you transform it into both a low-area and high-throughput implementation by
using optimization directives without changing the C code. The following sections describe
a summary of each tutorial.
C Validation
This tutorial reviews the aspects of a good C test bench and demonstrates the basic
operations of the Vivado High-Level Synthesis C debug environment. The tutorial also
shows how to debug arbitrary precision data types.
Interface Synthesis
This interface synthesis tutorial reviews all aspects of creating ports for the RTL design. You
can learn how to control block-level I/O port protocols and port I/O protocols, how arrays
in the C function can be implemented as multiple ports and types of interface protocol
(RAM, FIFO, AXI4-Stream), and how AXI4 bus interfaces are implemented.
To create an optimal implementation of the design the tutorial concludes with a design
example where I/O accesses and logic are optimized together.
Design Analysis
This tutorial uses a DCT function to explain the features of the interactive design analysis
features in Vivado High-Level Synthesis. The initial design takes you through a number of
analysis and optimization stages that highlight all the features of the analysis perspective
and provide the basis for a design optimization methodology.
Design Optimization
Using a matrix multiplier example, this tutorial reviews two-design optimization techniques.
The Design Optimization lab explains how a design can be pipelined, contrasting the
approach of pipelining the loops versus pipelining the functions.
The tutorial shows you how to use the insights learned from analyzing to update the initial
C code and create a more optimal implementation of the design.
RTL Verification
This tutorial shows how you can use the RTL CoSimulation feature to automatically verify
the RTL created by synthesis. The tutorial demonstrates the importance of the C test bench
and shows you how to use the output from RTL verification to view the waveform diagrams
in the Vivado and Mentor Graphics ModelSim simulators.
Software Requirements
This tutorial requires that the Vivado Design Suite 2017.1 release or later is installed.
Hardware Requirements
Xilinx recommends a minimum of 2 GB of RAM when using the Vivado tools.
IMPORTANT: All the tutorial examples for Vivado High-Level Synthesis are available at: Reference
Design Files
This tutorial assumes that you have placed the unzipped design files in the location
C:\Vivado_HLS_Tutorial.
Overview
This tutorial introduces Vivado® High-Level Synthesis (HLS). You can learn the primary
tasks for performing High-Level Synthesis using both the Graphical User Interface (GUI) and
Tcl environments.
The tutorial shows how use of optimization directives transforms an initial RTL
implementation into both a low-area and high-throughput implementation.
Lab 1 Description
Explains how to set up a High-Level Synthesis (HLS) project and perform all the major steps
in the HLS design flow:
Lab 2 Description
Demonstrates how to use the Tcl interface.
Lab 3 Description
Shows you how to optimize the design using optimization directives. This lab creates
multiple versions of the RTL implementation and compares the different solutions.
The sample design used in this tutorial is a FIR filter. The hardware goal for this FIR design
project is:
The final design must process data supplied with an input valid signal and produce output
data accompanied by an output valid signal. The filter coefficients are to be stored
externally to the FIR design, in a single port RAM.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial files are unzipped and placed in the location C:\Vivado_HLS_Tutorial.
° On Windows systems, open Vivado HLS by double-clicking the Vivado HLS 2019.1
desktop icon.
X-Ref Target - Figure 2-1
TIP: You can also open Vivado HLS using the Windows menu Start > All Programs > Xilinx Design
Tools > Vivado 2019.1 > Vivado HLS > Vivado HLS 2019.1.
Vivado HLS opens with the Welcome Screen as shown below. If any projects were previously
opened, they are shown in the Recent Project pane, otherwise this window is not shown in
the Welcome screen.
IMPORTANT: In this lab there is only one C design file. When there are multiple C files to be
synthesized, you must add all of them to the project at this stage. Any header files that exist in the local
directory lab1 are automatically included in the project. If the header resides in a different location,
use the Edit CFLAGS button to add the standard gcc/g++ search path information (for example,
-I<path_to_header_file_dir>).
Figure 2-5 shows the input window for specifying the test bench files. The test bench and
all files used by the test bench (except header files) must be included. You can add files one
at a time, or select multiple files to add using the Ctrl and Shift keys.
If you do not include all the files used by the test bench (for example, data files read by the
test bench, such as out.gold.dat), C and RTL simulation might fail due to an inability to
find the data files.
The Solution Configuration window (shown in Figure 2-6) specifies the technical
specifications of the first solution.
A project can have multiple solutions, each using a different target technology, package,
constraints, and/or synthesis directives.
In the Solution Configuration dialog box (shown in Figure 2-6, above), the selected part
name now appears under the Part Selection heading.
12. Click Finish to open the Vivado HLS project, as shown in Figure 2-7.
X-Ref Target - Figure 2-7
TIP: At any time, you can change project or solution settings using the corresponding Project Settings
and/or Solution Settings buttons in the toolbar.
Shows the project hierarchy. As you proceed through the validation, synthesis, verification,
and IP packaging steps, sub-folders with the results of each step are created automatically
inside the solution directory (named csim, syn, sim, and impl respectively).
When you create new solutions, they appear inside the project hierarchy alongside
solution1.
Information Pane
Shows the contents of any files opened from the Explorer pane. When operations complete,
the report file opens automatically in this pane.
Auxiliary Pane
Cross-links with the Information pane. The information shown in this pane dynamically
adjusts, depending on the file open in the Information pane.
Console Pane
Shows the messages produced when Vivado HLS runs. Errors and warnings appear in
Console pane tabs.
Toolbar Buttons
You can perform the most common operations using the Toolbar buttons.
When you hold the cursor over the button, a popup tool tip opens, explaining the function.
Each button also has an associated menu item available from the pull-down menus.
Perspectives
The perspectives provide convenient ways to adjust the windows within the Vivado HLS
GUI.
• Synthesis Perspective
The default perspective allows you to synthesize designs, run simulations, and package the
IP.
• Debug Perspective
Includes panes associated with debugging the C code. You can open the Debug Perspective
after the C code compiles (unless you use the Optimizing Compile mode as this disables
debug information).
• Analysis Perspective
Windows in this perspective are configured to support analysis of synthesis results. You can
use the Analysis Perspective only after synthesis completes.
In this project, the test bench compares the output data from the fir function with known
good values.
• The test bench saves the output from the fir function into the output file, out.dat.
• The output file is compared with the golden results, stored in file out.gold.dat.
• If the output matches the golden data, a message confirms that the results are correct,
and the return value of the test bench main() function is set to 0.
• If the output is different from the golden results, a message indicates this, and the
return value of main() is set to 1.
The Vivado HLS tool can reuse the C test bench to perform verification of the RTL.
If the test bench has the previously described self-checking characteristics, the RTL results
are automatically checked during RTL verification. Vivado HLS re-uses the test bench during
RTL verification and confirms the successful verification of the RTL if the test bench returns
a value of 0. If any other value is returned by main(), including no return value, it indicates
that the RTL verification failed. There is no requirement to create an RTL test bench. This
provides a robust and productive verification methodology.
4. Click the Run C Simulation button, or use menu Project > Run C Simulation, to
compile and execute the C design.
5. In the C Simulation dialog box, click OK.
The Console pane (Figure 2-10) confirms the simulation executed successfully.
TIP: If the C simulation ever fails, select the Launch Debugger option in the C Simulation dialog box,
compile the design, and automatically switch to the Debug perspective. There you can use a C
debugger to fix any problems.
The C Validation tutorial module provides more details on using the Debug environment.
1. Click the Run C Synthesis toolbar button or use the menu Solution > Run C Synthesis
> Active Solution.
When synthesis completes, the report file opens automatically. Because the synthesis
report is open in the Information pane, the Outline tab in the Auxiliary pane automatically
updates to reflect the report information.
3. In the Detail section of the Performance Estimates, expand the Loop view.
X-Ref Target - Figure 2-11
The clock uncertainty ensures there is some timing margin available for the (at this stage)
unknown net delays due to place and routing.
The estimated clock period (worst-case delay) is 5.772 ns, which meets the 8.75 ns timing
requirement.
• The design has a latency of 34-clock cycles: it takes 34 clocks to output the results.
• The interval is 34 clock cycles: the next set of inputs is read after 34 clocks. The design
is not pipelined. The next execution of this function (or next transaction) can only start
when the current transaction completes.
• There are no sub-blocks in this design. Expanding the Instance section shows no
submodules in the hierarchy.
• All the latency delay is due to the RTL logic synthesized from the loop named
Shift_Accum_Loop. This logic executes 11 times (Trip Count). Each execution
requires 3 clock cycles (Iteration Latency), for a total of 33 clock cycles, to execute all
iterations of the logic synthesized from this loop (Latency).
• The total latency is one clock cycle greater than the loop latency. It requires one clock
cycle to enter and exit the loop (in this case, the design finishes when the loop finishes,
so there is no exit cycle).
4. In the Outline tab, click Utilization Estimates (Figure 2-12).
° The design uses a single memory implemented as LUTRAM (since it contains less
than 1024 elements), 3 DSP48s, and approximately 200 flip-flops and LUTs. At this
stage, the device resource numbers are estimates.
° The resource utilization numbers are estimates because RTL synthesis might be able
to perform additional optimizations, and these figures might change after RTL
synthesis.
X-Ref Target - Figure 2-12
° The multiplier instance shown in the Expression view accounts for all the DSP48s.
In: Lab 3: Using Solutions for Design Optimization, you optimize this design.
• The design has a clock and reset port (ap_clk and ap_reset). These are associated
with the Source Object fir: the design itself.
• There are additional ports associated with the design as indicated by Source Object fir.
Synthesis has automatically added some block level control ports: ap_start,
ap_done, ap_idle, and ap_ready.
• The Interface Synthesis tutorial provides more information about these ports.
• The function output y is now a 32-bit data port with an associated output valid signal
indicator y_ap_vld.
• Function input argument c (an array) has been implemented as a block RAM interface
with a 4-bit output address port, an output CE port and a 32-bit input data port.
• Finally, scalar input argument x is implemented as a data port with no I/O protocol
(ap_none).
Later in this tutorial: Lab 3: Using Solutions for Design Optimization explains how to
optimize the I/O protocol for port x.
1. Click the Run C/RTL CoSimulation toolbar button or use the menu Solution > Run
C/RTL CoSimulation.
2. Click OK in the C/RTL Co-simulation dialog box to execute the RTL simulation.
The default option for RTL co-simulation is to perform the simulation using the Vivado
simulator and Verilog RTL. To perform the verification using a different simulator or
language use the options in the C/RTL Co-simulation dialog box.
When RTL co-simulation completes, the report opens automatically in the Information
pane, and the Console displays the message shown in Figure 2-14. This is the same
message produced at the end of C simulation.
• The C test bench generates input vectors for the RTL design.
• The RTL design is simulated.
• The output vectors from the RTL are applied back into the C test bench and the
results-checking in the test bench verify whether or not the results are correct.
• The Vivado HLS indicates that simulation passes if the test bench returns a value of 0. It
is the value of the return variable in the test bench, and this alone, which indicates if
the simulation was successful. It is important that the test bench returns a value of 0
only if the results are correct.
X-Ref Target - Figure 2-14
Step 5: IP Creation
The final step in the High-Level Synthesis flow is to package the design as an IP block for
use with other tools in the Vivado Design Suite.
1. Click the Export RTL toolbar button or use the menu Solution > Export RTL.
2. Ensure the Format Selection drop-down menu shows IP Catalog.
3. Click OK.
The IP packager creates a package for the Vivado IP Catalog. (Other options available from
the drop-down menu allow you to create IP packages for System Generator for DSP, a
Synthesized Checkpoint format for Vivado, or a Pcore for Xilinx Platform Studio.)
° On Windows, use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt (Figure 2-16).
2. In the GUI, still open from Lab 1, expand the Constraints folder in solution1 and
double-click the file script.tcl to view it in the Information pane.
X-Ref Target - Figure 2-17
In this lab exercise, you use the script.tcl from Lab 1 to create a Tcl file for the Lab 2
project.
3. Close the Vivado HLS GUI from Lab 1. This is project no longer needed.
4. In the Vivado HLS Command Prompt, use the following commands (also shown in
Figure 2-18) to create a new Tcl file for Lab 2.
a. Change directory to the Introduction tutorial directory
C:\Vivado_HLS_Tutorial\Introduction.
b. Use the command cp lab1\fir_prj\solution1\script.tcl
lab2\run_hls.tcl to copy the existing Tcl file to Lab 2. (The Windows command
prompt supports auto-completion using the Tab key: press the tab key repeatedly to
see new selections).
c. Use the command cd lab2 to change into the lab2 directory.
Vivado HLS executes all the steps covered in lab1. When finished, the results are available
inside the project directory fir_prj.
CAUTION! When copying the RTL results from a Vivado HLS project, you must use the RTL from the
impl directory. Additional processing is performed by Vivado HLS during export_design before you
can use this RTL in other design tools.
° On Windows, use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
4. In the command prompt window, type vivado_hls -p fir_prj to open the project
in the Vivado HLS GUI.
Vivado HLS opens, as shown in Figure 2-20, with the synthesis for solution1 already
complete.
You reviewed the I/O protocol for this design in Lab 1 (Figure 2-13), and you can review the
synthesis report again by navigating to the report folder inside the solution1\syn folder.
The I/O requirements are:
Port C already is a single-port RAM access. However, if you do not explicitly specify the RAM
access type, High-Level Synthesis might use a dual-port interface. HLS takes this action if
doing so creates a design with a higher throughput. If a single-port is required, you should
explicitly add to the design the I/O protocol requirement to use a single-port RAM.
Input Port X is by default a simple 32-bit data port. You can implement it as an input data
port with an associated data valid signal by specifying the I/O protocol ap_vld.
Output Port Y already has an associated output valid signal. This is the default for pointer
arguments. You do not have to specify an explicit port protocol for this port, because the
default implementation is what is required, but if it is a requirement, it is a good practice to
specify it.
1. Click Project > New Solution toolbar button to create a new solution.
2. Leave the default solution name as solution2. Do not change any of the technology
or clock settings.
3. Click Finish.
This creates solution2 and sets it as the default solution. To confirm you can verify that
the current active solution2 is highlighted in bold in the Explorer pane.
To add optimization directives to define the desired I/O interfaces to the solution, perform
the following steps.
4. In the Explorer pane, expand the Source container (as shown in Figure 2-21).
5. Double-click fir.c to open the file in the Information pane.
6. Activate the Directive tab in the Auxiliary pane and select the top-level function fir to
jump to the top of the fir function in the source code view.
The steps above specify that array c be implemented using a single-port block RAM
resource. Because array c is in the function argument list, and hence is outside the function,
a set of data ports are automatically created to access a single-port block RAM outside the
RTL implementation.
Because I/O protocols are unlikely to change, you can add these optimization directives to
the source code as pragmas to ensure that the correct I/O protocols are embedded in the
design.
10. In the Destination section of the Directive Editor, select Source File.
11. To apply the directive, click OK.
TIP: If you wish to change the destination of any directive, double-click on the directive In the Directive
tab and modify the destination.
When complete, verify that the source code and the Directive tab are correct as shown in
Figure 2-23. Right-click on any incorrect directive to modify it.
X-Ref Target - Figure 2-23
16. Click the Outline tab to view the Interface results, or simply scroll down to the bottom
of the report file.
Figure 2-24 shows that the ports now have the correct I/O protocols.
Follow the steps below to show the Analysis perspective as shown in Figure 2-25.
• The design starts in the first state with a read operation on port x.
• In the next state, it starts to execute the logic created by the for-loop
Shift_Accum_Loop. Loops are shown in grey, and you can expand or collapse them.
• In the first state, the loop iteration counter is checked: addition, comparison, and a
potential loop exit.
• There is a one-cycle memory read operation on the block RAM synthesized from array
data.
• There is a memory read on the c port.
• The multiplication operation takes 1 cycles to complete.
• The for-loop is executed 11 times.
• At the end of the final iteration, the loop exits in Control Step 1 and the write to port y
occurs.
You can also use the Analysis perspective to analyze the resources used in the design.
• There is a read on port x and a write to port y. Port c is reported in the memory section
because this is also a memory access (the memory is outside the design).
• There is a single pipelined multiplier used in this design.
• One of the adders is being shared: there are two instances of the adder on one row.
With the insight gained through analysis, you can proceed to optimize the design.
The Arbitrary Precision Types tutorial shows how you can create designs with more suitable
data types for hardware. Use of arbitrary precision types allows you to define data types of
any arbitrary bit size (more than the standard C/C++ 8-, 16-, 32- or 64-bit types).
• The for loop. By default loops are kept rolled: one copy of the loop body is
synthesized and re-used for each iteration. This ensures each iteration of the loop is
executed sequentially. You can unroll the for loop to allow all operations to occur in
parallel.
• The block RAM used for shift_reg. Because the variable shift_reg is an array in
the C source code, it is implemented as a block RAM by default. However, this prevents
its implementation as a shift-register. You should therefore partition this block RAM
into individual registers.
The following steps, summarized in Figure 2-27 explain how to unroll the loop.
X-Ref Target - Figure 2-27
IMPORTANT: Reminder: the source code must be open in the Information pane to see any code objects
in the Directive tab.
When optimizing a design, you must often perform multiple iterations of optimizations to
determine what the final optimization should be. By adding the optimizations to the
directive file, you can ensure they are not automatically carried forward to the next solution.
Storing the optimizations in the solution directive file allows different solutions to have
different optimizations. Had you added the optimizations as pragmas in the code, they
would be automatically carried forward to new solutions, and you would have to modify the
code to go back and re-run a previous solution.
Leave the other options in the Directives window unchecked and blank to ensure that the
loop is fully unrolled.
With the directives embedded in the code from solution2 and the two new directives just
added, the directive pane for solution3 appears as shown in Figure 2-28.
X-Ref Target - Figure 2-28
11. In the Explorer pane, expand the Constraint folder in Solution3 as shown in
Figure 2-29.
12. Double-click the solution3 directives.tcl file to open it in the Information pane.
14. Compare the results of the different solutions. Click the Compare Reports toolbar
button.
Figure 2-30 shows the comparison of the reports. solution3 has the smallest initiation
interval and can process data much faster. As the interval is only 16, it starts to process a
new set of inputs every 16 clock cycles.
As mentioned earlier, you could modify the code itself to use arbitrary precision types. For
example, if the data types are not required to be 32-bit int types, you could use bit accurate
types (for example, 6-bit, 14-bit, or 22-bit types), provided that they satisfy the required
accuracy. For more details on using arbitrary precision type see the Chapter 5, Arbitrary
Precision Types tutorial.
Conclusion
In this tutorial, you learned how to:
• Create a Vivado High-Level Synthesis project in the GUI and Tcl environments.
• Execute the major steps in the HLS design flow.
• Create and use a Tcl file to run Vivado HLS.
• Create new solutions, add optimization directives, and compare the results of different
solutions.
C Validation
Overview
Validation of the C algorithm is an important part of the High-Level Synthesis (HLS) process.
The time spent ensuring the C algorithm is performing the correct operation and creating
a C test bench, which confirms the results are correct, reduces the time spent analyzing
designs that are incorrect “by design” and ensures the RTL verification can be performed
automatically.
Lab 1 Description
Reviews the aspects of a good C test bench, the basic operations for C validation and the C
debugger.
Lab 2 Description
Validates and debugs a C design using arbitrary precision C types.
Lab 3 Description
Validates and debugs a design using arbitrary precision C++ types.
The sample design used in this tutorial is a Hamming Window FIR. There are three versions
of this design:
This tutorial explains the operation and methodology for C validation using High-Level
Synthesis. There are no design goals for this tutorial.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the
tutorial data directory is unzipped to a different location, or on Linux systems, adjust the few
pathnames referenced, to the location you have chosen to place the Vivado_HLS_Tutorial
directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
This process of checking the results and returning a value of zero if they are correct
automates RTL verification.
You can execute the C code and test bench to confirm that the code is working as expected.
2. Click the Run C Simulation toolbar button to open the C Simulation Dialog box, shown
in Figure 3-4.
X-Ref Target - Figure 3-4
As shown in Figure 3-5, the following actions occur when C simulation executes:
Because the C simulation is not executed in the project directory, you must add any data
files to the project as C test bench files (so they can be copied to the csim/build
directory when the simulation runs). Such files would include, for example, input data read
by the test bench.
X-Ref Target - Figure 3-5
1. Click the Run C Simulation toolbar button to open the C Simulation Dialog box.
2. Select the Launch Debugger option as shown in Figure 3-6.
3. Click OK to run the simulation.
X-Ref Target - Figure 3-6
• Highlighted at the top-right in Figure 3-7, you can see that the perspective has
changed from Synthesis to Debug. Click the perspective buttons to return to the
synthesis environment at any time.
• By default, the code compiles in debug mode. The Launch Debugger option
automatically opens the debug perspective at time 0, ready for debug to begin. To
compile the code without debug information, select the Optimizing Compile option in
the C Simulation dialog box.
For more detailed analysis, to the right of the Step Into button are the Step Over (F6), Step
Return (F7) and the Resume (F8) buttons.
You can use the Run C simulation button to restart the debug session from within the
Debug perspective.
14. Exit the Vivado HLS GUI and return to the command prompt.
More details for using arbitrary precision types are discussed in the Chapter 5, Arbitrary
Precision Types tutorial. An example of using arbitrary precision types would be to change
this file to use 12-bit input data types: standard C types only support data widths on 8-bit
boundaries.
The message in the console pane and log file indicate you cannot debug the arbitrary
precision types used for ANSI C designs in the debug environment.
IMPORTANT: When working with arbitrary precision types you can use the Vivado HLS debug
environment only with C++ or SystemC. When using arbitrary precision types with ANSI C,the debug
environment cannot be used. With ANSI C, you must instead use printf or fprintf statements for
debugging.
The variables in the design are now C++ arbitrary precision types. These types are defined
in header file ap_int.h. When the debugger encounters these types, it follows the
definition into the header file.
As you continue stepping through the code, you have the opportunity to observe in greater
detail how the results for arbitrary precision types are calculated.
X-Ref Target - Figure 3-20
8. Click the Step Return button (or press the F7 key) to return to the calling function.
Conclusion
In this tutorial, you learned:
Interface Synthesis
Overview
Interface synthesis is the process of adding RTL ports to the C design. In addition to adding
the physical ports to the RTL design, interface synthesis includes an associated I/O protocol,
allowing the data transfer through the port to be synchronized automatically and optimally
with the internal logic.
This tutorial consists of four lab exercises that cover the primary features and capabilities of
interface synthesis.
Lab 1 Description
Review the function return and block-level protocols.
Lab 2 Description
Understand the default I/O protocol for ports and learn how to select an I/O protocol.
Lab 3 Description
Review how array ports are implemented and can be partitioned.
Lab 4 Description
Create an optimized implementation of the design and add AXI4 interfaces.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
4. When Vivado HLS completes, open the project in the Vivado HLS GUI using the
command vivado_hls -p adders_prj, as shown in Figure 4-2.
X-Ref Target - Figure 4-2
This example uses a simple design to focus on the I/O implementation (and not the logic in
the design). The important points to take from this code are:
• Directives in the form of pragmas have been added to the source code to prevent any
I/O protocol being synthesized for any of the data ports (inA, inB and inC). I/O port
protocols are reviewed in the next lab exercise.
• This function returns a value and this is the only output from the function. As seen in
later exercises, not all functions return a value. The port created for the function return
is discussed in this lab exercise.
X-Ref Target - Figure 4-3
2. Execute the Run C Synthesis command using the dedicated toolbar button or the
Solution menu.
3. To review the RTL interfaces scroll to the Interface summary at the end of the synthesis
report.
The Interface summary and Outline tab are shown in Figure 4-4.
X-Ref Target - Figure 4-4
• The design takes more than one clock cycle to complete, so a clock and reset have been
added to the design: ap_clk and ap_rst. Both are single-bit inputs.
• A block-level I/O protocol has been added to control the RTL design: ports ap_start,
ap_done, ap_idle and ap_ready. These ports will be discussed shortly.
• The design has four data ports.
° Input ports In1, In2, and In3 are 32-bit inputs and have the I/O protocol
ap_none (as specified by the directives in Figure 4-4).
° The design also has a 32-bit output port for the function return, ap_return.
The block-level I/O protocol allows the RTL design to be controlled by additional ports
independently of the data I/O ports. This I/O protocol is associated with the function itself,
not with any of the data ports. The default block-level I/O protocol is called ap_ctrl_hs.
Figure 4-5 shows this protocol is associated with the function return value (this is true even
if the function has no return value specified in the code).
Table 4-1 summarizes the behavior of the signals for block-level I/O protocol ap_ctrl_hs.
Note: The explanation here uses the term “transaction”. In the context of high-level synthesis, a
transaction is equivalent to one execution of the C function (or the equivalent operation in the
synthesized RTL design).
Table 4-1: Block Level I/O Protocol ap_ctrl_hs
Signals Description
ap_start This signal controls the block execution and must be asserted to logic 1 for the design
to begin operation.
It should be held at logic 1 until the associated output handshake ap_ready is
asserted. When ap_ready goes high, the decision can be made on whether to keep
ap_start asserted and perform another transaction or set ap_start to logic 0 and
allow the design to halt at the end of the current transaction.
If ap_start is asserted low before ap_ready is high, the design might not have read
all input ports and might stall operation on the next input read.
ap_ready This output signal indicates when the design is ready for new inputs.
The ap_ready signal is set to logic 1 when the design is ready to accept new inputs,
indicating that all input reads for this transaction have been completed.
If the design has no pipelined operations, new reads are not performed until the next
transaction starts.
This signal is used to make a decision on when to apply new values to the inputs ports
and whether to start a new transaction should using the ap_start input signal.
If the ap_start signal is not asserted high, this signal goes low when the design
completes all operations in the current transaction.
ap_done This signal indicates when the design has completed all operations in the current
transaction.
A logic 1 on this output indicates the design has completed all operations in this
transaction. Because this is the end of the transaction, a logic 1 on this signal also
indicates the data on the ap_return port is valid.
Not all functions have a function return argument and hence not all RTL designs have
an ap_return port.
ap_idle This signal indicates if the design is operating or idle (no operation).
The idle state is indicated by logic 1 on this output port. This signal is asserted low
once the design starts operating.
This signal is asserted high when the design completes operation and no further
operations are performed.
You can observe the behavior of these signals by viewing the trace file produced by RTL
CoSimulation. This is discussed in Chapter 8, RTL Verification tutorial, but Figure 4-5 shows
the waveforms for the current synthesis results.
• The design does not start operation until ap_start is set to logic 1.
• The design indicates it is no longer idle by setting port ap_idle low.
• Five transactions are shown. The first three input values (10, 20, and 30) are applied to
input ports In1, In2, and In3 respectively.
• Output signal ap_ready goes high to indicate the design is ready for new inputs on
the next clock.
• Output signal ap_done indicates when the design is finished and that the value on
output port ap_return is valid (the first output value, 60, is the sum of all three
inputs).
• Because ap_start is held high, the next transaction starts on the next clock cycle.
Note: In RTL CoSimulation, all design and port input control signals are always enabled. For
example, in Figure 4-5 signal ap_start is always high.
In the 2nd transaction, notice on port ap_return, the first output has the value 70. The result
on this port is not valid until the ap_done signal is asserted high.
1. Select New Solution from the toolbar or Project menu to create a new solution.
2. Leave all settings in the new solution dialog box at their default setting and click Finish.
3. Select the C source code tab (adders.c) in the Information pane (or re-open the C source
code if it was closed).
4. Activate the Directives tab and select the top-level function adders, as shown in
Figure 4-6.
X-Ref Target - Figure 4-6
5. In the Directive tab, mouse over the top-level function adders, right-click, and select
Insert Directive.
The Directives Editor dialog box opens. Select the INTERFACE option from the Directive
pull-down list.
Figure 4-7 shows this dialog box with the drop-down menu for the interface mode
activated.
The block-level I/O protocol ap_ctrl_chain is not covered in this tutorial. This protocol
is similar to ap_ctrl_hs protocol but with an additional input signal, ap_continue,
which must be high when ap_done is asserted for the next transaction to proceed. This
allows downstream blocks to apply back-pressure on the system and halt further processing
when they are unable to continue accepting new data.
6. In the Destination section of the Directives Editor dialog box, select Source File.
By default, directives are placed in the directives.tcl file. In this example, the directive
is placed in the source file with the existing I/O directives.
7. From the mode options, select ap_ctrl_none from the drop-down menu.
8. Click OK.
The source file now has a new directive, highlighted in both the source code and directives
tab in Figure 4-8.
The new directive shows the associated function argument/port called return. All
interface directives are attached to a function argument. For block-level I/O protocols, the
return argument is used to specify the block-level interface. This is true even if the
function has no return argument in the source code.
X-Ref Target - Figure 4-8
Adding the directive to the source file modified the source file. Figure 4-8 shows the source
file name as *adders.c. The asterisk indicates that the file is modified but not saved.
When the report opens, the Interface summary appears, as shown in Figure 4-9.
Note that without the ap_done signal, the consumer block that accepts data from the
ap_return port now has no indication when the data is valid.
In addition, the RTL CoSimulation feature requires a block-level I/O protocol to sequence
the test bench and RTL design for CoSimulation automatically. Any attempt to use RTL
CoSimulation results in the following error message and RTL CoSimulation with halt:
Exit the Vivado HLS GUI and return to the command prompt.
This time, the code does not have a function return, but instead passes the output of the
function through the pointer argument *in_out1. This also provides the opportunity to
explore the interface options for bidirectional (input and output) ports.
The types of I/O protocol that you can add to C function arguments by interface synthesis
depends on the argument type. These options are fully described in the Vivado Design Suite
User Guide: High-Level Synthesis (UG902) [Ref 2].
The pointer argument in this example is both an input and output to the function. In the RTL
design, this argument is implemented as separate input and output ports.
For the code shown in Figure 4-11, the possible options for each function argument are
described in Table 4-2.
Note: The port directives applied in Lab 1 were not actually necessary because ap_none is the
default I/O protocol for these C arguments. The directives were provided to avoid addressing any I/O
port protocol behavior in that exercise, default behavior or not.
This design has an input array and an output array. The comments in the C source explain
how the data in the input array is ordered as channels and how the channels are
accumulated. To understand the design, you can also review the test bench and the input
and output data in file result.golden.dat.
X-Ref Target - Figure 4-15
1. Run C Synthesis button in the toolbar and review the Interface summary when the
report opens (Figure 4-16).
The interface summary shows how array arguments in the C source are by default
synthesized into RTL RAM ports.
• The design has a clock, reset, and the default block-level I/O protocol ap_ctrl_hs
(noted on the clock in the report).
• The d_o argument has been synthesized to a RAM port (I/O protocol ap_memory).
In both cases, the data port is the width of the data values in the C source (16-bit integers
in this case) and the width of the address port has been automatically sized to match the
number of addresses that must be accessed (5-bit for 32 addresses).
X-Ref Target - Figure 4-16
Step 2 used a single-port RAM interface because the for-loop in the source code is by
default left rolled: each iteration of the loop is executed in turn:
This ensures only a single input read and output write is ever required. Even if multiple input
and outputs are made available, the internal logic cannot take advantage of any additional
ports.
Note: If you specify a dual-port RAM and Vivado HLS can determine only a single port is required,
it uses a single-port and over-ride the dual-port specification.
In this design, if you want to implement an array argument using multiple RTL ports, the
first thing you must do is unroll the for-loop and allow all internal operations to happen in
parallel, otherwise there is no benefit in multiple ports: the rolled for-loop ensure only one
data sample can be read (or written) at a time.
1. Select New Solution from the toolbar or Project menu to create a new solution.
2. Accept the defaults, and click Finish.
3. Ensure the C source code is visible in the Information pane.
4. In the Directive tab select For_Loop, and right-click and select Insert Directive to open
the Directives Editor dialog box.
a. In the Directives Editor dialog box activate the Directive drop-down menu at the top
and select UNROLL.
b. With the Directives Editor as shown in Figure 4-17, click OK.
The Directive tab shows the directives now applied to the design (Figure 4-19).
X-Ref Target - Figure 4-19
When the report opens in the Information pane, the Interface summary is as shown in
Figure 4-20.
• The design has the standard clock, reset, and block-level I/O ports.
• Array argument d_o has been implemented as a FIFO interface with a 16-bit data port
(d_o_din) and associated output write (d_o_write) and input FIFO full
(d_o_full_n) ports.
• Argument d_i has been implemented as a dual-port RAM interface.
X-Ref Target - Figure 4-20
1. Select New Solution from the toolbar or the Project menu and create a new solution.
2. Accept the defaults, and click Finish. This includes copying existing directives from
solution2.
3. Ensure the C source code is visible in the Information pane.
4. In the Directive tab, select d_o and right-click and select Insert Directive to open the
Directive Editor dialog box.
a. In the Directives Editor dialog box activate the Directive drop-down menu at the top
and select ARRAY_PARTITION.
b. Activate the type drop down to partition the array into blocks. Set type to block.
c. In the Vivado HLS Directive Editor dialog box, set the factor (optional) to 4.
d. With the Vivado HLS Directive Editor as shown in Figure 4-21, click OK.
X-Ref Target - Figure 4-21
5. In the Directive tab, select d_i and repeat the previous step, but this time partition the
port with a factor of 2.
The directives tab shows the directives now applied to the design (Figure 4-22).
When the report opens in the Information pane, the Interface summary is as shown in
Figure 4-23.
• The design has the standard clock, reset, and block-level I/O ports.
• Array argument d_o has been implemented as a four separate FIFO interfaces.
• Argument d_i has been implemented as two separate RAM interfaces, each of which
uses a dual-port interface. (If you see four separate RAM interfaces, confirm a partition
factor for d_i is two and not four).
1. Select New Solution from the toolbar and create a new solution.
2. Click Finish and accept the defaults. This includes copying existing directives from
solution3.
3. Ensure the C source code is visible in the Information pane.
4. In the Directive tab, select the existing partition directive for d_o as shown in
Figure 4-24.
5. Right-click and select Modify Directive.
X-Ref Target - Figure 4-24
The Directives tab shows the directives now applied to the design (Figure 4-26).
Although this tutorial has focused exclusively on the I/O interfaces, at this point it is worth
examining the differences in performance across all four solutions.
11. Select Compare Reports from the toolbar or the Project menu to open a comparison of
the solutions.
12. In the Solution Selection dialog box, add each of the four solutions to the Selected
Solutions pane (Figure 4-27).
13. Click OK.
14. Exit the Vivado HLS GUI and return to the command prompt.
The key to understanding how best to perform this optimization is to recognize that the
channels in the input and output arrays lend themselves to cyclic partitioning. Cyclic
partitioning is fully explained in the Vivado Design Suite User Guide: High-Level Synthesis
(UG902) [Ref 2], but basically means each array element is, in turn, sorted into a different
partition.
Finally, if the I/O ports are configured to supply and consume individual streams of channel
data, partial unrolling of the for-loop can ensure dedicated hardware processes each
channel.
2. In the Directive tab, select d_o and right-click and select Insert Directive to open the
type dialog box.
a. Select the Directive drop-down menu at the top and select ARRAY_PARTITION.
b. Click the type (optional) drop-down menu to specify cyclic partitioning.
c. In the factor (optional) box, enter the value 8, to create eight separate partitions.
(This results in eight ports.)
d. With the Directives Editor dialog box filled in as shown in Figure 4-31, click OK.
X-Ref Target - Figure 4-31
Select a factor of 8 to partially unroll the for-loop. This is equivalent to re-writing the
C code to execute eight copies of the loop-body in each iteration of the loop (where
the new loop only executes for four iterations in total, not 32).
Click OK.
c. In the Directive tab, select For_Loop again and right-click and select Insert
Directive to open Vivado HLS Directive Editor dialog box.
Activate the Directive drop-down menu at the top and select PIPELINE. Leave the
interval (II) blank and let it default to 1.
When the top-level of the design is a loop, you can use the pipeline rewind option. This
informs Vivado HLS that when implemented in RTL, this loop runs continuously (with no
end of function and function re-start cycles).
After performing the above steps, the Directives tab should be as shown in Figure 4-32. Be
sure to check all options are correctly applied. If not, double-click the directive to re-open
the Directives Editor.
When the report opens in the information pane, confirm both d_i and d_o are implemented
as eight separate AXI4-Stream ports.
7. In the performance section of the report, confirm that the for-loop processes one
sample every clock cycle (Initiation Interval 1) with a latency of 4, and that the design
has less area than solutions 2, 3, or 4 in Lab 3 (Figure 4-29).
Cyclic partitioning of the array interfaces and partial for-loop unrolling has allowed
implementation of this C code as eight separate channels in the hardware.
Pipelining the for-loop allows the logic in each channel to process 1 sample per clock.
Varying the partitioning and loop unrolling allows you to create a design which is the
optimal balance of area and performance to satisfy your particular requirements.
1. Select New Solution from the toolbar or the Project menu to create a new solution.
2. Accept the defaults and click Finish. This includes copying existing directives from
solution1.
3. Ensure the C source code is visible in the Information pane.
4. In the Directive tab, select the top-level function axi_interfaces and right-click and
select Insert Directive to open the Insert Directives Editor dialog box.
a. Select the Directive drop-down menu at the top and select INTERFACE.
b. Select the mode drop-down menu and select s_axilite. This specifies that the ports
associated with the function return (the block-level I/O ports) are implemented as an
AXI4-Lite interface. Since the default mode for the function return is ap_ctrl_hs, there
is no requirement to specify this I/O protocol.
c. Click OK.
When the report opens, review the interface summary to confirm the block-level I/O
protocol ports (ap_start, ap_done, etc.) have been replaced with an AXI4Lite interface
and that the output interrupt signal has been added to the design. The source of the
interrupt can be selected through the AXI-Lite interface.
6. Select Export RTL from the toolbar or the Solution menu to create an IP package.
7. Leave the Format Selection as IP Catalog and click OK.
You can see the IP package in the solution2/impl folder. Because you used the Vivado
IP Catalog format, the package is in the ip folder.
When you add an AXI4-Lite interface to the design, the IP packaging process also creates
software driver files to enable an external block, typically a CPU, to control this block (start
it, stop it, set port values, review the interrupt status).
This shows the addresses to access and control the block-level interface signals. For
example, setting control register 0x0 bit 0 to the value 1 will enable the ap_start port, or
alternatively, setting bit 7 will enable the auto-restart and the design will re-start
automatically at the end of each transaction.
The remaining C driver files are used to integrate control of the AXI4 Slave Lite interface
into the code running on a CPU or microcontroller and are included in the packaged IP.
Conclusion
In this tutorial, you learned:
Overview
C/C++ provided data types are fixed to 8-bit boundaries:
• char (8-bit)
• short (16-bit)
• int (32-bit)
• long long (64-bit)
• float (32-bit)
• double (64-bit)
• Exact width integer types such as int16_t (16-bit) and int32_t (32-bit)
When creating hardware, it is often the case that more accurate bits-widths are required.
Consider, for example, a case in which the input to a filter is 12-bit and the accumulation of
the results only requires a maximum range of 27 bits. Using standard C data types for
hardware design results in unnecessary hardware costs. Operations can use more LUTs and
registers than needed for the required accuracy, and delays might even exceed the clock
cycle, requiring more cycles to compute the result.
Vivado High-Level Synthesis (HLS) provides a number of bit accurate or arbitrary precision
data-types, allowing you to model variables using any (arbitrary) width.
Lab 1 Description
Synthesize a design using floating-point types and review the results. The design uses
standard C++ floating-point types.
Lab 2 Description
Synthesize the same function used in Lab 1 using arbitrary precision fixed-types
highlighting the benefits in accuracy and results. This exercise shows how this same design
can be converted to the Vivado HLS ap_fixed types, retaining the required accuracy but
creating a more optimal hardware implementation.
In this lab, you synthesize a design using standard C types. You use this design as a
reference for the design using arbitrary precision types, which is the basis for Lab 2.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
vivado_hls -f run_hls.tcl
When using math functions from math.h or cmath.h, see the Vivado Design Suite User
Guide: High-Level Synthesis (UG902) [Ref 2] for details on which math functions are
supported for synthesis.
4. Click the Run C Simulation toolbar button to open the C Simulation Dialog box.
5. Accept the default setting (no options selected) and click OK.
The Console pane shows that the design simulates with the expected results.
When synthesis completes, the synthesis report opens automatically. Figure 5-5 shows the
synthesis report.
2. Scroll down the report and expand the Instances in the Details section of the Utilization
Estimates (Figure 5-6).
More details on using the Analysis perspective are available in the Chapter 6, Design
Analysis tutorial. For the purposes of understanding this design, two of the operations in
the first state are one-cycle read-from-memory operations, and the operation in the final
state is a write-to-memory operation.
Introduction
This lab exercise uses the same design as Lab 1, however, the data types are now arbitrary
precision types. You first review the design and then examine the synthesis results.
4. Open the Source folder in the Explorer pane and double-click window_fn_top.cpp to
open the code as shown in Figure 5-9.
X-Ref Target - Figure 5-9
When you revise C code to use arbitrary precision types instead of standard C types, one of
the most common changes you must make is to reduce the size of the data types. In this
case, you change the design to use 8-bit, 24-bit, and 18-bit words instead of 32-bit float
types. This results in smaller operators, reduced area, and fewer clock cycles to complete.
Similar optimizations help when you change more common C types such as int, short, and
char. For example, changing a data type that only needs to be 18-bit from int (32-bit)
ensures that only a single DSP48 is required to perform any multiplications.
In both cases, you must confirm that the design still performs the correct operation and
that it does so with the required accuracy. The benefit of the arbitrary precision types
provided with Vivado High-Level Synthesis is that you can simulate the updated C code to
confirm its function and accuracy.
7. Open the Test Bench folder in the Explorer pane and double-click window_fn_test.cpp
to open the code.
8. Scroll down to see the view shown in Figure 5-11.
X-Ref Target - Figure 5-11
This allows the updated design to be validated quickly and efficiently in C, with fast compile
and run times.
9. Click the Run C Simulation toolbar button to open the C Simulation dialog box.
10. Accept the default setting (no options selected) and click OK.
The Console pane shows the results of the C simulation. With the updated data types, the
results are no longer identical to the expected results. However, they are within tolerance.
X-Ref Target - Figure 5-12
When synthesis completes, the synthesis report opens automatically. Figure 5-13 shows the
synthesis report.
Figure 5-14 shows the data ports are now 8-bit and 24-bit.
Conclusion
In this tutorial, you learned:
• How to update the existing standard C types to Vivado High-Level Synthesis arbitrary
precision types.
• The advantages in terms of hardware performance and area of using bit accurate
data-types.
Design Analysis
Overview
The general design methodology for creating an RTL implementation from C, C++, or
SystemC includes the following tasks:
You can repeat the steps above until the required performance is achieved. Subsequently,
you can revisit the design to improve area.
A key part of this process is the analysis of the results. This tutorial explains how to use the
reports and the GUI Analysis perspective to analyze the design and determine which
optimizations to apply.
• Takes you through one design from the initial implementation through six steps and
multiple optimizations to produce the final optimized design.
As demonstrated throughout the tutorial, performing these steps in a single project gives
you the ability to compare the different solutions.
Lab 1 Description
Synthesize and analyze a DCT design. Use the insights from the design analysis to apply
optimizations and judge the effectiveness of the optimization.
The sample designs used in the lab exercise is a 2-D DCT function. To highlight the design
analysis feature, your goal is to have this design operate with an interval of 125 or less. The
design should be able to process a new set of input data at least every 125 clock cycles.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the
tutorial data directory is unzipped to a different location, or if it is on a Linux system, adjust the few
pathnames referenced to the location at which you placed the Vivado_HLS_Tutorial directory.
° On Windows click Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
Step 2: Review the Source Code and Create the Initial Design
1. Double-click the file dct.cpp in the Source folder to open the source code for review.
This example uses a DCT function. Figure 6-3 shows an overview of this code.
° Top-level function dct has three sub-functions: read_data, dct_2d and write_data.
° The read_data function executes, and the data is processed through loop
RD_Loop_Row, which has a sub-loop RD_Loop_Col.
• Click the Run C Synthesis toolbar button to synthesize the design to RTL.
° This block also has no pipelining and accounts for most of the clock cycles.
• Notice that the functions read_data and write_data are not noted here as
instances of the top level.
° Figure 6-5 shows that, during synthesis, these blocks were automatically inlined
(the hierarchy was removed).
° Sub-loop RD_Loop_Col has a latency of 2 cycles for each iteration of the loop
(iteration latency) and a tripcount of 8: 2 x 8 = 16 clock cycles total latency for the
loop.
° From RD_Loop_Row, it takes 1 clock to enter loop RD_Loop_Col and 1 clock cycle to
return to RD_Loop_Row. The iteration latency for RD_Loop_Row is therefore (1 + 16
+1) 18 clock cycles.
To review the details of the instantiated sub-blocks dct_2d and dct_1d, open their respective
reports from the syn/report folder under solution1 in the Explorer pane.
You can also use the design analysis perspective to review these details in a more interactive
manner.
1. Click the Analysis perspective button (Figure 6-6) to begin interactive design analysis.
X-Ref Target - Figure 6-6
The Performance view is also shown (on the right side of Figure 6-8). This view shows how
the operations in this particular block are scheduled into clock cycles.
° A loop called RD_Loop_Row. The plus symbol (+) indicates that the loop has
hierarchy and that you can expand the loop to view it.
The top row lists the control states in the design. Control states are the internal states
High-Level Synthesis uses to schedule operations into clock cycles. There is a close
correlation between the control states and the final states in the RTL Finite State Machine
(FSM), but there is no one-to-one mapping.
2. Click loop RD_Loop_Row and sub-loop RD_Loop_Col to fully expand the loop hierarchy
(Figure 6-8).
X-Ref Target - Figure 6-8
3. Select the adder in state C1, right-click and select Go to Source (Figure 6-9).
a. When the dialog box opens, press OK to select item 0.
This opens the C source code to highlight the operation in the C source that created this
adder. From the details on screen (also shown in Figure 6-9), you can determine it is indeed
the loop counter. It is the only addition on this line, and the variable is named “r”.
X-Ref Target - Figure 6-9
4. Click any of the operations in the RD_Loop_Col to see the source code highlighting
update.
This should help confirm your understanding of how the operations in the C source code
are implemented in the RTL.
Loops in the Performance view mean that the design iterates around these states multiple
times. The number of iterations is noted as the loop tripcount and shown in the
Performance Profile.
To improve performance, these loops should be pipelined. You can review the rest of the
design for other performance optimization opportunities.
You can pipeline the loops to improve the performance. The details in the Performance
Profile show that most of the latency is caused by loops Row_DCT_Loop and Col_DCT_Loop.
7. Click loops Row_DCT_Loop and Col_DCT_Loop of the dct_2d block in the performance
viewer to fully expand them, as shown in Figure 6-11.
Expanding these loops in Performance view shows both loops call function dct_dct_1d2.
Unless this function itself is pipelined, there is no benefit in pipelining the loop. TheModule
Hierarchy shows the interval for dct_1d2 is 145clock cycles, which means it can only accept
a new input every 145 clock cycles.
8. In the Module Hierarchy, click function dct_1d2 to navigate into the view for this
function.
9. Expand the loops in the Performance Profile and Performance view to see the view
shown in Figure 6-11.
• You can pipeline the function and then pipeline the loop that calls it. (Because the
function is pipelined, the loop can take advantage of using a pipelined part.)
• You can pipeline the loops within this function and simply make this function execute
faster.
Pipelining the function unrolls all the loops within it, and thus greatly increases the area. If
the objective is to get the highest possible performance with no regard for area, this may be
the best optimization to perform.
You can find more details on pipelining loops and functions in the Chapter 7, Design
Optimization tutorial. For this case, the approach is to optimize the loops and keep the area
at a minimum.
10. Click the Synthesis perspective button to return to the main synthesis view.
X-Ref Target - Figure 6-12
When pipelining nested loops, it is generally best to pipeline the inner-most loop. Typically,
High-Level Synthesis can generally flatten the loop nest automatically (allowing the outer
loop to simply feed the inner loop). For more information on why it is better to perform
certain loop optimizations rather than others, see the Chapter 7, Design Optimization
tutorial.
1. Select the New Solution toolbar button or use the menu Project > New Solution to
create a new solution.
2. Click Finish and accept the defaults.
3. Ensure that you can see the C source code (dct.cpp) in the Information pane.
4. In the Directive tab, add a pipeline directive to loop DCT_Inner_Loop in function dct_1d.
a. Right-click DCT_Inner_Loop in the Directive pane and select Insert Directive.
b. In the Directives Editor dialog box activate the Directive drop-down menu at the top
and select PIPELINE.
c. Click OK to select the default maximum pipeline rate (II=1).
5. Repeat step 4 for the following loops:
The Directive pane shows the following (highlighted) optimization directives applied.
X-Ref Target - Figure 6-13
Figure 6-14 shows the results of comparing solution1 and solution2. Pipelining the loops
has improved the latency of the design with an almost 50% reduction in solution2.
When the Analysis perspective opens, you can see that the majority of the latency is still
due to block dct_2d. Before proceeding to analyze further, you can review how the loops at
this level have been optimized.
The Performance Profile (Figure 6-15) shows that the latency of both loops has been
reduced from 144 clock cycles in solution1 to only 64 clock cycles.
X-Ref Target - Figure 6-15
to
Vivado HLS also made this possible by automatically performing loop flattening (there is no
longer any loop hierarchy). You can see this by reviewing the Console pane, or log file, for
solution2. Figure 6-16 shows the loops that have been automatically optimized.
In the Performance Profile you can see that the latency of all the loops has been
substantially reduced (Row_DCT_Loop and Col_DCT_loop have been approximately halved
from the earlier report in Figure 6-10). However, the majority of the latency is still due to
these two loops, each of which calls the dct_1d block.
10. In the Module Hierarchy, click function dct_1d2 to navigate into the view for this
function.
The Performance Profile (Figure 6-17) shows the loop latencies have been reduced, but
there is still a loop hierarchy here. (There is still loop DCT_Outer_Loop, shown in
Figure 6-17, so no loop flattening occurred).
X-Ref Target - Figure 6-17
11. In the Performance view, click loops DCT_Outer_Loop and DCT_Inner_Loop to view
the loop hierarchy (Figure 6-18).
12. Select the write operation in state C3.
13. Right-click and select Go to Source.
Figure 6-18 shows that this loop was not flattened because additional operations outside of
DCT_Inner_Loop, at the level of DCT_Outer_Loop, prevented loop flattening. One of the
operations that prevented loop flattening is highlighted in Figure 6-18, below.
You should pipeline the outer loop instead. This causes the inner loop to be completely
unrolled. An increase in area results, but you are still far from the throughput goal of 125
and not yet ready to pipeline the entire function (and see an even greater area increase, as
the outer loop is also completely unrolled).
14. Click the Synthesis perspective button to return to the main synthesis view.
The Directive pane should show the following (highlighted) optimization directives applied.
X-Ref Target - Figure 6-19
5. Click the Run C Synthesis toolbar button to synthesize the design to RTL.
6. When synthesis completes, click the Compare Reports toolbar button to compare
solutions 2 and 3.
Figure 6-20 shows the results of comparing solution2 and solution3. Pipelining the
outer-loop has in fact resulted in an increase to the performance and the area.
The significant latency benefit is achieved because multiple loops in the design call the
dct_1d function multiple times. Saving latency in this block is multiplied because this
function is used inside many loops.
X-Ref Target - Figure 6-20
Now that all the loops are pipelined, it is worthwhile to review the design to see if there are
performance-limiting “bottlenecks.” Bottlenecks are limitations in the flow of data that can
prevent the logic blocks from working at their maximum data rate.
Such limitations in the data flow can come from a number of sources, for example, I/O ports
and arrays implemented as block RAM. In both cases, the finite number of ports (on the I/O
or block RAM) limits the rate at which data can be read or written.
Another source of bottlenecks is data dependencies in the original source code. In some
cases, these data dependencies are inherent in how the algorithm operates, as when a
calculation cannot be performed until an earlier calculation has completed. Sometimes,
however, the use of an optimization directive or a minor change to the C code can remove
them.
The first task is to identify such issues in the RTL design. There are a number of approaches
you can take:
• Start with the largest latency or interval in the Module Hierarchy report and navigate
down the hierarchy to find the source of any large latency or interval.
• Click the Resource Profile to examine I/O and memory usage.
• Use the power of the graphical viewer and look for patterns in the Performance view
which indicate a limitation in data flow.
In this case, you will use the latter approach. You can use the Analysis perspective to
identify such places in the design quickly.
This loop is implemented in two states. The red arrow in Figure 6-21 shows the path from
the start of the loop to the end of the loop: the arrow is almost vertical (everything happens
in two clock cycles) and this loop is well implemented in terms of latency.
You can use same analysis process down through the hierarchy. If you perform this analysis
you will discover that all the function blocks and loops have a similar optimal (few cycles)
implementation, until the dct_1d block is examined.
12. In the Performance view, double-click function dct_1d2 and navigate into the
dct_1d2 function.
13. Expand the DCT_Outer_Loop to see the view shown in Figure 6-22.
Figure 6-22 shows a very different view from the earlier loop schedules (which had only a
few cycles of latency). The schedule shows a long drift from input to output (as shown by
the red arrow).
14. In the Performance view, click the Resource Viewer tab at the bottom of the window.
The Resource view shows how the resources in the design are used in different control
states.
The rows list the resources in the design. In Figure 6-23, the memory resources are
expanded.
The columns show the control states in which the resource is used. If a resource is active in
multiple states, the resource is being re-used in different clock cycles.
Figure 6-23 shows the memory accesses on block RAM src are being used to the maximum
in every clock cycle. (At most, a block RAM can be dual-port and both ports are being used).
This is a good indication the design may be bandwidth-limited by the memory resource. To
determine if this really is the case, you can examine further.
16. Select one of the read operations for the src block RAM.
17. Right-click and select Goto Source to see the view shown in Figure 6-24.
The eight reads are being forced to occur over multiple cycles because the array src is
implemented as a block RAM in the RTL and a block RAM can only allow two reads
(maximum) in any one clock cycle. In Figure 6-24, the read operations take 2 clocks cycles:
a cycle to generate the address for the block RAM and a cycle to read the data. Only the
launch (address generation cycle) is shown because it overlaps with the operation in the
next clock cycle.
You can optimize the block RAM accesses using optimization directives to partition the
block RAM. The array that function dct_1d accesses is defined as an input argument to the
function and therefore resides outside this block.
• The input array to the first instance of dct_1d is buf_2d_in in function dct.
• The input array to the second instance of dct_1d is col_inbuf in function dct_2d.
In both cases, the arrays are 2-dimensional of size DCT_SIZE by DCT_SIZE (8x8). By default,
this results in a single block RAM with 64 elements. Because the arrays are configured in the
code in the form of Row by Column, we can partition the second dimension and create eight
separate Block RAMs: one for each row, allowing the row data to be accessed in parallel.
18. Click the Synthesis perspective button to return to the main synthesis view.
The Directive pane displays optimization directives, as shown in Figure 6-25 (the two new
directives are highlighted).
Figure 6-26 shows the results of comparing solution3 and solution4. Improving access to
the data in the src block RAM in the dct_1d block has improved the overall performance
because the dct_1d block executes frequently.
The important point from the previous optimization is that you can see there are now
additional memories due to the array partitioning optimization.
You still have a goal to ensure that the design can accept a new set of samples every 125
clock cycles. The synthesis report, however, shows that you can only accept new data every
477 clocks. This is much better than the original, pre-optimized design (approx. 2600 clock
cycles), but further optimization is required.
Up to this point, you have focused on improving the latency and interval of each of the
individual loops and functions in the design. You must now apply the dataflow
optimization, which enables the individual loops and functions to execute in parallel, thus
improving the overall design interval.
12. Click the Synthesis perspective button to return to the main synthesis view.
The Directive pane now displays the following optimization directives (the new directive is
highlighted).
Figure 6-29 shows the results of comparing solution4 and solution5, and you can see the
interval has improved. The design takes 476 clocks cycles to produce the outputs but can
now accept new inputs every 343 clocks.
• The interval of the dct block is less than the sum of the individual latencies (for
read_data, dct_2d and write_data). This means the blocks are operating in
parallel.
• The interval of dct is nearly the same as the interval for sub-block dct_2d. The
dct_2d block is therefore the limiting factor.
Because the dct_2d block is selected in the Module Hierarchy the Performance Profile
shows the details for this block. Figure 6-31 shows the interval is the same as the latency, so
none of these blocks operate in parallel.
One way to have the blocks in dct_2d operate in parallel would be to pipeline the entire
function. This, however, would unroll all the loops, which can sometimes lead to a large area
increase. An alternative is use dataflow optimization on function dct_2d.
Another alternative is to use a less obvious technique: raise these loops up to the top-level
of hierarchy, where they will be included in the dataflow optimization already applied to the
top-level. This can be achieved by using an optimization directive to remove the dct_2d
hierarchy: inline the dct_2d function.
Before performing this optimization, review the area increase caused by using dataflow
optimization.
c. In the Directives Editor dialog box activate the Directive drop-down menu at the
top and select INLINE.
d. Click OK.
The Directive pane now shows the following optimization directives (the new directive is
highlighted).
X-Ref Target - Figure 6-32
Figure 6-33 shows the results of comparing solution5 and solution6. You can see the
interval has improved substantially.
Conclusion
In this tutorial, you learned:
Design Optimization
Overview
A crucial part of creating high quality RTL designs using High-Level Synthesis is having the
ability to apply optimizations to the C code. High-Level Synthesis always tries to minimize
the latency of loops and functions. To achieve this, within the loops and functions, it tries to
execute as many operations as possible in parallel. At the level of functions, High-Level
Synthesis always tries to execute functions in parallel.
• Execute multiple tasks in parallel, for example, multiple executions of the same function
or multiple iterations of the same loop. This is pipelining.
• Restructure the physical implementation of arrays (block RAMs), functions, loops and
ports to improve the availability of data and help data flow through the design faster.
• Provide information on data dependencies, or lack of them, allowing more
optimizations to be performed.
The final optimization technique is to modify the C source code to remove unintended
dependencies in the code that may limit the performance of the hardware.
This tutorial consists of two lab exercises. You may perform the analysis in these lab
exercises using the Analysis perspective. A prerequisite for this tutorial is completion of the
Chapter 6, Design Analysis tutorial.
Lab 1 Description
Contrast the uses of loop and function pipelining to create a design that can process one
sample per clock. This lab includes examples that give you the opportunity to analyze the
two most common causes for designs failing to meet performance requirements: loop
dependencies and data flow limitations or bottlenecks.
Lab 2 Description
This lab shows how modifications to the code from Lab 1 can help overcome some
performance limitations inherent, but unintended, in the code.
For this tutorial, you use the design files in the tutorial directory
Vivado_HLS_Tutorial\Design_Optimization.
The sample design you use in the lab exercise is a matrix multiplier function. The design
goal is to process a new sample every clock period and implement the interfaces as
streaming data interfaces.
The analysis includes a comparison of a methodology that optimizes at the loop level with
one that optimizes at the function level.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
Scroll down the file to see that the source code has two input arrays, a and b, and output
array res. Hold the mouse over the macros (as shown in Figure 7-3) to see that each is
three-by-three for a total of nine elements.
X-Ref Target - Figure 7-3
When synthesis completes, the synthesis report opens (Figure 7-4), and the Performance
Estimates appear:
• The interval is 80 clock cycles. Because there are nine elements in each input array, the
design takes approximately nine cycles per input read.
• The interval is one cycle longer than the latency, so there is no parallelism in the
hardware at this point.
• The latency/interval is due to nested loops.
° The top-level loop has a latency of 26 clock cycles per iteration, for a total of 78
clock cycles for all iterations of the loop.
When pipelining loops, the initiation interval of the loops is the important metric to
monitor. As seen in this exercise, even when the design reaches the stage at which the loop
can process a sample every clock cycle, the initiation interval of the function is still reported
as the time it takes for the loops contained within the function to finish processing all data
for the function.
When pipelining nested loops, you realize the greatest benefit by pipelining the inner-most
loop, which processes a sample of data. High-Level Synthesis automatically applies loop
flattening, collapsing the nested loops, removing the loop transitions (essentially creating a
single loop with more iterations but overall fewer clock cycles).
The Directive pane should show the following optimization directives. (The new directive is
highlighted.)
X-Ref Target - Figure 7-5
During synthesis, the information reported in the Console pane shows loop flattening was
performed on loop Row and that the default initiation internal target of 1 could not be
achieved on loop Product due to a dependency.
The synthesis report (Figure 7-6) shows that although the Product loop is pipelined with an
interval of 2, the interval of top-level loop is not pipelined.
X-Ref Target - Figure 7-6
The write operation in state C1 is due to the code that sets res to zero before the Product
loop. Because res is a top-level function argument, it is a write to a port in the RTL: This
operation must happen before the operations in loop Product are executed. Because it is
not an internal operation but has an impact on the I/O behavior, this operation cannot be
moved or optimized. This prevents the Product loop from being flattened into the Row_Col
loop.
The message SCHED-68 in the console pane (and file vivado_hls.log) tells you:
From Figure 7-8 you can see line 60 is a read from array res (due to the += operator) and a
write to array res. An array is mapped into a block RAM by default and the details in the
Performance View can show why this conflict occurred.
The Performance view shows in which states the operations are scheduled. Figure 7-8
shows that two of the operations are responsible for the II violation. These are the
operations which have a dependency between loop iterations. The analysis view provides
that capability to filter the analysis view to the operations causing an II violation. To use this
feature, select II Violation in the filter drop-down combo box.
The first iteration of the loop shows the states in which the operations occur. The read in
states 2 and 3, and the write in state 3. The operation in the next iteration must start 1 cycle
after this, because the 2nd read cannot occur until the 1st write has finished: the operations
in each iteration of the loop are to a different address and only 1 address can be applied at
the same time.
X-Ref Target - Figure 7-8
The next step is to pipeline the loop above, the Col loop. This automatically unrolls the
Product loop and creates more operators and hence more hardware resources, but it
ensures there is no dependency between different iterations of the Product loop.
2. Because solution2 already has a directive added, use the drop-down menu to select
solution1 as the source for existing directives and constraints (solution1 has none).
3. Click Finish and accept the default solution name, solution3.
4. Open the C source code matrixmul.cpp to make it visible in the Information pane.
5. In the Directive tab:
a. Select loop Col.
b. Right-click and select Insert Directive.
c. In the Directive Editor dialog box activate the Directive drop-down menu at the
top and select PIPELINE.
d. Click OK. With the default options, an initiation interval (II) of 1 (one new loop
iteration per clock) becomes the default.
The Directive pane, shown below, displays the following optimization directives (the new
directive is highlighted).
X-Ref Target - Figure 7-9
During synthesis, the information reported in the Console pane shows that loop Product
was unrolled, loop flattening was performed on loop Row, and the default initiation internal
target of 1 could not be achieved on loop Row_Col due to resource limitations on the
memory for array a.
INFO: [XFORM 203-502] Unrolling all sub-loops inside loop 'Col' (matrixmul.cpp:56) in
function 'matrixmul' for pipelining.
INFO: [XFORM 203-501] Unrolling loop 'Product' (matrixmul.cpp:59) in function
'matrixmul' completely.
INFO: [XFORM 203-541] Flattening a loop nest 'Row' (matrixmul.cpp:54:37) in function
'matrixmul'.
...
...
INFO: [SCHED 204-61] Pipelining loop 'Row_Col'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('a_load_1',
matrixmul.cpp:60) on array 'a' due to limited memory ports.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
Reviewing the synthesis report shows, as noted above, that the interval for loop Row_Col is
only two: the target is to process one sample every cycle. Once again, you can use the
Analysis perspective to highlight why the initiation target was not achieved.
The operations on array a (mentioned in the SCHED-69 message above) are highlighted in
Figure 7-10. There are three read operations on array a. One operation in each state C1
through C3.
Arrays are implemented as block RAMs and arrays which are arguments to the function are
implemented as block RAM ports. In both cases a block RAM can only have a maximum of
two ports (for dual-port block RAM). By accessing array a through a single block RAM
interface, there are not enough ports to be able to read all three values in one clock cycle.
X-Ref Target - Figure 7-10
In Figure 7-11 the 2-cycle read operations in state C1 overlap with those starting in state C2
and so only a single cycle is visible: however, it is clear that this resource is used in multiple
states.
In looking at this view, it is clear that even when the issue with port a is resolved, the same
issue occurs with port b: it also has to perform 3 reads.
High-Level Synthesis can only report one schedule error or warning at a time, because, as
soon as the first issue occurs, the actions to create an achievable schedule invalidates any
other infeasible schedules.
X-Ref Target - Figure 7-11
Because the loop index for the Product loop is k, both arrays should be partitioned along
their respective k dimension: the design needs to access more than two values of k in each
clock cycle.
For array a, this is dimension 2 because its access patterns is a[i][k]; for array b, this is
dimension 1 because its access pattern is b[k][j].
Partitioning these arrays creates MAT_A_COLS arrays - in this case, MAT_A_COLS number
ports. Alternatively, we can use re-shape instead of partition allowing one wide array (port)
to be created instead of k ports.
After this transformation, the data in the block RAM outside this block must be reshaped in
an identical manner: if this process is not done by HLS, the data must be arranged as:
The synthesis report shows the top-level loop Row_Col is now processing data at 1 sample
per clock period (Figure 7-13).
The function can then complete and return to start to process the next set of data.
Now, change the block RAM interfaces to FIFO interfaces to allow for streaming data.
c. In the Directive Editor dialog box activate the Directive drop-down menu at the
top and select INTERFACE.
d. Click the mode drop-down menu to select ap_fifo.
e. Click OK.
5. Repeat this process for variables b and variable res.
The Directive pane displays the following optimization directives. (The new directives are
highlighted).
X-Ref Target - Figure 7-14
Four consecutive writes to address [0][0] does not constitute a streaming access pattern;
this is random access.
X-Ref Target - Figure 7-16
Before modifying the code, however, it is worth pipelining the function instead of pipelining
the loops to contrast the difference in the two approaches.
IMPORTANT: In this step, copy the directives from solution4 as this solution does not have FIFO
interfaces specified.
2. Select solution4 from both the drop down menus in the Options section. The Solution
Wizard appears as shown in Figure 7-17.
INFO: [XFORM 203-502] Unrolling all loops for pipelining in function 'matrixmul'
(matrixmul.cpp:49).INFO: [HLS 200-489] Unrolling loop 'Row' (matrixmul.cpp:54) in
function 'matrixmul' completely with a factor of 3.
Pipelining loops allows the loops to remain rolled, thus providing a good means of
controlling the area. When pipelining a function, all loops contained in the function are
unrolled, which is a requirement for pipelining. The pipelined function design can process a
new set of 9 samples every 5 clock cycles. This exceeds the requirement of 1 sample per
clock because the default behavior of High-Level Synthesis is to produce a design with the
highest performance.
The pipelined function results in the best performance. However, if it exceeds the required
performance, it might take multiple additional directives to slow the design down.
Pipelining loops gives you an easy way to control resources, with the option of partially
unrolling the design to meet performance.
The code intuitively captured the behavior of a matrix multiplication, but it prevented a
required behavior in the hardware: streaming accesses.
This lab exercise uses an updated version of the C code you worked with in Lab 1. The
following explains how the C code was updated.
Figure 7-20 shows the I/O access pattern for the code in Lab 1. Out of necessity the address
values are shown in a small font.
As variables i, j and k iterate from 0 to 3, the lower part of Figure 7-20 shows the
addresses generated to read a, b and write to res. In addition, at the start of each Product
loop, res is set to the value zero.
X-Ref Target - Figure 7-20
ensure the design does not have to re-read the port. For the write port res, the data must
be saved into a temporary variable and only written to the port in the cycles shown in red.
• The directives from Lab 1, including the FIFO interfaces, are specified in the code as
pragmas.
• For-loops have been added to cache the row and column reads.
• A temporary variable is used for the accumulation and port res is only written to when
the final result is computed for each value.
• Because the for-loops to cache the row and column would require multiple cycles to
perform the reads, the pipeline directive has been applied to the Col for-loop, ensuring
these cache for-loops are automatically unrolled.
5. Click the Run C Synthesis toolbar button to synthesize the design to RTL.
6. When synthesis completes, use the Run C/RTL CoSimulation toolbar button to launch
the CoSimulation Dialog box.
7. Click OK to start RTL verification.
The design has been now been fully synthesized to read one sample every clock cycle using
streaming FIFO interfaces.
Conclusion
In this tutorial, you learned:
• How to analyze pipelined loops and understand exactly which limitations prevent
optimizations targets from being achieved.
• The advantages and disadvantages of function versus loop pipelining.
• How unintended dependencies in the code can prevent hardware design goals from
being realized and how they can be overcome by modifications to the source code.
RTL Verification
Overview
The High Level Synthesis tool automates the process of RTL verification and allows you to
use RTL verification to generate trace files that show the activity of the waveforms in the
RTL design. You can use these waveforms to analyze and understand the RTL output. This
tutorial covers all aspects of the RTL verification process.
To perform RTL verification, you use both the RTL output from High-Level Synthesis
(Verilog, VHDL or SystemC) and the C test bench. RTL verification is often called
CoSimulation or C/RTL CoSimulation; because both C and RTL are used in the verification.
Lab 1 Description
Perform RTL verification steps and understand the importance of the C test bench in
verifying the RTL.
Lab 2 Description
Create RTL trace files and analyze them using the Vivado Design Suite.
Lab 3 Description
Create RTL trace files and analyze them using a third-party RTL simulator. This lab requires
a license for Mentor Graphics ModelSim simulator. (You can use an alternative, third-party
simulator with minor modifications to the steps).
The sample design used in the lab exercise is a DUC (digital up converter) function. The
purpose of this lab is to demonstrate and explain the features of RTL verification. There are
no design goals for these lab exercises.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
When RTL Verification completes, the simulation report opens automatically (Figure 8-5).
The report indicates if the simulation passed or failed. In addition, the report indicates the
measured latency and interval.
First, the C test bench is executed to generate input stimuli for the RTL design.
At the end of this phase, the simulation shows any messages generated by the C test bench.
The output from the C function is not used in the C test bench at this stage, but any
messages output by the test bench can be seen in the console.
An RTL test bench with newly generated input stimuli is created and the RTL simulation is
then performed.
Finally, the output from the RTL simulation is re-applied to the C test bench to check the
results. Once again, you can see any message output by the C test bench in the console.
Finally, RTL verification issues message SIM-1000 if the RTL verification passed.
To fully understand why the C test bench should check the results and how message
SIM-1000 is generated, you will modify the C test bench.
7. Click the Run C/RTL CoSimulation toolbar button to launch the CoSimulation Dialog
box.
8. Leave the CoSimulation options at their default value and click OK to execute the RTL
CoSimulation.
When RTL CoSimulation completes, the CoSimulation report opens and says the verification
has failed (Figure 8-8).
X-Ref Target - Figure 8-8
If required, you can confirm the results are correct. To do this, compare the output files
created by the RTL simulation with the golden results. The RTL simulation is executed in the
simulation directory wrapc, which is inside the solution directory. Figure 8-9 shows the
solution directory, with the output files highlighted.
To ensure that the RTL results are automatically verified: the C test bench must always check
the output from the C function to be synthesized and return a 0 (zero) if the results are
correct OR return any other value if they are not correct.
When RTL Verification is performed, the same testing occurs in the test bench, and the
output from the RTL block is automatically checked. This is why it is important for the C test
bench to check the results and return a zero value only if they are correct (or return a
non-zero value if they are incorrect).
9. Exit the Vivado HLS GUI and return to the command prompt.
In this case, you will produce a trace file you can open using the Vivado Simulator.
The next step is to view the trace files inside the Vivado Design Suite.
Since waveform trace data has been generated for the Vivado Simulator, the Open Wave
Viewer toolbar button is now highlighted, as shown in Figure 8-13.
Note: The Open Wave Viewer toolbar button can only be used when Vivado Simulator is selected
as the Verilog/VHDL Simulator.
X-Ref Target - Figure 8-13
You can then view the waveforms in the waveform viewer. Figure 8-14 shows the zoomed
waveforms where the output data ports and their associated I/O protocol signals (output
valid signals) are expanded to view.
CAUTION! This lab exercise requires that the executable for ModelSim is defined in the system search
path and that the required license to perform HDL simulation is available on the system.
This exercise uses the Mentor Graphics ModelSim RTL simulator. The path to the simulator
executable must be set in your system search path.
7. Exit the Vivado HLS GUI and return to the command prompt.
Conclusion
In this tutorial, you learned how to:
• Perform RTL verification on a design synthesized from C and the importance of the test
bench in this process.
• Create and open waveform trace files using the Vivado Design Suite.
• Create and open waveform trace files using a third-party HDL simulator (ModelSim)
and view the trace file created by RTL verification.
Overview
You can package the RTL from High-Level Synthesis and use it inside IP Integrator. This
tutorial demonstrates how to take HLS IP and use it in IP Integrator as part of a larger
design.
Lab 1 Description
Complete the steps to generate two HLS blocks for the IP catalog and use them in a design
with Xilinx IP, an FFT. You validate and verify the final design using an RTL test bench.
This tutorial uses the design files in the tutorial directory Vivado_HLS_Tutorial\
Using_IP_with_IPI.
The design blocks in this tutorial process the data for a complex FFT.
• The Xilinx FFT IP block only operates on complex data. Although you can perform an
FFT of real data on a complex data set with all imaginary components set to zero, it can
be done more efficiently by pre-processing the data.
• The front-end HLS block in this lab applies a Hamming windowing function to the 1024
(N) real data samples and sends even/odd pairs to an N/2-point XFFT as though they
are complex data.
• The back-end HLS block takes bit-reverse ordered data, puts it in natural order and
applies an O(N) transformation to FFT output to extract the spectral data for the
N-point real data set. Note, the first output pair packs the 0th and 512th (purely real)
spectral data point into the real and imaginary parts, respectively.
• The designs are fully pipelined, streaming designs for high throughput; intended for
continuous processing of data, but with throttling capability (stalls if input stalls).
• AXI4-Stream interfaces are used to connect all blocks in IP Integrator.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
The remainder of this tutorial shows how the Vivado HLS IP blocks can be integrated into a
design (in IP Integrator) and verified.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado 2019.1.
A Vivado HLS IP category now appears in the IP Catalog as HLS IP (Figure 9-9).
6. Right-click in any space in the canvas and select Add IP (Figure 9-15).
The design block now has three IP blocks, as shown in Figure 9-16.
8. Hover the cursor over the dout interface connector of the Hls_real2xftt block until
pencil cursor appears.
a. Left-click and hold down the mouse button to start a connection.
b. Drag the connection line to the S_AXIS_DATA port connector of FFT block and
release (when green check mark appears next to it).
9. In a similar fashion, connect the FFT’s M_AXIS_DATA interface to the din interface of
the Hls_xfft2real block.
10. Right-click the din_V_V interface connector on the hls_real2xfft block and select
Make External (Figure 9-18).
IMPORTANT: Property changes might not take effect if this re-naming step is not done.
The validate design will show some warnings. These are related to the s_axis_config pin of
the FFT.
a. The XFFT configuration interface is left unconnected because this design always
operates in the default mode of the core.
b. Click OK to close the messages..
19. Click File > Save Block Design.
20. Close the Block Design.
21. The next step is to generate output products.
a. In the Sources tab of Project Manager pane (Figure 9-23), right-click RealFFT.bd
and select Generate Output Products.
b. Click Generate in the resulting dialog to initiate the generation of all output
products.
c. Select OK to ignore the warnings discussed above.
1. Right-click Simulation Sources in the Sources tab of the Project Manager pane
(Figure 9-24).
2. Select Add Sources.
Conclusion
In this tutorial, you learned:
Overview
A common use of High-Level Synthesis design is to create an accelerator for a CPU – to
move code that executes on the CPU into the FPGA programmable logic to improve
performance. This tutorial shows how you can incorporate a design created with High-Level
Synthesis into a Zynq device.
Lab 1 Description
You create and configure a simple HLS design to work with the CPU on a Zynq device. The
HLS design used in this lab is simple to allow the focus of the tutorial to be on explaining
the connections to the CPU and how to configure the software drivers created by
High-Level Synthesis to control the device and manage interrupts.
Lab 2 Description
This lab illustrates a common high performance connection scheme for connecting
hardware accelerator blocks that consume data originating in the CPU memory and/or
producing data destined for it in a streaming manner. The lab highlights the software
requirements to avoid cache coherency issues.
The sample design is a simple multiple accumulate block. The focus of this tutorial exercise
is the methodology, connections and integration of the software drivers. (The tutorial does
not focus on the logic in the design itself.)
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial. If the tutorial data
directory is unzipped to a different location, or on Linux systems, adjust the few pathnames referenced,
to the location you have chosen to place the Vivado_HLS_Tutorial directory.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado HLS > Vivado HLS 2019.1 Command Prompt.
The remainder of this tutorial exercise shows how the Vivado HLS IP blocks can be
integrated into a Zynq design using IP Integrator.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado 2019.1.
b. In the Project Location text entry box, browse to the location of the tutorial file
directory Using_IP_with_Zynq\lab1 and click Next (Figure 10-3).
c. On the Project Type page, select RTL Project and Do not specify sources at this
time (if it is not the default).
d. Click Next.
X-Ref Target - Figure 10-3
2. Press the Add IP button on the main screen open the IP search dialog.
a. Type zynq into the Search text entry box.
b. Select ZYNQ7 Processing System and press Enter.
3. Double-click the ZYNQ IP symbol to open the associated Re-customize IP dialog box.
a. Click the Presets icon and select ZC702 (Figure 10-12).
6. Click the Run Block Automation link under the title bar (Figure 10-15).
a. Ensure processing_system7_0 is selected.
b. Ensure Apply Board Presets is deselected. If this remains selected it re-applies the
timers that were disabled in step 4 and results in additional ports on the Zynq block
in Figure 10-15.
c. Click OK to complete in the resulting dialog box.
8. Click the Run Connection Automation link at the top of the canvas.
9. Select /hls_macc_0/S_AXI_HLS_MACC_PERIPH_BUS and click OK in the resulting
dialog box to automatically connect the HLS IP to the M_AXI_GP0 interface of the Zynq
Processor.
This adds an AXI Interconnect (block instance: processing_system7_0), a Proc Sys Reset
block and makes all necessary AXI related connections to create the design shown in
Figure 10-17.
10. Mouse over the interrupt pin on the hls_macc_0 IP symbol. When the cursor changes
to pencil shape, click and drag to the IRQ_F2P[0:0] port of the PS7 and release,
completing the connection.
11. Select the Address Editor tab and confirm that the hls_macc_0 peripheral has been
assigned a master address range. If it has not, click the Auto Assign Address icon.
1. Return to the Project Manager view by clicking on Project Manager in the Flow
Navigator.
2. In the Sources browser in the main workspace pane, a Block Diagram object called
Zynq_Design is at the top of the Design Sources tree view (Figure 10-19). Right-click
this object and select Generate Output Products.
3. In the resulting dialog box, click Generate to start the process of generating the
necessary source files.
The top-level of the Design Sources tree becomes the Zynq_Design_wrapper.v file. The
design is now ready to be synthesized, implemented and to have an FPGA programming
bitstream generated.
1. From the Vivado File menu select Export > Export Hardware.
Note: Both the IPI Block Design and the Implemented Design must be open in the Vivado
workspace for this step to complete successfully.
2. In the Export Hardware dialog box (Figure 10-20), ensure that the Include Bitstream is
enabled and click OK.
8. Click on SDK Terminal and click on add button to add a port to the terminal.
3. Define variables for the HLS block and interrupt controller instance data. The variables
will be passed to driver API calls as handles in the respective hardware.
// HLS macc HW instance
XHls_macc HlsMacc;
//Interrupt Controller Instance
XScuGic ScuGic;
4. Define global variables to interface with the interrupt service routine (ISR).
volatile static int RunHlsMacc = 0;
volatile static int ResultAvailHlsMacc = 0;
5. Define a function to wrap all run-once API initialization function calls for the HLS block.
int hls_macc_init(XHls_macc *hls_maccPtr)
{
XHls_macc_Config *cfgPtr;
int status;
cfgPtr = XHls_macc_LookupConfig(XPAR_XHLS_MACC_0_DEVICE_ID);
if (!cfgPtr) {
print("ERROR: Lookup of accelerator configuration failed.\n\r");
return XST_FAILURE;
}
status = XHls_macc_CfgInitialize(hls_maccPtr, cfgPtr);
if (status != XST_SUCCESS) {
6. Define a helper function to wrap the HLS block API calls required to enable its interrupt
and start the block.
void hls_macc_start(void *InstancePtr){
XHls_macc *pAccelerator = (XHls_macc *)InstancePtr;
XHls_macc_InterruptEnable(pAccelerator,1);
XHls_macc_InterruptGlobalEnable(pAccelerator);
XHls_macc_Start(pAccelerator);
}
An interrupt service routine is required in order for the processor to respond to an interrupt
generated by a peripheral.
Each peripheral with an interrupt attached to the PS must have an ISR defined and
registered with the PS’s interrupt handler.
The ISR is responsible for clearing the peripheral’s interrupt and, in this example, setting a
flag that indicates that a result is available for retrieval from the peripheral. In general, ISRs
should be designed to be lightweight and as fast as possible, essentially doing the
minimum necessary to service the interrupt. Tasks such as retrieving the data should be left
to the main application code.
ResultAvailHlsMacc = 1;
// restart the core if it should run again
if(RunHlsMacc){
hls_macc_start(pAccelerator);
}
}
7. Define a routine to setup the PS interrupt handler and register the HLS peripheral’s ISR.
int setup_interrupt()
{
//This functions sets up the interrupt on the Arm
int result;
XScuGic_Config *pCfg = XScuGic_LookupConfig(XPAR_SCUGIC_SINGLE_DEVICE_ID);
if (pCfg == NULL){
print("Interrupt Configuration Lookup Failed\n\r");
return XST_FAILURE;
}
result = XScuGic_CfgInitialize(&ScuGic,pCfg,pCfg->CpuBaseAddress);
if(result != XST_SUCCESS){
return result;
}
// self-test
result = XScuGic_SelfTest(&ScuGic);
if(result != XST_SUCCESS){
return result;
}
// Initialize the exception handler
Xil_ExceptionInit();
// Register the exception handler
//print("Register the exception handler\n\r");
Xil_ExceptionRegisterHandler(XIL_EXCEPTION_ID_INT,
(Xil_ExceptionHandler)XScuGic_InterruptHandler,&ScuGic);
//Enable the exception handler
Xil_ExceptionEnable();
// Connect the Adder ISR to the exception table
//print("Connect the Adder ISR to the Exception handler table\n\r");
result = XScuGic_Connect(&ScuGic, XPAR_FABRIC_HLS_MACC_0_INTERRUPT_INTR,
(Xil_InterruptHandler)hls_macc_isr,&HlsMacc);
if(result != XST_SUCCESS){
return result;
}
//print("Enable the Adder ISR\n\r");
XScuGic_Enable(&ScuGic,XPAR_FABRIC_HLS_MACC_0_INTERRUPT_INTR);
return XST_SUCCESS;
}
8. Define a software model of the HLS hardware functionality with which you can compare
reference results.
void sw_macc(int a, int b, int *accum, bool accum_clr)
{
static int accum_reg = 0;
if (accum_clr)
accum_reg = 0;
accum_reg += a * b;
*accum = accum_reg;
}
9. Modify main() to use the HLS device driver API and the functions defined above to test
the HLS peripheral hardware.
int main()
{
print("Program to test communication with HLS MACC peripheral in PL\n\r");
int a = 2, b = 21;
int res_hw;
int res_sw;
int i;
int status;
}
//Setup the interrupt
status = setup_interrupt();
if(status != XST_SUCCESS){
print("Interrupt setup failed\n\r");
exit(-1);
}
if (XHls_macc_IsReady(&HlsMacc))
print("HLS peripheral is ready. Starting... ");
else {
print("!!! HLS peripheral is not ready! Exiting...\n\r");
exit(-1);
}
printf("Result from HW: %d; Result from SW: %d\n\r", res_hw, res_sw);
if (res_hw == res_sw) {
print("*** Results match ***\n\r");
status = 0;
}
else {
print("!!! MISMATCH !!!\n\r");
status = -1;
}
cleanup_platform();
return status;
}
10. Save the modified source file. When you save the file, SDK automatically attempts to
re-build the application executable. If the build fails, fix any outstanding issues.
Run the new application on the hardware and verify that it works as expected. Ensure that
a TCF hardware server is running, that the FPGA is programmed and a terminal session is
connected to the UART. Then Launch on Hardware, as you did for the previous Hello World
application code.
• This tutorial uses the same Vivado HLS and XFFT IP blocks created in Lab 1 of the
tutorial “Using HLS IP in IP Integrator”. In this lab exercise these blocks are connected
to the HP0 Slave AXI4 port on a Zynq7 processing system via an AXI DMA IP core.
• The hardware accelerator blocks are free-running and do not require drivers; as long as
data is pushed in and pulled out by the CPU (often simply referred to as the Processing
System or PS).
• The lab highlights the software requirements to avoid cache coherency issues.
° On Windows use Start > All Programs > Xilinx Design Tools > Vivado 2019.1 >
Vivado 2019.1.
Add pins to the RealFFT hierarchical block so that you can connect it at the top-level.
Figure 10-35: RealFFT Diagram with Interface Pin and Clock Pin
13. Following the procedures in steps 10 to 12:
a. Create an interface pin called realfft_m_axis_dout connected to the dout_V
pin of the hls_xfft2real component.
b. Create a pin for aresetn (from any one of the blocks).
After this step, the RealFFT diagram appears as shown in Figure 10-36.
14. Right-click the canvas and select Add IP from the context menu.
a. Type const into the search box and press Enter.
b. Double-click the xlconstant_0 component and verify that the Const Val field in
the Customize IP dialog is set to 1.
Leave all other output pins of the components disconnected. The final RealFFT diagram
appears with the connections shown in Figure 10-38.
IPI will place several new blocks require to complete the connection automatically,
including an AXI DMA core, an AXI Interconnect and a Processor System Reset block.
22. Make a connection from the RealFFT block’s realfft_m_axis_dout to the Zynq’s
S_AXI_HP0 interface. Accepting the defaults in the Make Connection dialog will cause IPI
to use the existing AXI DMA (which has an unused write channel) and AXI Interconnect
to make the ‘S2MM’ connection.
23. Note that Designer Assistance is again available. Run Connection Automation on
/axi_dma/S_AXI_LITE and click OK in the resulting dialog box.
24. Connect the aclk and aresetn ports of the RealFFT hierarchical block to nets
processing_system7_0 pin FCLK_CLK0 and rst_processing_system7_0_100M pin
peripheral_aresetn respectively.
25. To complete the design, run Validate Design. When validation completes successfully,
the block diagram should look like Figure 10-41.
Before proceeding with the system design, you must generate implementation sources and
create an HDL wrapper as the top-level module for synthesis and implementation.
1. Return to the Project Manager view by clicking Project Manager in the Flow Navigator.
2. In the Sources browser in the main workspace pane, a Block Diagram object called
Zynq_RealFFT appears at the top of the Design Sources tree view. Right-click this
object and select Generate Output Products.
3. In the resulting dialog box, click OK to start the process of generating the necessary
source files.
4. Right-click the Zynq_RealFFT object again, select Create HDL Wrapper, and click OK
to exit the resulting dialog box.
The top-level of the Design Sources tree becomes the Zynq_RealFFT_wrapper.v file.
You are now ready to synthesize, implement, and generate an FPGA programming bitstream
for the design.
1. From the Vivado File menu select Export > Export Hardware for SDK.
Note: Both the IPI Block Design and the Implemented Design must be open in the Vivado
workspace for this step to complete successfully.
2. In the Export Hardware for SDK dialog box, ensure that the Include Bitstream option is
checked, and click OK.
3. From the Vivado File menu, select Launch SDK.
4. Click OK to launch SDK.
5. Create a Hello World application (also creates BSP).
a. Select File > New > Application Project.
b. Enter the project name Zynq_RealFFT_Test.
c. Click Next.
d. Select Hello World (if it is not the default).
e. Click Finish.
6. Power up the ZC702 board and program the FPGA.
Ensure the board has all the connections to allow you to download the bitstream on the
FPGA device. Refer to the documentation that accompanies the ZC702 development
board.
7. Click XilinxTools > Program FPGA. The Done LED (DS3) goes on.
T
X-Ref Target - Figure 10-42
i. In the Tool Settings tab, select Arm gcc linker > Libraries.
4. Define a custom complex data type with 16-bit real and imaginary members:
typedef struct {
short re;
short im;
} complex16;
5. Declare helper functions before the definition of main(); they will be defined later.
Note: The init_dma() function wraps up all run-once, initialization AXI DMA driver API calls and
checks that hardware initialization is successful before returning or exiting on an error condition.
The generate_waveform() function fills an array with a simple, periodic waveform to be used as
input stimulus for the RealFFT accelerator.
int init_dma(XAxiDma *axiDma);
void generate_waveform(short *signal_buf, int num_samples);
6. Modify main() to generate and send input data to the RealFFT accelerator and receive
the spectral data from it via the AXI DMA engine. Sections of particular importance will
be discussed in detail.
// Program entry point
int main()
{
e. Before making the DMA transfer request, the buffer containing the data must be
flushed from the processor’s data cache. Without this step, the DMA might pull stale
data from the DRAM.
// *IMPORTANT* - flush contents of 'realdata' from data cache to memory
// before DMA. Otherwise DMA is likely to get stale or uninitialized data
Xil_DCacheFlushRange((unsigned)realdata, 4 * REAL_FFT_LEN * sizeof(short));
f. Request DMA transfer from PS to PL. Enough data to fill the front-end block and the
FFT processing pipelines must be sent in order for spectral data to be ready when
the PL to PS transfer is requested. Therefore, four data sets are sent before the first
output set is requested:
// DMA enough data to push out first result data set completely
status = XAxiDma_SimpleTransfer(&axiDma, (u32)realdata,
4 * REAL_FFT_LEN * sizeof(short), XAXIDMA_DMA_TO_DEVICE);
// Do multiple DMA xfers from the RealFFT core's output stream and
// display data for bins with significant energy. After the first frame,
// there should only be energy in bins around the frequencies specified
// in the generate_waveform() function - currently bins 191~193 only
for (i = 0; i < 8; i++) {
g. Request DMA transfer of a frame of FFT spectral data from PL to PS then poll for
completion of the transfer before proceeding.
// Setup DMA from PL to PS memory using
// AXI DMA's 'simple' transfer mode
status = XAxiDma_SimpleTransfer(&axiDma, (u32)realspectrum,
REAL_FFT_LEN / 2 * sizeof(complex16), XAXIDMA_DEVICE_TO_DMA);
// Poll the AXI DMA core
do {
status = XAxiDma_Busy(&axiDma, XAXIDMA_DEVICE_TO_DMA);
} while(status);
h. Before attempting to use the spectral data, the processor’s data cache copy of the
buffer must be invalidated to avoid use of stale data.
// Data cache must be invalidated for 'realspectrum' buffer after DMA
Xil_DCacheInvalidateRange((unsigned)realspectrum,
REAL_FFT_LEN / 2 * sizeof(complex16));
i. Push another set of stimulus data to the PL in order to start the accelerator
processing the next frame:
// DMA another frame of data to PL
if (!XAxiDma_Busy(&axiDma, XAXIDMA_DMA_TO_DEVICE))
status = XAxiDma_SimpleTransfer(&axiDma, (u32)realdata,
REAL_FFT_LEN * sizeof(short), XAXIDMA_DMA_TO_DEVICE);
printf("\n\rFrame #%d received:\n\r");
j. Do something to verify that the accelerator is functioning. In this case, the spectral
data is scanned for bins that contain significant energy. The expectation is to detect
only energy in bins around the single tone (192) generated by the
generate_waveform() function.
// Detect energy in spectral data above a set threshold
for (j = 0; j < REAL_FFT_LEN / 2; j++) {
// Convert the fixed point (s.15) values into floating point values
float real = (float)realspectrum[j].re / 32767.0f;
float imag = (float)realspectrum[j].im / 32767.0f;
float mag = sqrtf(real * real + imag * imag);
if (mag > 0.00390625f) {
printf("Energy detected in bin %3d - ",j);
printf("{%8.5f, %8.5f}; mag = %8.5f\n\r", real, imag, mag);
}
}
printf("End of frame.\n\r");
}
printf("***************\n\r");
printf("* End of test *\n\r");
printf("***************\n\r\n\r");
return 0;
}
7. Define the helper function that generates the waveform data sets. This version simply
fills a buffer with a single tone with 192 cycles per num_samples data window with
values in a S.15 fixed point format.
void generate_waveform(short *signal_buf, int num_samples)
{
const float cycles_per_win = 192.0f;
const float phase = 0.0f;
const float ampl = 0.9f;
int i;
for (i = 0; i < num_samples; i++) {
float sample = ampl *
cosf((i * 2 * M_PI * cycles_per_win / (float)num_samples) + phase);
signal_buf[i] = (short)(32767.0f * sample);
}
}
8. Define a routine to set up the and initialize the AXI DMA engine, wrapping all driver API
calls that only need to be run once at startup.
int init_dma(XAxiDma *axiDmaPtr){
XAxiDma_Config *CfgPtr;
int status;
// Get pointer to DMA configuration
CfgPtr = XAxiDma_LookupConfig(XPAR_AXIDMA_0_DEVICE_ID);
if(!CfgPtr){
print("Error looking for AXI DMA config\n\r");
return XST_FAILURE;
}
// Initialize the DMA handle
status = XAxiDma_CfgInitialize(axiDmaPtr,CfgPtr);
if(status != XST_SUCCESS){
print("Error initializing DMA\n\r");
return XST_FAILURE;
}
//check for scatter gather mode - this example must have simple mode only
if(XAxiDma_HasSg(axiDmaPtr)){
print("Error DMA configured in SG mode\n\r");
return XST_FAILURE;
}
//disable the interrupts
XAxiDma_IntrDisable(axiDmaPtr, XAXIDMA_IRQ_ALL_MASK,XAXIDMA_DEVICE_TO_DMA);
XAxiDma_IntrDisable(axiDmaPtr, XAXIDMA_IRQ_ALL_MASK,XAXIDMA_DMA_TO_DEVICE);
return XST_SUCCESS;
}
9. Save the modified source file. As soon as you save the file, SDK automatically attempts
to re-build the application executable.
10. Run the new application on the hardware and verify that it works as expected. Ensure
that the FPGA is programmed and a terminal session is connected to the UART. Then
Launch on Hardware, as done for the previous Hello World application code.
Conclusion
In this tutorial, you learned:
Overview
The RTL created by High-Level Synthesis can be packaged as IP and used inside System
Generator for DSP (Vivado). This tutorial shows how this process is performed and
demonstrates how the design can be used inside System Generator for DSP.
Lab 1 Description
Generates a design using Vivado HLS and package the design for use with System Generator
for DSP. Then include the HLS IP into a System Generator for DSP design and execute an RTL
simulation.
The sample design is a FIR filter that uses streaming interfaces modeled with the High-Level
Synthesis hls::stream class. The design is fully pipelined at the function level. The
optimization directives are embedded into the C code as pragmas.
IMPORTANT: The figures and commands in this tutorial assume the tutorial data directory
Vivado_HLS_Tutorial is unzipped and placed in the location C:\Vivado_HLS_Tutorial.
If the tutorial data directory is unzipped to a different location, or on Linux systems, adjust the few
pathnames referenced, to the location you have chosen to place the Vivado_HLS_Tutorial
directory.
° On Windows, go to Start > All Programs > Xilinx Design Tools > Vivado 2019.1
> Vivado HLS > Vivado HLS 2019.1 Command Prompt.
The remainder of this tutorial exercise shows how to integrate the Vivado HLS IP block into
a System Generator design.
4. Right-click in the canvas and select Xilinx BlockAdd, as shown in Figure 11-5.
IMPORTANT: System Generator for DSP uses the location of the solution folder to identify the IP.
10. Connect the design I/O ports to the ports on the FIR IP block, as shown in Figure 11-8.
X-Ref Target - Figure 11-8
Conclusion
In this tutorial, you learned:
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx
Support.
Solution Centers
See the Xilinx Solution Centers for support on devices, software tools, and intellectual
property at all stages of the design cycle. Topics include design assistance, advisories, and
troubleshooting tips.
• From the Vivado IDE, select Help > Documentation and Tutorials.
• On Windows, select Start > All Programs > Xilinx Design Tools > DocNav.
• At the Linux command prompt, enter docnav.
Xilinx Design Hubs provide links to documentation organized by design tasks and other
topics, which you can use to learn key concepts and address frequently asked questions. To
access the Design Hubs:
• In the Xilinx Documentation Navigator, click the Design Hubs View tab.
• On the Xilinx website, see the Design Hubs page.
Note: For more information on Documentation Navigator, see the Documentation Navigator page
on the Xilinx website.
References
1. Introduction to FPGA Design with Vivado High-Level Synthesis (UG998)
2. Vivado® Design Suite User Guide: High-Level Synthesis (UG902)
3. Vivado Design Suite User Guide: Release Notes, Installation, and Licensing (UG973)
4. Vivado Design Suite Documentation
Training Resources
Xilinx provides a variety of training courses and QuickTake videos to help you learn more
about the concepts presented in this document. Use these links to explore related training
resources:
1. C-based Design: High-Level Synthesis with the Vivado HLS Tool Training Course
2. C-based HLS Coding for Hardware Designers Training Course
3. C-based HLS Coding for Software Designers Training Course
4. Vivado Design Suite QuickTake Video Tutorials
5. Vivado Design Suite QuickTake Video Tutorials: Vivado High-Level Synthesis
6. Vivado Design Suite QuickTake Video: Getting Started with High-Level Synthesis
7. Vivado Design Suite QuickTake Video: Verifying your Vivado HLS Design
8. Vivado Design Suite QuickTake Video: Creating Different Types of Projects
AUTOMOTIVE PRODUCTS (IDENTIFIED AS “XA” IN THE PART NUMBER) ARE NOT WARRANTED FOR USE IN THE DEPLOYMENT OF
AIRBAGS OR FOR USE IN APPLICATIONS THAT AFFECT CONTROL OF A VEHICLE (“SAFETY APPLICATION”) UNLESS THERE IS A
SAFETY CONCEPT OR REDUNDANCY FEATURE CONSISTENT WITH THE ISO 26262 AUTOMOTIVE SAFETY STANDARD (“SAFETY
DESIGN”). CUSTOMER SHALL, PRIOR TO USING OR DISTRIBUTING ANY SYSTEMS THAT INCORPORATE PRODUCTS, THOROUGHLY
TEST SUCH SYSTEMS FOR SAFETY PURPOSES. USE OF PRODUCTS IN A SAFETY APPLICATION WITHOUT A SAFETY DESIGN IS FULLY
AT THE RISK OF CUSTOMER, SUBJECT ONLY TO APPLICABLE LAWS AND REGULATIONS GOVERNING LIMITATIONS ON PRODUCT
LIABILITY.
© Copyright 2012-2019 Xilinx, Inc. Xilinx, the Xilinx logo, Artix, ISE, Kintex, Spartan, Virtex, Vivado, Zynq, and other designated
brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of
their respective owners.