Ug902 Vivado High Level Synthesis
Ug902 Vivado High Level Synthesis
Ug902 Vivado High Level Synthesis
User Guide
High-Level Synthesis
UG902 (v2016.2)
(v2016.1) June
April8,6,2016
2016
Revision History
06/08/2016: Released
The following with Vivado
table shows Designhistory
the revision Suite 2016.2 without
for this changes from the previous version.
document.
High-Level Synthesis
Note: For more information on FPGA architectures and Vivado HLS basic concepts, see the
Introduction to FPGA Design with Vivado High-Level Synthesis (UG998) [Ref 1].
Work at a level that is abstract from the implementation details, which consume
development time.
Validate the functional correctness of the design more quickly than with traditional
hardware description languages.
• Create multiple implementations from the C source code using optimization directives
Explore the design space, which increases the likelihood of finding an optimal
implementation.
Retarget the C source into different devices as well as incorporate the C source into new
projects.
• Scheduling
Determines which operations occur during each clock cycle based on:
° Time it takes for the operation to complete, as defined by the target device
If the clock period is longer or a faster FPGA is targeted, more operations are completed
within a single clock cycle, and all operations might complete in one clock cycle.
Conversely, if the clock period is shorter or a slower FPGA is targeted, high-level
synthesis automatically schedules the operations over more clock cycles, and some
operations might need to be implemented as multicycle resources.
• Binding
Extracts the control logic to create a finite state machine (FSM) that sequences the
operations in the RTL design.
If the C code includes a hierarchy of sub-functions, the final RTL design includes a
hierarchy of modules or entities that have a one-to-one correspondence with the
original C function hierarchy. All instances of a function use the same RTL
implementation or block.
When loops are rolled, synthesis creates the logic for one iteration of the loop, and the
RTL design executes this logic for each iteration of the loop in sequence. Using
optimization directives, you can unroll loops, which allows all iterations to occur in
parallel.
• Arrays in the C code synthesize into block RAM or UltraRAM in the final FPGA design
If the array is on the top-level function interface, high-level synthesis implements the
array as ports to access a block RAM outside the design.
To determine if the design meets your requirements, you can review the performance
metrics in the synthesis report generated by high-level synthesis. After analyzing the
report, you can use optimization directives to refine the implementation. The synthesis
report contains information on the following performance metrics:
• Area: Amount of hardware resources required to implement the design based on the
resources available in the FPGA, including look-up tables (LUT), registers, block RAMs,
and DSP48s.
• Latency: Number of clock cycles required for the function to compute all output values.
• Initiation interval (II): Number of clock cycles before the function can accept new input
data.
• Loop iteration latency: Number of clock cycles it takes to complete one iteration of the
loop.
• Loop initiation interval: Number of clock cycle before the next iteration of the loop
starts to process data.
• Loop latency: Number of cycles to execute all iterations of the loop.
&ORFN&\FOH
6FKHGXOLQJ
3KDVH D
[
\
E
7DUJHW%LQGLQJ
'63 $GG6XE
3KDVH
;
IMPORTANT: The advantage of implementing the C code in the hardware is that all operations finish
in a shorter number of clock cycles. In this example, the operations complete in only two clock cycles.
In a central processing unit (CPU), even this simple code example takes more clock cycles to complete.
In the initial binding phase of this example, high-level synthesis implements the multiplier
operation using a combinational multiplier (Mul) and implements both add operations
using a combinational adder/subtractor (AddSub).
In the target binding phase, high-level synthesis implements both the multiplier and one of
the addition operations using a DSP48 resource. The DSP48 resource is a computational
block available in the FPGA architecture that provides the ideal balance of
high-performance and efficient implementation.
&ORFN
E
F
\ RXWBGDWD
D
LQBGDWD [
RXWBDGGU
LQBDGGU RXWBFH
LQBFH RXWBZH
)LQLWH6WDWH0DFKLQH )60
;
Figure 1-2: Control Logic Extraction and I/O Port Implementation Example
This code example performs the same operations as the previous example. However, it
performs the operations inside a for-loop, and two of the function arguments are arrays.
The resulting design executes the logic inside the for-loop three times when the code is
scheduled. High-level synthesis automatically extracts the control logic from the C code
and creates an FSM in the RTL design to sequence these operations. High-level synthesis
implements the top-level function arguments as ports in the final RTL design. The scalar
variable of type char maps into a standard 8-bit data bus port. Array arguments, such as in
and out, contain an entire collection of data.
In high-level synthesis, arrays are synthesized into block RAM by default, but other options
are possible, such as FIFOs, distributed RAM, and individual registers. When using arrays as
arguments in the top-level function, high-level synthesis assumes that the block RAM is
outside the top-level function and automatically creates ports to access a block RAM
outside the design, such as data ports, address ports, and any required chip-enable or
write-enable signals.
The FSM controls when the registers store data and controls the state of any I/O control
signals. The FSM starts in the state C0. On the next clock, it enters state C1, then state C2,
and then state C3. It returns to state C1 (and C2, C3) a total of three times before returning
to state C0.
Note: This closely resembles the control structure in the C code for-loop. The full sequence of states
are: C0,{C1, C2, C3}, {C1, C2, C3}, {C1, C2, C3}, and return to C0.
The design requires the addition of b and c only one time. High-level synthesis moves the
operation outside the for-loop and into state C0. Each time the design enters state C3, it
reuses the result of the addition.
The design reads the data from in and stores the data in x. The FSM generates the address
for the first element in state C1. In addition, in state C1, an adder increments to keep track
of how many times the design must iterate around states C1, C2, and C3. In state C2, the
block RAM returns the data for in and stores it as variable x.
High-level synthesis reads the data from port a with other values to perform the calculation
and generates the first y output. The FSM ensures that the correct address and control
signals are generated to store this value outside the block. The design then returns to state
C1 to read the next value from the array/block RAM in. This process continues until all
output is written. The design then returns to state C0 to read the next values of b and c to
start the process again.
& & & & & & & & & & &
5HDG% $GGU 5HDG &DOF $GGU 5HDG &DOF $GGU 5HDG &DOF 5HDG%
DQG& LQ>@ LQ>@ RXW>@ LQ>@ LQ>@ RXW>@ LQ>@ LQ>@ RXW>@ DQG&
)XQFWLRQ/DWHQF\
)XQFWLRQ,QLWLDWLRQ,QWHUYDO
/RRS,WHUDWLRQ/DWHQF\
/RRS,WHUDWLRQ,QWHUYDO
/RRS/DWHQF\
;
This is the primary input to Vivado HLS. The function can contain a hierarchy of
sub-functions.
• Constraints
Constraints are required and include the clock period, clock uncertainty, and FPGA
target. The clock uncertainty defaults to 12.5% of the clock period if not specified.
• Directives
Directives are optional and direct the synthesis process to implement a specific
behavior or optimization.
Vivado HLS uses the C test bench to simulate the C function prior to synthesis and to
verify the RTL output using C/RTL Cosimulation.
You can add the C input files, directives, and constraints to a Vivado HLS project
interactively using the Vivado HLS graphical user interface (GUI) or using Tcl commands at
the command prompt. You can also create a Tcl file and execute the commands in batch
mode.
This is the primary output from Vivado HLS. Using Vivado synthesis, you can synthesize
the RTL into a gate-level implementation and an FPGA bitstream file. The RTL is available
in the following industry standard formats:
Vivado HLS packages the implementation files as an IP block for use with other tools in
the Xilinx design flow. Using logic synthesis, you can synthesize the packaged IP into an
FPGA bitstream.
• Report files
The following figure shows an overview of the Vivado HLS input and output files.
X-Ref Target - Figure 1-4
&6LPXODWLRQ &6\QWKHVLV
57/6LPXODWLRQ 3DFNDJHG,3
9LYDGR ;LOLQ[
6\VWHP
'HVLJQ 3ODWIRUP
*HQHUDWRU
6XLWH 6WXGLR
;
Test Bench
When using the Vivado HLS design flow, it is time consuming to synthesize a functionally
incorrect C function and then analyze the implementation details to determine why the
function does not perform as expected. To improve productivity, use a test bench to
validate that the C function is functionally correct prior to synthesis.
The C test bench includes the function main() and any sub-functions that are not in the
hierarchy under the top-level function for synthesis. These functions verify that the
top-level function for synthesis is functionally correct by providing stimuli to the function
for synthesis and by consuming its output.
Vivado HLS uses the test bench to compile and execute the C simulation. During the
compilation process, you can select the Launch Debugger option to open a full C-debug
environment, which enables you to analyze the C simulation. For more information on test
benches, see C Test Bench in Chapter 3.
RECOMMENDED: Because Vivado HLS uses the test bench to both verify the C function prior to
synthesis and to automatically verify the RTL output, using a test bench is highly recommended.
Language Support
Vivado HLS supports the following standards for C compilation/simulation:
Vivado HLS supports many C, C++, and SystemC language constructs and all native data
types for each language, including float and double types. However, synthesis is not
supported for some constructs, including:
An FPGA has a fixed set of resources, and the dynamic creation and freeing of memory
resources is not supported.
All data to and from the FPGA must be read from the input ports or written to output
ports. OS operations, such as file read/write or OS queries like time and date, are not
supported. Instead, the C test bench can perform these operations and pass the data
into the function for synthesis as function arguments.
For details on the supported and unsupported C constructs and examples of each of the
main constructs, see Chapter 3, High-Level Synthesis Coding Styles.
Vivado HLS supports the OpenCL API C language constructs and built-in functions from the
OpenCL API C 1.0 embedded profile.
C Libraries
C libraries contain functions and constructs that are optimized for implementation in an
FPGA. Using these libraries helps to ensure high quality of results (QoR), that is, the final
output is a high-performance design that makes optimal use of the resources. Because the
libraries are provided in C, C++, OpenCL API C, or SystemC, you can incorporate the
libraries into the C function and simulate them to verify the functional correctness before
synthesis.
Vivado HLS provides the following C libraries to extend the standard C languages:
For more information on the C libraries provided by Vivado HLS, see Chapter 2, High-Level
Synthesis C Libraries.
C Library Example
C libraries ensure a higher QoR than standard C types. Standard C types are based on 8-bit
boundaries (8-bit, 16-bit, 32-bit, 64-bit). However, when targeting a hardware platform, it is
often more efficient to use data types of a specific width.
For example, a design with a filter function for a communications protocol requires 10-bit
input data and 18-bit output data to satisfy the data transmission requirements. Using
standard C data types, the input data must be at least 16-bits and the output data must be
at least 32-bits. In the final hardware, this creates a datapath between the input and output
that is wider than necessary, uses more resources, has longer delays (for example, a 32-bit
by 32-bit multiplication takes longer than an 18-bit by 18-bit multiplication), and requires
more clock cycles to complete.
Using an arbitrary precision data type in this design instead, you can specify the exact
bit-sizes to be specified in the C code prior to synthesis, simulate the updated C code, and
verify the quality of the output using C simulation prior to synthesis. Arbitrary precision
data types are provided for C and C++ and allow you to model data types of any width from
1 to 1024-bit. For example, you can model some C++ types up to 32768 bits. For more
information on arbitrary precision data types, see Data Types for Efficient Hardware.
Note: Arbitrary precision types are only required on the function boundaries, because Vivado HLS
optimizes the internal logic and removes data bits and logic that do not fanout to the output ports.
Following are the synthesis, optimization, and analysis steps in the Vivado HLS design
process:
After analyzing the results, you can create a new solution for the project with different
constraints and optimization directives and synthesize the new solution. You can repeat this
process until the design has the desired performance characteristics. Using multiple
solutions allows you to proceed with development while still retaining the previous results.
Optimization
Using Vivado HLS, you can apply different optimization directives to the design, including:
• Instruct a task to execute in a pipeline, allowing the next execution of the task to begin
before the current execution is complete.
• Specify a latency for the completion of functions, loops, and regions.
• Specify a limit on the number of resources used.
• Override the inherent or implied dependencies in the code and permit specified
operations. For example, if it is acceptable to discard or ignore the initial data values,
such as in a video stream, allow a memory read before write if it results in better
performance.
• Select the I/O protocol to ensure the final design can be connected to other hardware
blocks with the same I/O protocol.
Note: Vivado HLS automatically determines the I/O protocol used by any sub-functions. You
cannot control these ports except to specify whether the port is registered. For more information
on working with I/O interfaces, see Managing Interfaces.
You can use the Vivado HLS GUI to place optimization directives directly into the source
code. Alternatively, you can use Tcl commands to apply optimization directives. For more
information on the various optimizations, see Optimizing the Design.
Analysis
When synthesis completes, Vivado HLS automatically creates synthesis reports to help you
understand the performance of the implementation. In the Vivado HLS GUI, the Analysis
Perspective includes the Performance tab, which allows you to interactively analyze the
results in detail. The following figure shows the Performance tab for the Extracting Control
Logic and Implementing I/O Ports Example.
X-Ref Target - Figure 1-5
• C0: The first state includes read operations on ports a, b, and c and the addition
operation.
• C1 and C2: The design enters a loop and checks the loop increment counter and exit
condition. The design then reads data into variable x, which requires two clock cycles.
Two clock cycles are required, because the design is accessing a block RAM, requiring
an address in one cycle and a data read in the next.
• C3: The design performs the calculations and writes output to port y. Then, the loop
returns to the start.
The following OpenCL API C kernel code shows a vector addition design where two arrays
of data are summed into a third. The required size of the work group is 16, that is, this kernel
must execute a minium of 16 times to produce a valid result.
#include <clc.h>
Vivado HLS synthesizes this design into hardware that performs the following:
RTL Verification
If you added a C test bench to the project, you can use it to verify that the RTL is functionally
identical to the original C. The C test bench verifies the output from the top-level function
for synthesis and returns zero to the top-level function main() if the RTL is functionally
identical. Vivado HLS uses this return value for both C simulation and C/RTL co-simulation
to determine if the results are correct. If the C test bench returns a non-zero value, Vivado
HLS reports that the simulation failed.
IMPORTANT: Even if the output data is correct and valid, Vivado HLS reports a simulation failure if the
test bench does not return the value zero to function main().
TIP: For test bench examples that you can use for reference, see Design Examples and References.
Vivado HLS automatically creates the infrastructure to perform the C/RTL co-simulation and
automatically executes the simulation using one of the following supported RTL simulators:
If you select Verilog or VHDL HDL for simulation, Vivado HLS uses the HDL simulator you
specify. The Xilinx design tools include Vivado Simulator. Third-party HDL simulators
require a license from the third-party vendor. The VCS, NCSim, and Riviera HDL simulators
are only supported on the Linux operating system. For more information, see Using C/RTL
Co-Simulation.
RTL Export
Using Vivado HLS, you can export the RTL and package the final RTL output files as IP in any
of the following Xilinx IP formats:
• Vivado IP Catalog
Import into the Vivado IP catalog for use in the Vivado Design Suite.
Import directly into the Vivado Design Suite the same way you import any Vivado
Design Suite checkpoint.
Note: The synthesized checkpoint format invokes logic synthesis and compiles the RTL
implementation into a gate-level implementation, which is included in the IP package.
For all IP formats except the synthesized checkpoint, you can optionally execute logic
synthesis from within Vivado HLS to evaluate the results of RTL synthesis. This optional step
allows you to confirm the estimates provided by Vivado HLS for timing and area before
handing off the IP package. These gate-level results are not included in the packaged IP.
Note: Vivado HLS estimates the timing and area resources based on built-in libraries for each FPGA.
When you use logic synthesis to compile the RTL into a gate-level implementation, perform physical
placement of the gates in the FPGA, and perform routing of the inter-connections between gates,
logic synthesis might make additional optimizations that change the Vivado HLS estimates.
$ vivado_hls
You can use the Quick Start options to perform the following tasks:
You can use the Documentation options to perform the following tasks:
• Tutorials: Opens the Vivado Design Suite Tutorial: High-Level Synthesis (UG871) [Ref 2].
For details on the tutorial examples, see Design Examples and References.
• User Guide: Opens this document, the Vivado Design Suite User Guide: High-Level
Synthesis (UG902).
• Release Notes Guide: Opens the Vivado Design Suite User Guide: Release Notes,
Installation, and Licensing (UG973) [Ref 3] for the latest software version.
The primary controls for using Vivado HLS are shown in the toolbar in the following figure.
Project control ensures only commands that can be currently executed are highlighted. For
example, synthesis must be performed before C/RTL co-simulation can be executed. The
C/RTL co-simulation toolbar buttons remain gray until synthesis completes.
X-Ref Target - Figure 1-8
The next group of toolbar buttons control the tool operation (from left to right):
The final group of toolbar buttons are for design analysis (from left to right):
• Open Report opens the C synthesis report or drops down to open other reports.
• Compare Reports allows the reports from different solutions to be compared.
Each of the buttons on the toolbar has an equivalent command in the menus. In addition,
Vivado HLS GUI provides three perspectives. When you select a perspective, the windows
automatically adjust to a more suitable layout for the selected task.
Changing between perspectives can be done at any time by selecting the desired
perspective button.
The remainder of this chapter discusses how to use Vivado HLS. The following topics are
discussed:
This chapter ends with a review of the design examples, tutorials, and resources for more
information.
• Project Name: Specifies the project name, which is also the name of the directory in
which the project details are stored.
• Location: Specifies where to store the project.
CAUTION! The Windows operating system has a 260-character limit for path lengths, which can affect
the Vivado tools. To avoid this issue, use the shortest possible names and directory locations when
creating projects, defining IP or managed IP projects, and creating block designs.
• Top Function: Specifies the name of the top-level function to be synthesized. If you
add the C files first, you can use the Browse button to review the C hierarchy, and then
select the top-level function for synthesis. The Browse button remains grayed out until
you add the source files.
Note: This step is not required when the project is specified as SystemC, because Vivado HLS
automatically identifies the top-level functions.
Use the Add Files button to add the source code files to the project.
IMPORTANT: Do not add header files (with the .h suffix) to the project using the Add Files button (or
with the associated add_files Tcl command).
Vivado HLS automatically adds the following directories to the search path:
• Working directory
Note: The working directory contains the Vivado HLS project directory.
• Any directory that contains C files added to the project
Header files that reside in these directories are automatically included in the project. You
must specify the path to all other header files using the Edit CFLAGS button.
The Edit CFLAGS button specifies the C compiler flags options required to compile the C
code. These compiler flag options are the same used in gcc or g++. C compiler flags include
the path name to header files, macro specifications, and compiler directives, as shown in
the following examples:
TIP: For a complete list of supported Edit CFLAGS options, see the Option Summary page
(gcc.gnu.org/onlinedocs/gcc/Option-Summary.html) on the GNU Compiler Collection (GCC) website.
Note: For SystemC designs with header files associated with the test bench but not the design file,
you must use the Add Files button to add the header files to the project.
In most of the example designs provided with Vivado HLS, the test bench is in a separate
file from the design. Having the test bench and the function to be synthesized in separate
files keeps a clean separation between the process of simulation and synthesis. If the test
bench is in the same file as the function to be synthesized, the file should be added as a
source file and, as shown in the next step, a test bench file.
In addition to the C source files, all files read by the test bench must be added to the
project. In the example shown in Figure 1-11, the test bench opens file in.dat to supply
input stimuli to the design and file out.golden.dat to read the expected results.
Because the test bench accesses these files, both files must be included in the project.
If the test bench files exist in a directory, the entire directory might be added to the project,
rather than the individual files, using the Add Folders button.
If there is no C test bench, there is no requirement to enter any information here and the
Next > button opens the final window of the project wizard, which allows you to specify the
details for the first solution, as shown in the following figure.
• Solution Name: Vivado HLS provides the initial default name solution1, but you can
specify any name for the solution.
• Clock Period: The clock period specified in units of ns or a frequency value specified
with the MHz suffix (For example, 150MHz).
• Uncertainty: The clock period used for synthesis is the clock period minus the clock
uncertainty. Vivado HLS uses internal models to estimate the delay of the operations
for each FPGA. The clock uncertainty value provides a controllable margin to account
for any increases in net delays due to RTL logic synthesis, place, and route. If not
specified in nanoseconds (ns) or a percentage, the clock uncertainty defaults to 12.5%
of the clock period.
• Part: Click to select the appropriate technology, as shown in the following figure.
• On the left hand side, the Explorer pane lets you navigate through the project
hierarchy. A similar hierarchy exists in the project directory on the disk.
• In the center, the Information pane displays files. Files can be opened by
double-clicking on them in the Explorer Pane.
• On the right, the Auxiliary pane shows information relevant to whatever file is open in
the Information pane,
• At the bottom, the Console Pane displays the output when Vivado HLS is running.
• Pre-synthesis validation that validates the C program correctly implements the required
functionality.
• Post-synthesis verification that verifies the RTL is correct.
Before synthesis, the function to be synthesized should be validated with a test bench using
C simulation. A C test bench includes a top-level function main() and the function to be
synthesized. It might include other functions. An ideal test bench has the following
attributes:
• The test bench is self-checking and verifies the results from the function to be
synthesized are correct.
• If the results are correct the test bench returns a value of 0 to main(). Otherwise, the
test bench should return any non-zero values
Vivado HLS synthesizes an OpenCL API C kernel. To simulate an OpenCL API C kernel, you
must use a standard C test bench. You cannot use the OpenCL API C host code as the C test
bench. For more information on test benches, see C Test Bench in Chapter 3.
Clicking the Run C Simulation toolbar button opens the C Simulation Dialog box,
shown in the following figure.
X-Ref Target - Figure 1-15
simulates successfully, the console window displays a message, as shown in the following
figure. The test bench echoes to the console any printf commands used with the
message “Test Passed!”
X-Ref Target - Figure 1-16
• Launch Debugger: This compiles the C code and automatically opens the debug
perspective. From within the debug perspective the Synthesis perspective button (top
left) can be used to return the windows to synthesis perspective.
• Build Only: The C code compiles, but the simulation does not run. Details on executing
the C simulation are covered in Reviewing the Output of C Simulation.
• Clean Build: Remove any existing executable and object files from the project before
compiling the code.
• Optimized Compile: By default the design is compiled with debug information,
allowing the compilation to be analyzed in the debug perspective. This option uses a
higher level of optimization effort when compiling the design but removes all
information required by the debugger. This increases the compile time but should
reduce the simulation run time.
If you select the Launch Debugger option, the windows automatically switch to the debug
perspective and the debug environment opens as shown in the following figure. This is a
full featured C debug environment. The step buttons (red box in the following figure) allow
you to step through code, breakpoints can be set and the value of the variables can be
directly viewed.
X-Ref Target - Figure 1-17
TIP: Click the Synthesis perspective button to return to the standard synthesis windows.
• Any files read by the test bench are copied to this folder.
• The C executable file csim.exe is created and run in this folder.
• Any files written by the test bench are created in this folder.
If the Build Only option is selected in the C simulation dialog box, the file csim.exe is
created in this folder but the file is not executed. The C simulation is run manually by
executing this file from a command shell. On Windows the Vivado HLS command shell is
available through the start menu.
The next step in the Vivado HLS design flow is to execute synthesis.
The message include information messages showing how the synthesis process is
proceeding:
The messages also provide details on the synthesis process. The following example shows a
case where some functions are automatically inlined. Vivado HLS automatically inlines
functions which contain small amounts of logic (The INLINE directive with the -off option
is used to prevent this if required).
INFO: [XFORM 602] Inlining function 'read_data' into 'dct' (dct.cpp:85) automatically.
INFO: [XFORM 602] Inlining function 'write_data' into 'dct' (dct.cpp:90) automatically.
When synthesis completes, the synthesis report for the top-level function opens
automatically in the information pane as shown in the following figure.
X-Ref Target - Figure 1-19
The report folder contains a report file for the top-level function and one for every
sub-function in the design: provided the function was not inlined using the INLINE directive
or inlined automatically by Vivado HLS. The report for the top-level function provides
details on the entire design.
The verilog, vhdl, and systemc folders contain the output RTL files. Figure 1-20 shows
the verilog folder expanded. The top-level file has the same name as the top-level
function for synthesis. In the C design there is one RTL file for each function (not inlined).
There might be additional RTL files to implement sub-blocks (block RAM, pipelined
multipliers, etc).
IMPORTANT: Xilinx does not recommend using these files for RTL synthesis. Instead, Xilinx
recommends using the packaged IP output files discussed later in this design flow. Carefully read the
text that immediately follows this note.
In cases where Vivado HLS uses Xilinx IP in the design, such as with floating point designs,
the RTL directory includes a script to create the IP during RTL synthesis. If the files in the
syn folder are used for RTL synthesis, it is your responsibility to correctly use any script files
present in those folders. If the package IP is used, this process is performed automatically
by the design Xilinx tools.
• Synthesis reports
• Analysis Perspective
In addition, if you are more comfortable working in an RTL environment, Vivado HLS creates
two projects during the IP packaging process:
Synthesis Reports
When synthesis completes, the synthesis report for the top-level function opens
automatically in the information pane (Figure 1-19). The report provides details on both the
performance and area of the RTL design. The outline tab on the right-hand side can be used
to navigate through the report.
The latency is the number of cycles it takes to produce the output. The
initiation interval is the number of clock cycles before new inputs can be
applied.
In the absence of any PIPELINE directives, the latency is one cycle less than
the initiation interval (the next input is read when the final output is
written).
Performance Estimates > The latency and initiation interval for the instances (sub-functions) and
Latency > Detail loops in this block. If any loops contain sub-loops, the loop hierarchy is
shown.
The min and max latency values indicate the latency to execute all iterations
of the loop. The presence of conditional branches in the code might make
the min and max different.
The Iteration Latency is the latency for a single iteration of the loop.
If the loop has a variable latency, the latency values cannot be determined
and are shown as a question mark (?). See the text after this table.
Any specified target initiation interval is shown beside the actual initiation
interval achieved.
If the design only has no RTL hierarchy, there are no instances reported.
If any instances are present, clicking on the name of the instance opens the
synthesis report for that instance.
Vivado HLS reports a single-port BRAM as using one bank of memory and
reports a dual-port BRAM as using two banks of memory.
Utilization Estimates > The resources listed here are those used in the implementation of any FIFOs
Details > FIFO implemented at this level of the hierarchy.
Utilization Estimates > A summary of all shift registers mapped into Xilinx SRL components.
Details > Shift Register
Additional mapping into SRL components can occur during RTL synthesis.
Utilization Estimates > This category shows the resources used by any expressions such as
Details > Expressions multipliers, adders, and comparators at the current level of hierarchy.
The RTL port names are grouped with their protocol and source object:
these are the RTL ports created when that source object is synthesized with
the stated I/O protocol.
Certain Xilinx devices use stacked silicon interconnect (SSI) technology. In these devices, the
total available resources are divided over multiple super logic regions (SLRs). When you
select an SSI technology device as the target technology, the utilization report includes
details on both the SLR usage and the total device usage.
IMPORTANT: When using SSI technology devices, it is important to ensure that the logic created by
Vivado HLS fits within a single SLR. For information on using SSI technology devices, see Managing
Interfaces with SSI Technology Devices.
A common issue for new users of Vivado HLS is seeing a synthesis report similar to the
following figure. The latency values are all shown as a “?” (question mark).
X-Ref Target - Figure 1-21
In the following example, the maximum iteration of the for-loop is determined by the value
of input num_samples. The value of num_samples is not defined in the C function, but
comes into the function from the outside.
If the latency or throughput of the design is dependent on a loop with a variable index,
Vivado HLS reports the latency of the loop as being unknown (represented in the reports by
a question mark “?”).
The TRIPCOUNT directive can be applied to the loop to manually specify the number of
loop iterations and ensure the report contains useful numbers. The -max option tells
Vivado HLS the maximum number of iterations that the loop iterates over, the -min option
specifies the minimum number of iterations performed and the -avg option specifies an
average tripcount.
Note: The TRIPCOUNT directive does not impact the results of synthesis.
The tripcount values are used only for reporting, to ensure the reports generated by Vivado
HLS show meaningful ranges for latency and interval. This also allows a meaningful
comparison between different solutions.
If the C assert macro is used in the code, Vivado HLS can use it to both determine the loop
limits automatically and create hardware that is exactly sized to these limits. See Assertions
in Chapter 3 for more information.
Analysis Perspective
In addition to the synthesis report, you can use the Analysis Perspective to analyze the
results. To open the Analysis Perspective, click the Analysis button as shown in the
following figure.
X-Ref Target - Figure 1-22
The Module Hierarchy pane provides an overview of the entire RTL design.
The following figure shows the dct design uses 6 block RAMs, approximately 300 LUTs and
has a latency of around 3000 clock cycles. Sub-block dct_2b contributes 4 block RAMs,
approximately 250 LUTs and about 2600 cycle of latency to the total. It is immediately clear
that most of the resources and latency in this design are due to sub-block dct_2d and this
block should be analyzed first.
The Performance Profile pane provides details on the performance of the block currently
selected in the Module Hierarchy pane, in this case, the dct block highlighted in the Module
Hierarchy pane.
• The performance of the block is a function of the sub-blocks it contains and any logic
within this level of hierarchy. The Performance Profile pane shows items at this level of
hierarchy that contribute to the overall performance.
• Performance is measured in terms of latency and the initiation interval. This pane also
includes details on whether the block was pipelined or not.
• In this example, you can see that two loops (RD_Loop_Row and WR_Loop_Row) are
implemented as logic at this level of hierarchy and both contain sub-loops and both
contribute 144 clock cycles to the latency. Add the latency of both loops to the latency
of dct_2d which is also inside dct and you get the total latency for the dct block.
The Schedule View pane shows how the operations in this particular block are scheduled
into clock cycles. The default view is the Performance view.
° A loop called RD_Loop_Row. In Figure 1-23 the loop hierarchy for loop
RD_Loop_Row has been expanded.
° A loop called WR_Loop_Row. The plus symbol “+” indicates this loop has hierarchy
and the loop can be expanded to view it.
• The top row lists the control states in the design. Control states are the internal states
used by Vivado HLS to schedule operations into clock cycles. There is a close
correlation between the control states and the final states in the RTL FSM, but there is
no one-to-one mapping.
The information presented in the Schedule View is explained here by reviewing the first set
of resources to be execute: the RD_Loop_Row loop.
° The Performance Profile pane indicates it takes 16 clock cycles to execute all
operations of loop RD_Loop_Cols.
° Plus a clock cycle to return to the start of loop RD_Loop_Row for a total of 18 cycles
per loop iteration.
The following figure shows that you can select an operation and right-click the mouse to
open the associated variable in the source code view. You can see that the write operation
is implementing the writing of data into the buf array from the input array variable.
X-Ref Target - Figure 1-24
The Analysis Perspective also allows you to analyze resource usage. The following figure
shows the resource profile and the resource panes.
X-Ref Target - Figure 1-25
The Resource Profile pane shows the resources used at this level of hierarchy. In this
example, you can see that most of the resources are due to the instances: blocks that are
instantiated inside this block.
You can see by expanding the Expressions that most of the resources at this level of
hierarchy are used to implement adders.
The Resource pane shows the control state of the operations used. In this example, all the
adder operations are associated with a different adder resource. There is no sharing of the
adders. More than one add operation on each horizontal line indicates the same resource is
used multiple times in different states or clock cycles.
The adders are used in the same cycles that are memory accessed and are dedicated to each
memory. Cross correlation with the C code can be used to confirm.
The Analysis Perspective is a highly interactive feature. More information on the Analysis
Perspective can be found in the Design Analysis section of the Vivado Design Suite Tutorial:
High-Level Synthesis (UG871) [Ref 2].
TIP: Remember, even if a Tcl flow is used to create designs, the project can still be opened in the GUI
and the Analysis Perspective used to analyze the design.
Generally after design analysis you can create a new solution to apply optimization
directives. Using a new solution for this allows the different solutions to be compared.
Use the New Solution toolbar button or the menu Project > New Solution to create
a new solution. This opens the Solution Wizard as shown in the following figure.
X-Ref Target - Figure 1-26
The Solution Wizard has the same options as the final window in the New Project wizard
(Figure 1-12) plus an additional option that allow any directives and customs constraints
applied to an existing solution to be conveniently copied to the new solution, where they
can be modified or removed.
After the new solution has been created, optimization directives can be added (or modified
if they were copied from the previous solution). The next section explains how directives
can be added to solutions. Custom constraints are applied using the configuration options
and are discussed in Optimizing the Design.
Note: To apply directives to objects in other C files, you must open the file and make it active in the
Information pane.
Although you can select objects in the Vivado HLS GUI and apply directives, Vivado HLS
applies all directives to the scope that contains the object. For example, you can apply an
INTERFACE directive to an interface object in the Vivado HLS GUI. Vivado HLS applies the
directive to the top-level function (scope), and the interface port (object) is identified in the
directive. In the following example, port data_in on function foo is specified as an
AXI4-Lite interface:
You can apply optimization directives to the following objects and scopes:
• Interfaces
When you apply directives to an interface, Vivado HLS applies the directive to the
top-level function, because the top-level function is the scope that contains the
interface.
• Functions
When you apply directives to functions, Vivado HLS applies the directive to all objects
within the scope of the function. The effect of any directive stops at the next level of
function hierarchy. The only exception is a directive that supports or uses a recursive
option, such as the PIPELINE directive that recursively unrolls all loops in the hierarchy.
• Loops
When you apply directives to loops, Vivado HLS applies the directive to all objects
within the scope of the loop. For example, if you apply a LOOP_MERGE directive to a
loop, Vivado HLS applies the directive to any sub-loops within the loop but not to the
loop itself.
Note: The loop to which the directive is applied is not merged with siblings at the same level of
hierarchy.
• Arrays
When you apply directives to arrays, Vivado HLS applies the directive to the scope that
contains the array.
• Regions
When you apply directives to regions, Vivado HLS applies the directive to the entire
scope of the region. A region is any area enclosed within two braces. For example:
{
the scope between these braces is a region
}
Note: You can apply directives to a region in the same way you apply directives to functions and
loops.
To apply a directive, select an object in the Directives tab, right-click, and select Insert
Directive to open the Directives Editor dialog box. From the drop-down menu, select the
appropriate directive. The drop-down menu only shows directives that you can add to the
selected object or scope. For example, if you select an array object, the drop-down menu
does not show the PIPELINE directive, because an array cannot be pipelined. The following
figure shows the addition of the DATAFLOW directive to the DCT function.
X-Ref Target - Figure 1-28
In the Vivado HLS Directive Editor dialog box, you can specify either of the following
Destination settings:
• Directive File: Vivado HLS inserts the directive as a Tcl command into the file
directives.tcl in the solution directory.
• Source File: Vivado HLS inserts the directive directly into the C source file as a pragma.
The following table describes the advantages and disadvantages of both approaches.
The following figure shows the DATAFLOW directive being added to the Directive File. The
directives.tcl file is located in the solution constraints folder and opened in the
Information pane using the resulting Tcl command.
When directives are applied as a Tcl command, the Tcl command specifies the scope or the
scope and object within that scope. In the case of loops and regions, the Tcl command
requires that these scopes be labeled. If the loop or region does not currently have a label,
a pop-up dialog box asks for a label (Assigns a default name for the label).
The following shows examples of labeled and unlabeled loops and regions.
TIP: Named loops allow the synthesis report to be easily read. An auto-generated label is assigned to
loops without a label.
The following figure shows the DATAFLOW directive added to the Source File and the
resultant source code open in the information pane. The source code now contains a
pragma which specifies the optimization directive.
In both cases, the directive is applied and the optimization performed when synthesis is
executed. If the code was modified, either by inserting a label or pragma, a pop-up dialog
box reminds you to save the code before synthesis.
A complete list of all directives and custom constraints can be found in Optimizing the
Design. For information on directives and custom constraints, see Chapter 4, High-Level
Synthesis Reference Guide.
Directives can only be applied to scopes or objects within a scope. As such, they cannot be
directly applied to global variables which are declared outside the scope of any function.
To apply a directive to a global variable, apply the directive to the scope (function, loop or
region) where the global variable is used. Open the directives tab on a scope were the
variable is used, apply the directive and enter the variable name manually in Directives
Editor.
Optimization directives can be also applied to objects or scopes defined in a class. The
difference is typically that classes are defined in a header file. Use one of the following
actions to open the header file:
• From the Explorer pane, open the Includes folder, navigate to the header file, and
double-click the file to open it.
• From within the C source, place the cursor over the header file (the #include
statement), to open hold down the Ctrl key, and click the header file.
The directives tab is then populated with the objects in the header file and directives can be
applied.
CAUTION! Care should be taken when applying directives as pragmas to a header file. The file might be
used by other people or used in other projects. Any directives added as a pragma are applied each time
the header file is included in a design.
To apply optimization directives manually on templates when using Tcl commands, specify
the template arguments and class when referring to class methods. For example, given the
following C++ code:
The following Tcl command is used to specify the INLINE directive on the function:
set_directive_inline DES10<SIZE,RATE>::calcRUN
Pragma directives do not natively support the use of values specified by the define
statement. The following code seeks to specify the depth of a stream using the define
statement and will not compile.
#include <hls_stream.h>
using namespace hls;
#define STREAM_IN_DEPTH 8
// Illegal pragma
#pragma HLS stream depth=STREAM_IN_DEPTH variable=InStream
// Legal pragma
#pragma HLS stream depth=8 variable=OutStream
You can use macros in the C code to implement this functionality. The key to using macros
is to use a level of hierarchy in the macro. This allows the expansion to be correctly
performed. The code can be made to compile as follows:
#include <hls_stream.h>
using namespace hls;
// Legal pragmas
PRAGMA_HLS(HLS stream depth=STREAM_IN_DEPTH variable=InStream)
#pragma HLS stream depth=8 variable=OutStream
When optimization directives are applied, Vivado HLS outputs information to the console
(and log file) detailing the progress. In the following example the PIPELINE directives was
applied to the C function with an II=1 (iteration interval of 1) but synthesis failed to satisfy
this objective.
IMPORTANT: If Vivado HLS fails to satisfy an optimization directive, it automatically relaxes the
optimization target and seeks to create a design with a lower performance target. If it cannot relax the
target, it will halt with an error.
By seeking to create a design which satisfies a lower optimization target, Vivado HLS is able
to provide three important types of information:
• What target performance can be achieved with the current C code and optimization
directives.
• A list of the reasons why it was unable to satisfy the higher performance target.
• A design which can be analyzed to provide more insight and help understand the
reason for the failure.
In message SCHED-69, the reason given for failing to reach the target II is due to limited
ports. The design must access a block RAM, and a block RAM only has a maximum of two
ports.
The next step after a failure such as this is to analyze what the issue is. In this example,
analyze line 52 of the code and/or use the Analysis perspective to determine the bottleneck
and if the requirement for more than two ports can be reduced or determine how the
number of ports can be increased. More details on how to optimize designs for higher
performance are provided in Optimizing the Design.
After the design is optimized and the desired performance achieved, the RTL can be verified
and the results of synthesis packaged as IP.
The C/RTL co-simulation dialog box shown in the following figure allows you to select
which type of RTL output to use for verification (Verilog or VHDL) and which HDL simulator
to use for the simulation.
A complete description of all C/RTL co-simulation options are provided in Verifying the RTL.
X-Ref Target - Figure 1-31
When verification completes, the console displays message SIM-1000 to confirm the
verification was successful. The result of any printf commands in the C test bench are
echoed to the console.
The simulation report opens automatically in the Information pane, showing the pass or fail
status and the measured statistics on latency and II.
In pipelined designs, the design might read new inputs before the first transaction
completes, and there might be multiple ap_start and ap_ready signals before a
transaction completes. In this case, C/RTL cosimulation measures the latency as the number
of cycles between data input values and data output values. The II is the number of cycles
between ap_ready signals, which the design uses to requests new inputs.
Note: For pipelined designs, the II value for C/RTL cosimulation is only valid if the design is
simulated for multiple transactions.
IMPORTANT: The C/RTL co-simulation only passes if the C test bench returns a value of zero.
• The report folders contains the report and log file for each type of RTL simulated.
• A verification folder is created for each type of RTL which is verified. The verification
folder is named verilog or vhdl. If an RTL format is not verified, no folder is created.
• The RTL files used for simulation are stored in the verification folder.
• The RTL simulation is executed in the verification folder.
• Any outputs, such as trace files, are written to the verification folder.
• Folders autowrap, tv, wrap and wrap_pc are work folders used by Vivado HLS. There
are no user files in these folders.
If the Setup Only option was selected in the C/RTL Co-Simulation dialog boxes, an
executable is created in the verification folder but the simulation is not run. The simulation
can be manually run by executing the simulation executable at the command prompt.
Note: For more information on the RTL verification process, see Verifying the RTL.
Packaging the IP
The final step in the Vivado HLS design flow is to package the RTL output as IP. Use the
Export RTL toolbar button or the menu Solution > Export RTL to open the Export RTL
dialog box shown in the following figure.
• The report folder. If the evaluate option is selected, the synthesis report for Verilog
and VHDL synthesis is placed in this folder.
• The verilog folder. This contains the Verilog format RTL output files. If the evaluate
option is selected, RTL synthesis is performed in this folder.
• The vhdl folder. This contains the VHDL format RTL output files. If the evaluate option
is selected, RTL synthesis is performed in this folder.
IMPORTANT: Xilinx does not recommend directly using the files in the verilog or vhdl folders for
your own RTL synthesis project. Instead, Xilinx recommends using the packaged IP output files
discussed next. Please carefully read the text that immediately follows this note.
In cases where Vivado HLS uses Xilinx IP in the design, such as with floating point designs,
the RTL directory includes a script to create the IP during RTL synthesis. If the files in the
verilog or vhdl folders are copied out and used for RTL synthesis, it is your responsibility
to correctly use any script files present in those folders. If the package IP is used, this
process is performed automatically by the design Xilinx tools.
The Format Selection drop-down determines which other folders are created. The following
table details the folders created and their content.
The Export RTL process automatically creates a Vivado RTL project. For hardware designers
more familiar with RTL design and working in the Vivado RTL environment, this provides a
convenient way to analyze the RTL.
As shown in Figure 1-34 a project.xpr file is created in the verilog and vhdl folders. This
file can be used to directly open the RTL output inside the Vivado Design Suite.
If C/RTL co-simulation has been executed in Vivado HLS, the Vivado project contains an RTL
test bench and the design can be simulated.
Note: The Vivado RTL project has the RTL output from Vivado HLS as the top-level design. Typically,
this design should be incorporated as IP into a larger Vivado RTL project. This Vivado project is
provided solely as a means for design analysis and is not intended as a path to implementation.
To create the IP Integrator project, execute the ipi_example.* file at the command
prompt then open the Vivado IPI project file which is created.
• By default, only the current active solution is archived. To ensure all solutions are
archived, deselect the Active Solution Only option.
• By default, the archive contains all of the output results from the archived solutions. If
you want to archive the input files only, deselect the Include Run Results option.
On Windows and Linux, using the -i option with the vivado_hls command opens Vivado
HLS in interactive mode. Vivado HLS then waits for Tcl commands to be entered.
vivado_hls>
By default, Vivado HLS creates a vivado_hls.log file in the current directory. To specify
a different name for the log file, the -1 <log_file> option can be used.
The help command is used to access documentation on the commands. A complete list of
all commands is provided using:
vivado_hls> help
Any command or command option can be completed using the auto-complete feature.
After a single character has been specified, pressing the tab key causes Vivado HLS to list
the possible options to complete the command or command option. Entering more
characters improves the filtering of the possible options. For example, pressing the tab key
after typing “open” lists all commands that start with “open”.
Selecting the Tab Key after typing open_p auto-completes the command open_project,
because there are no other possible options.
Type the exit command to quit interactive mode and return to the shell prompt:
vivado_hls> exit
Commands embedded in a Tcl script are executed in batch mode with the -f
<script_file> option.
$ vivado_hls -f script.tcl
All the Tcl commands for creating a project in GUI are stored in the script.tcl file within
the solution. If you wish to develop Tcl batch scripts, the script.tcl file is an ideal
starting point.
The following figure shows that both (or either) the Linux ls command and the DOS dir
command is used to list the contents of a directory.
X-Ref Target - Figure 1-35
Vivado HLS schedules operations hierarchically. The operations within a loop are scheduled,
then the loop, the sub-functions and operations with a function are scheduled. Run time for
Vivado HLS increases when:
Vivado HLS schedules objects. Whether the object is a floating-point multiply operation or
a single register, it is still an object to be scheduled. The floating-point multiply may take
multiple cycles to complete and use many resources to implement but at the level of
scheduling it is still one object.
Unrolling loops and partitioning arrays creates more objects to schedule and potentially
increases the run time. Inlining functions creates more objects to schedule at this level of
hierarchy and also increases run time. These optimizations may be required to meet
performance but be very careful about simply partitioning all arrays, unrolling all loops and
inlining all functions: you can expect a run time increase. Use the optimization strategies
provided earlier and judiciously apply these optimizations.
If the loops must be unrolled, or if the use of the PIPELINE directive in the hierarchy above
has automatically unrolled the loops, consider capturing the loop body as a separate
function. This will capture all the logic into one function instead of creating multiple copies
of the logic when the loop is unrolled: one set of objects in a defined hierarchy will be
scheduled faster. Remember to pipeline this function if the unrolled loop is used in
pipelined region.
The degrees of freedom in the code can also impact run time. Consider Vivado HLS to be an
expert designer who by default is given the task of finding the design with the highest
throughput, lowest latency and minimum area. The more constrained Vivado HLS is, the
fewer options it has to explore and the faster it will run. Consider using latency constraints
over scopes within the code: loops, functions or regions. Setting a LATENCY directive with
the same minimum and maximum values reduces the possible optimization searches within
that scope.
Finally, the config_schedule configuration controls the effort level used during
scheduling. This generally has less impact than the techniques mentioned above, but it is
worth considering. The default strategy is set to Medium.
If this setting is set to Low, Vivado HLS will reduce the amount of time it spends on trying
to improve on the initial result. In some cases, especially if there are many operations and
hence combinations to explore, it may be worth using the low setting. The design may not
be ideal but it may satisfy the requirements and be very close to the ideal. You can proceed
to make progress with the low setting and then use the default setting before you create
your final result.
With a run strategy set to High, Vivado HLS uses additional CPU cycles and memory, even
after satisfying the constraints, to determine if it can create an even smaller or faster design.
This exploration may, or may not, result in a better quality design but it does take more time
and memory to complete. For designs that are just failing to meet their goals or for designs
where many different optimization combinations are possible, this could be a useful
strategy. In general, it is a better practice to leave the run strategies at the Medium default
setting.
Tutorials
Tutorials are available in the Vivado Design Suite Tutorial: High-Level Synthesis (UG871)
[Ref 2]. The following table shows a list of the tutorial exercises.
Design Examples
To open the Vivado HLS design examples from the Welcome Page, click Open Example
Project. In the Examples wizard, select a design from the Design Examples folder.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
You can also open the design examples directly from the Vivado Design Suite installation
area: Vivado_HLS\2015.x\examples\design.
Coding Examples
The Vivado HLS coding examples provide examples of various coding techniques. These are
small examples intended to highlight the results of Vivado HLS synthesis on various C, C++,
and SystemC constructs.
To open the Vivado HLS coding examples from the Welcome Page, click Open Example
Project. In the Examples wizard, select a design from the Coding Style Examples folder.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
You can also open the design examples directly from the Vivado Design Suite installation
area: Vivado_HLS\2015.x\examples\coding.
The advantage of arbitrary precision data types is that they allow the C code to be updated
to use variables with smaller bit-widths and then for the C simulation to be re-executed to
validate the functionality remains identical or acceptable. The smaller bit-widths result in
hardware operators which are in turn smaller and faster. This is in turn allows more logic to
be place in the FPGA and for the logic to execute at higher clock frequencies.
#include "types.h"
The data types dinA_t, dinB_t etc. are defined in the header file types.h. It is highly
recommended to use a project wide header file such as types.h as this allows for the easy
migration from standard C types to arbitrary precision types and helps in refining the
arbitrary precision types to the optimal size.
+ Timing (ns):
* Summary:
+---------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+---------+-------+----------+------------+
|default | 4.00| 3.85| 0.50|
+---------+-------+----------+------------+
If the width of the data is not required to be implemented using standard C types but in
some width which is smaller, but still greater than the next smallest standard C type, such as
the following,
The results after synthesis shown an improvement to the maximum clock frequency, the
latency and a significant reduction in area of 75%.
+ Timing (ns):
* Summary:
+---------+-------+----------+------------+
| Clock | Target| Estimated| Uncertainty|
+---------+-------+----------+------------+
|default | 4.00| 3.49| 0.50|
+---------+-------+----------+------------+
The large difference in latency between both design is due to the division and remainder
operations which take multiple cycles to complete. Using accurate data types, rather than
force fitting the design into standard C data types, results in a higher quality FPGA
implementation: the same accuracy, running faster with less resources.
The header files which define the arbitrary precision types are also provided with Vivado
HLS as a standalone package with the rights to use them in your own source code. The
package, xilinx_hls_lib_<release_number>.tgz is provided in the include
directory in the Vivado HLS installation area. The package does not include the C arbitrary
precision types defined in ap_cint.h. These types cannot be used with standard C
compilers - only with Vivado HLS.
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_int.h"
The default maximum width allowed for ap_[u]int data types is 1024 bits. This default
may be overridden by defining the macro AP_INT_MAX_W with a positive integer value less
than or equal to 32768 before inclusion of the ap_int.h header file.
CAUTION! Setting the value of AP_INT_MAX_W too High may cause slow software compile and run
times.
ap_int<4096> very_wide_var;
#include <ap_fixed.h>
...
ap_fixed<18,6,AP_RND > my_type;
...
When performing calculations where the variables have different number of bits or different
precision, the binary point is automatically aligned.
The behavior of the C++/SystemC simulations performed using fixed-point matches the
resulting hardware. This allows you to analyze the bit-accurate, quantization, and overflow
behaviors using fast C-level simulation.
Fixed-point types are a useful replacement for floating point types which require many
clock cycle to complete. Unless the entire range of the floating-point type is required, the
same accuracy can often be implemented with a fixed-point type resulting in the same
accuracy with smaller and faster hardware.
This dictates the behavior when greater precision is generated than can be defined by
smallest fractional bit in the variable used to store the result.
SystemC Types ap_fixed Types Description
SC_RND AP_RND Round to plus infinity
SC_RND_ZERO AP_RND_ZERO Round to zero
SC_RND_MIN_INF AP_RND_MIN_INF Round to minus infinity
AP_RND_INF AP_RND_INF Round to infinity
AP_RND_CONV AP_RND_CONV Convergent rounding
AP_TRN AP_TRN Truncation to minus infinity
AP_TRN_ZERO AP_TRN_ZERO Truncation to zero (default)
O Overflow mode.
This dictates the behavior when the result of an operation exceeds the maximum (or
minimum in the case of negative numbers) value which can be stored in the result variable.
SystemC Types ap_fixed Types Description
SC_SAT AP_SAT Saturation
SC_SAT_ZERO AP_SAT_ZERO Saturation to zero
SC_SAT_SYM AP_SAT_SYM Symmetrical saturation
SC_WRAP AP_WRAP Wrap around (default)
SC_WRAP_SM AP_WRAP_SM Sign magnitude wrap
around
N This defines the number of saturation bits in the overflow wrap modes.
The default maximum width allowed for ap_[u]fixed data types is 1024 bits. This default
may be overridden by defining the macro AP_INT_MAX_W with a positive integer value less
than or equal to 32768 before inclusion of the ap_int.h header file.
CAUTION! Setting the value of AP_INT_MAX_W too High may cause slow software compile and run
times.
ap_fixed<4096> very_wide_var;
Arbitrary precision data types are highly recommend when using Vivado HLS. As shown in
the earlier example, they typically have a significant positive benefit on the quality of the
hardware implementation. Complete details on the Vivado HLS arbitrary precision data
types are provided in the Chapter 4, High-Level Synthesis Reference Guide.
• 1 signed bit
• 5 exponent bits
• 10 mantissa bits
The following example shows how Vivado HLS uses the half-precision floating-point data
type:
Note: Vivado HLS only supports the half-precision floating-point data type for pointers or arrays.
You cannot use this data type as a pass-by-value argument.
Vivado HLS supports the following math operations for the half-precision floating-point
data type:
• Addition
• Division
• Multiplication
• Square root
• Subtraction
Managing Interfaces
In C based design, all input and output operations are performed, in zero time, through
formal function arguments. In an RTL design these same input and output operations must
be performed through a port in the design interface and typically operates using a specific
I/O (input-output) protocol.
Vivado HLS supports two solutions for specifying the type of I/O protocol used:
• Interface Synthesis, where the port interface is created based on efficient industry
standard interfaces.
• Manual interface specification where the interface behavior is explicitly described in
the input source code. This allows any arbitrary I/O protocol to be used.
° This solution is provided through SystemC designs, where the I/O control signals
are specified in the interface declaration and their behavior specified in the code.
° Vivado HLS also supports this mode of interface specification for C and C++
designs.
Interface Synthesis
When the top-level function is synthesized, the arguments (or parameters) to the function
are synthesized into RTL ports. This process is called interface synthesis.
#include "sum_io.h"
dout_t temp;
return temp;
}
With the default interface synthesis settings, the design is synthesized into an RTL block
with the ports shown in the following figure.
X-Ref Target - Figure 1-36
A chip-enable port can optionally be added to the entire block using Solution > Solution
Settings > General and config_interface configuration.
The operation of the reset is controlled by the config_rtl configuration. More details on the
reset configuration are provided in Clock, Reset, and RTL Output.
By default, a block-level interface protocol is added to the design. These signal control the
block, independently of any port-level I/O protocols. These ports control when the block
can start processing data (ap_start), indicate when it is ready to accept new inputs
(ap_ready) and indicate if the design is idle (ap_idle) or has completed operation
(ap_done).
The final group of signals are the data ports. The I/O protocol created depends on the type
of C argument and on the default. A complete list of all possible I/O protocols is shown in
Figure 1-38. After the block-level protocol has been used to start the operation of the
block, the port-level IO protocols are used to sequence data into and out of the block.
By default input pass-by-value arguments and pointers are implemented as simple wire
ports with no associated handshaking signal. In the above example, the input ports are
therefore implemented without an I/O protocol, only a data port. If the port has no I/O
protocol, (by default or by design) the input data must be held stable until it is read.
By default output pointers are implemented with an associated output valid signal to
indicate when the output data is valid. In the above example, the output port is
implemented with an associated output valid port (sum_o_ap_vld) which indicates when the
data on the port is valid and can be read. If there is no I/O protocol associated with the
output port, it is difficult to know when to read the data. It is always a good idea to use an
I/O protocol on an output.
Function arguments which are both read from and writes to are split into separate input and
output ports. In the above example, sum is implemented as input port sum_i and output
port sum_o with associated I/O protocol port sum_o_ap_vld.
If the function has a return value, an output port ap_return is implemented to provide the
return value. When the design completes one transaction - this is equivalent to one
execution of the C function - the block-level protocols indicate the function is complete
with the ap_done signal. This also indicates the data on port ap_return is valid and can
be read.
For the example code shown the timing behavior is shown in the following figure (assuming
that the target technology and clock frequency allow a single addition per clock cycle).
X-Ref Target - Figure 1-37
During synthesis, Vivado HLS groups all interfaces in OpenCL API C as follows:
• All scalar interfaces and the block-level interface into a single AXI4-Lite interface
• All arrays and pointers into a single AXI4 interface
Note: No other interface specifications are allowed for OpenCL API C kernels.
Argument HLS::
Scalar Array Pointer or Reference
Type Stream
ap_ctrl_none
ap_ctrl_hs D
ap_ctrl_chain
axis
s_axilite
m_axi
ap_none D D
ap_stable
ap_ack
ap_vld D
ap_ovld D
ap_hs
ap_memory D D D
bram
ap_fifo D
ap_bus
The ap_ctrl_hs mode described in the previous example is the default protocol. The
ap_ctrl_chain protocol is similar to ap_ctrl_hs but has an additional input port
ap_continue which provides back pressure from blocks consuming the data from this
block. If the ap_continue port is logic 0 when the function completes, the block will halt
operation and the next transaction will not proceed. The next transaction will only proceed
when the ap_continue is asserted to logic 1.
The ap_ctrl_none mode implements the design without any block-level I/O protocol.
If the function return is also specified as an AXI4-Lite interface (s_axilite) all the ports in
the block-level interface are grouped into the AXI4-Lite interface. This is a common practice
when another device, such as a CPU, is used to configure and control when this block starts
and stops operation.
The AXI4 interfaces supported by Vivado HLS include the AXI4-Stream (axis), AXI4-Lite
(s_axilite), and AXI4 master (m_axi) interfaces, which you can specify as follows:
For information on additional functionality provided by the AXI4 interface, see Using AXI4
Interfaces.
The ap_none and ap_stable modes specify that no I/O protocol be added to the port.
When these modes are specified the argument is implemented as a data port with no other
associated signals. The ap_none mode is the default for scalar inputs. The ap_stable
mode is intended for configuration inputs which only change when the device is in reset
mode.
Interface mode ap_hs includes a two-way handshake signal with the data port. The
handshake is an industry standard valid and acknowledge handshake. Mode ap_vld is the
same but only has a valid port and ap_ack only has a acknowledge port.
Mode ap_ovld is for use with in-out arguments. When the in-out is split into separate
input and output ports, mode ap_none is applied to the input port and ap_vld applied to
the output port. This is the default for pointer arguments which are both read and written.
The ap_hs mode can be applied to arrays which are read or written in sequential order. If
Vivado HLS can determine the read or write accesses are not sequential it will halt synthesis
with an error. If the access order cannot be determined Vivado HLS will issue a warning.
The bram interface mode is functional identical to the ap_memory interface. The only
difference is how the ports are implemented when the design is used in Vivado IP
Integrator:
If the array is accessed in a sequential manner an ap_fifo interface can be used. As with
the ap_hs interface, Vivado HLS will halt if determines the data access is not sequential,
report a warning if it cannot determine if the access is sequential or issue no message if it
determines the access is sequential. The ap_fifo interface can only be used for reading or
writing, not both.
The ap_bus interface can communicate with a bus bridge. The interface does not adhere to
any specific bus standard but is generic enough to be used with a bus bridge that in-turn
arbitrates with the system bus. The bus bridge must be able to cache all burst writes.
Arrays of structs are implemented as multiple arrays, with a separate array for each member
of the struct.
The DATA_PACK optimization directive is used for packing all the elements of a struct into a
single wide vector. This allows all members of the struct to be read and written to
simultaneously. The member elements of the struct are placed into the vector in the order
the appear in the C code: the first element of the struct is aligned on the LSB of the vector
and the final element of the struct is aligned with the MSB of the vector. Any arrays in the
struct are partitioned into individual array elements and placed in the vector from lowest to
highest, in order.
Care should be taken when using the DATA_PACK optimization on structs with large arrays.
If an array has 4096 elements of type int, this will result in a vector (and port) of width
4096*32=131072 bits. Vivado HLS can create this RTL design, however it is very unlikely
logic synthesis will be able to route this during the FPGA implementation.
The single wide-vector created by using the DATA_PACK directive allows more data to be
accessed in a single clock cycle. This is the case when the struct contains an array. When
data can be accessed in a single clock cycle, Vivado HLS automatically unrolls any loops
consuming this data, if doing so improves the throughput. The loop can be fully or partially
unrolled to create enough hardware to consume the additional data in a single clock cycle.
This feature is controlled using the config_unroll command and the option
tripcount_threshold. In the following example, any loops with a tripcount of less than
16 will be automatically unrolled if doing so improves the throughput.
config_unroll -tripcount_threshold 16
If a struct port using DATA_PACK is to be implemented with an AXI4 interface you may wish
to consider using the DATA_PACK byte_pad option. The byte_pad option is used to
automatically align the member elements to 8-bit boundaries. This alignment is sometimes
required by Xilinx IP. If an AXI4 port using DATA_PACK is to be implemented, refer to the
documentation for the Xilinx IP it will connect to and determine if byte alignment is
required.
For the following example code, the options for implementing a struct port are shown in
the following figure.
typedef struct{
int12 A;
int18 B[4];
int6 C;
} my_data;
void foo(my_data *a )
• By default, the members are implemented as individual ports. The array has multiple
ports (data, addr, etc.)
• Using DATA_PACK results in a single wide port.
• Using DATA_PACK with struct_level byte padding aligns entire struct to the next
8-bit boundary.
• Using DATA_PACK with field_level byte padding aligns each struct member to the
next 8-bit boundary.
Note: The maximum bit-width of any port or bus created by data packing is 8192 bits.
X-Ref Target - Figure 1-39
6WUXFW3RUW,PSOHPHQWDWLRQ
& %BDGGU %BFH %BGDWD $
'$7$B3$&.RSWLPL]DWLRQ
$
6LQJOHSDFNHGYHFWRU>@
'$7$B3$&.RSWLPL]DWLRQZLWKE\WHBSDGRQWKHVWUXFWBOHYHO
$
6LQJOHSDFNHGYHFWRUSRUW>@
'$7$B3$&.RSWLPL]DWLRQZLWKE\WHBSDGRQWKHILHOGBOHYHO
$
6LQJOHSDFNHGYHFWRUSRUW>@
ELW ELW ELW ELW ELW ELW ELW ELW ELW ELW ELW ELW
;
If a struct contains arrays, those arrays can be optimized using the ARRAY_PARTITION
directive to partition the array or the ARRAY_RESHAPE directive to partition the array and
re-combine the partitioned elements into a wider array. The DATA_PACK directive performs
the same operation as ARRAY_RESHAPE and combines the reshaped array with the other
elements in the struct.
A struct cannot be optimized with DATA_PACK and then partitioned or reshaped. The
DATA_PACK, ARRAY_PARTITION and ARRAY_RESHAPE directives are mutually exclusive.
#include "pointer_stream_bad.h"
acc += *d_i;
acc += *d_i;
*d_o = acc;
acc += *d_i;
acc += *d_i;
*d_o = acc;
}
After synthesis this code will result in an RTL design which reads the input port once and
writes to the output port once. As with any standard C compiler, Vivado HLS will optimize
away the redundant pointer accesses. To implement the above code with the “anticipated”
4 reads on d_i and 2 writes to the d_o the pointers must be specified as volatile as
shown in the next example.
#include "pointer_stream_better.h"
acc += *d_i;
acc += *d_i;
*d_o = acc;
acc += *d_i;
acc += *d_i;
*d_o = acc;
}
Even this C code is problematic. Using a test bench, there is no way to supply anything but
a single value to d_i or verify any write to d_o other than the final write. Although
multi-access pointers are supported, it is highly recommended to implement the behavior
required using the hls::stream class. Details on the hls::stream class are in HLS
Stream Library in Chapter 2.
Specifying Interfaces
Interface synthesis is controlled by the INTERFACE directive or by using a configuration
setting. To specify the interface mode on ports, select the port in the GUI Directives tab and
right-click the mouse to open the Vivado HLS Directive Editor as shown in the following
figure.
• mode
• register
If you select this option, all pass-by-value reads are performed in the first cycle of
operation. For output ports, the register option guarantees the output is registered. You
can apply the register option to any function in the design. For memory, FIFO, and AXI4
interfaces, the register option has no effect.
• depth
This option specifies how many samples are provided to the design by the test bench
and how many output values the test bench must store. Use whichever number is
greater.
Note: For cases in which a pointer is read from or written to multiple times within a single
transaction, the depth option is required for C/RTL co-simulation. The depth option is not
required for arrays or when using the hls::stream construct. It is only required when using
pointers on the interface.
If the depth option is set too small, the C/RTL co-simulation might deadlock as follows:
° The input reads might stall waiting for data that the test bench cannot provide.
° The output writes might stall when trying to write data, because the storage is full.
• port
This option is required. By default, Vivado HLS does not register ports.
Note: To specify a block-level I/O protocol, select the top-level function in the Vivado HLS GUI,
and specify the port as the function return.
• offset
This option is used for AXI4 interfaces. For information, see Using AXI4 Interfaces.
To set the interface configuration, select Solution > Solution Settings > General >
config_interface. You can use configuration settings to:
Any C function can use global variables: those variables defined outside the scope of any
function. By default, global variables do not result in the creation of RTL ports: Vivado HLS
assumes the global variable is inside the final design. The config_interface
configuration setting expose_global instructs Vivado HLS to create a ports for global
variables. For more information on the synthesis of global variables, see Global Variables in
Chapter 3.
The processes for performing interface synthesis on a SystemC design is different from
adding the same interfaces to C or C++ designs.
• Memory block RAM and AXI4 master interfaces require the SystemC data port is
replaced with a Vivado HLS port.
• AXI4-Stream and AXI4-Lite slave interfaces only require directives but there is a
different process for adding directives to a SystemC design.
When adding directives as pragmas to SystemC source code, the pragma directives cannot
be added where the ports are specified in the SC_MODULE declaration, they must be added
inside a function called by the SC_MODULE.
The directives can be applied to any member function of the SC_MODULE, however it is a
good design practice to add them to the function where the variables are used.
SC_MODULE(my_design) {
//”RAM” Port
sc_uint<20> my_array[256];
…
The port my_array is synthesized into an internal block RAM, not a block RAM interface
port.
Including the Vivado HLS header file ap_mem_if.h allows the same port to be specified as
an ap_mem_port<data_width, address_bits> port. The ap_mem_port data type is
synthesized into a standard block RAM interface with the specified data and address
bus-widths and using the ap_memory port protocol.
#include "ap_mem_if.h"
SC_MODULE(my_design) {
//”RAM” Port
ap_mem_port<sc_uint<20>,sc_uint<8>, 256> my_array;
…
#include "ap_mem_if.h"
ap_mem_chn<int,int, 68> bus_mem;
…
// Instantiate the top-level module
my_design U_dut (“U_dut”)
U_dut.my_array.bind(bus_mem);
…
The header file ap_mem_if.h is located in the include directory located in the Vivado HLS
installation area and must be included if simulation is performed outside Vivado HLS.
An AXI4-Stream interface can be added to any SystemC ports that are of the sc_fifo_in
or sc_fifo_out type. The following shows the top-level of a typical SystemC design. As is
typical, the SC_MODULE and ports are defined in a header file:
SC_MODULE(sc_FIFO_port)
{
//Ports
sc_in <bool> clock;
sc_in <bool> reset;
sc_in <bool> start;
sc_out<bool> done;
sc_fifo_out<int> dout;
sc_fifo_in<int> din;
//Variables
int share_mem[100];
bool write_done;
//Process Declaration
void Prc1();
void Prc2();
//Constructor
SC_CTOR(sc_FIFO_port)
{
//Process Registration
SC_CTHREAD(Prc1,clock.pos());
reset_signal_is(reset,true);
SC_CTHREAD(Prc2,clock.pos());
reset_signal_is(reset,true);
}
};
To create an AXI4-Stream interface the RESOURCE directive must be used to specify the
ports are connected an AXI4-Stream resource. For the example interface shown above, the
directives are shown added in the function called by the SC_MODULE: ports din and dout
are specified to have an AXI4-Stream resource.
#include "sc_FIFO_port.h"
void sc_FIFO_port::Prc1()
{
//Initialization
write_done = false;
wait();
while(true)
{
while (!start.read()) wait();
write_done = false;
write_done = true;
wait();
} //end of while(true)
}
void sc_FIFO_port::Prc2()
{
#pragma HLS resource core=AXI4Stream variable=din
#pragma HLS resource core=AXI4Stream variable=dout
//Initialization
done = false;
wait();
while(true)
{
while (!start.read()) wait();
wait();
while (!write_done) wait();
for(int i=0;i<100; i++)
{
dout.write(share_mem[i]+din.read());
}
done = true;
wait();
} //end of while(true)
}
When the SystemC design is synthesized, it results in an RTL design with standard RTL FIFO
ports. When the design is packaged as IP using the Export RTL toolbar button , the
output is a design with an AXI4-Stream interfaces.
An AXI4-Lite slave interface can be added to any SystemC ports of type sc_in or sc_out.
The following example shows the top-level of a typical SystemC design. In this case, as is
typical, the SC_MODULE and ports are defined in a header file:
SC_MODULE(sc_sequ_cthread){
//Ports
sc_in <bool> clk;
sc_in <bool> reset;
sc_in <bool> start;
sc_in<sc_uint<16> > a;
sc_in<bool> en;
sc_out<sc_uint<16> > sum;
sc_out<bool> vld;
//Variables
sc_uint<16> acc;
//Process Declaration
void accum();
//Constructor
SC_CTOR(sc_sequ_cthread){
//Process Registration
SC_CTHREAD(accum,clk.pos());
reset_signal_is(reset,true);
}
};
To create an AXI4-Lite interface the RESOURCE directive must be used to specify the ports
are connected to an AXI4-Lite resource. For the example interface shown above, the
following example shows how ports start, a, en, sum and vld are grouped into the same
AXI4-Lite interface slv0: all the ports are specified with the same bus_bundle name and
are grouped into the same AXI4-Lite interface.
= #include "sc_sequ_cthread.h"
void sc_sequ_cthread::accum(){
//Group ports into AXI4 slave slv0
#pragma HLS resource core=AXI4LiteS metadata="-bus_bundle slv0" variable=start
#pragma HLS resource core=AXI4LiteS metadata="-bus_bundle slv0" variable=a
#pragma HLS resource core=AXI4LiteS metadata="-bus_bundle slv0" variable=en
#pragma HLS resource core=AXI4LiteS metadata="-bus_bundle slv0" variable=sum
#pragma HLS resource core=AXI4LiteS metadata="-bus_bundle slv0" variable=vld
//Initialization
acc=0;
sum.write(0);
vld.write(false);
wait();
When the SystemC design is synthesized, it results in an RTL design with standard RTL ports.
When the design is packaged as IP using Export RTL toolbar button , the output is a
design with an AXI4-Lite interface.
In most standard SystemC designs, you have no need to specify a port with the behavior of
the Vivado HLS ap_bus I/O protocol. However, if the design requires an AXI4 master bus
interface the ap_bus I/O protocol is required.
• Use the Vivado HLS type AXI4M_bus_port to create an interface with the ap_bus I/O
protocol.
• Assign an AXI4M resource to the port.
SC_MODULE(dut)
{
//Ports
sc_in<bool> clock; //clock input
sc_in<bool> reset;
sc_in<bool> start;
sc_out<int> dout;
AXI4M_bus_port<sc_fixed<32, 8> > bus_if;
//Variables
//Constructor
SC_CTOR(dut)
//:bus_if ("bus_if")
{
//Process Registration
SC_CTHREAD(P1,clock.pos());
reset_signal_is(reset,true);
}
}
The following shows how the variable bus_if can be accessed in the SystemC function to
produce standard or burst read and write operations.
//Process Declaration
void P1() {
//Initialization
dout.write(10);
int addr = 10;
DT tmp[10];
wait();
while(1) {
tmp[0]=10;
tmp[1]=11;
tmp[2]=12;
// Port write
bus_if->write(addr, tmp);
dout.write(tmp[0].to_int());
addr+=2;
wait();
}
}
When the port class AXI4M_bus_port is used in a design, it must have a matching HLS bus
interface channel hls_bus_chn<start_addr > in the test bench, as shown in the
following example:
#include <systemc.h>
#include "tlm.h"
using namespace tlm;
#include "hls_bus_if.h"
#include "AE_clock.h"
#include "driver.h"
#ifdef __RTL_SIMULATION__
#include "dut_rtl_wrapper.h"
#define dut dut_rtl_wrapper
#else
#include "dut.h"
#endif
// hls_bus_chan<type>
// bus_variable(“name”, start_addr, end_addr)
//
hls_bus_chn<sc_fixed<32, 8> > bus_mem("bus_mem",0,1024);
sc_signal<bool> s_clk;
sc_signal<bool> reset;
sc_signal<bool> start;
sc_signal<int> dout;
U_AE_Clock.reset(reset);
U_AE_Clock.clk(s_clk);
U_dut.clock(s_clk);
U_dut.reset(reset);
U_dut.start(start);
U_dut.dout(dout);
U_dut.bus_if(bus_mem);
U_driver.clk(s_clk);
U_driver.start(start);
U_driver.dout(dout);
// start simulation
sc_start(end_time, SC_NS);
return U_driver.ret;
};
The synthesized RTL design contains an interface with the ap_bus I/O protocol.
When the AXI4M_bus_port class is used, it results in an RTL design with an ap_bus
interface. When the design is packaged as IP using Export RTL the output is a design with an
AXI4 master port.
Note: You can also specify an I/O protocol with SystemC designs to provide greater I/O control.
The following examples show the requirements and advantages of manual interface
specifications. In the first code example, the following occurs:
P1: {
read1 = response[0];
opcode = 5;
*request = opcode;
read2 = response[1];
}
C1: {
*z1 = a + b;
*z2 = read1 + read2;
}
}
When Vivado HLS implements this code, the write to request does not need to occur
between the two reads on response. The code uses this I/O behavior, but there are no
dependencies in the code to enforce it. Vivado HLS might schedule the I/O accesses using
the same access pattern as the C code or use a different access pattern.
If there is an external requirement that the I/O accesses must occur in this order, you can
use a protocol block to enforce a specific I/O protocol behavior. Because the accesses occur
in the scope defined by block P1, you can apply an I/O protocol as follows:
The modified code now contains the header file and ap_wait() statements:
void test (
int *z1,
int a,
int b,
int *mode,
volatile int *request,
volatile int response[2],
int *z2
) {
P1: {
read1 = response[0];
opcode = 5;
ap_wait();// Added ap_wait statement
*request = opcode;
read2 = response[1];
}
C1: {
*z1 = a + b;
*z2 = read1 + read2;
}
}
This instructs Vivado HLS to schedule the code within this region as is. There is no
reordering of the I/O or ap_wait() statements.
This results in the following exact I/O behavior specified in the code:
• Do not use an I/O protocol on the ports used in a manual interface. Explicitly set all
ports to I/O protocol ap_none to ensure interface synthesis does not add any
additional protocol signals.
• You must specify all the control signals used in a manually specified interface in the C
code with volatile type qualifier. These signals typically change value multiple times
within the function (for example, typically set to 0, then 1, then back to zero). Without
the volatile qualifier, Vivado HLS follows standard C semantics and optimizes out all
intermediate operations, leaving only the first read and final write.
• Use the volatile qualifier to specify data signals with values that will be updated
multiples times.
• If multiple clocks are required, use ap_wait_n(<value>) to specify multiple cycles.
Do not use multiple ap_wait() statements.
• Group signals that need to change in the same clock cycle using the latency directive.
For example:
{
#pragma HLS PROTOCOL fixed
// A protocol block may span multiple clock cycles
// To ensure both these signals are scheduled in the exact same clock cycle.
// create a region { } with a latency = 0
{
#pragma HLS LATENCY max=0 min=0
*data = 0xFF;
*data_vld = 1;
}
ap_wait_n(2);
}
This second use model provides additional functionality, allowing the optional
side-channels which are part of the AXI4-Stream standard, to be used directly in the C code.
An AXI4-Stream is used without side-channels when the function argument does not
contain any AXI4 side-channel elements. The following example shown a design where the
data type is a standard C int type. In this example, both interfaces are implemented using
an AXI4-Stream.
int i;
After synthesis, both arguments are implemented with a data port and the standard
AXI4-Stream TVALID and TREADY protocol ports as shown in the following figure.
X-Ref Target - Figure 1-41
Side-channels are optional signals which are part of the AXI4-Stream standard. The
side-channel signals may be directly referenced and controlled in the C code using a struct,
provided the member elements of the struct match the names of the AXI4-Stream
side-channel signals. An example of this is provided with Vivado HLS;. The Vivado HLS
include directory contains the file ap_axi_sdata.h. This header file contains the
following structs,.
#include "ap_int.h"
ap_uint<TI> id;
ap_uint<TD> dest;
};
Both structs contain as top-level members, variables whose names match those of the
optional AXI4-Stream side-channel signals. Provided the struct contains elements with
these names, there is no requirement to use the header file provided. You can create your
own user defined structs. Since the structs shown above use ap_int types and templates,
this header file is only for use in C++ designs.
Note: The valid and ready signals are mandatory signals in an AXI4-Stream and will always be
implemented by Vivado HLS. These cannot be controlled using a struct.
The following example shows how the side-channels can be used directly in the C code and
implemented on the interface. In this example a signed 32-bit data type is used.
#include "ap_axi_sdata.h"
int i;
After synthesis, both arguments are implemented with data ports, the standard
AXI4-Stream TVALID and TREADY protocol ports and all of the optional ports described in
the struct.
There is a difference in the default synthesis behavior when using structs with AXI4-Stream
interfaces. The default synthesis behavior for struct is described in Interface Synthesis and
Structs in Chapter 1.
When using AXI4-Stream interfaces without side-channels and the function argument is a
struct:
• Vivado HLS automatically applies the DATA_PACK directive and all elements of the
struct are combined into a single wide-data vector. The interface is implemented as a
single wide-data vector with associated TVALID and TREADY signals.
• If the DATA_PACK directive is manually applied to the struct, all elements of the struct
are combined into a single wide-data vector and the AXI alignment options to the
DATA_PACK directive may be applied. The interface is implemented as a single
wide-data vector with associated TVALID and TREADY signals.
When using AXI4-Stream interfaces with side-channels, the function argument is itself a
struct (AXI-Stream struct), and may contain data which is itself a struct (data struct) along
with the side-channels:
• Vivado HLS automatically applies the DATA_PACK directive to the data struct and all
elements of the data struct are combined into a single wide-data vector. The interface is
implemented as a single wide-data vector with associated side-channels, TVALID and
TREADY signals.
• If the DATA_PACK directive is manually applied to the data struct, all elements of the
data struct are combined into a single wide-data vector and the AXI alignment options
to the DATA_PACK directive may be applied. The interface is implement as a single
wide-data vector with associated side-channels, TVALID and TREADY signals.
• If the DATA_PACK directive is applied to AXI-Stream struct, the function argument, the
data struct and the side-channel signals are combined into a single wide-vector. The
interface is implement as a single wide-data vector with TVALID and TREADY signals.
AXI4-Lite Interface
You can use an AXI4-Lite interface to allow the design to be controlled by a CPU or
microcontroller. Using the Vivado HLS AXI4-Lite interface, you can:
The following example shows how Vivado HLS implements multiple arguments, including
the function return, as an AXI4-Lite interface. Because each directive uses the same name
for the bundle option, each of the ports is grouped into the same AXI4-Lite interface.
*c += *a + *b;
}
Note: If you do not use the bundle option, Vivado HLS groups all arguments specified with an
AXI4-Lite interface into the same default bundle and automatically names the port.
You can also assign an I/O protocol to ports grouped into an AXI4-Lite interface. In the
example above, Vivado HLS implements port b as an ap_vld interface and groups port b
into the AXI4-Lite interface. As a result, the AXI4-Lite interface contains a register for the
port b data, a register for the output to acknowledge that port b was read, and a register
for the port b input valid signal.
Each time port b is read, Vivado HLS automatically clears the input valid register and resets
the register to logic 0. If the input valid register is not set to logic 1, the data in the b data
register is not considered valid, and the design stalls and waits for the valid register to be
set.
RECOMMENDED: For ease of use during the operation of the design, Xilinx recommends that you do not
include additional I/O protocols in the ports grouped into an AXI4-Lite interface. However, Xilinx
recommends that you include the block-level I/O protocol associated with the return port in the
AXI4-Lite interface.
IMPORTANT: You cannot assign arrays to an AXI4-Lite interface using the bram interface. You can only
assign arrays to an AXI4-Lite interface using the default ap_memory interface.
You also cannot assign any argument specified with ap_stable I/O protocol to an AXI4-Lite interface.
By default, Vivado HLS automatically assigns the address for each port that is grouped into
an AXI4-Lite interface. Vivado HLS provides the assigned addresses in the C driver files. For
more information, see C Driver Files.
Note: To explicitly define the address, you can use the offset option, as shown for argument c in
the example above.
IMPORTANT: In an AXI4-Lite interface, Vivado HLS reserves addresses 0x0000 through 0x000C for the
block-level I/O protocol signals and interrupt controls.
After synthesis, Vivado HLS implements the ports in the AXI4-Lite port, as shown in the
following figure. Vivado HLS creates the interrupt port by including the function return in
the AXI4-Lite interface. You can program the interrupt through the AXI4-Lite interface. You
can also drive the interrupt from the following block-level protocols:
By default, Vivado HLS uses the same clock for the AXI4-Lite interface and the synthesized
design. Vivado HLS connects all registers in the AXI4-Lite interface to the clock used for the
synthesized logic (ap_clk).
Optionally, you can use the INTERFACE directive clock option to specify a separate clock
for each AXI4-Lite port. When connecting the clock to the AXI4-Lite interface, you must use
the following protocols:
• AXI4-Lite interface clock must be synchronous to the clock used for the synthesized
logic (ap_clk). That is, both clocks must be derived from the same master generator
clock.
• AXI4-Lite interface clock frequency must be equal to or less than the frequency of the
clock used for the synthesized logic (ap_clk).
If you use the clock option with the interface directive, you only need to specify the
clock option on one function argument in each bundle. Vivado HLS implements all other
function arguments in the bundle with the same clock and reset. Vivado HLS names the
generated reset signal with the prefix ap_rst_ followed by the clock name. The generated
reset signal is active Low independent of the config_rtl command. For more
information, see Controlling the Reset Behavior.
The following example shows how Vivado HLS groups function arguments a and b into an
AXI4-Lite port with a clock named AXI_clk1 and an associated reset port.
In the following example, Vivado HLS groups function arguments c and d into AXI4-Lite
port CTRL1 with a separate clock called AXI_clk2 and an associated reset port.
C Driver Files
When an AXI4-Lite slave interface is implemented, a set of C driver files are automatically
created. These C driver files provide a set of APIs that can be integrated into any software
running on a CPU and used to communicate with the device via the AXI4-Lite slave
interface.
The C driver files are created when the design is packaged as IP in either the IP Catalog. For
more details on packing IP, see Exporting the RTL Design.
Driver files are created for standalone and Linux modes. In standalone mode the drivers are
used in the same way as any other Xilinx standalone drivers. In Linux mode, copy all the C
files (.c) and header files (.h) files into the software project.
The driver files and API functions derive their name from the top-level function for
synthesis. In the above example, the top-level function is called “example”. If the top-level
function was named “DUT” the name “example” would be replaced by “DUT” in the
following description. The driver files are created in the packaged IP (located in the impl
directory inside the solution).
The following table lists each of the API function provided in the C driver files.
XExample_Get_ARG_BitWidth Return the bit width of each element in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XExample_Get_ARG_Depth Return the total number of elements in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XExample_Write_ARG_Words Write the length of a 32-bit word into the specified address of
the AXI4-Lite interface. This API requires the offset address
from BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Read_ARG_Words Read the length of a 32-bit word from the array. This API
requires the data target, the offset address from BaseAddress,
and the length of the data to be stored. Only available when
ARG is an array grouped into the AXI4-Lite interface.
XExample_Write_ARG_Bytes Write the length of bytes into the specified address of the
AXI4-Lite interface. This API requires the offset address from
BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XExample_Read_ARG_Bytes Read the length of bytes from the array. This API requires the
data target, the offset address from BaseAddress, and the
length of data to be loaded. Only available when ARG is an array
grouped into the AXI4-Lite interface.
IMPORTANT: The C driver APIs always use an unsigned 32-bit type (U32). You might be required to cast
the data in the C code into the expected type.
C driver files always use a data 32-bit unsigned integer (U32) for data transfers. In the
following example, the function uses float type arguments a and r1. It sets the value of a
and returns the value of r1:
*r1 = 0.5f*a;
return (a>0);
}
After synthesis, Vivado HLS groups all ports into the default AXI4-Lite interface and creates
C driver files. However, as shown in the following example, the driver files use type U32:
Xil_AssertNonvoid(InstancePtr != NULL);
Xil_AssertNonvoid(InstancePtr->IsReady == XIL_COMPONENT_IS_READY);
Data = XCaculate_ReadReg(InstancePtr->Hls_periph_bus_BaseAddress,
XCACULATE_HLS_PERIPH_BUS_ADDR_R1_DATA);
return Data;
}
If these functions work directly with float types, the write and read values are not consistent
with expected float type. When using these functions in software, you can use the following
casts in the code:
float a=3.0f,r1;
u32 ua,ur1;
For a complete description of the API functions, see AXI4-Lite Slave C Driver Reference in
Chapter 4.
Controlling Hardware
The hardware header file xexample_hw.h (in this example) provides a complete list of the
memory mapped locations for the ports grouped into the AXI4-Lite slave interface.
To correctly program the registers in the AXI4-Lite slave interface, there is some
requirement to understand how the hardware ports operate. The block will operate with the
same port protocols described in Interface Synthesis.
For example, to start the block operation the ap_start register must be set to 1. The
device will then proceed and read any inputs grouped into the AXI4-Lite slave interface
from the register in the interface. When the block completes operation, the ap_done,
ap_idle and ap_ready registers will be set by the hardware output ports and the results
for any output ports grouped into the AXI4-Lite slave interface read from the appropriate
register. This is the same operation described in Figure 1-37.
The implementation of function argument c in the example above also highlights the
importance of some understanding how the hardware ports are operate. Function
argument c is both read and written to, and is therefore implemented as separate input and
output ports c_i and c_o, as explained in Interface Synthesis.
The first recommended flow for programing the AXI4-Lite slave interface is for a one-time
execution of the function:
• Use the interrupt function to determine how you wish the interrupt to operate.
• Load the register values for the block input ports. In the above example this is
performed using API functions XExample_Set_a, XExample_Set_b, and
XExample_Set_c_i.
• Set the ap_start bit to 1 using XExample_Start to start executing the function.
This register is self-clearing as noted in the header file above. After one transaction, the
block will suspend operation.
• Allow the function to execute. Address any interrupts which are generated.
• Read the output registers. In the above example this is performed using API functions
XExample_Get_c_o_vld, to confirm the data is valid, and XExample_Get_c_o.
Note: The registers in the AXI4-Lite slave interface obey the same I/O protocol as the ports. In
this case, the output valid is set to logic 1 to indicate if the data is valid.
• Repeat for the next transaction.
The second recommended flow is for continuous execution of the block. In this mode, the
input ports included in the AXI4-Lite slave interface should only be ports which perform
configuration. The block will typically run must faster than a CPU. If the block must wait for
inputs, the block will spend most of its time waiting:
• Use the interrupt function to determine how you wish the interrupt to operate.
• Load the register values for the block input ports. In the above example this is
performed using API functions XExample_Set_a, XExample_Set_a and
XExample_Set_c_i.
• Set the auto-start function using API XExample_EnableAutoRestart
• Allow the function to execute. The individual port I/O protocols will synchronize the
data being processed through the block.
• Address any interrupts which are generated. The output registers could be accessed
during this operation but the data may change often.
• Use the API function XExample_DisableAutoRestart to prevent any more
executions.
• Read the output registers. In the above example this is performed using API functions
XExample_Get_c_o and XExample_Set_c_o_vld.
Controlling Software
The API functions can be used in the software running on the CPU to control the hardware
block. An overview of the process is:
An abstracted versions of this process is shown below. Complete examples of the software
control are provided in the Zynq-7000 AP SoC tutorials noted in Table 1-4.
// HLS HW instance
XExample HlsExample;
XExample_Config *ExamplePtr
int main() {
int res_hw;
When an HLS RTL design using an AXI4-Lite slave interface is incorporated into a design in
Vivado IP Integrator, you can customize the block. From the block diagram in IP Integrator,
select the HLS block, right-click with the mouse button and select Customize Block.
The address width is by default configured to the minimum required size. Modify this to
connect to blocks with address sizes less than 32-bit.
X-Ref Target - Figure 1-44
With individual data transfers, Vivado HLS reads or writes a single element of data for each
address. The following example shows a single read and single write operation. In this
example, Vivado HLS generates an address on the AXI interface to read a single data value
and an address to write a single data value. The interface transfers one data value per
address.
acc += *d;
*d = acc;
}
With burst mode transfers, Vivado HLS reads or writes data using a single base address
followed by multiple sequential data samples, which makes this mode capable of higher
data throughput. Burst mode of operation is possible when you use the C memcpy function
or a pipelined for loop.
Note: The C memcpy function is only supported for synthesis when used to transfer data to or from
a top-level function argument specified with an AXI4 master interface.
The following example shows a copy of burst mode using the memcpy function. The
top-level function argument a is specified as an AXI4 master interface.
int i;
int buff[50];
memcpy((int *)a,buff,50*sizeof(int));
}
When this example is synthesized, it results in the interface shown in the following figure.
int i;
int buff[50];
When using a for loop to implement burst reads or writes, follow these requirements:
In the following example, Vivado HLS implements the port reads as burst transfers. Port a
is specified without using the bundle option and is implemented in the default AXI
interface. Port b is specified using a named bundle and is implemented in a separate AXI
interface called d2_port.
int i;
int buff[50];
//copy data in
for(i=0; i < 50; i++){
#pragma HLS PIPELINE
buff[i] = a[i] + b[i];
}
...
}
Strong AXI4 interface use involves the design never stalling while waiting to access the bus,
and, after bus access is granted, the bus never stalling while waiting for the design to
read/write. To create the optimal AXI4 interface, the following options are provided in the
INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the
AXI4 interface.
Some of these options use internal storage to buffer data and may have an impact on area
and resources:
• latency: Specifies the expected latency of the AXI4 interface, allowing the design to
initiate a bus request a number of cycles (latency) before the read or write is expected.
If this figure it too low, the design will be ready too soon and may stall waiting for the
bus. If this figure is too high, bus access may be granted but the bus may stall waiting
on the design to start the access.
• max_read_burst_length: Specifies the maximum number of data values read
during a burst transfer.
• num_read_outstanding: Specifies how many read requests can be made to the AXI4
bus, without a response, before the design stalls. This implies internal storage in the
design, a FIFO of size:
num_read_outstanding*max_read_burst_length*word_size.
• max_write_burst_length: Specifies the maximum number of data values written
during a burst transfer.
The interface is specified as having a latency of 100. Vivado HLS seeks to schedule the
request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
To further improve bus efficiency, the options num_write_outstanding and
num_read_outstanding ensure the design contains enough buffering to store up to 32
read and write accesses. This allows the design to continue processing until the bus
requests are serviced. Finally, the options max_read_burst_length and
max_write_burst_length ensure the maximum burst size is 16 and that the AXI4
interface does not hold the bus for longer than this.
These options allow the behavior of the AXI4 interface to be optimized for the system in
which it will operate. The efficiency of the operation does depend on these values being set
accuracy.
By default, Vivado HLS implements the AXI4 port with a 32-bit address bus. Optionally, you
can implement the AXI4 interface with a 64-bit address bus using the m_axi_addr64
interface configuration option as follows:
IMPORTANT: When you select the m_axi_addr64 option, Vivado HLS implements all AXI4 interfaces in
the design with a 64-bit address bus.
By default, the AXI4 master interface starts all read and write operations from address
0x00000000. For example, given the following code, the design reads data from addresses
0x00000000 to 0x000000c7 (50 32-bit words, gives 200 bytes), which represents 50 address
values. The design then writes data back to the same addresses.
int i;
int buff[50];
memcpy(buff,(const int*)a,50*sizeof(int));
To apply an address offset, use the -offset option with the INTERFACE directive, and
specify one of the following options:
In the final RTL, Vivado HLS applies the address offset directly to any read or write address
generated by the AXI4 master interface. This allows the design to access any address
location in the system.
If you use the slave option in an AXI interface, you must use an AXI4-Lite port on the
design interface. Xilinx recommends that you implement the AXI4-Lite interface using the
following pragma:
In addition, if you use the slave option and you used several AXI4-Lite interfaces, you
must ensure that the AXI master port offset register is bundled into the correct AXI4-Lite
interface. In the following example, port a is implemented as an AXI master interface with
an offset and AXI4-Lite interfaces called AXI_Lite_1 and AXI_Lite_2:
The following INTERFACE directive is required to ensure that the offset register for port a is
bundled into the AXI4-Lite interface called AXI_Lite_1:
When you incorporate an HLS RTL design that uses an AXI4 master interface into a design
in the Vivado IP Integrator, you can customize the block. From the block diagram in IP
Integrator, select the HLS block, right-click, and select Customize Block to customize any
of the settings provided. A complete description of the AXI4 parameters is provided in this
link in the AXI Reference Guide (UG1037)[Ref 8].
The following figure shows the Re-Customize IP dialog box for the design shown in
Figure 1-45. This design includes an AXI4-Lite port.
X-Ref Target - Figure 1-46
• Register all signals that cross between SLRs at both the SLR output and SLR input.
• You do not need to register a signal if it enters or exits an SLR via an I/O buffer.
• Ensure that the logic created by Vivado HLS fits within a single SLR.
Note: When you select an SSI technology device as the target technology, the utilization report
includes details on both the SLR usage and the total device usage.
If the logic is contained within a single SLR device, Vivado HLS provides a register_io
option to the config_interface command. This option provides a way to automatically
register all block inputs, outputs, or both. This option is only required for scalars. All array
ports are automatically registered.
The following table lists the optimization directives provided by Vivado HLS.
The optimizations are presented in the context of how they are typically applied on a
design.
The Clock, Reset and RTL output are discussed together. The clock frequency along with the
target device is the primary constraint which drives optimization. Vivado HLS seeks to place
as many operations from the target device into each clock cycle. The reset style used in the
final RTL is controlled, along setting such as the FSM encoding style, using the config_rtl
configuration.
The primary optimizations for Optimizing for Throughput are presented together in the
manner in which they are typically used: pipeline the tasks to improve performance,
improve the data flow between tasks and optimize structures to improve address issues
which may limit performance.
Optimizing for Latency uses the techniques of latency constraints and the removal of loop
transitions to reduce the number of clock cycles required to complete.
A focus on how operations are implemented - controlling the number of operations and
how those operations are implemented in hardware - is the principal technique for
improving the area.
For SystemC designs, each SC_MODULE may be specified with a different clock. To specify
multiple clocks in a SystemC design, use the -name option of the create_clock
command to create multiple named clocks and use the CLOCK directive or pragma to
specify which function contains the SC_MODULE to be synthesized with the specified clock.
Each SC_MODULE can only be synthesized using a single clock: clocks may be distributed
through functions, such as when multiple clocks are connected from the top-level ports to
individual blocks, but each SC_MODULE can only be sensitive to a single clock.
The clock period, in ns, is set in the Solutions > Solutions Setting. Vivado HLS uses the
concept of a clock uncertainty to provide a user defined timing margin. Using the clock
frequency and device target information Vivado HLS estimates the timing of operations in
the design but it cannot know the final component placement and net routing: these
operations are performed by logic synthesis of the output RTL. As such, Vivado HLS cannot
know the exact delays.
To calculate the clock period used for synthesis, Vivado HLS subtracts the clock uncertainty
from the clock period, as shown in the following figure.
&ORFN3HULRG
&ORFN8QFHUWDLQW\
(IIHFWLYH&ORFN3HULRG
XVHGE\9LYDGR+/6
0DUJLQIRU/RJLF
6\QWKHVLVDQG3 5
;
By default, the clock uncertainty is 12.5% of the cycle time. The value can be explicitly
specified beside the clock period.
Vivado HLS aims to satisfy all constraints: timing, throughput, latency. However, if a
constraints cannot be satisfied, Vivado HLS always outputs an RTL design.
If the timing constraints inferred by the clock period cannot be met Vivado HLS issues
message SCHED-644, as shown below, and creates a design with the best achievable
performance.
Even if Vivado HLS cannot satisfy the timing requirements for a particular path, it still
achieves timing on all other paths. This behavior allows you to evaluate if higher
optimization levels or special handling of those failing paths by downstream logic
syntheses can pull-in and ultimately satisfy the timing.
IMPORTANT: It is important to review the constraint report after synthesis to determine if all
constraints is met: the fact that Vivado HLS produces an output design does not guarantee the design
meets all performance constraints. Review the “Performance Estimates” section of the design report.
an II value (and an II=1 is implied). If the II value is explicitly specified in the PIPELINE
directive, the relax_ii_for_timing option has no effect.
A design report is generated for each function in the hierarchy when synthesis completes
and can be viewed in the solution reports folder. The worse case timing for the entire design
is reported as the worst case in each function report. There is no need to review every
report in the hierarchy.
If the timing violations are too severe to be further optimized and corrected by downstream
processes, review the techniques for specifying an exact latency and specifying exact
implementation cores before considering a faster target technology.
Initialization Behavior
In C, variables defined with the static qualifier and those defined in the global scope, are by
default initialized to zero. Optionally, these variables may be assigned a specific initial
value. For these type of variables, the initial value in the C code is assigned at compile time
(at time zero) and never again. In both cases, the same initial value is implemented in the
RTL.
• During RTL simulation the variables are initialized with the same values as the C code.
• The same variables are initialized in the bitstream used to program the FPGA. When the
device powers up, the variables will start in their initialized state.
The variables start with the same initial state as the C code. However, there is no way to
force a return to this initial state. To return to their initial state the variables must be
implemented with a reset.
The reset port is used in an FPGA to return the registers and block RAM connected to the
reset port to an initial value any time the reset signal is applied. The presence and behavior
of the RTL reset port is controlled using the config_rtl configuration shown in the
following figure. To access this configuration, select Solution > Solution Settings >
General > Add > config_rtl.
IMPORTANT: When AXI4 interfaces are used on a design the reset polarity is automatically changed to
active-Low irrespective of the setting in the config_rtl configuration. This is required by the AXI4
standard.
Finer grain control over reset is provided through the RESET directive. If a variable is a static
or global, the RESET directive is used to explicitly add a reset, or the variable can be
removed from those being reset by using the RESET directive’s off option. This can be
particularly useful when static or global arrays are present in the design.
IMPORTANT: Is is important when using the reset state or all option to consider the effect on
arrays.
Arrays are often defined as static variables, which implies all elements be initialized to zero,
and arrays are typically implemented as block RAM. When reset options state or all are
used, it forces all arrays implemented as block RAM to be returned to their initialized state
after reset. This may result in two very undesirable attributes in the RTL design:
• Unlike a power-up initialization, an explicit reset requires the RTL design iterate
through each address in the block RAM to set the value: this can take many clock cycles
if N is large and require more area resources to implement.
• A reset is added to every array in the design.
To prevent placing reset logic onto every such block RAM and incurring the cycle overhead
to reset all elements in the RAM:
• Use the default control reset mode and use the RESET directive to specify individual
static or global variables to be reset.
• Alternatively, use reset mode state and remove the reset from specific static or global
variables using the off option to the RESET directive.
RTL Output
Various characteristics of the RTL output by Vivado HLS can be controlled using the
config_rtl configuration shown in Figure 1-48.
• Specify the type of FSM encoding used in the RTL state machines.
• Add an arbitrary comment string, such as a copyright notice, to all RTL files using the
-header option.
• Specify a unique name with the prefix option which is added to all RTL output file
names.
• Force the RTL ports to use lower case names.
The default FSM coding is style is onehot. Other possible options are auto, binary, and
gray. If you select auto, Vivado HLS implements the style of encoding using the onehot
default, but Vivado Design Suite might extract and re-implement the FSM style during logic
synthesis. If you select any other encoding style (binary, onehot, gray), the encoding
style cannot be re-optimized by Xilinx logic synthesis tools.
The names of the RTL output files are derived from the name of the top-level function for
synthesis. If different RTL blocks are created from the same top-level function, the RTL files
will have the same name and cannot be combined in the same RTL project. The prefix
option allows RTL files generated from the same top-level function (and which by default
have the same name as the top-level function) to be easily combined in the same directory.
The lower_case_name option ensures the only lower case names are used in the output
RTL. This option ensures the IO protocol ports created by Vivado HLS, such as those for AXI
interfaces, are specified as s_axis_<port>_tdata in the final RTL rather than the default
port name of s_axis_<port>_TDATA.
Task Pipelining
Pipelining allows operations to happen concurrently: the task does not have to complete all
operations before it begin the next operation. Pipelining is applied to functions and loops.
The throughput improvements in function pipelining are shown in the following figure.
void func(…) {
op_Read; RD
op_Compute; CMP
op_Write; WR
3 cycles 1 cycle
In the pipelined version of the loop shown in (B), a new input sample is read every cycle
(II=1) and the final output is written after only 4 clock cycles: substantially improving both
the II and latency while using the same hardware resources.
void func(m,n,o) {
for (i=2;i>=0;i--) {
op_Read;
op_Compute;
op_Write;
}
}
F\FOHV F\FOH
5' &03 :5 5' &03 :5 5' &03 :5 5' &03 :5
5' &03 :5
F\FOHV
$ :LWKRXW/RRS3LSHOLQLQJ % :LWK/RRS3LSHOLQLQJ
;
Pipelining is applied to the specified task not to the hierarchy below: all loops in the
hierarchy below are automatically unrolled. Any sub-functions in the hierarchy below the
specified task must be pipelined individually. If the sub-functions are pipelined, the
pipelined tasks above it can take advantage of the pipeline performance. Conversely, any
sub-function below the pipelined task that is not pipelined, may be the limiting factor in the
performance of the pipeline.
• In the case of functions, the pipeline runs forever and never ends.
• In the case of loops, the pipeline executes until all iterations of the loop are completed.
3LSHOLQHG)XQFWLRQ 3LSHOLQHG/RRS
3LSHOLQHG)XQFWLRQ,2$FFHVVHV 3LSHOLQHG/RRS,2$FFHVVHV
5' 5' 5' 5'1 5' 5' 5' 5'1 %XEEOH 5' 5' 5'
:5 :5 :5 :51 :5 :5 :5 :51 %XEEOH :5
;
An implication from the difference in behavior is the difference in how inputs and outputs
to the pipeline are processed. As seen the figure above, a pipelined function will
continuously read new inputs and write new outputs. By contrast, because a loop must first
finish all operations in the loop before starting the next loop, a pipelined loop causes a
“bubble” in the data stream: a point when no new inputs are read as the loop completes the
execution of the final iterations, and a point when no new outputs are written as the loop
starts new loop iterations.
Loops which are the top-level loop in a function or are used in a region where the
DATAFLOW optimization is used can be made to continuously execute using the PIPELINE
directive with the rewind option.
The following figure shows the operation when the rewind option is used when pipelining
a loop. At the end of the loop iteration count, the loop immediately starts to re-execute.
Loop:for(i=1;i<N;i++){
op_Read; 5' 5' &03 :5
op_Compute; &03 5' &03 :5
op_Write; :5 5' &03 :5
} 5'1 &03 :51
([HFXWH1H[W/RRS
;
If the loop is used in a region with the DATAFLOW optimization, Vivado HLS automatically
implements the loop as if it is in a function hierarchy.
Flushing Pipelines
Pipelines continue to execute as long as data is available at the input of the pipeline. If there
is no data available to process, the pipeline will stall. This is shown in the following figure,
where the input data valid signal goes low to indicate there is no more data. Once there is
new data available to process, the pipeline will continue operation.
X-Ref Target - Figure 1-53
,QSXW'DWD9DOLG
5' &03 :5
5' &03 :5
5' &03 :5
5'1 &03 :51
;
IMPORTANT: The pipeline flush feature is only supported for pipelined functions.
The pipeline_loops option set the iteration limit. All loops with an iteration count below
this limit are automatically pipelined. The default is 0: no automatic loop pipelining is
performed.
If the pipeline_loops option is set to 10 (a value above 5 but below 5*640), the
following pipelining is performed automatically:
If there are loops in the design that you do not want to use automatic pipelining, apply the
PIPELINE directive with the off option to that loop. The off option prevents automatic
loop pipelining.
IMPORTANT: Vivado HLS applies the config_compile pipeline_loops option after performing
all user-specified directives. For example, if Vivado HLS applies a user-specified UNROLL directive to a
loop, the loop is first unrolled, and automatic loop pipelining cannot be applied.
When a task is pipelined, all loops in the hierarchy are automatically unrolled. This is a
requirement for pipelining to proceed. If a loop has variables bounds it cannot be unrolled.
This will prevent the task from being pipelined. Refer to Variable Loop Bounds in Chapter 3
for techniques to remove such loops from the design.
In this example, Vivado HLS states it cannot reach the specified initiation interval (II) of 1
because it cannot schedule a load (write) operation onto the memory because of limited
memory ports. It reports a final II of 2 instead of the desired 1.
This issue is typically caused by arrays. Arrays are implemented as block RAM which only
has a maximum of two data ports. This can limit the throughput of a read/write (or
load/store) intensive algorithm. The bandwidth can be improved by splitting the array (a
single block RAM resource) into multiple smaller arrays (multiple block RAMs), effectively
increasing the number of ports.
Arrays are partitioned using the ARRAY_PARTITION directive. Vivado HLS provides three
types of array partitioning, as shown in the following figure. The three styles of partitioning
are:
• block: The original array is split into equally sized blocks of consecutive elements of
the original array.
• cyclic: The original array is split into equally sized blocks interleaving the elements of
the original array.
• complete: The default operation is to split the array into its individual elements. This
corresponds to resolving a memory into registers.
X-Ref Target - Figure 1-54
1
EORFN
1 1 1
1
1 1 1 F\FOLF
1 1
FRPSOHWH 1
1
1
;
For block and cyclic partitioning the factor option specifies the number of arrays that are
created. In the preceding figure, a factor of 2 is used, that is, the array is divided into two
smaller arrays. If the number of elements in the array is not an integer multiple of the factor,
the final array has fewer elements.
The examples in the figure demonstrate how partitioning dimension 3 results in 4 separate
arrays and partitioning dimension 1 results in 10 separate arrays. If zero is specified as the
dimension, all dimensions are partitioned.
X-Ref Target - Figure 1-55
my_array_0[10][6]
my_array[10][6][4] partition dimension 3 my_array_1[10][6]
my_array_2[10][6]
my_array_3[10][6]
my_array_0[6][4]
my_array[10][6][4] partition dimension 1 my_array_1[6][4]
my_array_2[6][4]
my_array_3[6][4]
my_array_4[6][4]
my_array_5[6][4]
my_array_6[6][4]
my_array_7[6][4]
my_array_8[6][4]
my_array_9[6][4]
The partition thresholds can be adjusted and partitioning can be fully automated with the
throughput_driven option. When the throughput_driven option is selected Vivado
HLS automatically partitions arrays to achieve the specified throughput.
In this example, the Vivado HLS does not have any knowledge about the value of cols and
conservatively assumes that there is always a dependence between the write to
buff_A[1][col]and the read from buff_A[1][col].
The issue is highlighted in the following figure. If cols=0, the next iteration of the rows
loop starts immediately, and the read from buff_A[0][cols] cannot happen at the same
time as the write.
X-Ref Target - Figure 1-56
%XII>@>FRO@DFFHVVHVLIFROV
5RZ&RO
;
In an algorithm such as this, it is unlikely cols will ever be zero but Vivado HLS cannot
make assumptions about data dependencies. To overcome this deficiency, you can use the
DEPENDENCE directive to provide Vivado HLS with additional information about the
dependencies. In this case, state there is no dependence between loop iterations (in this
case, for both buff_A and buff_B).
Note: Specifying a false dependency, when in fact the dependency is not false, can result in
incorrect hardware. Be sure dependencies are correct (true or false) before specifying them.
• Inter: Specifies the dependency is between different iterations of the same loop.
If this is specified as false it allows Vivado HLS to perform operations in parallel if the
pipelined or loop is unrolled or partially unrolled and prevents such concurrent
operation when specified as true.
• Intra: Specifies dependence within the same iteration of a loop, for example an array
being accessed at the start and end of the same iteration.
When intra dependencies are specified as false Vivado HLS may move operations freely
within the loop, increasing their mobility and potentially improving performance or
area. When the dependency is specified as true, the operations must be performed in
the order specified.
Data dependencies are a much harder issues to resolve and often require changes to the
source code. A scalar data dependency could look like the following:
while (a != b) {
if (a > b) a -= b;
else b -= a;
}
The next iteration of this loop cannot start until the current iteration has calculated the
updated the values of a and b, as shown in the following figure.
X-Ref Target - Figure 1-57
! !
;
Vivado HLS provides the ability to unroll or partially unroll for-loops using the UNROLL
directive.
The following figure shows both the powerful advantages of loop unrolling and the
implications that must be considered when unrolling loops. This example assumes the
arrays a[i], b[i] and c[i] are mapped to block RAMs. This example shows how easy
it is to create many different implementations by the simple application of loop unrolling.
void top(...) {
...
for_mult:for (i=3;i>0;i--) {
a[i] = b[i] * c[i];
}
...
}
:ULWHD>@
:ULWHD>@
:ULWHD>@
:ULWHD>@
;
To perform loop unrolling, you can apply the UNROLL directives to individual loops in the
design. Alternatively, you can apply the UNROLL directive to a function, which unrolls all
loops within the scope of the function.
The following example code demonstrates how loop unrolling can be used to create an
optimal design. In this example, the data is stored in the arrays as interleaved channels. If
the loop is pipelined with II=1 each channel is only read and written every 8th block cycle.
#define CHANNELS 8
#define SAMPLES 400
#define N CHANNELS * SAMPLES
Partially unrolling the loop by a factor of 8 will allow each of the channels (every 8th
sample) to be processed in parallel (if the input and output arrays are also partitioned in a
cyclic manner to allow multiple accesses per clock cycle). If the loop is also pipelined with
the rewind option, this design will continuously process all 8 channels in parallel.
int i, rem;
rem=i%CHANNELS;
acc[rem] = acc[rem] + d_i[i];
d_o[i] = acc[rem];
}
}
Partial loop unrolling does not require the unroll factor to be an integer multiple of the
maximum iteration count. Vivado HLS adds an exit checks to ensure partially unrolled loops
are functionally identical to the original loop. For example, given the following code:
Loop unrolling by a factor of 2 effectively transforms the code to look like the following
example where the break construct is used to ensure the functionality remains the same:
for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
if (i+1 >= N) break;
a[i+1] = b[i+1] + c[i+1];
}
Because N is a variable, Vivado HLS may not be able to determine its maximum value (it
could be driven from an input port). If you know the unrolling factor, 2 in this case, is an
integer factor of the maximum iteration count N, the skip_exit_check option removes
the exit check and associated logic. The effect of unrolling can now be represented as:
for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
This helps minimize the area and simplify the control logic.
723
;
723
;
In the example without dataflow pipelining (A) in the following figure, the implementation
requires 8 cycles before a new input can be processed by func_A and 8 cycles before an
output is written by func_C.
In the example with dataflow pipelining (B) in the following figure, func_A can begin
processing a new input every 3 clock cycles (lower initiation interval) and it now only
requires 5 clocks to output a final value (shorter latency).
return d;
}
F\FOHV F\FOHV
IXQFB$ IXQFB% IXQFB& IXQFB$ IXQFB$
IXQFB% IXQFB%
IXQFB& IXQFB&
F\FOHV F\FOHV
$ :LWKRXW'DWDIORZ3LSHOLQLQJ % :LWK'DWDIORZ3LSHOLQLQJ
;
For the DATAFLOW optimization to work, the data must flow through the design from one
task to the next. The following coding styles prevent Vivado HLS from performing the
DATAFLOW optimization:
• Single-producer-consumer violations
• Bypassing tasks
• Feedback between tasks
• Conditional execution of tasks
• Loop scopes with variable bounds
• Loops with multiple exit conditions
IMPORTANT: If any of these coding styles are present, Vivado HLS issues a message and does not
perform DATAFLOW optimization.
For Vivado HLS to perform the DATAFLOW optimization, all elements passed between tasks
must follow a single-producer-consumer model. Each variable must be driven from a single
task and only be consumed by a single task. In the following code example, temp1 fans out
and is consumed by both Loop2 and Loop3. This violates the single-producer-consumer
model.
int temp1[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
}
Loop2: for(int j = 0; j < N; j++) {
data_out1[j] = temp1[j] * 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out2[k] = temp1[k] * 456;
}
In addition, data must flow from one task into the next task. If you bypass tasks, this inhibits
the DATAFLOW optimization. In this example, Loop1 generates the values for temp1 and
temp2. However, the next task, Loop2, only uses the value of temp1. The value of temp2 is
not consumed until after Loop2. Therefore, temp2 bypasses the next task in the sequence,
which prevents Vivado HLS from performing the DATFLOW optimization.
Because the loop iteration limits are all the same in this example, you can modify the code
so that Loop2 consumes temp2 and produces temp4 as follows. This ensures that the data
flows from one task to the next.
Feedback occurs when the output from a task is consumed by a previous task in the
DATAFLOW region. Feedback between tasks is not permitted in a DATAFLOW region. When
Vivado HLS detects feedback, it issues a warning and does not perform the DATAFLOW
optimization.
The DATAFLOW optimization does not optimize tasks that are conditionally executed. The
following example highlights this limitation. In this example, the conditional execution of
Loop1 and Loop2 prevents Vivado HLS from optimizing the data flow between these loops,
because the data does not flow from one loop into the next.
if (sel) {
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * 123;
temp2[i] = data_in[i];
}
} else {
Loop2: for(int j = 0; j < N; j++) {
temp1[j] = data_in[j] * 321;
temp2[j] = data_in[j];
}
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
To ensure each loop is executed in all cases, you must transform the code as shown in the
following example. In this example, the conditional statement is moved into the first loop.
Both loops are always executed, and data always flows from one loop to the next.
Loops with multiple exit points cannot be used in a DATAFLOW region. In the following
example, Loop2 has three exit conditions:
• An exit defined by the value of N; the loop will exit when k>=N.
• An exit defined by the break statement.
• An exit defined by the continue statement.
#include "ap_cint.h"
#define N 16
Because a loop’s exit condition is always defined by the loop bounds, the use of break or
continue statements will prohibit the loop being used in a DATAFLOW region.
Vivado HLS implements channels between the tasks as either ping-pong or FIFO buffers,
depending on the access patterns of the producer and the consumer of the data:
• For scalar, pointer, and reference parameters as well as the function return, Vivado HLS
implements the channel as a FIFO.
Note: For scalar values, the maximum channel size is one, that is, only one value is passed from
one function to another.
• If the parameter (producer or consumer) is an array, Vivado HLS implements the
channel as a ping-pong buffer or a FIFO as follows:
° If Vivado HLS determines the data is accessed in sequential order, Vivado HLS
implements the memory channel as a FIFO channel of depth 1.
° If Vivado HLS is unable to determine that the data is accessed in sequential order or
determines the data is accessed in an arbitrary manner, Vivado HLS implements the
memory channel as a ping-pong buffer, that is, as two block RAMs each defined by
the maximum size of the consumer or producer array.
Note: A ping-pong buffer ensures that the channel always has the capacity to hold all
samples without a loss. However, this might be an overly conservative approach in some
cases. For example, if tasks are pipelined with an interval of 1 and use data in a streaming,
sequential manner but Vivado HLS is unable to automatically determine the sequential data
usage, Vivado HLS implements a ping-pong buffer. In this case, the channel only requires a
single register and not 2 block RAM defined by the size of the array.
To explicitly specify the default channel used between tasks, use the config_dataflow
configuration. This configuration sets the default channel for all channels in a design. To
reduce the size of the memory used in the channel, you can use a FIFO. To explicitly set the
depth or number of elements in the FIFO, use the fifo_depth option.
Specifying the size of the FIFO channels overrides the default safe approach. If any task in
the design can produce or consume samples at a greater rate than the specified size of the
FIFO, the FIFOs might become empty (or full). In this case, the design halts operation,
because it is unable to read (or write). This might result in a stalled, unrecoverable state.
Note: This issue only appears when executing C/RTL co-simulation or when the block is used in a
complete system.
When setting the depth of the FIFOs, it is recommended that you use FIFOs with the default
depth, confirm the design passes C/RTL co-simulation, and then reduce the size of the
FIFOs and confirm C/RTL co-simulation still completes without issues. If RTL co-simulation
fails, the size of the FIFO is likely too small to prevent stalling.
• If an array on the top-level function interface is set as interface type ap_fifo, axis or
ap_hs it is automatically set as streaming.
• The arrays used in a region where the DATAFLOW optimization is applied are
automatically set to streaming if Vivado HLS determines the data is streaming between
the tasks or if the config_dataflow configuration sets the default memory channel
as FIFO.
All other arrays must be specified as streaming using the STREAM directive if a FIFO is
required for the implementation.
The STREAM directive is also used to change any arrays in a DATAFLOW region from the
default implementation specified by the config_dataflow configuration.
When a maximum and/or minimum LATENCY constraint is placed on a scope, Vivado HLS
tries to ensure all operations in the function complete within the range of clock cycles
specified.
The latency directive applied to a loop specifies the required latency for a single iteration of
the loop: it specifies the latency for the loop body, as the following examples shows:
If the intention is to limit the total latency of all loop iterations, the latency directive should
be applied to a region that encompasses the entire loop, as in this example:
Region_All_Loop_A: {
#pragma HLS latency max=10
Loop_A: for (i=0; i<N; i++)
{
..Loop Body...
}
}
In this case, even if the loop is unrolled, the latency directive sets a maximum limit on all
loop operations.
If Vivado HLS cannot meet a maximum latency constraint it relaxes the latency constraint
and tries to achieve the best possible result.
If a minimum latency constraint is set and Vivado HLS can produce a design with a lower
latency than the minimum required it inserts dummy clock cycles to meet the minimum
latency.
The following figure shows a simple example where a seemingly intuitive coding style has
a negative impact on the performance of the RTL design.
F\FOH
;
In this simple example it is obvious that an else branch in the ADD loop would also solve the
issue but in a more complex example it may be less obvious and the more intuitive coding
style may have greater advantages.
Merging loops allows the logic within the loops to be optimized together. In the example
above, using a dual-port block RAM allows the add and subtraction operations to be
performed in parallel.
• If loop bounds are all variables, they must have the same value.
• If loops bounds are constants, the maximum constant value is used as the bound of the
merged loop.
• Loops with both variable bound and constant bound cannot be merged.
• The code between loops to be merged cannot have side effects: multiple execution of
this code should generate the same results (a=b is allowed, a=a+1 is not).
• Loops cannot be merged when they contain FIFO accesses: merging would change the
order of the reads and writes from a FIFO: these must always occur in sequence.
In the small example shown here, this implies 200 extra clock cycles to execute loop Outer.
void foo_top { a, b, c, d} {
...
Outer: while(j<100)
Inner: while(i<6)// 1 cycle to enter inner
...
LOOP_BODY
...
} // 1 cycle to exit inner
}
...
}
Vivado HLS provides the set_directive_loop_flatten command to allow labeled perfect and
semi-perfect nested loops to be flattened, removing the need to re-code for optimal
hardware performance and reducing the number of cycles it takes to perform the
operations in the loop.
• Perfect loop nest: only the innermost loop has loop body content, there is no logic
specified between the loop statements and all the loop bounds are constant.
• Semi-perfect loop nest: only the innermost loop has loop body content, there is no
logic specified between the loop statements but the outermost loop bound can be a
variable.
For imperfect loop nests, where the inner loop has variables bounds or the loop body is not
exclusively inside the inner loop, designers should try to restructure the code, or unroll the
loops in the loop body to create a perfect loop nest.
When the directive is applied to a set of nested loops it should be applied to the inner most
loop that contains the loop body.
set_directive_loop_flatten top/Inner
Loop flattening can also be performed using the directive tab in the GUI, either by applying
it to individual loops or applying it to all loops in a function by applying the directive at the
function level.
• Use the appropriate precision for the data types. Refer to Data Types for Efficient
Hardware.
• Confirm the size of any arrays that are to be implemented as RAMs or registers. The
area impact of any over-sized elements is wasteful in hardware resources.
• Pay special attention to multiplications, divisions, modulus or other complex arithmetic
operations. If these variables are larger than they need to be, they negatively impact
both area and performance.
Function Inlining
Function inlining removes the function hierarchy. A function is inlined using the INLINE
directive.
Inlining a function may improve area by allowing the components within the function to be
better shared or optimized with the logic in the calling function. This type of function
inlining is also performed automatically by Vivado HLS. Small functions are automatically
inlined.
Inlining allows functions sharing to be better controlled. For functions to be shared they
must be used within the same level of hierarchy. In this code example, function foo_top
calls foo twice and function foo_sub.
foo_sub (p, q) {
int q1 = q + 10;
foo(p1,q);// foo_3
...
}
void foo_top { a, b, c, d} {
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
Inlining function foo_sub and using the ALLOCATION directive to specify only 1 instance
of function foo is used, results in a design which only has one instance of function foo:
one-third the area of the example above.
foo_sub (p, q) {
#pragma HLS INLINE
int q1 = q + 10;
foo(p1,q);// foo_3
...
}
void foo_top { a, b, c, d} {
#pragma HLS ALLOCATION instances=foo limit=1 function
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
The INLINE directive optionally allows all functions below the specified function to be
recursively inlined by using the recursive option. If the recursive option is used on the
top-level function, all function hierarchy in the design is removed.
The INLINE off option can optionally be applied to functions to prevent them being
inlined. This option may be used to prevent Vivado HLS from automatically inlining a
function.
The INLINE directive is a powerful way to substantially modify the structure of the code
without actually performing any modifications to the source code and provides a very
powerful method for architectural exploration.
Each array is mapped into a block RAM. The basic block RAM unit provide in an FPGA is 18K.
If many small arrays do not use the full 18K, a better use of the block RAM resources is map
many of the small arrays into a larger array. If a block RAM is larger than 18K, they are
automatically mapped into multiple 18K units. In the synthesis report, review Utilization
Report > Details > Memory for a complete understanding of the block RAMs in your
design.
The ARRAY_MAP directive supports two ways of mapping small arrays into a larger one:
The following code example has two arrays that would result in two RAM components.
Arrays array1 and array2 can be combined into a single array, specified as array3 in
the following example:
In this example, the ARRAY_MAP directive transforms the arrays as shown in the following
figure.
X-Ref Target - Figure 1-63
/RQJHUDUUD\
KRUL]RQWDOH[SDQVLRQ
ZLWKPRUHHOHPHQWV
When you use the horizontal mapping shown in Figure 1-63, the implementation in the
block RAM appears as shown in the following figure.
X-Ref Target - Figure 1-64
RAM1P
N-1 M+N-1
N-2
1
0 Addresses
M-1
M-2
1
0 0
MSB LSB
;
The offset option to the ARRAY_MAP directive is used to specify at which location
subsequent arrays are added when using the horizontal option. Repeating the previous
example, but reversing the order of the commands (specifying array2 then array1) and
adding an offset, as shown below:
2IIVHWRIIURPWKHHQG
RIDUUD\HOHPHQWV
;
Although horizontal mapping can result in using less block RAM components and therefore
improve area, it does have an impact on the throughput and performance as there are now
fewer block RAM ports. To overcome this limitation, Vivado HLS also provides vertical
mapping.
9HUWLFDOH[SDQVLRQ
ZLWKPRUHELWV
06%
DUUD\>1@ 0 0
1 1 /6%
;
• • • • •• • •
• •• • • • • •• • • •
• •• • • • • •• • • •
• •• • • •
• • • •• • • • • • •
•• ••
•• ••
••
• ••••• •• • • • •
• ••••••
To map elements from a partitioned array into a single array with horizontal mapping,
the individual elements of the array to be partitioned must be specified in the ARRAY_MAP
directive. For example, the following Tcl commands partition array accum and map the
resulting elements back together.
It is possible to map a global array. However, the resulting array instance is global and any
local arrays mapped onto this same array instance become global. When local arrays of
different functions get mapped onto the same target array, then the target array instance
becomes global.
Array function arguments may only be mapped if they are arguments to the same function.
Array Reshaping
The ARRAY_RESHAPE directive combines ARRAY_PARTITIONING with the vertical mode of
ARRAY_MAP and is used to reduce the number of block RAM while still allowing the
beneficial attributes of partitioning: parallel access to the data.
The ARRAY_RESHAPE directive transforms the arrays into the form shown in the following
figure.
X-Ref Target - Figure 1-68
DUUD\>1@ DUUD\>1@
06% 1 1 1
1 1 1 EORFN /6% 1
DUUD\>1@ DUUD\>1@
06% 1 1
1 1 1 F\FOLF /6% 1
DUUD\>@
DUUD\>1@ 06% 1
1 1 1 FRPSOHWH 1
/6%
;
config_unroll -tripcount_threshold 16
Function Instantiation
Function instantiation is an optimization technique that has the area benefits of
maintaining the function hierarchy but provides an additional powerful option: performing
targeted local optimizations on specific instances of a function. This can simplify the control
logic around the function call and potentially improve latency and throughput.
The FUNCTION_INSTANTIATE directive exploits the fact that some inputs to a function may
be a constant value when the function is called and uses this to both simplify the
surrounding control structures and produce smaller more optimized function blocks. This is
best explained by example.
void foo(){
#pragma HLS FUNCTION_INSTANTIATE variable=select
foo_sub(true);
foo_sub(false);
}
It is clear that function foo_sub has been written to perform multiple but exclusive
operations (depending on whether mode is true or not). Each instance of function foo_sub
is implemented in an identical manner: this is great for function reuse and area optimization
but means that the control logic inside the function must be more complex.
void foo_sub1() {
// code segment 1
}
void foo_sub1() {
// code segment 2
}
void A(){
B1();
B2();
}
If the function is used at different levels of hierarchy such that function sharing is difficult
without extensive inlining or code modifications, function instantiation can provide the
best means of improving area: many small locally optimized copies are better than many
large copies that cannot be shared.
• First, elaborates the C, C++ or SystemC source code into an internal database
containing operators.
• Then, maps the operators on to cores which implement the hardware operations.
Cores are the specific hardware components used to create the design (such as adders,
multipliers, pipelined multipliers, and block RAM).
Control is provided over each of these steps, allowing you to control the hardware
implementation at a fine level of granularity.
Explicitly limiting the number of operators to reduce area may be required in some cases:
the default operation of Vivado HLS is to first maximize performance. Limiting the number
of operators in a design is a useful technique to reduce the area: it helps reduce area by
forcing sharing of the operations.
The ALLOCATION directive allows you to limit how many operators, or cores or functions
are used in a design. For example, if a design called foo has 317 multiplications but the
FPGA only has 256 multiplier resources (DSP48s). The ALLOCATION directive shown below
directs Vivado HLS to create a design with maximum of 256 multiplication (mul) operators:
for (i=0;i<317;i++) {
#pragma HLS UNROLL
acc += acc * d[i];
}
rerun acc;
}
Note: If you specify an ALLOCATION limit that is greater than needed, Vivado HLS attempts to use
the number of resources specified by the limit, or the maximum necessary, which reduces the
amount of sharing.
You can use the type option to specify if the ALLOCATION directives limits operations,
cores, or functions. The following table lists all the operations that can be controlled using
the ALLOCATION directive.
The ALLOCATION directive, like all directives, is specified inside a scope: a function, a loop
or a region. The config_bind configuration allows the operators to be minimized
throughout the entire design.
The minimization of operators through the design is performed using the min_op option in
the config_bind configuration. An any of the operators listed in Table 1-13 can be
limited in this fashion.
After the configuration is applied it applies to all synthesis operations performed in the
solution: if the solution is closed and re-opened the specified configuration still applies to
any new synthesis operations.
Any configurations applied with the config_bind configuration can be removed by using
the reset option or by using open_solution -reset to open the solution.
When synthesis is performed, Vivado HLS uses the timing constraints specified by the clock,
the delays specified by the target device together with any directives specified by you, to
determine which core is used to implement the operators. For example, to implement a
multiplier operation Vivado HLS could use the combinational multiplier core or use a
pipeline multiplier core.
The cores which are mapped to operators during synthesis can be limited in the same
manner as the operators. Instead of limiting the total number of multiplication operations,
you can choose to limit the number of combinational multiplier cores, forcing any
remaining multiplications to be performed using pipelined multipliers (or vice versa). This is
performed by specifying the ALLOCATION directive type option to be core.
The RESOURCE directive is used to explicitly specify which core to use for specific
operations. In the following example, a 2-stage pipelined multiplier is specified to
implement the multiplication for variable The following command informs Vivado HLS to
use a 2-stage pipelined multiplier for variable c. It is left to Vivado HLS which core to use for
variable d.
return d;
}
In the following example, the RESOURCE directives specify that the add operation for
variable temp and is implemented using the AddSub_DSP core. This ensures that the
operation is implemented using a DSP48 primitive in the final design - by default, add
operations are implemented using LUTs.
dout2_t temp;
#pragma HLS RESOURCE variable=temp core=AddSub_DSP
The list_core command is used to obtain details on the cores available in the library. The
list_core can only be used in the Tcl command interface and a device must be specified
using the set_part command. If a device has not been selected, the command does not
have any effect.
The -operation option of the list_core command lists all the cores in the library that
can be implemented with the specified operation.
The following table lists the cores used to implement standard RTL logic operations (such as
add, multiply, and compare).
MulnS N-stage pipelined multiplier with bit-widths that exceed the size of a standard DSP48
macrocell.
Note: Multipliers that can be implemented with a single DSP48 macrocell are mapped to the
DSP48 core.
In addition to the standard cores, the following floating point cores are used when the
operation uses floating-point types. Refer to the documentation for each device to
determine if the floating-point core is supported in the device.
The following table lists the cores used to implement storage elements, such as registers or
memories.
The resource directives uses the assigned variable as the target for the resource. Given the
code, the RESOURCE directive specifies the multiplication for out1 is implemented with a
3-stage pipelined multiplier.
void foo(...) {
#pragma HLS RESOURCE variable=out1 latency=3
If the assignment specifies multiple identical operators, the code must be modified to
ensure there is a single variable for each operator to be controlled. For example if only the
first multiplication in this example (inA * inB) is to be implemented with a pipelined
multiplier:
The code should be changed to the following with the directive specified on the
Result_tmp variable:
The config_bind configuration provides control over the binding process. The
configuration allows you to direct how much effort is spent when binding cores to
operators. By default Vivado HLS chooses cores which are the best balance between timing
and area. The config_bind influences which operators are used.
The config_bind command can only be issued inside an active solution. The default run
strategies for the binding operation is medium.
• Low Effort: Spend less timing sharing, run time is faster but the final RTL may be
larger. Useful for cases when the designer knows there is little sharing possible or
desirable and does not wish to waste CPU cycles exploring possibilities.
• Medium Effort: The default, where Vivado HLS tries to share operations but endeavors
to finish in a reasonable time.
• High Effort: Try to maximize sharing and do not limit run time. Vivado HLS keeps
trying until all possible combinations of sharing is explored.
Optimizing Logic
Controlling Operator Pipelining
Vivado HLS automatically determines the level of pipelining to use for internal operations.
You can use the RESOURCE directive with the -latency option to explicitly specify the
number of pipeline stages and override the number determined by Vivado HLS.
RTL synthesis might use the additional pipeline registers to help improve timing issues that
might result after place and route. Registers added to the output of the operation typically
help improve timing in the output datapath. Registers added to the input of the operation
typically help improve timing in both the input datapath and the control logic from the
FSM.
• If the latency is specified as 1 cycle more than the latency decided by Vivado HLS,
Vivado HLS adds new output registers to the output of the operation.
• If the latency is specified as 2 more than the latency decided by Vivado HLS, Vivado
HLS adds registers to the output of the operation and to the input side of the
operation.
• If the latency is specified as 3 or more cycles than the latency decided by Vivado HLS,
Vivado HLS adds registers to the output of the operation and to the input side of the
operation. Vivado HLS automatically determines the location of any additional
registers.
You can use the config_core configuration to pipeline all instances of a specific core
used in the design that have the same pipeline depth. To set this configuration:
For example, the following configuration specifies that all operations implemented with
the DSP48 core are pipelined with a latency of 4, which is the maximum latency allowed
by this core:
The following configuration specifies that all block RAM implemented with the
RAM_1P_BRAM core are pipelined with a latency of 3:
IMPORTANT: Vivado HLS only applies the core configuration to block RAM with an explicit RESOURCE
directive that specifies the core used to implemented the array. If an array is implemented using a
default core, the core configuration does not affect the block RAM.
See Table 1-16 for a list of all the cores you can use to implement arrays.
Expression balancing rearranges operators to construct a balanced tree and reduce latency.
Given the highly sequential code using assignment operators such as += and *= in the
following example:
sum = 0;
sum += a;
sum += b;
sum += c;
sum += d;
return sum;
Without expression balancing, and assuming each addition requires one clock cycle, the
complete computation for sum requires four clock cycles shown in the following figure.
G F E D
ಯರ
&\FOH
&\FOH
&\FOH
&\FOH
VXP
;
D E F G
&\FOH
&\FOH
VXP
;
A=B*C; A=B*F;
D=E*F; D=E*C;
O1=A*D O2=A*D;
This behavior is a function of the saturation and rounding in the C standard when
performing operation with types float or double. Therefore, Vivado HLS always
maintains the exact order of operations when variables of type float or double are
present and does not perform expression balancing by default.
You can enable expression balancing with float and double types using the
configuration config_compile option as follows:
With this setting enabled, Vivado HLS might change the order of operations to produce a
more optimal design. However, the results of C/RTL cosimulation might differ from the C
simulation.
x - 0.0 = x;
x + 0.0 = x;
0.0 - x = -x;
x - x = 0.0;
x*0.0 = 0.0;
TIP: When the unsafe_math_operations and no_signed_zero optimizations are used, the RTL
implementation will have different results than the C simulation. The test bench should be capable of
ignoring minor differences in the result: check for a range, do not perform an exact comparison.
• The C simulation is executed and the inputs to the top-level function, or the
Device-Under-Test (DUT), are saved as “input vectors”.
• The “input vectors” are used in an RTL simulation using the RTL created by Vivado HLS.
The outputs from the RTL are save as “output vectors”.
• The “output vectors” from the RTL simulation are applied to C test bench, after the
function for synthesis, to verify the results are correct. The C test bench performs the
verification of the results.
The following messages are output by Vivado HLS to show the progress of the verification.
C simulation:
At this stage, since the C simulation was executed, any messages written by the C test bench
will be output in console window or log file.
RTL simulation:
At this stage, any messages from the RTL simulation are output in console window or log
file.
The importance of the C test bench in the C/RTL co-simulation flow is discussed below.
5HVXOW 5HVXOW
&KHFNLQJ &KHFNLQJ
'87 57/0RGXOH
;
• The test bench must be self-checking and return a value of 0 if the test passes or
returns a non-zero value if the test fails.
• The correct interface synthesis options must be selected.
• Any 3rd-party simulators must be available in the search path.
• Any arrays or structs on the design interface cannot use the optimization directives or
combinations of optimization directives listed in Unsupported Optimizations for
Cosimulation.
int main () {
int ret=0;
…
// Execute (DUT) Function
…
if (ret != 0) {
printf("Test failed !!!\n");
ret=1;
} else {
printf("Test passed !\n");
}
…
return ret;
}
This self-checking test bench compares the results against known good results in the
output.golden.dat file.
Note: There are many ways to perform this checking. This is just one example.
In the Vivado HLS design flow, the return value to function main() indicates the following:
RECOMMENDED: Because the system environment (for example, Linux, Windows, or Tcl) interprets the
return value of the main() function, it is recommended that you constrain the return value to an 8-bit
range for portability and safety.
CAUTION! You are responsible for ensuring that the test bench checks the results. If the test bench does
not check the results but returns zero, Vivado HLS indicates that the simulation test passed even though
the results were not actually checked.
If one of these conditions is not met, C/RTL co-simulation halts with the following message:
IMPORTANT: To verify an RTL design using the third-party simulators (for example, ModelSim, VCS,
Riviera), you must include the executable to the simulator in the system search path, and the
appropriate license must be available. See the third-party vendor documentation for details on
configuring these simulators.
IMPORTANT: When verifying a SystemC design, you must select the ModelSim simulator and ensure it
includes C compiler capabilities with appropriate licensing.
• Horizontal Mapping
• Vertical Mapping of arrays of different sizes
• Data Pack on structs containing other structs as members
For other supported HDL simulators the Xilinx floating point library must be pre-compiled
and added to the simulator libraries. The following example steps demonstrate how the
floating point library may be compiled in verilog for use with the VCS simulator:
1. Open Vivado (not Vivado HLS) and issue the following command in the Tcl console
window:
compile_simlib -simulator vcs_mx -family all -language verilog
• Setup Only: This creates all the files (wrappers, adapters, and scripts) required to run
the simulation but does not execute the simulator. The simulation can be run in the
command shell from within the appropriate RTL simulation folder
<solution_name>/sim/<RTL>.
• Dump Trace: This generates a trace file for every function, which is saved to the
<solution>/sim/<RTL> folder. The drop-down menu allows you to select which
signals are saved to the trace file. You can choose to trace all signals in the design,
trace just the top-level ports, or trace no signals. For details on using the trace file, see
the documentation for the selected RTL simulator.
• Optimizing Compile: This ensures a high level of optimization is used to compile the C
test bench. Using this option increases the compile time but the simulation executes
faster.
• Reduce Disk Space: The flow shown Figure 1-71 in saves the results for all transactions
before executing RTL simulation. In some cases, this can result in large data files. The
reduce_diskspace option can be used to execute one transaction at a time and
reduce the amount of disk space required for the file. If the function is executed N
times in the C test bench, the reduce_diskspace option ensure N separate RTL
simulations are performed. This causes the simulation to run slower.
• Compiled Library Location: This specifies the location of the compiled library for a
third-party RTL simulator.
Note: If you are simulating with a third-party RTL simulator and the design uses IP, you must use
an RTL simulation model for the IP before performing RTL simulation. To create or obtain the RTL
simulation model, contact your IP provider.
• Input Arguments: This allows the specification of any arguments required by the test
bench.
where
Any files written by the C test bench during co-simulation and any trace files generated by
the simulator are written to this directory. For example, if the C test bench save the output
results for comparison, review the output file in this directory and compare it with the
expected results.
• Verilog/VHDL Simulator Selection: Select Vivado Simulator. For Xilinx 7 series and
later devices, you can alternatively select Auto.
• Dump Trace: Select all or port.
When C/RTL cosimulation completes, the Open Wave Viewer toolbar button opens the RTL
waveforms in the Vivado IDE.
Note: When you open the Vivado IDE using this method, you can only use the waveform analysis
features, such as zoom, pan, and waveform radix.
To debug a C/RTL cosimulation failure, run the checks described in the following sections.
If you are unable to resolve the C/RTL cosimulation failure, see Xilinx Support for support
resources, such as answers, documentation, downloads, and forums.
Are you running Linux? Ensure that your setup files (for example .cshrc or .bashrc) do not
have a change directory command. When C/RTL cosimulation starts, it
spawns a new shell process. If there is a cd command in your setup files,
it causes the shell to run in a different location and eventually C/RTL
cosimulation fails.
Optimization Directives
Check the C test bench and C source code as shown in the following table.
Are you using floating-point math • Check that the C test bench results are within an acceptable error
operations in the design? range instead of performing an exact comparison. For some of the
floating point math operations, the RTL implementation is not
identical to the C. For details, see Verification and Math Functions in
Chapter 2.
• Ensure that the RTL simulation models for the floating-point cores are
provided to the third-party simulator. For details, see Simulating
Floating-Point Cores.
Are you using Xilinx IP blocks and a Ensure that the path to the Xilinx IP HDL models is provided to the
third-party simulator? third-party simulator.
Table 1-19: Debugging the C Test Bench and C Source Code (Cont’d)
Questions Actions to Take
Are you using the hls::stream Analyze the design and use the STREAM directive to increase the size of
construct in the design that changes the FIFOs used to implement the hls::stream.
the data rate (for example, decimation Note: By default, an hls::stream is implemented as a FIFO with a depth of 1.
or interpolation)? If the design results in an increase in the data rate (for example, an interpolation
operation), a default FIFO size of 1 might be too small and cause the C/RTL
cosimulation to stall.
Are you using very large data sets in Use the reduce_diskspace option when executing C/RTL
the simulation? cosimulation. In this mode, Vivado HLS only executes 1 transaction at a
time. The simulation might run marginally slower, but this limits storage
and system capacity issues.
Note: The C/RTL cosimulation feature verifies all transaction at one time. If the
top-level function is called multiple times (for example, to simulate multiple
frames of video), the data for the entire simulation input and output is stored on
disk. Depending on the machine setup and OS, this might cause performance or
execution issues.
The following table shows the formats you can export with details about each.
You can only export designs targeted to 7 series devices, Zynq-7000 AP SoC, and UltraScale
devices to the Vivado Design Suite design flows. For example, if the target is a Virtex®-6
device, the options for packaging as the Vivado IP catalog, System Generator for DSP
(Vivado Design Suite), or Synthesis Checkpoint (.dsp) are not available, because these IP
package formats are only for use in the Vivado Design Suite design flow.
In addition to the packaged output formats, the RTL files are available as standalone files
(not part of a packaged format) in the verilog and vhdl directories located within the
implementation directory <project_name>/<solution_name>/impl.
In addition to the RTL files, these directories also contain project files for the Vivado Design
Suite. Opening the file project.xpr causes the design (Verilog or VHDL) to be opened in
a Vivado project where the design may be analyzed. If C/RTL Cosimulation was executed in
the Vivado HLS project, the C/RTL C/RTL Cosimulation files are available inside the Vivado
project.
Before exporting a design, you have the opportunity to execute logic synthesis and confirm
the accuracy of the estimates. The evaluate option shown the following figure invokes RTL
synthesis during the export process and synthesizes the RTL design to gates.
Note: The RTL synthesis option is provided to confirm the reported estimates. In most cases, these
RTL results are not included in the packaged IP.
X-Ref Target - Figure 1-73
If no values are provided in the configuration setting the following values are used:
• Vendor: xilinx.com
• Library: hls
• Version: 1.0
• Description: An IP generated by Vivado HLS
• Display Name: This field is left blank by default
• Taxonomy: This field is left blank by default
If the Evaluate option was selected, RTL synthesis is executed and the final timing and
resources reported but not included in the IP package. See the RTL synthesis section above
for more details on this process.
1. Inside the System Generator design, right-click and use option XilinxBlockAdd to
instantiate new block.
2. Scroll down the list in dialog box and select Vivado HLS.
3. Double-click on the newly instantiated Vivado HLS block to open the Block Parameters
dialog box.
4. Browse to the solution directory where the Vivado HLS block was exported. Using the
example, <project_name>/<solution_name>/impl/sysgen, browse to the
<project_name>/<solution_name> directory and select apply.
Optimizing Ports
If any top-level function arguments are transformed during the synthesis process into a
composite port, the type information for that port cannot be determined and included in
the System Generator IP block.
The implication for this limitation is that any design that uses the reshape, mapping or data
packing optimization on ports must have the port type information, for these composite
ports, manually specified in System Generator.
To manually specify the type information in System Generator, you should know how the
composite ports were created and then use slice and reinterpretation blocks inside System
Generator when connecting the Vivado HLS block to other blocks in the system.
For example:
• If three 8-bit in-out ports R, G and B are packed into a 24-bit input port (RGB_in) and a
24-bit output port (RGB_out) ports.
• The 24-bit input port (RGB_in) would need to be driven by a System Generator block
that correctly groups three 8-bit input signals (Rin, Gin and Bin) into a 24-bit input bus.
• The 24-bit output bus (RGB_out) would need to be correctly split into three 8-bit
signals (Rout, Bout and Gout).
See the System Generator documentation for details on how to use the slice and
reinterpretation blocks for connecting to composite type ports.
Selecting OK generates the design checkpoint package. This package is written to the
<project_name>/<solution_name>/impl/ip directory. The design checkpoint files
can be used in a Vivado Design Suite project in the same manner as any other design
checkpoint.
You can use each of the C libraries in your design by including the library header file. These
header files are located in the include directory in the Vivado HLS installation area.
IMPORTANT: The header files for the Vivado HLS C libraries do not have to be in the include path if the
design is used in Vivado HLS. The paths to the library header files are automatically added.
Vivado HLS provides both integer and fixed-point arbitrary precision data types for C, C++
and supports the arbitrary precision data types which are part of SystemC.
The advantage of arbitrary precision data types is that they allow the C code to be updated
to use variables with smaller bit-widths and then for the C simulation to be re-executed to
validate the functionality remains identical or acceptable.
Note: The header files define the arbitrary precision types are also provided with Vivado HLS as a
standalone package with the rights to use them in your own source code. The package,
xilinx_hls_lib_<release_number>.tgz is provided in the include directory in the Vivado
HLS installation area.
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_cint.h"
The following example shows how the header file is added and two variables implemented
to use 9-bit integer and 10-bit unsigned integer types:
#include "ap_int.h"
Vivado HLS offers arbitrary precision fixed-point data types for use with C++ and SystemC
functions as shown in the following table.
These data types manage the value of floating point numbers within the boundaries of a
specified total width and integer width, as shown in the following figure.
X-Ref Target - Figure 2-1
MSB LSB
Binary point
W=I+B
;
I The number of bits used to represent the integer value (the number of bits above the
decimal point)
Q Quantization mode
This dictates the behavior when greater precision is generated than can be defined by
smallest fractional bit in the variable used to store the result.
SystemC Types ap_fixed Types Description
SC_RND AP_RND Round to plus infinity
SC_RND_ZERO AP_RND_ZERO Round to zero
SC_RND_MIN_INF AP_RND_MIN_INF Round to minus infinity
AP_RND_INF AP_RND_INF Round to infinity
AP_RND_CONV AP_RND_CONV Convergent rounding
AP_TRN AP_TRN Truncation to minus infinity
AP_TRN_ZERO AP_TRN_ZERO Truncation to zero (default)
This dictates the behavior when the result of an operation exceeds the maximum (or
minimum in the case of negative numbers) possible value that can be stored in the
variable used to store the result.
SystemC Types ap_fixed Types Description
SC_SAT AP_SAT Saturation
SC_SAT_ZERO AP_SAT_ZERO Saturation to zero
SC_SAT_SYM AP_SAT_SYM Symmetrical saturation
SC_WRAP AP_WRAP Wrap around (default)
SC_WRAP_SM AP_WRAP_SM Sign magnitude wrap
around
N This defines the number of saturation bits in overflow wrap modes.
In this example the Vivado HLS ap_fixed type is used to define an 18-bit variable with 6
bits representing the numbers above the decimal point and 12-bits representing the value
below the decimal point. The variable is specified as signed, the quantization mode is set to
round to plus infinity and the default wrap-around mode is used for overflow.
#include <ap_fixed.h>
...
ap_fixed<18,6,AP_RND > my_type;
...
In this sc_fixed example a 22-bit variable is shown with 21 bits representing the numbers
above the decimal point: enabling only a minimum accuracy of 0.5. Rounding to zero is
used, such that any result less than 0.5 rounds to 0 and saturation is specified.
#define SC_INCLUDE_FX
#define SC_FX_EXCLUDE_OTHER
#include <systemc.h>
...
sc_fixed<22,21,SC_RND_ZERO,SC_SAT> my_type;
...
Vivado HLS also provides arbitrary precision data types in C++ and supports the arbitrary
precision data types that are part of SystemC. These types are discussed in the respective
C++ and SystemC coding.
If, for example, a 17-bit multiplier is required, you can use arbitrary precision types to
require exactly 17 bits in the calculation.
Arbitrary precision data types in the C code allows the C simulation to be executed using
accurate bit-widths and for the C simulation to validate the functionality (and accuracy)
of the algorithm before synthesis.
For the C language, the header file ap_cint.h defines the arbitrary precision integer data
types [u]int#W. For example:
$HLS_ROOT/include
where
The code shown in the following example is a repeat of the code shown in the Example 3-22
on basic arithmetic. In both examples, the data types in the top-level function to be
synthesized are specified as dinA_t, dinB_t, etc.
#include "apint_arith.h"
The real difference between the two examples is in how the data types are defined. To use
arbitrary precision integer data types in a C function:
° intN
or
° uintN
where
The data types are defined in the header apint_arith.h. See the following example
compared with Example 3-22:
• The input data types have been reduced to represent the maximum size of the real
input data. For example, 8-bit input inA is reduced to 6-bit input.
• The output types have been refined to be more accurate. For example, out2 (the sum
of inA and inB) needs to be only 13-bit, not 32-bit.
#include <stdio.h>
#include ap_cint.h
To create arbitrary precision types, attributes are added to define the bit-sizes in file
ap_cint.h. Standard C compilers such as gcc compile the attributes used in the header
file, but they do not know what the attributes mean. This results in computations that do
not reflect the bit-accurate behavior of the code. For example, a 3-bit integer value with
binary representation 100 is treated by gcc (or any other third-party C compiler) as having
a decimal value 4 and not -4.
Note: This issue is only present when using C arbitrary precision types. There are no such issues with
C++ or SystemC arbitrary precision types.
Vivado HLS solves this issue by automatically using its own built-in C compiler apcc, when
it recognizes arbitrary precision C types are being used. This compiler is gcc compatible
but correctly interprets arbitrary precision types and arithmetic. You can invoke the apcc
compiler at the command prompt by replacing “gcc” by “apcc”.
When arbitrary precision types are used in C, the design can no longer be analyzed using
the Vivado HLS C debugger. If it is necessary to debug the design, Xilinx recommends one
of the following methodologies:
• Use the printf or fprintf functions to output the data values for analysis.
• Replace the arbitrary precision types with native C types (int, char, short, etc). This
approach helps debug the operation of the algorithm itself but does not help when you
must analyze the bit-accurate results of the algorithm.
• Change the C function to C++ and use C++ arbitrary precision types for which there
are no debugger limitations.
Integer Promotion
Take care when the result of arbitrary precision operations crosses the native 8, 16, 32 and
64-bit boundaries. In the following example, the intent is that two 18-bit values are
multiplied and the result stored in a 36-bit number:
#include "ap_cint.h"
int18 a,b;
int36 tmp;
tmp = a * b;
Integer promotion occurs when using this method. The result might not be as expected.
This results in the behavior and incorrect result shown in the following figure.
X-Ref Target - Figure 2-2
5HVXOWLQ+H[
D
E
0XOWLSOLFDWLRQ5HVXOW
5HVXOWಯSURPRWHGರWRELW
WPS
;
To overcome the integer promotion issue, cast operator inputs to the output size. The
following example shows where the inputs to the multiplier are cast to 36-bit value before
the multiplication. This results in the correct (expected) results during C simulation and the
expected 36-bit multiplication in the RTL.
#include "ap_cint.h"
Casting to avoid integer promotion issue is required only when the result of an operation is
greater than the next native boundary (8, 16, 32, or 64). This behavior is more typical with
multipliers than with addition and subtraction operations.
There are no integer promotion issues when using C++ or SystemC arbitrary precision
types.
Vivado HLS provides arbitrary precision data types for C++ to allow variables and
operations in the C++ code to be specified with any arbitrary bit-widths: 6-bit, 17-bit,
234-bit, up to 1024 bits.
TIP: The default maximum width allowed is 1024 bits. You can override this default by defining the
macro AP_INT_MAX_W with a positive integer value less than or equal to 32768 before inclusion of the
ap_int.h header file.
C++ supports use of the arbitrary precision types defined in the SystemC standard. Include
the SystemC header file systemc.h, and use SystemC data types. For more information on
SystemC types, see SystemC Synthesis in Chapter 3.
Arbitrary precision data types have are two primary advantages over the native C++ types:
• Accurate C++ simulation/analysis: Arbitrary precision data types in the C++ code
allows the C++ simulation to be performed using accurate bit-widths and for the C++
simulation to validate the functionality (and accuracy) of the algorithm before
synthesis.
The arbitrary precision types in C++ have none of the disadvantages of those in C:
• C++ arbitrary types can be compiled with standard C++ compilers (there is no C++
equivalent of apcc, as discussed in Validating Arbitrary Precision Types in C).
• C++ arbitrary precision types do not suffer from Integer Promotion Issues.
It is not uncommon for users to change a file extension from .c to .cpp so the file can be
compiled as C++, where neither of these issues are present.
For the C++ language, the header file ap_int.h defines the arbitrary precision integer
data types ap_(u)int<W>. For example, ap_int<8> represents an 8-bit signed integer
data type and ap_uint<234> represents a 234-bit unsigned integer type.
The ap_int.h file is located in the directory $HLS_ROOT/include, where $HLS_ROOT is the
Vivado HLS installation directory.
The code shown in the following example, is a repeat of the code shown in the earlier
example on basic arithmetic (Example 3-22 and again in Example 2-1). In this example the
data types in the top-level function to be synthesized are specified as dinA_t, dinB_t ...
#include "cpp_ap_int_arith.h"
In this latest update to this example, the C++ arbitrary precision types are used:
The data types are defined in the header cpp_ap_int_arith.h as shown in Example 2-2.
Compared with Example 3-22, the input data types have simply been reduced to represent
the maximum size of the real input data (for example, 8-bit input inA is reduced to 6-bit
input). The output types have been refined to be more accurate, for example, out2, the
sum of inA and inB, need only be 13-bit and not 32-bit.
#ifndef _CPP_AP_INT_ARITH_H_
#define _CPP_AP_INT_ARITH_H_
#include <stdio.h>
#include "ap_int.h"
#define N 9
#endif
DSB>X@IL[HG:,421!
%LQDU\SRLQW: ,%
;
TIP: Arbitrary precision fixed-point types use more memory during C simulation. If using very large
arrays of ap_[u]fixed types, refer to the discussion of C simulation in Arrays in Chapter 3.
These attributes are summarized by examining the code in Example 2-6. First, the header
file ap_fixed.h is included. The ap_fixed types are then defined using the typedef
statement:
The function contains no code to manage the alignment of the decimal point after
operations are performed. The alignment is done automatically.
#include "ap_fixed.h"
The following table shows the quantization and overflow modes. For detailed information,
see C++ Arbitrary Precision Fixed-Point Types in Chapter 4.
TIP: Quantization and overflow modes that do more than the default behavior of standard hardware
arithmetic (wrap and truncate) result in operators with more associated hardware. It costs logic (LUTs)
to implement the more advanced modes, such as round to minus infinity or saturate symmetrically.
The number of bits used to represent the integer value (the number of bits above the
I decimal point)
Quantization mode dictates the behavior when greater precision is generated than can
Q
be defined by smallest fractional bit in the variable used to store the result.
Mode Description
AP_RND Rounding to plus infinity
Overflow mode dictates the behavior when more bits are generated than the variable to
O
store the result contains.
Mode Description
AP_SAT Saturation
Using ap_(u)fixed types, the C++ simulation is bit accurate. Fast simulation can validate
the algorithm and its accuracy. After synthesis, the RTL exhibits the identical bit-accurate
behavior.
Arbitrary precision fixed-point types can be freely assigned literal values in the code. See
shown the test bench (Example 2-7) used with Example 2-6, in which the values of in1 and
in2 are declared and assigned constant values.
When assigning literal values involving operators, the literal values must first be cast to
ap_(u)fixed types. Otherwise, the C compiler and Vivado HLS interpret the literal as an
integer or float/double type and may fail to find a suitable operator. As shown in the
following example, in the assignment of in1 = in1 + din1_t(0.25), the literal 0.25 is
cast to an ap_fixed type.
#include <cmath>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
#include "ap_fixed.h"
result.open(result.dat);
// Persistent manipulators
result << right << fixed << setbase(10) << setprecision(15);
Modeling designs that use streaming data can be difficult in C. As discussed in Multi-Access
Pointer Interfaces: Streaming Data in Chapter 3, the approach of using pointers to perform
multiple read and/or write accesses can introduce issues, because there are implications for
the type qualifier and how the test bench is constructed.
Vivado HLS provides a C++ template class hls::stream<> for modeling streaming data
structures. The streams implemented with the hls::stream<> class have the following
attributes.
This section shows how the hls::stream<> class can more easily model designs with
streaming data. The topics in this section provide:
Streams can be used only in C++ based designs. Each hls::stream<> object must be
written by a single process and read by a single process. For example, in a DATAFLOW
design each stream can have only one producer and one consumer process.
In the RTL, streams are implemented as FIFO interface but can optionally be implemented
using a full handshake interface port (ap_hs). The default interface port is an ap_fifo
port. The depth of the FIFO optionally can be using the STREAM directive.
• Globally-defined streams that are only read from, or only written to, are inferred as
external ports of the top-level RTL block.
• Globally-defined streams that are both read from and written to (in the hierarchy below
the top-level function) are implemented as internal FIFOs.
Streams defined in the global scope follow the same rules as any other global variables. For
more information on the synthesis of global variables, see Data Types and Bit-Widths in
Chapter 1.
#include "ap_int.h"
#include "hls_stream.h"
Streams must use scoped naming. Xilinx recommends using the scoped hls:: naming
shown in the example above. However, if you want to use the hls namespace, you can
rewrite the preceding example as:
#include <ap_int.h>
#include <hls_stream.h>
using namespace hls;
Streams may be optional named. Providing a name for the stream allows the name to be
used in reporting. For example, Vivado HLS automatically checks to ensure all elements
from an input stream are read during simulation. Given the following two streams:
stream<uint8_t> bytestr_in1;
stream<uint8_t> bytestr_in2("input_stream2");
Any warning on elements left in the streams are reported as follows, where it is clear which
message relates to bytetr_in2:
When streams are passed into and out of functions, they must be passed-by-reference as in
the following example:
void stream_function (
hls::stream<uint8_t> &strm_out,
hls::stream<uint8_t> &strm_in,
uint16_t strm_len
)
A complete design example using streams is provided in the Vivado HLS examples. Refer to
the hls_stream example in the design examples available from the GUI welcome screen.
In this example, the value of variable src_var is pushed into the stream.
hls::stream<int> my_stream;
int src_var = 42;
my_stream.write(src_var);
The << operator is overloaded such that it may be used in a similar fashion to the stream
insertion operators for C++ stream (for example, iostreams and filestreams). The
hls::stream<> object to be written to is supplied as the left-hand side argument and the
value to be written as the right-hand side.
hls::stream<int> my_stream;
int src_var = 42;
This method reads from the head of the stream and assigns the values to the variable
dst_var.
hls::stream<int> my_stream;
int dst_var;
my_stream.read(dst_var);
Alternatively, the next object in the stream can be read by assigning (using for example =,
+=) the stream to an object on the left-hand side:
// Usage of T read(void)
hls::stream<int> my_stream;
The '>>' operator is overloaded to allow use similar to the stream extraction operator for
C++ stream (for example, iostreams and filestreams). The hls::stream is supplied as the
LHS argument and the destination variable the RHS.
hls::stream<int> my_stream;
int dst_var;
These methods return a Boolean value indicating the status of the access (true if
successful, false otherwise). Additional methods are included for testing the status of an
hls::stream<> stream.
TIP: None of the methods discussed for non-blocking accesses can be used on an hls::stream<>
interface for which the ap_hs protocol has been selected.
During C simulation, streams have an infinite size. It is therefore not possible to validate
with C simulation if the stream is full. These methods can be verified only during RTL
simulation when the FIFO sizes are defined (either the default size of 1, or an arbitrary size
defined with the STREAM directive).
Non-Blocking Writes
This method attempts to push variable src_var into the stream my_stream, returning a
boolean true if successful. Otherwise, false is returned and the queue is unaffected.
hls::stream<int> my_stream;
int src_var = 42;
if (my_stream.write_nb(src_var)) {
// Perform standard operations
...
} else {
// Write did not occur
return;
}
Fullness Test
bool full(void)
hls::stream<int> my_stream;
int src_var = 42;
bool stream_full;
stream_full = my_stream.full();
Non-Blocking Read
bool read_nb(T & rdata)
This method attempts to read a value from the stream, returning true if successful.
Otherwise, false is returned and the queue is unaffected.
hls::stream<int> my_stream;
int dst_var;
if (my_stream.read_nb(dst_var)) {
// Perform standard operations
...
} else {
// Read did not occur
return;
}
Emptiness Test
bool empty(void)
hls::stream<int> my_stream;
int dst_var;
bool stream_empty;
fifo_empty = my_stream.empty();
The following example shows how a combination of non-blocking accesses and full/empty
tests can provide error handling functionality when the RTL FIFOs are full or empty:
#include "hls_stream.h"
using namespace hls;
typedef struct {
short data;
bool valid;
bool invert;
} input_interface;
For multirate designs in which the implementation requires a FIFO with a depth greater than
1, you must determine (and set using the STREAM directive) the depth necessary for the RTL
simulation to complete. If the FIFO depth is insufficient, RTL co-simulation stalls.
Because stream objects cannot be viewed in the GUI directives pane, the STREAM directive
cannot be applied directly in that pane.
Right-click the function in which an hls::stream<> object is declared (or is used, or exists
in the argument list) to:
typedef struct {
hls::stream<uint8_t> a;
hls::stream<uint16_t> b;
} strm_strct_t;
These restrictions apply to both top-level function arguments and globally declared
objects. If structs of streams are used for synthesis, the design must be verified using an
external RTL simulator and user-created HDL test bench. There are no such restrictions on
hls::stream<> objects with strictly internal linkage.
Not every function supported by the standard C math libraries is provided in the HLS Math
Library. Only the math functions shown in the following table are supported for synthesis.
This is related to the accuracy of the implemented functions, as listed in the preceding
table.
In some cases, the bit-approximate HLS math library function does not provide the same
accuracy as the standard C function. To achieve the desired result, a bit-approximate
implementation may use a different underlying algorithm than the C or C++ version. The
accuracy of the function is specified in terms of ULP (Unit of Least Precision). This difference
in accuracy has implications, discussed later, for both C simulation and C/RTL co-simulation.
In addition, the following seven functions might show some differences, depending on the
C standard used to compile and run the C simulation:
• copysign
• fpclassify
• isinf
• isfinite
• isnan
• isnormal
• signbit
C90 mode
Only isinf, isnan, and copysign are usually provided by the system header files, and
they operate on doubles. In particular, copysign always returns a double result. This might
result in unexpected results after synthesis, if it must be returned to a float, because a
double-to-float conversion block is introduced into the hardware.
All seven functions are usually provided under the expectation that the system header files
will redirect them to __isnan(double) and __isnan(float). The usual GCC header
files do not redirect isnormal, but implement it in terms of fpclassify.
All seven are provided by the system header files, and they operate on doubles.
copysign always returns a double result. This might cause unexpected results after
synthesis if it must be returned to a float, because a double-to-float conversion block is
introduced into the hardware.
copysign and copysignf are handled as built-ins even when using namespace std;.
• -std=c99 for C
• -fno-builtin for C and C++
Note: To specify the C compile options, such as -std=c99, use the Tcl command add_files with
the -cflags option. Alternatively, use the Edit CFLAGs button in the Project Settings dialog box.
If the hls_math.h library is used in the C source code, the C simulation and C/RTL
co-simulation results are identical. However, the results of C simulation using hls_math.h
are not the same as those using the standard C libraries. The hls_math.h library simply
ensures the C simulation matches the C/RTL co-simulation results.
The following explains each of the possible options which are used to perform verification
when using math functions.
#include <cmath>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
In this case, the results between C simulation and C/RTL co-simulation are different. Keep in
mind when comparing the outputs of simulation, any results written from the test bench are
written to the working directory where the simulation executes:
where <project> is the project folder, <solution> is the name of the solution folder and
<RTL> is the type of RTL verified (verilog or vhdl). The following figure shows a typical
comparison of the pre-synthesis results file on the left-hand side and the post-synthesis RTL
results file on the right-hand side. The output is shown in the third column.
X-Ref Target - Figure 2-4
The recommended flow for handling these differences is using a test bench that checks the
results to ensure that they lie within an acceptable error range. This can be accomplished by
creating two versions of the same function, one for synthesis and one as a reference
version. In this example, only function cpp_math is synthesized.
#include <cmath>
#include <fstream>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
The test bench to verify the design compares the outputs of both functions to determine
the difference, using variable diff in the following example. During C simulation both
functions produce identical outputs. During C/RTL co-simulation function cpp_math
produces different results and the difference in results are checked.
int main() {
data_t angle = 0.01;
data_t output, exp_output, diff;
int retval=0;
if (retval != 0) {
printf("Test failed !!!\n");
retval=1;
} else {
printf("Test passed !\n");
}
// Return 0 if the test passes
return retval;
}
If the margin of difference is lowered to 0.00000005, this test bench highlights the margin
of error during C/RTL co-simulation:
When using the standard C math libraries (math.h and cmath.h) create a “smart” test
bench to verify any differences in accuracy are acceptable.
There is a difference between the C simulation results using the HLS math library and those
previously obtained using the standard C math libraries. These difference should be
validated with C simulation using a “smart” test bench similar to option 1.
In cases where there are many math functions and updating the code is painful, a third
option can be used.
The HLS math library file is located in the src directory in the Vivado HLS installation area.
Simply copy the file to your local folder and add the file as a standard design file.
There is a difference between the C simulation results using the HLS math library file and
those previously obtained without adding this file. These difference should be validated
with C simulation using a “smart” test bench similar to option 1.
The fixed-point type functions are intended as replacements for functions using float type
variables and are therefore fixed to 32-bit input and return. The number of integer bits can
be any value up to 32.
The HLS math library provides fixed-point implementations for some of the most common
math functions. The methodology for using these functions is:
When using fixed-point math functions, the input type must include the decimal point (W
>= I, I >=0 if unsigned, I >= 1 if signed). The result type has the same width and integer bits
as the input (although some of the leading bits are likely to be zero).
C++ cmath.h
If the C++ cmath.h header file is used, the floating point functions (for example, sinf
and cosf) can be used. These result in 32-bit operations in hardware. The cmath.h header
file also overloads the standard functions (for example, sin and cos) so they can be used
for float and double types.
C math.h
If the C math.h library is used, the floating point functions (for example, sinf and cosf)
are required to synthesize 32-bit floating point operations. All standard function calls (for
example, sin and cos) result in doubles and 64-bit double-precision operations being
synthesized.
Cautions
When converting C functions to C++ to take advantage of math.h support, be sure that the
new C++ code compiles correctly before synthesizing with Vivado HLS. For example, if
sqrtf() is used in the code with math.h, it requires the following code extern added to
the C++ code to support it:
#include <math.h>
extern “C” float sqrtf(float);
To avoid unnecessary hardware caused by type conversion, follow the warnings on mixing
double and float types discussed in Floats and Doubles in Chapter 3.
• Video Functions
• Data Types
• Memory Line Buffer
• Memory Window
When using the Vivado HLS video library, the only additional usage requirement is as
follows.
#include <hls_video.h>
hls::rgb_8 video_data[1920][1080]
You can use alternatively scoped naming as shown in the following example:
#include <hls_video.h>
using namespace hls;
rgb_8 video_data[1920][1080]
When using any Xilinx Video IP in your system, refer to the IP data sheet and determine the
format used to send or receive the video data. Use the appropriate video data type in the C
code and the RTL created by synthesis may be connected to the Xilinx Video IP.
The library includes the following data types. All data types support 8-bit data only.
After the hls_video.h library is included, the data types can be freely used in the source
code.
#include "hls_video.h"
hls::rgb_8 video_data[1920][1080]
• shift_pixels_up()
• shift_pixels_down()
• insert_bottom_row()
• insert_top_row()
• getval(row,column)
To illustrate the usage of the LineBuffer class, the following data set is assumed at the start
of all examples.
A line buffer can be instantiated in an algorithm by using the LineBuffer data type,
shown in this example specifying a LineBuffer variable for the data in the table above:
The LineBuffer class assumes the data entering the block instantiating the line buffer is
arranged in raster scan order. Each new data item is therefore stored in a different column
than the previous data item.
Inserting new values, while preserving a finite number of previous values in a column,
requires a vertical shift between rows for a given column. After the shift is complete, a new
data value can be inserted at either the top or the bottom of the column.
For example, to insert the value 100 to the top of column 2 of the line buffer set:
Buff_A.shift_pixels_down(2);
Buff_A.insert_top_row(100,2);
This results in the new data set shown in the following table.
Table 2-8: Data Set After Shift Down and Insert Top Classes Used
Line Column 0 Column 1 Column 2 Column 3 Column 4
Row 0 1 2 100 4 5
Row 1 6 7 3 9 10
Row 2 11 12 8 14 15
To insert the value 100 to the bottom of column 2 of the line buffer set in Table 2-7 use of
the following:
Buff_A.shift_pixels_up(2);
Buff_A.insert_bottom_row(100,2);
This results in the new data set shown in the following table.
Table 2-9: Data Set After Shift Up and Insert Bottom Classes Used
Line Column 0 Column 1 Column 2 Column 3 Column 4
Row 0 1 2 8 4 5
Row 1 6 7 13 9 10
Row 2 11 12 100 14 15
The shift and insert methods both require the column value on which to operate.
All values stored by a LineBuffer instance are available using the getval(row,column)
method. Returns the value of any location inside the line buffer. For example, the following
results in variable Value being assigned the value 9:
Value = Buff_A.getval(1,3);
The memory window class is supported by the following methods, explained below:
• shift_pixels_up()
• shift_pixels_down()
• shift_pixels_left()
• shift_pixels_right()
• insert_pixel(value,row,colum)
• insert_row()
• insert_bottom_row()
• insert_top_row()
• insert_col()
• insert_left_col()
• insert_right_col()
• getval(row, column)
You can instantiate a memory window in an algorithm by specifying a Window variable for
the following data type:
The memory window class examples in this section use the data set in the following table.
The Window class provides methods for moving data stored within the memory window up,
down, left, and right. Each shift operation clears space in the memory window for new data.
The Window class allows you to insert and retrieve data from any location within the
memory window. It also supports block insertion of data on the boundaries of the memory
window.
To insert data into any location of the memory window, use the following:
insert_pixel(value,row,column);
For example, you can place the value 100 into row 1, column 1 of the memory window
using:
Buff_B.insert_pixel(100,1,1);
Table 2-15: Memory Window Data Set After Insertion Operation at Location 1,1
Column 0 Column 1 Column 2 Row
1 2 3 Row 0
6 100 8 Row 1
11 12 13 Row 2
Block level insertion requires that you provide an array of data elements to insert on a
boundary. The methods provided by the window class are:
• insert_row()
• insert_bottom_row()
• insert_top_row()
• insert_col()
• insert_left_col()
• insert_right_col()
Note: insert_row() and insert_col() are not currently documented.
For example, when C is an array of three elements in which each element has the value of 50,
you can insert the value 50 across the bottom boundary of the memory window using the
following operation:
Table 2-16: Memory Window Data Set After Insert Bottom Operation Using an Array
Column 0 Column 1 Column 2 Row
1 2 3 Row 0
6 7 8 Row 1
50 50 50 Row 2
The other edge insertion methods for the window class work in the same way as the
insert_bottom_row() method.
getval(row,column)
For example:
A = Buff_B.getval(0,1);
results in:
A = 50
Video Functions
The video processing functions included in the HLS Video library are compatible with
existing OpenCV functions and are similarly named. They do not directly replace existing
OpenCV video library functions. The video processing functions use a data type hls::Mat.
This data type allows the functions to be synthesized and implemented as high
performance hardware.
• OpenCV Interface Functions: Converts data to and from the AXI4 streaming data type
and the standard OpenCV data types. These functions allow any OpenCV functions
executed in software to transfer data, via the AXI4 streaming functions, to and from the
hardware block created by HLS.
• AXI4-Stream Functions: These functions are used to convert the video data specified in
hls::mat data types into an AXI4 streaming data type. This AXI4 streaming data type
is used as an argument to the function to be synthesized, ensuring a high-performance
interface is synthesized.
• Video Processing Functions: Compatible with standard OpenCV functions for
manipulating and processing video images. These functions use the hls::mat data
type and are synthesized by Vivado HLS.
Because the AXI4 streaming protocol is commonly used as the interface between the code
that remains on the CPU and the functions to be synthesized, the OpenCV interface
functions are provided to enable the data transfer between the OpenCV code running on
the CPU and the synthesized hardware function running on FPGA fabric.
Using the interface functions to transform the data before passing it to the function to be
synthesized ensures a high-performance system. In addition to transforming the data, the
functions also include the means of converting OpenCV data formats to and from the
Vivado HLS Video Library data types, for example hls::Mat.
To use the OpenCV interface functions, you must include the header file hls_opencv.h.
These functions are used in the code that remains on the CPU.
AXI4-Stream Functions
The AXI4-Stream functions are used to transfer data into and out of the function to be
synthesized. The video functions to be synthesized use the hls::Mat data type for an
image.
The AXI4-Stream I/O functions discussed below allow you to convert the hls::Mat data
type to or from the AXI4-Stream data type (hls::stream) used in the OpenCV Interface
functions.
This process ensures the test bench operates using the standard OpenCV functions used in
many software applications. The test bench may be executed on a CPU with the following:
#include "hls_video.h"
char tempbuf[2000];
sprintf(tempbuf, "diff --brief -w %s %s", OUTPUT_IMAGE, OUTPUT_IMAGE_GOLDEN);
int ret = system(tempbuf);
if (ret != 0) {
printf("Test Failed!\n");
ret = 1;
} else {
printf("Test Passed!\n");
}
return ret;
Using all three types of functions allows you to implement video functions on an FPGA and
maintain a seamless transfer of data between the video functions optimized for synthesis
and the OpenCV functions and data which remain in the test bench (executing on the CPU).
The following table summarizes the functions provided in the HLS Video Library.
As shown in the example above, the video functions are not direct replacements for
OpenCV functions. They use input and output arrays to process the data and typically use
template parameters.
A complete description of all functions in the HLS video library is provided in Chapter 4,
High-Level Synthesis Reference Guide.
The exact performance metrics of the video functions depends upon the clock rate and the
target device specifications. Refer to the synthesis report for complete details on the final
performance achieved after synthesis.
The previous example is repeated below to highlight the only optimizations required to
achieve a complete high-performance design.
• Because the functions are already pipelined, adding the DATAFLOW optimization
ensures the pipelined functions will execute in parallel.
• In this example, the data type is an hls::stream which is automatically implemented
as a FIFO of depth 1: there is no requirement to use the config_dataflow
configuration to control the size of the dataflow memory channels.
• Implementing the input and output ports with an AXI4-Stream interface (axis) ensures a
high-performance streaming interface.
• Optionally, implementing the block-level protocol with an AXI4-Lite slave interface
would allow the synthesized block to be controlled from a CPU.
#include "hls_video.h"
typedef hls::stream<ap_axiu<32,1,1,1> > AXI_STREAM;
typedef hls::Scalar<3, unsigned char> RGB_PIXEL;
typedef hls::Mat<MAX_HEIGHT, MAX_WIDTH, HLS_8UC3> RGB_IMAGE;
HLS IP Libraries
Vivado HLS provides C libraries to implement a number of Xilinx IP blocks. The C libraries
allow the following Xilinx IP blocks to be directly inferred from the C source code ensuring
a high-quality implementation in the FPGA.
FFT IP Library
The Xilinx FFT IP block can be called within a C++ design using the library hls_fft.h. This
section explains how the FFT can be configured in your C++ code.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP Fast Fourier Transform
Product Guide (PG109) [Ref 5] for information on how to implement and use the features of the IP.
The following code examples provide a summary of how each of these steps is performed.
Each step is discussed in more detail below.
First, include the FFT library in the source code. This header file resides in the include
directory in the Vivado HLS installation area which is automatically searched when Vivado
HLS executes.
#include "hls_fft.h"
Define the static parameters of the FFT. This includes such things as input width, number of
channels, type of architecture. which do not change dynamically. The FFT library includes a
parameterization struct hls::ip_fft::params_t, which can be used to initialize all
static parameters with default values.
In this example, the default values for output ordering and the widths of the configuration
and status ports are over-ridden using a user-defined struct param1 based on the
pre-defined struct.
Define types and variables for both the run time configuration and run time status. These
values can be dynamic and are therefore defined as variables in the C code which can
change and are accessed through APIs.
Next, set the run time configuration. This example sets the direction of the FFT (Forward or
Inverse) based on the value of variable “direction” and also set the value of the scaling
schedule.
fft_config1.setDir(direction);
fft_config1.setSch(0x2AB);
Call the FFT function using the HLS namespace with the defined static configuration
(param1 in this example). The function parameters are, in order, input data, output data,
output status and input configuration.
Finally, check the output status. This example checks the overflow flag and stores the results
in variable “ovflo”.
*ovflo = fft_status1->getOvflo();
Design examples using the FFT C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FFT.
The hls_fft.h header file defines a struct hls::ip_fft::params_t which can be used
to set default values for the static parameters. If the default values are to be used, the
parameterization struct can be used directly with the FFT function.
hls::fft<hls::ip_fft::params_t >
(xn1, xk1, &fft_status1, &fft_config1);
A more typical use is to change some of the parameters to non-default values. This is
performed by creating a new user-define parameterization struct based on the default
parameterization struct and changing some of the default values.
In this example, a new user struct my_fft_config is defined and with a new value for the
output ordering (changed to natural_order). All other static parameters to the FFT use the
default values (shown below in Table 2-20).
hls::fft<my_fft_config >
(xn1, xk1, &fft_status1, &fft_config1);
The values used for the parameterization struct hls::ip_fft::params_t are explained
in the following table. The default values for the parameters and a list of possible values is
provided in Table 2-20.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP Fast Fourier Transform
Product Guide (PG109) [Ref 5] for details on the parameters and the implication for their settings.
When specifying parameter values which are not integer or boolean, the HLS FFT
namespace should be used.
For example the possible values for parameter butterfly_type in the following table are
use_luts and use_xtremedsp_slices. The values used in the C program should be
butterfly_type = hls::ip_fft::use_luts and butterfly_type =
hls::ip_fft::use_xtremedsp_slices.
The following table covers all features and functionality of the FFT IP. Features and
functionality not described in this table are not supported in the Vivado HLS
implementation.
The run time configuration and status can be accessed using the predefined structs from
the FFT C library:
• hls::ip_fft::config_t<param1>
• hls::ip_fft::status_t<param1>
Note: In both cases, the struct requires the name of the static parameterization struct, shown in
these examples as param1. Refer to the previous section for details on defining the static
parameterization struct.
The run time configuration struct allows the following actions to be performed in the C
code:
IMPORTANT: The length specified during run time cannot exceed the size defined by max_nfft in the
static configuration.
The output status port can be accessed using the pre-defined struct to determine:
IMPORTANT: After each transaction completes, check the overflow status to confirm the correct
operation of the FFT.
hls::fft<STATIC_PARAM> (
INPUT_DATA_ARRAY,
OUTPUT_DATA_ARRAY,
OUTPUT_STATUS,
INPUT_RUN_TIME_CONFIGURATION);
The STATIC_PARAM is the static parameterization struct discussed in the earlier section FFT
Static Parameters. This defines the static parameters for the FFT.
Both the input and output data are supplied to the function as arrays (INPUT_DATA_ARRAY
and OUTPUT_DATA_ARRAY). In the final implementation, the ports on the FFT RTL block will
be implemented as AXI4-Stream ports. Xilinx recommends always using the FFT function in
a region using dataflow optimization (set_directive_dataflow), because this ensures
the arrays are implemented as streaming arrays. An alternative is to specify both arrays as
streaming using the set_directive_stream command.
IMPORTANT: The FFT cannot be used in a region which is pipelined. If high-performance operation is
required, pipeline the loops or functions before and after the FFT then use dataflow optimization on all
loops and functions in the region.
To use fixed-point data types, the Vivado HLS arbitrary precision type ap_fixed should be
used.
#include "ap_fixed.h"
typedef ap_fixed<FFT_INPUT_WIDTH,1> data_in_t;
typedef ap_fixed<FFT_OUTPUT_WIDTH,FFT_OUTPUT_WIDTH-FFT_INPUT_WIDTH+1> data_out_t;
#include <complex>
typedef std::complex<data_in_t> cmpxData;
typedef std::complex<data_out_t> cmpxDataOut;
In both cases, the FFT should be parameterized with the same correct data sizes. In the case
of floating point data, the data widths will always be 32-bit and any other specified size will
be considered invalid.
TIP: The input and output width of the FFT can be configured to any arbitrary value within the
supported range. The variables which connect to the input and output parameters must be defined in
increments of 8-bit. For example, if the output width is configured as 33-bit, the output variable must
be defined as a 40-bit variable.
The multichannel functionality of the FFT can be used by using two-dimensional arrays for
the input and output data. In this case, the array data should be configured with the first
dimension representing each channel and the second dimension representing the FFT data.
The FFT core consumes and produces data as interleaved channels (for example, ch0-data0,
ch1-data0, ch2-data0, etc, ch0-data1, ch1-data1, ch2-data2, etc.). Therefore, to stream the
input or output arrays of the FFT using the same sequential order that the data was read or
written, you must fill or empty the two-dimensional arrays for multiple channels by
iterating through the channel index first, as shown in the following example:
cmpxData in_fft[FFT_CHANNELS][FFT_LENGTH];
cmpxData out_fft[FFT_CHANNELS][FFT_LENGTH];
}
}
Design examples using the FFT C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FFT.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP FIR Compiler Product
Guide (PG149) [Ref 6] for information on how to implement and use the features of the IP.
The following code examples provide a summary of how each of these steps is performed.
Each step is discussed in more detail below.
First, include the FIR library in the source code. This header file resides in the include
directory in the Vivado HLS installation area. This directory is automatically searched when
Vivado HLS executes. There is no need to specify the path to this directory if compiling
inside Vivado HLS.
#include "hls_fir.h"
Define the static parameters of the FIR. This includes such static attributes such as the input
width, the coefficients, the filter rate (single, decimation, hilbert). The FIR library includes a
parameterization struct hls::ip_fir::params_t which can be used to initialize all
static parameters with default values.
In this example, the coefficients are defined as residing in array coeff_vec and the default
values for the number of coefficients, the input width and the quantization mode are
over-ridden using a user a user-defined struct myconfig based on the pre-defined struct.
Create an instance of the FIR function using the HLS namespace with the defined static
parameters (myconfig in this example) and then call the function with the run method to
execute the function. The function arguments are, in order, input data and output data.
Optionally, a run time input configuration can be used. In some modes of the FIR, the data
on this input determines how the coefficients are used during interleaved channels or when
coefficient reloading is required. This configuration can be dynamic and is therefore
defined as a variable. For a complete description of which modes require this input
configuration, refer to the LogiCORE IP FIR Compiler Product Guide (PG149) [Ref 6].
When the run time input configuration is used, the FIR function is called with three
arguments: input data, output data and input configuration.
Design examples using the FIR C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FIR.
IMPORTANT: There are no defaults defined for the coefficients. Therefore, Xilinx does not recommend
using the pre-defined struct to directly initialize the FIR. A new user defined struct which specifies the
coefficients should always be used to perform the static parameterization.
In this example, a new user struct my_config is defined and with a new value for the
coefficients. The coefficients are specified as residing in array coeff_vec. All other
parameters to the FIR will use the default values (shown below in Table 2-22).
The following table describes the parameters used for the parametrization struct
hls::ip_fir::params_t. Table 2-22 provides the default values for the parameters and
a list of possible values.
RECOMMENDED: Xilinx highly recommends that you refer to the LogiCORE IP FIR Compiler Product
Guide (PG149) [Ref 6] for details on the parameters and the implication for their settings.
When specifying parameter values that are not integer or boolean, the HLS FIR namespace
should be used.
For example the possible values for rate_change are shown in the following table to be
integer and fixed_fractional. The values used in the C program should be
rate_change = hls::ip_fir::integer and rate_change =
hls::ip_fir::fixed_fractional.
The following table covers all features and functionality of the FIR IP. Features and
functionality not described in this table are not supported in the Vivado HLS
implementation.
The STATIC_PARAM is the static parameterization struct discussed in the earlier section
FIR Static Parameters. This defines most static parameters for the FIR.
Both the input and output data are supplied to the function as arrays (INPUT_DATA_ARRAY
and OUTPUT_DATA_ARRAY). In the final implementation, these ports on the FIR IP will be
implemented as AXI4-Stream ports. Xilinx recommends always using the FIR function in a
region using the dataflow optimization (set_directive_dataflow), because this
ensures the arrays are implemented as streaming arrays. An alternative is to specify both
arrays as streaming using the set_directive_stream command.
IMPORTANT: The FIR cannot be used in a region which is pipelined. If high-performance operation is
required, pipeline the loops or functions before and after the FIR then use dataflow optimization on all
loops and functions in the region.
The multichannel functionality of the FIR is supported through interleaving the data in a
single input and single output array.
• The size of the input array should be large enough to accommodate all samples:
num_channels * input_length.
• The output array size should be specified to contain all output samples: num_channels
* output_length.
The following code example demonstrates, for two channels, how the data is interleaved. In
this example, the top-level function has two channels of input data (din_i, din_q) and
two channels of output data (dout_i, dout_q). Two functions, at the front-end (fe) and
back-end (be) are used to correctly order the data in the FIR input array and extract it from
the FIR output array.
din_t fir_in[FIR_LENGTH];
dout_t fir_out[FIR_LENGTH];
static hls::FIR<myconfig> fir1;
This input configuration can be performed in the C code using a standard ap_int.h 8-bit
data type. In this example, the header file fir_top.h specifies the use of the FIR and
ap_fixed libraries, defines a number of the design parameter values and then defines
some fixed-point types based on these:
#include "ap_fixed.h"
#include "hls_fir.h"
In the top-level code, the information in the header file is included, the static
parameterization struct is created using the same constant values used to specify the
bit-widths, ensuring the C code and FIR configuration match, and the coefficients are
specified. At the top-level, an input configuration, defined in the header file as 8-bit data,
is passed into the FIR.
#include "fir_top.h"
// DUT
void fir_top(s_data_t in[INPUT_LENGTH],
m_data_t out[OUTPUT_LENGTH],
config_t* config)
{
s_data_t fir_in[INPUT_LENGTH];
m_data_t fir_out[OUTPUT_LENGTH];
config_t fir_config;
// Create struct for config
static hls::FIR<param1> fir1;
//==================================================
// Dataflow process
dummy_fe(in, fir_in, config, &fir_config);
fir1.run(fir_in, fir_out, &fir_config);
dummy_be(fir_out, out);
//==================================================
}
Design examples using the FIR C library are provided in the Vivado HLS examples and can
be accessed using menu option Help > Welcome > Open Example Project > Design
Examples > FIR.
DDS IP Library
You can use the Xilinx Direct Digital Synthesizer (DDS) IP block within a C++ design using
the hls_dds.h library. This section explains how to configure DDS IP in your C++ code.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP DDS Compiler Product
Guide (PG141) [Ref 7] for information on how to implement and use the features of the IP.
IMPORTANT: The C IP implementation of the DDS IP core supports the fixed mode for the
Phase_Increment and Phase_Offset parameters and supports the none mode for Phase_Offset, but it
does not support programmable and streaming modes for these parameters.
First, include the DDS library in the source code. This header file resides in the include
directory in the Vivado HLS installation area, which is automatically searched when Vivado
HLS executes.
#include "hls_dds.h"
Define the static parameters of the DDS. For example, define the phase width, clock rate,
and phase and increment offsets. The DDS C library includes a parameterization struct
hls::ip_dds::params_t, which is used to initialize all static parameters with default
values. By redefining any of the values in this struct, you can customize the implementation.
The following example shows how to override the default values for the phase width, clock
rate, phase offset, and the number of channels using a user-defined struct param1, which
is based on the existing predefined struct hls::ip_dds::params_t:
Create an instance of the DDS function using the HLS namespace with the defined static
parameters (for example, param1). Then, call the function with the run method to execute
the function. Following are the data and phase function arguments shown in order:
To access design examples that use the DDS C library, select Help > Welcome > Open
Example Project > Design Examples > DDS.
RECOMMENDED: Xilinx highly recommends that you review the LogiCORE IP DDS Compiler Product
Guide (PG141) [Ref 7] for details on the parameters and values.
The following table shows the possible values for the hls::ip_dds::params_t
parameterization struct parameters.
SRL IP Library
C code is written to satisfy several different requirements: reuse, readability, and
performance. Until now, it is unlikely that the C code was written to result in the most ideal
hardware after high-level synthesis.
Like the requirements for reuse, readability, and performance, certain coding techniques or
pre-defined constructs can ensure that the synthesis output results in more optimal
hardware or to better model hardware in C for easier validation of the algorithm.
This most common way to implement a shift register from C into hardware is to completely
partition the array into individual elements, and allow the data dependencies between the
elements in the RTL to imply a shift register.
Logic synthesis typically implements the RTL shift register into a Xilinx SRL resource, which
efficiently implements shift registers. The issue is that sometimes logic synthesis does not
implement the RTL shift register using an SRL component:
• When data is accessed in the middle of the shift register, logic synthesis cannot directly
infer an SRL.
• Sometimes, even when the SRL is ideal, logic synthesis may implement the shift-resister
in flip-flops, due to other factors. (Logic synthesis is also a complex process).
Vivado HLS provides a C++ class (ap_shift_reg) to ensure that the shift register defined
in the C code is always implemented using an SRL resource. The ap_shift_reg class has
two methods to perform the various read and write accesses supported by an SRL
component.
The ap_shift_reg.h header file that defines the ap_shift_reg class is also included
with Vivado HLS as a standalone package. You have the right to use it in your own source
code. The package xilinx_hls_lib_<release_number>.tgz is located in the
include directory in the Vivado HLS installation area.
When using the ap_shift_reg class, Vivado HLS creates a unique RTL component for
each shifter. When logic synthesis is performed, this component is synthesized into an SRL
resource.
The linear algebra functions all use two-dimensional arrays to represent matrices. All
functions support float (single precision) inputs, for real and complex data. A subset of the
functions support ap_fixed (fixed-point) inputs, for real and complex data. The precision
and rounding behavior of the ap_fixed types may be user defined, if desired.
A complete description of all linear algebra functions is provided in the HLS Linear Algebra
Library Functions in Chapter 4.
hls::cholesky(In_Array,Out_Array);
cholesky(In_Array,Out_Array);
To simplify the process of optimization, Vivado HLS provides the linear algebra library
functions, which include several C code architectures and embedded optimization
directives. Using a C++ configuration class, you can select the C code to use and the
optimization directives to apply.
Although the exact optimizations vary from function to function, the configuration class
typically allows you to specify the level of optimization for the RTL implementation as
follows:
Vivado HLS provides example projects that show how to use the configuration class for
each function in the linear algebra library. You can use these examples as templates to learn
how to configure Vivado HLS for each of the functions for a specific implementation target.
Each example provides a C++ source file with multiple C code architectures as different
C++ functions.
Note: To identify the top-level C++ function, look for the TOP directive in the directives.tcl
file or the Vivado HLS GUI Directive tab.
You can open these examples from the Vivado HLS Welcome screen:
To determine which optimization works best for your design, you can compare the
performance and utilization estimates for each solution using the Vivado HLS Compare
Reports feature. To compare the estimates, you must run synthesis for all of the project
solutions by selecting Solution > Run C Synthesis > All Solutions. Then, use the Compare
Reports toolbar button.
Cholesky
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Key Factors
Following is additional information about the key factors in the preceding table:
• Architecture
° 1: Uses higher DSP utilization but minimized memory utilization with increased
throughput. This value does not support inner loop unrolling to further increase
throughput.
° 2: Uses highest DSP and memory utilization. This value supports inner loop
unrolling to improve overall throughput with a limited increase in DSP resources.
This is the most flexible architecture for design exploration.
• Inner loop pipelining
° >1: For ARCH 2, enables Vivado HLS to resource share and reduce the DSP
utilization. When using complex floating-point data types, setting the value to 2 or
4 significantly reduces DSP utilization.
• Inner loop unrolling
° For ARCH 2, duplicates the hardware required to implement the loop processing by
a specified factor, executes the corresponding number of loop iterations in parallel,
and increases throughput but also increases DSP and memory utilization.
Specifications
You can specify all factors using a configuration class derived from the following
hls::cholesky_traits base class by redefining the appropriate class member:
struct MY_CONFIG :
hls::cholesky_traits<LOWER_TRIANGULAR,ROWS_COLS_A,MAT_IN_T,MAT_OUT_T>{
static const int ARCH = 2;
static const int INNER_II = 2;
static const int UNROLL_FACTOR = 1;
};
hls::cholesky_top<LOWER_TRIANGULAR,ROWS_COLS_A,MY_CONFIG,MAT_IN_T,MAT_OUT_T>(A,L);
hls::cholesky<LOWER_TRIANGULAR,ROWS_COLS_A,MAT_IN_T,MAT_OUT_T>(A,L);
Examples
The following table shows example implementation solutions for the Cholesky function. The
performance metrics are generated using the Cholesky example project, which defines a
solution for each implementation target. The throughput and latency figures are based on
post-synthesis simulation.
(INNER_II)
pipelining
Inner loop
(UNROLL_FACTOR)
unrolling
Inner loop
Throughput cycles
Latency cycles
DSP
BRAM
FF
LUT
small 0 N/A N/A 8 8 5850 4271 33724 33724
balanced 1 N/A N/A 10 8 4582 3367 14466 14466
alt_balanced 2 4 1 10 6 5115 3552 15412 15412
fast 2 1 1 36 6 7820 5288 9322 9322
faster 2 1 2 72 12 12569 8494 8370 8370
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Key Factors
Following is additional information about the key factors shown in the preceding table:
• Sub-function implementation
° >1: Enables Vivado HLS to resource share and reduce the DSP utilization.
• DATAFLOW directive
° Removes the sub-function hierarchy and allows Vivado HLS to better share
resources and can reduce DSP and memory utilization.
TIP: You can adjust the resources and throughput of the Inverse functions to meet specific requirements
by combining the DATAFLOW directive with the appropriate sub-function implementations.
Specifications
set_directive_dataflow "cholesky_inverse_top"
You can specify the individual sub-function implementations using a configuration class
derived from the following hls::cholesky_inverse_traits or
hls::qr_inverse_traits base class by redefining the appropriate class member:
typedef hls::cholesky_inverse_traits<ROWS_COLS_A,
MAT_IN_T,
MAT_OUT_T> MY_DFLT_CFG;
hls::cholesky_inverse_top<ROWS_COLS_A,MY_CONFIG,MAT_IN_T,MAT_OUT_T>(A,INVERSE_A,inv
erse_OK);
hls::cholesky_inverse<ROWS_COLS_A,MAT_IN_T,MAT_OUT_T>(A,INVERSE_A,inverse_OK);
Examples
The following table shows example implementation solutions for the Cholesky and matrix
multiply sub-functions. The performance metrics are generated using the Cholesky
Inverse example project, which defines a solution for each implementation target. The
throughput and latency figures are based on post-synthesis simulation.
DATAFLOW directive
Target
Cholesky and Multiply
Throughput cycles
Latency cycles
Resources
Subst.
DIAG_II
INNER_II
DSP
BRAM
FF
LUT
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
The following table shows example implementation solutions for the QRF and matrix
multiply sub-functions. The performance metrics are generated using the QR Inverse
example project, which defines a solution for each implementation target. The throughput
and latency figures are based on post-synthesis simulation.
DATAFLOW directive
Target
QRF and Multiply
Throughput cycles
Latency cycles
Resources
Subst.
DIAG_II
INNER_II
DSP
BRAM
FF
LUT
smaller ✓ N/A Small 8 8 18 23 13530 9715 10734 10734
small N/A N/A Small 8 8 33 25 16249 11721 10705 10705
balanced N/A N/A Balanced 2 2 92 26 39436 21675 6277 6277
balanced_ N/A ✓ Balanced 2 2 92 38 39461 21653 2975 12458
high_
throughput
default N/A N/A Default 1 1 110 26 41254 22532 5982 5982
fast N/A N/A Fast 1 1 146 26 45026 25471 5576 5576
fast_high_ N/A ✓ Fast 1 1 146 38 45051 25449 2650 11066
throughput
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
Matrix Multiply
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Key Factors
Following is additional information about the key factors in the preceding table:
• Architecture
The ARCH key factor selects the architecture based on the implementation data type.
° >1: When using complex floating-point data types, shares resources and reduces
DSP utilization. Setting the value to 2 or 4 significantly reduces DSP utilization.
• Inner loop unrolling
° For ARCH 2, duplicates the hardware required to implement the loop processing by
a specified factor, executes the corresponding number of loop iterations in parallel,
and increases throughput but also increases DSP and memory utilization.
° For ARCH 2, partially unrolling the accumulation loop results in Vivado HLS splitting
the sum_mult array across multiple Block RAM.
° When the partitioned size does not require using a Block RAM, use the RESOURCE
directive to specify a LUTRAM.
Specifications
Except for the RESOURCE directive, you can specify all factors using a configuration class
derived from the following hls::matrix_multiply_traits base class by redefining
the appropriate class member:
hls::matrix_multiply_top<hls::NoTranspose,hls::NoTranspose,A_ROWS,A_COLS,B_ROWS,B_C
OLS,C_ROWS,C_COLS,MY_CONFIG,MATRIX_T,MATRIX_T>(A,B,C);
hls::matrix_multiply<hls::NoTranspose,hls::NoTranspose,A_ROWS,A_COLS,B_ROWS,B_COLS,
C_ROWS,C_COLS,MATRIX_T,MATRIX_T>(A,B,C);
If you select ARCH 2, the RESOURCE directive is applied to the sum_mult array in function
hls::matrix_multiply_alt2 as follows:
Examples
The following table shows example implementation solutions for the matrix multiply
function. The performance metrics are generated using the Matrix Multiply Float
and Matrix Multiply Fixed example projects, which define a solution for each
implementation target. The throughput and latency values are based on post-synthesis
simulation.
Resources
(ARCH)
Architecture
(INNER_II)
pipelining
Inner loop
(INNER_UNROLL
unrolling
Inner loop
directive
RESOURCE
Throughput cycles
Latency cycles
DSP
BRAM
FF
LUT
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
QRF
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Key Factors
Following is additional information about the key factors in the preceding table:
° 2: Sets the minimum achievable initiation interval (II) of 2, which satisfies the Q and
R matrix array requirement of two writes every iteration of the update loop.
° >2: Enables Vivado HLS to further resource share and reduce the DSP utilization.
With complex-floating point data types, setting the value to 4 or 8 significantly
reduces DSP utilization.
• Q and R update loop unrolling
° Enables Vivado HLS to resource share and reduce the DSP utilization.
Specifications
You can specify all factors using a configuration class derived from the following
hls::qrf_traits base class by redefining the appropriate class member:
hls::qrf_top<TRANSPOSED_Q,A_ROWS,A_COLS,MY_CONFIG,MAT_IN_T,MAT_OUT_T>(A,Q,R);
hls::qrf<TRANSPOSED_Q,A_ROWS,A_COLS,MAT_IN_T,MAT_OUT_T>(A,Q,R);
Examples
The following table shows example implementation solutions for the QRF function. The
performance metrics are generated using the QRF example project, which defines a solution
for each implementation target. The throughput and latency figures are based on
post-synthesis simulation.
(UPDATE_II)
pipelining
Q and R update loop
(UNROLL_FACTOR)
unrolling
Q and R update loop
Throughput cycles
Latency cycles
DSP
BRAM
FF
LUT
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
SVD
Implementation Controls
The following table summarizes the key factors that influence resource utilization, function
throughput (initiation interval), and function latency. The values of Low, Medium, and High
are relative to the other key factors.
Key Factors
Following is additional information about the key factors in the preceding table:
• ALLOCATION directive
° Limits the number of implemented 2x1 vector dot products. Vivado HLS schedules
the SVD function to use the specified number 2x1 vector dot product kernels.
Note: The SVD algorithm is computationally intensive, particularly for complex data types.
The ALLOCATION directive is the most effective method to balance resource utilization and
throughput.
• Off-diagonal loop pipelining
° 4: Sets the minimum achievable initiation interval (II) of 4, which satisfies the S, U,
and V array requirement of four writes every iteration of the off-diagonal loop.
° >4: Enables Vivado HLS to further resource share and reduce the DSP utilization.
• Diagonal loop pipelining
• Iterations
Specifications
You can apply the ALLOCATION directive to the hls::svd_pairs function in combination
with the INLINE directive as follows:
config_compile -unsafe_math_optimizations
You can specify all other factors using a configuration class derived from the following
hls::svd_traits base class by redefining the appropriate class member:
hls::svd_top<A_ROWS,A_COLS,MY_CONFIG,MATRIX_IN_T,MATRIX_OUT_T>(A,S,U,V);
hls::svd<A_ROWS,A_COLS,MATRIX_IN_T,MATRIX_OUT_T>(A,S,U,V);
Examples
The following table shows example implementation solutions for the SVD function. The
performance metrics are generated using the SVD example project, which defines a solution
for each implementation target. The throughput and latency figures are based on
post-synthesis simulation.
OFF_DIAG_II1)
pipelining (DIAG_II /
Off-diagonal loop
Diagonal and
(NUM_SWEEP)
Iterations
Throughput cycles
Latency cycles
DSP
BRAM
FF
LUT
Notes:
1. Bold row indicates the default configuration.
2. N/A indicates key factors that are not utilized or have a limited effect.
3. Values are representative only and are not intended to be exact.
Functions use the Vivado HLS fixed precision types ap_[u]int and ap_[u]fixed to
describe input and output data as needed. The functions have the minimum viable interface
type to maximize flexibility. For example, functions with a simple throughput model, such as
one sample out for one sample in, use pointer interfaces. Functions that perform a rate
change, such as viterbi_decoder, use the type hls::stream on the interfaces.
You can copy the existing library and make the interfaces more complex, such as creating
hls::streams for the pointer interfaces and AXI4-Stream interfaces for any function.
However, complex interfaces require more resources.
Vivado HLS provides most library elements as templated C++ classes, which are fully
described in the header file (hls_dsp.h) with constructor, destructor, and operator access
functions.
For a complete description of all DSP functions, see the HLS DSP Library Functions in
Chapter 4.
Functions in the DSP Library include synthesis directives as pragmas in the source code,
which guide Vivado HLS in synthesizing the function to meet typical requirements. The
functions are optimized for maximal throughput, which is the most common use case. For
example, arrays might be completely partitioned to ensure that an Initiation Interval of 1 is
achieved regardless of template parameter configuration.
• To apply optimizations on the DSP functions, open the header file hls_dsp.h in the
Vivado HLS GUI, and do one of the following:
° Use the Explorer Pane and navigate to the file using the Includes folder.
• To add or remove an optimization as a directive, open the header file in the Information
pane, and use the Directives tab.
Note: If you add the optimization as a pragma, Vivado HLS places the optimization in the library
and applies it every time you add the header to a design. File write permissions might be
required to add the optimization as a pragma.
TIP: If you want to modify a function to modify its RTL implementation, look for comments in the
library source code with the prefix TIP, which indicate where it might be useful to place a pragma or
apply a directive.
IMPORTANT: The term “C code” as used in this guide refers to code written in C, C++, SystemC, and
OpenCL API C, unless otherwise specifically noted.
The coding examples in this guide are part of the Vivado ® HLS release. Access the coding
examples using one of the following methods:
Unsupported C Constructs
While Vivado HLS supports a wide range of the C language, some constructs are not
synthesizable, or can result in errors further down the design flow. This section discusses
areas in which coding changes must be made for the function to be synthesized and
implemented in a device.
To be synthesized:
System Calls
System calls cannot be synthesized because they are actions that relate to performing some
task upon the operating system in which the C program is running.
Vivado HLS ignores commonly-used system calls that display only data and that have no
impact on the execution of the algorithm, such as printf() and fprintf(stdout,). In
general, calls to the system cannot be synthesized and should be removed from the
function before synthesis. Other examples of such calls are getc(), time(), sleep(), all
of which make calls to the operating system.
Vivado HLS defines the macro __SYNTHESIS__ when synthesis is performed. This allows
the __SYNTHESIS__ macro to exclude non-synthesizable code from the design.
Note: Only use the __SYNTHESIS__ macro in the code to be synthesized. Do not use this macro in the
test bench, because it is not obeyed by C simulation or C RTL co-simulation.
In the following code example, the intermediate results from a sub-function are saved to a
file on the hard drive. The macro __SYNTHESIS__ is used to ensure the non-synthesizable
files writes are ignored during synthesis.
#include "hier_func4.h"
sumsub_func(&A,&B,&apb,&amb);
#ifndef __SYNTHESIS__
FILE *fp1;// The following code is ignored for synthesis
char filename[255];
sprintf(filename,Out_apb_%03d.dat,apb);
fp1=fopen(filename,w);
fprintf(fp1, %d \n, apb);
fclose(fp1);
#endif
shift_func(&apb,&amb,C,D);
}
CAUTION! If the __SYNTHESIS__ macro is used to change the functionality of the C code, it can
result in different results between C simulation and C synthesis. Errors in such code are inherently
difficult to debug. Do not use the __SYNTHESIS__ macro to change functionality.
Memory allocation system calls must be removed from the design code before synthesis.
Because dynamic memory operations are used to define the functionality of the design,
they must be transformed into equivalent bounded representations. The following code
example shows how a design using malloc() can be transformed into a synthesizable
version and highlights two useful coding style techniques:
The user-defined macro NO_SYNTH is used to select between the synthesizable and
non-synthesizable versions. This ensures that the same code is simulated in C and
synthesized in Vivado HLS.
• The pointers in the original design using malloc() do not need to be rewritten to
work with fixed sized elements.
Fixed sized resources can be created and the existing pointer can simply be made to
point to the fixed sized resource. This technique can prevent manual re-coding of the
existing design.
#include "malloc_removed.h"
#include <stdlib.h>
//#define NO_SYNTH
#ifdef NO_SYNTH
long long *out_accum = malloc (sizeof(long long));
int* array_local = malloc (64 * sizeof(int));
#else
long long _out_accum;
long long *out_accum = &_out_accum;
int _array_local[64];
int* array_local = &_array_local[0];
#endif
int i,j;
*out_accum=0;
LOOP_ACCUM:for (j=0;j<N-1; j++) {
*out_accum += *(array_local+j);
}
return *out_accum;
}
Because the coding changes here impact the functionality of the design, Xilinx does not
recommend using the __SYNTHESIS__ macro. Xilinx recommends that you perform the
following steps:
1. Add the user-defined macro NO_SYNTH to the code and modify the code.
2. Enable macro NO_SYNTH, execute the C simulation, and save the results.
3. Disable the macro NO_SYNTH, and execute the C simulation to verify that the results are
identical.
4. Perform synthesis with the user-defined macro disabled.
This methodology ensures that the updated code is validated with C simulation and that the
identical code is then synthesized.
As with restrictions on dynamic memory usage in C, Vivado HLS does not support (for
synthesis) C++ objects that are dynamically created or destroyed. This includes dynamic
polymorphism and dynamic virtual function calls.
The following code cannot be synthesized because it creates a new function at run time.
Class A {
public:
virtual void bar() {…};
};
void fun(A* a) {
a->bar();
}
A* a = 0;
if (base)
a = new A();
else
a = new B();
foo(a);
Pointer Limitations
General Pointer Casting
Vivado HLS does not support general pointer casting, but supports pointer casting between
native C types. For more information on pointer casting, see Example 3-36.
Pointer Arrays
Vivado HLS supports pointer arrays for synthesis, provided that each pointer points to a
scalar or an array of scalars. Arrays of pointers cannot point to additional pointers. For more
information on pointer arrays, see Example 3-35.
Recursive Functions
Recursive functions cannot be synthesized. This applies to functions that can form endless
recursion, where endless:
Vivado HLS does not support tail recursion in which there is a finite number of function
calls.
Note: Standard data types, such as std::complex, are supported for synthesis.
C Test Bench
The first step in the synthesis of any block is to validate that the C function is correct. This
step is performed by the test bench. Writing a good test bench can greatly increase your
productivity.
C functions execute in orders of magnitude faster than RTL simulations. Using C to develop
and validate the algorithm before synthesis is more productive than developing at the RTL.
• The key to taking advantage of C development times is to have a test bench that checks
the results of the function against known good results. Because the algorithm is known
to be correct, any code changes can be validated before synthesis.
• Vivado HLS reuses the C test bench to verify the RTL design. No RTL test bench needs
to be created when using Vivado HLS. If the test bench checks the results from the
top-level function, the RTL can be verified by simulation.
Note: To provide input arguments to the test bench, select Project > Project Settings, click
Simulation, and use the Input Arguments option. The test bench must not require the execution of
interactive user inputs. Vivado HLS GUI does not have a command console and cannot accept user
inputs while the test bench executes.
Xilinx recommends that you separate the top-level function for synthesis from the test
bench, and that you use header files. The following code example shows a design in which
the function hier_func calls two sub-functions:
The data types are defined in the header file (hier_func.h), which is also described:
#include "hier_func.h"
sumsub_func(&A,&B,&apb,&amb);
shift_func(&apb,&amb,C,D);
}
The top-level function can contain multiple sub-functions. There can be only a single
top-level function for synthesis. To synthesize multiple functions, group them into a single
top-level function.
1. Add the file shown in Example 3-4 to a Vivado HLS project as a design file.
2. Specify the top-level function as hier_func.
After synthesis:
• The arguments to the top-level function (A, B, C, and D in Example 3-4) are synthesized
into RTL ports.
• The functions within the top-level (sumsub_func and shift_func in Example 3-4)
are synthesized into hierarchical blocks.
The header file (hier_func.h) in Example 3-4 shows how to use macros and how
typedef statements can make the code more portable and readable. Later sections show
how the typedef statement allows the types and therefore the bit-widths of the variables
to be refined for both area and performance improvements in the final FPGA
implementation.
#ifndef _HIER_FUNC_H_
#define _HIER_FUNC_H_
#include <stdio.h>
#define NUM_TRANS 40
#endif
The header file in this example includes some definitions (such as NUM_TRANS) that are not
required in the design file. These definitions are used by the test bench which also includes
the same header file.
The following code example shows the test bench for the design shown in Example 3-4.
#include "hier_func.h"
int main() {
// Data storage
int a[NUM_TRANS], b[NUM_TRANS];
int c_expected[NUM_TRANS], d_expected[NUM_TRANS];
int c[NUM_TRANS], d[NUM_TRANS];
// Misc
int retval=0, i, i_trans, tmp;
FILE *fp;
fp=fopen(tb_data/inB.dat,r);
for (i=0; i<NUM_TRANS; i++){
fscanf(fp, %d, &tmp);
b[i] = tmp;
}
fclose(fp);
//Store outputs
c[i_trans] = c_actual;
d[i_trans] = d_actual;
}
fp=fopen(tb_data/outD.golden.dat,r);
for (i=0; i<NUM_TRANS; i++){
fscanf(fp, %d, &tmp);
d_expected[i] = tmp;
}
fclose(fp);
// Print Results
if(retval == 0){
printf( *** *** *** *** \n);
printf( Results are good \n);
printf( *** *** *** *** \n);
} else {
printf( *** *** *** *** \n);
printf( Mismatch: retval=%d \n, retval);
printf( *** *** *** *** \n);
}
• The top-level function for synthesis (hier_func) is executed for multiple transactions,
as defined by macro NUM_TRANS (specified in the header file Example 3-5). This
execution allows many different data values to be applied and verified. The test bench
is only as good as the variety of tests it performs.
• The function outputs are compared against known good values. The known good
values are read from a file in this example, but can also be computed as part of the test
bench.
• The return value of main() function is set to:
RECOMMENDED: Because the system environment (for example, Linux, Windows, or Tcl) interprets the
return value of the main() function, Xilinx recommends that you constrain the return value to an 8-bit
range for portability and safety.
CAUTION! You are responsible for ensuring that the test bench checks the results. If the test bench does
not check the results but returns zero, Vivado HLS indicates that the simulation test passed even though
the results were not actually checked.
A test bench that exhibits these attributes quickly tests and validates any changes made to
the C functions before synthesis and is re-usable at RTL, allowing easier verification of the
RTL.
Files associated with the test bench are any files that are:
Examples of such files include the data files inA.dat and inB.dat in Example 3-6. You
must add these to the Vivado HLS project as test bench files.
The requirement for identifying test bench files in a Vivado HLS project does not require
that the design and test bench to be in separate files (although separate files are
recommended).
The same design from Example 3-4 is repeated in Example 3-7. The only difference is that
the top-level function is renamed hier_func2, to differentiate the examples.
Using the same header file and test bench (other than the change from hier_func to
hier_func2), the only changes required in Vivado HLS to synthesize function
sumsum_func as the top-level function are:
Even though function sumsub_func is not explicitly instantiated inside the main()
function, the remainder of the functions (hier_func2 and shift_func) confirm that it is
operating correctly, and thus is part of the test bench.
#include "hier_func2.h"
sumsub_func(&A,&B,&apb,&amb);
shift_func(&apb,&amb,C,D);
}
IMPORTANT: If the test bench and design are in a single file, you must add the file to a Vivado HLS
project as both a design file and a test bench file.
#include <stdio.h>
#define NUM_TRANS 40
sumsub_func(&A,&B,&apb,&amb);
shift_func(&apb,&amb,C,D);
}
int main() {
// Data storage
int a[NUM_TRANS], b[NUM_TRANS];
int c_expected[NUM_TRANS], d_expected[NUM_TRANS];
int c[NUM_TRANS], d[NUM_TRANS];
// Misc
int retval=0, i, i_trans, tmp;
FILE *fp;
// Load input data from files
fp=fopen(tb_data/inA.dat,r);
for (i=0; i<NUM_TRANS; i++){
fscanf(fp, %d, &tmp);
a[i] = tmp;
}
fclose(fp);
fp=fopen(tb_data/inB.dat,r);
for (i=0; i<NUM_TRANS; i++){
fscanf(fp, %d, &tmp);
b[i] = tmp;
}
fclose(fp);
//Store outputs
c[i_trans] = c_actual;
d[i_trans] = d_actual;
}
fp=fopen(tb_data/outD.golden.dat,r);
for (i=0; i<NUM_TRANS; i++){
fscanf(fp, %d, &tmp);
d_expected[i] = tmp;
}
fclose(fp);
// Print Results
if(retval == 0){
printf( *** *** *** *** \n);
printf( Results are good \n);
printf( *** *** *** *** \n);
} else {
printf( *** *** *** *** \n);
printf( Mismatch: retval=%d \n, retval);
printf( *** *** *** *** \n);
The following example shows this methodology. This OpenCL API C kernel code shows a
vector addition design where two arrays of data are summed into a third. The required size
of the work group is 16, that is, this kernel must execute a minium of 16 times to produce a
valid result.
#include <clc.h>
The following C test bench example is used to verify the preceding example. The following
code is similar to any other C test bench except that it includes the API function
hls_run_kernel. Vivado HLS provides the following function signature to execute the
OpenCL API C kernel:
void hls_run_kernel(
const char *KernelName,
ScalarType0 *Arg0, int size0,
ScalarType1 *Arg1, int size1, …)
Where:
The number of arguments used in the API must match the number of arguments in the
OpenCL API C kernel. The example design vadd includes three arguments (a, b and c) that
read or write 16 data values. The following test bench verifies this function:
#define LENGTH 16
int main(int argc, char** argv)
{
int errors=0, i;
int a[LENGTH];
int b[LENGTH];
int hw_c[LENGTH];
int swref_c[LENGTH];
TIP: Vivado HLS provides OpenCL API C project examples. For an explanation of each design example,
see Table 1-5.
Functions
The top-level function becomes the top level of the RTL design after synthesis.
Sub-functions are synthesized into blocks in the RTL design.
After synthesis, each function in the design has its own synthesis report and RTL HDL file
(Verilog and VHDL).
Inlining functions
Sub-functions can optionally be inlined to merge their logic with the logic of the
surrounding function. While inlining functions can result in better optimizations, it can also
increase run time. More logic and more possibilities must be kept in memory and analyzed.
TIP: Vivado HLS may perform automatic inlining of small functions. To disable automatic inlining of a
small function, set the inline directive to off for that function.
If a function is inlined, there is no report or separate RTL file for that function. The logic and
loops are merged with the function above it in the hierarchy.
If the arguments to a function are sized accurately, Vivado HLS can propagate this
information through the design. There is no need to create arbitrary precision types for
every variable. In the following example, two integers are multiplied, but only the bottom
24 bits are used for the result.
#include "ap_cint.h"
tmp = (x * y);
return tmp
}
When this code is synthesized, the result is a 32-bit multiplier with the output truncated to
24-bit.
If the inputs are correctly sized to 12-bit types (int12) as shown in the following code
example, the final RTL uses a 24-bit multiplier.
#include "ap_cint.h"
typedef int12 din_t;
typedef int24 dout_t;
tmp = (x * y);
return tmp
}
Using arbitrary precision types for the two function inputs is enough to ensure Vivado HLS
creates a design using a 24-bit multiplier. The 12-bit types are propagated through the
design. Xilinx recommends that you correctly size the arguments of all functions in the
hierarchy.
In general, when variables are driven directly from the function interface, especially from
the top-level function interface, they can prevent some optimizations from taking place. A
typical case of this is when an input is used as the upper limit for a loop index.
Loops
Loops provide a very intuitive and concise way of capturing the behavior of an algorithm
and are used often in C code. Loops are very well supported by synthesis: loops can be
pipelined, unrolled, partially unrolled, merged and flattened.
The optimizations unroll, partially unroll, flatten and merge effectively make changes to the
loop structure, as if the code was changed. These optimizations ensure limited coding
changes are required when optimizing loops. Some optimizations can be applied only in
certain conditions. Some coding changes might be required.
RECOMMENDED: Avoid use of global variables for loop index variables, as this can inhibit some
optimizations.
#include "ap_cint.h"
#define N 32
dout_t out_accum=0;
dsel_t x;
return out_accum;
}
Attempting to optimize the design in Example 3-10 reveals the issues created by variable
loop bounds.
The first issue with variable loop bounds is that they prevent Vivado HLS from determining
the latency of the loop. Vivado HLS can determine the latency to complete one iteration of
the loop, but because it cannot statically determine the exact value of variable width, it does
not know how many iteration are performed and thus cannot report the loop latency (the
number of cycles to completely execute every iteration of the loop).
When variable loop bounds are present, Vivado HLS reports the latency as a question mark
(?) instead of using exact values. The following shows the result after synthesis of
Example 3-10.
Another issue with variable loop bounds is that the performance of the design is unknown.
• Use the Tripcount directive. The details on this approach are explained here.
• Use an assert macro in the C code. for more information, see C++ Classes and
Templates.
Tripcount directive has no impact on the results of synthesis, only reporting. The
user-provided values for the Tripcount directive are used only for reporting. The Tripcount
value allows Vivado HLS to report number in the report, allowing the reports from different
solutions to be compared. To have this same loop-bound information used for synthesis,
the C code must be updated. For more information, see C++ Classes and Templates.
The next steps in optimizing Example 3-10 for a lower initiation interval are:
If these optimizations are applied, the output from Vivado HLS highlights the most
significant issue with variable bound loops:
Because variable bounds loops cannot be unrolled, they not only prevent the unroll
directive being applied, they also prevent pipelining of the levels above the loop.
IMPORTANT: When a loop or function is pipelined, Vivado HLS unrolls all loops in the hierarchy below
the function or loop. If there is a loop with variable bounds in this hierarchy, it prevents pipelining.
The solution to loops with variable bounds is to make the number of loop iteration a fixed
value with conditional executions inside the loop. The code from Example 3-10 can be
rewritten as shown in the following code example. Here, the loop bounds are explicitly set
to the maximum value of variable width and the loop body is conditionally executed.
#include "ap_cint.h"
#define N 32
dout_t out_accum=0;
dsel_t x;
return out_accum;
}
The for-loop (LOOP_X) in Example 3-11 can be unrolled. Because the loop has fixed upper
bounds, Vivado HLS knows how much hardware to create. There are N(32) copies of the
loop body in the RTL design. Each copy of the loop body has conditional logic associated
with it and is executed depending on the value of variable width.
Loop Pipelining
When pipelining loops, the most optimum balance between area and performance is
typically found by pipelining the inner most loop. This is also results in the fastest run time.
The following code example demonstrates the trade-offs when pipelining loops and
functions.
#include "loop_pipeline.h"
int i,j;
static dout_t acc;
return acc;
}
If the inner-most (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware, (a single
multiplier) and Vivado HLS uses the outer-loop (LOOP_I) to simply feed LOOP_J with new
data. Only 1 multiplier operation and 1 array access need to be scheduled, then the loop
iterations can be scheduled as single loop-body entity (20x20 loop iterations).
TIP: When a loop or function is pipelined, any loop in the hierarchy below the loop or function being
pipelined must be unrolled.
If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 400
arrays accessed must now be scheduled. It is very unlikely that Vivado HLS will produce a
design with 400 multiplications because in most designs data dependencies often prevent
maximal parallelism, for example, in this case, even if a dual-port RAM is used for A[N] the
design can only access two values of A[N] in any clock cycle.
The concept to appreciate when selecting at which level of the hierarchy to pipeline is to
understand that pipelining the inner-most loop gives the smallest hardware with generally
acceptable throughput for most applications. Pipelining the upper-levels of the hierarchy
unrolls all sub-loops and can create many more operations to schedule (which could impact
run time and memory capacity), but typically gives the highest performance design in terms
of throughput and latency.
• Pipeline LOOP_J
Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and
registers (the I/O control and FSM are always present).
• Pipeline LOOP_I
Latency is approximately 20 cycles but requires a few hundred LUTs and registers. About
20 times the logic as first option, minus any logic optimizations that can be made.
Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs and
registers (about 400 times the logic of the first option minus any optimizations that can
be made).
Nested loops can only be flattened if the loops are perfect or semi-perfect.
• Perfect Loops
The following code example shows a case in which the loop nest is imperfect:
#include "loop_imperfect.h"
int i,j;
dint_t acc;
The assignment to acc and array B[N] inside LOOP_I, but outside LOOP_J, prevent the
loops from being flattened. If LOOP_J in Example 3-13 is pipelined, the synthesis report
shows the following:
• The pipeline depth shows it takes 2 clocks to execute one iteration of LOOP_J. This
varies with the device technology and clock period.
• A new iteration can begin each clock cycle. Pipeline II is 1. II is the Initiation Interval:
cycles between each new execution of the loop body.
• It takes 2 cycles for the first iteration to output a result. Due to pipelining each
subsequent iteration executes in parallel with the previous one and outputs a value
after 1 clock. The total latency of the loop is 2 plus 1 for each of the remaining 19
iterations: 21.
• LOOP_I, requires 480 clock cycles to perform 20 iterations, thus each iteration of
LOOP_I is 24 clocks cycles. This means there are 3 cycles of overhead to enter and exit
LOOP_J (24 - 21 = 3).
Imperfect loop nests, or the inability to flatten loop them, results in additional clock cycles
to enter and exit the loops. The code in Example 3-13 can be rewritten to make the nested
loops perfect and allow them to be flattened.
The following code example shows how conditionals can be added to loop LOOP_J to
provide the same functionality as Example 3-13 but allow the loops to be flattened.
#include "loop_perfect.h"
int i,j;
dint_t acc;
When the design contains nested loops, analyze the results to ensure as many nested loops
as possible have been flattened: review the log file or look in the synthesis report for cases,
as shown above, where the loop labels have been merged (LOOP_I and LOOP_J are now
reported as LOOP_I_LOOP_J).
Loop Parallelism
Vivado HLS schedules logic and functions are early as possible to reduce latency. To
perform this, it schedules as many logic operations and functions as possible in parallel. It
does not schedule loops to execute in parallel.
If the following code example is synthesized, loop SUM_X is scheduled and then loop
SUM_Y is scheduled: even though loop SUM_Y does not need to wait for loop SUM_X to
complete before it can begin its operation, it is scheduled after SUM_X.
#include "loop_sequential.h"
dout_t X_accum=0;
dout_t Y_accum=0;
int i,j;
Because the loops have different bounds (xlimit and ylimit), they cannot be merged. By
placing the loops in separate functions, as shown in the following code example, the
identical functionality can be achieved and both loops (inside the functions), can be
scheduled in parallel.
#include "loop_functions.h"
dout_t X_accum=0;
dout_t Y_accum=0;
int i,j;
sub_func(A,X,xlimit);
sub_func(B,Y,ylimit);
}
If Example 3-16 is synthesized, the latency is half the latency of Example 3-15 because the
loops (as functions) can now execute in parallel.
The dataflow optimization could also be used in Example 3-15. The principle of capturing
loops in functions to exploit parallelism is presented here for cases in which dataflow
optimization cannot be used. For example, in a larger example, dataflow optimization is
applied to all loops and functions at the top-level and memories placed between every
top-level loop and function.
Loop Dependencies
Loop dependencies are data dependencies that prevent optimization of loops, typically
pipelining. They can be within a single iteration of a loop and or between different iteration
of a loop.
The easiest way to understand loop dependencies is to examine an extreme example. In the
following example, the result of the loop is used as the loop continuation or exit condition.
Each iteration of the loop must finish before the next can start.
Minim_Loop: while (a != b) {
if (a > b)
a -= b;
else
b -= a;
}
This loop cannot be pipelined. The next iteration of the loop cannot begin until the previous
iteration ends.
Not all loop dependencies are as extreme as this, but this example highlights the issue:
some operation cannot begin until some other operation has completed. The solution is to
try ensure the initial operation is performed as early as possible.
Loop dependencies can occur with any and all types of data. They are particularly common
when using arrays, which are discussed in Arrays.
template <typename T0, typename T1, typename T2, typename T3, int N>
class foo_class {
private:
pe_mac<T0, T1, T2> mac;
public:
T0 areg;
T0 breg;
T2 mreg;
T1 preg;
T0 shift[N];
int k; // Class Member
T0 shift_output;
void exec(T1 *pcout, T0 *dataOut, T1 pcin, T3 coeff, T0 data, int col)
{
Function_label0:;
#pragma HLS inline off
SRL:for (k = N-1; k >= 0; --k) {
#pragma HLS unroll// Loop will fail UNROLL
if (k > 0)
shift[k] = shift[k-1];
else
shift[k] = data;
}
*dataOut = shift_output;
shift_output = shift[N-1];
}
For Vivado HLS to be able to unroll the loop as specified by the UNROLL pragma directive,
the code should be rewritten to remove “k” as a class member.
template <typename T0, typename T1, typename T2, typename T3, int N>
class foo_class {
private:
pe_mac<T0, T1, T2> mac;
public:
T0 areg;
T0 breg;
T2 mreg;
T1 preg;
T0 shift[N];
T0 shift_output;
void exec(T1 *pcout, T0 *dataOut, T1 pcin, T3 coeff, T0 data, int col)
{
Function_label0:;
int k; // Local variable
#pragma HLS inline off
SRL:for (k = N-1; k >= 0; --k) {
#pragma HLS unroll// Loop will unroll
if (k > 0)
shift[k] = shift[k-1];
else
shift[k] = data;
}
*dataOut = shift_output;
shift_output = shift[N-1];
}
Arrays
Before discussing how the coding style can impact the implementation of arrays after
synthesis it is worthwhile discussing a situation where arrays can introduce issues even
before synthesis is performed, for example, during C simulation.
If you specify a very large array, it might cause C simulation to run out of memory and fail,
as shown in the following example:
#include "ap_cint.h"
int i, acc;
// Use an arbitrary precision type
int32 la0[10000000], la1[10000000];
The simulation might fail by running out of memory, because the array is placed on the
stack that exists in memory rather than the heap that is managed by the OS and can use
local disk space to grow.
This might mean the design runs out of memory when running and certain issues might
make this issue more likely:
• On PCs, the available memory is often less than large Linux boxes and there might be
less memory available.
• Using arbitrary precision types, as shown above, could make this issue worse as they
require more memory than standard C types.
• Using the more complex fixed-point arbitrary precision types found in C++ and
SystemC might make the issue even more likely as they require even more memory.
A solution is to use dynamic memory allocation for simulation but a fixed sized array for
synthesis, as shown in the next example. This means that the memory required for this is
allocated on the heap, managed by the OS, and which can use local disk space to grow.
A change such as this to the code is not ideal, because the code simulated and the code
synthesized are now different, but this might sometimes be the only way to move the
design process forward. If this is done, be sure that the C test bench covers all aspects of
accessing the array. The RTL simulation performed by cosim_design will verify that the
memory accesses are correct.
#include "ap_cint.h"
int i, acc;
#ifdef __SYNTHESIS__
// Use an arbitrary precision type & array for synthesis
int32 la0[10000000], la1[10000000];
#else
// Use an arbitrary precision type & dynamic memory for simulation
int32 *la0 = malloc(10000000 * sizeof(int32));
int32 *la1 = malloc(10000000 * sizeof(int32));
#endif
for (i=0 ; i < 10000000; i++) {
acc = acc + la0[i] + la1[i];
}
Note: Only use the __SYNTHESIS__ macro in the code to be synthesized. Do not use this macro in the
test bench, because it is not obeyed by C simulation or C RTL co-simulation.
Arrays are typically implemented as a memory (RAM, ROM or FIFO) after synthesis. As
discussed in Arrays on the Interface, arrays on the top-level function interface are
synthesized as RTL ports that access a memory outside. Arrays internal to the design are
synthesized to internal block RAM, LUTRAM or registers, depending on the optimization
settings.
Like loops, arrays are an intuitive coding construct and so they are often found in C
programs. Also like loops, Vivado HLS includes optimizations and directives that can be
applied to optimize their implementation in RTL without any need to modify the code.
Vivado HLS supports arrays of pointers. See Pointers. Each pointer can point only to a scalar
or an array of scalars.
Note: Arrays must be sized. For example, sized arrays are supported, for example: Array[10];
However, unsized arrays are not supported, for example: Array[];.
#include "array_mem_bottleneck.h"
dout_t sum=0;
int i;
SUM_LOOP:for(i=2;i<N;++i)
sum += mem[i] + mem[i-1] + mem[i-2];
return sum;
}
Trying to pipeline SUM_LOOP with an initiation interval of 1 results in the following message
(after failing to achieve a throughput of 1, Vivado HLS relaxes the constraint):
The issue here is that the single-port RAM has only a single data port: only 1 read (and 1
write) can be performed in each clock cycle.
A dual-port RAM could be used, but this allows only two accesses per clock cycle. Three
reads are required to calculate the value of sum, and so three accesses per clock cycle are
required to pipeline the loop with an new iteration every clock cycle.
CAUTION! Arrays implemented as memory or memory ports, can often become bottlenecks to
performance.
The code in Example 3-17 can be rewritten as shown in the following code example to allow
the code to be pipelined with a throughput of 1. In the following code example, by
performing pre-reads and manually pipelining the data accesses, there is only one array
read specified in each iteration of the loop. This ensures that only a single-port RAM is
required to achieve the performance.
#include "array_mem_perform.h"
tmp0 = mem[0];
tmp1 = mem[1];
SUM_LOOP:for (i = 2; i < N; i++) {
tmp2 = mem[i];
sum += tmp2 + tmp1 + tmp0;
tmp0 = tmp1;
tmp1 = tmp2;
}
return sum;
}
Vivado HLS includes optimization directives for changing how arrays are implemented and
accessed. It is typically the case that directives can be used, and changes to the code are not
required. Arrays can be partitioned into blocks or into their individual elements. In some
cases, Vivado HLS partitions arrays into individual elements. This is controllable using the
configuration settings for auto-partitioning.
When an array is partitioned into multiple blocks, the single array is implemented as
multiple RTL RAM blocks. When partitioned into elements, each element is implemented as
a register in the RTL. In both cases, partitioning allows more elements to be accessed in
parallel and can help with performance; the design trade-off is between performance and
the number of RAMs or registers required to achieve it.
FIFO Accesses
A special care of arrays accesses are when arrays are implemented as FIFOs. This is often the
case when dataflow optimization is used.
Accesses to a FIFO must be in sequential order starting from location zero. In addition, if an
array is read in multiple locations, the code must strictly enforce the order of the FIFO
accesses. It is often the case that arrays with multiple fanout cannot be implemented as
FIFOs without additional code to enforce the order of the accesses.
• Memory is off-chip
• Specify the interface as a RAM or FIFO interface using the INTERFACE directive.
• Specify the RAM as a single or dual-port RAM using the RESOURCE directive.
• Specify the RAM latency using the RESOURCE directive.
• Use array optimization directives (Array_Partition, Array_Map, or
Array_Reshape) to reconfigure the structure of the array and therefore, the number
of I/O ports.
TIP: Because access to the data is limited through a memory (RAM or FIFO) port, arrays on the
interface can create a performance bottleneck. Typically, you can overcome these bottlenecks using
directives.
Arrays must be sized when using arrays in synthesizable code. If, for example, the
declaration d_i[4] in Example 3-19 is changed to d_i[], Vivado HLS issues a message
that the design cannot be synthesized.
Array Interfaces
The resource directive can explicitly specify which type of RAM is used, and therefore which
RAM ports are created (single-port or dual-port). If no resource is specified, Vivado HLS
uses:
The partition, map, and reshape directives can re-configure arrays on the interface.
Arrays can be partitioned into multiple smaller arrays, each implemented with its own
interface. This includes the ability to partition every element of the array into its own scalar
element. On the function interface, this results in a unique port for every element in the
array. This provides maximum parallel access, but creates many more ports and might
introduce routing issues in the hierarchy above.
Similarly, smaller arrays can be combined into a single larger array, resulting in a single
interface. While this might map better to an off-chip block RAM, it might also introduce a
performance bottleneck. These trade-offs can be made using Vivado HLS optimization
directives and do not impact coding.
By default, he array arguments in the function shown in the following code example are
synthesized into a single-port RAM interface.
#include "array_RAM.h"
A single-port RAM interface is used because the for-loop ensures that only one element
can be read and written in each clock cycle. There is no advantage in using a dual-port RAM
interface.
If the for-loop is unrolled, Vivado HLS uses a dual-port. Doing so allows multiple elements
to be read at the same time and improves the initiation interval. The type of RAM interface
can be explicitly set by applying the resource directive.
Issues related to arrays on the interface are typically related to throughput. They can be
handled with optimization directives. For example, if the arrays in Example 3-19 are
partitioned into individual elements and the for-loop unrolled, all four elements in each
array are accessed simultaneously.
You can also use the RESOURCE directive to specify the latency of the RAM. This allows
Vivado HLS to model external SRAMs with a latency of greater than 1 at the interface.
FIFO Interfaces
Vivado HLS allows array arguments to be implemented as FIFO ports in the RTL. If a FIFO
ports is to be used, be sure that the accesses to and from the array are sequential.
Note: If the accesses are in fact not sequential, there is an RTL simulation mismatch.
The following code example shows a case in which Vivado HLS cannot determine whether
the accesses are sequential. In this example, both d_i and d_o are specified to be
implemented with a FIFO interface during synthesis.
#include "array_FIFO.h"
In this case, the behavior of variable idx determines whether or not a FIFO interface can be
successfully created.
Because this interface might not work, Vivado HLS issues a message during synthesis and
creates a FIFO interface.
If the “//Breaks FIFO interface” comment in Example 3-20 is removed, Vivado HLS
can determine that the accesses to the arrays are not sequential, and it halts with an error
message if a FIFO interface is specified.
Note: FIFO ports cannot be synthesized for arrays that are read from and written to. Separate input
and output arrays (as in Example 3-20) must be created.
The following general rules apply to arrays that are to be streamed (implemented with a
FIFO interface):
• The array must be written and read in only one loop or function. This can be
transformed into a point-to-point connection that matches the characteristics of FIFO
links.
• The array reads must be in the same order as the array write. Because random access is
not supported for FIFO channels, the array must be used in the program following first
in, first out semantics.
• The index used to read and write from the FIFO must be analyzable at compile time.
Array addressing based on run time computations cannot be analyzed for FIFO
semantics and prevent the tool from converting an array into a FIFO.
Code changes are generally not required to implement or optimize arrays in the top-level
interface. The only time arrays on the interface may need coding changes is when the array
is part of a struct.
Array Initialization
RECOMMENDED: As discussed in Type Qualifiers, although not a requirement, Xilinx recommends
specifying arrays that are to be implemented as memories with the static qualifier. This not only
ensures that Vivado HLS implements the array with a memory in the RTL, it also allows the
initialization behavior of static types to be used.
In the following code, an array is initialized with a set of values. Each time the function is
executed, array coeff is assigned these values. After synthesis, each time the design
executes the RAM that implements coeff is loaded with these values. For a single-port
RAM this would take 8 clock cycles. For an array of 1024, it would of course, take 1024 clock
cycles, during which time no operations depending on coeff could occur.
The following code uses the static qualifier to define array coeff. The array is initialized
with the specified values at start of execution. Each time the function is executed, array
coeff remembers its values from the previous execution. A static array behaves in C code
as a memory does in RTL.
static int coeff[8] = {-2, 8, -4, 10, 14, 10, -4, 8, -2};
In addition, if the variable has the static qualifier, Vivado HLS initializes the variable in the
RTL design and in the FPGA bitstream. This removes the need for multiple clock cycles to
initialize the memory and ensures that initializing large memories is not an operational
overhead.
The RTL configuration command can specify if static variables return to their initial state
after a reset is applied (not the default). If a memory is to be returned to its initial state after
a reset operation, this incurs an operational overhead and requires multiple cycles to reset
the values. Each value must be written into each memory address.
Implementing ROMs
Vivado HLS does not require that an array be specified with the static qualifier to
synthesize a memory or the const qualifier to infer that the memory should be a ROM.
Vivado HLS analyzes the design and attempts to create the most optimal hardware.
Xilinx highly recommends using the static qualifier for arrays that are intended to be
memories. As noted in Array Initialization, a static type behaves in an almost identical
manner as a memory in RTL.
The const qualifier is also recommended when arrays are only read, because Vivado HLS
cannot always infer that a ROM should be used by analysis of the design. The general rule
for the automatic inference of a ROM is that a local, static (non-global) array is written to
before being read. The following practices in the code can help infer a ROM:
• Initialize the array as early as possible in the function that uses it.
• Group writes together.
• Do not interleave array(ROM) initialization writes with non-initialization code.
• Do not store different values to the same array element (group all writes together in
the code).
• Element value computation must not depend on any non-constant (at compile-time)
design variables, other than the initialization loop counter variable.
If complex assignments are used to initialize a ROM (for example, functions from the
math.h library), placing the array initialization into a separate function allows a ROM to be
inferred. In the following example, array sin_table[256] is inferred as a memory and
implemented as a ROM after RTL synthesis.
#include "array_ROM_math_init.h"
#include <math.h>
TIP: Because the result of the sin() function results in constant values, no core is required in the RTL
design to implement the sin() function. The sin() function is not one of the cores listed in Table 3-2
and is not supported for synthesis in C. See SystemC Synthesis for using math.h functions in C++.
Data Types
The data types used in a C function compiled into an executable impact the accuracy of the
result and the memory requirements, and can impact the performance.
• A 32-bit integer int data type can hold more data and therefore provide more precision
than an 8-bit char type, but it requires more storage.
• If 64-bit long long types are used on a 32-bit system, the run time is impacted
because it typically requires multiple accesses to read and write those values.
Vivado HLS supports the synthesis of all standard C types, including exact-width integer
types.
Exact-width integers types are useful for ensuring designs are portable across all types of
system.
Xilinx highly recommends defining the data types for all variables in a common header file,
which can be included in all source files.
• During the course of a typical Vivado HLS project, some of the data types might be
refined, for example to reduce their size and allow a more efficient hardware
implementation.
• One of the benefits of working at a higher level of abstraction is the ability to quickly
create new design implementations. The same files typically are used in later projects
but might use different (smaller or larger or more accurate) data types.
Both of these tasks are more easily achieved when the data types can be changed in a single
location: the alternative is to edit multiple files.
TIP: When using macros in header files, always use unique names. For example, if a macro named
_TYPES_H is defined in your header file, it is likely that such a common name might be defined in other
system files, and it might enable or disable some other code, causing unforeseen side-effects.
Standard Types
The following code example shows some basic arithmetic operations being performed.
#include "types_standard.h"
The data types in Example 3-22 are defined in the header file types_standard.h shown
in the following code example. They show how the following types can be used:
#define N 9
These different types result in the following operator and port sizes after synthesis:
• The multiplier used to calculate result out1 is a 24-bit multiplier. An 8-bit char type
multiplied by a 16-bit short type requires a 24-bit multiplier. The result is
sign-extended to 32-bit to match the output port width.
• The adder used for out2 is 8-bit. Because the output is an 8-bit unsigned char type,
only the bottom 8-bits of inB (a 16-bit short) are added to 8-bit char type inA.
• For output out3 (32-bit exact width type), 8-bit char type inA is sign-extended to
32-bit value and a 32-bit division operation is performed with the 32-bit (int type)
inC input.
• A 64-bit modulus operation is performed using the 64-bit long long type inD and
8-bit char type inA sign-extended to 64-bit, to create a 64-bit output result out4.
As the result of out1 indicates, Vivado HLS uses the smallest operator it can and extends
the result to match the required output bit-width. For result out2, even though one of the
inputs is 16-bit, an 8-bit adder can be used because only an 8-bit output is required. As the
results for out3 and out4 show, if all bits are required, a full sized operator is synthesized.
• Single-precision 32 bit
° 24-bit fraction
° 8-bit exponent
• Double-precision 64 bit
° 53-bit fraction
° 11-bit exponent
RECOMMENDED: When using floating-point data types, Xilinx highly recommends that you review
Floating-Point Design with Vivado HLS (XAPP599) [Ref 4].
In addition to using floats and doubles for standard arithmetic operations (such as +, -, * )
floats and doubles are commonly used with the math.h (and cmath.h for C++). This
section discusses support for standard operators. For more information on synthesizing the
C and C++ math libraries, see HLS Math Library in Chapter 2.
The following code example shows the header file used with Example 3-22 updated to
define the data types to be double and float types.
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#define N 9
This updated header file is used with the following code example where a sqrtf()
function is used.
#include "types_float_double.h"
void types_float_double(
din_A inA,
din_B inB,
din_C inC,
din_D inD,
dout_1 *out1,
dout_2 *out2,
dout_3 *out3,
dout_4 *out4
) {
If the double-precision square-root function sqrt() was used, it would result in additional
logic to cast to and from the 32-bit single-precision float types used for inD and out4:
sqrt() is a double-precision (double) function, while sqrtf() is a single precision
(float) function.
In C functions, be careful when mixing float and double types as float-to-double and
double-to-float conversion units are inferred in the hardware.
This code:
wire(foo_t)
Float-to-Double Converter unit
Double-Precision Square Root unit
Double-to-Float Converter unit
wire (var_f)
The implications from the cores shown in the following table are that if the technology does
not support a particular LogiCORE IP, the design cannot be synthesized. Vivado HLS halts
with an error message.
The cores in the preceding table allow the operation, in some cases, to be implemented
with a core in which many DSP48s are used or none (for example, DMul_nodsp and
DMul_maxdsp ). By default, Vivado HLS implements the operation using the core with the
maximum number of DSP48s. Alternatively, the Vivado HLS resource directive can specify
exactly which core to use.
When synthesizing float and double types, Vivado HLS maintains the order of operations
performed in the C code to ensure that the results are the same as the C simulation. Due to
saturation and truncation, the following are not guaranteed to be the same in single and
double precision operations:
A=B*C; A=B*F;
D=E*F; D=E*C;
O1=A*D O2=A*D;
With float and double types, O1 and O2 are not guaranteed to be the same.
TIP: In some cases (design dependent), optimizations such as unrolling or partial unrolling of loops,
might not be able to take full advantage of parallel computations as Vivado HLS maintains the strict
order of the operations when synthesizing float and double types.
For C++ designs, Vivado HLS provides a bit-approximate implementation of the most
commonly used math functions.
• struct
• enum
• union
Structs
When structs are used as arguments to the top-level function, the ports created by
synthesis are a direct reflection of the struct members. Scalar members are implemented as
standard scalar ports and arrays are implemented, by default, as memory ports.
In this design example, struct data_t is defined in the header file shown in the
following code example. This struct has two data members:
In the following code example, the struct is used as both a pass-by-value argument (from
i_val to the return of o_val) and as a pointer (*i_pt to *o_pt).
#include "struct_port.h"
data_t struct_port(
data_t i_val,
data_t *i_pt,
data_t *o_pt
) {
data_t o_val;
int i;
return o_val;
}
All function arguments and the function return are synthesized into ports as follows:
There are no limitations in the size or complexity of structs that can be synthesized by
Vivado HLS. There can be as many array dimensions and as many members in a struct as
required. The only limitation with the implementation of structs occurs when arrays are to
be implemented as streaming (such as a FIFO interface). In this case, follow the same
general rules that apply to arrays on the interface (FIFO Interfaces).
The elements on a struct can be packed into a single vector by the data packing
optimization. For more information, see the set_directive_data_pack command on
performing this optimization. Additionally, unused elements of a struct can be removed
from the interface by the -trim_dangling_ports option of the config_interface
command.
Enumerated Types
The header file in the following code example defines some enum types and uses them in
a struct. The struct is used in turn in another struct. This allows an intuitive
description of a complex type to be captured.
The following code example shows how a complex define (MAD_NSBSAMPLES) statement
can be specified and synthesized.
#include <stdio.h>
enum mad_layer {
MAD_LAYER_I = 1,
MAD_LAYER_II = 2,
MAD_LAYER_III = 3
};
enum mad_mode {
MAD_MODE_SINGLE_CHANNEL = 0,
MAD_MODE_DUAL_CHANNEL = 1,
MAD_MODE_JOINT_STEREO = 2,
MAD_MODE_STEREO = 3
};
enum mad_emphasis {
MAD_EMPHASIS_NONE = 0,
MAD_EMPHASIS_50_15_US = 1,
MAD_EMPHASIS_CCITT_J_17 = 3
};
int flags;
int private_bits;
} header_t;
# define MAD_NSBSAMPLES(header) \
((header)->layer == MAD_LAYER_I ? 12 : \
(((header)->layer == MAD_LAYER_III && \
((header)->flags & 17)) ? 18 : 36))
The struct and enum types defined in Example 3-28 are used in the following code
example. If the enum is used in an argument to the top-level function, it is synthesized as a
32-bit value to comply with the standard C compilation behavior. If the enum types are
internal to the design, Vivado HLS optimizes them down to the only the required number of
bits.
The following code example shows how printf statements are ignored during synthesis.
#include "types_composite.h"
ns = MAD_NSBSAMPLES(&frame->header);
printf("Samples from header %d \n", ns);
Unions
In the following code example, a union is created with a double and a struct. Unlike C
compilation, synthesis does not guarantee using the same memory (in the case of synthesis,
registers) for all fields in the union. Vivado HLS perform the optimization that provides the
most optimal hardware.
#include "types_union.h"
The synthesis of unions does not support casting between native C types and user-defined
types. The following union contains the native type long long and a user-defined struct.
This union cannot be synthesized because it would require casting from the native type to
a user-defined type.
typedef union {
long long raw[6];
struct {
int b;
int c;
int a[10];
};
} data_t;
Type Qualifiers
The type qualifiers can directly impact the hardware created by high-level synthesis. In
general, the qualifiers influence the synthesis results in a predictable manner, as discussed
below. Vivado HLS is limited only by the interpretation of the qualifier as it affects
functional behavior and can perform optimizations to create a more optimal hardware
design. Examples of this are shown after an overview of each qualifier.
Volatile
The volatile qualifier impacts how many reads or writes are performed in the RTL when
pointers are accessed multiple times on function interfaces. Although the volatile
qualifier impacts this behavior in all functions in the hierarchy, the impact of the volatile
qualifier is primarily discussed in the section on top-level interfaces. See Understanding
Volatile Data.
Arbitrary precision types do not support the volatile qualifier for arithmetic operations. Any
arbitrary precision data types using the volatile qualifier must be assigned to a non-volatile
data type before being used in arithmetic expression.
Statics
Static types in a function hold their value between function calls. The equivalent behavior in
a hardware design is a registered variable (a flip-flop or memory). If a variable is required to
be a static type for the C function to execute correctly, it will certainly be a register in the
final RTL design. The value must be maintained across invocations of the function and
design.
It is not true that only static types result in a register after synthesis. Vivado HLS
determines which variables are required to be implemented as registers in the RTL design.
For example, if a variable assignment must be held over multiple cycles, Vivado HLS creates
a register to hold the value, even if the original variable in the C function was not a static
type.
Vivado HLS obeys the initialization behavior of statics and assigns the value to zero (or any
explicitly initialized value) to the register during initialization. This means that the static
variable is initialized in the RTL code and in the FPGA bitstream. It does not mean that the
variable is re-initialized each time the reset signal is.
See the RTL configuration (config_rtl command) to determine how static initialization
values are implemented with regard to the system reset.
Const
A const type specifies that the value of the variable is never updated. The variable is read
but never written to and therefore must be initialized. For most const variables, this
typically means that they are reduced to constants in the RTL design. Vivado HLS performs
constant propagation and removes any unnecessary hardware).
In the case of arrays, the const variable is implemented as a ROM in the final RTL design
(in the absence of any auto-partitioning performed by Vivado HLS on small arrays). Arrays
specified with the const qualifier are (like statics) initialized in the RTL and in the FPGA
bitstream. There is no need to reset them, because they are never written to.
The following code example shows a case in which Vivado HLS implements a ROM even
though the array is not specified with a static or const qualifier. This highlights how
Vivado HLS analyzes the design and determines the most optimal implementation. The
qualifiers, or lack of them, influence but do not dictate the final RTL.
#include "array_ROM.h"
In the case of Example 3-31, Vivado HLS is able to determine that the implementation is
best served by having the variable lookup_table as a memory element in the final RTL.
For more information on how this achieved for arrays, see Implementing ROMs.
Global Variables
Global variables can be freely used in the code and are fully synthesizable. By default, global
variables are not exposed as ports on the RTL interface.
The following code example shows the default synthesis behavior of global variables. It uses
three global variables. Although this example uses arrays, Vivado HLS supports all types of
global variables.
By default, after synthesis, the only port on the RTL design is port idx. Global variables are
not exposed as RTL ports by default. In the default case:
While global variables are not exposed as I/O ports by default, they can be exposed as I/O
ports by one of following three methods:
• If the global variable is defined with the external qualifier, the variable is exposed as an
I/O port.
• If an I/O protocol is specified on the global variable (using the INTERFACE directive),
the variable is synthesized to an I/O port with the specified interface protocol.
• The expose_global option in the interface configuration can expose all global
variables as ports on the RTL interface. The interface configuration can be set by:
When global variables are exposed using the interface configuration, all global variables in
the design are exposed as I/O ports, including those that are accessed exclusively inside the
design.
Finally, if any global variable is specified with the static qualifier, it cannot be synthesized to
an I/O port.
In summary, while Vivado HLS supports global variables for synthesis, Xilinx does not
recommend a coding style that uses global variables extensively.
Pointers
Pointers are used extensively in C code and are well-supported for synthesis. When using
pointers, be careful in the following cases:
• When pointers are accessed (read or written) multiple times in the same function.
• When using arrays of pointers, each pointer must point to a scalar or a scalar array (not
another pointer).
• Pointer casting is supported only when casting between standard C types, as shown.
The following code example shows synthesis support for pointers that point to multiple
objects.
#include "pointer_multi.h"
dout_t* ptr;
if (sel)
ptr = a;
else
ptr = b;
return ptr[pos];
}
Vivado HLS supports pointers to pointers for synthesis but does not support them on the
top-level interface, that is, as argument to the top-level function. If you use a pointer to
pointer in multiple functions, Vivado HLS inlines all functions that use the pointer to
pointer. Inlining multiple functions can increase run time.
#include "pointer_double.h"
x = 0;
// Sum x if AND of local index and pointer to pointer index is true
for(i=0; i<size; ++i)
if (**flagPtr & i)
x += *(ptr+i);
return x;
}
ptrFlag = flag;
Arrays of pointers can also be synthesized. See the following code example in which an
array of pointers is used to store the start location of the second dimension of a global
array. The pointers in an array of pointers can point only to a scalar or to an array of scalars.
They cannot point to other pointers.
#include "pointer_array.h"
data_t A[N][10];
// Array of pointers
data_t* PtrA[N];
return sum1;
}
Pointer casting is supported for synthesis if native C types are used. In the following code
example, type int is cast to type char.
#define N 1024
Vivado HLS does not support pointer casting between general types. For example, if a
(struct) composite type of signed values is created, the pointer cannot be cast to assign
unsigned values.
struct {
short first;
short second;
} pair;
In such cases, the values must be assigned using the native types.
struct {
short first;
short second;
} pair;
// Assigned value
pair.first = -1U;
pair.second = -1U;
Basic Pointers
A function with basic pointers on the top-level interface, such as shown in the following
code example, produces no issues for Vivado HLS. The pointer can be synthesized to either
a simple wire interface or an interface protocol using handshakes.
#include "pointer_basic.h"
acc += *d;
*d = acc;
}
The pointer on the interface is read or written only once per function call. The test bench
shown in the following code example.
#include "pointer_basic.h"
int main () {
dio_t d;
int i, retval=0;
FILE *fp;
C and RTL simulation verify the correct operation (although not all possible cases) with this
simple data set:
Din Dout
0 0
1 1
2 3
3 6
Test passed!
Pointer Arithmetic
Introducing pointer arithmetic limits the possible interfaces that can be synthesized in RTL.
The following code example shows the same code, but in this instance simple pointer
arithmetic is used to accumulate the data values (starting from the second value).
#include "pointer_arith.h"
for (i=0;i<4;i++) {
acc += *(d+i+1);
*(d+i) = acc;
}
}
The following code example shows the test bench that supports this example. Because the
loop to perform the accumulations is now inside function pointer_arith, the test bench
populates the address space specified by array d[5] with the appropriate values.
#include "pointer_arith.h"
int main () {
dio_t d[5], ref[5];
int i, retval=0;
FILE *fp;
Din Dout
0 1
1 3
2 6
3 10
Test passed!
The pointer arithmetic does not access the pointer data in sequence. Wire, handshake, or
FIFO interfaces have no way of accessing data out of order:
• A wire interface reads data when the design is ready to consume the data or write the
data when the data is ready.
• Handshake and FIFO interfaces read and write when the control signals permit the
operation to proceed.
In both cases, the data must arrive (and is written) in order, starting from element zero. In
Example 3-39, the code states the first data value read is from index 1 (i starts at 0, 0+1=1).
This is the second element from array d[5] in the test bench.
When this is implemented in hardware, some form of data indexing is required. Vivado HLS
does not support this with wire, handshake, or FIFO interfaces. The code in Example 3-39
can be synthesized only with an ap_bus interface. This interface supplies an address with
which to index the data when the data is accessed (read or write).
Alternatively, the code must be modified with an array on the interface instead of a pointer.
See Example 3-41. This can be implemented in synthesis with a RAM (ap_memory)
interface. This interface can index the data with an address and can perform out-of-order, or
non-sequential, accesses.
Wire, handshake, or FIFO interfaces can be used only on streaming data. It cannot be used
in conjunction with pointer arithmetic (unless it indexes the data starting at zero and then
proceeds sequentially).
For more information on the ap_bus and ap_memory interface types, see Chapter 1,
High-Level Synthesis and Chapter 4, High-Level Synthesis Reference Guide.
#include "array_arith.h"
for (i=0;i<4;i++) {
acc += d[i+1];
d[i] = acc;
}
}
Designs that use pointers in the argument list of the top-level function need special
consideration when multiple accesses are performed using pointers. Multiple accesses
occur when a pointer is read from or written to multiple times in the same function.
• You must use the volatile qualifier on any function argument accessed multiple times.
• On the top-level function, any such argument must have the number of accesses on the
port interface specified if you are verifying the RTL using co-simulation within Vivado
HLS.
• Be sure to validate the C before synthesis to confirm the intent and that the C model is
correct.
If modeling the design requires that an function argument be accessed multiple times,
Xilinx recommends that you model the design using streams. See HLS Stream Library in
Chapter 2. Use streams to ensure that you do not encounter the issues discussed in this
section. The designs in the following table use the Coding Examples in Chapter 1.
In the following code example, input pointer d_i is read from four times and output d_o is
written to twice, with the intent that the accesses are implemented by FIFO interfaces
(streaming data into and out of the final RTL implementation).
#include "pointer_stream_bad.h"
acc += *d_i;
acc += *d_i;
*d_o = acc;
acc += *d_i;
acc += *d_i;
*d_o = acc;
}
The test bench to verify this design is shown in the following code example.
#include "pointer_stream_bad.h"
int main () {
din_t d_i;
dout_t d_o;
int retval=0;
FILE *fp;
The code in Example 3-42 is written with intent that input pointer d_i and output pointer
d_o are implemented in RTL as FIFO (or handshake) interfaces to ensure that:
• Upstream producer blocks supply new data each time a read is performed on RTL port
d_i.
• Downstream consumer blocks accept new data each time there is a write to RTL port
d_o.
When this code is compiled by standard C compilers, the multiple accesses to each pointer
is reduced to a single access. As far as the compiler is concerned, there is no indication that
the data on d_i changes during the execution of the function and only the final write to
d_o is relevant. The other writes are overwritten by the time the function completes.
Vivado HLS matches the behavior of the gcc compiler and optimizes these reads and writes
into a single read operation and a single write operation. When the RTL is examined, there
is only a single read and write operation on each port.
The fundamental issue with this design is that the test bench and design do not adequately
model how you expect the RTL ports to be implemented:
• You expect RTL ports that read and write multiple times during a transaction (and can
stream the data in and out).
• The test bench supplies only a single input value and returns only a single output value.
A C simulation of Example 3-42 shows the following results, which demonstrates that
each input is being accumulated four times. The same value is being read once and
accumulated each time. It is not four separate reads.
Din Dout
0 0
1 4
2 8
3 12
To make this design read and write to the RTL ports multiple times, use a volatile
qualifier. See the following code example.
The volatile qualifier tells the C compiler (and Vivado HLS) to make no assumptions
about the pointer accesses. That is, the data is volatile and might change.
#include "pointer_stream_better.h"
acc += *d_i;
acc += *d_i;
*d_o = acc;
acc += *d_i;
acc += *d_i;
*d_o = acc;
}
Example 3-44 simulates the same as Example 3-42, but the volatile qualifier:
Even if the volatile keyword is used, this coding style (accessing a pointer multiple
times) still has an issue in that the function and test bench do not adequately model
multiple distinct reads and writes.
In this case, four reads are performed, but the same data is read four times. There are two
separate writes, each with the correct data, but the test bench captures data only for the
final write.
TIP: To see the intermediate accesses, enable cosim_design to create a trace file during RTL
simulation and view the trace file in the appropriate viewer).
Example 3-44 can be implemented with wire interfaces. If a FIFO interface is specified,
Vivado HLS creates an RTL test bench to stream new data on each read. Because no new
data is available from the test bench, the RTL fails to verify. The test bench does not
correctly model the reads and writes.
Unlike software, the concurrent nature of hardware systems allows them to take advantage
of streaming data. Data is continuously supplied to the design and the design continuously
outputs data. An RTL design can accept new data before the design has finished processing
the existing data.
As the Example 3-44 has shown, modeling streaming data in software is non-trivial,
especially when writing software to model an existing hardware implementation (where the
concurrent/streaming nature already exists and needs to be modeled).
• Add the volatile qualifier as shown in Example 3-44. The test bench does not model
unique reads and writes, and RTL simulation using the original C test bench might fail,
but viewing the trace file waveforms shows that the correct reads and writes are being
performed.
• Modify the code to model explicit unique reads and writes. See Example 3-45.
• Modify the code to using a streaming data type. A streaming data type allows hardware
using streaming data to be accurately modeled. See Chapter 1, High-Level Synthesis.
The following code example has been updated to ensure that it reads four unique values
from the test bench and write two unique values. Because the pointer accesses are
sequential and start at location zero, a streaming interface type can be used during
synthesis.
#include "pointer_stream_good.h"
acc += *d_i;
acc += *(d_i+1);
*d_o = acc;
acc += *(d_i+2);
acc += *(d_i+3);
*(d_o+1) = acc;
}
The test bench is updated to model the fact that the function reads four unique values in
each transaction. This new test bench models only a single transaction. To model multiple
transactions, the input data set must be increased and the function called multiple times.
#include "pointer_stream_good.h"
int main () {
din_t d_i[4];
dout_t d_o[4];
int i, retval=0;
FILE *fp;
The test bench validates the algorithm with the following results, showing that:
The final issue to be aware of when pointers are accessed multiple time at the function
interface is RTL simulation modeling.
When pointers on the interface are accessed multiple times, to read or write, Vivado HLS
cannot determine from the function interface how many reads or writes are performed.
Neither of the arguments in the function interface informs Vivado HLS how many values are
read or written.
Unless the interface informs Vivado HLS how many values are required (for example, the
maximum size of an array), Vivado HLS assumes a single value and creates C/RTL
co-simulation for only a single input and a single output.
If the RTL ports are actually reading or writing multiple values, the RTL co-simulation stalls.
RTL co-simulation models the producer and consumer blocks that are connected to the RTL
design. If it models requires more than a single value, the RTL design stalls when trying to
read or write more than one value (because there is currently no value to read or no space
to write).
When multi-access pointers are used at the interface, Vivado HLS must be informed of the
maximum number of reads or writes on the interface. When specifying the interface, use the
depth option on the INTERFACE directive as shown in the following figure.
C Builtin Functions
Vivado HLS supports the following C bultin functions:
The following example shows these functions may be used. This example returns the sum of
the number of leading zeros in in0 and training zeros in in1:
The same methodology applies to code written for a DSP or a GPU, and when using an
FPGA: an FPGA device is simply another target.
C code synthesized by Vivado HLS will execute on an FPGA and provide the same
functionality as the C simulation. In some cases, the developers work is done at this stage.
Typically however, an FPGA is selected to implement the C code due to the superior
performance of the FPGA device - the massively parallel architecture of an FPGA allows it to
perform operations much faster than the inherently sequential operations of a processor -
and users typically wish to take advantage of that performance.
The focus here is on understanding the impact of the C code on the results which can be
achieved and how modifications to the C code can be used to extract the maximum
advantage from the first three items in this list.
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
// Horizontal convolution
HconvH:for(int col = 0; col < height; col++){
HconvWfor(int row = border_width; row < width - border_width; row++){
Hconv:for(int i = - border_width; i <= border_width; i++){
}
}
// Vertical convolution
VconvH:for(int col = border_width; col < height - border_width; col++){
VconvW:for(int row = 0; row < width; row++){
Vconv:for(int i = - border_width; i <= border_width; i++){
}
}
// Border pixels
Top_Border:for(int col = 0; col < border_width; col++){
}
Side_Border:for(int col = border_width; col < height - border_width; col++){
}
Bottom_Border:for(int col = height - border_width; col < height; col++){
}
}
Horizontal Convolution
The first step in this is to perform the convolution in the horizontal direction as shown in
the following figure.
VUF
ORFDO
;
The convolution is performed using K samples of data and K convolution coefficients. In the
figure above, K is shown as 5 however the value of K is defined in the code. To perform the
convolution, a minimum of K data samples are required. The convolution window cannot
start at the first pixel, since the window would need to include pixels which are outside the
image.
By performing a symmetric convolution, the first K data samples from input src can be
convolved with the horizontal coefficients and the first output calculated. To calculate the
second output, the next set of K data samples are used. This calculation proceeds along
each row until the final output is written.
The final result is a smaller image, shown above in blue. The pixels along the vertical border
are addressed later.
#ifndef __SYNTHESIS__
T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
#else // Static storage allocation for HLS, dynamic otherwise
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
#endif
Note: Only use the __SYNTHESIS__ macro in the code to be synthesized. Do not use this macro in the
test bench, because it is not obeyed by C simulation or C RTL co-simulation.
The code is straight forward and intuitive. There are already however some issues with this
C code and three which will negatively impact the quality of the hardware results.
The first issue is the requirement for two separate storage requirements. The results are
stored in an internal local array. This requires an array of HEIGHT*WIDTH which for a
standard video image of 1920*1080 will hold 2,073,600 vales. On some Windows systems, it
is not uncommon for this amount of local storage to create issues. The data for a local array
is placed on the stack and not the heap which is managed by the OS.
A useful way to avoid such issues is to use the __SYNTHESIS__ macro. This macro is
automatically defined when synthesis is executed. The code shown above will use the
dynamic memory allocation during C simulation to avoid any compilation issues and only
use the static storage during synthesis. A downside of using this macro is the code verified
by C simulation is not the same code which is synthesized. In this case however, the code is
not complex and the behavior will be the same.
The first issue for the quality of the FPGA implementation is the array local. Since this is
an array it will be implemented using internal FPGA block RAM. This is a very large memory
to implement inside the FPGA. It may require a larger and more costly FPGA device. The use
of block RAM can be minimized by using the DATAFLOW optimization and streaming the
data through small efficient FIFOs, but this will require the data to be used in a streaming
manner.
The next issue is the initialization for array local. The loop Clear_Local is used to set
the values in array local to zero. Even if this loop is pipelined, this operation will require
approximately 2 million clock cycles (HEIGHT*WIDTH) to implement. This same initialization
of the data could be performed using a temporary variable inside loop HConv to initialize
the accumulation before the write.
Finally, the throughput of the data is limited by the data access pattern.
• For the first output, the first K values are read from the input.
• To calculate the second output, the same K-1 values are re-read through the data input
port.
• This process of re-reading the data is repeated for the entire image.
One of the keys to a high-performance FPGA is to minimize the access to and from the
top-level function arguments. The top-level function arguments become the data ports on
the RTL block. With the code shown above, the data cannot be streamed directly from a
processor using a DMA operation, since the data is required to be re-read time and again.
Re-reading inputs also limits the rate at which the FPGA can process samples.
Vertical Convolution
The next step is to perform the vertical convolution shown in the following figure.
X-Ref Target - Figure 3-3
ORFDO
GVW
;
The process for the vertical convolution is similar to the horizontal convolution. A set of K
data samples is required to convolve with the convolution coefficients, Vcoeff in this case.
After the first output is created using the first K samples in the vertical direction, the next
set K values are used to create the second output. The process continues down through
each column until the final output is created.
After the vertical convolution, the image is now smaller then the source image src due to
both the horizontal and vertical border effect.
This code highlights similar issues to those already discussed with the horizontal
convolution code.
• Many clock cycles are spent to set the values in the output image dst to zero. In this
case, approximately another 2 million cycles for a 1920*1080 image size.
• There are multiple accesses per pixel to re-read data stored in array local.
• There are multiple writes per pixel to the output array/port dst.
Another issue with the code above is the access pattern into array local. The algorithm
requires the data on row K to be available to perform the first calculation. Processing data
down the rows before proceeding to the next column requires the entire image to be stored
locally. In addition, because the data is not streamed out of array local, a FIFO cannot be
used to implement the memory channels created by DATAFLOW optimization. If DATAFLOW
optimization is used on this design, this memory channel requires a ping-pong buffer: this
doubles the memory requirements for the implementation to approximately 4 million data
samples all stored locally on the FPGA.
Border Pixels
The final step in performing the convolution is to create the data around the border. These
pixels can be created by simply re-using the nearest pixel in the convolved output. The
following figures shows how this is achieved.
X-Ref Target - Figure 3-4
GVW
GVW
;
The border region is populated with the nearest valid value. The following code performs
the operations shown in the figure.
The code suffers from the same repeated access for data. The data stored outside the FPGA
in array dst must now be available to be read as input data re-read multiple time. Even in
the first loop, dst[border_width_offset + border_width] is read multiple times but the
values of border_width_offset and border_width do not change.
The final aspect where this coding style negatively impact the performance and quality of
the FPGA implementation is the structure of how the different conditions is address. A
for-loop processes the operations for each condition: top-left, top-row, etc. The
optimization choice here is to:
The question of whether to pipeline the top-level loop and unroll the sub-loops or pipeline
the sub-loops individually is determined by the loop limits and how many resources are
available on the FPGA device. If the top-level loop limit is small, unroll the loops to replicate
the hardware and meet performance. If the top-level loop limit is large, pipeline the lower
level loops and lose some performance by executing them sequentially in a loop
(Top_Border, Side_Border, Bottom_Border).
As shown in this review of a standard convolution algorithm, the following coding styles
negatively impact the performance and size of the FPGA implementation:
• Maximize the flow of data through the system. Refrain from using any coding
techniques or algorithm behavior which limits the flow of data.
• Maximize the reuse of data. Use local caches to ensure there are no requirements to
re-read data and the incoming data can keep flowing.
The first step is to ensure you perform optimal I/O operations into and out of the FPGA. The
convolution algorithm is performed on an image. When data from an image is produced
and consumed, it is transferred in a standard raster-scan manner as shown in the following
figure.
:LGWK
+HLJKW
;
Code written using hls::streams will generally create designs in an FPGA which have
high-performance and use few resources because an hls::stream enforces a coding style
which is ideal for implementation in an FPGA.
Multiple reads of the same data from an hls::stream are impossible. Once the data has been
read from an hls::stream it no longer exists in the stream. This helps remove this coding
practice.
If the data from an hls::stream is required again, it must be cached. This is another good
practice when writing code to be synthesized on an FPGA.
The hls::stream forces the C code to be developed in a manner which ideal for an FPGA
implementation.
There is no requirement to use hls::streams and the same implementation can be performed
using arrays in the C code. The hls::stream construct does help enforce good coding
practices. More details on hls::streams are provided in HLS Stream Library in Chapter 2.
With an hls::stream construct the outline of the new optimized code is as follows:
hls::stream<T> hconv("hconv");
hls::stream<T> vconv("vconv");
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
// Horizontal convolution
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
HConv:for(int i = 0; i < K; i++) {
}
}
}
// Vertical convolution
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
VConv:for(int i = 0; i < K; i++) {
}
}
In addition, some assert statements are used to specify the maximize of loop bounds. This
is a good coding style which allows HLS to automatically report on the latencies of variable
bounded loops and optimize the loop bounds.
Horizontal Convolution
To perform the calculation in a more efficient manner for FPGA implementation, the
horizontal convolution is computed as shown in the following figure.
X-Ref Target - Figure 3-6
VUF
KFRQY
;
Using an hls::stream enforces the good algorithm practice of forcing you to start by reading
the first sample first, as opposed to performing a random access into data. The algorithm
must use the K previous samples to compute the convolution result, it therefore copies the
sample into a temporary cache hwin. For the first calculation there are not enough values
in hwin to compute a result, so no output values are written.
The algorithm keeps reading input samples a caching them into hwin. Each time is reads a
new sample, it pushes an unneeded sample out of hwin. The first time an output value can
be written is after the Kth input has been read. Now an output value can be written.
The algorithm proceeds in this manner along the rows until the final sample has been read.
At point, only the last K samples are stored in hwin: all that is required to compute the
convolution.
// Horizontal convolution
HConvW:for(int row = 0; row < width; row++) {
HconvW:for(int row = border_width; row < width - border_width; row++){
T in_val = src.read();
T out_val = 0;
HConv:for(int i = 0; i < K; i++) {
hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
out_val += hwin[i] * hcoeff[i];
}
if (row >= K - 1)
hconv << out_val;
}
}
An interesting point to note in the code above is use of the temporary variable out_val to
perform the convolution calculation. This variable is set to zero before the calculation is
performed, negating the need to spend 2 million clocks cycle to reset the values, as in the
pervious example.
Throughout the entire process, the samples in the src input are processed in a
raster-streaming manner. Every sample is read in turn. The outputs from the task are either
discarded or used, but the task keeps constantly computing. This represents a difference
from code written to perform on a CPU.
In a CPU architecture, conditional or branch operations are often avoided. When the
program needs to branch it loses any instructions stored in the CPU fetch pipeline. In an
FPGA architecture, a separate path already exists in the hardware for each conditional
branch and there is no performance penalty associated with branching inside a pipelined
task. It is simply a case of selecting which branch to use.
The outputs are stored in the hls::stream hconv for use by the vertical convolution loop.
Vertical Convolution
The vertical convolution represents a challenge to the streaming data model preferred by
an FPGA. The data must be accessed by column but you do not wish to store the entire
image. The solution is to use line buffers, as shown in the following figure.
KFRQY
YFRQY
;
Once again, the samples are read in a streaming manner, this time from the hls::stream
hconv. The algorithm requires at least K-1 lines of data before it can process the first
sample. All the calculations performed before this are discarded.
A line buffer allows K-1 lines of data to be stored. Each time a new sample is read, another
sample is pushed out the line buffer. An interesting point to note here is that the newest
sample is used in the calculation and then the sample is stored into the line buffer and the
old sample ejected out. This ensure only K-1 lines are required to be cached, rather than K
lines. Although a line buffer does require multiple lines to be stored locally, the convolution
kernel size K is always much less than the 1080 lines in a full video image.
The first calculation can be performed when the first sample on the Kth line is read. The
algorithm then proceeds to output values until the final pixel is read.
// Vertical convolution
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
#pragma HLS DEPENDENCE variable=linebuf inter false
#pragma HLS PIPELINE
T in_val = hconv.read();
T out_val = 0;
VConv:for(int i = 0; i < K; i++) {
T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
out_val += vwin_val * vcoeff[i];
if (i > 0)
linebuf[i - 1][row] = vwin_val;
}
if (col >= K - 1)
vconv << out_val;
}
}
The code above once again process all the samples in the design in a streaming manner.
The task is constantly running. The use of the hls::stream construct forces you to cache the
data locally. This is an ideal strategy when targeting an FPGA.
Border Pixels
The final step in the algorithm is to replicate the edge pixels into the border region. Once
again, to ensure the constant flow or data and data reuse the algorithm makes use of an
hls::stream and caching.
The following figure shows how the border samples are aligned into the image.
• Each sample is read from the vconv output from the vertical convolution.
• The sample is then cached as one of 4 possible pixel types.
• The sample is then written to the output stream.
YFRQY
%RUGHU
5LJKW(GJH
/HIW(GJH
GVW 5DZ3L[HO
%RUGHU
;
The code for determining the location of the border pixels is:
A notable difference with this new code is the extensive use of conditionals inside the tasks.
This allows the task, once it is pipelined, to continuously process data and the result of the
conditionals does not impact the execution of the pipeline: the result will impact the output
values but the pipeline with keep processing so long as input samples are available.
The final code for this FPGA-friendly algorithm has the following optimization directives
used.
hls::stream<T> hconv("hconv");
hls::stream<T> vconv("vconv");
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
// Horizontal convolution
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
#pragma HLS PIPELINE
HConv:for(int i = 0; i < K; i++) {
}
}
}
// Vertical convolution
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
#pragma HLS PIPELINE
#pragma HLS DEPENDENCE variable=linebuf inter false
VConv:for(int i = 0; i < K; i++) {
}
}
Each of the tasks are pipelined at the sample level. The line buffer is full partitioned into
registers to ensure there are no read or write limitations due to insufficient block RAM
ports. The line buffer also requires a dependence directive. All of the tasks execute in a
dataflow region which will ensure the tasks run concurrently. The hls::streams are
automatically implemented as FIFOs with 1 element.
Minimize accesses to arrays, especially large arrays. Arrays are implemented in block RAM
which like I/O ports only have a limited number of ports and can be bottlenecks to
performance. Arrays can be partitioned into smaller arrays and even individual registers but
partitioning large arrays will result in many registers being used. Use small localized caches
to hold results such as accumulations and then write the final result to the array.
Seek to perform conditional branching inside pipelined tasks rather than conditionally
execute tasks, even pipelined tasks. Conditionals will be implemented as separate paths in
the pipeline. Allowing the data from one task to flow into with the conditional performed
inside the next task will result in a higher performing system.
Minimize output writes for the same reason as input reads: ports are bottlenecks.
Replicating addition ports simply pushes the issue further out into the system.
For C code which processes data in a streaming manner, consider using hls::streams as
these will enforce good coding practices. It is much more productive to design an algorithm
in C which will result in a high-performance FPGA implementation than debug why the
FPGA is not operating at the performance required.
#include "cpp_FIR.h"
return fir1(x);
}
IMPORTANT: Classes and class member functions cannot be the top-level for synthesis. Instantiate the
class in a top-level function.
Before examining the class used to implement the design in Example 3-48, it is worth
noting Vivado HLS ignores the standard output stream cout during synthesis. When
synthesized, Vivado HLS issues the following warnings:
The following code example shows the header file cpp_FIR.h, including the definition of
class CFir and its associated member functions. In this example the operator member
functions () and << are overloaded operators, which are respectively used to execute the
main algorithm and used with cout to format the data for display during C simulation.
#include <fstream>
#include <iostream>
#include <iomanip>
#include <cstdlib>
using namespace std;
#define N 85
acc_t acc = 0;
data_t m;
The test bench Example 3-48 is shown in the following code example and demonstrates
how top-level function cpp_FIR is called and validated. This example highlights some of
the important attributes of a good test bench for Vivado HLS synthesis:
For more for information on test benches, see Productive Test Benches.
#include "cpp_FIR.h"
int main() {
ofstream result;
data_t output;
int retval=0;
// Apply stimuli, call the top-level function and saves the results
for (int i = 0; i <= 250; i++)
{
output = cpp_FIR(i);
}
result.close();
1. Open the file where the class is defined (typically a header file).
2. Apply the directive using the Directives tab.
As with functions, all instances of a class have the same optimizations applied to them.
Vivado HLS supports virtual functions (including abstract functions) for synthesis, provided
that it can statically determine the function during elaboration. Vivado HLS does not
support virtual functions for synthesis in the following cases:
• Virtual functions can be defined in a multilayer inheritance class hierarchy but only
with a single inheritance.
• Dynamic polymorphism is only supported if the pointer object can be determined at
compile time. For example, such pointers cannot be used in an if-else or loop
constructs.
• An STL container cannot contain the pointer of an object and call the polymorphism
function. For example:
vector<base *> base_ptrs(10);
• Vivado HLS does not support cases in which the base object pointer is a global variable.
For example:
Base *base_ptr;
void func()
{
……
base_prt->virtual_function();
……
}
• The base object pointer cannot be a member variable in a class definition. For example:
// Static elaboration cannot bind base object pointer with correct data type.
class A
{
…..
Base *base_ptr;
void set_base(Base *base_ptr);
void some_func();
…..
};
void A::some_func()
{
….
base_ptr->virtual_function();
….
}
• If the base object pointer or reference is in the function parameter list of constructor,
Vivado HLS does not convert it. The ISO C++ standard has depicted this in section12.7:
sometimes the behavior is undefined.
class A {
A(Base *b) {
b-> virtual _ function ();
}
};
#define TAPS 3
#define PHASES 4
#define DATA_SAMPLES 256
#define CELL_SAMPLES 12
template <typename T0, typename T1, typename T2, typename T3, int N>
class polyd_cell {
private:
public:
T0 areg;
T0 breg;
T2 mreg;
T1 preg;
T0 shift[N];
int k; //line 73
T0 shift_output;
void exec(T1 *pcout, T0 *dataOut, T1 pcin, T3 coeff, T0 data, int col)
{
Function_label0:;
if (col==0) {
SHIFT:for (k = N-1; k >= 0; --k) {
if (k > 0)
shift[k] = shift[k-1];
else
shift[k] = data;
}
*dataOut = shift_output;
shift_output = shift[N-1];
}
*pcout = (shift[4*col]* coeff) + pcin;
}
};
acc_t pcin0 = 0;
acc_t pcout0, pcout1;
data_t dout0, dout1;
int col;
static acc_t accum=0;
static int sample_count = 0;
static polyd_cell<data_t, acc_t, mult_t, coef_t, CELL_SAMPLES>
polyd_cell0;
static polyd_cell<data_t, acc_t, mult_t, coef_t, CELL_SAMPLES>
polyd_cell1;
polyd_cell0.exec(&pcout0,&dout0,pcin0,coeff1[row][col],dataIn[sample_count],
col);
polyd_cell1.exec(&pcout1,&dout1,pcout0,coeff2[row][col],dout0,col);
}
sample_count++;
}
Example 3-51: C++ Class Data Member Used for Loop Index Coding Example
Within class polyd_cell there is a loop SHIFT used to shift data. If the loop index k used
in loop SHIFT was removed and replaced with the global index for k (shown earlier in the
example, but commented static int k), Vivado HLS is unable to pipeline any loop or
function in which class polyd_cell was used. Vivado HLS would issue the following
message:
Using local non-global variables for loop indexing ensures that Vivado HLS can perform all
optimizations.
Templates
Vivado HLS supports the use of templates in C++ for synthesis. Vivado HLS does not
support templates for the top-level function.
In addition to the general use of templates shown in Example 3-49 and Example 3-51,
templates can be used implement a form of recursion that is not supported in standard C
synthesis (Recursive Functions).
The following code example shows a case in which a templatized struct is used to
implement a tail-recursion Fibonacci algorithm. The key to performing synthesis is that a
termination class is used to implement the final call in the recursion, where a template size
of one is used.
// Termination condition
template<> struct fibon_s<1> {
template<typename T>
static T fibon_f(T a, T b) {
return b;
}
};
Assertions
The assert macro in C is supported for synthesis when used to assert range information. For
example, the upper limit of variables and loop-bounds.
As noted in Variable Loop Bounds, when variable loop bounds are present, Vivado HLS
cannot determine the latency for all iterations of the loop and reports the latency with a
question mark. The Tripcount directive can inform Vivado HLS of the loop bounds, but this
information is only used for reporting purposes and does not impact the result of synthesis
(the same sized hardware is created, with or without the Tripcount directive).
The following code example shows how assertions can inform Vivado HLS about the
maximum range of variables, and how those assertions are used to produce more optimal
hardware.
Before using assertions, the header file that defines the assert macro must be included. In
this example, this is included in the header file.
#ifndef _loop_sequential_assert_H_
#define _loop_sequential_assert_H_
#include <stdio.h>
#include <assert.h>
#include ap_cint.h
#define N 32
void loop_sequential_assert(din_t A[N], din_t B[N], dout_t X[N], dout_t Y[N], dsel_t
xlimit, dsel_t ylimit);
#endif
In the main code two assert statements are placed before each of the loops.
assert(xlimit<32);
...
assert(ylimit<16);
...
These assertions:
• Guarantee that if the assertion is false and the value is greater than that stated, the C
simulation will fail. This also highlights why it is important to simulate the C code
before synthesis: confirm the design is valid before synthesis.
• Inform Vivado HLS that the range of this variable will not exceed this value and this fact
can optimize the variables size in the RTL and in this case, the loop iteration count.
#include "loop_sequential_assert.h"
void loop_sequential_assert(din_t A[N], din_t B[N], dout_t X[N], dout_t Y[N], dsel_t
xlimit, dsel_t ylimit) {
dout_t X_accum=0;
dout_t Y_accum=0;
int i,j;
assert(xlimit<32);
SUM_X:for (i=0;i<=xlimit; i++) {
X_accum += A[i];
X[i] = X_accum;
}
assert(ylimit<16);
SUM_Y:for (i=0;i<=ylimit; i++) {
Y_accum += B[i];
Y[i] = Y_accum;
}
}
Except for the assert macros, this code is the same as that shown in Example 3-15. There are
two important differences in the synthesis report after synthesis.
Without the assert macros, the report is as follows, showing that the loop tripcount can vary
from 1 to 256 because the variables for the loop-bounds are of data type d_sel that is an
8-bit variable.
* Loop Latency:
+----------+-----------+----------+
|Target II |Trip Count |Pipelined |
+----------+-----------+----------+
|- SUM_X |1 ~ 256 |no |
|- SUM_Y |1 ~ 256 |no |
+----------+-----------+----------+
In the version with the assert macros, the report shows the loops SUM_X and SUM_Y
reported Tripcount of 32 and 16. Because the assertions assert that the values will never be
greater than 32 and 16, Vivado HLS can use this in the reporting.
* Loop Latency:
+----------+-----------+----------+
|Target II |Trip Count |Pipelined |
+----------+-----------+----------+
|- SUM_X |1 ~ 32 |no |
|- SUM_Y |1 ~ 16 |no |
+----------+-----------+----------+
In addition, and unlike using the Tripcount directive, the assert statements can provide
more optimal hardware. In the case without assertions, the final hardware uses variables
and counters that are sized for a maximum of 256 loop iterations.
* Expression:
+----------+------------------------+-------+---+----+
|Operation |Variable Name |DSP48E |FF |LUT |
+----------+------------------------+-------+---+----+
|+ |X_accum_1_fu_182_p2 |0 |0 |13 |
|+ |Y_accum_1_fu_209_p2 |0 |0 |13 |
|+ |indvar_next6_fu_158_p2 |0 |0 |9 |
|+ |indvar_next_fu_194_p2 |0 |0 |9 |
|+ |tmp1_fu_172_p2 |0 |0 |9 |
|+ |tmp_fu_147_p2 |0 |0 |9 |
|icmp |exitcond1_fu_189_p2 |0 |0 |9 |
|icmp |exitcond_fu_153_p2 |0 |0 |9 |
+----------+------------------------+-------+---+----+
|Total | |0 |0 |80 |
+----------+------------------------+-------+---+----+
The code which asserts the variable ranges are smaller than the maximum possible range
results in a smaller RTL design.
* Expression:
+----------+------------------------+-------+---+----+
|Operation |Variable Name |DSP48E |FF |LUT |
+----------+------------------------+-------+---+----+
|+ |X_accum_1_fu_176_p2 |0 |0 |13 |
|+ |Y_accum_1_fu_207_p2 |0 |0 |13 |
|+ |i_2_fu_158_p2 |0 |0 |6 |
|+ |i_3_fu_192_p2 |0 |0 |5 |
|icmp |tmp_2_fu_153_p2 |0 |0 |7 |
|icmp |tmp_9_fu_187_p2 |0 |0 |6 |
+----------+------------------------+-------+---+----+
|Total | |0 |0 |50 |
+----------+------------------------+-------+---+----+
Assertions can indicate the range of any variable in the design. It is important to execute a
C simulation that covers all possible cases when using assertions. This will confirm that the
assertions that Vivado HLS uses are valid.
SystemC Synthesis
Vivado HLS supports SystemC (IEEE standard 1666), a C++ class library used to model
hardware. The library is available at the Accellera website (www.accellera.org). For synthesis,
Vivado HLS supports the SystemC Synthesizable Subset (Draft 1.3) for SystemC version 2.1.
This section provides information on the synthesis of SystemC functions with Vivado HLS.
This information is in addition to the information in the earlier chapters, C for Synthesis and
C++ for Synthesis. Xilinx recommends that you read those chapters to fully understand the
basic rules of coding for synthesis.
IMPORTANT: As with C and C++ designs, the top-level function for synthesis must be a function below
the top-level for C compilation sc_main(). The sc_main() function cannot be the top-level function
for synthesis.
Design Modeling
The top-level for synthesis must be an SC_MODULE. Designs can be synthesized if modeled
using the SystemC constructor processes SC_METHOD, SC_CTHREAD and the
SC_HAS_PROCESS marco or if SC_MODULES are instantiated inside other SC_MODULES.
The top-level SC_MODULE in the design cannot be a template. Templates can be used only
on submodules.
The module constructor can only define or instantiate modules. It cannot contain any
functionality.
An SC_ MODULE cannot be defined inside another SC_MODULE. (Although they can be
instantiated, as discussed later).
SC_MODULE(nested1)
{
SC_MODULE(nested2)
{
sc_in<int> in0;
sc_out<int> out0;
SC_CTOR(nested2)
{
SC_METHOD(process);
sensitive<<in0;
}
void process()
{
int var =10;
out0.write(in0.read()+var);
}
};
sc_in<int> in0;
sc_out<int> out0;
nested2 nd;
SC_CTOR(nested1)
:nd(nested2)
{
nd.in0(in0);
nd.out0(out0);
}
};
SC_MODULE(nested2)
{
sc_in<int> in0;
sc_out<int> out0;
SC_CTOR(nested2)
{
SC_METHOD(process);
sensitive<<in0;
}
void process()
{
int var =10;
out0.write(in0.read()+var);
}
};
SC_MODULE(nested1)
{
sc_in<int> in0;
sc_out<int> out0;
nested2 nd;
SC_CTOR(nested1)
:nd(nested2)
{
nd.in0(in0);
nd.out0(out0);
}
};
SC_MODULE(BASE)
{
sc_in<bool> clock; //clock input
sc_in<bool> reset;
SC_CTOR(BASE) {}
};
Cases such as the following (SC_ MODULE Example Four) should be transformed as shown
in SC_ MODULE Example Five.
SC_MODULE(dut) {
sc_in<int> in0;
sc_out<int>out0;
SC_HAS_PROCESS(dut);
dut(sc_module_name nm);
…
};
dut::dut(sc_module_name nm)
{
SC_METHOD(process);
sensitive<<in0;
}
SC_MODULE(dut) {
sc_in<int> in0;
sc_out<int>out0;
SC_HAS_PROCESS(dut);
dut(sc_module_name nm)
:sc_module(nm)
{
SC_METHOD(process);
sensitive<<in0;
}
…
};
Using SC_METHOD
The following code example shows the header file (sc_combo_method.h) for a small
combinational design modeled using an SC_METHOD to model a half-adder. The top-level
design name (c_combo_method) is specified in the SC_MODULE.
#include <systemc.h>
SC_MODULE(sc_combo_method){
//Ports
sc_in<sc_uint<1> > a,b;
sc_out<sc_uint<1> > sum,carry;
//Process Declaration
void half_adder();
//Constructor
SC_CTOR(sc_combo_method){
//Process Registration
SC_METHOD(half_adder);
sensitive<<a<<b;
}
};
The design has two single-bit input ports (a and b). The SC_METHOD is sensitive to any
changes in the state of either input port and executes function half_adder. The function
half_adder is specified in the file sc_combo_method.cpp shown in the following code
example. It calculates the value for output port carry.
#include "sc_combo_method.h"
void sc_combo_method::half_adder(){
bool s,c;
s=a.read() ^ b.read();
c=a.read() & b.read();
sum.write(s);
carry.write(c);
#ifndef __SYNTHESIS__
cout << Sum is << a << ^ << b << = << s << : <<
sc_time_stamp() <<endl;
cout << Car is << a << & << b << = << c << : <<
sc_time_stamp() <<endl;
#endif
Example 3-61 shows how any cout statements used to display values during C simulation
can be protected from synthesis using the __SYNTHESIS__ macro.
Note: Only use the __SYNTHESIS__ macro in the code to be synthesized. Do not use this macro in the
test bench, because it is not obeyed by C simulation or C RTL co-simulation.
The following code example shows the test bench for Example 3-61. This test bench
displays several important attributes required when using Vivado HLS.
#ifdef __RTL_SIMULATION__
#include "sc_combo_method_rtl_wrapper.h"
#define sc_combo_method sc_combo_method_RTL_wrapper
#else
#include "sc_combo_method.h"
#endif
#include "tb_init.h"
#include "tb_driver.h"
sc_signal<bool> s_reset;
sc_signal<sc_uint<1> > s_a;
sc_signal<sc_uint<1> > s_b;
sc_signal<sc_uint<1> > s_sum;
sc_signal<sc_uint<1> > s_carry;
tb_init U_tb_init(U_tb_init);
sc_combo_method U_dut(U_dut);
tb_driver U_tb_driver(U_tb_driver);
// start simulation
sc_start(end_time, SC_NS);
if (U_tb_driver.retval != 0) {
printf(Test failed !!!\n);
} else {
printf(Test passed !\n);
}
return U_tb_driver.retval;
};
To perform RTL simulation using the cosim_design feature in Vivado HLS, the test bench
must contain the macros shown at the top of Example 3-62. For a design named DUT, the
following must be used, where DUT is replaced with the actual design name.
#ifdef __RTL_SIMULATION__
#include "DUT_rtl_wrapper.h"
#define DUT DUT_RTL_wrapper
#else
#include "DUT.h" //Original unmodified code
#endif
You must add this to the test bench in which the design header file is included. Otherwise,
cosim_design RTL simulation fails.
RECOMMENDED: Add the report handler functions shown in Example 3-62 to all SystemC test bench
files used with Vivado HLS.
sc_report_handler::set_actions(/IEEE_Std_1666/deprecated, SC_DO_NOTHING);
sc_report_handler::set_actions( SC_ID_LOGIC_X_TO_BOOL_, SC_LOG);
sc_report_handler::set_actions( SC_ID_VECTOR_CONTAINS_LOGIC_VALUE_, SC_LOG);
sc_report_handler::set_actions( SC_ID_OBJECT_EXISTS_, SC_LOG);
These settings prevent the printing of extraneous messages during RTL simulation.
The adapters placed around the synthesized design start with unknown (X) values. Not all
SystemC types support unknown (X) values. This warning is issued when unknown (X) values
are applied to types that do not support unknown (X) values, typically before the stimuli is
applied from the test bench and can generally be ignored.
Finally, the test bench in Example 3-62 performs checking on the results.
Returns a value of zero if the results are correct. In this case, the results are verified inside
function tb_driver but the return value is checked and returned in the top-level test
bench.
if (U_tb_driver.retval != 0) {
printf(Test failed !!!\n);
} else {
printf(Test passed !\n);
}
return U_tb_driver.retval;
Instantiating SC_MODULES
Hierarchical instantiations of SC_MODULEs can be synthesized, as shown in the following
code example In this code example, the two instances of the half-adder design
(sc_combo_method) from Example 3-60 are instantiated to create a full-adder design.
#include <systemc.h>
#include "sc_combo_method.h"
SC_MODULE(sc_hier_inst){
//Ports
sc_in<sc_uint<1> > a, b, carry_in;
sc_out<sc_uint<1> > sum, carry_out;
//Variables
sc_signal<sc_uint<1> > carry1, sum_int, carry2;
//Process Declaration
void full_adder();
//Half-Adder Instances
sc_combo_methodU_1, U_2;
//Constructor
SC_CTOR(sc_hier_inst)
:U_1(U_1)
,U_2(U_2)
{
// Half-adder inst 1
U_1.a(a);
U_1.b(b);
U_1.sum(sum_int);
U_1.carry(carry1);
// Half-adder inst 2
U_2.a(sum_int);
U_2.b(carry_in);
U_2.sum(sum);
U_2.carry(carry2);
//Process Registration
SC_METHOD(full_adder);
sensitive<<carry1<<carry2;
}
};
The function full_adder is used to create the logic for the carry_out signal, as shown
in the following code example.
#include "sc_hier_inst.h"
void sc_hier_inst::full_adder(){
carry_out= carry1.read() | carry2.read();
Using SC_CTHREAD
The constructor process SC_CTHREAD is used to model clocked processes (threads) and is
the primary way to model sequential designs. The following code example shows a case
that highlights the primary attributes of a sequential design.
• The data has associated handshake signals, allowing it to operate with the same test
bench before and after synthesis.
• An SC_CTHREAD sensitive on the clock is used to model when the function is executed.
• The SC_CTHREAD supports reset behavior.
#include <systemc.h>
SC_MODULE(sc_sequ_cthread){
//Ports
sc_in <bool> clk;
sc_in <bool> reset;
sc_in <bool> start;
sc_in<sc_uint<16> > a;
sc_in<bool> en;
sc_out<sc_uint<16> > sum;
sc_out<bool> vld;
//Variables
sc_uint<16> acc;
//Process Declaration
void accum();
//Constructor
SC_CTOR(sc_sequ_cthread){
//Process Registration
SC_CTHREAD(accum,clk.pos());
reset_signal_is(reset,true);
}
};
Function accum is shown in the following code example. This example demonstrates:
• The core modeling process is an infinite while() loop with a wait() statement inside
it.
• Any initialization of the variables is performed before the infinite while() loop. This
code is executed when reset is recognized by the SC_CTHREAD.
• The data reads and writes are qualified by handshake protocols.
#include "sc_sequ_cthread.h"
void sc_sequ_cthread::accum(){
//Initialization
acc=0;
sum.write(0);
vld.write(false);
wait();
Synthesis of Loops
When coding with loops, you must account for the Vivado HLS SystemC scheduling rule in
which Vivado HLS always synthesizes a loop by starting in a new state. For example, given
the following design:
process code:
unsigned count = 0;
while (!start.read()) wait();
for(int i=0;i<100; i++)
{
if(enable.read()) count++;
wait();
}
start = true;
enable=true;
wait(1);
start = false;
wait(99);
enable=false;
This design executes during C simulation and samples the enable signal. Then, count
reaches 100. After synthesis, the SystemC loop scheduling rule requires the loop to start
with a new state and any operations in the loop to be scheduled after this point. For
example, the following code shows a wait statement called First Loop Clock:
sc_in<bool> start;
sc_in<bool> enable;
process code:
unsigned count = 0;
while (!start.read()) wait();
for(int i=0;i<100; i++)
{
wait(); //First Loop Clock
if(enable.read()) count++;
wait();
}
After the initial clock samples the start signal, there is a 2 clock cycle delay before the new
clock samples the enable signal for the first time. This new clock occurs at the same time
as the second clock in the test bench, which is the first clock in the series of 99 clocks. On
the third test bench clock, which is the second clock in the series of 99 clocks, the clock
samples the enable signal for the first time. In this case, the RTL design only counts to 99
before enable is set to false.
RECOMMENDED: When coding loops in SystemC, Xilinx highly recommends that you place the wait()
statement as the first item in a loop.
In the following example, the wait() statement is the first clock or state in the synthesized
loop:
sc_in<bool> start;
sc_in<bool> enable;
process code:
unsigned count = 0;
while (!start.read()) wait();
for(int i=0;i<100; i++)
{
wait(); // Put the 'wait()' at the beginning of the loop
if(enable.read()) count++;
}
Unlike C and C++ synthesis, SystemC supports designs with multiple clocks. In a multiple
clock design, the functionality associated with each clock must be captured in an
SC_CTHREAD.
The following code example shows a design with two clocks (clock and clock2).
After synthesis, all the sequential logic associated with function Prc1 is clocked by clock,
while clock2 drives all the sequential logic of function Prc2.
#includesystemc.h
#includetlm.h
using namespace tlm;
SC_MODULE(sc_multi_clock)
{
//Ports
sc_in <bool> clock;
sc_in <bool> clock2;
sc_in <bool> reset;
sc_in <bool> start;
sc_out<bool> done;
sc_fifo_out<int> dout;
sc_fifo_in<int> din;
//Variables
int share_mem[100];
bool write_done;
//Process Declaration
void Prc1();
void Prc2();
//Constructor
SC_CTOR(sc_multi_clock)
{
//Process Registration
SC_CTHREAD(Prc1,clock.pos());
reset_signal_is(reset,true);
SC_CTHREAD(Prc2,clock2.pos());
reset_signal_is(reset,true);
}
};
Communication Channels
Communication between threads, methods, and modules (which themselves contain
threads and methods) should only be performed using channels. Do not use simple
variables for communication between threads.
For sc_fifo and tlm_fifo, the following methods are supported for synthesis:
• Non-blocking read/write
• Blocking read/write
• num_available()/num_free()
• nb_can_put()/nb_can_get()
All ports on the top-level interface must be one of the following types:
• sc_in_clk
• sc_in
• sc_out
• sc_inout
• sc_fifo_in
• sc_fifo_out
• ap_mem_if
• AXI4M_bus_port
Except for the supported memory interfaces, all handshaking between the design and the
test bench must be explicitly modeled in the SystemC function. The supported memory
interfaces are:
• sc_fifo_in
• sc_fifo_out
• ap_mem_if
Vivado HLS might add additional clock cycles to a SystemC design if required to meet
timing. Because the number of clock cycles after synthesis might be different, SystemC
designs should handshake all data transfers with the test bench.
Vivado HLS does not support transaction level modeling using TLM 2.0 and event-based
modeling for synthesis.
Unlike the synthesis of C and C++, Vivado HLS does not transform array ports into RTL RAM
ports. In the following SystemC code, you must use Vivado HLS directives to partition the
array ports into individual elements.
SC_MODULE(dut)
{
sc_in<T> in0[N];
sc_out<T>out0[N];
…
SC_CTOR(dut)
{
…
}
};
If N is a large number, this results in many individual scalar ports on the RTL interface.
The following code example shows how a RAM interface can be modeled in SystemC
simulation and fully synthesized by Vivado HLS. In this code example, the arrays are
replaced by ap_mem_if types that can synthesized into RAM ports.
#includesystemc.h
#include "ap_mem_if.h"
SC_MODULE(sc_RAM_port)
{
//Ports
sc_in <bool> clock;
sc_in <bool> reset;
sc_in <bool> start;
sc_out<bool> done;
//sc_out<int> dout[100];
//sc_in<int> din[100];
ap_mem_port<int, int, 100, RAM_2P> dout;
ap_mem_port<int, int, 100, RAM_2P> din;
//Variables
int share_mem[100];
sc_signal<bool> write_done;
//Process Declaration
void Prc1();
void Prc2();
//Constructor
SC_CTOR(sc_RAM_port)
: dout (dout),
din (din)
{
//Process Registration
SC_CTHREAD(Prc1,clock.pos());
reset_signal_is(reset,true);
SC_CTHREAD(Prc2,clock.pos());
reset_signal_is(reset,true);
}
};
• The data_type is the type used for the stored data elements. In Example 3-69, these
are standard int types.
• The address_type is the type used for the address bus. This type should have
enough data bits to address all elements in the array, or C simulation fails.
• The number_of_elements specifies the number of elements in the array being
modeled.
• The Mem_Target specifies the memory to which this port will connect and therefore
determines the I/O ports on the final RTL. For a list of the available targets, see the
following table.
The memory targets described in the following table influence both the ports created by
synthesis and how the operations are scheduled in the design. For example, a dual-port
RAM:
After the ap_mem_port has been defined on the interface, the variables are accessed in the
code in the same manner as any other arrays:
The test bench to support Example 3-69 is shown in the following code example. The
ap_mem_port type must be supported by an ap_mem_chn type in the test bench. The
ap_mem_chn type is defined in the header file ap_mem_if.h and supports the same fields
as ap_mem_port.
#ifdef __RTL_SIMULATION__
#include "sc_RAM_port_rtl_wrapper.h"
#define sc_RAM_port sc_RAM_port_RTL_wrapper
#else
#include "sc_RAM_port.h"
#endif
#include "tb_init.h"
#include "tb_driver.h"
#include "ap_mem_if.h"
sc_signal<bool> s_reset;
sc_signal<bool> s_start;
sc_signal<bool> s_done;
ap_mem_chn<int,int, 100, RAM_2P> dout;
ap_mem_chn<int,int, 100, RAM_2P> din;
sc_clock s_clk(s_clk,10,SC_NS);
tb_init U_tb_init(U_tb_init);
sc_RAM_port U_dut(U_dut);
tb_driver U_tb_driver(U_tb_driver);
// Sim
int end_time = 1100;
// start simulation
sc_start(end_time, SC_NS);
if (U_tb_driver.retval != 0) {
printf(Test failed !!!\n);
} else {
printf(Test passed !\n);
}
return U_tb_driver.retval;
};
FIFO ports on the top-level interface can be synthesized directly from the standard SystemC
sc_fifo_in and sc_fifo_out ports. For an example of using FIFO ports on the
interface, see the following code example.
After synthesis, each FIFO port has a data port and associated FIFO control signals.
By using FIFO ports, the handshake required to synchronize data transfers is added in the
RTL test bench.
#includesystemc.h
#includetlm.h
using namespace tlm;
SC_MODULE(sc_FIFO_port)
{
//Ports
sc_in <bool> clock;
sc_in <bool> reset;
sc_in <bool> start;
sc_out<bool> done;
sc_fifo_out<int> dout;
sc_fifo_in<int> din;
//Variables
int share_mem[100];
bool write_done;
//Process Declaration
void Prc1();
void Prc2();
//Constructor
SC_CTOR(sc_FIFO_port)
{
//Process Registration
SC_CTHREAD(Prc1,clock.pos());
reset_signal_is(reset,true);
SC_CTHREAD(Prc2,clock.pos());
reset_signal_is(reset,true);
}
};
Instantiating Modules
{
sc_in<T> din;
sc_out<T> dout;
M1 *t0;
SC_CTOR(TOP){
t0 = new M1(t0);
t0->din(din);
t0->dout(dout);
}
}
SC_MODULE(TOP)
{
sc_in<T> din;
sc_out<T> dout;
M1 t0;
SC_CTOR(TOP)
: t0(“t0”)
{
t0.din(din);
t0.dout(dout);
}
}
Module Constructors
Only name parameters can be used with module constructors. Passing on variable temp of
type int is not allowed. See the following example.
SC_MODULE(dut) {
sc_in<int> in0;
sc_out<int>out0;
int var;
SC_HAS_PROCESS(dut);
dut(sc_module_name nm, int temp)
:sc_module(nm),var(temp)
{ … }
};
Virtual Functions
Vivado HLS does not support virtual functions. Because the following code uses a virtual
function, it cannot be synthesized.
SC_MODULE(DUT)
{
sc_in<int> in0;
sc_out<int>out0;
void process()
{
int var=foo(in0.read());
out0.write(var);
}
…
};
SC_MODULE(DUT)
{
sc_in<T> in0;
sc_out<T>out0;
…
void process()
{
int var=in0.read()+out0.read();
out0.write(var);
}
};
Command Reference
add_files
Description
The tool searches the current directory for any header files included in the design source. To
use header files stored in other directories, use the -cflags option to add those
directories to the search path.
Syntax
add_files [OPTIONS] <src_files>
where
• < src_files> lists source files with the description of the design.
Options
-tb
These files are not synthesized. They are used when post-synthesis verification is executed
by the cosim_design command.
This option does not allow design files to be included in the list of source files. Use a
separate add_files command to add design files and test bench files.
Pragma
Examples
add_files a.cpp
add_files b.cpp
add_files c.cpp
Add a SystemC file with compiler flags to enable macro USE_RANDOM.and specify an
additional search path, subdirectory ./lib_functions, for header files.
Use the-tb option to add test bench files to the project. This example adds multiple files
with a single command, including:
° input_stimuli.dat
° out.gold.dat.
add_files -tb "a_test.cpp input_stimuli.dat out.gold.dat"
If the test bench data files in the previous example are stored in a separate directory (for
example test_data), the directory can be added to the project in place of the individual
data files.
close_project
Description
Closes the current project. The project is no longer active in the Vivado ® HLS session.
Syntax
close_project
Options
Pragma
Examples
close_project
close_solution
Description
Closes the current solution. The current solution is no longer active in the Vivado HLS
session.
Syntax
close_solution
Options
Pragma
Examples
close_solution
config_array_partition
Description
Syntax
config_array_partition [OPTIONS]
Options
Sets the threshold for partitioning arrays (including those without constant indexing).
Arrays with fewer elements than the specified threshold limit are partitioned into individual
elements, unless interface or core specification is applied on the array. The default is 4.
Arrays with fewer elements than the specified threshold limit, and that have
constant-indexing (the indexing is not variable), are partitioned into individual elements.
The default is 64.
-exclude_extern_globals
-include_ports
This reduces an array I/O port into multiple ports. Each port is the size of the individual
array elements.
-scalarize_all
-throughput_driven
Vivado HLS determines whether partitioning the array into individual elements allows it to
meet any specified throughput requirements.
Pragma
Examples
Partitions all arrays in the design with less than 12 elements (but not global arrays) into
individual elements.
Instructs Vivado HLS to determine which arrays to partition (including arrays on the
function interface) to improve throughput.
Partitions all arrays in the design (including global arrays) into individual elements.
config_array_partition -scalarize_all
config_bind
Description
Binding is the process in which operators (such as addition, multiplication, and shift) are
mapped to specific RTL implementations. For example, a mult operation implemented as a
combinational or pipelined RTL multiplier.
Syntax
config_bind [OPTIONS]
Options
-effort (low|medium|high)
The optimizing effort level controls the trade-off between run time and optimization.
• A Low effort optimization improves the run time and might be useful for cases in which
little optimization is possible. For example, when all if-else statements have
mutually exclusive operators in each branch and no operator sharing can be achieved.
• A High effort optimization results in increased run time, but typically gives better
results.
Minimizes the number of instances of a particular operator. If there are multiple such
operators in the code, they are shared onto the fewest number of RTL resources (cores).
• add - Addition
• sub - Subtraction
• mul - Multiplication
• icmp - Integer Compare
• sdiv - Signed Division
• udiv - Unsigned Division
• srem - Signed Remainder
• urem - Unsigned Remainder
• lshr - Logical Shift-Right
• ashr - Arithmetic Shift-Right
• shl - Shift-Left
Pragma
Examples
Minimizes the number of multiplication operators, resulting in RTL with the fewest number
of multipliers.
config_compile
Description
Syntax
config_compile [OPTIONS]
Options
-name_max_length <threshold>
Specifies the maximum length of the function names. If the length of the name is higher
than the threshold, the last part of the name is truncated. The default is 30.
-no_signed_zeros
Ignores the signedness of floating-point zero so that the compiler can perform aggressive
optimizations on floating-point operations. The default is off.
-pipeline_loops <threshold>
Specifies the lower threshold used when pipelining loops automatically. The default is no
automatic loop pipelining.
If the option is applied, the innermost loop with a tripcount higher than the threshold is
pipelined, or if the tripcount of the innermost loop is less than or equal to the threshold, its
parent loop is pipelined. If the innermost loop has no parent loop, the innermost loop is
pipelined regardless of its tripcount.
The higher the threshold, the more likely it is that the parent loop is pipelined and the run
time is increased.
-unsafe_math_optimizations
Pragma
Examples
Pipeline the innermost loop with a tripcount higher than 30, or pipeline the parent loop of
the innermost loop when its tripcount is less than or equal 30.
config_compile -pipeline_loops 30
config_compile -no_signed_zeros
Ignore the signedness of floating-point zero and enable the associative floating-point
operations.
config_compile -unsafe_math_optimiaztions
config_dataflow
Description
• Specifies the default behavior of dataflow pipelining (implemented by the
set_directive_dataflow command).
• Allows you to specify the default channel memory type and depth.
Syntax
config_dataflow [OPTIONS]
Options
-default_channel (fifo|pingpong)
By default, a RAM memory, configured in pingpong fashion, is used to buffer the data
between functions or loops when dataflow pipelining is used. When streaming data is used
(that is, the data is always read and written in consecutive order), a FIFO memory is more
efficient and can be selected as the default memory type.
TIP: Set arrays to streaming using the set_directive_stream command to perform FIFO accesses.
This option has no effect when pingpong memories are used. If not specified, the FIFOs
used in the channel are set to the size of the largest producer or consumer (whichever is
largest). In some cases, this might be too conservative and introduce FIFOs that are larger
than necessary. Use this option when you know that the FIFOs are larger than required.
CAUTION! Be careful when using this option. Incorrect use might result in a design that fails to operate
correctly.
Pragma
Examples
config_dataflow -default_channel
Changes the default channel from pingpong memories to a FIFO with a depth of 6.
CAUTION! If the design implementation requires a FIFO with greater than six elements, this setting
results in a design that fails RTL verification. Be careful when using this option, because it is a user
override.
config_interface
Description
Specifies the default interface option used to implement the RTL port of each function
during interface synthesis.
Syntax
config_interface [OPTIONS]
Options
-clock_enable
The clock enable prevents all clock operations when it is active-Low. It disables all
sequential operations
-expose_global
If a variable is created as a global, but all read and write accesses are local to the design, the
resource is created in the design. There is no need for an I/O port in the RTL.
RECOMMENDED: If you expect the global variable to be an external source or destination outside the
RTL block, create ports using this option.
-m_axi_addr64
Globally enables 64-bit addressing for all M_AXI ports in the design.
-m_axi_offset (off|direct|slave)
Globally controls the offset ports of all M_AXI interfaces in the design.
• off (default)
• direct
• slave
-register_io (off|scalar_in|scalar_out|scalar_all)
Globally controls turning on registers for all inputs/outputs on the top function. The default
is off.
-trim_dangling_port
By default, all members of an unpacked struct at the block interface become RTL ports
regardless of whether they are used or not by the design block. Setting this switch to on
removes all interface ports that are not used in some way by the block generated.
Pragma
Examples
• Exposes global variables as I/O ports.
• Adds a clock enable port.
config_interface -expose_global -clock_enable
config_rtl
Description
Configures various attributes of the output RTL, the type of reset used, and the encoding of
the state machines. It also allows you to use specific identification in the RTL.
By default, these options are applied to the top-level design and all RTL blocks within the
design. You can optionally specify a specific RTL model.
Syntax
config_rtl [OPTIONS] <model_name>
Options
Places the contents of file <string> at the top (as comments) of all output RTL and
simulation files.
TIP: Use this option to ensure that the output RTL files contain user specified identification.
-reset (none|control|state|all)
Variables initialized in the C code are always initialized to the same value in the RTL and
therefore in the bitstream. This initialization is performed only at power-on. It is not
repeated when a reset is applied to the design.
The setting applied with the -reset option determines how registers and memories are
reset.
• none
• control (default)
Resets control registers, such as those used in state machines and those used to
generate I/O protocol signals.
• state
Resets control registers and registers or memories derived from static or global
variables in the C code. Any static or global variable initialized in the C code is reset to
its initialized value.
• all
Resets all registers and memories in the design. Any static or global variable initialized
in the C code is reset to its initialized value.
-reset_async
-reset_level (low|high)
-encoding (binary|onehot|gray)
Specifies the encoding style used by the state machine of the design.
With auto encoding, Vivado® HLS determines the style of encoding. However, the Xilinx ®
logic synthesis tool Vivado can extract and re-implement the FSM style during logic
synthesis. If any other encoding style is selected, the encoding style cannot be re-optimized
by the Xilinx logic synthesis tool.
Pragma
Examples
Configures the output RTL to have all registers reset with an asynchronous active-Low reset.
config_schedule
Description
Syntax
config_schedule [OPTIONS]
Options
-effort (high|medium|low)
-verbose
Prints out the critical path when scheduling fails to satisfy any directives or constraints.
Pragma
Examples
cosim_design
Description
Executes post-synthesis co-simulation of the synthesized RTL with the original C-based test
bench.
To specify the files for the test bench run the following command:
add_files -tb
where
° ap_vld
° ap_ovld
° ap_hs
° ap_memory
° ap_fifo
° ap_bus
The interface modes use a write valid signal to specify when an output is written.
Syntax
cosim_design [OPTIONS]
Options
-reduce_diskspace
This option enables disk space saving flow. It helps to reduce disk space used during
simulation, but with possibly larger run time and memory usage.
-rtl (vhdl|verilog)
Specifies which RTL to use for C/RTL co-simulation. The default is Verilog. You can use the
-tool option to select the HDL simulator. The default is xsim.
-setup
Creates all simulation files created in the sim/< HDL> directory of the active solution. The
simulation is not executed.
Specifies the simulator to use to co-simulate the RTL with the C test bench.
Determines the level of waveform tracing during C/RTL co-simulation. Option 'all' results in
all port and signal waveforms being saved to the trace file, and option 'port' only saves
waveform traces for the top-level ports. The trace file is saved in the “sim/<RTL>” directory
of the current solution when the simulation executes. The <RTL> directory depends on the
selection used with the -rtl option: verilog or vhdl.
-compiled_library_dir <string>
Specifies the compiled library directory during simulation with third-party simulators. The
<string> is the path name to the compiled library directory.
-O
Enable optimization to improve the run time performance, if possible, at the expense of
compilation time. Although the resulting executable might potentially run much faster, the
run time improvements are design-dependent. Optimizing for run time might require large
amounts of memory for large functions.
-coverage
Enables the coverage feature during simulation with the VCS simulator.
Disables comparison checking for the first <integer> number of clock cycles.
This is useful when it is known that the RTL will initially start with unknown ('hX) values.
-ldflags <string>
This option is typically used to pass include path information or library information for the
C test bench.
Pragma
Examples
cosim_design
Uses the VCS simulator to verify the Verilog RTL and enable saving of the waveform trace
file.
Verifies the VHDL RTL using ModelSim. Values 5 and 1 are passed to the test bench function
and used in the RTL verification.
create_clock
Description
The command can be executed only in the context of an active solution. The clock period is
a constraint that drives optimization (chaining as many operations as feasible in the given
clock period).
C and C++ designs support only a single clock. For SystemC designs, you can create
multiple named clocks and apply them to different SC_MODULEs using the
set_directive_clock command.
Syntax
create_clock -period <number> [OPTIONS]
Options
-name <string>
Pragma
Examples
create_clock -period 50
create_clock
For a SystemC designs, multiple named clocks can be created and applied using
set_directive_clock.
csim_design
Description
Compiles and runs pre-synthesis C simulation using the provided C test bench.
To specify the files for the test bench, use add_file -tb. The simulation working
directory is csim inside the active solution.
Syntax
csim_design [OPTIONS]
Options
-O
-clean
This option is typically used to pass on library information for the C test bench and design.
-setup
Creates the C simulation binary in the csim directory of the active solution. Simulation is
not executed.
Pragma
Examples
csim_design
Compiles source design and test bench to generate the simulation binary. Does not execute
the binary. To run the simulation, execute run.sh in the csim/build directory of the
active solution.
csim_design -O -setup
csynth_design
Description
The command can be executed only in the context of an active solution. The elaborated
design in the database is scheduled and mapped onto RTL, based on any constraints that
are set.
Syntax
csynth_design
Options
Pragma
Examples
csynth_design
delete_project
Description
Syntax
delete_project <project>
where
Options
Pragma
Examples
Deletes Project_1 by removing the directory Project_1 and all its contents.
delete_project Project_1
delete_solution
Syntax
delete_solution <solution>
where
Description
Removes a solution from an active project, and deletes the <solution> subdirectory from
the project directory.
If the solution does not exist in the project directory, the command has no effect.
Pragma
Examples
Deletes solution Solution_1 from the active project by removing the subdirectory
Solution_1 from the active project directory.
delete_solution Solution_1
export_design
Description
Exports and packages the synthesized design in RTL as an IP for downstream tools.
• Vivado IP catalog
• DCP format
• System Generator
The packaged design is under the impl directory of the active solution in one of the
following subdirectories:
• ip
• sysgen
Syntax
export_design [OPTIONS]
Options
-description <string>
-evaluate (verilog|vhdl)
Obtains more accurate timing and utilization data for the specified HDL using RTL synthesis.
-format (sysgen|ip_catalog|syn_dcp)
• sysgen
In a format accepted by System Generator for DSP for Vivado Design Suite (Xilinx 7
series devices only)
• ip_catalog
In format suitable for adding to the Vivado IP Catalog (default for Xilinx 7 series devices)
• syn_dcp
Synthesized checkpoint file for the Vivado Design Suite. If this option is used, RTL
synthesis is automatically executed.
-library <string>
-vendor <string>
-version <string>
Pragma
Examples
Exports RTL in IP catalog. Evaluates the VHDL to obtain better timing and utilization data
(using the Vivado tools).
help
Description
• When used without any < cmd> as an argument, lists all Vivado HLS Tcl commands.
• When used with a Vivado HLS Tcl command as an argument, provides information on
the specified command.
For legal Vivado HLS commands, auto-completion using the tab key is active when typing
the command argument.
Syntax
help [OPTIONS] <cmd>
where
Options
Pragma
Examples
help
help add_files
list_core
Description
Cores are the components used to implement operations in the output RTL (such as adders,
multipliers, and memories).
After elaboration, the operations in the RTL are represented as operators in the internal
database. During scheduling, operators are mapped to cores from the library to implement
the RTL design. Multiple operators can be mapped on the same instance of a core, sharing
the same RTL resource.
The list_core command allows the available operators and cores to be listed by using
the relevant option:
• Operation
• Type
Lists the available cores by type, for example those that implement functional
operations, or those that implement memory or storage operations.
If no options are specified, the command lists all cores in the library.
TIP: Use the information provided by the list_core command with the
set_directive_resource command to implement specific operations onto specific cores.
Syntax
list_core [OPTIONS]
Options
-operation (opers)
Lists the cores in the library that can implement the specified operation. The operations are:
• add - Addition
• sub - Subtraction
• mul - Multiplication
• udiv - Unsigned Division
• urem - Unsigned Remainder (Modulus operator)
• srem - Signed Remainder (Modulus operator)
• icmp - Integer Compare
• shl - Shift-Left
• lshr - Logical Shift-Right
• ashr - Arithmetic Shift-Right
• mux - Multiplexor
• load - Memory Read
• store - Memory Write
• fiforead - FIFO Read
• fifowrite - FIFO Write
• fifonbread - Non-Blocking FIFO Read
• fifonbwrite - Non-Blocking FIFO Write
-type (functional_unit|storage|connector|adapter|ip_block)
• Function Units
Cores that implement standard RTL operations (such as add, multiply, or compare)
• Storage
• Connectors
Cores used to implement connectivity within the design, including direct connections
and streaming storage elements.
• Adapter
Cores that implement interfaces used to connect the top-level design when IP is
generated. These interfaces are implemented in the RTL wrapper used in the IP
generation flow (Xilinx EDK).
• IP Blocks
Pragma
Examples
Lists all cores in the currently loaded libraries that can implement an add operation.
TIP: Use the set_directive_resource command to implement an array using one of the available
memories.
list_part
Description
• If a family is specified, returns the supported device families or supported parts for that
family.
• If no family is specified, returns all supported families.
TIP: To return parts of a family, specify one of the supported families that was listed when no family
was specified when the command was run.
Syntax
list_part [OPTIONS]
Pragma
Examples
list_part
list_part virtex6
open_project
Description
There can only be one project active at any given time in a Vivado HLS session. A project
can contain multiple solutions.
To close a project:
Use the delete_project command to completely delete the project directory (removing
it from the disk) and any solutions associated it.
Syntax
open_project [OPTIONS] <project>
where
Options
-reset
• Resets the project by removing any project data that already exists.
• Removes any previous project information on design source files, header file search
paths, and the top level function. The associated solution directories and files are kept,
but might now have invalid results.
Note: The delete_project command accomplishes the same as the -reset option and removes
all solution data).
RECOMMENDED: Use this option when executing Vivado HLS with Tcl scripts. Otherwise, each new
add_files command adds additional files to the existing data.
Pragma
Examples
open_project Project_1
RECOMMENDED: Use this method with Tcl scripts to prevent adding source or library files to the
existing project data.
open_solution
Description
Opens an existing solution or creates a new one in the currently active project.
CAUTION! Attempting to open or create a solution when there is no active project results in an error.
There can only be one solution active at any given time in a Vivado HLS session.
Each solution is managed in a subdirectory of the current project directory. A new solution
is created if the solution does not yet exist in the current work directory.
To close a solution:
Use the delete_solution command to remove them from the project and delete the
corresponding subdirectory.
Syntax
open_solution [OPTIONS] <solution>
where
Options
-reset
• Resets the solution data if the solution already exists. Any previous solution
information on libraries, constraints, and directives is removed.
• Removes synthesis, verification, and implementation.
Pragma
Examples
open_solution Solution_1
RECOMMENDED: Use this method with Tcl scripts to prevent adding to the existing solution data.
set_clock_uncertainty
Description
The margin is subtracted from the clock period to create an effective clock period. If the
clock uncertainty is not defined in ns or as a percentage, it defaults to 12.5% of the clock
period.
Vivado HLS optimizes the design based on the effective clock period, providing a margin
for downstream tools to account for logic synthesis and routing. The command can be
executed only in the context of an active solution. Vivado HLS still uses the specified clock
period in all output files for verification and implementation.
For SystemC designs in which multiple named clocks are specified by the create_clock
command, you can specify a different clock uncertainty on each named clock by specifying
the named clock.
Syntax
set_clock_uncertainty <uncertainty> <clock_list>
where
• < uncertainty> is a value, specified in ns, representing how much of the clock period is
used as a margin.
• < clock_list> a list of clocks to which the uncertainty is applied. If none is provided, it
is applied to all clocks.
Pragma
Examples
Specifies an uncertainty or margin of 0.5 ns on the clock. This effectively reduces the clock
period that Vivado HLS can use by 0.5 ns.
set_clock_uncertainty 0.5
In this SystemC example, creates two clock domains. A different clock uncertainty is
specified on each domain.
TIP: SystemC designs support multiple clocks. Use the set_directive_clock command to apply the clock
to the appropriate function.
set_directive_allocation
Description
This defines, and can limit, the number of RTL instances used to implement specific
functions or operations. For example, if the C source has four instances of a function
foo_sub, the set_directive_allocation command can ensure that there is only one
instance of foo_sub in the final RTL. All four instances are implemented using the same
RTL block.
Syntax
set_directive_allocation [OPTIONS] <location> <instances>
where
The function can be any function in the original C code that has not been:
The list of operators is as follows (provided there is an instance of such an operation in the
C source code):
• add - Addition
• sub - Subtraction
• mul - Multiplication
• icmp - Integer Compare
• sdiv - Signed Division
• udiv - Unsigned Division
• srem - Signed Remainder
• urem - Unsigned Remainder
• lshr - Logical Shift-Right
• ashr - Arithmetic Shift-Right
• shl - Shift-Left
Options
Sets a maximum limit on the number of instances (of the type defined by the -type option)
to be used in the RTL design.
-type (function|operation)
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Given a design foo_top with multiple instances of function foo, limits the number of
instances of foo in the RTL to 2.
Limits the number of multipliers used in the implementation of My_func to 1.This limit
does not apply to any multipliers that might reside in sub-functions of My_func. To limit
the multipliers used in the implementation of any sub-functions, specify an allocation
directive on the sub-functions or inline the sub-function into function My_func.
set_directive_array_map
Description
Use the -mode option to determine whether the new target is a concatenation of:
The arrays are concatenated in the order the set_directive_array_map commands are
issued starting at:
Syntax
set_directive_array_map [OPTIONS] <location> <array>
where
• < location> is the location (in the format function[/label]) which contains the array
variable.
• < variable> is the array variable to be mapped into the new target array instance.
Options
-instance <string>
Specifies the new array instance name where the current array variable is to be mapped.
-mode (horizontal|vertical)
• Horizontal mapping (the default) concatenates the arrays to form a target with more
elements.
• Vertical mapping concatenates the array to form a target with longer words.
Specifies an integer value indicating the absolute offset in the target instance for current
mapping operation. For example:
• Element 0 of the array variable maps to element < int> of the new target.
• Other elements map to <int+1>, < int+2>... of the new target.
If the value is not specified, Vivado HLS calculates the required offset automatically to avoid
any overlap. Example: concatenating the arrays starting at the next unused element in the
target.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
These commands map arrays A[10] and B[15] in function foo into a single new array AB[25].
Concatenates arrays C and D into a new array CD with same number of bits as C and D
combined. The number of elements in CD is the maximum of C or D
set_directive_array_partition
Description
This partitioning:
• Results in RTL with multiple small memories or multiple registers instead of one large
memory.
• Effectively increases the amount of read and write ports for the storage.
• Potentially improves the throughput of the design.
• Requires more memory instances or registers.
Syntax
set_directive_array_partition [OPTIONS] <location> <array>
where
• < location> is the location (in the format function[/label]) which contains the array
variable.
• < array> is the array variable to be partitioned.
Options
• If a value of 0 is used, all dimensions are partitioned with the specified options.
• Any other value partitions only that dimension. For example, if a value 1 is used, only
the first dimension is partitioned.
-type (block|cyclic|complete)
• Block partitioning creates smaller arrays from consecutive blocks of the original array.
This effectively splits the array into N equal blocks where N is the integer defined by
the -factor option.
• Cyclic partitioning creates smaller arrays by interleaving elements from the original
array. For example, if -factor 3 is used:
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Partitions array AB[13] in function foo into four arrays. Because four is not an integer
multiple of 13:
Partitions array AB[6][4] in function foo into two arrays, each of dimension [6][2].
set_directive_array_reshape
Description
Combines array partitioning with vertical array mapping to create a single new array with
fewer elements but wider words.
Syntax
where
• < location> is the location (in the format function[/label]) that contains the array
variable.
• < array> is the array variable to be reshaped.
Options
-type (block|cyclic|complete)
• Block reshaping creates smaller arrays from consecutive blocks of the original array.
This effectively splits the array into N equal blocks where N is the integer defined by
the -factor option and then combines the N blocks into a single array with
word-width*N. The default is complete.
• Cyclic reshaping creates smaller arrays by interleaving elements from the original array.
For example, if -factor 3 is used, element 0 is assigned to the first new array, element
1 to the second new array, element 2 is assigned to the third new array, and then
element 3 is assigned to the first new array again. The final array is a vertical
concatenation (word concatenation, to create longer words) of the new arrays into a
single array.
• Complete reshaping decomposes the array into temporary individual elements and
then recombines them into an array with a wider word. For a one-dimension array this
is equivalent to creating a very-wide register (if the original array was N elements of M
bits, the result is a register with N*M bits).
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Reshapes 8-bit array AB[17] in function foo, into a new 32-bit array with five elements.
Partitions array AB[6][4] in function foo, into a new array of dimension [6][2], in which
dimension 2 is twice the width.
Reshapes 8-bit array AB[4][2][2] in function foo into a new single element array (a register),
4*2*2*8(=128)-bits wide.
set_directive_clock
Description
C and C++ designs support only a single clock. The clock period specified by
create_clock is applied to all functions in the design.
SystemC designs support multiple clocks. Multiple named clocks can be specified using the
create_clock command and applied to individual SC_MODULEs using the
set_directive_clock command. Each SC_MODULE is synthesized using a single clock.
Syntax
set_directive_clock <location> <domain>
where
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_dataflow
Description
Specifies that dataflow optimization be performed on the functions or loops, improving the
concurrency of the RTL implementation.
Data dependencies can limit this. For example, functions or loops that access arrays must
finish all read/write accesses to the arrays before they complete. This prevents the next
function or loop that consumes the data from starting operation.
It is possible for the operations in a function or loop to start operation before the previous
function or loop completes all its operations.
If no initiation interval (number of cycles between the start of one function or loop and the
next) is specified, Vivado HLS attempts to minimize the initiation interval and start
operation as soon as data is available.
Syntax
set_directive_dataflow <location>
where
• < location> is the location (in the format function[/label]) at which dataflow
optimization is to be performed.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_dataflow foo
#pragma HLS dataflow
set_directive_data_pack
Description
Packs the data fields of a struct into a single scalar with a wider word width.
Any arrays declared inside the struct are completely partitioned and reshaped into a wide
scalar and packed with other scalar fields.
The bit alignment of the resulting new wide-word can be inferred from the declaration
order of the struct fields. The first field takes the least significant sector of the word and so
forth until all fields are mapped.
Syntax
set_directive_data_pack [OPTIONS] <location> <variable>
where
• < location> is the location (in the format function[/label]) which contains the variable
which will be packed.
• < variable> is the variable to be packed.
Options
-instance <string>
Specifies the name of resultant variable after packing. If none is provided, the input
variable is used.
-byte_pad (struct_level|field_level)
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Packs struct array AB[17] with three 8-bit field fields (typedef struct {unsigned char R, G, B;}
pixel) in function foo, into a new 17 element array of 24 bits.
set_directive_data_pack foo AB
#pragma HLS data_pack variable=AB
Packs struct pointer AB with three 8-bit fields (typedef struct {unsigned char R, G, B;} pixel)
in function foo, into a new 24-bit pointer.
set_directive_data_pack foo AB
#pragma HLS data_pack variable=AB
set_directive_dependence
Description
These dependencies impact when operations can be scheduled, especially during function
and loop pipelining.
• Loop-independent dependence
for (i=0;i<N;i++) {
A[i]=x;
y=A[i];
}
• Loop-carry dependence
for (i=0;i<N;i++) {
A[i]=A[i-1]*2;
}
Under certain circumstances such as variable dependent array indexing or when an external
requirement needs enforced (for example, two inputs are never the same index) the
dependence analysis might be too conservative. The set_directive_dependence
command allows you to explicitly specify the dependence and resolve a false dependence.
Syntax
set_directive_dependence [OPTIONS] <location>
where
• < location> is the location (in the format function[/label]) at which the dependence is
to be specified.
Options
-class (array|pointer)
Specifies a class of variables in which the dependence needs clarification. This is mutually
exclusive with the option -variable.
-dependent (true|false)
-direction (RAW|WAR|WAW)
The read instruction gets a value that is overwritten by the write instruction.
-distance <integer>
Note: Relevant only for loop-carry dependencies where -dependent is set to true.
-type (intra|inter)
-variable <variable>
Specifies the specific variable to consider for the dependence directive. Mutually exclusive
with the option -class.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Removes the dependence between Var1 in the same iterations of loop_1 in function foo.
The dependence on all arrays in loop_2 of function foo informs Vivado HLS that all reads
must happen after writes in the same loop iteration.
set_directive_expression_balance
Description
Sometimes a C-based specification is written with a sequence of operations. This can result
in a lengthy chain of operations in RTL. With a small clock period, this can increase the
design latency.
By default, Vivado HLS rearranges the operations through associative and commutative
properties. This rearrangement creates a balanced tree that can shorten the chain,
potentially reducing latency at the cost of extra hardware.
Syntax
where
• < location> is the location (in the format function[/label]) where the balancing should
be enabled or disabled.
Options
-off
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_expression_balance My_Func2
#pragma HLS expression_balance
set_directive_function_instantiate
Description
By default:
By default, the following code results in a single RTL implementation of function foo_sub
for all three instances.
Using the directive as shown in the example section below results in three versions of
function foo_sub, each independently optimized for variable incr.
Syntax
set_directive_function_instantiate <location> <variable>
where
• < location> is the location (in the format function[/label]) where the instances of a
function are to be made unique.
• variable <string> specifies which function argument < string> is to be specified as
constant.
Options
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
For the example code shown above, the following Tcl (or pragma placed in function
foo_sub) allows each instance of function foo_sub to be independently optimized with
respect to input incr.
set_directive_inline
Description
Removes a function as a separate entity in the hierarchy. After inlining, the function is
dissolved and no longer appears as a separate level of hierarchy.
In some cases, inlining a function allows operations within the function to be shared and
optimized more effectively with surrounding operations. An inlined function cannot be
shared. This can increase area.
Syntax
set_directive_inline [OPTIONS] <location>
where
• < location> is the location (in the format function[/label]) where inlining is to be
performed.
Options
-off
Disables function inlining to prevent particular functions from being inlined. For example, if
the -recursive option is used in a caller function, this option can prevent a particular
called function from being inlined when all others are.
-recursive
By default, only one level of function inlining is performed. The functions within the
specified function are not inlined. The -recursive option inlines all functions recursively
down the hierarchy.
-region
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Inlines all functions in foo_top (but not any lower level functions).
set_directive_inline foo_sub1
#pragma HLS inline
Inline all functions in foo_top, recursively down the hierarchy, except function foo_sub2.
The first pragma is placed in function foo_top. The second pragma is placed in function
foo_sub2.
set_directive_interface
Description
Specifies how RTL ports are created from the function description during interface
synthesis.
Function-level handshakes:
° Ends
° Is idle
Each function argument can be specified to have its own I/O protocol (such as valid
handshake or acknowledge handshake).
If a global variable is accessed, but all read and write operations are local to the design, the
resource is created in the design. There is no need for an I/O port in the RTL. If however, the
global variable is expected to be an external source or destination, specify its interface in a
similar manner as standard function arguments. See the examples below.
Syntax
set_directive_interface [OPTIONS] <location> <port>
where
• < location> is the location (in the format function[/label]) where the function interface
or registered output is to be specified.
• < port> is the parameter (function argument or global variable) for which the interface
has to be synthesized. This is not required when modes ap_ctrl_none or
ap_ctrl_hs are used.
Options
-bundle <string>: Groups function arguments into AXI ports. By default, Vivado HLS
groups all function arguments specified as an AXI4-Lite interface into a single AXI4-Lite
port. Similarly, Vivado HLS groups all function arguments specified as an AXI4 interface into
a single AXI4 port. The -bundle option explicitly groups all function arguments with the
same <string> into the same interface port and names the RTL port <string>.
-mode (ap_none|ap_stable|ap_vld|ap_ack|ap_hs|ap_ovld|ap_fifo|
ap_bus|ap_memory|bram|axis|s_axilite|m_axi|ap_ctrl_none|ap_ctrl_hs
|ap_ctrl_chain)
Following is a summary of how Vivado HLS implements the -mode options. For detailed
descriptions, see Interface Synthesis Reference.
• ap_memory: Implements array arguments as a standard RAM interface. If you use the
RTL design in Vivado IP integrator, the memory interface appears as discrete ports.
• bram: Implements array arguments as a standard RAM interface. If you use the RTL
design in Vivado IP integrator, the memory interface appears as a single port.
• axis: Implements all ports as an AXI4-Stream interface.
• s_axilite: Implements all ports as an AXI4-Lite interface. Vivado HLS produces an
associated set of C driver files during the Export RTL process.
• m_axi: Implements all ports as an AXI4 interface. You can use the
config_interface command to specify either 32-bit (default) or 64-bit address
ports and to control any address offset.
• ap_ctrl_none: No block-level I/O protocol.
Note: Using the ap_ctrl_none mode might prevent the design from being verified using the
C/RTL co-simulation feature.
• ap_ctrl_hs: Implements a set of block-level control ports to start the design
operation and to indicate when the design is idle, done, and ready for new input
data.
Note: The ap_ctrl_hs mode is the default block-level I/O protocol.
• ap_ctrl_chain: Implements a set of block-level control ports to start the design
operation, continue operation, and indicate when the design is idle, done, and
ready for new input data.
-depth: Specifies the maximum number of samples for the test bench to process. This
setting indicates the maximum size of the FIFO needed in the verification adapter that
Vivado HLS creates for RTL co-simulation. This option is required for pointer interfaces
using ap_fifo or ap_bus modes.
-register: Registers the signal and any relevant protocol signals and instructs the signals
to persist until at least the last cycle of the function execution. This option applies to the
following scalar interfaces for the top-level function:
• ap_none
• ap_ack
• ap_vld
• ap_ovld
• ap_hs
• ap_fifo
• axis
-offset <string>: Controls the address offset in AXI4-Lite and AXI4 interfaces. In an
AXI4-Lite interface, <string> specifies the address in the register map. In an AXI interface,
<string> specifies the following:
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Argument InData in function foo is specified to have a ap_vld interface and the input
should be registered.
Exposes global variable lookup_table used in function foo as a port on the RTL design,
with an ap_memory interface.
set_directive_latency
Description
Vivado HLS always aims for minimum latency. The behavior of Vivado HLS when minimum
and maximum latency values are specified is as follows:
If Vivado HLS can achieve less than the minimum specified latency, it extends the
latency to the specified value, potentially increasing sharing.
If Vivado HLS cannot schedule within the maximum limit, it increases effort to achieve
the specified constraint. If it still fails to meet the maximum latency, it issues a warning.
Vivado HLS then produces a design with the smallest achievable latency.
Syntax
set_directive_latency [OPTIONS] <location>
where
• < location> is the location (function, loop or region) (in the format function[/label]) to
be constrained.
Options
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
In function foo, loop loop_row is specified to have a maximum latency of 12. Place the
pragma in the loop body.
set_directive_loop_flatten
Description
In the RTL implementation, it costs a clock cycle to move between loops in the loop
hierarchy. Flattening nested loops allows them to be optimized as a single loop. This saves
clock cycles, potentially allowing for greater optimization of the loop body logic.
RECOMMENDED: Apply this directive to the inner-most loop in the loop hierarchy. Only perfect and
semi-perfect loops can be flattened in this manner.
When the inner loop has variables bounds (or the loop body is not exclusively inside the
inner loop), try to restructure the code, or unroll the loops in the loop body to create a
perfect loop nest.
Syntax
set_directive_loop_flatten [OPTIONS] <location>
where
Options
-off
Can prevent some loops from being flattened while all others in the specified location are
flattened.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Flattens loop_1 in function foo and all (perfect or semi-perfect) loops above it in the loop
hierarchy, into a single loop. Place the pragma in the body of loop_1.
set_directive_loop_flatten foo/loop_1
#pragma HLS loop_flatten
Prevents loop flattening in loop_2 of function foo. Place the pragma in the body of
loop_2.
set_directive_loop_merge
Description
Merging loops:
• Reduces the number of clock cycles required in the RTL to transition between the
loop-body implementations.
• Allows the loops be implemented in parallel (if possible).
• If the loop bounds are variables, they must have the same value (number of iterations).
• If loops bounds are constants, the maximum constant value is used as the bound of the
merged loop.
• Loops with both variable bound and constant bound cannot be merged.
• The code between loops to be merged cannot have side effects. Multiple execution of
this code should generate the same results.
- a=b is allowed
- a=a+1 is not allowed.
• Loops cannot be merged when they contain FIFO reads. Merging changes the order of
the reads. Reads from a FIFO or FIFO interface must always be in sequence.
Syntax
set_directive_loop_merge <location>
where
• < location> is the location (in the format function[/label]) at which the loops reside.
Options
-force
Forces loops to be merged even when Vivado HLS issues a warning. You must assure that
the merged loop will function correctly.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_loop_merge foo
#pragma HLS loop_merge
All loops inside loop_2 of function foo (but not loop_2 itself) are merged by using the
-force option. Place the pragma in the body of loop_2.
set_directive_loop_tripcount
Description
The loop tripcount is the total number of iterations performed by a loop. Vivado HLS reports
the total latency of each loop (the number of cycles to execute all iterations of the loop).
This loop latency is therefore a function of the tripcount (number of loop iterations).
The tripcount can be a constant value. It may depend on the value of variables used in the
loop expression (for example, x<y) or control statements used inside the loop.
Vivado HLS cannot determine the tripcount in some cases. These cases include, for
example, those in which the variables used to determine the tripcount are:
• Input arguments, or
• Variables calculated by dynamic operation
Syntax
set_directive_loop_tripcount [OPTIONS] <location>
where
• < location> is the location of the loop (in the format function[/label]) at which the
tripcount is specified.
Options
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
• A minimum tripcount of 12
• An average tripcount of 14
• A maximum tripcount of 16
set_directive_loop_tripcount -min 12 -max 14 -avg 16 foo/loop_1
#pragma HLS loop_tripcount min=12 max=14 avg=16
set_directive_occurrence
Description
When pipelining functions or loops, specifies that the code in a location is executed at a
lesser rate than the code in the enclosing function or loop.
This allows the code that is executed at the lesser rate to be pipelined at a slower rate, and
potentially shared within the top-level pipeline. For example:
If N is pipelined with an initiation interval II, any function or loops protected by the
conditional statement:
Identifying a region with an occurrence allows the functions and loops in this region to be
pipelined with an initiation interval that is slower than the enclosing function or loop.
Syntax
set_directive_occurrence [OPTIONS] <location>
where
Options
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_pipeline
Description
• Function pipelining
• Loop pipelining
A pipelined function or loop can process new inputs every N clock cycles, where N is the
initiation interval (II). The default initiation interval is 1, which processes a new input every
clock cycle, or it can be specified by the -II option.
If Vivado HLS cannot create a design with the specified II, it:
• Issues a warning.
• Creates a design with the lowest possible II.
You can then analyze this design with the warning message to determine what steps must
be taken to create a design that satisfies the required initiation interval.
Syntax
set_directive_pipeline [OPTIONS] <location>
where
Options
-II <integer>
Vivado HLS tries to meet this request. Based on data dependencies, the actual result might
have a larger II.
-enable_flush
Implements a pipeline that can flush pipeline stages if the input of the pipeline stalls.
This feature:
-rewind
Enables rewinding. Rewinding enables continuous loop pipelining, with no pause between
one loop iteration ending and the next starting.
Rewinding is effective only if there is one single loop (or a perfect loop nest) inside the
top-level function. The code segment before the loop:
• Is considered as initialization.
• Is executed only once in the pipeline.
• Cannot contain any conditional operations (if-else).
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_pipeline foo
#pragma HLS pipeline
Loop loop_1 in function foo is pipelined with an initiation interval of 4. Pipelining flush is
enabled.
set_directive_protocol
Description
Specifies a region of the code (a protocol region) in which no clock operations is inserted
by Vivado HLS unless explicitly specified in the code.
A protocol region can manually specify an interface protocol. Vivado HLS does not insert
any clocks between any operations, including those that read from or write to function
arguments. The order of read and writes are therefore obeyed at the RTL.
The ap_wait and wait statements have no effect on the simulation of C and C++ designs
respectively. They are only interpreted by Vivado HLS.
io_section:{..lines of C code...}
Syntax
set_directive_protocol [OPTIONS] <location>
where
Options
-mode (floating|fixed)
The default floating mode allows the code corresponding to statements outside the
protocol region to overlap with the statements in the protocol statements in the final RTL.
The protocol region remains cycle accurate, but other operations can occur at the same
time.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Defines region io_section in function foo as a fixed protocol region. Place the pragma
inside region io_section.
set_directive_reset
Description
Syntax
set_directive_reset [OPTIONS] <location> <variable>
Options
<location> <string>
The location (in the format function[/label]) at which the variable is defined.
<variable> <string>
Pragma
Place the pragma in the C source within the boundaries of the variable life cycle.
Examples
Adds reset to variable static int a in function foo even when the global reset setting
is none or control.
set_directive_reset foo a
#pragma HLS reset variable=a
Removes reset from variable static int a in function foo even when the global reset
setting is state or all.
set_directive_resource
Description
Specifies the resource (core) to implement a variable in the RTL. The variable can be any of
the following:
• array
• arithmetic operation
• function argument
Vivado HLS implements the operations in the code using hardware cores. When multiple
cores in the library can implement the operation, you can specify which core to use with the
set_directive_resource command. To generate a list of cores, use the list_core
command. If no resource is specified, Vivado HLS determines the resource to use.
To specify which memory element in the library to use to implement an array, use the
set_directive_resource command. For example, this allows you to control whether
the array is implemented as a single or a dual-port RAM. This usage is important for arrays
on the top-level function interface, because the memory associated with the array
determines the ports in the RTL.
You can use the -latency option to specify the latency of the core. For block RAMs on the
interface, the -latency option allows you to model off-chip, non-standard SRAMs at the
interface, for example to support an SRAM with a latency of 2 or 3. For internal operations,
the -latency option allows the operation to be implemented using more pipelined
stages. These additional pipeline stages can help resolve timing issues during RTL synthesis.
IMPORTANT: To use the -latency option, the operation must have an available multi-stage core.
Vivado HLS provides a multi-stage core for all basic arithmetic operations (add, subtract, multiply and
divide), all floating-point operations, and all block RAMs.
RECOMMENDED: For best results, Xilinx recommends that you use -std=c99 for C and
-fno-builtin for C and C++. To specify the C compile options, such as -std=c99, use the Tcl
command add_files with the -cflags option. Alternatively, use the Edit CFLAGs button in the
Project Settings dialog box.
Syntax
set_directive_resource -core <string> <location> <variable>
where
• < location> is the location (in the format function[/label]) at which the variable can be
found.
• < variable> is the variable.
Options
-core <string>
-port_map <string>
Specifies port mappings when using the IP generation flow to map ports on the design with
ports on the adapter.
The variable < string> is a Tcl list of the design port and adapter ports.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Given code Result=A*B in function foo, specifies the multiplication be implemented with
two-stage pipelined multiplier core.
set_directive_stream
Description
If the data stored in the array is consumed or produced in a sequential manner, a more
efficient communication mechanism is to use streaming data, where FIFOs are used instead
of RAMs.
When an argument of the top-level function is specified as interface type ap_fifo, the
array is identified as streaming.
Syntax
set_directive_stream [OPTIONS] <location> <variable>
where
• < location> is the location (in the format function[/label]) which contains the array
variable.
• < variable> is the array variable to be implemented as a FIFO.
Options
Overrides the default FIFO depth specified (globally) by the config_dataflow command.
-off
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_stream foo A
#pragma HLS STREAM variable=A
Array B in named loop loop_1 of function foo is set to streaming with a FIFO depth of 12.
In this case, place the pragma inside loop_1.
set_directive_top
Description
Attaches a name to a function, which can then be used for the set_top command.
RECOMMENDED: Specify the directive in an active solution. Use the set_top command with the new
name.
Syntax
set_directive_top [OPTIONS] <location>
where
Options
-name <string>
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
set_directive_unroll
Description
A loop is executed for the number of iterations specified by the loop induction variable. The
number of iterations may also be impacted by logic inside the loop body (for example,
break or modifications to any loop exit variable). The loop is implemented in the RTL by a
block of logic representing the loop body, which is executed for the same number of
iterations.
The set_directive_unroll command allows the loop to be fully unrolled. Unrolling the
loop creates as many copies of the loop body in the RTL as there are loop iterations, or
partially unrolled by a factor N, creating N copies of the loop body and adjusting the loop
iteration accordingly.
If the factor N used for partial unrolling is not an integer multiple of the original loop
iteration count, the original exit condition must be checked after each unrolled fragment of
the loop body.
To unroll a loop completely, the loop bounds must be known at compile time. This is not
required for partial unrolling.
Syntax
set_directive_unroll [OPTIONS] <location>
where
• < location> is the location of the loop (in the format function[/label]) to be unrolled.
Options
The loop body is repeated this number of times. The iteration information is adjusted
accordingly.
-region
Unrolls all loops within a loop without unrolling the enclosing loop itself.
• Loop loop_1 contains multiple loops at the same level of loop hierarchy (loops
loop_2 and loop_3).
• A named loop (such as loop_1) is also a region or location in the code.
• A section of code is enclosed by braces { }.
• If the unroll directive is specified on location <function>/loop_1, it unrolls loop_1.
The -region option specifies that the directive be applied only to the loops enclosing the
named region. This results in:
-skip_exit_check
• Fixed bounds
No exit condition check is performed if the iteration count is a multiple of the factor.
If the iteration count is not an integer multiple of the factor, the tool:
° Prevents unrolling.
Pragma
Place the pragma in the C source within the boundaries of the required location.
Examples
Unrolls loop L1 in function foo. Place the pragma in the body of loop L1.
set_directive_unroll foo/L1
#pragma HLS unroll
Specifies an unroll factor of 4 on loop L2 of function foo. Removes the exit check. Place the
pragma in the body of loop L2.
Unrolls all loops inside loop L3 in function foo, but not loop L3 itself. The -region option
specifies the location be considered an enclosing region and not a loop label.
set_part
Description
Syntax
set_part <device_specification>
where
• < device_specification> is a a device specification that sets the target device for
Vivado HLS synthesis and implementation.
• < device_family> is the device family name, which uses the default device in the family.
• < device>< package>< speed_grade> is the target device name including device,
package, and speed-grade information.
Options
Pragma
Examples
The FPGA libraries provided with Vivado HLS can be added to the current solution by
providing the device family name as shown below. In this case, the default device, package,
and speed-grade specified in the Vivado HLS FPGA library for this device family are used.
set_part virtex7
The FPGA libraries provided with Vivado HLS can optionally specify the specific device with
package and speed-grade information.
set_part xc6vlx240tff1156-1
set_top
Description
Any functions called from this function will also be part of the design.
Syntax
set_top <top>
where
Options
Pragma
Examples
set_top foo_top
GUI Reference
This reference section explains how to use, control and customize the Vivado HLS GUI.
Monitoring Variables
You can view the values of variables and expressions directly in the Debug perspective. The
following figure shows how you can monitor the value of individual variables.
X-Ref Target - Figure 4-1
You can monitor the value of expressions using the Expressions tab.
X-Ref Target - Figure 4-2
Undefined references occur when code defined in a header file (.h or .hpp extension)
cannot be resolved. The primary causes of undefined references are:
If the code is new, ensure the header file is saved. After saving the header file, Vivado
HLS automatically indexes the header files and updates the coding references.
Ensure the header file is included in the C code using an include statement, the
location to the header file is in the search path, and the header file is in the same
directory as the C files added to the project.
Note: To explicitly add the search path, select Solution > Solution Settings, click Synthesis or
Simulation, and use the Edit CFLAGs button. For more information, see Creating a New
Synthesis Project in Chapter 1.
• Automatic indexing is disabled.
Ensure that Vivado HLS is parsing all header files automatically. Select Project > Project
Settings to open the Project Settings dialog box. Click General, and make sure Disable
Parsing All Header Files is deselected, as shown in the following figure. This might
result in a reduced GUI response time, because Vivado HLS uses CPU cycles to
automatically check the header files.
Note: To manually force Vivado HLS to index all C files, click the Index C files toolbar button
.
X-Ref Target - Figure 4-4
2. Right-click and select the appropriate language encoding using Properties > Resource.
In the section titled Text File Encoding select Other and choose appropriate encoding
from the drop-down menu.
The default buffer size for this windows is 80,000 characters and can be changed, or the
limit can be removed, to ensure all messages can be reviewed, by using menu Window >
Preferences > Run/Debug > Console.
The default setting for the key combination Ctrl+Tab, is to make the active tab in the
Information Pane toggle between the source code and the header file. This is changed to
make the Ctrl+Tab combination make each tab in turn the active tab.
• In the Preferences menu, sub-menu General > Keys allows the Command value Toggle
Source/Header to be selected and the CTRL-TAB combination removed by using the
Unbind Command key.
• Selecting Next Tab in the Command column, placing the cursor in the Binding dialog
box and pressing the Ctrl key and then the Tab key, that causes the operation Ctrl+Tab
to be associated with making the next tab active.
A find-next hot key can be implemented by using the Microsoft Visual Studio scheme. This
can be performed using the menu Window > Preference > General > Keys and replace the
Default scheme with the Microsoft Visual Studio scheme.
Reviewing the sub-menus in the Preferences menu allows every aspect of the GUI
environment to be customized to ensure the highest levels of productivity.
You can specify these block-level I/O protocols on the function or the function return. If the
C code does not return a value, you can still specify the block-level I/O protocol on the
function return. If the C code uses a function return, Vivado HLS creates an output port
ap_return for the return value.
The ap_ctrl_hs block-level I/O protocol is the default. The following figure shows the
resulting RTL ports and behavior when Vivado HLS implements ap_ctrl_hs on a function.
In this example, the function returns a value using the return statement, and Vivado HLS
creates the ap_return output port in the RTL design. If a function return statement is
not included in the C code, this port is not created.
X-Ref Target - Figure 4-5
;
ap_ctrl_none
If you specify the ap_ctrl_none block-level I/O protocol, the handshake signal ports
(ap_start, ap_idle, ap_ready, and ap_done) shown in Figure 4-5 are not created. If
you do not specify block-level I/O protocols on the design, you must adhere to the
conditions described in Interface Synthesis Requirements in Chapter 1 when using C/RTL
cosimulation to verify the RTL design.
ap_ctrl_hs
The following figure shows the behavior of the block-level handshake signals created by the
ap_ctrl_hs I/O protocol.
X-Ref Target - Figure 4-6
° In pipelined designs, the ap_ready signal might go High at any cycle after
ap_start is sampled High. This depends on how the design is pipelined.
° If the ap_start signal is Low when ap_ready is High, the design executes until
ap_done is High and then stops operation.
° If the ap_start signal is High when ap_ready is High, the next transaction starts
immediately, and the design continues to operate.
8. The ap_idle signal indicates when the design is idle and not operating. Following is
additional information about the ap_idle signal:
° If the ap_start signal is Low when ap_ready is High, the design stops operation,
and the ap_idle signal goes High one cycle after ap_done.
° If the ap_start signal is High when ap_ready is High, the design continues to
operate, and the ap_idle signal remains Low.
ap_ctrl_chain
The ap_ctrl_chain block-level I/O protocol is similar to the ap_ctrl_hs protocol but
provides an additional input port named ap_continue. An active High ap_continue
signal indicates that the downstream block that consumes the output data is ready for new
data inputs. If the downstream block is not able to consume new data inputs, the
ap_continue signal is Low, which prevents upstream blocks from generating additional
data.
The ap_ready port of the downstream block can directly drive the ap_continue port.
Following is additional information about the ap_continue port:
• If the ap_continue signal is High when ap_done is High, the design continues
operating. The behavior of the other block-level I/O signals is identical to those
described in the ap_ctrl_hs block-level I/O protocol.
• If the ap_continue signal is Low when ap_done is High, the design stops operating,
the ap_done signal remains High, and data remains valid on the ap_return port if
the ap_return port is present.
In the following figure, the first transaction completes, and the second transaction starts
immediately because ap_continue is High when ap_done is High. However, the design
halts at the end of the second transaction until ap_continue is asserted High.
X-Ref Target - Figure 4-7
An ap_none interface does not require additional hardware overhead. However, the
ap_none interface does requires the following:
° Hold data for the length of a transaction until the design completes
• Consumer blocks to read output ports at the correct time
Note: The ap_none interface cannot be used with array arguments.
ap_stable
Like ap_none, the ap_stable port-level I/O protocol does not add any interface control
ports to the design. The ap_stable type is typically used for data that can change but
remains stable during normal operation, such as ports that provide configuration data. The
ap_stable type informs Vivado HLS of the following:
• The data applied to the port remains stable during normal operation but is not a
constant value that can be optimized.
• The fanout from this port is not required to be registered.
Note: The ap_stable type can only be applied to input ports. When applied to inout ports, only
the input of the port is considered stable.
• Data port
• Acknowledge signal to indicate when data is consumed
• Valid signal to indicate when data is read
The following figure shows how an ap_hs interface behaves for both an input and output
port. In this example, the input port is named in, and the output port is named out.
Note: The control signals names are based on the original port name. For example, the valid port for
data input in is named in_vld.
ap_ack
The ap_ack port-level I/O protocol is a subset of the ap_hs interface type. The ap_ack
port-level I/O protocol provides the following signals:
• Data port
• Acknowledge signal to indicate when data is consumed
CAUTION! You cannot use C/RTL cosimulation to verify designs that use ap_ack on an output port.
ap_vld
The ap_vld is a subset of the ap_hs interface type. The ap_vld port-level I/O protocol
provides the following signals:
• Data port
• Valid signal to indicate when data is read
° For input arguments, the design reads the data port as soon as the valid is active.
Even if the design is not ready to read new data, the design samples the data port
and holds the data internally until needed.
° For output arguments, Vivado HLS implements an output valid port to indicate
when the data on the output port is valid.
ap_ovld
The ap_ovld is a subset of the ap_hs interface type. The ap_ovld port-level I/O protocol
provides the following signals:
• Data port
• Valid signal to indicate when data is read
° For input arguments and the input half of inout arguments, the design defaults to
type ap_none.
° For output arguments and the output half of inout arguments, the design
implements type ap_vld.
ap_memory, bram
The ap_memory and bram interface port-level I/O protocols are used to implement array
arguments. This type of port-level I/O protocol can communicate with memory elements
(for example, RAMs and ROMs) when the implementation requires random accesses to the
memory address locations.
Note: If you only need sequential access to the memory element, use the ap_fifo interface
instead. The ap_fifo interface reduces the hardware overhead, because address generation is not
performed. For more information, see ap_fifo.
The ap_memory and bram interface port-level I/O protocols are identical. The only
difference is the way Vivado IP integrator shows the blocks:
When using an ap_memory interface, specify the array targets using the RESOURCE
directive. If no target is specified for the arrays, Vivado HLS determines whether to use a
single or dual-port RAM interface.
TIP: Before running synthesis, ensure array arguments are targeted to the correct memory type using
the RESOURCE directive. Re-synthesizing with corrected memories can result in a different schedule
and RTL.
The following figure shows an array named d specified as a single-port block RAM. The port
names are based on the C function argument. For example, if the C argument is d, the
chip-enable is d_ce, and the input data is d_q0 based on the output/q port of the BRAM.
ap_fifo
An ap_fifo interface is the most hardware-efficient approach when the design requires
access to a memory element and the access is always performed in a sequential manner,
that is, no random access is required. The ap_fifo port-level I/O protocol supports the
following:
In the following example, in1 is a pointer that accesses the current address, then two
addresses above the current address, and finally one address below.
If in1 is specified as an ap_fifo interface, Vivado HLS checks the accesses, determines
the accesses are not in sequential order, issues an error, and halts. To read from
non-sequential address locations, use an ap_memory or bram interface. For more
information, see ap_memory, bram.
You cannot specify an ap_fifo interface on an argument that is both read from and
written to. You can only specify an ap_fifo interface on an input or an output argument.
A design with input argument in and output argument out specified as ap_fifo
interfaces behaves as shown in the following figure.
ap_bus
An ap_bus interface can communicate with a bus bridge. Because the ap_bus interface
does not follow specific bus standards, you can use this interface with a bus bridge that
communicates with the system bus. The bus bridge must be able to cache all burst writes.
Note: Functions that can use an ap_bus interface use pointers and might access the same variable
multiple times. To understand the importance of the volatile qualifier when using this coding
style, see Multi-Access Pointer Interfaces: Streaming Data in Chapter 3.
• Standard Mode: This mode performs individual read and write operations, specifying
the address of each.
• Burst Mode: This mode performs data transfers if the C function memcpy is used in the
C source code. In burst mode, the interface indicates the base address and the size of
the transfer. The data samples are then transferred in consecutive cycles.
Note: Arrays accessed by the memcpy function cannot be partitioned into registers.
Figure 4-11 and Figure 4-12 show the behavior for read and write operations in standard
mode when an ap_bus interface is applied to argument d, as shown in the following
example:
for (i=0;i<4;i++) {
acc += d[i+1];
d[i] = acc;
}
}
Figure 4-13 and Figure 4-14 show the behavior when the C memcpy function and burst
mode are used, as shown in the following example:
memcpy(buf1,d,4*sizeof(int));
for (i=0;i<4;i++) {
buf2[i] = buf1[3-i];
}
memcpy(d,buf2,4*sizeof(int));
}
,
X-Ref Target - Figure 4-11
• If a read must be performed and data is available in the bus bridge FIFO, indicated by
d_rsp_empty_n High, the following occurs:
° Output signal d_rsp_read is asserted in the next clock cycle and data is read at
the next clock edge.
X-Ref Target - Figure 4-12
° The output signal d_req_din is immediately asserted to indicate the data is valid
at the next clock edge.
• If a write must be performed and space is available in the bus bridge FIFO, indicated by
d_req_full_n High, the following occurs:
° The output signal d_req_din is asserted to indicate the data is valid at the next
clock edge.
X-Ref Target - Figure 4-13
° The base address for the transfer and the size are output.
° The output signal d_req_din is immediately asserted to indicate the data is valid
at the next clock edge.
° Output signal d_req_din is immediately deasserted if the FIFO becomes full and
reasserted when space is available.
° The transfer stops after N data values are transferred, where N is the value on the
size output port.
• If a write must be performed and space is available in the bus bridge FIFO, indicated by
d_rsp_full_n High, transfer begins and the design stalls and waits until the FIFO is
full.
axis
The axis mode specifies an AXI4-Stream I/O protocol. For a complete description of the
AXI4-Stream interface, including timing and ports, see the Vivado Design Suite AXI
Reference Guide (UG1037) [Ref 8]. For information on using the full capabilities of this I/O
protocol, see Using AXI4 Interfaces in Chapter 1.
s_axilite
The s_axilite mode specifies an AXI4-Lite slave I/O protocol. For a complete description
of the AXI4-Lite slave interface, including timing and ports, see the Vivado Design Suite AXI
Reference Guide (UG1037) [Ref 8]. For information on using the full capabilities of this I/O
protocol, see Using AXI4 Interfaces in Chapter 1.
m_axi
The m_axi mode specifies an AXI4 master I/O protocol. For a complete description of the
AXI4 master interface including timing and ports, see the Vivado Design Suite AXI Reference
Guide (UG1037) [Ref 8]. For information on using the full capabilities of this I/O protocol,
see Using AXI4 Interfaces in Chapter 1.
The API functions derive their name from the top-level function for synthesis. This reference
section assumes the top-level function is called DUT. The following table lists each of the
API function provided in the C driver files.
XDut_Get_ARG_BitWidth Return the bit width of each element in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XDut_Get_ARG_Depth Return the total number of elements in the array. Only available
when ARG is an array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS
groups multiple elements into the 32-bit data width of the AXI4-Lite
interface. If the bit width of the elements exceeds 32-bit, Vivado HLS
stores each element over multiple consecutive addresses.
XDut_Write_ARG_Words Write the length of a 32-bit word into the specified address of
the AXI4-Lite interface. This API requires the offset address
from BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite
interface.
XDut_Initialize
Synopsis
Description
int XDut_Initialize(XDut *InstancePtr, const char* InstanceName): For use on Linux systems,
initialize a specifically named uio device. Create up to 5 memory mappings and assign the
slave base addresses by mmap, utilizing the uio device information in sysfs.
XDut_CfgInitialize
Synopsis
Description
Initialize a device when an MMU is used in the system. In such a case the effective address
of the AXI4-Lite slave is different from that defined in xparameters.h and API is required to
initialize the device.
XDut_LookupConfig
Synopsis
Description
This function is used to obtain the configuration information of the device by ID.
XDut_Release
Synopsis
Description
Release the uio device. Delete the mappings by munmap. (The mapping will automatically
be deleted if the process terminated)
XDut_Start
Synopsis
Description
Start the device. This function will assert the ap_start port on the device. Available only
if there is ap_start port on the device.
XDut_IsDone
Synopsis
Description
Check if the device has finished the previous execution: this function will return the value of
the ap_done port on the device. Available only if there is an ap_done port on the device.
XDut_IsIdle
Synopsis
Description
Check if the device is in idle state: this function will return the value of the ap_idle port.
Available only if there is an ap_idle port on the device.
XDut_IsReady
Synopsis
Description
Check if the device is ready for the next input: this function will return the value of the
ap_ready port. Available only if there is an ap_ready port on the device.
XDut_Continue
Synopsis
Description
Assert port ap_continue. Available only if there is an ap_continue port on the device.
XDut_EnableAutoRestart
Synopsis
Description
• Port ap_start will be asserted as soon as ap_done is asserted by the device and the
device will auto-start the next transaction.
• Alternatively, if the block-level I/O protocol ap_ctrl_chain is implemented on the device,
the next transaction will auto-restart (ap_start will be asserted) when ap_ready is
asserted by the device and if ap_continue is asserted when ap_done is asserted by the
device.
XDut_DisableAutoRestart
Synopsis
Description
Disable the “auto restart” function. Available only if there is an ap_start port.
XDut_Set_ARG
Synopsis
Description
Write a value to port ARG (a scalar argument of the top-level function). Available only if ARG
is an input port.
XDut_Set_ARG_vld
Synopsis
Description
Assert port ARG_vld. Available only if ARG is an input port and implemented with an ap_hs
or ap_vld interface protocol.
XDut_Set_ARG_ack
Synopsis
Description
Assert port ARG_ack. Available only if ARG is an output port and implemented with an
ap_hs or ap_ack interface protocol.
XDut_Get_ARG
Synopsis
Description
Read a value from ARG. Only available if port ARG is an output port on the device.
XDut_Get_ARG_vld
Synopsis
Description
Read a value from ARG_vld. Only available if port ARG is an output port on the device and
implemented with an ap_hs or ap_vld interface protocol.
XDut_Get_ARG_ack
Synopsis
Description
Read a value from ARG_ack Only available if port ARG is an input port on the device and
implemented with an ap_hs or ap_ack interface protocol.
XDut_Get_ARG_BaseAddress
Synopsis
Description
Return the base address of the array inside the interface. Only available when ARG is an
array grouped into the AXI4-Lite interface.
XDut_Get_ARG_HighAddress
Synopsis
Description
Return the address of the uppermost element of the array. Only available when ARG is an
array grouped into the AXI4-Lite interface.
XDut_Get_ARG_TotalBytes
Synopsis
Description
Return the total number of bytes used to store the array. Only available when ARG is an
array grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS groups multiple elements into the
32-bit data width of the AXI4-Lite interface. If the bit width of the elements exceeds 32-bit, Vivado
HLS stores each element over multiple consecutive addresses.
• InstancePtr: A pointer to the device instance.
XDut_Get_ARG_BitWidth
Synopsis
Description
Return the bit width of each element in the array. Only available when ARG is an array
grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS groups multiple elements into the
32-bit data width of the AXI4-Lite interface. If the bit width of the elements exceeds 32-bit, Vivado
HLS stores each element over multiple consecutive addresses.
• InstancePtr: A pointer to the device instance.
XDut_Get_ARG_Depth
Synopsis
Description
Return the total number of elements in the array. Only available when ARG is an array
grouped into the AXI4-Lite interface.
Note: If the elements in the array are less than 16-bit, Vivado HLS groups multiple elements into the
32-bit data width of the AXI4-Lite interface. If the bit width of the elements exceeds 32-bit, Vivado
HLS stores each element over multiple consecutive addresses.
• InstancePtr: A pointer to the device instance.
XDut_Write_ARG_Words
Synopsis
Description
Write the length of a 32-bit word into the specified address of the AXI4-Lite interface. This
API requires the offset address from BaseAddress and the length of the data to be stored.
Only available when ARG is an array grouped into the AXI4-Lite interface.
XDut_Read_ARG_Words
Synopsis
Description
Read the length of a 32-bit word from the array. This API requires the data target, the offset
address from BaseAddress, and the length of the data to be stored. Only available when
ARG is an array grouped into the AXI4-Lite interface.
XDut_Write_ARG_Bytes
Synopsis
Description
Write the length of bytes into the specified address of the AXI4-Lite interface. This API
requires the offset address from BaseAddress and the length of the data to be stored. Only
available when ARG is an array grouped into the AXI4-Lite interface.
XDut_Read_ARG_Bytes
Synopsis
Description
Read the length of bytes from the array. This API requires the data target, the offset address
from BaseAddress, and the length of data to be loaded. Only available when ARG is an array
grouped into the AXI4-Lite interface.
XDut_InterruptGlobalEnable
Synopsis
Description
Enable the interrupt output. Interrupt functions are available only if there is ap_start.
XDut_InterruptGlobalDisable
Synopsis
Description
XDut_InterruptEnable
Synopsis
Description
Enable the interrupt source. There may be at most 2 interrupt sources (source 0 for ap_done
and source 1 for ap_ready).
° Bit n = 0: no change.
XDut_InterruptDisable
Synopsis
Description
° Bit n = 0: no change.
XDut_InterruptClear
Synopsis
Description
° Bit n = 0: no change.
XDut_InterruptGetEnabled
Synopsis
Description
° Bit n = 1: enabled.
° Bit n = 0: disabled.
XDut_InterruptGetStatus
Synopsis
Description
° Bit n = 1: triggered.
Converts data to and from the standard OpenCV data types to AXI4 streaming protocol.
Allows the AXI4 streaming protocol to be converted into the hsl::Mat data types used
by the video processing functions.
Compatible with standard OpenCV functions for manipulating and processing video
images.
For more information and a complete methodology for working with the video functions in
the context of an existing OpenCV design, see Accelerating OpenCV Applications with
Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries (XAPP1167) [Ref 9].
Parameters
Description
• Converts data from OpenCV IplImage format to AXI4 video stream (hls::stream)
format.
• Image data must be stored in img.
• AXI_video_strm must be empty before invoking.
• The data width (in bits) of a pixel in img must be no greater than W, the data width of
TDATA in AXI4-Stream protocol.
AXIvideo2IplImage
Synopsis
template<int W> void AXIvideo2IplImage (
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm,
IplImage* img);
Parameters
Description
• Converts data from AXI4 video stream (hls::stream) format to OpenCV IplImage
format.
• Image data must be stored in AXI_video_strm.
• Invoking this function consumes the data in AXI_video_strm.
• The data width of a pixel in img must be no greater than W, the data width of TDATA in
AXI4-Stream protocol.
cvMat2AXIvideo
Synopsis
template<int W> void cvMat2AXIvideo (
cv::Mat& cv_mat,
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm);
Parameters
Description
• Converts data from OpenCV cv::Mat format to AXI4 video stream (hls::stream)
format.
• Image data must be stored in cv_mat.
• AXI_video_strm must be empty before invoking.
• The data width (in bits) of a pixel in cv_mat must be no greater than W, the data width
of TDATA in AXI4-Stream protocol.
AXIvideo2cvMat
Synopsis
template<int W> void AXIvideo2cvMat (
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm,
cv::Mat& cv_mat);
Parameters
• Converts data from AXI4 video stream (hls::stream) format to OpenCV cv::Mat
format.
• Image data must be stored in AXI_video_strm.
• Invoking this function consumes the data in AXI_video_strm.
• The data width of a pixel in cv_mat must be no greater than W, the data width of
TDATA in AXI4-Stream protocol.
Description
• Converts data from OpenCV cv::Mat format to AXI4 video stream (hls::stream)
format.
• Image data must be stored in cv_mat.
• AXI_video_strm must be empty before invoking.
• The data width (in bits) of a pixel in cv_mat must be no greater than W, the data width
of TDATA in AXI4-Stream protocol.
CvMat2AXIvideo
Synopsis
template<int W> void CvMat2AXIvideo (
CvMat* cvmat,
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm);
Parameters
Description
• Converts data from OpenCV CvMat format to AXI4 video stream (hls::stream)
format.
• Image data must be stored in cvmat.
• AXI_video_strm must be empty before invoking.
• The data width (in bits) of a pixel in cvmat must be no greater than W, the data width
of TDATA in AXI4-Stream protocol.
AXIvideo2CvMat
Synopsis
template<int W> void AXIvideo2CvMat (
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm,
CvMat* cvmat);
Parameters
Description
• Converts data from AXI4 video stream (hls::stream) format to OpenCV CvMat
format.
• Image data must be stored in AXI_video_strm.
• Invoking this function consumes the data in AXI_video_strm.
• The data width of a pixel in cvmat must be no greater than W, the data width of TDATA
in AXI4-Stream protocol.
IplImage2hlsMat
Synopsis
template<int ROWS, int COLS, int T> void IplImage2hlsMat (
IplImage* img,
hls::Mat<ROWS, COLS, T>& mat);
Parameters
Description
• Converts data from OpenCV IplImage format to hls::Mat format.
• Image data must be stored in img.
• mat must be empty before invoking.
• Arguments img and mat must have the same size and number of channels.
hlsMat2IplImage
Synopsis
template<int ROWS, int COLS, int T> void hlsMat2IplImage (
hls::Mat<ROWS, COLS, T>& mat,
IplImage* img);
Parameters
Description
• Converts data from hls::Mat format to OpenCV IplImage format.
• Image data must be stored in mat.
• Invoking this function consumes the data in mat.
• Arguments mat and img must have the same size and number of channels.
cvMat2hlsMat
Synopsis
template<int ROWS, int COLS, int T> void cvMat2hlsMat (
cv::Mat* cv_mat,
hls::Mat<ROWS, COLS, T>& mat);
Parameters
Description
• Converts data from OpenCV cv::Mat format to hls::Mat format.
• Image data must be stored in cv_mat.
• mat must be empty before invoking.
• Arguments cv_mat and mat must have the same size and number of channels.
hlsMat2cvMat
Synopsis
template<int ROWS, int COLS, int T> void hlsMat2cvMat (
hls::Mat<ROWS, COLS, T>& mat,
cv::Mat& cv_mat);
Parameters
Description
• Converts data from hls::Mat format to OpenCV cv::Mat format.
• Image data must be stored in mat.
• Invoking this function consumes the data in mat.
• Arguments mat and cv_mat must have the same size and number of channels.
CvMat2hlsMat
Synopsis
template<int ROWS, int COLS, int T> void CvMat2hlsMat (
CvMat* cvmat,
hls::Mat<ROWS, COLS, T>& mat);
Parameters
Description
• Converts data from OpenCV CvMat format to hls::Mat format.
• Image data must be stored in cvmat.
• mat must be empty before invoking.
• Arguments cvmat and mat must have the same size and number of channels.
hlsMat2CvMat
Synopsis
template<int ROWS, int COLS, int T> void hlsMat2CvMat (
hls::Mat<ROWS, COLS, T>& mat,
CvMat* cvmat);
Parameters
Description
• Converts data from hls::Mat format to OpenCV CvMat format.
• Image data must be stored in mat.
• Invoking this function consumes the data in mat.
• Arguments mat and cvmat must have the same size and number of channels.
CvMat2hlsWindow
Synopsis
template<int ROWS, int COLS, typename T> void CvMat2hlsWindow (
CvMat* cvmat,
hls::Window<ROWS, COLS, T>& window);
Parameters
Description
• Converts data from OpenCV CvMat format to hls::Window format.
• Image data must be stored in cvmat.
• window must be empty before invoking.
• Arguments cvmat and window must be single-channel, and have the same size. This
function is mainly for converting image processing kernels.
hlsWindow2CvMat
Synopsis
template<int ROWS, int COLS, typename T> void hlsWindow2hlsCvMat (
hls::Window<ROWS, COLS, T>& window,
CvMat* cvmat);
Parameters
Description
• Converts data from hls::Window format to OpenCV CvMat format.
• Image data must be stored in window.
• Invoking this function consumes the data in window.
• Arguments mat and window must be single-channel, and have the same size. This
function is mainly for converting image processing kernels.
Parameters
Description
• Converts image data stored in hls::Mat format to an AXI4 video stream
(hls::stream) format.
• Image data must be stored in AXI_video_strm.
• The data field of mat must be empty before invoking.
• Invoking this function consumes the data in AXI_video_strm and fills the image data
of mat.
• The data width of a pixel in mat must be no greater than W, the data width of TDATA in
AXI4-Stream protocol.
• This function is able to perform frame sync for the input video stream, by detecting the
TUSER bit to mark the top-left pixel of an input frame. It returns a bit error of
ERROR_IO_EOL_EARLY or ERROR_IO_EOL_LATE to indicate an unexpected line length, by
detecting the TLAST input.
hls::Mat2AXIvideo
Synopsis
template<int W, int ROWS, int COLS, int T> int hls::AXIvideo2Mat (
hls::Mat<ROWS, COLS, T>& mat,
hls::stream<ap_axiu<W,1,1,1> >& AXI_video_strm);
Parameters
Description
• Converts image data stored in AXI4 video stream (hls::stream) format to an image
of hls::Mat format.
• Image data must be stored in mat.
• The data field of AXI_video_strm must be empty before invoking.
• Invoking this function consumes the data in mat and fills the image data of
AXI_video_strm.
• The data width of a pixel in mat must be no greater than W, the data width of TDATA in
AXI4-Stream protocol.
• To fill image data to AXI4 video stream, this function also sets TUSER bit of stream
element for indicating the top-left pixel, as well as setting TLAST bit in the last pixel of
each line to indicate the end of line.
Parameters
Description
• Computes the absolute difference between two input images src1 and src2 and
saves the result in dst.
• Image data must be stored in src1 and src2.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src1 and src2 and fills the image data of
dst.
• src1 and src2 must have the same size and number of channels.
• dst must have the same size and number of channels as the inputs.
OpenCV Reference
• cvAbsDiff
• cv::absdiff
hls::AddS
Synopsis
Without Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::AddS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst);
With Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::AddS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst,
hls::Mat<ROWS, COLS, HLS_8UC1>& mask,
hls::Mat<ROWS, COLS, DST_T>& dst_ref);
Parameters
Description
• Computes the per-element sum of an image src and a scalar scl.
• Saves the result in dst.
• If computed with mask:
• Image data must be stored in src (if computed with mask, mask and dst_ref must
have data stored), and the image data of dst must be empty before invoking.
• Invoking this function consumes the data in src (if computed with mask. The data of
mask and dst_ref are also consumed) and fills the image data of dst.
• src and scl must have the same number of channels. dst and dst_ref must have the
same size and number of channels as src. mask must have the same size as the src.
OpenCV Reference
• cvAddS
• cv::add
hls::AddWeighted
Synopsis
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T, typename P_T>
void hls::AddWeighted (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
P_T alpha,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
P_T beta,
P_T gamma,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Computes the weighted per-element sum of two image src1 and src2.
• Saves the result in dst.
• The weighted sum computes as follows:
OpenCV Reference
• cvAddWeighted
• cv::addWeighted
hls::And
Synopsis
Without Mask:
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T>
void hls::And (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst);
With Mask:
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T>
void hls::And (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst,
hls::Mat<ROWS, COLS, HLS_8UC1>& mask,
hls::Mat<ROWS, COLS, DST_T>& dst_ref);
Parameters
Description
• Calculates the per-element bitwise logical conjunction of two images src1 and src2
• Returns the result as image dst.
• If computed with mask:
OpenCV Reference
• cvAnd,
• cv::bitwise_and
hls::Avg
Synopsis
Without Mask:
With Mask:
Parameters
Description
• Calculates an average of elements in image src.
• Returns the result in hls::Scalar format.
• If computed with mask:
OpenCV Reference
• cvAvg
• cv::mean
hls::AvgSdv
Synopsis
Without Mask:
With Mask:
Parameters
Description
• Calculates an average of elements in image src.
• Returns the result in hls::Scalar format.
• If computed with mask:
OpenCV Reference
• cvAvgSdv
• cv::meanStdDev
hls::Cmp
Synopsis
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T>
void hls::Cmp (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst,
int cmp_op);
Parameters
Description
• Performs the per-element comparison of two input images src1 and src2.
• Saves the result in dst.
• If the comparison result is true, the corresponding element of dst is set to 255.
Otherwise, it is set to 0.
• Image data must be stored in src1 and src2.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src1 and src2 and fills the image data of
dst.
• src1 and src2 must have the same size and number of channels.
• dst must have the same size and number of channels as the inputs.
OpenCV Reference
• cvCmp
• cv::compare
hls::CmpS
Synopsis
template<int ROWS, int COLS, int SRC_T, typename P_T, int DST_T>
void hls::CmpS (
hls::Mat<ROWS, COLS, SRC1_T>& src,
P_T value,
hls::Mat<ROWS, COLS, DST_T>& dst,
int cmp_op);
Parameters
Description
• Performs the comparison between the elements of input images src and the input
value and saves the result in dst.
• If the comparison result is true, the corresponding element of dst is set to 255.
Otherwise it is set to 0.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• src and dst must have the same size and number of channels.
OpenCV Reference
• cvCmpS
• cv::compare
hls::CornerHarris
Synopsis
template<int blockSize,int Ksize,typename KT,int SRC_T,int DST_T,int ROWS,int COLS>
void CornerHarris(
hls::Mat<ROWS, COLS, SRC_T> &_src,
hls::Mat<ROWS, COLS, DST_T> &_dst,
KT k);
Parameters
Description
• This function implements a Harris edge/corner detector. The horizontal and vertical
derivatives are estimated using a Ksize*Ksize Sobel filter. The local covariance matrix M
of the derivatives is smoothed over a blockSize*blockSize neighborhood of each pixel
(x,y). This function outputs the function.
OpenCV Reference
• cvCornerHarris
• cv::cornerHarris
hls::CvtColor
Synopsis
template<int code, int ROWS, int COLS, int SRC_T, int DST_T>
void hls::CvtColor (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Converts a color image from or to a grayscale image. The type of conversion is defined
by the value of code:
OpenCV Reference
• cvCvtColor
• cv::cvtColor
hls::Dilate
Synopsis
Default:
Custom:
template<int ROWS, int COLS, int SRC_T, int DST_T, int K_ROWS, int K_COLS, typename
K_T, int Shape_type, int ITERATIONS>
void hls::Dilate (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst,
hls::Window<K_ROWS, K_COLS, K_T> & kernel);
Parameters
Description
• Dilates the image src using the specified structuring element constructed within the
kernel.
• Saves the result in dst.
• The dilation determines the shape of a pixel neighborhood over which the maximum is
taken.
OpenCV Reference
• cvDilate
• cv::dilate
hls::Duplicate
Synopsis
template<int ROWS, int COLS, int SRC_T, int DST_T>
void hls::Duplicate (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst1,
hls::Mat<ROWS, COLS, DST_T>& dst2);
Parameters
Description
• Copies the input image src to two output images dst1 and dst2, for divergent point
of two datapaths.
• Image data must be stored in src.
• The image data of dst1 and dst2 must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst1 and
dst2.
• src, dst1, and dst2 must have the same size and number of channels.
OpenCV Reference
Not applicable.
hls::EqualizeHist
Synopsis
template<int SRC_T, int DST_T,int ROW, int COL>
void EqualizeHist(
Mat<ROW, COL, SRC_T>&_src,
Mat<ROW, COL, DST_T>&_dst);
Parameters
Description
• Computes a histogram of each frame and uses it to normalize the range of the
following frame.
• The delay avoids the use of a frame buffer in the implementation.
• The histogram is stored as static data internal to this function, allowing only one call to
EqualizeHist to be made.
• The input is expected to have type HLS_8UC1.
OpenCV Reference
• cvEqualizeHist
• cv::EqualizeHist
hls::Erode
Synopsis
Default:
Custom:
Parameters
Description
• Erodes the image src using the specified structuring element constructed within
kernel.
• Saves the result in dst.
• The erosion determines the shape of a pixel neighborhood over which the maximum is
taken, each channel of image src is processed independently:
OpenCV Reference
• cvErode
• cv::erode
hls::FASTX
Synopsis
template<int SRC_T,int ROWS,int COLS>
void FASTX(
hls::Mat<ROWS,COLS,SRC_T> &_src,
hls::Mat<ROWS,COLS,HLS_8UC1> &_mask,
int _threshold,
bool _nomax_supression);
Parameters
Description
• Implements the FAST corner detector, generating either a mask of corners, or an array
of coordinates.
OpenCV Reference
• cvFAST
• cv::FASTX
hls::Filter2D
Synopsis
template<typename BORDERMODE, int SRC_T, int DST_T, typename KN_T, typename POINT_T,
int IMG_HEIGHT,int IMG_WIDTH,int K_HEIGHT,int K_WIDTH>
void Filter2D(
Mat<IMG_HEIGHT, IMG_WIDTH, SRC_T>&_src,
Mat<IMG_HEIGHT, IMG_WIDTH, DST_T> &_dst,
Window<K_HEIGHT,K_WIDTH,KN_T>&_kernel,
Point_<POINT_T>anchor)
Parameters
Description
• Applies an arbitrary linear filter to the image src using the specified kernel.
• Saves the result to image dst.
• This function filters the image by computing correlation using kernel:
OpenCV Reference
Usage:
hls::Filter2D<3,3,BORDER_CONSTANT>(src,dst)
hls::Filter2D<3,3>(src,dst)
• cv::filter2D
• cvFilter2D (see the note below in the discussion of border modes)
hls::GaussianBlur
Synopsis
template<int KH,int KW,typename BORDERMODE,int SRC_T,int DST_T,int ROWS,int COLS>
void GaussianBlur(
Mat<ROWS, COLS, SRC_T> &_src,
Mat<ROWS, COLS, DST_T> &_dst,
double sigmaX=0,
double sigmaY=0);
Parameters
Description
• Applies a normalized 2D Gaussian Blur filter to the input.
• The filter coefficients are determined by the KH and KW parameters, which must either
be 3 or 5.
• The 3x3 filter taps are given by:
[1,2,1
2,4,2
1,2,1] * 1/16
[1, 2, 3, 2, 1,
2, 5, 6, 5, 2,
3, 6, 8, 6, 3,
2, 5, 6, 5, 2,
1, 2, 3, 2, 1]* 1/84
OpenCV Reference
Usage:
hls::GaussianBlur<3,3,BORDER_CONSTANT>(src,dst)
hls::GaussianBlur<3,3>(src,dst)
• cv::GaussianBlur
• BORDER_REPLICATE: The input is extended at the boundary with the boundary value.
Given the series of pixels “abcde” the boundary value the border is completed as
“abcdeeee”.
• BORDER_REFLECT: The input is extended at the boundary with the edge pixel
duplicated. Given the series of pixels “abcde” the boundary value the border is
completed as “abcdeedc”.
• BORDER_REFLECT_101: The input is extended at the boundary with the edge pixel not
duplicated. Given the series of pixels “abcde” the boundary value the border is
completed as “abcdedcb”.
• BORDER_DEFAULT: Same as BORDER_REFLECT_101.
hls::Harris
Synopsis
template<int blockSize,int Ksize,typename KT,int SRC_T,int DST_T,int ROWS,int COLS>
void Harris(
hls::Mat<ROWS, COLS, SRC_T> &_src,
hls::Mat<ROWS, COLS, DST_T> &_dst,
KT k,
int threshold);
Parameters
Description
• This function implements a Harris edge or corner detector.
• The horizontal and vertical derivatives are estimated using a Ksize*Ksize Sobel filter.
• The local covariance matrix M of the derivatives is smoothed over a blockSize*blockSize
neighborhood of each pixel (x,y).
• Points where the function
has a maximum, and is greater than the threshold are marked as corners/edges in the
output image.
OpenCV Reference
• cvCornerHarris
• cv::cornerHarris
hls::HoughLines2
Synopsis
template<typename AT,typename RT>
struct Polar_
AT angle;
RT rho;
};
Parameters
Description
• Implements the Hough line transform.
OpenCV Reference
• cvHoughLines2
• cv::HoughLines
hls::Integral
Synopsis
template<int SRC_T, int DST_T, int ROWS,int COLS>
void Integral(
Mat<ROWS, COLS, SRC_T>&_src,
Mat<ROWS+1, COLS+1, DST_T>&_sum);
template<int SRC_T, int DST_T,int DSTSQ_T, ROWS,int COLS>
void Integral(
Mat<ROWS, COLS, SRC_T>&_src,
Mat<ROWS+1, COLS+1, DST_T>&_sum,
Mat<ROWS+1, COLS+1, DSTSQ_T>&_sqsum);
Parameters
Description
• Implements the computation of an integral image.
OpenCV Reference
• cvIntegral
• cv::integral
hls::InitUndistortRectifyMap
Synopsis
template< typename CMT, typename RT, typename DT, int ROW, int COL, int MAP1_T, int MAP2_T,
int N>
void InitUndistortRectifyMap(
Window<3,3, CMT> cameraMatrix,
DT (&distCoeffs)[N],
Window<3,3, RT> R,
Window<3,3, CMT> newcameraMatrix,
Mat<ROW, COL, MAP1_T> &map1,
Mat<ROW, COL, MAP2_T> &map2);
template< typename CMT, typename RT, typename DT, int ROW, int COL, int MAP1_T, int MAP2_T,
int N>
void InitUndistortRectifyMapInverse(
Window<3,3, CMT> cameraMatrix,
DT (&distCoeffs)[N],
Window<3,3, ICMT> ir
Mat<ROW, COL, MAP1_T> &map1,
Mat<ROW, COL, MAP2_T> &map2);
Parameters
Description
• Generates map1 and map2, based on a set of parameters, where map1 and map2 are
suitable inputs for hls::Remap().
• In general, InitUndistortRectifyMapInverse() is preferred for synthesis, because the
per-frame processing to compute ir is performed outside of the synthesized logic. The
various parameters may be floating point or fixed-point. If fixed-point inputs are used,
then internal coordinate transformations are done with at least the precision given by
ICMT.
• As the coordinate transformations implemented in this function can be hardware
resource intensive, it may be preferable to compute the results of this function offline
and store map1 and map2 in external memory if the input parameters are fixed and
sufficient external memory bandwidth is available.
Limitations
map1 and map2 are only supported as HLS_16SC2. cameraMatrix, and newCameraMatrix,
are normalized in the sense that their form is:
[f_x,0,c_x,
0,f_y,c_y,
0,0,1]
[a,b,c,
d,e,f,
0,0,1]
OpenCV Reference
• cv::initUndistortRectifyMap
hls::Max
Synopsis
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T>
void hls::Max (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Calculates per-element maximum of two input images src1 and src2 and saves the
result in dst.
• Image data must be stored in src1 and src2.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src1 and src2 and fills the image data of
dst.
• src1 and src2 must have the same size and number of channels. dst must have the
same size and number of channels as the inputs.
OpenCV Reference
• cvMax
• cv::max
hls::MaxS
Synopsis
template<int ROWS, int COLS, int SRC_T, typename P_T, int DST_T>
void hls::MaxS (
hls::Mat<ROWS, COLS, SRC_T>& src,
P_T value,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Calculates the maximum between the elements of input images src and the input value
and saves the result in dst.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• src and dst must have the same size and number of channels.
OpenCV Reference
• cvMaxS
• cv::max
hls::Mean
Synopsis
Without Mask:
With Mask:
Parameters
Description
• Calculates an average of elements in image src, and return the value of first channel of
result scalar.
• If computed with mask:
• Image data must be stored in src (if computed with mask, mask must have data
stored).
• Invoking this function consumes the data in src (if computes with mask. The data of
mask is also consumed).
• src and mask must have the same size. mask must have non-zero element.
OpenCV Reference
• cvMean
• cv::mean
hls::Merge
Synopsis
Parameters
Description
• Composes a multichannel image dst from several single-channel images.
• Image data must be stored in input images.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in inputs and fills the image data of dst.
• Input images must have the same size and be single-channel. dst must have the same
size as the inputs, the number of channels of dst must equal to the number of input
images.
OpenCV Reference
• cvMerge
• cv::merge
hls::Min
Synopsis
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T>
void hls::Min (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Calculates per-element minimum of two input images src1 and src2 and saves the
result in dst.
• Image data must be stored in src1 and src2.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src1 and src2 and fills the image data of
dst.
• src1 and src2 must have the same size and number of channels.
• dst must have the same size and number of channels as the inputs.
OpenCV Reference
• cvMin
• cv::min
hls::MinMaxLoc
Synopsis
Without Mask:
With Mask:
Parameters
Description
• Finds the global minimum and maximum and their locations in input image src.
• Image data must be stored in src (if computed with mask, mask must have data
stored).
• Invoking this function consumes the data in src (if computed with mask. The data of
mask is also consumed).
• min_val and max_val must have the save data type. src and mask must have the same
size.
OpenCV Reference
• cvMinMaxLoc
• cv::minMaxLoc
hls::MinS
Synopsis
template<int ROWS, int COLS, int SRC_T, typename P_T, int DST_T>
void hls::MinS (
hls::Mat<ROWS, COLS, SRC_T>& src,
P_T value,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Calculates the minimum between the elements of input images src and the input value
and saves the result in dst.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• src and dst must have the same size and number of channels.
OpenCV Reference
• cvMinS
• cv::min
hls::Mul
Synopsis
template<int ROWS, int COLS, int SRC1_T, int SRC2_T, int DST_T, typename P_T>
void hls::Mul (
hls::Mat<ROWS, COLS, SRC1_T>& src1,
hls::Mat<ROWS, COLS, SRC2_T>& src2,
hls::Mat<ROWS, COLS, DST_T>& dst,
P_T scale=1);
Parameters
Description
• Calculates the per-element product of two input images src1 and src2.
• Saves the result in image dst. An optional scaling factor scale can be used.
OpenCV Reference
• cvMul
• cv::multiply
hls::Not
Synopsis
template<int ROWS, int COLS, int SRC_T, int DST_T>
void hls::Not (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Performs per-element bitwise inversion of image src.
• Outputs the result as image dst.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• src and dst must have the same size and number of channels.
OpenCV Reference
• cvNot
• cv::bitwise_not
hls::PaintMask
Synopsis
template<int SRC_T,int MASK_T,int ROWS,int COLS>
void PaintMask(
hls::Mat<ROWS,COLS,SRC_T> &_src,
hls::Mat<ROWS,COLS,MASK_T> &_mask,
hls::Mat<ROWS,COLS,SRC_T> &_dst,
hls::Scalar<HLS_MAT_CN(SRC_T),HLS_TNAME(SRC_T)> _color);
Parameters
Description
• Each pixel of the destination image is either set to color (if mask is not zero) or the
corresponding pixel from the input image.
• src, mask, and dst must all be the same size.
hls::PyrDown
Synopsis
template<int SRC_T,int DST_T,int ROWS,int COLS, int DROWS, int DCOLS>
void PyrDown(
Mat<ROWS, COLS, SRC_T> &_src,
Mat<DROWS, DCOLS, DST_T> &_dst)
Parameters
Description
• Blurs an image by performing the Gaussian pyramid construction and then downsizes
the image by a factor of 2.
• First, this function convolves the source image with the following kernel:
[1, 4, 6, 4, 1,
1, 4, 6, 4, 1]* 1/256
• Then, this function downsamples the image by rejecting even rows and columns.
OpenCV Reference
• cvPyrDown
• cv::pyrDown
hls::PyrUp
Synopsis
template<int SRC_T,int DST_T,int ROWS,int COLS, int DROWS, int DCOLS>
void PyrUp(
Mat<ROWS, COLS, SRC_T> &_src,
Mat<DROWS, DCOLS, DST_T> &_dst)
Parameters
Description
• Upsamples the image by a factor of 2 and then blurs it.
• The function performs the upsampling step of the Gaussian pyramid construction,
though it can actually be used to construct the Laplacian pyramid.
• First, this function upsamples the source image by injecting even zero rows and
columns.
• Then, this function convolves the result with the following kernel (same as in pyrDown()
but multiplied by 4).
[1, 4, 6, 4, 1,
1, 4, 6, 4, 1]* 1/64
OpenCV Reference
• cvPyrUp
• cv::pyrup
hls::Range
Synopsis
template<int ROWS, int COLS, int SRC_T, int DST_T, typename P_T>
void hls::Range (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst,
P_T start,
P_T end);
Parameters
Description
• Sets all value in image src by the following rule and return the result as image dst.
OpenCV Reference
• cvRange
hls::Remap
Synopsis
template <int WIN_ROW, int ROW, int COL, int SRC_T, int DST_T, int MAP1_T, int
MAP2_T>
void Remap(
hls::Mat<ROW, COL, SRC_T> &src,
hls::Mat<ROW, COL, DST_T> &dst,
hls::Mat<ROW, COL, MAP1_T> &map1,
hls::Mat<ROW, COL, MAP2_T> &map2);
Parameters
Description
• Remaps the source image src to the destination image dst according to the given
remapping. For each pixel in the output image, the coordinates of an input pixel are
specified by map1 and map2.
• This function is designed for streaming operation for cameras with small vertical
disparity. It contains an internal linebuffer to enable the remapping that contains
WIN_ROW rows of the input image. If the row r_i of an input pixel corresponding to an
output pixel at row r_o is not in the range [r_o-(WIN_ROW/2-1], r_o+(WIN_ROW/2-1)
then the output is black.
• In addition, because of the architecture of the line buffer, the function uses fewer
resources if WIN_ROW and COL are powers of 2.
OpenCV Reference
• cvRemap
hls::Reduce
Synopsis
template<typename INTER_SUM_T, int ROWS, int COLS, int SRC_T, int DST_ROWS, int
DST_COLS, int DST_T>
void hls::Reduce (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<DST_ROWS, DST_COLS, DST_T>& dst,
int dim,
int reduce_op=HLS_REDUCE_SUM);
Parameters
Description
• Reduces 2D image src along dimension dim to a vector dst.
• Image data must be stored in src.
• The data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
OpenCV Reference
• cvReduce,
• cv::reduce
hls::Resize
Synopsis
template<int SRC_T, int ROWS,int COLS,int DROWS,int DCOLS>
void Resize (
Mat<ROWS, COLS, SRC_T> &_src,
Mat<DROWS, DCOLS, SRC_T> &_dst);
Parameters
Description
• Resizes the input image to the size of the output image using bilinear interpolation.
OpenCV Reference
• cvResize
• cv::resize
hls::Set
Synopsis
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::Set (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(DST_T), _T> scl,
hls::Mat<ROWS, COLS, DST_T>& dst);
Parameters
Description
• Sets elements in image src to a given scalar value scl.
• Saves the result as image dst.
• Generates a dst image with all element has scalar value scl if no input image.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• src and scl must have the same number of channels.
• dst must have the same size and number of channels as src.
OpenCV Reference
• cvSet
hls::Scale
Synopsis
template<int ROWS, int COLS, int SRC_T, int DST_T, typename P_T>
void hls::Scale (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst,
P_T scale=1.0,
P_T shift=0.0);
Parameters
Description
• Converts an input image src with optional linear transformation.
• Saves the result as image dst.
OpenCV Reference
• cvScale
• cvConvertScale
hls::Sobel
Synopsis
template<int XORDER, int YORDER, int SIZE, typename BORDERMODE, int SRC_T, int DST_T,
int ROWS,int COLS,int DROWS,int DCOLS>
void Sobel (
Mat<ROWS, COLS, SRC_T>&_src,
Mat<DROWS, DCOLS, DST_T>&_dst)
template<int XORDER, int YORDER, int SIZE, int SRC_T, int DST_T, int ROWS,int
COLS,int DROWS,int DCOLS>
void Sobel (
Mat<ROWS, COLS, SRC_T>&_src,
Mat<DROWS, DCOLS, DST_T>&_dst)
Parameters
Description
• Computes a horizontal or vertical Sobel filter, returning an estimate of the horizontal or
vertical derivative, using a filter such as:
[-1,0,1
-2,0,2,
-1,0,1]
OpenCV Reference
Usage:
hls::Sobel<1,0,3,BORDER_CONSTANT>(src,dst)
hls::Sobel<1,0,3>(src,dst)
• cv::Sobel
• cvSobel (see the note below in the discussion of border modes).
hls::Split
Synopsis
Parameters
Description
• Divides a multichannel image src from several single-channel images.
• Image data must be stored in image src.
• The image data of outputs must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of outputs.
• Output images must have the same size and be single-channel.
OpenCV Reference
• cvSplit
• cv::split
hls::SubRS
Synopsis
Without Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::SubRS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst);
With Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::SubRS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst,
hls::Mat<ROWS, COLS, HLS_8UC1>& mask,
hls::Mat<ROWS, COLS, DST_T>& dst_ref);
Parameters
Description
• Computes the differences between scalar value scl and elements of image src.
• Saves the result in dst.
• If computed with mask:
OpenCV Reference
• cvSubRS
• cv::subtract
hls::SubS
Synopsis
Without Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::SubRS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst);
With Mask:
template<int ROWS, int COLS, int SRC_T, typename _T, int DST_T>
void hls::SubRS (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Scalar<HLS_MAT_CN(SRC_T), _T>& scl,
hls::Mat<ROWS, COLS, DST_T>& dst,
hls::Mat<ROWS, COLS, HLS_8UC1>& mask,
hls::Mat<ROWS, COLS, DST_T>& dst_ref);
Parameters
Description
• Computes the differences between elements of image src and scalar value scl.
• Saves the result in dst.
OpenCV Reference
• cvSub
• cv::subtract
hls::Sum
Synopsis
template<typename DST_T, int ROWS, int COLS, int SRC_T>
hls::Scalar<HLS_MAT_CN(SRC_T), DST_T> hls::Sum(
hls::Mat<ROWS, COLS, SRC_T>& src);
Parameters
Description
• Sums the elements of an image src.
• Returns the result as a scalar value.
• Image data must be stored in src
• Invoking this function consumes the data in src
OpenCV Reference
• cvSum
• cv::sum
hls::Threshold
Synopsis
template<int ROWS, int COLS, int SRC_T, int DST_T, typename P_T>
void hls::Threshold (
hls::Mat<ROWS, COLS, SRC_T>& src,
hls::Mat<ROWS, COLS, DST_T>& dst,
P_T thresh,
P_T maxval,
int thresh_type);
Parameters
Description
Performs a fixed-level threshold to each element in a single-channel image src and return
the result as a single-channel image dst. The thresholding type supported by this function
are determined by thresh_type:
HLS_THRESH_BINARY
HLS_THRESH_BINARY_INV
HLS_THRESH_TRUNC
HLS_THRESH_TOZERO
HLS_THRESH_TOZERO_INV
OpenCV Reference
• cvThreshold
• cv::threshold
hls::Zero
Synopsis
Parameters
Description
• Sets elements in image src to 0.
• Saves the result as image dst.
• Generates a dst image with all element 0 if no input image.
• Image data must be stored in src.
• The image data of dst must be empty before invoking.
• Invoking this function consumes the data in src and fills the image data of dst.
• dst must have the same size and number of channels as src.
OpenCV Reference
• cvSetZero
• cvZero
matrix_multiply
Synopsis
template<
class TransposeFormA,
class TransposeFormB,
int RowsA,
int ColsA,
int RowsB,
int ColsB,
int RowsC,
int ColsC,
typename InputType,
typename OutputType>
void matrix_multiply(
const InputType A[RowsA][ColsA],
const InputType B[RowsB][ColsB],
OutputType C[RowsC][ColsC]);
Description
C=AB
Parameters
The function will throw an assertion and fail to compile, or synthesize, if ColsA != RowsB.
The transpose requirements for A and B are resolved before check is made.
Arguments
Return Values
• Not applicable (void function)
cholesky
Synopsis
template<
bool LowerTriangularL,
int RowsColsA,
typename InputType,
typename OutputType>
int cholesky(
const InputType A[RowsColsA][RowsColsA],
OutputType L[RowsColsA][RowsColsA])
Description
A=LL*
Parameters
Arguments
Return Values
• 0 = success
• 1 = failure. The function attempted to find the square root of a negative number, that
is, the input matrix A was not Hermitian/symmetric positive definite.
qrf
Synopsis
template<
bool TransposeQ,
int RowsA,
int ColsA,
typename InputType,
typename OutputType>
void qrf(
const InputType A[RowsA][ColsA],
OutputType Q[RowsA][RowsA],
OutputType R[RowsA][ColsA])
Description
A=QR
Parameters
Arguments
Return Values
• Not applicable (void function)
cholesky_inverse
Synopsis
template <
int RowsColsA,
typename InputType,
typename OutputType>
void cholesky_inverse(const InputType A[RowsColsA][RowsColsA],
OutputType InverseA[RowsColsA][RowsColsA],
int& cholesky_success)
Description
AA-1 = I
• Computes the inverse of symmetric positive definite input matrix A by the Cholesky
decomposition method, producing matrix InverseA.
Parameters
Arguments
Return Values
• Not applicable (void function)
• For floating point types, subnormal input values are not supported. If used, the
synthesized hardware will flush these to zero, and behavior will differ versus software
simulation.
qr_inverse
Synopsis
template <
int RowsColsA,
typename InputType,
typename OutputType>
void qr_inverse(const InputType A[RowsColsA][RowsColsA],
OutputType InverseA[RowsColsA][RowsColsA],
int& A_singular)
Description
AA-1=I
Parameters
Arguments
Return Values
• Not applicable (void function)
svd
Synopsis
template<
int RowsA,
int ColsA,
typename InputType,
typename OutputType>
void svd(
const InputType A[RowsA][ColsA],
OutputType S[RowsA][ColsA],
OutputType U[RowsA][RowsA],
OutputType V[ColsA][ColsA])
Description
A=USV*
Parameters
• The function will throw an assertion and fail to compile, or synthesize, if RowsA !=
ColsA.
Arguments
Return Values
• Not applicable (void function)
Examples
The examples provide a basic test-bench and demonstrate how to parameterize and
instantiate each Linear Algebra function. One or more examples for each function are
available in the Vivado HLS examples directory:
<VIVADO_HLS>/examples/design/linear_algebra
awgn
Synopsis
template<
int OutputWidth>
class awgn {
public:
typedef ap_ufixed<8,4, AP_RND, AP_SAT> t_ input_scale;
static const int LFSR_SECTION_WIDTH = 32;
static const int NUM_NOISE_GENS = 4;
static const int LFSR_WIDTH = LFSR_SECTION_WIDTH*NUM_NOISE_GENS;
void awgn(ap_uint<LFSR_WIDTH> seed);
void ~awgn();
void operator()(t_input_scale &snr,
ap_int<OutputWidth> &noise);
Description
• Outputs Gaussian noise of a magnitude determined by input signal-to-noise ratio
(SNR). 0 dB for a BPSK signal results in a bit error rate (BER) of approximately 7%. This is
because for Eb/N0 = 0, Eb = 1, but N0 / 2 = noise power for a BPSK channel, resulting
in noise variance half that of the signal variance. For more information, see the AWGN
page (www.mathworks.com/help/comm/ug/awgn-channel.html) on the MathWorks
website.
• The SNR input represents signal-to-noise ratio in decibels in the range [0.0 to 16.0) in
steps of 1/16 of a decibel.
• If the noise value exceeds that which can be described by the configuration, it saturates
at the maximum positive or negative value appropriately.
• The function uses multiple individual noise generators that are summed, which takes
advantage of the central limit theorem, to create the output value. By default, these
multiple generators are pipelined and unrolled, because the expected target
application is for high-rate BER testing where a high clock rate and therefore, an
Initiation Interval of 1 is expected.
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_ufixed
• Output
° ap_int
qam_mod
Synopsis
template<
class Constellation,
int OutputWidth>
class qam_mod {
public:
typedef ap_int<OutputWidth> t_outcomponent;
typedef std::complex< t_outcomponent > t_iq;
void qam_mod();
void ~qam_mod();
void operator()(const typename Constellation::t_symbol &symbol,
t_iq &outputdata);
Description
• Converts an input symbol (one of four values for QPSK, one of sixteen for QAM16, or
one of sixty-four for QAM64) into an output value in complex form with I and Q
components each of OutputWidth bits.
• Where OutputWidth is greater than the minimum required to describe the I and Q
values, and zeros are concatenated to the least significant bits until the output word is
OutputWidth wide, for example, symbol 0 of QPSK output to 8 bits is I = Q = 01100000.
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_uint
• Output
qam_demod
Synopsis
template<
Class Constellation,
int InputWidth>
class qam_demod {
public:
typedef ap_int<InputWidth> t_incomponent;
typedef std::complex< t_incomponent > t_in;
void qam_demod();
void ~qam_demod();
void operator()(const t_in &inputData,
typename Constellation::t_symbol &symbol);
Description
• Accepts an input of complex type with I and Q components each of InputWidth,
matches this to the nearest point in the QAM type selected, and outputs the
corresponding symbol value of that point in the constellation.
• The output is a hard-decision.
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_uint
nco
Synopsis
template<
int AccumWidth,
int PhaseAngleWidth,
int SuperSampleRate,
int OutputWidth,
class DualOutputCmpyImpl,
class SingleOutputCmpyImpl,
class SingleOutputNegCmpyImpl>
class nco {
public:
void nco(const ap_uint<AccumWidth> InitPinc,
const ap_uint<AccumWidth> InitPoff);
void ~nco();
void operator()(
stream< ap_uint<AccumWidth> > &pinc,
stream< ap_uint<AccumWidth> > &poff,
stream< t_nco_output_data<SuperSampleRate,OutputWidth> >
&outputData
);
Description
• Performs a numerically controlled oscillator (NCO) function.
• Supports super sample rate (SSR), where the sample rate exceeds the clock rate, so
multiple parallel data samples must be output on each clock cycle.
• When in SSR mode, a change to phase increment (pinc) prompts an internal interrupt.
This does not cause a disturbance to the output samples unless two or more changes
to pinc occur less than N cycles apart where N is SuperSampleRate/2 +1.
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_uint
• Output
convolution_encoder
Synopsis
template<
int OutputWidth,
bool Punctured,
bool DualOutput,
int InputRate,
int OutputRate,
int ConstraintLength,
int PunctureCode0,
int PunctureCode1,
int ConvolutionCode0,
int ConvolutionCode1,
int ConvolutionCode2,
int ConvolutionCode3,
int ConvolutionCode4,
int ConvolutionCode5,
int ConvolutionCode6>
class convolution_encoder {
public:
convolution_encoder();
~convolution_encoder();
void operator()(stream< ap_uint<1> > &inputData,
stream< ap_uint<OutputWidth> > &outputData);
Description
• Performs convolutional encoding of an input data stream based on user-defined
convolution codes and constraint length
• Optional puncturing of data
• Optional dual channel output
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
viterbi_decoder
Synopsis
template<
int ConstraintLength,
int TracebackLength,
bool HasEraseInput,
bool SoftData,
int InputDataWidth,
int SoftDataFormat,
int OutputRate,
int ConvolutionCode0,
int ConvolutionCode1,
int ConvolutionCode2,
int ConvolutionCode3,
int ConvolutionCode4,
int ConvolutionCode5,
int ConvolutionCode6>
class viterbi_decoder {
public:
viterbi_decoder();
~viterbi_decoder();
void operator()(stream<
viterbi_decoder_input<OutputRate,InputDataWidth,HasEraseInput> > &inputData,
stream< ap_uint<1> > &outputData)
Description
• Performs Viterbi decoding of a convolutionally encoded data stream
• Supports hard or soft data
• Supports offset binary and signed magnitude soft data formats
• Supports erasures (puncturing)
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_uint
• Output
° ap_uint
atan2
Synopsis
template <
int PhaseFormat,
int InputWidth,
int OutputWidth,
int RoundMode>
void atan2(const typename atan2_input<InputWidth>::cartesian &x,
typename atan2_output<OutputWidth>::phase &atanX)
Description
• CORDIC-based fixed-point implementation of two-argument arctangent
• Configurable input and output widths
• Configurable phase format
• Configurable rounding mode
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
• Output
° ap_fixed
sqrt
Synopsis
template <
int DataFormat,
int InputWidth,
int OutputWidth,
int RoundMode>
void sqrt(const typename sqrt_input<InputWidth, DataFormat>::in &x,
typename sqrt_output<OutputWidth, DataFormat>::out &sqrtX)
Description
• CORDIC-based fixed-point implementation of square root
• Unsigned fractional or unsigned integer data formats supported
• Configurable rounding mode
Parameters
Note: Parameters are checked during C simulation to verify that the template parameter
configuration is legal.
Arguments
Return Values
• Not applicable (void function)
° ap_ufixed
° ap_uint
• Output
° ap_ufixed
° ap_uint
cmpy
Synopsis
• Scalar Interface
template <
class Architecture,
int W1, int I1, ap_q_mode Q1, ap_o_mode O1, int N1,
int W2, int I2, ap_q_mode Q2, ap_o_mode O2, int N2>
void cmpy (const ap_fixed<W1, I1, Q1, O1, N1> &ar,
const ap_fixed<W1, I1, Q1, O1, N1> &ai,
const ap_fixed<W1, I1, Q1, O1, N1> &br,
const ap_fixed<W1, I1, Q1, O1, N1> &bi,
ap_fixed<W2, I2, Q2, O2, N2> &pr,
ap_fixed<W2, I2, Q2, O2, N2> &pi);
• std::complex interface
template <
class Architecture,
int W1, int I1, ap_q_mode Q1, ap_o_mode O1, int N1,
int W2, int I2, ap_q_mode Q2, ap_o_mode O2, int N2>
void cmpy (const std::complex< ap_fixed<W1, I1, Q1, O1, N1> > &a,
const std::complex< ap_fixed<W1, I1, Q1, O1, N1> > &b,
std::complex< ap_fixed<W2, I2, Q2, O2, N2> > &p);
Description
• Performs fixed-point complex multiplication
• Implements either three-multiplier or four-multiplier structure
• Supports scalar or std::complex interfaces
Parameters
Arguments
Return Values
• Not applicable (void function)
To open the Vivado HLS design examples from the Welcome Page, click Open Example
Project. In the Examples wizard, select a design from the Design Examples > dsp folder.
Note: The Welcome Page appears when you invoke the Vivado HLS GUI. You can access it at any
time by selecting Help > Welcome.
You can also open the design examples directly from the Vivado Design Suite installation
area: Vivado_HLS\2015.x\examples\design\dsp.
Note: Some of the design examples also include a directives.tcl file, which provides additional
Tcl commands for applying optimization and implementation directives.
• The Arbitrary Precision (AP) types provided for C language designs by Vivado HLS.
• The associated functions for C int#w types.
When compiling software models that use these types, it may be necessary to specify the
location of the Vivado HLS header files, for example, by adding the
“-I/<HLS_HOME>/include” option for gcc compilation.
• int#W
• uint#W
where
User-defined types may be created with the C/C++ ‘typedef’ statement as shown in the
following examples:
#include "ap_cint.h"
uint15 a = 0;
uint52 b = 1234567890U;
uint52 c = 0o12345670UL;
uint96 d = 0x123456789ABCDEFULL;
For bit-widths greater than 64-bit, the following functions can be used.
apint_string2bits()
This section also discusses use of the related functions:
• apint_string2bits_bin()
• apint_string2bits_oct()
• apint_string2bits_hex()
These functions convert a constant character string of digits, specified within the
constraints of the radix (decimal, binary, octal, hexadecimal), into the corresponding value
with the given bit-width N. For any radix, the number can be preceded with the minus sign
to indicate a negative value.
This is used to construct integer constants with values that are larger than those already
permitted by the C language. While smaller values also work, they are easier to specify with
existing C language constant value constructs.
#include <stdio.h>
#include "ap_cint.h"
int128 a;
apint_vstring2bits()
This function converts a character string of digits, specified within the constraints of the
hexadecimal radix, into the corresponding value with the given bit-width N. The number can
be preceded with the minus sign to indicate a negative value.
This is used to construct integer constants with values that are larger than those already
permitted by the C language. The function is typically used in a test bench to read
information from a file.
123456789ABCDEF
-123456789ABCDEF
-5
The function, used in the test bench, supplies the following values:
#include <stdio.h>
#include "ap_cint.h"
typedef data_t;
int128 test (
int128 t a
) {
return a+1;
}
int main () {
FILE *fp;
char vstring[33];
fp = fopen(test.dat,r);
while (fscanf(fp,%s,vstring)==1) {
test(apint_vstring2bits_hex(vstring,128));
printf(\n);
}
fclose(fp);
return 0;
}
#include "ap_cint.h"
uint164 c = 0x123456789ABCDEFULL;
apint_print()
This is used to print integers with values that are larger than those already permitted by the
C language. This function prints a value to stdout, interpreted according to the radix (2, 8,
10, 16).
#include <stdio.h>
#include "ap_cint.h"
apint_print(tmp,2);
//00000000000000000000000000000000000000000000000000000000000101100
apint_print(tmp,8); // 0000000000000000000054
apint_print(tmp,10); // 44
apint_print(tmp,16); // 0000000000000002C
apint_fprint()
This is used to print integers with values that are bigger than those already permitted by the
C language. This function prints a value to a file, interpreted according to the radix (2, 8, 10,
16).
Explicit casting of the source variable might be necessary to ensure expected behavior on
assignment.
There is no special handling of the sign information during truncation, which may lead to
unexpected behavior. Explicit casting may help avoid this unexpected behavior.
Standard binary integer arithmetic operators are overloaded to provide arbitrary precision
arithmetic. All of the following operators take either two operands of [u]int#W or one
[u]int#W type and one C/C++ fundamental integer data type, for example, char, short,
int.
The width and signedness of the resulting value is determined by the width and signedness
of the operands, before sign-extension, zero-padding or truncation are applied based on
the width of the destination variable (or expression). Details of the return value are
described for each operator.
When expressions contain a mix of ap_[u]int and C/C++ fundamental integer types, the
C++ types assume the following widths:
• char: 8-bits
• short: 16-bits
• int: 32-bits
• long: 32-bits
• long long: 64-bits
Addition
[u]int#W::RType [u]int#W::operator + ([u]int#W op)
Produces the sum of two ap_[u]int or one ap_[u]int and a C/C++ integer type.
The sum is treated as signed if either (or both) of the operands is of a signed type.
Subtraction
[u]int#W::RType [u]int#W::operator - ([u]int#W op)
° Two bits if and only if the wider is unsigned and the narrower signed
• This applies before assignment, at which point it is sign-extended, zero-padded, or
truncated based on the width of the destination variable.
• The difference is treated as signed regardless of the signedness of the operands.
Multiplication
[u]int#W::RType [u]int#W::operator * ([u]int#W op)
Division
[u]int#W::RType [u]int#W::operator / ([u]int#W op)
Modulus
[u]int#W::RType [u]int#W::operator % ([u]int#W op)
• Returns the modulus, or remainder of integer division, for two integer values.
• The width of the modulus is the minimum of the widths of the operands, if they are
both of the same signedness; if the divisor is an unsigned type and the dividend is
signed then the width is that of the divisor plus one.
• The quotient is treated as having the same signedness as the dividend.
Note: Vivado HLS synthesis of the modulus (%) operator will lead to lead to instantiation of
appropriately parameterized Xilinx LogiCORE divider cores in the generated RTL.
Sign-extension (or zero-padding) may occur, based on the signedness of the expression,
not the destination variable.
Bitwise OR
[u]int#W::RType [u]int#W::operator | ([u]int#W op)
Bitwise AND
[u]int#W::RType [u]int#W::operator & ([u]int#W op)
Bitwise XOR
[u]int#W::RType [u]int#W::operator ^ ([u]int#W op)
Shift Operators
Each shift operator comes in two versions, one for unsigned right-hand side (RHS) operands
and one for signed RHS.
A negative value supplied to the signed RHS versions reverses the shift operations
direction, that is, a shift by the absolute value of the RHS operand in the opposite direction
occurs.
The shift operators return a value with the same width as the left-hand side (LHS) operand.
As with C/C++, if the LHS operand of a shift-right is a signed type, the sign bit is copied into
the most significant bit positions, maintaining the sign of the LHS operand.
CAUTION! When assigning the result of a shift-left operator to a wider destination variable, some (or
all) information may be lost. Xilinx recommends that you explicitly cast the shift expression to the
destination type to avoid unexpected behavior.
• *=
• /=
• %=
• +=
• -=
• <<=
• >>=
• &=
• ^=
• =
The RHS expression is first evaluated then supplied as the RHS operand to the base
operator. The result is assigned back to the LHS variable. The expression sizing, signedness,
and potential sign-extension or truncation rules apply as discussed above for the relevant
operations.
Relational Operators
Vivado HLS supports all relational operators. They return a Boolean value based on the
result of the comparison. Variables of ap_[u]int types may be compared to C/C++
fundamental integer types with these operators.
Equality
bool [u]int#W::operator == ([u]int#W op)
Inequality
bool [u]int#W::operator != ([u]int#W op)
Less than
bool [u]int#W::operator < ([u]int#W op)
Greater than
bool [u]int#W::operator > ([u]int#W op)
Bit Manipulation
The following methods are included to facilitate common bit-level operations on the value
stored in ap_[u]int type variables.
Length
apint_bitwidthof()
int apint_bitwidthof(type_or_value)
Returns an integer value that provides the number of bits in an arbitrary precision integer
value. It can be used with a type or a value.
Var1= -1;
Res1 = apint_bitwidthof(Var1); // Res1 is assigned 5
Res1 = apint_bitwidthof(int7); // Res1 is assigned 7
Concatenation
apint_concatenate()
Concatenates two [u]int#W variables. The width of the returned value is the sum of the
widths of the operands.
The High and Low arguments are placed in the higher and lower order bits of the result
respectively.
RECOMMENDED: To avoid unexpected results, explicitly cast C native types (including integer literals)
to an appropriate [u]int#W type before concatenating.
Bit Selection
apint_get_bit()
Selects one bit from an arbitrary precision integer value and returns it.
The source must be an [u]int#W type. The index argument must be an int value. It
specifies the index of the bit to select. The least significant bit has index 0. The highest
permissible index is one less than the bit-width of this [u]int#W.
apint_set_bit()
• Sets the specified bit, index, of the [u]int#W instance source to the value specified
(zero or one).
Range Selection
apint_get_range()
• Returns the value represented by the range of bits specified by the arguments.
• The High argument specifies the most significant bit (MSB) position of the range.
• THE Low argument specifies the least significant bit (LSB) position of the range.
• The LSB of the source variable is in position 0. If the High argument has a value less
than Low, the bits are returned in reverse order.
apint_set_range()
• Sets the source specified bits between High and Low to the value of the part.
Bit Reduction
AND Reduce
apint_and_reduce()
Var1= -1;
Res1 = apint_and_reduce(Var1); // Res1 is assigned 1
Var1= 1;
Res1 = apint_and_reduce(Var1); // Res1 is assigned 0
OR Reduce
apint_or_reduce()
Var1= 1;
Res1 = apint_or_reduce(Var1); // Res1 is assigned 1
Var1= 0;
Res1 = apint_or_reduce(Var1); // Res1 is assigned 0
XOR Reduce
apint_xor_reduce()
Var1= 0;
Res1 = apint_xor_reduce(Var1); // Res1 is assigned 0
Var1= 1;
Res1 = apint_xor_reduce(Var1); // Res1 is assigned 1
NAND Reduce
apint_nand_reduce()
Var1= 1;
Res1 = apint_nand_reduce(Var1); // Res1 is assigned 1
Var1= -1;
Res1 = apint_nand_reduce(Var1); // Res1 is assigned 0
NOR Reduce
apint_nor_reduce()
Var1= 0;
Res1 = apint_nor_reduce(Var1); // Res1 is assigned 1
Var1= 1;
Res1 = apint_nor_reduce(Var1); // Res1 is assigned 0
XNOR Reduce
apint_xnor_reduce()
Var1= 0;
Res1 = apint_xnor_reduce(Var1); // Res1 is assigned 1
Var1= 1;
Res1 = apint_xnor_reduce(Var1); // Res1 is assigned 0
This class provides all arithmetic, bitwise, logical and relational operators allowed for native
C integer types. In addition, this class provides methods to handle some useful hardware
operations, such as allowing initialization and conversion of variables of widths greater than
64 bits. Details for all operators and class methods are discussed below.
When compiling software models that use these classes, it may be necessary to specify the
location of the Vivado HLS header files, for example by adding the
-I/<HLS_HOME>/include option for g++ compilation.
• ap_int<int_W> (signed)
• ap_uint<int_W> (unsigned)
The template parameter int_W specifies the total width of the variable being declared.
User-defined types may be created with the C/C++ typedef statement as shown in the
following examples:
The default maximum width allowed is 1024 bits. This default may be overridden by
defining the macro AP_INT_MAX_W with a positive integer value less than or equal to
32768 before inclusion of the ap_int.h header file.
CAUTION! Setting the value of AP_INT_MAX_W too High may cause slow software compile and run
times.
ap_int<4096> very_wide_var;
To allow assignment of values wider than 64-bits, the ap_[u]fixed<> classes provide
constructors that allow initialization from a string of arbitrary length (less than or equal to
the width of the variable).
Following are examples of initialization and assignments, including for values greater than
64-bit, are:
A compilation error occurs if the string literal contains any characters that are invalid as
digits for the radix specified.
The radix of the number encoded in the string can also be inferred by the constructor, when
it is prefixed with a zero (0) followed by one of the following characters: “b”, “o” or “x”. The
prefixes “0b”, “0o” and “0x” correspond to binary, octal and hexadecimal formats
respectively.
If the bit-width is greater than 53-bits, the ap_[u]fixed value must be initialized with a
string, for example:
ap_ufixed<72,10> Val(“2460508560057040035.375”);
The stream insertion operator (<<) is overloaded to correctly output the full range of values
possible for any given ap_[u]fixed variable. The following stream manipulators are also
supported:
• dec (decimal)
• hex (hexadecimal)
• oct (octal)
#include <iostream.h>
// Alternative: #include <iostream>
ap_ufixed<72> Val(“10fedcba9876543210”);
1. Convert the value to a C++ std::string using the ap_[u]fixed classes method
to_string().
2. Convert the result to a null-terminated C character string using the std::string class
method c_str().
• 2 (binary)
• 8 (octal)
• 10 (decimal)
• 16 (hexadecimal) (default)
ap_int<72> Val(“80fedcba9876543210”);
Explicit casting of the source variable may be necessary to ensure expected behavior on
assignment. See the following example:
ap_uint<10> Result;
There is no special handling of the sign information during truncation. This may lead to
unexpected behavior. Explicit casting may help avoid this unexpected behavior.
int main() {
if (nonzero((ap_uint<65>)1 << 64)) {
return 0;
}
printf(FAIL\n);
return 1;
}
To convert wide ap_[u]int types to built-in integers, use the explicit conversion functions
included with the ap_[u]int types:
• to_int()
• to_long()
• to_bool()
In general, any valid operation that can be done on a native C/C++ integer data type is
supported using operator overloading for ap_[u]int types.
In addition to these overloaded operators, some class specific operators and methods are
included to ease bit-level operations.
For example:
• char
• short
• int
The width and signedness of the resulting value is determined by the width and signedness
of the operands, before sign-extension, zero-padding or truncation are applied based on
the width of the destination variable (or expression). Details of the return value are
described for each operator.
When expressions contain a mix of ap_[u]int and C/C++ fundamental integer types, the
C++ types assume the following widths:
• char (8-bits)
• short (16-bits)
• int (32-bits)
• long (32-bits)
• long long (64-bits)
Addition
ap_(u)int::RType ap_(u)int::operator + (ap_(u)int op)
• Two ap_[u]int, or
• One ap_[u]int and a C/C++ integer type
The sum is treated as signed if either (or both) of the operands is of a signed type.
Subtraction
ap_(u)int::RType ap_(u)int::operator - (ap_(u)int op)
Multiplication
ap_(u)int::RType ap_(u)int::operator * (ap_(u)int op)
The width of the product is the sum of the widths of the operands.
The product is treated as a signed type if either of the operands is of a signed type.
Division
ap_(u)int::RType ap_(u)int::operator / (ap_(u)int op)
The width of the quotient is the width of the dividend if the divisor is an unsigned type.
Otherwise, it is the width of the dividend plus one.
The quotient is treated as a signed type if either of the operands is of a signed type.
Modulus
ap_(u)int::RType ap_(u)int::operator % (ap_(u)int op)
Returns the modulus, or remainder of integer division, for two integer values.
The width of the modulus is the minimum of the widths of the operands, if they are both of
the same signedness.
If the divisor is an unsigned type and the dividend is signed, then the width is that of the
divisor plus one.
IMPORTANT: Vivado HLS synthesis of the modulus (%) operator will lead to lead to instantiation of
appropriately parameterized Xilinx LogiCORE divider cores in the generated RTL.
ap_uint<71> Rslt;
ap_uint<42> Val1 = 5;
ap_int<23> Val2 = -8;
Sign-extension (or zero-padding) may occur, based on the signedness of the expression,
not the destination variable.
Bitwise OR
ap_(u)int::RType ap_(u)int::operator | (ap_(u)int op)
Bitwise AND
ap_(u)int::RType ap_(u)int::operator & (ap_(u)int op)
Bitwise XOR
ap_(u)int::RType ap_(u)int::operator ^ (ap_(u)int op)
Unary Operators
Addition
ap_(u)int ap_(u)int::operator + ()
Subtraction
ap_(u)int::RType ap_(u)int::operator - ()
• The negated value of the operand with the same width if it is a signed type, or
• Its width plus one if it is unsigned.
Bitwise Inverse
ap_(u)int::RType ap_(u)int::operator ~ ()
Returns the bitwise-NOT of the operand with the same width and signedness.
Logical Invert
bool ap_(u)int::operator ! ()
Returns a Boolean false value if and only if the operand is not equal to zero (0).
Ternary Operators
When you use the ternary operator with the standard C int type, you must explicitly cast
from one type to the other to ensure that both results have the same type. For example:
Shift Operators
Each shift operator comes in two versions:
A negative value supplied to the signed RHS versions reverses the shift operations
direction. That is, a shift by the absolute value of the RHS operand in the opposite direction
occurs.
The shift operators return a value with the same width as the left-hand side (LHS) operand.
As with C/C++, if the LHS operand of a shift-right is a signed type, the sign bit is copied into
the most significant bit positions, maintaining the sign of the LHS operand.
CAUTION! When assigning the result of a shift-left operator to a wider destination variable, some or all
information may be lost. Xilinx recommends that you explicitly cast the shift expression to the
destination type to avoid unexpected behavior.
ap_uint<13> Rslt;
• *=
• /=
• %=
• +=
• -=
• <<=
• >>=
• &=
• ^=
• |=
The RHS expression is first evaluated then supplied as the RHS operand to the base
operator, the result of which is assigned back to the LHS variable. The expression sizing,
signedness, and potential sign-extension or truncation rules apply as discussed above for
the relevant operations.
// Val1 = ap_uint<10>(ap_int<11>(Val1) +
// ap_int<11>((ap_int<6>(Val2) -
// ap_int<6>(Val3))));
Pre-Increment
ap_(u)int& ap_(u)int::operator ++ ()
Post-Increment
const ap_(u)int ap_(u)int::operator ++ (int)
Returns the value of the operand before assignment of the incremented value to the
operand variable.
Pre-Decrement
ap_(u)int& ap_(u)int::operator -- ()
Returns the decremented value of, as well as assigning the decremented value to, the
operand.
Post-Decrement
const ap_(u)int ap_(u)int::operator -- (int)
Returns the value of the operand before assignment of the decremented value to the
operand variable.
Relational Operators
Vivado HLS supports all relational operators. They return a Boolean value based on the
result of the comparison. You can compare variables of ap_[u]int types to C/C++
fundamental integer types with these operators.
Equality
bool ap_(u)int::operator == (ap_(u)int op)
Inequality
bool ap_(u)int::operator != (ap_(u)int op)
Less than
bool ap_(u)int::operator < (ap_(u)int op)
Greater than
bool ap_(u)int::operator > (ap_(u)int op)
Bit-Level Operations
The following methods facilitate common bit-level operations on the value stored in
ap_[u]int type variables.
Length
int ap_(u)int::length ()
Returns an integer value providing the total number of bits in the ap_[u]int variable.
Concatenation
ap_concat_ref ap_(u)int::concat (ap_(u)int low)
ap_concat_ref ap_(u)int::operator , (ap_(u)int high, ap_(u)int low)
Concatenates two ap_[u]int variables, the width of the returned value is the sum of the
widths of the operands.
The High and Low arguments are placed in the higher and lower order bits of the result
respectively; the concat() method places the argument in the lower order bits.
When using the overloaded comma operator, the parentheses are required. The comma
operator version may also appear on the LHS of assignment.
RECOMMENDED: To avoid unexpected results, explicitly cast C/C++ native types (including integer
literals) to an appropriate ap_[u]int type before concatenating.
ap_uint<10> Rslt;
Bit Selection
ap_bit_ref ap_(u)int::operator [] (int bit)
Selects one bit from an arbitrary precision integer value and returns it.
The returned value is a reference value that can set or clear the corresponding bit in this
ap_[u]int.
The bit argument must be an int value. It specifies the index of the bit to select. The least
significant bit has index 0. The highest permissible index is one less than the bit-width of
this ap_[u]int.
The result type ap_bit_ref represents the reference to one bit of this ap_[u]int
instance specified by bit.
Range Selection
ap_range_ref ap_(u)int::range (unsigned Hi, unsigned Lo)
ap_range_ref ap_(u)int::operator () (unsigned Hi, unsigned Lo)
Returns the value represented by the range of bits specified by the arguments.
The Hi argument specifies the most significant bit (MSB) position of the range, and Lo
specifies the least significant bit (LSB).
The LSB of the source variable is in position 0. If the Hi argument has a value less than Lo,
the bits are returned in reverse order.
ap_uint<4> Rslt;
AND reduce
bool ap_(u)int::and_reduce ()
OR reduce
bool ap_(u)int::or_reduce ()
XOR reduce
bool ap_(u)int::xor_reduce ()
NAND reduce
bool ap_(u)int::nand_reduce ()
NOR reduce
bool ap_int::nor_reduce ()
XNOR reduce
bool ap_(u)int::xnor_reduce ()
Bit Reverse
void ap_(u)int::reverse ()
Sets the specified bit of the ap_(u)int instance to the value of integer V.
Sets the specified bit of the ap_(u)int instance to the value 1 (one).
Sets the specified bit of the ap_(u)int instance to the value 0 (zero).
Invert Bit
void ap_(u)int:: invert(unsigned i)
Inverts the bit specified in the function argument of the ap_(u)int instance. The specified
bit becomes 0 if its original value is 1 and vice versa.
Rotate Right
void ap_(u)int:: rrotate(unsigned n)
Rotate Left
void ap_(u)int:: lrotate(unsigned n)
Bitwise NOT
void ap_(u)int:: b_not()
Test Sign
bool ap_int:: sign()
• Returns native C/C++ (32-bit on most systems) integers with the value contained in the
ap_[u]int.
• Truncation occurs if the value is greater than can be represented by an [unsigned]
int.
• Returns native C/C++ 64-bit integers with the value contained in the ap_[u]int.
• Truncation occurs if the value is greater than can be represented by an [unsigned]
int.
To C/C++ “double”
double ap_(u)int::to_double ()
• Returns a native C/C++ double 64-bit floating point representation of the value
contained in the ap_[u]int.
• If the ap_[u]int is wider than 53 bits (the number of bits in the mantissa of a
double), the resulting double may not have the exact value expected.
Sizeof
When the standard C++ sizeof() function is used with ap_[u]int types it returns the
number of bytes. The following set the value of var1 to 32.
You can use the width data member to extract the data width of an existing ap_[u]int<>
data type to create another ap_[u]int<> data type at compile time. The following
example shows how the size of variable Res is defined as 1-bit greater than variables Val1
and Val2:
This ensures that Vivado HLS correctly models the bit-growth caused by the addition even
if you update the value of INPUT_DATA_WIDTH for data_t.
Even though Var1 and Var2 have different precisions, the fixed-point type ensures that
the decimal point is correctly aligned before the operation (an addition in this case), is
performed. You are not required to perform any operations in the C code to align the
decimal point.
The type used to store the result of any fixed-point arithmetic operation must be large
enough (in both the integer and fractional bits) to store the full result.
• overflow handling (when the result has more MSBs than the assigned type supports)
• quantization (or rounding, when the result has fewer LSBs than the assigned type
supports)
The ap_[u]fixed type provides includes various options on how the overflow and
quantization are performed. The options are discussed below.
ap_[u]fixed Representation
In ap[u]fixed types, a fixed-point value is represented as a sequence of bits with a
specified position for the binary point.
• Bits to the left of the binary point represent the integer part of the value.
• Bits to the right of the binary point represent the fractional part of the value.
ap_[u]fixed<int W,
int I,
ap_q_mode Q,
ap_o_mode O,
ap_sat_bits N>;
• The W attribute takes one parameter, the total number of bits for the word. Only a
constant integer expression can be used as the parameter value.
• The I attribute takes one parameter, the number of bits to represent the integer part.
Quantization Modes
• Rounding to plus infinity AP_RND
• Rounding to zero AP_RND_ZERO
• Rounding to minus infinity AP_RND_MIN_INF
• Rounding to infinity AP_RND_INF
• Convergent rounding AP_RND_CONV
• Truncation AP_TRN
• Truncation to zero AP_TRN_ZERO
AP_RND
• Round the value to the nearest representable value for the specific ap_[u]fixed type.
ap_fixed<3, 2, AP_RND, AP_SAT> UAPFixed4 = 1.25; // Yields: 1.5
ap_fixed<3, 2, AP_RND, AP_SAT> UAPFixed4 = -1.25; // Yields: -1.0
AP_RND_ZERO
• Round the value to the nearest representable value.
• Round towards zero.
° For negative values, add the least significant bits to get the nearest representable
value.
ap_fixed<3, 2, AP_RND_ZERO, AP_SAT> UAPFixed4 = 1.25; // Yields: 1.0
ap_fixed<3, 2, AP_RND_ZERO, AP_SAT> UAPFixed4 = -1.25; // Yields: -1.0
AP_RND_MIN_INF
• Round the value to the nearest representable value.
• Round towards minus infinity.
AP_RND_INF
• Round the value to the nearest representable value.
• The rounding depends on the least significant bit.
° For positive values, if the least significant bit is set, round towards plus infinity.
Otherwise, round towards minus infinity.
° For negative values, if the least significant bit is set, round towards minus infinity.
Otherwise, round towards plus infinity.
ap_fixed<3, 2, AP_RND_INF, AP_SAT> UAPFixed4 = 1.25; // Yields: 1.5
ap_fixed<3, 2, AP_RND_INF, AP_SAT> UAPFixed4 = -1.25; // Yields: -1.5
AP_RND_CONV
• Round the value to the nearest representable value.
• The rounding depends on the least significant bit.
AP_TRN
• Round the value to the nearest representable value.
• Always round the value towards minus infinity.
ap_fixed<3, 2, AP_TRN, AP_SAT> UAPFixed4 = 1.25; // Yields: 1.0
ap_fixed<3, 2, AP_TRN, AP_SAT> UAPFixed4 = -1.25; // Yields: -1.5
AP_TRN_ZERO
Overflow Modes
• Saturation AP_SAT
• Saturation to zero AP_SAT_ZERO
• Symmetrical saturation AP_SAT_SYM
• Wrap-around AP_WRAP
• Sign magnitude wrap-around AP_WRAP_SM
AP_SAT
AP_SAT_ZERO
AP_SAT_SYM
AP_WRAP
If N>0:
AP_WRAP_SM
If N>0:
When compiling software models that use these classes, it may be necessary to specify the
location of the Vivado HLS header files, for example by adding the
“-I/<HLS_HOME>/include” option for g++ compilation.
• ap_fixed<W,I> (signed)
• ap_ufixed<W,I> (unsigned)
You can create user-defined types with the C/C++ typedef statement:
That is, typically, a floating point value that is single precision type or in the form of double
precision.
Such floating point constants are interpreted and translated into the full width of the
arbitrary precision fixed-point variable depending on the sign of the value (if support is also
provided for using the C99 standard hexadecimal floating point constants).
#include <ap_fixed.h>
The ap_[u]fixed types do not support initialization if they are used in an array of
std::complex types.
The easiest way to output any value stored in an ap_[u]fixed variable is to use the C++
standard output stream, std::cout (#include <iostream> or <iostream.h>).
The stream insertion operator, “<<“, is overloaded to correctly output the full range of
values possible for any given ap_[u]fixed variable. The following stream manipulators
are also supported, allowing formatting of the value as shown.
• dec (decimal)
• hex (hexadecimal)
• oct (octal)
#include <iostream.h>
// Alternative: #include <iostream>
1. Convert the value to a C++ std::string using the ap_[u]fixed classes method
to_string().
2. Convert the result to a null-terminated C character string using the std::string class
method c_str().
• 2 (binary)
• 8 (octal
• 10 (decimal)
• 16 (hexadecimal) (default)
The ap_[u]fixed types are supported by the following C++ manipulator functions:
• setprecision
• setw
• setfill
The setprecision manipulator sets the decimal precision to be used. It takes one parameter
f as the value of decimal precision, where n specifies the maximum number of meaningful
digits to display in total (counting both those before and those after the decimal point).
The example above displays the following results where the printed results are rounded
when the actual precision exceeds the specified precision:
3.1416
3.14159
1.2346e+05
where
If the standard width of the representation is shorter than the field width, the
representation is padded with fill characters. Fill characters are controlled by the setfill
manipulator which takes one parameter f as the padding character.
ap_fixed<65,32> aa = 123456;
int precision = 5;
cout<<setprecision(precision)<<setw(13)<<setfill('T')<<a<<endl;
TTT1.2346e+05
All values of smaller bit-width are zero or sign-extended depending on the sign of the
source value. You may need to insert casts to obtain alternative signs when assigning
smaller bit-widths to larger.
• Truncations
Truncation occurs when you assign an arbitrary precision fixed-point of larger bit-width
than the destination variable.
• ap_[u]fixed
• ap_[u]int
• C/C++
The result type ap_[u]fixed::RType depends on the type information of the two
operands.
Because Val2 has the larger bit-width on both integer part and fraction part, the result type
has the same bit-width and plus one to be able to store all possible result values.
Subtraction
ap_[u]fixed::RType ap_[u]fixed::operator - (ap_[u]fixed op)
The result type ap_[u]fixed::RType depends on the type information of the two
operands.
Because Val2 has the larger bit-width on both integer part and fraction part, the result type
has the same bit-width and plus one to be able to store all possible result values.
Multiplication
ap_[u]fixed::RType ap_[u]fixed::operator * (ap_[u]fixed op)
This shows the multiplication of Val1 and Val2. The result type is the sum of their integer
part bit-width and their fraction part bit width.
Division
ap_[u]fixed::RType ap_[u]fixed::operator / (ap_[u]fixed op)
This shows the division of Val1 and Val2. To preserve enough precision:
• The integer bit-width of the result type is sum of the integer = bit-width of Val1 and
the fraction bit-width of Val2.
• The fraction bit-width of the result type is sum of the fraction bit-width of Val1 and the
whole bit-width of Val2.
Applies a bitwise operation on an arbitrary precision fixed-point and a given operand op.
Bitwise AND
ap_[u]fixed::RType ap_[u]fixed::operator & (ap_[u]fixed op)
Applies a bitwise operation on an arbitrary precision fixed-point and a given operand op.
Bitwise XOR
ap_[u]fixed::RType ap_[u]fixed::operator ^ (ap_[u]fixed op)
Applies an xor bitwise operation on an arbitrary precision fixed-point and a given operand
op.
Post-Increment
ap_[u]fixed ap_[u]fixed::operator ++ (int)
Pre-Decrement
ap_[u]fixed ap_[u]fixed::operator -- ()
This operator function prefix decreases this arbitrary precision fixed-point variable by 1.
Post-Decrement
ap_[u]fixed ap_[u]fixed::operator -- (int)
Unary Operators
Addition
ap_[u]fixed ap_[u]fixed::operator + ()
Subtraction
ap_[u]fixed::RType ap_[u]fixed::operator - ()
Equality Zero
bool ap_[u]fixed::operator ! ()
Bitwise Inverse
ap_[u]fixed::RType ap_[u]fixed::operator ~ ()
Shift Operators
Unsigned Shift Left
ap_[u]fixed ap_[u]fixed::operator << (ap_uint<_W2> op)
• char
• short
• int
• long
The return type of the shift left operation is the same width as the type being shifted.
ap_uint<4> sh = 2;
The bit-width of the result is (W = 25, I = 15). Because the shift left operation result type
is same as the type of Val:
If a result of 21.5 is required, Val must be cast to ap_fixed<10, 7> first -- for example,
ap_ufixed<10, 7>(Val).
This operator:
• char
• short
• int
• long
The return type of the shift right operation is the same width as the type being shifted.
ap_int<4> Sh = 2;
Result = Val << sh; // Shift left, yields -10.25
Sh = -2;
Result = Val << sh; // Shift right, yields 1.25
• char
• short
• int
• long
The return type of the shift right operation is the same width as the type being shifted.
ap_uint<4> sh = 2;
If it is necessary to preserve all significant bits, extend fraction part bit-width of the Val
first, for example ap_fixed<10, 5>(Val).
This operator:
The operand can be a C/C++ integer type (char, short, int, or long).
The return type of the shift right operation is the same width as type being shifted. For
example:
ap_int<4> Sh = 2;
Result = Val >> sh; // Shift right, yields 1.25
Sh = -2;
Result = Val >> sh; // Shift left, yields -10.5
1.25
Relational Operators
Equality
bool ap_[u]fixed::operator == (ap_[u]fixed op)
This operator compares the arbitrary precision fixed-point variable with a given operand.
Returns true if they are equal and false if they are not equal.
The type of operand op can be ap_[u]fixed, ap_int or C/C++ integer types. For
example:
bool Result;
Inequality
bool ap_[u]fixed::operator != (ap_[u]fixed op)
This operator compares this arbitrary precision fixed-point variable with a given operand.
Returns true if they are not equal and false if they are equal.
• ap_[u]fixed
• ap_int
• C or C++ integer types
For example:
bool Result;
Returns true if they are equal or if the variable is greater than the operator and false
otherwise.
For example:
bool Result;
This operator compares a variable with a given operand, and return true if it is equal to or
less than the operand and false if not.
For example:
bool Result;
Greater than
bool ap_[u]fixed::operator > (ap_[u]fixed op)
This operator compares a variable with a given operand, and return true if it is greater than
the operand and false if not.
For example:
bool Result;
Less than
bool ap_[u]fixed::operator < (ap_[u]fixed op)
This operator compares a variable with a given operand, and return true if it is less than
the operand and false if not.
The type of operand op can be ap_[u]fixed, ap_int, or C/C++ integer types. For
example:
bool Result;
Bit Operator
Bit-Select and Set
af_bit_ref ap_[u]fixed::operator [] (int bit)
This operator selects one bit from an arbitrary precision fixed-point value and returns it.
The returned value is a reference value that can set or clear the corresponding bit in the
ap_[u]fixed variable. The bit argument must be an integer value and it specifies the
index of the bit to select. The least significant bit has index 0. The highest permissible index
is one less than the bit-width of this ap_[u]fixed variable.
Value[3]; // Yields 1
Value[4]; // Yields 0
Bit Range
af_range_ref af_(u)fixed::range (unsigned Hi, unsigned Lo)
af_range_ref af_(u)fixed::operator [] (unsigned Hi, unsigned Lo)
This operation is similar to bit-select operator [] except that it operates on a range of bits
instead of a single bit.
It selects a group of bits from the arbitrary precision fixed-point variable. The Hi argument
provides the upper range of bits to be selected. The Lo argument provides the lowest bit to
be selected. If Lo is larger than Hi the bits selected are returned in the reverse order.
The return type af_range_ref represents a reference in the range of the ap_[u]fixed
variable specified by Hi and Lo. For example:
ap_uint<4> Result = 0;
ap_ufixed<4, 2> Value = 1.25;
ap_uint<8> Repl = 0xAA;
Range Select
af_range_ref af_(u)fixed::range ()
af_range_ref af_(u)fixed::operator []
This operation is the special case of the range select operator []. It selects all bits from this
arbitrary precision fixed-point value in the normal order.
The return type af_range_ref represents a reference to the range specified by Hi = W - 1 and
Lo = 0. For example:
ap_uint<4> Result = 0;
Length
int ap_[u]fixed::length ()
This function returns an integer value that provides the number of bits in an arbitrary
precision fixed-point value. It can be used with a type or a value. For example:
This member function returns this fixed-point value in form of IEEE double precision format.
For example:
Fixed-to-ap_int
ap_int ap_[u]fixed::to_ap_int ()
This member function explicitly converts this fixed-point value to ap_int that captures all
integer bits (fraction bits are truncated). For example:
Fixed-to-integer
int ap_[u]fixed::to_int ()
unsigned ap_[u]fixed::to_uint ()
ap_slong ap_[u]fixed::to_int64 ()
ap_ulong ap_[u]fixed::to_uint64 ()
This member function explicitly converts this fixed-point value to C built-in integer types.
For example:
You can use these data members to extract the following information from any existing
ap_[u]fixed<> data type:
For example, you can use these data members to extract the data width of an existing
ap_[u]fixed<> data type to create another ap_[u]fixed<> data type at compile time.
The following example shows how the size of variable Res is automatically defined as 1-bit
greater than variables Val1 and Val2 with the same quantization modes:
This ensures that Vivado HLS correctly models the bit-growth caused by the addition even
if you update the value of INPUT_DATA_WIDTH, IN_INTG_WIDTH, or the quantization modes
for data_t.
There are some differences in the behavior between Vivado HLS types and SystemC types.
These differences are discussed in this section and cover the following topics.
• Default constructor
• Integer division
• Integer modulus
• Negative shifts
• Over-left shift
• Range operation
• Fixed-point division
• Fixed-point right-shift
• Fixed-point left-shift
Default Constructor
In SystemC, the constructor for the following types initializes the values to zero before
execution of the program:
• sc_[u]int
• sc_[u]bigint
• sc_[u]fixed
The following Vivado HLS types are not initialized by the constructor:
• ap_[u]int
• ap_[u]fixed
• ap_[u]int
No default initialization
• ap_[u]fixed
No default initialization
• sc_[u]int
Default initialization to 0
• sc_big[u]int
Default initialization to 0
• sc_[u]fixed
Default initialization to 0
CAUTION! When migrating SystemC types to Vivado HLS types, be sure that no variables are read or
used in conditionals until they are written to.
SystemC designs can be started showing all outputs with a default value of zero, whether or
not the output has been written to. The same variables expressed as Vivado HLS types
remain unknown until written to.
Integer Division
When using integer division, Vivado HLS types are consistent with sc_big[u]int types
but behave differently than sc_[u]int types. The following figure shows an example.
X-Ref Target - Figure 4-15
The SystemC sc_int type returns a zero value when an unsigned integer is divided by a
negative signed integer. The Vivado HLS types, such as the SystemC sc_bigint type,
represent the negative result.
Integer Modulus
When using the modulus operator, Vivado HLS types are consistent with sc_big[u]int
types, but behave differently than sc_[u]int types. The following figure shows an
example.
X-Ref Target - Figure 4-16
The SystemC sc_int type returns the value of the dividend of a modulus operation when:
The Vivado HLS types (such as the SystemC sc_bigint type) returns the positive result of
the modulus operation.
Negative Shifts
When the value of a shift operation is a negative number, Vivado HLS ap_[u]int types
shift the value in the opposite direction. For example, it returns a left-shift for a right-shift
operation).
The SystemC types sc_[u]int and sc_big[u]int behave differently in this case. The
following figure shows an example of this operation for both Vivado HLS and SystemC
types.
op 0 0 1 8 24 op 0 0 1 8 24 op 0 0 1 8 24
shift 7 F F E -2 shift 7 F F E -2 shift 7 F F E -2
>> >> >>
ret 0 0 0 0 6 0 96 ret 0 0 0 0 1 8 24 ret 0 0 0 0 0 0 0
X14226
Over-Shift Left
When a shift operation is performed and the result overflows the input variable but not the
output or assigned variable, Vivado HLS types and SystemC types behave differently.
• Vivado HLS ap_[u]int shifts the value and then assigns meaning to the upper bits
that are lost (or overflowed).
• Both SystemC sc_big(u)int and sc_(u)int types assign the result and then shift,
preserving the upper bits.
• The following figure shows an example of this operation for both Vivado HLS and
SystemC types.
X14227
Range Operation
There are differences in behavior when the range operation is used and the size of the
range is different between the source and destination. The following figure shows an
example of this operation for both Vivado HLS and SystemC types. See the summary below.
X-Ref Target - Figure 4-19
repl 0000 0000 00AB CDEF repl 0000 0000 00AB CDEF repl 0000 0000 00AB CDEF
value 0000 0000 1234 5678 value 0000 0000 1234 5678 value 0000 0000 1234 5678
value 0000 0000 00AB CD78 value 0000 0000 12AB CD78 value 0000 0000 00AB CD78
• Vivado HLS ap_[u]int types and SystemC sc_big[u]int types replace the
specified range and extend to fill the target range with zeros.
• SystemC sc_big[u]int types update only with the range of the source.
For ap_[u]fixed types, the fraction is no greater than that of the dividend. SystemC
sc_[u]fixed types retain the fractional precision on divide. The fractional part can be
retained when using the ap_[u]fixed type by casting to the new variable width before
assignment.
The following figure shows an example of this operation for both Vivado HLS and SystemC
types.
X-Ref Target - Figure 4-20
ap_(u)fixed =/ sc_(u)fixed
#include “ap_fixed.h” #include “systemc.h”
#define SC_INCLUDE_FX
ap_fixed<3,3> dividend=2;
ap_fixed<4,4> divisor=4; sc_fixed<3,3> dividend=2;
ap_fixed<4,2> ret=dividend/divisor; sc_fixed<4,4> divisor=4;
//casting required to keep precision sc_fixed<4,2> ret=dividend/divisor;
ap_fixed<4,2> ret2=ap_fixed<4,2>(dividend)/divisor;
• With Vivado HLS fixed-point types, the shift is performed and then the value is
assigned.
• With SystemC fixed-point types, the value is assigned and then the shift is performed.
When the result is a fixed-point type with more fractional bits, the SystemC type preserves
the additional accuracy.
The following figure shows an example of this operation for both Vivado HLS and SystemC
types.
ap_(u)fixed =/ sc_(u)fixed
#include “ap_fixed.h” #include “systemc.h”
#define SC_INCLUDE_FX
ap_fixed<5,3,AP_RND,AP_SAT> val=3.75
ap_fixed<5,3,AP_RND,AP_SAT> res=val>>2; sc_fixed<5,3 AP_RND,AP_SAT> val=3.75
ap_fixed<7,3,AP_RND,AP_SAT> res2=val>>2; sc_fixed<5,3,AP_RND,AP_SAT> res=val>>2;
sc_fixed<7,3,AP_RND,AP_SAT> res2=val>>2;
The type of quantization mode does not affect the result of the ap_[u]fixed right-shift. Xilinx
recommends that you assign to the size of the result type before the shift operation.
The following figure shows an example of this operation for both Vivado HLS and SystemC
types.
X-Ref Target - Figure 4-22
ap_(u)fixed =/ sc_(u)fixed
#include “ap_fixed.h” #include “systemc.h”
#define SC_INCLUDE_FX
ap_fixed<5,3,AP_RND,AP_SAT> val=3.75
ap_fixed<5,3,AP_RND,AP_SAT> res=val<<2; ap_fixed<5,3,AP_RND,AP_SAT> val=3.75
ap_fixed<7,5,AP_RND,AP_SAT> res2=val<<2; ap_fixed<5,3,AP_RND,AP_SAT> res=val<<2;
ap_fixed<7,5,AP_RND,AP_SAT> res2=val<<2;
X14231
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx
Support.
Solution Centers
See the Xilinx Solution Centers for support on devices, software tools, and intellectual
property at all stages of the design cycle. Topics include design assistance, advisories, and
troubleshooting tips.
References
1. Introduction to FPGA Design with Vivado High-Level Synthesis (UG998)
2. Vivado ® Design Suite Tutorial: High-Level Synthesis (UG871)
3. Vivado Design Suite User Guide: Release Notes, Installation, and Licensing (UG973)
4. Floating-Point Design with Vivado HLS (XAPP599)
5. LogiCORE IP Fast Fourier Transform Product Guide (PG109)
6. LogiCORE IP FIR Compiler Product Guide (PG149)
7. LogiCORE IP DDS Compiler Product Guide (PG141)
8. Vivado Design Suite AXI Reference Guide (UG1037)
9. Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC Using Vivado
HLS Video Libraries (XAPP1167)
10. UltraFast™ High-Level Productivity Design Methodology Guide (UG1197)
Training Resources
Xilinx provides a variety of training courses and QuickTake videos to help you learn more
about the concepts presented in this document. Use these links to explore related training
resources:
1. C-based Design: High-Level Synthesis with the Vivado HLS Tool Training Course
2. C-based HLS Coding for Hardware Designers Training Course
3. C-based HLS Coding for Software Designers Training Course
4. Vivado Design Suite QuickTake Video: Getting Started with Vivado High-Level Synthesis
5. Vivado Design Suite QuickTake Video Tutorials