Ug1027 Intro To Sdsoc
Ug1027 Intro To Sdsoc
Ug1027 Intro To Sdsoc
User Guide
Revision History
The following table shows the revision history for this document.
Date
Version
Revision
07/20/2015
2015.2
www.xilinx.com
Send Feedback
Table of Contents
Revision History ............................................................................................................. 2
Table of Contents ........................................................................................................... 3
www.xilinx.com
Send Feedback
www.xilinx.com
Send Feedback
Chapter 1
Getting Started
Download and install the SDSoC environment according to the directions provided in SDSoC
Environment User Guide: Getting Started (UG1028). The Getting Started guide provides detailed
instructions and hands-on tutorials to introduce the primary work flows for project creation,
specifying functions to run in programmable logic, system compilation, debugging, and
performance estimation. Working through these tutorials is the best way to get an overview of
the SDSoC environment, and should be considered prerequisite to application development.
www.xilinx.com
Send Feedback
When running the SDSoC system compilers from the command-line or through makefile
flows, you must set the shell environment as described in SDSoC Environment User Guide:
Getting Started (UG1028) or the tools will not function properly.
The SDSoC environment includes the entire tools stack to create a bitstream, object
code, and executables. If you have installed the Xilinx Vivado Design Suite and
Software Development Kit tools independently, you should not attempt to combine these
installations with the SDSoC environment.
Feature Overview
The SDSoC environment inherits many of the tools in the Xilinx Software Development Kit
(SDK), including GNU toolchain and standard libraries (for example, glibc, OpenCV) for the ARM
CPUs within Zynq devices, as well as the Target Communication Framework (TCF) and GDB
interactive debuggers, a performance analysis perspective within the Eclipse/CDT-based GUI,
and command-line tools.
The SDSoC environment includes system compilers (sdscc/sds++) that generate complete
hardware/software systems targeting Zynq devices, an Eclipse-based user interface to create
and manage projects and workflows, and a system performance estimation capability to explore
different "what if" scenarios for the hardware/software interface.
The SDSoC system compilers employ underlying tools from the Vivado Design Suite
(System Edition), including Vivado HLS, IP integrator (IPI), IP libraries for data movement and
interconnect, and the RTL synthesis, placement, routing, and bitstream generation tools.
The principle of design reuse underlies workflows you employ with the SDSoC environment,
using well established platform-based design methodologies. The SDSoC system compiler
generates an application-specific system on chip by extending a target platform. The SDSoC
environment includes a number of platforms for application development and others are
provided by Xilinx partners. SDSoC Environment User Guide: Platforms and Libraries (UG1146)
describes how to capture platform metadata so that a pre-existing design built using the
Vivado Design Suite, and corresponding software run-time environment can be used to build
an SDSoC platform and used in the SDSoC environment.
An SDSoC platform defines a base hardware and software architecture and application context,
including processing system, external memory interfaces, custom input/output, and software
run time including operating system (possibly "bare metal"), boot loaders, drivers for platform
peripherals and root file system. Every project you create within the SDSoC environment
targets a specific platform, and you employ the tools within the SDSoC IDE to customize
the platform with application-specific hardware accelerators and data motion networks
connecting accelerators to the platform. In this way, you can easily create highly tailored
application-specific systems-on-chip for different base platforms, and can reuse base platforms
for many different application-specific systems-on-chip.
www.xilinx.com
Send Feedback
Chapter 2
www.xilinx.com
Send Feedback
The first step is to select a development platform, cross-compile the application, and ensure it
runs properly on the platform. You then identify compute-intensive hot spots to migrate into
programmable logic to improve system performance, and to isolate them into functions that
can be compiled into hardware. You then invoke the SDSoC system compiler to generate a
complete system-on-chip and SD card boot image for your application. You can instrument
your code to analyze performance, and if necessary, optimize your system and hardware
functions using a set of directives and tools within the SDSoC environment.
The system generation process is orchestrated by the sdscc/sds++ system compilers through
the SDSoC IDE or in an SDSoC terminal shell using the command line and makefiles. Using the
SDSoC IDE or sdscc command line options, you select functions to run in hardware, specify
accelerator and system clocks, and set properties on data transfers (for example, interrupt vs.
polling for DMA transfers). You can insert pragmas into application source code to control
the system mapping and generation flows, providing directives to the system compiler for
implementing the accelerators and data motion networks.
Because a complete system compile can be time-consuming compared with an "object code"
compile for a CPU, the SDSoC environment provides a faster performance estimation capability
that allows you to approximate the expected speed up over a software-only implementation for
a given choice of hardware functions. This estimate is based on properties of the generated
system and estimates for the hardware functions provided by the IPs when available.
As shown in User Design Flow, the overall design process involves iterating the steps until the
generated system achieves your performance and cost objectives.
It is assumed that you have already worked through the introductory tutorials (see SDSoC
Environment User Guide: Getting Started (UG1028) ) and are familiar with project creation,
hardware function selection, compilation, and running a generated application on the target
platform. If you have not done so, it is recommended you do so before continuing.
www.xilinx.com
Send Feedback
If you are writing makefiles outside of the SDSoC IDE, you must include the -sds-pf command
line option on every call to sdscc.
sdscc -sds-pf <platform path name>
where the platform is either a file path or a named platform within the
<sdsoc_root>/platforms directory. To view the available base platforms from the
command line, run the following command.
sdscc -sds-pf-list
In addition to the available base platforms, you can find additional sample platforms in the
<sds_root>/samples/platforms directory. To create a new project for one of these
platforms within the SDSoC IDE, create a new project, select Other for the platform and
navigate to the desired sample platform.
To see the available clocks for a platform from the command line, execute the following:
$ sdscc -sds-pf-info zc702
Platform Description
====================
Basic platform targeting the ZC702 board, which includes 1GB of DDR3, 16MB QuadSPI Flash and an SDIO card interface. More information at http://www.xilinx.com/
products/boards-and-kits/EK-Z7-ZC702-G.htm
Platform Information
====================
Name: zc702
Device
-----Architecture: zynq
Device: xc7z020
Package: clg484
Speed grade: -1
System Clocks
------------Clock ID Frequency
----------|-----------666.666687
0 166.666672
1 142.857132
2 100.000000
3 200.000000
www.xilinx.com
Send Feedback
2.
The underlying GNU toolchain is defined when you select the operating system during project
creation. The SDSoC system compilers (sdscc/sds++) automatically invoke the corresponding
toolchain when compiling code for the CPUs, including all source files not involved with
hardware functions.
All object code for the ARM CPUs is generated with the GNU toolchains, but the sdscc (and
sds++) compiler, built upon Clang/LLVM frameworks, is generally less forgiving of C/C++
language violations than the GNU compilers. As a result, you might find that some libraries
needed for your application cause front-end compiler errors when using sdscc. In such cases,
compile the source files directly through the GNU toolchain rather than through sdscc, either
in your makefiles or by setting the compiler Command to GCC or g++ by right-clicking on the
file (or folder) in the Project Explorer and selecting C/C++ Build > Settings > SDSCC/SDS++
Compiler.
The SDSoC system compilers generate an SD card image by default in a project subdirectory
named sd_card. For Linux applications, this directory includes the following files:
BOOT.BIN - the boot image contains first stage boot loader (FSBL), boot program (u-boot),
and the FPGA bitstream
To run the application, copy the contents of sd_card directory onto an SD card and insert
into the target board. Open a serial terminal connection to the target and power up the board
(for more information see SDSoC Environment User Guide: Getting Started (UG1028) for more
information). Linux boots, automatically logs you in as root, and enters a bash shell. The SD
card is mounted at /mnt, and from that directory you can run <app>.elf.
For standalone applications, the ELF, bitstream, and board support package (BSP) are contained
within BOOT.BIN, which automatically runs the application after the system boots.
www.xilinx.com
Send Feedback
10
2.
In the SDSoC Project Overview window, click on Debug application. Note: the board
must be connected to your computer and powered on. The application automatically
breaks at the entry to main().
3.
Launch the TCF Profiler by selecting Window > Show View > Other > Debug > TCF
Profiler.
4.
Start the TCF Profiler by clicking on the green Start button at the top of the TCF Profiler
tab. Enable Aggregate per function in the Profiler Configuration dialog box.
5.
Start the profiling by clicking on the Resume button. The program runs to completion and
breaks at the exit() function.
6.
Profiling provides a statistical method for finding hot spots based on sampling the CPU
program counter and correlating to the program in execution. Another way to measure
program performance is to instrument the application to determine the actual duration
between different parts of a program in execution.
The sds_lib library included in the SDSoC environment provides a simple, source code
annotation based time-stamping API that can be used to measure application performance.
/*
* @return value of free-running 64-bit Zynq(TM) global counter
*/
unsigned long long sds_clock_counter(void);
www.xilinx.com
Send Feedback
11
By using this API to collect timestamps and differences between them, you can determine
duration of key parts of your program. For example, you can measure data transfer or overall
round trip execution time for hardware functions as shown in the following code snippet:
#include "sds_lib.h"
unsigned long long total_run_time = 0;
unsigned int num_calls = 0;
unsigned long long count_val = 0;
#define sds_clk_start(){ \
count_val = sds_clock_counter(); \
num_calls++; \
}
#define sds_clk_stop() { \
long long tmp = sds_clock_counter(); \
total_run_time += (tmp - count_val); \
}
#define avg_cpu_cycles()(total_run_time / num_calls)
#define NUM_TESTS 1024
extern void f();
void measure_f_runtime()
{
for (int i = 0; i < NUM_TESTS; i++) {
sds_clock_start();
f();
sds_clock_stop();
}
printf("Average cpu cycles f(): %ld\n", avg_cpu_cycles());
}
The performance estimation feature within the SDSoC environment employs this API by
automatically instrumenting functions selected for hardware implementation, measuring actual
run-times by running the application on the target, and then comparing actual times with
estimated times for the hardware functions.
NOTE: While off-loading CPU-intensive functions is probably the most reliable heuristic
to partition your application, it is not guaranteed to improve system performance without
algorithmic modification to optimize memory accesses. A CPU almost always has much faster
random access to external memory than you can achieve from programmable logic, due to
multi-level caching and a faster clock speed (typically 2x to 8x faster than programmable
logic). Extensive manipulation of pointer variables over a large address range, for example,
a sort routine that sorts indices over a large index set, while very well-suited for a CPU, may
become a liability when moving a function into programmable logic. This does not mean that
such compute functions are not good candidates for hardware, only that code or algorithm
restructuring may be required. This issue is also well-known for DSP and GPU coprocessors.
www.xilinx.com
Send Feedback
12
Click on the
symbol in the Hardware Functions panel to display the list of candidate
functions within your program. This list consists of functions in the call graph rooted at the
Root Function listed in the General panel, by default main, but changeable by clicking on
the ... button and selecting an alternative function root.
From within the popup window, you can select one or more functions for hardware acceleration
and click OK. The selected functions appear in the list box. Note that the Eclipse CDT indexing
mechanism is not foolproof, and you might need to close and reopen the selection popup
to view available functions. If a function does not appear in the list, you can navigate to
its containing file in the Project Explorer, expand the contents, right-click on the function
prototype, and select Toggle HW/SW.
From the command line, select a function foo in the file foo_src.c for hardware with the
following sdscc command line option.
-sds-hw foo foo_src.c -sds-end
If foo invokes sub-functions contained in files foo_sub0.c and foo_sub1.c, use the
-files option.
-sds-hw foo foo_src.c -files foo_sub0.c,foo_sub1.c -sds-end
Although the data motion network runs off of a single clock, it is possible to run hardware
functions at different clock rates to achieve higher performance. In the Hardware Functions
panel, select functions from the list and use the Clock Frequency pull-down menu to choose
their clocks. Be aware that it might not be possible to implement the hardware system with
some clock selections.
To set a clock on the command-line, determine the corresponding clock id using sdscc
-sds-pf-info <platform> and use the -clockid option.
-sds-hw foo foo_src.c -clockid 1 -sds-end
When moving a function optimized for CPU execution into programmable logic, you usually
need to revise the code to achieve best performance. See A Programmers Guide to Vivado HLS
and Coding Guidelines for programming guidelines.
www.xilinx.com
Send Feedback
13
Description
-perf-funcs function_name_list
-perf-root function_name
-perf-est data_file
-perf-est-hw-only
CAUTION! After running the sd_card image on the board for collecting profile data, run cd
/; sync; umount /mnt;. This ensures that the sw_perf_data.xml file is written out
to the SD card.
A complete example of the makefile-based flow for performance estimation can be found in
<sdsoc_root>/samples/mmult_performance_estimation.
www.xilinx.com
Send Feedback
14
Chapter 3
Compile/link time errors can be the result of typical software syntax errors caught by
software compilers, or errors specific to the SDSoC environment flow, such as the design
being too large to fit on the target platform.
Runtime errors can be the result of general software issues such as null-pointer access,
or SDSoC environment-specific issues such as incorrect data being transferred to/from
accelerators.
Performance issues are related to the choice of the algorithms used for acceleration, the
time taken for transferring the data to/from the accelerator, and the actual speed at which
the accelerators and the data motion network operate.
www.xilinx.com
Send Feedback
15
Some tips for dealing with SDSoC environment specific errors follow.
Check for typos in pragmas that might prevent them from being applied to the correct
function.
Vivado Design Suite High-Level Synthesis (HLS) cannot meet timing requirement.
Select a slower clock frequency for the accelerator in the SDSoC IDE (or with the
sdscc/sds++ command line parameter).
Modify the code structure to allow HLS to generate a faster implementation. See A
Programmers Guide to High-Level Synthesis for more information on how to do this.
In the SDSoC IDE, select a slower clock frequency for the data motion network or
accelerator, or both (from the command line, use sdscc/sds++ command line
parameters).
Modify the C/C++ code passed to HLS, or add more HLS directives to make the HLS
block go faster.
Reduce the size of the design in case the resource usage (see the Vivado tools report
in _sds/ipi/*.log and other log files in the subdirectories there) exceeds 80% or
so. See the next item for ways to reduce the design size.
Change the coding style for an accelerator function to produce a more compact
accelerator. You can reduce the amount of parallelism using the mechanisms described
in A Programmers Guide to High-Level Synthesis.
Modify pragmas and coding styles (pipelining) that cause multiple instances of
accelerators to be created.
Use pragmas to select smaller data movers such as AXIFIFO instead of AXIDMA_SG.
www.xilinx.com
Send Feedback
16
// definition
Notice that the loop reads the in_a stream 19 times but the size of in_a[] is 20, so the
caller of f1 would wait forever (or hang) if it waited for f1 to consume all the data that
was streamed to it. Similarly, the caller would wait forever if it waited for f1 to send 20 int
values because f1 sends only 19. Program errors that lead to such hangs can be detected
by instrumenting the code to flag streaming access errors such as non-sequential access or
incorrect access counts within a function and running in software. Streaming access issues are
typically flagged as improper streaming access warnings in the log file, and it is left to
the user to determine if these are actual errors.
The following list shows other sources of run-time errors:
Software reading invalid data before a hardware accelerator has written the correct value
Inconsistent use of memory consistency #pragma SDS data mem_attribute can result
in incorrect results.
www.xilinx.com
Send Feedback
17
For best performance improvement, the time required for executing the accelerated function
must be much smaller than the time required for executing the original software function. If
this is not true, try to run the accelerator at a higher frequency by selecting a different clkid
on the sdscc/sds++ command line. If that does not work, try to determine whether the
data transfer overhead is a significant part of the accelerated function execution time, and
reduce the data transfer overhead. Note that the default clkid is 100 MHz for all platforms.
More details about the clkid values for the given platform can be obtained by running sdscc
sds-pf-info <platform name>.
If the data transfer overhead is large, the following changes might help:
Move more code into the accelerated function so that the computation time increases, and
the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to
transfer only the required data.
Debugging an Application
The SDSoC environment allows projects to be created and debugged using the SDSoC IDE.
Projects can also be created outside the SDSoC IDE (user-defined makefiles) and debugged
either on the command line or using the SDSoC IDE.
See SDSoC Environment User Guide: Getting Started (UG1028), Tutorial: Debugging Your
System for information on using the interactive debuggers in the SDSoC IDE.
www.xilinx.com
Send Feedback
18
Chapter 4
In the SDSoC environment, you control the system generation process by structuring hardware
functions and calls to hardware functions to balance communication and computation, and by
inserting pragmas into your source code to guide the sdscc system compiler
The hardware/software interface is defined implicitly in your application source code once
you have selected a platform and a set of functions in the program to be implemented in
hardware. The sdscc/sds++ system compilers analyze the program data flow involving
hardware functions, schedule each such function call, and generate a hardware accelerator and
data motion network realizing the hardware functions in programmable logic. They do so
not by implementing each function call on the stack through the standard ARM application
binary interface, but instead by redefining hardware function calls as calls to function stubs
having the same interface as the original hardware function. These stubs are implemented with
low level function calls to a send / receive middleware layer that efficiently transfers data
between the platform memory and CPU and hardware accelerators, interfacing as needed to
underlying kernel drivers.
The send/receive calls are implemented in hardware with data mover IPs based on program
properties like memory allocation of array arguments, payload size, the corresponding hardware
interface for a function argument, as well as function properties such as memory access
patterns and latency of the hardware function.
Every transfer between the software program and a hardware function requires a data mover,
which consists of a hardware component that moves the data, and an operating system-specific
library function. The following table lists supported data movers and various properties for each.
www.xilinx.com
Send Feedback
19
Scalar variables are always transferred over an AXI4-Lite bus interface with the axi_lite data
mover. For array arguments, the data mover inference is based on transfer size, hardware
function port mapping, and function call site information. The axi_dma_simple data mover is
the most efficient bulk transfer engine, but only supports up to 8MB transfers, so for larger
transfers, the axi_dma_sg (scatter-gather DMA) data mover is required. The axi_fifo data
mover does not require as many hardware resources as the DMA, but due to its slower transfer
rates, is preferred only for payloads of up to 300 bytes.
You can override the data mover selection by inserting a pragma into program source
immediately before the function declaration, for example,
#pragma SDS data data_mover(A:AXIDMA_SIMPLE)
Note that #pragma SDS is always treated as a rule, not a hint, so you must ensure that their
use conforms with the data mover requirements in SDSoC Data Movers Table.
Memory Allocation
The sdscc/sds++ compilers analyze your program and selects data movers to match the
requirements for each hardware function call between software and hardware, based on payload
size, hardware interface on the accelerator, and properties of the function arguments. When the
compiler can guarantee an array argument is located in physically contiguous memory, it can
use the most efficient data movers. Allocating or memory-mapping arrays with the following
sds_lib library functions can inform the compiler that memory is physically contiguous.
sds_alloc(size_t size); // guarantees physically contiguous memory
sds_mmap(void *paddr, size_t size, void *vaddr); // paddr must point to contiguous memory
sds_register_dmabuf(void *vaddr, int fd); // assumes physically contiguous memory
It is possible that due to the program structure, the sdscc compiler cannot definitively deduce
the memory contiguity, and when this occurs, it issues a warning message, as shown:
WARNING: [SDSoC 0-0] Unable to determine the memory attributes passed to foo_arg_A of function
foo at foo.cpp:102
www.xilinx.com
Send Feedback
20
You can inform the compiler that the data is allocated in a physically contiguous memory
by inserting the following pragma immediately before the function declaration (note: the
pragma does not guarantee physically contiguous allocation of memory; your code must
use sds_alloc to allocate such memory).
#pragma SDS data mem_attribute (A:PHYSICAL_CONTIGUOUS) // default is NON_PHYSICAL_CONTIGUOUS
For arrays passed as pointer typed arguments to hardware functions, sometimes the compilers
can infer transfer size, but if they cannot, they issue ERROR: [SDSoC 0:0] The bound
callers of accelerator foo have different/indeterminate data size for
port p and you must use the following to specify the size of the data to be transferred:
#pragma SDS data copy(p[0:<array_size>]) // for example, int *p.
You can vary the data transfer size on a per function call basis to avoid transferring data that
is not required by a hardware function by setting <array_size> in the pragma definition
to be an expression defined in the scope of the function call (that is, all variables in the size
expression must be scalar arguments to the function), for example:
#pragma SDS data copy(A[0:L+2*T/3]) // scalar arguments L, T to same function
www.xilinx.com
Send Feedback
21
Declaring an array as non-cacheable means the compiler does not need to ensure the cache
coherency when accessing the specified array in the memory, but it is your responsibility to do
so when necessary. A typical use case is a video application where some frame buffers are
accessed by programmable logic but not the CPU.
www.xilinx.com
Send Feedback
22
At the system level, the sdscc compiler chains together hardware functions when the data flow
between them does not require transferring arguments out of programmable logic and back to
system memory. For example, consider the code in the following figure, where mmult and
madd functions have been selected for hardware.
Because the intermediate array variable tmp1 is used only to pass data between the two
hardware functions, the sdscc system compiler chains the two functions together in hardware
with a direct connection between them.
It is instructive to consider a time line for the calls to hardware as shown in the following figure.
The program preserves the original program semantics, but instead of the standard ARM
procedure calling sequence, each hardware function call is broken into multiple phases
involving setup, execution, and cleanup, both for the data movers (DM) and the accelerators.
The CPU in turn sets up each hardware function (that is, the underlying IP control interface) and
the data transfers for the function call with non-blocking APIs, and then waits for all calls and
transfers to complete. In the example shown in the diagram, the mmult and madd functions
run concurrently whenever their inputs become available. The ensemble of function calls is
orchestrated in the compiled program by control code automatically generated by sdscc
according to the program, data mover, and accelerator structure.
www.xilinx.com
Send Feedback
23
In general, it is impossible for the sdscc compiler to determine side-effects of function calls in
your application code (for example, sdscc may have no access to source code for functions
within linked libraries), so any intermediate access of a variable occurring lexically between
hardware function calls requires the compiler to transfer data back to memory. So for example,
an injudicious simple change to uncomment the debug print statement (in the "wrong place")
as shown in the figure below, can result in a significantly different data transfer graph and
consequently, an entirely different generated system and application performance.
A program can invoke a single hardware function from multiple call sites. In this case, the
sdscc compiler behaves as follows. If any of the function calls results in "direct connection"
data flow, then sdscc creates an instance of the hardware function that services every similar
direct connection, and an instance of the hardware function that services the remaining calls
between memory ("software") and programmable logic.
Structuring your application code with "direct connection" data flow between hardware
functions is one of the best ways to achieve high performance in programmable logic. You can
create deep pipelines of accelerators connected with data streams, increasing the opportunity
for concurrent execution.
There is another way in which you can increase parallelism and concurrency using the sdscc
compiler. You can direct the compiler to create multiple instances of a hardware function by
inserting the following pragma immediately preceding a call to the function.
#pragma SDS async(<id>) // <id> a non-negative integer
This pragma creates a hardware instance that is referenced by <id>. The generated control
code for the hardware function call returns to the caller as soon as all of the setup has
completed without waiting for the function execution to complete. The program must correctly
synchronize with the function call by inserting a matching wait pragma for the same <id> at
an appropriate point in the program.
#pragma SDS wait(<id>) // <id> synchronizes to hardware function with <id>
A simple code snippet that creates two instances of a hardware function mmult is as follows.
{
#pragma SDS async(1)
mmult(A, B, C); // instance 1
#pragma SDS async(2)
mmult(D, E, F); // instance 2
#pragma SDS wait(1)
#pragma SDS wait(2)
}
www.xilinx.com
Send Feedback
24
The async mechanism gives the programmer ability to handle the "hardware threads"
explicitly to achieve very high levels of parallelism and concurrency, but like any explicit
multi-threaded programming model, requires careful attention to synchronization details to
avoid non-deterministic behavior or deadlocks.
www.xilinx.com
Send Feedback
25
Chapter 5
Coding Guidelines
This contains general coding guidelines for application programming using the SDSoC system
compilers, with the assumption of starting from application code that has already been
cross-compiled for the ARM CPU within the Zynq device, using the GNU toolchain included
as part of the SDSoC environment.
Use sds_lib functions, for example, to allocate or memory map buffers that are sent to
hardware functions
Files that contain functions in the transitive closure of the downward call graph of the above
All other source files can safely be compiled with the ARM GNU toolchain.
A large software project may include many files and libraries that are unrelated to the hardware
accelerator and data motion networks generated by sdscc. If the sdscc compiler issues errors
on source files unrelated to the generated hardware system (for example, from an OpenCV
library), you can compile these files through GCC instead of sdscc by right-clicking on the file
(or folder) Properties > C/C++ Build > Settings and setting the Command to GCC.
www.xilinx.com
Send Feedback
26
Makefile Guidelines
The makefiles provided with the designs in <sdsoc_root>/samples consolidate all sdscc
hardware function options into a single command line. This is not required, but has the benefit
of preserving the overall control structure and dependencies within a makefile without requiring
change to the makefile actions for files containing a hardware function.
You can define make variables to capture the entire SDSoC environment command line, for
example: CC = sds++ ${SDSFLAGS} for C++ files, invoking sdscc for C files. In
this way, all SDSoC environment options are consolidated in the ${CC} variable. Define the
platform and target OS once in this variable.
There must be a separate -sds-hw/-sds-end clause in the command line for each file
that contains a hardware function. For example:
-sds-hw foo foo.cpp -clkid 1 -sds-end
For the list of the SDSoC compiler and linker options, see SDSSC/SDS++ Compiler and
Linker Options or use sdscc --help.
Hardware functions can execute concurrently under the control of a master thread. A
program can have multiple threads and processes, but must have only a single master
thread that controls hardware functions.
A top-level hardware function must be a global function, not a class method, and it cannot
be an overloaded function.
There is no support for exception handling in hardware functions.
It is an error to refer to a global variable within a hardware function or any of its
sub-functions when this global variable is also referenced by other functions running in
software.
If a hardware function returns a value, then the return type must be a scalar type that
fits in a 32-bit container.
A hardware function must have at least one argument.
An output or inout scalar argument to a hardware function should be assigned once.
Create a local variable when multiple assignments to an ouput or inout scalar are required
within a hardware function.
Use predefined macros to guard code with #ifdef and #ifndef preprocessor statements;
the macro names begin and end with two underscore characters _. For examples, see
SDSSC/SDS++ Compiler and Linker Options.
The __SDSCC__ macro is defined and passed as a -D option to sub-tools whenever
sdscc or sds++ is used to compile source files, and can be used to guard code
dependent on whether it is compiled by sdscc/sds++ or by another compiler, for
example a GNU host compiler.
When sdscc or sds++ compiles source files targeted for hardware acceleration using
Vivado HLS, the __SDSVHLS__ macro is defined and passed as a -D option, and can
be used to guard code dependent on whether high-level synthesis is run or not.
www.xilinx.com
Send Feedback
27
To avoid interface incompatibilities, you should only incorporate Vivado HLS interface type
directives and pragmas in your source code as described in Vivado HLS Function Argument
Types when sdscc fails to generate a suitable hardware interface directive.
To ensure alignment across the hardware/software interface, do not use hardware function
arguments that have type long, or an array of bool or struct.
www.xilinx.com
Send Feedback
28
#pragma SDS data zero_copy which provides pointer semantics using shared memory
which maps the argument onto a stream, and requires that array elements are accessed
in index order. The data copy pragma is only required when the sdscc system compiler
is unable to determine the data transfer size and issues an error.When you require
non-sequential access to the array in the hardware function, you should change the pointer
argument to an array with an explicit declaration of its dimensions, for example, A[1024].
Stub functions generated in the SDSoC environment transfer the exact number of bytes
according the compile-time determinable array bound of the corresponding argument in
the hardware function declaration. If a hardware function admits a variable data size,
you can use the following pragma to direct the SDSoC environment to generate code to
transfer data whose size is defined by an arithmetic expression:
#pragma SDS data copy|zero_copy(arg[0:<C_size_expr>]
where the <C_size_expr> must compile in the scope of the function declaration.
The zero_copy pragma directs the SDSoC environment to map the argument into shared
memory.
Be aware that mismatches between intended and actual data transfer sizes can cause the
system to hang at runtime, requiring laborious hardware debugging.
Align arrays transferred by DMAs on cache-line boundaries (for L1 and L2 caches). Use the
sds_alloc API provided with the SDSoC environment or posix_memalign() instead of
malloc() to allocate these arrays.
Align arrays to page boundaries to minimize the number of pages transferred with the
scatter-gather DMA, for example, for arrays allocated with malloc.
You must use sds_alloc to allocate an array for the following two cases:
1.
2.
You are using pragmas to explicitly direct the system compiler to use Simple-DMA or
2D-DMA.
www.xilinx.com
Send Feedback
29
Chapter 6
Avoid using the long data type. The long data type is not portable between 64-bit
architecture (such as x64) or 32-bit architecture (such as the ARM A9 in Zynq devices).
2.
Avoid using arrays of bool. An array of bool has different memory layout between ARM
GCC and Vivado HLS.
3.
Avoid using arrays of struct. An array of struct has different memory layout between
ARM GCC and Vivado HLS. In the future, the SDSoC environment will support array of
struct with a compatible memory layout between ARM GCC and Vivado HLS.
4.
Avoid using ap_int<>, ap_fixed<>, hls::stream, except with data width of 8, 16, 32
or 64 bits. Navigate to <SDSoC Installation Path>/samples/hls_if/hls_stream
for a sample design on how to use hls::stream in the SDSoC environment.
www.xilinx.com
Send Feedback
30
IMPORTANT: If you specify the interface using #pragma HLS interface for a top-level
function argument, the SDSoC environment does not generate a HLS interface directive for
that argument, and it is your responsibility to ensure that the generated hardware interface
is consistent with all other function argument hardware interfaces. Because a function with
incompatible HLS interface types can result in cryptic sdscc error messages, it is strongly
recommended (though not absolutely mandatory) that you omit HLS interface pragmas.
Optimization Guidelines
This section documents several fundamental HLS optimization techniques to enhance hardware
function performance. These techniques are: function inlining, loop and function pipelining,
loop unrolling, increasing local memory bandwidth and streaming data flow between loops
and functions.
Function Inlining
Similar to function inlining of software functions, it can be beneficial to inline hardware functions.
Function inlining replaces a function call by substituting a copy of the function body after
resolving the actual and formal arguments. After that, the inlined function is dissolved and no
longer appears as a separate level of hierarchy. Function inlining allows operations within the
inlined function be optimized more effectively with surrounding operations, thus improves the
overall latency or the initiation interval for a loop.
To inline a function, put #pragma HLS inline at the beginning of the body of the desired
function. The following code snippet directs Vivado HLS to inline the mmult_kernel function:
void mmult_kernel(float in_A[A_NROWS][A_NCOLS],
float in_B[A_NCOLS][B_NCOLS],
float out_C[A_NROWS][B_NCOLS])
{
#pragma HLS INLINE
int index_a, index_b, index_d;
// rest of code body omitted
}
www.xilinx.com
Send Feedback
31
Loop Pipelining
In sequential languages such as C/C++, the operations in a loop are executed sequentially and
the next iteration of the loop can only begin when the last operation in the current loop
iteration is complete. Loop pipelining allows the operations in a loop to be implemented in a
concurrent manner as shown in the following figure.
As shown in the above figure, without pipelining, there are three clock cycles between the two
RD operations and it requires six clock cycles for the entire loop to finish. However, with
pipelining, there is only one clock cycle between the two RD operations and it requires four
clock cycles for the entire loop to finish, that is, the next iteration of the loop can start before
the current iteration is finished.
An important term for loop pipelining is called Initiation Interval (II), which is the number
of clock cycles between the start times of consecutive loop iterations. In Loop Pipelining the
Initiation Interval (II) is one, because there is only one clock cycle between the start times of
consecutive loop iterations.
www.xilinx.com
Send Feedback
32
To pipeline a loop, put #pragma HLS pipeline at the beginning of the loop body, as
illustrated in the following code snippet. Vivado HLS tries to pipeline the loop with minimum
Initiation Interval.
for (index_a = 0; index_a < A_NROWS; index_a++) {
for (index_b = 0; index_b < B_NCOLS; index_b++) {
#pragma HLS PIPELINE II=1
float result = 0;
for (index_d = 0; index_d < A_NCOLS; index_d++) {
float product_term = in_A[index_a][index_d] * in_B[index_d][index_b];
result += product_term;
}
out_C[index_a * B_NCOLS + index_b] = result;
}
}
Loop Unrolling
Loop unrolling is another technique to exploit parallelism between loop iterations. It creates
multiple copies of the loop body and adjust the loop iteration counter accordingly. The
following code snippet shows a normal rolled loop:
int sum = 0;
for(int i = 0; i < 10; i++) {
sum += a[i];
}
= 0;
i = 0; i < 10; i+=2) {
+= a[i];
+= a[i+1];
So unrolling a loop by a factor of N basically creates N copies of the loop body, the loop
variable referenced by each copy is updated accordingly ( such as the a[i+1] in the above
code snippet ), and the loop iteration counter is also updated accordingly ( such as the i+=2 in
the above code snippet ).
Loop unrolling creates more operations in each loop iteration, so that Vivado HLS can exploit
more parallelism among these operations. More parallelism means more throughput and
higher system performance. If the factor N is less than the total number of loop iterations (10
in the example above), it is called a "partial unroll". If the factor N is the same as the number
of loop iterations, it is called a "full unroll". Obviously, "full unroll" requires the loop bounds
be known at compile time but exposes the most parallelism.
To unroll a loop, simply put #pragma HLS unroll [factor=N] at the beginning of the
desired loop. Without the optional factor=N , the loop will be fully unrolled.
int sum
for(int
#pragma
sum
}
= 0;
i = 0; i < 10; i++) {
HLS unroll factor=2
+= a[i];
Factors Limiting the Parallelism Achieved by Loop Pipelining and Loop Unrolling
Both loop pipelining and loop unrolling exploit the parallelism between loop iterations.
However, parallelism between loop iterations is limited by two main factors: one is the data
dependencies between loop iterations, the other is the number of available hardware resources.
www.xilinx.com
Send Feedback
33
Obviously, operations in the next iteration of this loop can not start until the current iteration
has calculated and updated the values of a and b. Array accesses are a common source of
loop-carried dependences, as shown in the following example:
for (i = 1; i < N; i++)
mem[i] = mem[i-1] + i;
In this case, the next iteration of the loop must wait until the current iteration updates the
content of the array. In case of loop pipelining, the minimum Initiation Interval is the total
number of clock cycles required for the memory read, the add operation, and the memory write.
Another performance limiting factor for loop pipelining and loop unrolling is the number of
available hardware resources. The following figure shows an example the issues created by
resource limitations, which in this case prevents the loop to be pipelined with an initiation
interval of 1.
In this example, if the loop is pipelined with an initiation interval of one, there are two read
operations. If the memory has only a single port, then the two read operations cannot be
executed simultaneously and must be executed in two cycles. So the minimal initiation interval
can only be two, as shown in part (B) of the figure. The same can happen with other hardware
resources. For example, if the op_compute is implemented with a DSP core which cannot
accept new inputs every cycle, and there is only one such DSP core. Then op_compute cannot
be issued to the DSP core each cycle, and an initiation interval of one is not possible.
www.xilinx.com
Send Feedback
34
Array Partitioning
Arrays can be partitioned into smaller arrays. Physical implementation of memories have only a
limited number of read ports and write ports, which can limit the throughput of a load/store
intensive algorithm. The memory bandwidth can sometimes be improved by splitting up
the original array (implemented as a single memory resource) into multiple smaller arrays
(implemented as multiple memories), effectively increasing the number of load/store ports.
Vivado HLS provides three types of array partitioning, as shown in Array Partitioning.
1.
block: The original array is split into equally sized blocks of consecutive elements of the
original array.
2.
cyclic: The original array is split into equally sized blocks interleaving the elements of the
original array.
3.
complete: The default operation is to split the array into its individual elements. This
corresponds to implementing an array as a collection of registers rather than as a memory.
www.xilinx.com
Send Feedback
35
in the hardware function source code. For block and cyclic partitioning, the factor option
can be used to specify the number of array which are created. In the figure, Array Partitioning, a
factor of two is used, dividing the array into two smaller arrays. If the number of elements in the
array is not an integer multiple of the factor, the last array will have fewer than average elements.
When partitioning multi-dimensional arrays, the dim option can be used to specify which
dimension is partitioned. The following figure shows an example of partitioning different
dimensions of a multi-dimensional array.
www.xilinx.com
Send Feedback
36
Array Reshaping
Arrays can also be reshaped to increase the memory bandwidth. Reshaping takes different
elements from a dimension in the original array, and combines them into a single wider
element. Array reshaping is similar to array partitioning, but instead of partitioning into multiple
arrays, it widens array elements. The following figure illustrates the concept of array reshaping.
in the hardware function source code. The options have the same meaning as the array
partition pragma.
www.xilinx.com
Send Feedback
37
An example execution with data flow pipelining is shown in the part (B) of the figure above.
Assuming the execution of func_A takes 3 cycles, func_A can begin processing a new input
every three clock cycles rather than waiting for all the three functions to complete, resulting in
increased throughput, The complete execution to produce an output then requires only five
clock cycles, resulting in shorter overall latency.
Vivado HLS implements function data flow pipelining by inserting "channels" between the
functions. These channels are implemented as either ping-pong buffers or FIFOs, depending on
the access patterns of the producer and the consumer of the data.
For scalar, pointer and reference parameters as well as the function return, the channel is
implemented as a FIFO, which uses less hardware resources (no address generation) but
requires that the data is accessed sequentially.
To use function data flow pipelining, put #pragma HLS dataflow where the data flow
optimization is desired. The following code snippet shows an example:
void top(a, b, c, d) {
#pragma HLS dataflow
func_A(a, b, i1);
func_B(c, i1, i2);
func_C(i2, d);
}
www.xilinx.com
Send Feedback
38
With data flow pipelining, these loops can operate concurrently. An example execution with
data flow pipelining is shown in part (B) of the figure above. Assuming the loop M takes 3
cycles to execute, the code can accept new inputs every three cycles. Similarly, it can produce
an output value every five cycles, using the same hardware resources. Vivado HLS automatically
inserts channels between the loops to ensure data can flow asynchronously from one loop to
the next. As with data flow pipelining, the channels between the loops are implemented either
as multi-buffers or FIFOs.
To use loop data flow pipelining, put #pragma HLS dataflow where the data flow
optimization is desired.
www.xilinx.com
Send Feedback
39
www.xilinx.com
Send Feedback
40
None no software control interface. The hardware function must self-synchronize entirely
based on arguments mapped to AXI streams and cannot have any scalar arguments
or arguments that are memory mapped. All AXI stream ports must include TLAST and
TKEEP sideband signals.
axis_acc_adapter the default interface in the SDSoC environment for Vivado Design
Suite HLS hardware functions. The SDSoC environment automatically inserts an instance of
the axis_accelerator_adapter IP to interface a Vivado HLS hardware function. This IP
provides pipelined AXI4-Lite control and data interfaces for software pipelining, and clock
domain crossing circuitry to run hardware functions at higher (or lower) clock rates than
the data motion network to balance computation and communication. The adapter also
provides optional multi-buffering for arguments that map to BRAM and FIFO interfaces, and
automatically maps them into AXI4-Streams (see Hardware Buffer Depth for buffer_depth
pragma). The hardware function interface cannot include any arguments with #pragma
HLS interface s_axilite, but can contain any number of arguments that map onto
a single AXI-MM master interface (with pragma attribute offset=direct) and onto
AXI4-Stream interfaces that include TLAST and TKEEP sideband signals.
The axis_accelerator_adapter IP supports up to eight AXI4-Stream inputs and up to
eight AXI4-Stream outputs each of which can map onto either a BRAM or FIFO interface.
The IP also provides an AXI4-Lite register interface to support scalar arguments, with
eight input registers, eight output registers, and eight input/output registers that can be
used either for an input, output, or inout argument. Scalar arguments can be of type
bool, char, short, int, or float. A function return value is mapped into an output
scalar register. A hardware function that cannot adhere to these constraints must employ
the generic_axi_lite control protocol.
generic_axi_lite the native Vivado HLS control interface when any of the arguments
are mapped via #pragma HLS interface s_axilite. This interface is suitable for
C-callable HDL IP, described in SDSoC Environment User Guide: Platforms and Libraries
(UG1146), Creating a Library. The hardware control register must reside at offset 0x0 with
the following bit encoding.
//
//
//
//
//
//
//
//
www.xilinx.com
Send Feedback
41
This section describes supported hardware interface types for hardware functions compiled
by the SDSoC system compilers using Vivado HLS. The compilers automatically
determine hardware interface types based on the argument type, #pragma SDS data
copy|zero_copy and #pragma SDS data access_pattern.
IMPORTANT: To avoid interface incompatibilities, you should only incorporate Vivado HLS
interface type directives and pragmas in your source code when sdscc fails to generate a
suitable hardware interface directive, and you should only use the HLS interface types described
in this section.
The sdscc compiler selects a hardware function control protocol based on the program
structure, a hardware function prototype, and the types of its arguments. The remainder of
this section describes the hardware interface types supported by the system compilers, but
it should be emphasized that explicit use of Vivado HLS interface pragmas is discouraged to
avoid inadvertent errors due to conflicts between tools defaults and requirements for the
control protocols.
The following diagram describes supported hardware interface types (white boxes) and their
relation to the supported function control protocols (green). Several mappings involve
constraints (yellow). Unsupported HLS interface directives are in gray.
www.xilinx.com
Send Feedback
42
Figure 68: Hardware Function Control Protocols and Supported Hardware Interfaces
www.xilinx.com
Send Feedback
43
only one explicit AXI4-Lite interface; you must bundle all ports, including ap_control,
into a single AXI4-Lite interface.
The example <sdsoc_install_dir>/samples/hls_if/arraycopy_axilite
demonstrates how to use HLS AXI4-Lite interfaces in the SDSoC environment.
AXI-memory mapped (AXI-MM) master Using the VHLS pragma #pragma HLS
INTERFACE m_axi port=arg to pass physical addresses over the AXI4-Lite interface. In
this mode, the hardware function acts as its own data mover.When a hardware function
maps an argument onto an AXI-MM master, it must also include an output scalar argument
or a return value.
The example <sdsoc_install_dir>/samples/hls_if/mmult_hls_aximm
demonstrates how to use HLS AXI-MM interfaces in the SDSoC environment.
IMPORTANT: Data transport using a DMA data mover requires AXI4-Stream TLAST, TKEEP
side band signals, which must be explicitly coded within HLS code.
www.xilinx.com
Send Feedback
44
Chapter 7
When you are using the SDSoC IDE, you add these sdscc options by right-clicking on your
project, selecting C/C++ Build Settings->SDSCC Compiler->Directories (or SDS++
Compiler->Directories for C++ compilation).
To link the library into your application, you use the -L<path> and -l<lib> options.
> sdscc sds-pf zc702 ${OBJECTS} L<path to library> -l<library_name> o myApp.elf
As with the standard GNU linkers, for a library called libMyLib.a, you use -lMyLib.
When you are using the SDSoC IDE, you add these sdscc options by right-clicking on your
project, selecting C/C++ Build Settings > SDS++ Linker->Libraries.
You can find code examples that employ C-callable libraries in the SDSoC environment
installation under the samples/fir_lib/use and samples/rtl_lib/arraycopy/use
directories.
www.xilinx.com
Send Feedback
45
Chapter 8
www.xilinx.com
Send Feedback
46
Chapter 9
www.xilinx.com
Send Feedback
47
For example, a hardware function mmult_accel typically has the following declaration:
// mmult_accel.h
void _p0_mmult_accel_0(float[], float[], float[]);
For any hardware function, the entry point should be straightforward to determine by
expanding the library in the Project Explorer and inspecting functions within the library.
You can find a complete example in the samples/mmult_static_lib/build directory
in the SDSoC environment install.
To create a shared library in the SDSoC IDE, you select the Shared Library check box when
you create a new SDSoC environment project.
The shared library libmySharedLib.so is created along with the SD card boot image. You
can export a design as a shared library from the command line by compiling source files
containing the hardware functions and functions calling them with the sdscc/sds++ position
independent code flag (-fPIC) and linking using the shared option.
www.xilinx.com
Send Feedback
48
The SDSoC IDE provides a Matrix Multiplication Shared Library example template when you
select the Shared Library check box shown in the figure above. The connectivity of the hardware
blocks is determined using a source file that includes a process function that defines how the
user calls the library. The SDSoC system compiler then determines the system connectivity
based on this function as usual.
File: mmult_call.c
#include "mmult_accel.h"
void mmult_call (float in_A[A_NROWS*A_NCOLS],
float in_B[A_NCOLS*B_NCOLS],
float out_C[A_NROWS*B_NCOLS])
{
mmult_accel(in_A, in_B, out_C);
}
This example specifies that there is a single call to the function mmult_accel that is selected
for hardware implementation, but you can specify multiple hardware functions for the library.
The hardware function is compiled using sdscc with an additional -fPIC flag to make the
object code position independent.
sdscc sds-pf zc702 -sds-hw mmult_accel mmult_accel.cpp -sds-end \
c fPIC mmult_accel.c o mmult_accel.o
You must also compile the calling function code with the -fPIC flag.
sdscc sds-pf zc702 c fPIC mmult_call.c o mmult_call.o
Finally, link all object files and specify the shared library options.
sdscc sds-pf zc702 -shared mmult_accel.o mmult_call.o -o libmmult_accel.so
This creates a libmmult_accel.so library that can be linked using the standard ARM GNU
toolchain on the command line or in any software development environment.
The above command also creates an sd_card image that contains the boot files needed to
execute the program that links against the library.
You can find a complete example in the samples/mmult_shared_lib/build directory
in the SDSoC environment install.
www.xilinx.com
Send Feedback
49
This creates an executable called mmult.elf that you copy into your SD card along with
the boot files. The POSIX Threads (pthread) library is required for the software runtime
code generated by sdscc.
To run the program, copy the sd_card directory created in the SDSoC environment into an SD
card, boot the board and wait for the command prompt. Execute the following commands
on the board:
sh-4.3# export LD_LIBRARY_PATH=/mnt
sh-4.3# /mnt/mmult.elf
www.xilinx.com
Send Feedback
50
Chapter 10
Debugging an Application
The SDSoC environment allows projects to be created and debugged using the SDSoC IDE.
Projects can also be created outside the SDSoC IDE (user-defined makefiles) and debugged
either on the command line or using the SDSoC IDE.
See SDSoC Environment User Guide: Getting Started (UG1028), Tutorial: Debugging Your
System for information on using the interactive debuggers in the SDSoC IDE.
www.xilinx.com
Send Feedback
51
www.xilinx.com
Send Feedback
52
For best performance improvement, the time required for executing the accelerated function
must be much smaller than the time required for executing the original software function. If
this is not true, try to run the accelerator at a higher frequency by selecting a different clkid
on the sdscc/sds++ command line. If that does not work, try to determine whether the data
transfer overhead is a significant part of the accelerated function execution time, and reduce
the data transfer overhead. Note that the default clkid is 100 MHz for all platforms. More
details about the clkid values for the given platform can be obtained by running sdscc
sds-pf-info <platform name>.
If the data transfer overhead is large, the following changes might help:
Move more code into the accelerated function so that the computation time increases, and
the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to
transfer only the required data.
www.xilinx.com
Send Feedback
53
Chapter 11
Linux Applications
The SDSoC environment supports Linux applications that run on Zynq devices, which
lets users compile their programs to run on the hardware with the Linux operating system.
The SDSoC environment links in a library to communicate with the hardware using services
provided by the operating system.
Usage
In order to compile and link an SDSoC environment program for Linux, the makefile should
include -target-os linux in CFLAGS, as well as LFLAGS. If the -target-os linux option
is omitted, the SDSoC environment by default targets the Linux operating system.
The SD boot image consists of multiple files in the sd_card directory. BOOT.BIN contains the
first stage boot loader (FSBL), which is invoked directly after powering on the board, and in
turn invokes U-boot. Linux boot uses a device tree, Linux kernel and ramdisk image. Finally,
SD boot image also includes the application ELF and hardware bitstream used to configure
the programmable logic.
Supported Platforms
Linux mode is supported for all SDSoC platforms.
Limitations
The provided Linux operating system utilizes a pre-built kernel image (3.19, Xilinx branch
xilinx-v2015.2.01) and a ramdisk containing BusyBox. To configure the Linux image or ramdisk
image for your own platform or requirements, follow the instructions at wiki.xilinx.com to
download and build the Linux kernel. SDSoC Environment User Guide: Platforms and Libraries
(UG1146), Linux Boot Files describes the Linux boot files and summarizes the process for
creating them using Petalinux.
www.xilinx.com
Send Feedback
54
Usage
In order to compile and link an SDSoC environment program for standalone mode, the makefile
should include -target-os standalone in CFLAGS, as well as LFLAGS.
The SD boot image consists of a single file BOOT.BIN in the sd_card directory that contains
the first stage boot loader (FSBL) as well as the user application, which is invoked directly
after powering on the board.
Supported Platforms
Standalone mode is supported for the following platforms:
Limitations
Standalone mode does not support multi-threading, virtual memory, or address protection as
documented in OS and Libraries Document Collection (UG643). Access to the file system is
not through the usual C API, but instead through a special API using libxilffs. The sample
program file_io_manr_sobel_standalone shows an example of its use. This program
can be compared with the Linux version file_io_manr_sobel to see what changes are
necessary for accessing the file system. In general, the procedure to access the file system is to
include a few extra files, use different types (for example, FIL instead of FILE), use a slightly
different API for file system access (for example, f_open instead of fopen), and disable DCache
before doing any file operations.
IMPORTANT: On the ZedBoard, a serial connection to the board takes a couple of seconds. If
your program runs for a time shorter than that, you will never see its output. When the ZedBoard
is power cycled, the serial connection goes down and it is not possible to see the output in the
subsequent run either.The ZC702 and ZC706 boards keep the serial connection alive across power
cycles and do not suffer from this limitation.
www.xilinx.com
Send Feedback
55
Usage
In order to compile and link an SDSoC environment program for FreeRTOS, the makefile
should include the -target-os freertos option in all compiler and linker calls in the
makefile. This is typically specified in an SDSoC environment variable, which in turn is included
in a compiler toolchain variable, as shown below:
SDSFLAGS = -sds-pf zc702 target-os freertos \
-sds-hw mmult_accel mmult_accel.cpp -sds-end \
-poll-mode 1
CPP = sds++ ${SDSFLAGS}
CC = sds ${SDSFLAGS}
:
all: ${EXECUTABLE}
${EXECUTABLE}: ${OBJECTS}
${CPP} ${LFLAGS} ${OBJECTS} -o $@
%.o: %.cpp
${CPP} ${CFLAGS} $< -o $@
:
When the SDSoC environment links the application ELF file, it builds a standalone (bare-metal)
library for you, provides a predefined linker script and uses a pre-configured FreeRTOS kernel
using headers and a pre-built library, and includes their paths when it calls the ARM GNU
toolchain (you do not need to specify the paths in your makefile):
<path_to_install>/SDSoC/2015.2/arm-xilinx-eabi/include/freertos
<path_to_install>/SDSoC/2015.2/arm-xilinx-eabi/lib/freertos
The SD boot image consists of a single file BOOT.BIN in the sd_card directory that contains
the first stage boot loader (FSBL) as well as the user application, which is invoked directly
after powering on the board.
SDSoC environment GUI flows for working with FreeRTOS applications are the same as those
for standalone (bare-metal) applications, except the target OS is specified as FreeRTOS. The
user application code needs to include the following:
Task functions and task creation calls using the xTaskCreate() API function
www.xilinx.com
Send Feedback
56
Simple SDSoC environment applications based on the Zynq-7000 AP SoC series demo
included in the FreeRTOS v8.2.1 software distribution are available in the SDSoC GUI application
wizard and in the SDSoC environment installation:
<path_to_install>/SDSoC/2015.2/samples/mmult_datasize_freertos
<path_to_install>/SDSoC/2015.2/samples/mmult_optimized_sds_freerttos
User or sample applications that normally target the Standalone BSP can be built using the
target-os freertos option compile and link, but the FreeRTOS linker script is used and
predefined callback functions found in the pre-built FreeRTOS library are used. Applications
built this way do not explicitly call FreeRTOS API functions and run as standalone applications.
While it is possible to begin FreeRTOS application development in this way, Xilinx recommends
that FreeRTOS API functions and callbacks be incorporated as early as possible.
Supported Platforms
FreeRTOS mode is supported for two Zynq-7000 AP SoC platforms:
ZC702
ZC706
www.xilinx.com
Send Feedback
57
The SDSoC environment uses a pre-configured FreeRTOS v8.2.1 library that has been pre-built
for the user, and a dynamically built (at application link time) standalone library. Characteristics
of the FreeRTOS library include:
Uses the standard FreeRTOS v8.2.1 distribution for platform independent code; platform
dependent code uses the default FreeRTOSConfig.h file included as part of FreeRTOS
v8.2.1 (see the FreeRTOS reference http://www.freertos.org/a00110.html, with downloads
available at http://sourceforge.net/projects/freertos/files/FreeRTOS )
Uses heap_3.c for its memory allocation implementation (see the FreeRTOS reference
http://www.freertos.org/a00111.html )
Demo/CORTEX_A9_Zynq_ZC702/RTOSDemo/src
Source
Source/include
Source/portable/GCC/ARM_CA9
Source/portable/MemMang
vApplicationMallocFailedHook
vApplicationStackOverflowHook
vApplicationIdleHook
vAssertCalled
vApplicationTickHook
vInitialiseTimerForRunTimeStats
www.xilinx.com
Send Feedback
58
Chapter 12
Since there is no need for synchronization of the input and output with the video hardware, the
software loop in process_frames() is straightforward, creating a hardware function pipeline
when manr and sobel_filter are selected for hardware implementation.
for (int loop_cnt = 0; loop_cnt<frames; loop_cnt++) {
// set up manr_in_current and manr_in_prev frames
manr( nr_strength,manr_in_current, manr_in_prev, yc_out_tmp);
sobel_filter(yc_out_tmp, out_frames[frame]);
}
The input and output video files are in YUV422 format. The platform directory contains sources
for converting these files to/from the frame arrays used in the accelerator code. The makefile
in the top level directory compiles the application sources along with the platform sources
to generate the application binary.
www.xilinx.com
Send Feedback
59
This wrapper function becomes the top-level hardware function that can be invoked from
application code.
Matrix Multiplication
Matrix multiplication is a common compute-intensive operation for many application domains.
The SDSoC IDE provides template examples for all base platforms, and the code for these
provide instructive use of SDSoC environment system optimizations for memory allocation and
memory access described in Improving System Performance, and Vivado HLS optimizations like
function inlining, loop unrolling and pipelining, and array partitioning, described in Hardware
Function Guidelines for Software Programmers.
www.xilinx.com
Send Feedback
60
Chapter 13
www.xilinx.com
Send Feedback
61
The data copy implies that data is explicitly copied from the processor memory to the
hardware function. A suitable data mover as described in Improving System Performance
performs the data transfer. The data zero_copy means that the hardware function
accesses the data directly from shared memory. For the latter, the hardware function must
access the array through an AXI4 bus interface.
Multiple arrays can be specified in the same pragma, separated by a comma(,). For example:
copy(ArrayName1[offset1:length1], ArrayName2[offset2:length2])
ArrayName must be one of the formal parameters of the function definition, that is, not
from the prototype (where parameter names are optional) but from the function definition.
offset is the number of elements from the first element in the corresponding dimension.
It must be a compile-time constant. This is currently ignored.
length is the number of elements transferred for that dimension. It can be an arbitrary
expression as long as the expression can be resolved at runtime inside the function. For
example:
#pragma SDS data copy(InData[0:num_rows+3*num_coeffs_active + L*(P+1)])
#pragma SDS data copy(OutData[0:2*(L-M-R+2)+4*num_coeffs_active*(1+num_rows)])
void evw_accelerator (uint8_t M,
uint8_t R,
uint8_t P,
uint16_t L,
uint8_t num_coeffs_active,
uint8_t num_rows,
uint32_t InData[InDataLength],
uint32_t OutData[OutDataLength]);
This pragma specifies the number of elements to be transferred for an array argument to a
hardware function, and applies to all calls to the function. As shown in the example above,
length need not be a constant; it can be a C arithmetic expression involving other scalar
parameters to the same function.
If this pragma is not specified for an array argument, the SDSoC environment first checks the
argument type. If the argument type has a compile-time array size, the compiler uses that as the
data transfer size. Otherwise, the SDSoC environment analyzes the calling code to determine
the transfer size based on the memory allocation APIs for the array (for example, malloc or
sds_alloc). If the analysis fails or there is inconsistency between callers about the transfer
size, the compiler generates an error message so that the user can modify the source code.
Memory Attributes
For an operating system like Linux that supports virtual memory, user-space allocated memory
is paged, which can affect system performance. To improve system performance, the pragmas
in this section can be used to declare arguments that have been allocated in physically
contiguous memory, or to tell the compiler that it need not enforce cache coherency.
www.xilinx.com
Send Feedback
62
Cache flushing/invalidating for a large chunk of video data can significantly decrease
the system performance
Software code does not read or write the video data so the cache coherency between
processor and accelerator is not required.
www.xilinx.com
Send Feedback
63
Multiple arrays can be specified in one pragma, separated by a comma(,). For example:
#pragma SDS data_mover(ArrayName:DataMover, ArrayName:DataMover)
This pragma specifies the data mover HW IP type used to transfer an array argument. Typically,
the compiler chooses the type of the data automatically by analyzing the code. This pragma
can be used to override the compiler inference rules.
There are some additional requirements for using AXIDMA_SIMPLE and AXIDMA_2D. The first
requirement is that the corresponding array must be allocated using sds_alloc().
For AXIDMA_2D, the pragma SDS data dim must be present to specify the 2D arrays size
of each dimension. The SDS data copy pragma is also needed to specify a rectangular
sub-region of the 2D array to be transferred. The array second dimension size, sub-region
offset and column size must all result in addresses aligned to 64-bit boundaries (number of
bytes divisible by 8).
In the example shown below, NUMCOLS, row_offset, col_offset and cols must be
multiples of 8 (each char bitwidth is 8) for AXIDMA_2D to work properly.
#pragma SDS data data_mover(y_lap_in:AXIDMA_SIMPLE, y_lap_out:AXIDMA_2D)
#pragma SDS data dim(y_lap_out[NUMROWS][NUMCOLS])
#pragma SDS data copy(y_lap_out[row_offset:rows][col_offset:cols])
void laplacian_filter(unsigned char y_lap_in[NUMROWS*NUMCOLS],
unsigned char y_lap_out[NUMROWS*NUMCOLS],
int rows, int cols, int row_offset, int col_offset);
www.xilinx.com
Send Feedback
64
port must be ACP or AFI or MIG. The Zynq-7000 All Programmable SoC provides a cache
coherent interface between programmable logic and external memory (S_AXI_ACP) and
high-performance ports (S_AXI_HP) for non-cache coherent access (AFI). If no sys_port
pragma is specified for an array argument, the interface to external memory is determined
automatically by the SDSoC system compilers, based on array memory attributes (cacheable
or non-cacheable), array size, data mover used, etc. This pragma overrides the SDSoC
compiler choice of memory port. MIG is valid only for the zc706_mem platform.
IMPORTANT: The hardware interpretation of this pragma might be revised in a future release.
www.xilinx.com
Send Feedback
65
Multiple arrays can be specified in one pragma, separated by a comma(,). For example:
#pragma SDS buffer_depth(ArrayName:BufferDepth, ArrayName:BufferDepth)
This pragma applies only to arrays that map to BRAM or FIFO interfaces, and specifies the
number of hardware buffers to allocate for the array argument, for example, to support
pipelining. For a hardware buffer the following must hold:
The async pragma is specified immediately preceding a call to a hardware function, directing
the compiler to return control to the CPU immediately after setting up the hardware function
and its data transfers.
The wait pragma must be inserted at an appropriate point in the program to direct the CPU to
wait until the associated async function (and data transfers) have completed.
The ID must be a compile time constant unsigned integer, and represents a unique
identifier for the hardware function. That is, using a different ID for the same hardware
function results in a different hardware instance for the function. Consequently, these
pragmas can be used to force the creation of multiple hardware instances.
In the presence of an async pragma, the SDSoC system compiler does not generate
an sds_wait() in the stub function for the associated call. The program must contain
the matching sds_wait(ID) or #pragma SDS wait(ID) at an appropriate point to
synchronize the controlling thread running on the CPU with the hardware function thread.
An advantage of using the #pragma SDS wait(ID) over the sds_wait(ID) function
call is that the source code can then be compiled by compilers other than sdscc (that do
not interpret either async or wait pragmas).
Partition Specification
The SDSoC system compilers sdscc/sds++ can automatically generate multiple bitstreams for
a single application that is loaded dynamically at run-time. Each bitstream has a corresponding
partition identifier. A platform might not support bitstream reloading, for example, due to
platform peripherals that cannot be shut down and then brought back up after reloading.
www.xilinx.com
Send Feedback
66
www.xilinx.com
Send Feedback
67
Chapter 14
virtual_addr:
www.xilinx.com
Send Feedback
68
www.xilinx.com
Send Feedback
69
Chapter 15
Name
sdscc SDSoC C compiler
sds++ - SDSoC C++ compiler
Command Synopsis
sdscc | sds++ [hardware_function_options] [system_options]
[performance_estimation_options] [options_passed_through_to_cross_compiler]
[-sds-pf platform_name] [-sds-pf-info platform_name] [-sds-pf-list] [-target-os os_name]
[-verbose] [ --help] [-version] [files]
www.xilinx.com
Send Feedback
70
System Options
[[-apm] [-dmclkid clock_id_number] [-mno-bitstream] [-mno-boot-files]
[-rebuild-hardware] [-poll-mode <0|1>] [-instrument-stub]]
The sdscc/sds++ compilers compile and link C/C++ source files into an application-specific
hardware/software system on chip implemented on a Zynq-7000 All Programmable SoC.
The command usage and options are identical for sdscc and sds++.
Options not recognized by sdscc are passed to the ARM cross-compiler. Compiler options
within an -sds-hw ... -sds-end clause are ignored for the -c foo.c option when
foo.c is not the file containing the specified hardware function.
When linking the application ELF, sdscc creates and implements the hardware system, and
generates an SD card image containing the ELF and boot files required to initialize the hardware
system, configure the programmable logic and run the target operating system.
When building a system containing no functions marked for hardware implementation, sdscc
uses pre-built hardware when available for the target platform. To force bitstream generation,
use the -rebuild-hardware option.
Report files are found in the folder _sds/reports.
General Options
The following command line options are applicable to any sdscc invocation or display
information for the user.
-sds-pf platform_name
Specify the target platform that defines the base system hardware and software, including
operation system and boot files. The platform_name can be the name of a platform in the
SDSoC environment installation, or a file path to a folder containing platform files, with
the last component of the path matching the platform name. The platform defines the base
hardware and software, including operation system and boot files. Use this option when
compiling accelerator source files and when linking the ELF file. Use the sds-pf-list option
to list available platforms and their features.
-sds-pf-info platform_name
Display general information about a platform and exit (no other options are specified). Use the
sds-pf-list option to list available platforms.
-sds-pf-list
Display a list of available platforms and exit (no other options are specified); for example,
sdscc sds-pf-list.
www.xilinx.com
Send Feedback
71
-target-os os_name
The target-os option specifies the target operating system. The selected OS determines
the compiler toolchain used, and include file and library paths added by sdscc. os_name
can be one of the following:
linux : for the Linux OS. This is the default if the command line contains no -target-os
option
-verbose
Print verbose output to STDOUT.
-version
Print the sdscc version information to STDOUT.
--help
Print command line help information. Note that two consecutive hyphen or dash characters are used.
www.xilinx.com
Send Feedback
72
www.xilinx.com
Send Feedback
73
-files file_list
Specify a comma-separated list (without white space) of one or more files required to compile
the current top-level function into hardware using Vivado HLS. If any of these files contain
source code that is not used by HLS but is required to produce the application executable,
they must be compiled separately to create object files (.o), and linked with other object files
during the link phase.
-hls-tcl hls_tcl_directives_file
When using the Vivado HLS tool to synthesize the hardware accelerator, source the specified
Tcl file containing HLS directives. During HLS synthesis, sdscc creates a run.tcl file used to
drive the Vivado HLS tool and in this Tcl file, the following commands are inserted:
# synthesis directives
create_clock -period <clock_period>
config_rtl -reset_level low
source <sdsoc_generated_tcl_directives_file>
# end synthesis directives
If the hls-tcl option is used, the user-defined Tcl file is sourced instead of the Tcl file
generated by the SDSoC environment. Ensure that the specified Tcl file contains commands
that result in a functionally correct directives file. The clock period is platform-specific and
reset levels are required to be active-Low.
# synthesis directives
create_clock -period <clock_period>
config_rtl -reset_level low
# user-defined synthesis directives
source <user_hls_tcl_directives_file>
# end user-defined synthesis directives
# end synthesis directives
www.xilinx.com
Send Feedback
74
-clkid <n>
Set the accelerator clock ID to <n>, where <n> has one of the values listed in the table
below. (You can use the command sdscc sds-pf-info platform_name to display the
information about a platform.) If the clkid option is not specified, the default value for the
platform is used. Use the command sdscc sds-pf-list to list available platforms and
settings.
Platform
Value of <n>
zc702
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
zc702_hdmi
1 142 MHz
2 100 MHz
3 - 166 MHz
zc706
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
zybo
0 25 MHz
1 100 MHz
2 125 MHz
3 50 MHz
Compiler Macros
www.xilinx.com
Send Feedback
75
Predefined macros allow you to guard code with #ifdef and #ifndef preprocessor
statements; the macro names begin and end with two underscore characters _. The
__SDSCC__ macro is defined whenever sdscc or sds++ is used to compile source files; it
can be used to guard code depending on whether it is compiled by sdscc/sds++ or another
compiler, for example GCC.
When sdscc or sds++ compiles source files targeted for hardware acceleration using Vivado
HLS, the __SDSVHLS__ macro is defined to be used to guard code depending on whether
high-level synthesis is run or not.
The code fragment below illustrates the use of the __SDSCC__ macro to use the sds_alloc()
and sds_free() functions when compiling source code with sdscc/sds++, and malloc()
and free() when using other compilers.
#ifdef __SDSCC__
#include <stdlib.h>
#include "sds_lib.h"
#define malloc(x) (sds_alloc(x))
#define free(x) (sds_free(x))
#endif
In the example below, the __SDSVHLS__ macro is used to guard code in a function definition
that differs depending on whether it is used by Vivado HLS to generate hardware or used in a
software implementation.
#ifdef __SDSVHLS__
void mmult(ap_axiu<32,1,1,1> A[A_NROWS*A_NCOLS],
ap_axiu<32,1,1,1> B[A_NCOLS*B_NCOLS],
ap_axiu<32,1,1,1> C[A_NROWS*B_NCOLS])
#else
void mmult(float A[A_NROWS*A_NCOLS],
float B[A_NCOLS*B_NCOLS],
float C[A_NROWS*B_NCOLS])
#endif
www.xilinx.com
Send Feedback
76
System Options
-apm
Insert an AXI Performance Monitor (APM) IP block to monitor all generated hardware/software
interfaces. Within the SDSoC IDE, in the Debug Perspective, you can activate the APM prior
to running your application by clicking the Start button within the Performance Counters
View. For more information on the SDSoC IDE, see the SDSoC Environment User Guide:
Getting Started (UG1028).
-dmclkid <n>
Set the data motion network clock ID to <n>, where <n> has one of the values listed in the
table below. (You can use the command sdscc sds-pf-info platform_name to display
the information about the platform.) If the dmclkid option is not specified, the default value
for the platform is used. Use the command sdscc sds-pf-list to list available platforms
and settings.
Platform
Value of <n>
zc702 platform
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
zc702_hdmi platform
1 142 MHz
2 100 MHz
3 - 166 MHz
zc706 platform
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
0 166 MHz
1 142 MHz
2 100 MHz
3 200 MHz
zybo platform
0 25 MHz
1 100 MHz
2 125 MHz
3 50 MHz
www.xilinx.com
Send Feedback
77
-mno-bitstream
Do not generate the bitstream for the design used to configure the programmable logic (PL).
Normally a bitstream is generated by running the Vivado implementation feature, which can
be time-consuming with runtimes ranging from minutes to hours depending on the size and
complexity of the design. This option can be used to disable this step when iterating over
flows that do not impact the hardware generation. The application ELF is compiled before
bitstream generation.
-mno-boot-files
Do not generate the SD card image in the folder sd_card. This folder includes your application
ELF and files required to boot the device and bring up the specified OS. This option disables
the creation of the sd_card folder in case you would like to preserve an earlier version of
this folder.
-rebuild-hardware
When building a software-only design with no functions mapped to hardware, sdscc uses a
pre-built bitstream if available within the platform, but use this option to force a full system
build.
-poll-mode <0|1>
The poll-mode <0|1> option enables DMA polling mode when set to 1 or interrupt
mode when set to 0 (default). For example, to specify DMA polling mode, add the sdscc
-poll-mode 1 option.
-instrument-stub
The instrument-stub option instruments the generated hardware function stubs with
calls to the counter function sds_clock_counter(). When a hardware function stub is
instrumented, the time required to call send and receive functions, as well as the time spent for
waits, is displayed for each call to the function.
www.xilinx.com
Send Feedback
78
When linking application ELF files for non-Linux targets, for example Standalone or FreeRTOS,
default linker scripts found in the folder <install_path>/platforms/<platform_name>
are used. If a user-defined linker script is required, it can be added using the Wl,-T
Wl,<path_to_linker_script> linker option.
When sdscc/sds++ creates a bitstream .bin file in the sd_card folder, it can be used to
configure the PL after booting Linux and before running the application ELF. The embedded
Linux command used is cat bin_file > /dev/xdevcfg.
www.xilinx.com
Send Feedback
79
Appendix A
Solution Centers
See the Xilinx Solution Centers for support on devices, software tools, and intellectual
property at all stages of the design cycle. Topics include design assistance, advisories, and
troubleshooting tips
References
These documents provide supplemental material useful with this guide:
1.
SDSoC Environment User Guide: Getting Started (UG1028), also available in the docs folder
of the SDSoC environment.
2.
SDSoC Environment User Guide (UG1027), also available in the docs folder of the SDSoC
environment.
3.
SDSoC Environment User Guide: Platforms and Libraries (UG1146), also available in the docs
folder of the SDSoC environment.
4.
5.
ZC702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC User Guide
(UG850)
6.
7.
8.
9.
Vivado Design Suite User Guide: Creating and Packaging Custom IP (UG1118)
www.xilinx.com
Send Feedback
80
www.xilinx.com
Send Feedback
81