Introduction To High-Level Synthesis With Vivado HLS
Introduction To High-Level Synthesis With Vivado HLS
> Summary
Need for High-Level Synthesis
The same hardware is used for each iteration of Different hardware is used for each iteration of the Different iterations are executed concurrently:
the loop: loop: • Higher area
• Small area • Higher area • Short latency
• Long latency • Short latency • Best throughput
• Low throughput • Better throughput
acc=0;
loop: for (i=3;i>=0;i--) { For-Loop Start
if (i==0)
{ acc+=x*c[0]; 1
shift_reg[0]=x;
} else
{ shift_reg[i]=shift_reg[i-
1]; acc+=shift_reg[i]*c[i];
}
} For-Loop End
*y=acc; 2
} Function End
From any C code example .. The loops in the C code correlated to states This behavior is extracted into a hardware
of behavior state machine
acc=0;
>= -
loop: for (i=3;i>=0;i--) { == == -
if (i==0)
+ + *
{ acc+=x*c[0];
shift_reg[0]=x;
1
} else * + *
{ shift_reg[i]=shift_reg[i-
1]; acc+=shift_reg[i]*c[i]; +
}
} *
*y=acc; WRy
} WRy 2
From any C code example .. Operations are The control is A unified control dataflow behavior is
extracted… known created.
Scheduling Binding
User RTL
(Verilog, VHDL, SystemC)
Directives
> The operations in the control flow graph are mapped into clock
cycles void foo ( a
… *
t1 = a * b; +
t2 = c + t1; b
t3 = d * t2; *
out = t3 – e; c - out
}
d
Schedule 1 e
* + * -
> Binding is where operations are mapped to cores from the hardware
library
Operators map to cores
> Binding Decision: to share
Given this schedule: * + * -
‒ Binding must use 2 multipliers, since both are in the same cycle
‒ It can decide to use an adder and subtractor or share one addsub
‒ Binding may decide to share the multipliers (each is used in a different cycle)
‒ Or it may decide the cost of sharing (muxing) would impact timing and it may decide not to share them
‒ It may make this same decision in the first example above too
> Productivity
Video Design Example
Verification
Input C Simulation Time RTL Simulation Time Improvement
‒ Functional
‒ Architectural 10 frames 10s ~2 days ~12000x
1280x720 (ModelSim)
Abstraction
‒ Datatypes
‒ Interface RTL (Spec) RTL (Sim)
‒ Classes
Automation C (Spec/Sim) RTL (Sim)
> Portability
Processors and FPGAs
Technology migration
Cost reduction
Power reduction
> Permutability
Architecture Exploration
‒ Timing
Parallelization
Pipelining
‒ Resources
Sharing
Better QoR
acc=0;
loop: for (i=3;i>=0;i--) { Loops: Functions typically contain loops. How these are handled can have a major
if (i==0) impact on area and performance
{ acc+=x*c[0];
shift_reg[0]=x;
} else
{ shift_reg[i]=shift_reg[i- Arrays: Arrays are used often in C code. They can influence the device IO and
1]; acc+=shift_reg[i] * become performance bottlenecks
c[i];
}
}
*y=acc;
} Operators: Operators in the C code may require sharing to control area or
specific hardware implementations to meet performance
From any C code example ... Operations are The C types define the size of the hardware used:
extracted… handled automatically
+
a[N]
Loops can be unrolled if their indices are statically determinable at elaboration time
‒ Not when the number of iterations is variable
Unrolled loops result in more elements to schedule but greater operator mobility
‒ Let’s look at an example ….
RDx
‒ If data dependencies allow
* *
‒ If operator timing allows
* *
Design finished faster but uses more operators + +
‒ 2 multipliers & 2 Adders +
W
Ry
> Schedule Summary void fir (
…
All the logic associated with the loop counters and index checking are now acc=0;
gone loop: for (i=3;i>=0;i--) {
if (i==0)
Two multiplications can occur at the same time { acc+=x*c[0];
shift_reg[0]=x;
‒ All 4 could, but it’s limited by the number of input reads (2) on coefficient port C } else
{ shift_reg[i]=shift_reg[i-
Why 2 reads on port C? 1]; acc+=shift_reg[i]*c[i];
}
‒ The default behavior for arrays now limits the schedule… }
*y=acc;
}
> The array can be targeted to any memory resource in the library
The ports (Address, CE active high, etc.) and sequential operation (clocks from address to
data out) are defined by the library model
All RAMs are listed in the Vivado HLS Library Guide
> Arrays can be merged with other arrays and reconfigured
To implement them in the same memory or one of different widths & sizes
> Arrays can be partitioned into individual elements
Implemented as smaller RAMs or registers
Intro to HLS 11- 26
© Copyright 2018 Xilinx
Top-Level IO Ports
+
}
CE0
WE0
CE1
WE1
> Default RAM resource
Dual port RAM if performance can be improved otherwise Single Port RAM
W
Ry
RDc
RDc
RDc
> With the C port partitioned into (4) separate RDc
ports RDx
*
All reads and mults can occur in one cycle If *
the timing allows *
‒ The additions can also occur in the same cycle *
+
‒ The write can be performed in the same cycles +
‒ Optionally the port reads and writes could be +
registered
Intro to HLS 11- 28
W
Ry
© Copyright 2018 Xilinx
Operators
> C validation
– A HUGE reason users want to use HLS Validate
• Fast, free verification C
– Validate the algorithm is correct before
synthesis
• Follow the test bench tips given over
int main () {
int ret=0;
…
ret = system("diff --brief -w output.dat output.golden.dat");
if (ret != 0) {
printf("Test failed !!!\n");
ret=1;
} else {
printf("Test passed !\n");
}
…
return ret;
}
func_AB.c
#include func_AB.h
func_AB(a,b,c, *i1, *i2) {
...
func_A(a,b,*i1); func_A
Recommendation is to separate func_B(c,*i1,*i2);
test bench and design files … func_B
}
> In HLS
C becomes RTL
Operations in the code map to hardware resources
Understand how constructs such as functions, loops and arrays are synthesized
> HLS design involves
Synthesize the initial design
Analyze to see what limits the performance
‒ User directives to change the default behaviors
‒ Remove bottlenecks
Analyze to see what limits the area
‒ The types used define the size of operators
‒ This can have an impact on what operations can fit in a clock cycle