Gajski HLS
Gajski HLS
Gajski HLS
High−Level Synthesis
Chapter 1
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
1.1
1. capture−and−simulate
2. describe−and−synthesize
1.2
STRUCTURAL BEHAVIORAL
DOMAIN System synthesis DOMAIN
Register−transfer synthesis
Processors, Memories, Buses Flowcharts, algorithms
Logic synthesis Register transfers
Registers, ALUs, MUXs
Gates, flip−flops Circuit synthesis Boolean expressions
Transistors Transistor functions
Transistor layouts
Cells
Chips
Boards, MCMs
PHYSICAL
DOMAIN
1.3
mux1 DBUF
SP PC
mux2 1
+/−
Data bus
STRUCTURE
mux1 DBUF
PC
SP Address
bus
MEM
mux2
ADD/SUB
Data bus
1.4 FLOORPLAN
Copyright © 1993 by Daniel Gajski UC Irvine
DEFINITION OF SYNTHESIS
Behavior−to−structure
Circuit synthesis
Logic synthesis
Register−transfer synthesis
System synthesis
Structure−to−layout
Cell layout generation
1.5
MODELS
DESCRIPTIONS DESIGN
STYLES ABSTRACTIONS
TECHNOLOGY
1.6
LIM CNT
if CNT =/ LIM then
EN <= ENIT;
else comp
EN <= ’0’; < = >
end if;
ENIT EN
Level sensitive
LIM CNT
Edge sensitive
1.7
B EXOR (A,B)
Transmission gates
A
B
EXOR (A,B)
AND−OR−INVERT gate
1.8
STATE X A B
0
REGISTER
STATUS
REGISTER
+/−
CONTROL
LOGIC
if x = 0 then status = 1
if status = 1 then y = a+b else y = a−b
1.9
Chapter 2
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
2.1
LIR RIR
ALU ALU
2.2
A B C in A B C in
Programmable
OR array Programmable
OR array
Decoder
0
Programmable
AND array
Cout S Cout S
2.3
0 1 1 0 A B
Decoder
0
A
1
2
B
3
Output
(EXOR) EXOR
2.4
1. Compilation
2. Minimization
3. Technology mapping
4. Optimization
5. Transistor sizing
2.5
FSM types
1. Autonomous
2. State based
3. Transition based
2.6
Modulo−3 counter
s0 s1 s2
State diagram
s1 = 0 1 s2 = 1 0
s2 = 1 0 s0 = 0 0
Next−state table
D1 Q1 D0 Q0
FF1 FF0
Q’1 Q’0
Clock
Logic implementation
Clock
Q1
Q0
State waveforms
2.7
s0 = 0 0 s1 = 0 1 0
s1 = 0 1 s2 = 1 0 0
s2 = 1 0 s0 = 0 0 1
State table
D1 Q1 D0 Q0
FF1 FF0
Q’1 Q’0
Clock Y
Logic implementation
Clock
Q1
Q0
2.8
Q1Q 0 Count Q 1Q 0 Q Q0 Y
1
s0 = 0 0 1 s0 = 0 1 s0 = 0 0 0
s1 = 0 1 1 s2 = 1 0 s1 = 0 1 0
s2 = 1 0 1 s0 = 0 0 s2 = 1 0 1
don’t care 0 s0 = 0 0
D1 D0
FF1 Q1 FF0
Q0
Q’1 Q’0
Clock Y
Logic implementation
Clock
Count
Q1
Q0
Q1Q 0 Count Q 1Q 0 Y
s0 = 0 0 1 s0 = 0 1 0
s1 = 0 1 1 s2 = 1 0 0
s2 = 1 0 1 s0 = 0 0 1
don’t care 0 s0 = 0 0 0
Count
D1 Q1 D0 Q0
FF FF0
1
Q’1 Q’0
Clock
Y
Logic implementation
Clock
Count
Q1
Q0
1. Compilation
2. State minimization
3. State encoding
4. Synthesis of next−state,
output functions
2.11
where
S = set of states
f = next state function
h = output function
2.12
(Count = 1) AND (x = 2) x = x + 1, Y = 0
s (Count = 1) AND (x = 2) s0 x = 0, Y=1
0
Count = 0 x = 0, Y=0
0 +
Count
0 1
Selector
clock
1 Register
Decoder
0 1 2 3
y
status (x = 2)
Datapath implementation
2.13
Count = 0 s0
s (Count = 1) AND (x = 2) s1 x = 0, Y=0
0
(Count = 1) AND (x = 2) s0
Count = 0 s0
s (Count = 1) AND (x = 0) s1 x = x + 1, Y = 0
1
(Count = 1) AND (x = 1) s2
Count = 0 s0
s2 x = 0, Y=1
Count = 1 s1
1 2
1
0 Selector 1
0 0 +
Count
0 1 0 1
Selector Selector
clock 1 clock
1 State Register
Decoder Decoder
0 1 2 3 0 1 2 3
Y
Datapath implementation
2.14
State reg.
Datapath
control
Next−state Output Datapath
function function
Status
Control unit
2.15
Status bits
0 1
Adder ROM/PLA
Status selector
Address selector
Test bit
2.16
1. Compilation
2. Unit selection
3. Storage binding
4. Unit binding
5. Interconnection binding
6. Control definition
7. Control−unit synthesis
8. Functional−unit synthesis
2.17
Loop forever
if count=1
then
if x=2
then
begin
x=0
y=1
end
else
begin
x=x+1
y=0
end
endif
else
begin
x=0
y=0
end
endif
endloop
2.18
DQ
C Data bus
Clock 1
State
FSM 1
Acknowledge Request
DQ
C
Clock 2
State
FSM 2
2.19
1. Compilation
2. Partitioning
3. Interface synthesis
4. Scheduling
5. FSMD synthesis
2.20
1. Clocking
2. Busing
3. Pipelining
2.21
D
Q
Clock
Q’
D−latch
Clock
I/O waveforms
2.22
C C C
Clock
C C C C C C
Master Slave Master Slave Master Slave
Clock
Clock‘
Clock width
Clock
Clock’
Clock period
Single−phase clock
Phase 1
Phase 2
2−phase clock
2.23
Data
Y
Control
Tri−state driver
Data Bus
Output Latch
D Q
C
Q D
C
Input Latch
Control
Logic
Bus
Released
Bus interface
2.24
Register Register
file file
Register Register
Selector Selector
Two−stage
ALU ALU
2.25
Datapath
control
Control Datapath
unit
Status
Control Datapath
output output
Non−pipelined control unit
Control Datapath
input input
Datapath
control
register
Control Datapath
unit
Status
register
Control Datapath
output output
Pipelined control unit
2.26
2.27
Chapter 3
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
3.1
1. Cost
2. Area
3. Performance
4. Power
5. Testability
6. Verifiability
7. Reliability
8. Manufacturability
3.2
Structural design
Control unit Datapath
Register
Reg file
Mux Mux
FU
Status reg
Floorplan
PLA Bit−sliced
stack
3.3
Control
lines
(metal1)
3.4
Wdp
Wbit
Area = W dp X H dp
LSB MSB
A bit
Unit 1
Unit 2
H dp
Unit n
Extra Routing
Data routing channel
area Power Ground
Ground
Ground
Wunit2
Wunit2
Ground
Over−the−cell
routing track
Power
Ground
Diffusion
Metal 1
Metal 2
Wunitn
Wunitn
Ground
H cell H ch H cell H ch
Wbit W bit
3.5
n1 n2 n3 n1 n2 n3
O1 O1
Clusters
Inputs
I1 I1’ I2 I2’ I3 I3’ I4 I4’ I5 I5’
Input
nets H ch
Internal H sc
nets
O1 O2 O3 O4
Wsc
Wsc
H ch
H sc
H cell
single−row
implementation
Wsc / R
H sc
H H sc
block
H sc
3−row implementation
3.7
WPLA
AND array OR array W
in Wp Wout
r
Product
AND term OR H
array array PLA
buffers lw
Buffer Clock 2 Latch lh
3.8
5. Logic minimization
6. Technology mapping
7. Transistor sizing
3.9
Wire Comp j
Comp i
RT model
Lw
Rw = R ( )
s W
w
E
__
C = (L Ww ) ( )
w w t
t = ( R + R ) ( C + C )
p out w w in
3.10
: Critical Path
In 1
B Out 1
In 2 n5
E
In 3
A C n3 F Out 2
In 4
n4
n2 n1
D Out 3
A C E F
3.11
D
Q
C
D − Latch
t setup t hold
Clock
Data
t t
CQ DQ
Timing Diagram
3.12
MSFF
D Master
QM Slave
Q
MS flip−flop
t (MS)
setup
Clock
D
t (S)
setup
QM
(M)
t
DQ
Q
t (S)
CQ
Timing diagram
3.13
n1 n2 MAX ( t ( n ) , t ( n ) )
Clock p 1 p 2
t p (ALU)
ALU
n3 t p ( n3)
t (Reg3)
setup
Reg3
3.14
Memory
Control unit RAM
Path1
n7 n8
register
n10 n1 n2
State
Next−state Control Reg.
logic logic file AR DR
n4 n n
5 6
n
9 n Functional
3
unit
Datapath
Non−pipelined control
Memory
Control unit RAM
Path1
n7 n8
register
n10 n1 n2
State
One−stage pipeline
Path2
Memory
Control unit RAM
Path1
n10 n1 n2 n7 n8
register
n
Control
reg.
State
Two−stage pipeliine
3.15
Better models
Modeling algorithms
Control Optimization
State encoding
Logic optimization
Microarchitecture optimization
Floorplanning models
Other measures
Other technologies
3.16
Chapter 4
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
4.1
Concept
HDL
description
Manual
design
Synthesis
tools
Register−transfer
design
Design specification
Documentation/Redesign
Verification/Simulation
4.2
1. Conceptual capture
4.3
1. Data types
Integer, boolean, arrays
2. Operators
Arithmetic, logic, manip, access
3. Control
If, case, repeat, decode
4. Conciseness
Macros, subroutines
5. Extensibility
Operator overloading, user definitions
6. Expressivity
Hardware constructs, constraints
1. Design model
Target architecture: DSP, uproc
2. Execution ordering
Sequential, parallel, pipeline
3. Hierarchy
Complex descriptions
4. Timing specification
Clocks, delays
5. Synchronization
Communication protocols
6. Asynchrony
Global events, resets
4.5
Textual
Programming languages
Pascal, Ada, ISPS, VHDL, Verilog
Applicative
DSL
Structural
EDIF, VHDL
Formal
HOP, CIRCAL
Graphical
Hierarchical FSM
StateCharts
Petri−Nets
GDL
FlowCharts
ASM, EXEL
Waveforms
Tabular
Symbolic MicroCode
State Tables
4.6
4.7
ISPS Description
Parser
Fault ....
Analysis
Synthesis
(Design
Automation)
Architecture
Evaluation
Architecture Simulation
Certification
4.8
NETWORK OF ENTITIES
4.9
Control constructs
if x => PC = PC + 1
decode x =>
begin
0 := Acc = 0,
1 := Acc = Acc + 1,
end
Label := repeat
begin
if (done) => leave Label
end
4.10
Concurrency
a PROCESS entity executes asynchronously
Synchronization
process Master := begin
process Slave := begin
...
L2 := repeat begin
nbsend (5) {Messg:Inp1}; nbrecv(:Start)
nbsend (1) {Messg:Start}; if (Start) => leave L2
end;
L1 := repeat begin nbsend (0) {Messg:Done};
nbrecv(:Done) {Messg:Done}; nbrecv (:X) {Messg:Inp1};
if (Done) => leave L1
...
end;
nbsend (0) {Mesg:Start}; nbsend (1) {Messg:Done};
...
end
end
4.11
**Instruction Execution**
Icycle\Instruction.Cycle(main) :=
begin
Mark1 := repeat
PI = M[CR]<0:15> next
begin decode f =>
**Memory.Primary.State** begin
M[0:8191]<31:0>, #0 := CR = M[s]
**Central.Processor.State** #1 := CR = CR + M[s]
#2 := Acc = Acc − M[s]
PI\Present.Instruction<0:15>,
#3 := M[s] = Acc
f\function = PI<0:2>, #4, #5 := Acc = Acc + M[s]
s<0:12> := PI<3:15>, #6 := if Acc < 0 =>
CR = CR + 1
Acc\Accumulator<0:31>, #7 := stop()
end next
CR = CR + 1
end
end
end
4.12
Paradigm
DSL Behavioral Description
Compiler
other
Flow Graphs applications
Automatic Synthesis
Model
APPLICATIVE
SEQUENTIAL M2
END;
END;
4.13
Applicative
Single assignment, concurrent
Global resets, interrupts
Imperative
Pascal−like procedures
Default sequential execution
Operation−level concurrency:
FORK A := B, C := D JOIN;
Delay specification
CLOCK CYCLES
(A := A + 5; B := 3 CYCLES < 4);
ABSOLUTE DELAY
(IF X THEN A := B DELAY < 20);
Chip−level constraints
POWER
VOLTAGE
LEVELS
AREA
FREQUENCY
4.14
CIRCUIT exponentiation;
4.15
a b x y
4.16
Memory board
ABus Addr
CPU MemReq
MR
DataRdy Mem
cntrl ROM
Data
BusAck
BUS
BusReq
CNTRL
DBus
MemReq
BusAck
MR
Addr
Busreq 175
ns
DBus
DataRdy
4.17
Memory board
ABus Addr
CPU MemReq
MR
DataRdy Mem
cntrl ROM
Data
BusAck
BUS
BusReq
CNTRL
DBus
T MR = 0;
Abus 2 Falling(BusAck)
{18..16} Addr = ABus;
== BusReq|( delay 175ns) = 0;
1
Board_Id
F 1 Falling(MemReq)
BusReq = 1;
2 T DBus = Data; 3 Rising(MemReq)
DataRdy = 0;
3 T MR = 1; 0 Rising(BusAck)
Addr = ’X’;
FETCH
INSTR := MEM(PC) ;
PC := PC + 1; EXECUTE
case OPCODE is
when 1 =>
ACCUM := 0;
DECODE when 2 =>
OPCODE <= INSTR/10; ACCUM := ACCUM + 1;
ADDR <= INSTR mod 10; ...
wait for 30 ns; end case;
4.19
VHDL Configuration
Design Design
Entity Entity
Design Interface
Entity
Architectural Body
Design
Entity Process Block DataFlow Block
(sequential behavior) (concurrent behavior)
Structure Block
(netlist)
Reg Reg
ALU
Bus
4.20
Paradigm
VHDL Description
Compiler
other
Internal Model applications
Event−Driven Simulator
Simulation Language
4.21
Operator Overloading
Concurrency
Process Level
Operation Level:
DataFlow Blocks
Timing Specification
Transport & Inertial Delays
A <= {TRANSPORT} B AFTER 5 ns
WAIT
WAIT FOR 20 ns
4.22
Resolution Function
Resolve multiple drivers for a signal
Packages
Definitions, Macros
Attributes
Signals:
S’STABLE
S’QUIET
User Extensions
Constraints, Annotations (not simulated)
attribute Performance : Integer
4.23
begin
A <= B;
B <= A;
B <= guarded A;
4.24
Behavior
Abstract algorithm
Sequential description of functionality
DataFlow
Parallel execution of operations
Structural
Component instantiations, interconnections
Netlist description
4.25
entity FULL_ADDER is
port (X,Y: in BIT; X Y CIN
CIN: in BIT;
SUM: out BIT;
COUT: out BIT ); COUT SUM
end FULL_ADDER;
architecture FA_BOOLEAN of
FULL_ADDER is
signal S1, S2, S3: BIT;
begin
S1 <= X xor Y;
SUM <= S1 xor CIN after 3 ns;
S2 <= X and Y;
S3 <= S1 and CIN;
COUT <= S2 or S3 after 5 ns;
end;
Data flow
X Y
S2
COUT S1
S3 CIN
SUM
Synthesized structure
4.26
architecture FA_BEHAV of
FULL_ADDER is
begin
process(X,Y,CIN)
variable BV: BIT_VECTOR(1 to 3);
variable NUM,I: INTEGER;
variable Stemp, Ctemp: BIT;
begin
NUM := 0; case NUM is
when 0 => Ctemp:=’0’; Stemp:=’0’;
BV := X & Y & CIN;
for I := 1 to 3 loop when 1 => Ctemp:=’0’; Stemp:=’1’;
when 2 => Ctemp:=’1’; Stemp:=’0’;
if (BV(I) = ’1’) then
when 3 => Ctemp:=’1’; Stemp:=’1’;
NUM := NUM + 1; end case;
end if;
SUM <= Stemp after 3 ns;
end loop;
COUT <= Ctemp after 5 ns;
end process;
end FA_BEHAV;
VHDL description
NUM = 0 X Y CIN
3 2 1 S
INC
INC
1 1 0 0 1 0 1 0
INC
3 2 1 0 3 2 1 0
MUX MUX
COUT SUM
entity FULL_ADDER is Y
X CIN
port (X,Y: in BIT;
CIN: in BIT;
SUM: out BIT;
COUT SUM
COUT: out BIT );
end FULL_ADDER;
architecture Structure_View of
A B
FULL_ADDER is
component Half_adder
HA
port (A,B: in BIT;
S,C: out BIT);
end component; AB C S
component Or_gate
port (A,B: in BIT;
O: out BIT); O
end component; X Y CIN
begin
S1
HA1: Half_adder port map
(A=>X, B=>Y, S=>S1, C=>C1); C1 HA2
HA2: Half_adder port map C2
(A=>S1, B=>CIN, S=>SUM, C=>C2);
OR!: Or_gate port map
(A=>C1, B=>C2, O=>COUT);
COUT SUM
end;
4.28
6. Modeling guidelines
4.29
"0000"
CLR
mux1
"0001" CNT1
EN
+
not INC EN
mux3
INC CNT1’quiet
CLK CNT CLR
mux2
CLK
not
CNT2’quiet mux4
CNT2
CNT
4.30
4.31
2. Combinatorial style
(a) concurrent execution semantics
(b) connection of logic gates
(c) no clocks
(d) dataflow VHDL constructs
3. Functional style
(a) one−state FSMD
(b) synchronous and asynchronous behavior
(c) signal typing
(d) dataflow block VHDL constructs
4. Register−transfer style
(a) FSMD model
(b) states, condition, actions
(c) no explicit VHDL constructs
5. Behavioral style
(a) communicating processes
(b) shared memory or message passing
(c) no allocation, no binding, no schedule
(d) VHDL process statements, wait statements
4.32
4.33
end case;
end block;
4.34
4.35
User−interaction / annotation
Design frameworks
Design verification
4.36
Specialized languages
Architectural taxonomies
Modeling guidelines
Design scenarios
4.37
Chapter 5
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
5.1
Synthesis
tools
T2
T1 T3
L1
A1
Input Canonical
L2 Target
intermediate A2
HDLs architectures
representation
A3
L3
− Language independent
5.2
VHDL Description
Compilation, Transformations
Value Lifetimes,
Control & Data Dependencies CDFG
Scheduling Partition into control steps,
FSM DP
Controller Structure
5.3
5.4
Control−flow graph
Data−flow graphs
Read ‘1’
START
0 1
=
‘0’ B"0000"
0 1 Write Write M
DONE
B4 0 1 = <
Read M Read A
Read M Read B
& ‘1’
B2
Write Write
+
M_OUT DONE
Write M
Read A Read
M[0]
Read ‘1’
COUNT SHR Read M ‘0’
B3 + Write A SHR
Write
COUNT Write M
5.5
START = 1
S0 0 1
B1
A := A_PORT; COUNT := 0; A_PORT B_PORT
ENDIF
Initial Allocation
B3
A := SHR(A, M(0)); COUNT := COUNT + 1;
S
3 M := SHR(M, ‘0’);
5.6
Present Next
Condition Value Actions
State State
A := A_PORT;
B := B_PORT;
T COUNT := 0; S1
S0 START = 1 DONE := ’0’;
M := 0000";
A_PORT B_PORT
F S0
START
T S2
S1 COUNT < 4 CLK
M_OUT := M @ A;
F DONE := ’1’; S0 DONE M_OUT
T M := M + B; S3
S2 A(0) = 1
I/O ports
F S3
A := SHR(A, M(0));
M := SHR(M. ’0’);
S3 COUNT := COUNT + 1; S1
A_PORT B_PORT
Mult Count_Reg
Control Unit
A_Reg B_Reg
1 S2
1
S1 Compar.LT Concat(OP: concat, INPS: Mult, A_Reg);
0 M_OUT(OP: load, INPS: Concat);
Mux5(OP: c1, INPS: ’0’, ’1’); S0 Mult(0) Mux3 Mux4
0
DONE(OP: load, INPS: Mux5);
Mux3(OP: c0, INPS: Mult, Count_Reg);
Mux4(OP: c0, INPS: B_Reg, "0001");
1 Adder(OP: add, INPS: Mux3, Mux4); S3 Shift1 Shift2 Adder
START
Mux1(OP: c1, INPS: Shift1, Adder); 4
S2 A_Reg(0) Concat
Mult(OP: load, INPS: Mux1);
Mux5
M_OUT DONE
5.8
A_PORT B_PORT
Mux1 Mux2
Present Condition Actions Next
State Value State Control unit
S0 & START = 1
S0 & ~(START = 1)
S1 & COUNT < 4
S1 & ~(COUNT < 4)
S2 & A(0) = 1
S2 & ~(A(0) = 1)
S3
Mux1 − − − − 1 − 0 Mult Count_Reg
Mux2 1 − − − − − 0
Mux3 − − − − 0 − 1 A_Reg B_Reg
Mux4 − − − − 0 − −
0001
1 S2 Load A_Reg 1 − − − − − 1
Load B_Reg 1 − − − − − − 0
S1 Compar.LT Clear Count_Reg 1 − − − − − − Mult(0)
Load Count_Reg − − − − − − 1
0 DONE := 1; S0 Clear Mult 1 − − − − − − Mux3 Mux4
Load Mult − − − − 1 − 1
Adder − − − − 1 − Shift1 Shift2
1
Mux3.sel := 0; Shift1 − − − − − − 1
Mux4.sel := 0; Shift2 − − − − − − 1
1 Adder.add := 1; S3 DONE 0 − − 1 − − − Adder
A_Reg(0) Mux1.sel := 1; Next State s1 s0 s2 s0 s3 s3 s1
S2 0100
Mult.load := 1;
0 S3 Compar
State Reg
START Concat
A_Reg(0)
Compar.LT
Complete Design
5.9
A := B + C;
D := A * E;
X := D − A;
HDL description
Stmt1
+ +
Write A
Read E
Read A Read E
Stmt2 * *
Write D
Read D Read A
Stmt3 −
−
Write X Write X
5.10
1 2 E
case C is
when 1 => X := X + 2;
A := X + 5; X := X + 2; A := X + 3; A := X + W;
when 2 => A := X + 3; A := X + 5;
when others => A := X + W;
end case;
Read X
+ 5 3 Read W
+ + +
1 2 E 1 2 E
Read C
Write X Write A
Dataflow representation
5.11
b <= a + 1;
a <= b + 1;
Concurrent VHDL
+ +
Write b Write a
Representation
5.12
Read b Const 1 b 1
+ +
a := b + 1; Write a
a
b := a + 1; Read a
+ +
Seq. VHDL
Write b b
5.13
Read a loop_init
Read b
Read req
delay: loop_join
min=500, max=1000 +
delay
delay min 100
min 50 max 1000
Const 1 max 90
shr loop_test
0 1
Write ack
loop_body
Write c loop_exit
5.14
Data flow
1. case C is
Read C Read X Read W
2. when 1 => X := X + 2;
3. A := X + 5; +
4. when 2 => A := X + 3;
5. when others => A := X + W; Control flow Write A
6. end case;
Const 3 Read X
1 2 E
VHDL Description +
Write A
Read X Const 2
+ Const 5
+
Partitioned CDFG
Write X Write A
Data flow
DX DW DC
BR BR
Control flow C Data flow
E E
1 2
1
2 3 (C=1) (C=E) 2 = 1
(C=2)
+ + + 2 + 2 X
5
5 W
+ 3 4 5
3
2 2
1 E 1
E + 3 + 4 +
5
6
ME ME
A
UX UA
5.15
Sequential execution
Control Flow Graph
Data Flow Graph
Data_bus.r
VHDL process
Creg.w
P1: process
begin
CREG := DATA_BUS;
CREG = B"00" Count.r 1
if CREG = B"00" then
COUNT := COUNT + 1;
else
COUNT := 0; +
end if
Count.w
end process;
Count.w
5.16
Parallel execution
DAGs
VHDL Block B.r 1
5.17
1. Compiler transformations
2. Flowgraph transformations
3. Hardware transformations
5.18
Write C Write C
Constant folding
* * *
5.19
1 2 3
= =
NOT
4
6
Read
Signal X
AND
7 sensitivity: EDGE
active edge: POSITIVE
5.20
a := b + c − d + f − g + h + k;
VHDL code
−
+
− +
−
b + + +
+
c b c d f g h k
t1 := h + k −
d
t2 := g + t1 + t1 := b + c Potential
t3 := f − t2 f t2 := d + f
+
parallelism
t4 := d + t3 g t3 := g + h
t5 := c − t4 t4 := t1 − t2
a := b + t5 t5 := t3 + k Potential
h k parallelism
a := t4 − t5
Initial parse tree After tree height reduction
5.21
if (X = 0) then
A := B + C;
D := B − C;
else
D := D − 1;
end if;
Textual representation
Read
0
X
If_test
0 1 Read Read
= B C
Stmt_blk2 Stmt_blk1
Read + −
1
D
Write Write
− A D
If_join
Write
D
CF representation
Write Write
A D
DF representation
5.22
Stmt_blk1
Stmt_blk1
If_test
0 1
If_test
Stmt_blk2 0 1
Stmt_blk3 Stmt_blk6
Stmt_blk3
If_test
0 1
If_join
Stmt_blk4 Stmt_blk5
If_join
If_join
Stmt_blk7
5.23
Read a Read b
NAND Write C
Write C
5.24
A B 1 A B
add
n1 + n3
inc
n2 +
Adder/Incrementer Transformation
5.25
A B "0001"
case F is + −
when "00" => OUT <= A + B;
when "01" => OUT <= A + B + "0001";
when "10" => OUT <= A − B; + + − +
when "11" => OUT <= A − B + "0001";
end case; "00" "01" "10" "11"
F
OUT
VHDL DF graph
A B
A B
+ AI − SI
"00" : +
F "01" : AI
"00" "01" "10" "11"
F "10" : −
"11" : SI
OUT
OUT
5.26
1. Common representation
2. Different views
3. Different architectures
4. Description disambiguation
5. Layout−Driven Transformations
6. Transformation scripts
5.27
Chapter 6
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
6.1
Scheduling
Allocation
Unit selection
Chip partitioning
Problem decomposition
for tractability
6.2
a
FF1 FF2 c
b G1
a v1 v2 v3
c
ni nj b
Cutline e36
d e24
REG. v5
e d
e v6
e v4 25 G2
f
f G
g
g
a b
Chip 1 c
n n
i j
Chip 2
d e f g
Partitioned Design
6.3
−Time utilization
I1 I2
−Component utilization
process1
process2 B B
var : D : integer;
end if process1
wait until (B <= 0);
D := I3 + B;
F <= I3 + I1;
B process2 F process3
end process2;
process3 wait wait
var : G : integer; B <= 0 F>0
Constructive methods
1. Random selection
2. Cluster growth
3. Hierarchical clustering
Iterative−Improvement methods
1. Min−cut partitioning
2. Simulated annealing
6.5
Algorithm 6.1
page 185
6.6
v1
v1 v v v v
5 4 2 3 4 5 v(24)
v2 v1 − − − − −
1 v3
v2 5 − − − − v1 v2 v4 v3 v5
6 3 v3 4 1 − − −
v5 v4 0 6 0 − −
v4 v5 0 3 0 0 −
(a)
v(241)
v1 v1 v(24) v v5
3 v(24)
5 4 v1 − − − −
1 v(24) 5 − − −
v3 v1 v2 v4 v3 v5
v(24) v3 4 1 − −
3 v5 0 3 0 −
v5
(b)
v(2413)
v(241)
v(241) v3 v5 v(241)
v(241) v(24)
v3 − − −
4
v3 4 − −
3 v5 3 0 − v1 v2 v4 v3 v5
v5
(c)
v(24135)
v(2413)
v (2413) v(2413) v5 v(241)
v(24)
v(2413) − −
v5 3 −
3 v1 v2 v4 v3 v5
v5
6.7
Algorithm 6.2
page 188
6.8
Criterion A
a b c
First
3 4 1 2
5 cutline Criterion A
d e f
f c a e d b
Criterion B (a)
a b 3 c
First
2 cutline
1
5
4
Criterion B
d e f
c e f b a d
(b)
{f,c} {d}
3
5 1
2 A then B
{a,e} {b} {a,e} {f,c} {b} {d}
(c)
4
{c,e} {a,d}
5 1
{f} {b}
B then A
{c,e} {f} {a,d} {b}
(d)
Second
cutline 5
{f,c} {a,e,d}
3 1
A then B
{b} {a,e,d} {f,c} {b}
f c a e d b
6.9
G1 G2
Cutline
6.10
Interconnection Reduction
Cutline Cutline
v v v v
i j j i
G1 G2 G1 G2
6.11
Algorithm 6.3
page 194
6.12
GAIN(k)
20
10
5
k
1 2 3 4 5 6 7 8 9 10
−5
6.13
Algorithm 6.4
page 197
6.14
a b c
o1 o2 e13
(+) G1 (+) add1
Two cluster
mult1
G1
e
partition
e13 e23 23 G2
o3
G2
*
( )
a b c
o1 o2
G1 G2 e13
add1
(+) (+)
G1
mult1 Three cluster
e13 e
G3
partition
23 add2
o3 G2 e
G3 23
*
( )
6.15
1. Functional proximity
2. Communication proximity
3. Potential parallelism
4. Closeness distance
6.16
CDFG
User choice
Clustering
Procedure/ Procedure/
Control Data control data Operator
Cutline
Schedule
Area Connections length
No
Done?
Yes
Partitioned CDFG
6.17
main
num_msgs : register(8); reset system_off
user_id_ram : memory(4x4);
system_on not system_on
system_on
initialize respond_to_machine_button
machine_button_pushed
respond_to_external_line
monitor dialtone
answer
dialtone
play_announcement record_msg
tone=1 tone=1
remote_operation
check_user_id respond_to_cmds
code_ok
dialtone not code_ok
Object description
6.18
external ports
17 24
8 3 2 1 1 4
2
monitor (4489) check_user_id (4272) user_id_ram (750) AREA: 9511
7 PINS: 21
6.19
Partitioning of CDFG
Partitioning of specifications
Software/Hardware partitioning
6.20
Chapter 7
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
7.1
VHDL Description
Compilation, Transformations
Value Lifetimes,
Control & Data Dependencies CDFG
Scheduling Partition into control steps,
FSM DP
Controller Structure
7.2
7.3
START = 1
S0 0 1
B1
A := A_PORT; COUNT := 0;
B := B_PORT; DONE <= ‘0’;
M := B"0000";
S1 COUNT < 4
0 1
B4
M_OUT <= M & A; DONE <= ’1’;
A(0) = ’1’
0 1 B2
S M := M + B ;
2
ENDIF
B3
A := SHR(A, M(0)); COUNT := COUNT + 1;
S
3 M := SHR(M, ‘0’);
7.4
Target Architecture:
Pipelining, Clocking, Busing, Component Sets, ...
u dx 3 x u dx x dx
v v
* v * 4 + 10
* v 2
1 e
while (x < a) loop e e2,5
4,9
1,5 y
x1 := x + dx; y
e
10,11
* v * v + v
9
3
u1 := u − ( 3 * x * u * dx ) − (3 * y * dx ); 5
u e e a
y1 := y + (u * dx); 3,6 dx
5,7
x := x1; u := u1; y := y1;
− v < v
end loop; v * 6 11
7
e
7,8 e
6,8
c
− v
8
7.6
Algorithm 7.1
page 217
7.7
Algorithm 7.2
page 219
(Check errata)
7.8
v1 v2 v3 v4 v
10
v1 v2
E=1 * * * + L=1
* * *
v v v v
7 7
v4 10
6 +
E=3 − L=3 − * *
v8 v v9 v11
E=4 − L=4 − 8 + <
7.9
List−Based Scheduling
Static−List Scheduling [JMSW91]
7.10
Algorithm 7.5
page 234
7.11
s2
E=2 * * + v < L=2 * v * v
v5 v6
9
v11 5 3
v v v v
7 s3 7
v4 10
6 +
E=3 − L=3 − * *
v8 s4 v v9 v11
E=4 − L=4 − 8 + <
* * − + <
7.13
* 1 * 2 * 4 + node 8 9 11 7 6 4 10 5 3 1 2
10
ALAP 1 1 1 2 2 2 2 3 3 4 4
* 5 * 3
+ 9 <
11 ASAP 4 2 2 3 2 1 1 2 1 1 1
− 7 * 6
priority 1 2 3 4 5 6 7 8 9 10 11
− 8
− + < − + <
* * * *
+ s * * + 10
* 2 * 1 10 1 2 1
s * 3 * 5 < 11
* 3 * 5 2
s * 4 * 6 − 7
3
s4 − 8 + 9
7.14
List−Based Scheduling
Static−List Scheduling [JMSW91]
7.15
1 2 3 4 5 6 7 8 9 10 11
s1
s2
s3
s4
Operation ranges
s1 v1 v2
* *
v5 v3 v10
s2
* * +
v7 v6 v
s3 − 4
* *
s4 v8 v v11
− + 9 <
Final Schedule
7.16
Formulation on
pages 221−222
(Check errata)
7.17
1 2 3 4 5 6 7 8 9 10 11
s1
s2
s3
s4
s1 v1 v2 (pages 222−223)
* *
v5 v3 v10
s2
* * +
v7 v6 v
s3 − 4
* *
s4 v8 v v11
− + 9 <
Final Schedule
7.18
Iterative Approach:
Determine global effect of scheduling an operation into a state
Compute probability of scheduling op into a state
Compute expected operator cost for each op type
Schedule each operation to balance hardware utilization
Choose assignment of op to state with min cost
Recompute probability distr. and expected op costs
7.19
s s 2.83
1 * * 1
1 2
*
s s 2.33
2 * * + 2
5 3
*
s s 0.83
3 − + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
7.20
s s 2.83
1 * * 1
1 2
*
s s 2.33
2 * * + 2
5 3
*
s s 0.83
3 − + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
s s 2.33
1 * * 1
1 2
s s 2.33
2 * * * + 2
5 3
s s 1.33
3
− * + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
Algorithm 7.3
page 227
7.22
s 1 2 3 4 10 s 1 2 3 4 10
1 1
s 5 6 9 11 s 5 9 11
2 2
s 7 s 7 6
3 3
s4 8 s4 8
Algorithm 7.4
page 231
7.24
s
1 +
s
1
*
s s
1 + + 2
s s1
2 − * * * *
s2 s3
− −
Multi−Functional Units
ALUs, Shift−registers, etc.
time
b
1 2 3 4 5 6 7 8 9 10 11 12
Sequential Execution
1, 2, 3 4, 5, 6 7, 8, 9 10 , 11, 12
m
1 4 7 10
p 2 5 8 11
3 6 9 12
Loop Folding
7.26
A B C
s A C
1
I F E D
s I F B
2
K L J H G
s K L D
3
M
s4 M E G
N s5
J N H
Q P R s6
Q P R
7.27
Loop Overhead
A
s A1 A2 B1 C1
1
I F C
s I1 F1 E1 D1
2
L D B
s L1 J1 H1 G1
3
J M E G A
s4 K1 M1 I2 C2
K N H I F C
s5 N1 L2 F2 D2
Q P R L D B
s P1 R1 Q1 B2
6
J M E G
s7 J2 E2 M2 G2 Loop Body
K N H
s8 K2 N2 H2
Q P R
s9 Q2 P2 R2
Loop Overhead
7.28
s1 v1 = v2 + v3 v4 = v2 * v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
Initial Schedule
s1 v4 = v2 * v3 v1 = v2 + v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
s1 v4 = v2 * v3 v1 = v2 + v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
7.29
1 ppc <= pc 1 1
if (branch = ’1’) 4
4 4
Branch Branch
i1 Branch
Branch 5
5 then pc<= branchpc 5
6 6 6
end if
8 oldpc <= pc 8 8
Write Write
9 9 9
pc <= pc + 4
10 10 10
7.30
a b c d e f a b c d e f
+ + + +
1 5 1 5
+ + +
2 2
3
+ +
3
4
+
4
a b d c a b d c
+ +
1 1
+ + +
2 6 2
* *
3 3
+ + +
4 5 4
+
5
d (a + b + c) + ab + ab d (a + b + c) + 2ab
before after
Redundant Operator Insertion
7.31
7.32
Chapter 8
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
8.1
Approaches
Greedy
Decomposition
Iterative
8.2
a b c d
r1 r2 r3 r4
s1 + o1 + o2 a b, e, g c, f, h d
e f
s2 + o3 + o4
+1, +3 +2, +4
ADD1 ADD2
g h
8.3
Output OutBus1
interconnection
network OutBus2
Output Register file
interconnection
network r3
Register file
r1 r2 r4 r5 r6
r3
r1 r2 r4 r5 r6
InBus1
Input InBus2
Input interconnection
interconnection network InBus3
network
InBus4
ALU1 ALU2
ALU1 ALU2
Mux−oriented DP Bus−oriented DP
8.4
tr t e t w
Bus−oriented DP
Cycle 1 Cycle 2
8.5
OutBus1
OutBus2
Read r1 Read r5 InBus1
r3
Read r2 Read r6 InBus2
r1 r2 r4 r5 r6 Read r3 Read r2 InBus3
Read r4 Read r5 InBus4
InBus1 Execute Execute ALU1
Execute Execute ALU2
InBus2 Write r3 Write r1 OutBus1
Write r1 Write r6 OutBus2
InBus3
Cycle 1 Cycle 2 Cycle 3
InBus4
ALU1 ALU2
r3
r1 r2 r4 r5 r6
te
tw tr
InBus1
Read r Read r5 Read r1 InBus1
InBus2 1
Read r Read r6 Read r6 InBus2
InBus3/OutBus1 2
InBus3/
Read r Read r2 Write r Write r1 OutBus1
InBus4/OutBus2 3 3
Read r4 Read r5 Write r1 Write r6 InBus4/
OutBus2
Execute Execute Execute ALU1
Execute Execute Execute ALU2
ALU1 ALU2
Cycle 1 Cycle 2 Cycle 3 Cycle 4
8.7
Unit Selection
Functional−unit Binding
Storage Binding
Interconnection Binding
8.8
a b c d
r1 r2 r3 r4
s1 + o1 + o2 a b, e, g c, f, h d
e f
s2 + o3 + o4
+1, +4 +2, +3
g h ADD1 ADD2
r1 r2 r3 r4
a,g b,e c,f d,h r1 r2 r3 r4
a, g b, e c, f d, h
+1, +3 +2, +4
+1, +4 +2, +3
ADD1 ADD2
ADD1 ADD2
8.9
r1 r2 r3 r4 r5 r1 r2 r3 r4 r5 r1 r2 r3 r4 r5
Initial partial design add 2 inputs to mux add tristate buffer to bus
r1 r2 r3 r4 r5 r1 r2 r3 r4 r5 r1 r2 r3 r4 r5
add mux to FU input add FU and tristate to bus add FU and mux
r1 r2 r3 r4 r5
Bus1
ALU1 ALU2
convert muxes to shared bus
8.10
Algorithm 8.1
page 276
8.11
8.12
Algorithm 8.2
page 279
8.13
v3 v4 v3 v4
v5 v5 s
s 5
s 5 s
4 134
Supernode Creation 1 Supernode Creation 2
v v2 s
1
25
s = {v1 , v 3 , v 4 }
134
v3 v4
s {v2 , v 5 }
s
134
v5
25 =
Supernode Creation 3 Cliques for Graph G
8.14
v1 v2
s1 s1 R
v6 + v4
v3 W
s2 s2 R
* −
v10 v5
W
v7
s s R
3 / + 3
+
W
v11 v8 v9
R
s4 & | s4
W
v1 v2
v8
v10
Cliques:
v1 r1 = {v1 , v 8 }
v9 v2
v7
r2 = {v2 , v3 , v 9 }
v11 r3 =
v3 v5 {v4 , v5 , v 11 }
r4 = {v6 , v7 }
v4
v6
r5 = {v10 }
Algorithm 8.3
page 285
8.16
v2
s1 v1 v6
v10 v4
s2
v3
v5 v7
s3
v8 v9 v11
s
4
v1’ v2’
Lifetimes Registers
8.17
Algorithm 8.4
page 288
8.18
s1
s2
s3
s
4
r1 v1
v3
r2
r1 = { v1 , v8 , v1’ }
v10
r2 = { v9 , v10 }
r3 v4 v5
r3 = { v4 , v5 , v11 }
r4 v6
v7 r4 = { v6 , v7 }
r5 v2
r5 = { v2 , v3 , v2’ }
Set R Set V
8.19
Algorithm 8.5
page 292
8.20
8.21
Allocation of memories
8.22
Chapter 9
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
9.1
9.2
r1 r2 r3 r4
+ − + Adder − Subtractor
r5 r6
* * Multiplier
r7
y y
Compiler
r1 r2 r3 r4
Scheduler
+ − DFG
Allocator
r5 r6
Netlist
generator
*
r7
Physical design
y
ASIC description
to manufacturing
9.4
r1 r2 r3 r4
+ − State 1 + −
r2
State 2 r4
*
State 3 *
r5
y
y
Control Address
mux mux
unit generator
+/− *
32
r5 AR2
Control bus1
Address bus2
Memory 2
FSMD implementation
Resource utilization
Behavioral Design
description constraints
HL synthesis
Compiler
Netlist
generator
Logic/Sequential synthesis
Physical design
9.6
Behavioral Design
description constraints
HL synthesis
Compiler
Scheduler Architecture,
topology,
CDFG
resource
Allocator selection
Netlist
generator
9.7
Completeness
1. All levels of design
Extensibility
1. Addition of new algorithms and tools
2. Addition of new architecture styles
Controllability
1. Control of tools
Interactivity
1. Partial design definition
Upgradability
1. Capture−and−simulate to describe−and−synthesize
2. Mixing of strategies
9.8
System
synthesis
Chip
SDB
Conceptualization environment
synthesis
Simulation code generators
Intermediate forms
Simulation suite
Logic/Sequential
CDB
synthesis
Physical design
synthesis
ASIC description
to manufacturing
9.9
System description
Compiler
Standard
Estimator SR component
binding
Partitioner
Interface &
arbitration
synthesis
Port minimization
Partitioned system
description
Scheduling
9.10
Behavioral
description
Compiler
Scheduler
Storage
allocator
CDFG Storage
Technology mapping
Architecture, topology, style selection
merger strategies:
Functional unit
allocator 1. Top−down
Interconnection
allocator 2. Meet−in−the−middle
Module
selector 3. Bottom−up
Technology
mapper
CDB
Microarchitecture
optimizer
Logic/Sequential synthesis
To physical design
9.11
State Interface
encoding synthesis
Logic
minimization
Technology
mapping
Physical design
9.12
RT netlist
Style Component
Partitioning instances
selection
from CDB
1D 2D 1Bit
Floorplanning
Routing
To ASIC manufacturing
9.13
Register
Counter
Mux
ALU
Datapath floorplan
Stack 2
Memory
A/D
Register
file Stack 1
Chip floorplan
9.14
9.15
Version Transaction
manager manager
Designer
Database
interface
Design data
Design view Design data
representation
manager manager
graphs
Design
tools
Consistency Design quality
checker evaluator
9.16
Component
descriptions,
Component
Component Schematic Component Component
generators,
netlist diagram request query
Component−
optimization
tools
Schematic Schematic
capture generator
Component database
Component
Fixed store
Parameterized
Component
Component
descriptions
generators
9.17
Synthesis algorithms
9.18
Variables
Begin State Condition CondValue Actions NextState
t, count, opd, result
Ports
R,A,B t=(A<B),
Operators
t=A<B BEGIN R=0, test1
R=0 result = 0
<, +, − result = 0
testT testF
testT count=A, opd=B join
count = A count = B
opd = B opd = A
testF count=B, opd=A join
Join
join loop
loop
loop t=(count>0) test2
t = count > 0
1 R = result END
END END
End
9.19
Datapath1
bus1
Reg4.1
Reg4.2
Memory Control Unit
Reg4.3
ALU4.1
ALU4.2
bus2
Datapath2
MUL4.1
Control Unit
Glue Logic
MUL4.2
9.20
Description capture
Description
partitioning
Component
selection
Scheduling
To physical design
9.21
9.22