Gajski HLS

Introduction to
High−Level Synthesis
Chapter 1
Source: Gajski, Dutt, Wu, Lin
"High−Level Synthesis"
Kluwer Academic Publishers, 1992
1.1
Copyright © 1993 by Daniel Gajski UC Irvine

NEED FOR HIGH−LEVELS
OF ABSTRACTION
VLSI complexity requires hierarchy
VLSI technology reached maturity
First silicon and first specification
Shorter design cycle
Better exploration of design space
Algorithms outperform designers
Two schools of thought:
1. capture−and−simulate
2. describe−and−synthesize
1.2

LEVELS OF ABSTRACTION
STRUCTURAL BEHAVIORAL
DOMAIN System synthesis DOMAIN
Register−transfer synthesis
Processors, Memories, Buses Flowcharts, algorithms
Logic synthesis Register transfers
Registers, ALUs, MUXs
Gates, flip−flops Circuit synthesis Boolean expressions
Transistors Transistor functions
Transistor layouts
Cells
Chips
Boards, MCMs
PHYSICAL
DOMAIN
1.3

THREE DESIGN VIEWS
if IR(3) = ’0’ then
PC := PC + 1;
else
DBUF := MEM(PC);
MEM(SP) := PC + 1;
SP := SP − 1;
PC := DBUF;
end if;
BEHAVIOR
mux1 DBUF
SP PC
Control Address bus

Unit MEM
mux2 1
+/−
Data bus
STRUCTURE
mux1 DBUF
PC
SP Address
bus
MEM
mux2
ADD/SUB
Data bus
1.4 FLOORPLAN
DEFINITION OF SYNTHESIS
Behavior−to−structure
Circuit synthesis
Logic synthesis
Register−transfer synthesis
System synthesis
Structure−to−layout
Cell layout generation
Module layout generation

Chip floorplanning
System partitioning and placement
1.5

DEPENDENCE OF LANGUAGES,
DESIGNS AND TECHNOLOGIES
MODELS
DESCRIPTIONS DESIGN
STYLES ABSTRACTIONS
TECHNOLOGY
Several descriptions for the same behavior
Several styles for the same description
Different abstractions for the same design
1.6

DIFFERENT DESIGNS
FOR THE SAME BEHAVIOR
LIM CNT
if CNT =/ LIM then
EN <= ENIT;
else comp
EN <= ’0’; < = >
end if;
ENIT EN
Level sensitive
LIM CNT
if ENIT = ’1’ and not ENIT’stable then

EN <= ’1’; comp
< = >
elseif CNT = LIM then
EN <= ’0’;
end if;
1 D Q EN
ENIT
Edge sensitive
1.7

DIFFERENT STYLES
FOR THE SAME DESCRIPTIONS
B EXOR (A,B)
Transmission gates
A
B
EXOR (A,B)
AND−OR−INVERT gate
1.8

DIFFERENT CONSTRUCTS
FOR THE SAME BEHAVIOR
STATE X A B
0
REGISTER
STATUS
REGISTER
+/−
CONTROL
LOGIC
1 state (no status register)

if x = 0 then y = a+b else y = a−b
2 states (with status register)
if x = 0 then status = 1
if status = 1 then y = a+b else y = a−b
1.9

Architechtural
Models in Synthesis
Chapter 2
2.1

DESIGN STYLES AND
TARGET ARCHITECTURE
Left Right Result Left Right

bus bus bus bus bus
Register file Register file
LIR RIR
ALU ALU
3−bus nonpipelined 2−bus pipelined

design design
Program 1: x <= a + b; (100ns) LIR <= a; RIR <= b; (50ns)

y <= c − x; (100ns) x, RIR <= LIR + RIR; LIR <= c; (50ns)
y <= LIR − RIR; (50ns)
Program 2: x <= a + b; (100ns) LIR <= a; RIR <= b; (50ns)

y <= c − d; (100ns) x <= LIR + RIR; (50ns)
LIR <= c; RIR <= d; (50ns)
y <= LIR − RIR; (50ns)
2.2

COMBINATORIAL LOGIC
A B C in A B C in
Programmable
OR array Programmable
OR array
Decoder
0
Programmable
AND array
Cout S Cout S
ROM implementation PLA implementation

of a FA of a FA
2.3

COMBINATORIAL LOGIC
0 1 1 0 A B
Decoder
0
A
1
2
B
3
Output
(EXOR) EXOR
Decoder implementation Logic gate implementation

of an EXOR gate of an EXOR gate
2.4

DESIGN PROCESS FOR
COMBINATORIAL FUNCTIONS
1. Compilation
2. Minimization
3. Technology mapping
4. Optimization
5. Transistor sizing
2.5

FINITE STATE MACHINES
<S, I, O, f: S x I −> S, h: S x I −> O>
FSM types
1. Autonomous
2. State based
3. Transition based
4. Machines with datapath

5. Communicating machines
2.6

AUTONOMOUS FSM
Modulo−3 counter
s0 s1 s2
State diagram
Present state Next state

Q Q Q Q
1 0 1 0
s0 = 0 0 s1 = 0 1
s1 = 0 1 s2 = 1 0
s2 = 1 0 s0 = 0 0
Next−state table
D1 Q1 D0 Q0
FF1 FF0
Q’1 Q’0
Clock
Logic implementation
Clock
Q1
Q0
State waveforms
2.7

FSM WITH OUTPUT
MODULO−3 DIVIDER
Present state Next state Output

Q Q Q Q Y
1 0 1 0
s0 = 0 0 s1 = 0 1 0
s1 = 0 1 s2 = 1 0 0
s2 = 1 0 s0 = 0 0 1
State table
D1 Q1 D0 Q0
FF1 FF0
Q’1 Q’0
Clock Y
Clock
Q1
Q0
State and output waveforms
2.8

STATE−BASED
MODULO−3 DIVIDER
Present state Input Next state Present state Output
Q1Q 0 Count Q 1Q 0 Q Q0 Y
1
s0 = 0 0 1 s0 = 0 1 s0 = 0 0 0
s1 = 0 1 1 s2 = 1 0 s1 = 0 1 0
s2 = 1 0 1 s0 = 0 0 s2 = 1 0 1
don’t care 0 s0 = 0 0
Next−state and output tables

Count
D1 D0
FF1 Q1 FF0
Q0
Q’1 Q’0
Clock Y
Clock
Count
Q1
Q0
Input and output waveforms

2.9

TRANSITION−BASED
MODULO−3 DIVIDER
Present state Input Next state Output
Q1Q 0 Count Q 1Q 0 Y
s0 = 0 0 1 s0 = 0 1 0
s1 = 0 1 1 s2 = 1 0 0
s2 = 1 0 1 s0 = 0 0 1
don’t care 0 s0 = 0 0 0
Next−state and output tables
Count
D1 Q1 D0 Q0
FF FF0
1
Q’1 Q’0
Clock
Y
Clock
Count
Q1
Q0
Input and output waveforms

2.10

DESIGN PROCESS FOR
FINITE−STATE MACHINES
1. Compilation
2. State minimization
3. State encoding
4. Synthesis of next−state,
output functions
2.11

FINITE−STATE MACHINES
WITH A DATAPATH
FSMD = < S, I U B, O U A, f, h >
where
S = set of states
f = next state function
h = output function
B = set of some status variables
A = set of storage variable assignments
2.12

TRANSITION−BASED FSMD
Present State Input Next State Output
(Count = 1) AND (x = 2) x = x + 1, Y = 0
s (Count = 1) AND (x = 2) s0 x = 0, Y=1
0
Count = 0 x = 0, Y=0
Next state and output table
0 +
Count
0 1
Selector
clock
1 Register
Decoder
0 1 2 3
y
status (x = 2)
Control unit Data path
Datapath implementation
2.13

STATE−BASED FSMD
Present State Input Next State Output
Count = 0 s0
s (Count = 1) AND (x = 2) s1 x = 0, Y=0
0
(Count = 1) AND (x = 2) s0
Count = 0 s0
s (Count = 1) AND (x = 0) s1 x = x + 1, Y = 0
1
(Count = 1) AND (x = 1) s2
Count = 0 s0
s2 x = 0, Y=1
Count = 1 s1
Next state and output table
1 2
1
0 Selector 1
0 0 +
Count
0 1 0 1
Selector Selector
clock 1 clock
1 State Register
Decoder Decoder
0 1 2 3 0 1 2 3
Control unit Data path
Y
Datapath implementation
2.14

GENERIC FSMD BLOCK DIAGRAM
Control inputs Datapath inputs
State reg.
Datapath
control
Next−state Output Datapath
function function
Status
Control unit
Control outputs Datapath outputs
FSMD = Control unit + Data Path
2.15

NEXT−STATE FUNCTION
State register Control inputs

1
Status bits
0 1
Adder ROM/PLA
Status selector
Address selector
Test bit
Typical processor implementation
2.16

DESIGN PROCESS FOR FSMDs
1. Compilation
2. Unit selection
3. Storage binding
4. Unit binding
5. Interconnection binding
6. Control definition
7. Control−unit synthesis
8. Functional−unit synthesis
2.17

BEHAVIORAL DESCRIPTION
FOR FSMDs
Loop forever
if count=1
then
if x=2
then
begin
x=0
y=1
end
else
begin
x=x+1
y=0
end
endif
else
begin
x=0
y=0
end
endif
endloop
2.18

FSMD COMMUNICATION
DQ
C Data bus
Clock 1
State
Next Output Datapath

state
FSM 1
Acknowledge Request
DQ
C
Clock 2
State
Next Output Datapath

state
FSM 2
2.19

DESIGN PROCESS FROM
SYSTEM DESCRIPTIONS
1. Compilation
2. Partitioning
3. Interface synthesis
4. Scheduling
5. FSMD synthesis
2.20

ENGINEERING CONSIDERATIONS
1. Clocking
2. Busing
3. Pipelining
2.21

CLOCKING AND STORAGE
D
Q
Clock
Q’
D−latch
Clock
I/O waveforms
2.22

CLOCKING STRATEGIES
Data in Data out

D Q D Q D Q
C C C
Clock
Shift−register with latches
MS flip−flop MS flip−flop MS flip−flop

Data in
D Q D Q D Q D Q D Q D Q
C C C C C C
Master Slave Master Slave Master Slave
Clock
Clock‘
Shift−register with MS flip−flops
Clock width
Clock
Clock’
Clock period
Single−phase clock
Phase 1
Phase 2
2−phase clock
2.23

BUSING
Data
Y
Control
Tri−state driver
Data Bus
Output Latch
D Q
C
Q D
C
Input Latch
Control
Logic
Bus
Released
Bus interface
2.24

DATAPATH PIPELINING
Register Register
file file
Register Register
Selector Selector
Two−stage
ALU ALU
Non−pipelined Pipelined datapath

datapath with 2−stage adder
2.25

CONTROL PIPELINING
Control Datapath
input input
Datapath
control
Control Datapath
unit
Status
Control Datapath
output output
Non−pipelined control unit
Control Datapath
input input
Datapath
control
register
Control Datapath
unit
Status
register
Control Datapath
output output
Pipelined control unit
2.26

FUTURE DIRECTIONS
Expansion of existing models
Design processes for new models
Algorithms with engineering considerations
Synthesis of mixed synch/asynch systems
2.27

Quality Measures
Chapter 3
3.1

QUALITY MEASURES
1. Cost
2. Area
3. Performance
4. Power
5. Testability
6. Verifiability
7. Reliability
8. Manufacturability
3.2

RELATIONSHIP BETWEEN
STRUCTURAL AND PHYSICAL DESIGNS
Structural design
Control unit Datapath
present next reg. AR

state cond. state transf.
RAM
DR
Register
Reg file
Mux Mux
FU
Status reg
Technology mapping Technology mapping
PLA Std. cells General Bit−sliced Std. cells

cells stack
Floorplan
PLA Bit−sliced
stack
RAM Std. cells
3.3

DATAPATH LAYOUT
ARCHITECTURE
Data lines Routing

(metal1 or poly) channel
LSB MSB LSB MSB
Control
lines
(metal1)
Data lines Control

(metal2) lines
(metal2)
Bit slice Bit slice
Custom Cells Standard Cells
3.4

DATAPATH LAYOUT
Wdp
Wbit
Area = W dp X H dp
LSB MSB
A bit
Unit 1
Unit 2
H dp
Unit n
Extra Routing
Data routing channel
area Power Ground
Ground
Unit 1 Power Unit 1
Ground
Wunit2
Wunit2
Unit 2 Power Unit 2
Ground
Over−the−cell
routing track
Power
Ground
Diffusion
Metal 1
Metal 2
Wunitn
Wunitn
Unit n Poly Unit n

Power
Power/
Ground
Ground
H cell H ch H cell H ch
Wbit W bit
Wunit = const 1 X tr (unit)

H ch = const 2 X # tracks est
3.5

CONTROL UNIT LAYOUT
Input Output
O1 =(I1’ I2 I3 I4’ I5’ ) OR ( I1 I2’ I3 I4’ I5 ) OR
I1 I2 I3 I4 I5 O 1 O2 O3 O4
(I1 I2’ I3’ I4’ I5 )
Present state Conditions/ Next Control
status state signals O2 = ( I1 I2’ I3’ I4’ I5 )
p1 p0 s2 s1 s0 r1 r 0 c1 c0
State 1 0 1 1 0 0 1 0 0 1 O3 =( I1 I2’ I3 I4’ I5 ) OR ( I1 I2’ I3’ I4’ I5 )
State2 1 0 1 0 1 1 0 1 0
State3 1 0 0 0 1 1 1 1 1 O4 = ( I1’ I2 I3 I4’ I5’ ) OR ( I1 I2’ I3’ I4’ I5 )
State table Output signals:

Boolean Equations
I1 I1
I1’ I1’
I2 I2
I2’ I2’
I3 I3
I3’ I3’
I4 I4
I4’ I4’
I5 I ’ I5 I ’
5 5
n1 n2 n3 n1 n2 n3
O1 O1
2−level AND−OR impl. 2−level NAND−NAND impl.
Clusters
Inputs
I1 I1’ I2 I2’ I3 I3’ I4 I4’ I5 I5’
Input
nets H ch
Internal H sc
nets
AND AND AND OR H cell
O1 O2 O3 O4
Wsc
Standard cell layout

Assumptions:
1. single row 2. signal clustering
3. no sharing between signals 4. track per signal
5. no logic optimization
3.6

MULTIPLE−ROW
CONTROL LAYOUT
Wsc
H ch
H sc
H cell
single−row
implementation
Wsc / R
H sc
H H sc
block
H sc
3−row implementation
3.7

PLA LAYOUT ARCHTECTURE
WPLA
AND array OR array W
in Wp Wout
r
Product
AND term OR H
array array PLA
buffers lw
Buffer Clock 2 Latch lh
1 Latch buffer Buffer bh

b
w
I1 I2 I3 I4 I5 O1 O 2 O 3 O4 Inputs Clock Outputs
Logic mapping Layout model
3.8

MODELING PHYSICAL DESIGN
1. Probabilistic distribution for pins,

wire length
2. Placement, routing models
3. Linear algorithms (min−cut)
4. State encoding model
5. Logic minimization
6. Technology mapping
7. Transistor sizing
3.9

WIRE MODELING
Wire Comp j
Comp i
RT model
Vdd Wire model

Rw
Comp i Comp j
Rout
C in
Cw 2
Equivalent RC delay model
Lw
Rw = R ( )
s W
w
E
__
C = (L Ww ) ( )
w w t
t = ( R + R ) ( C + C )
p out w w in
3.10

COMBINATORIAL DELAY
: Critical Path
In 1
B Out 1
In 2 n5
E
In 3
A C n3 F Out 2
In 4
n4
n2 n1
D Out 3
A C E F
3.11

D−LATCH DELAYS
D
Q
C
D − Latch
t setup t hold
Clock
Data
t t
CQ DQ
Timing Diagram
3.12

MASTER−SLAVE DELAYS
MSFF
D Master
QM Slave
Q
MS flip−flop
t (MS)
setup
Clock
D
t (S)
setup
QM
(M)
t
DQ
Q
t (S)
CQ
Timing diagram
3.13

REGISTER−TRANSFER PATH
Reg1 Reg2 MAX ( t (Reg1) , t (Reg2) )

p p
n1 n2 MAX ( t ( n ) , t ( n ) )
Clock p 1 p 2
t p (ALU)
ALU
n3 t p ( n3)
t (Reg3)
setup
Reg3
3.14

SYSTEM CLOCKING MODELS
Memory
Control unit RAM
Path1
n7 n8
register
n10 n1 n2
State
Next−state Control Reg.
logic logic file AR DR
n4 n n
5 6
n
9 n Functional
3
unit
Datapath
Non−pipelined control
Memory
Control unit RAM
Path1
n7 n8
register
n10 n1 n2
State
Next−state Control Reg.

n4 n5 n
6
n
9 Functional
Status register
n11 n unit
Path2 3
Datapath
One−stage pipeline
Path2
Memory
Control unit RAM
Path1
n10 n1 n2 n7 n8
register
n
Control
reg.
State
Next−state Control 12 Reg.

n4 n5 n
6
n
9 Functional
Status register unit
Path3 n n
11 3 Datapath
Two−stage pipeliine
3.15

FUTURE DIRECTIONS
Better models
Modeling algorithms
Control Optimization
State encoding
Logic optimization
Microarchitecture optimization
Floorplanning models
Other measures
Other technologies
3.16

Design Description
Languages
Chapter 4
4.1
Copyright © 1993 by Daniel Gajski and Nikil Dutt UC Irvine

NEED FOR HARDWARE
DESCRIPTION LANGUAGE
Concept
Schematic capture High−level

English
and simulation specification synthesis
HDL
description
Manual
design
Synthesis
tools
Register−transfer
design
Design specification
Documentation/Redesign
Verification/Simulation
Communication between designers
4.2

DESIGN SPECIFICATION
1. Conceptual capture
2. Higher abstraction level
3. Detect Early Design Errors
4. Model hardware realistically
5. Facilitate synthesis, simulation

and verification
6. Good spec ==> Good design

Poor spec ==> ?
4.3

SPECIFICATION:
Programming Language Features
1. Data types
Integer, boolean, arrays
2. Operators
Arithmetic, logic, manip, access
3. Control
If, case, repeat, decode
4. Conciseness
Macros, subroutines
5. Extensibility
Operator overloading, user definitions
6. Expressivity
Hardware constructs, constraints
7. Bindings, user annotations

Component/Time allocation and binding
4.4

SPECIFICATION: Design Features
1. Design model
Target architecture: DSP, uproc
2. Execution ordering
Sequential, parallel, pipeline
3. Hierarchy
Complex descriptions
4. Timing specification
Clocks, delays
5. Synchronization
Communication protocols
6. Asynchrony
Global events, resets
4.5

HDL FORMATS
Textual
Programming languages
Pascal, Ada, ISPS, VHDL, Verilog
Applicative
DSL
Structural
EDIF, VHDL
Formal
HOP, CIRCAL
Graphical
Hierarchical FSM
StateCharts
Petri−Nets
GDL
FlowCharts
ASM, EXEL
Waveforms
Tabular
Symbolic MicroCode
State Tables
4.6

TYPICAL HDL’S
ISPS (Barb 81)
Hardware C (Ku DE 91)
Silage (Hilf 85)
VHDL (IEEE 88)
Verilog (ThMo 91)
4.7

ISPS PARADIGM [Barb81]
ISPS Description
Parser
Global Data Base

(Parse Trees)
Fault ....
Analysis
Synthesis
(Design
Automation)
Architecture
Evaluation
Architecture Simulation
Certification
4.8

ISPS MODEL
NETWORK OF ENTITIES
ONE ACTIVITION ONLY

CRITICAL
MAIN ENTITIY3 ...
ENTITY (PARAMETERS)
label :=
BEGIN
** SECTION1 ** CONCURRENT SEQUENTIAL
1 COPY CALL
** SECTION2 **
ENTITY2 ... ENTITY1 ...
END
4.9

ISPS: FEATURES
Data types
constants
prefix meaning example
’ Boolean ’10?0
# Octal #17
{0−9} Decimal 12
" Hexadecimal "A
bit−vectors & arrays

Acc\accumulator<0:31>
Mem\register.file[0:15]<0:31>
Control constructs
if x => PC = PC + 1
decode x =>
begin
0 := Acc = 0,
1 := Acc = Acc + 1,
end
Label := repeat
begin
if (done) => leave Label
end
4.10

ISPS: FEATURES (cont’d)
Operation execution
All operations executed in parallel
Squentiality enforced by NEXT

PC = PC + 1;
ACC = 0;
next
IR = M[PC]
Concurrency
a PROCESS entity executes asynchronously
a CRITICAL entity is queued
Synchronization
process Master := begin
process Slave := begin
...
L2 := repeat begin
nbsend (5) {Messg:Inp1}; nbrecv(:Start)
nbsend (1) {Messg:Start}; if (Start) => leave L2
end;
L1 := repeat begin nbsend (0) {Messg:Done};
nbrecv(:Done) {Messg:Done}; nbrecv (:X) {Messg:Inp1};
if (Done) => leave L1
...
end;
nbsend (0) {Mesg:Start}; nbsend (1) {Messg:Done};
...
end
end
4.11

ISPS MARK 1 DESCRIPTION
**Instruction Execution**
Icycle\Instruction.Cycle(main) :=
begin
Mark1 := repeat
PI = M[CR]<0:15> next
begin decode f =>
**Memory.Primary.State** begin
M[0:8191]<31:0>, #0 := CR = M[s]
**Central.Processor.State** #1 := CR = CR + M[s]
#2 := Acc = Acc − M[s]
PI\Present.Instruction<0:15>,
#3 := M[s] = Acc
f\function = PI<0:2>, #4, #5 := Acc = Acc + M[s]
s<0:12> := PI<3:15>, #6 := if Acc < 0 =>
CR = CR + 1
Acc\Accumulator<0:31>, #7 := stop()
end next
CR = CR + 1
end
end
end
4.12

DSL: Paradigm and Model
Paradigm
DSL Behavioral Description
Compiler
other
Flow Graphs applications
Automatic Synthesis
Model
CIRCUIT chip; SEQUENTIAL M1

CALL
INTERFACE END;
AREA
FREQUENCY
POWER CALL
APPLICATIVE
SEQUENTIAL M2
END;
END;
4.13

DSL FEATURES
Description styles
Applicative
Single assignment, concurrent
Global resets, interrupts
Imperative
Pascal−like procedures
Default sequential execution
Operation−level concurrency:
FORK A := B, C := D JOIN;
Delay specification
CLOCK CYCLES
(A := A + 5; B := 3 CYCLES < 4);
ABSOLUTE DELAY
(IF X THEN A := B DELAY < 20);
Chip−level constraints
POWER
VOLTAGE
LEVELS
AREA
FREQUENCY
4.14

DSL SPECIFICATION (CaRo 85)
CIRCUIT exponentiation;
INTERFACE VAR x,y :FIXED(8,8);

vcc :12v; I :LOGICAL(4..0);
gnd :GND;
input(15..0) :INPUT FANIN 1;
output(15..0) :OUTPUT FANOUT 10; APPLICATIVE
enable :INPUT FANIN 1; output := y;
clk :CLOCK FANIN 1;
IF enable = 0 THEN x :=1, y := 1
ELSE Start calc;
POWER 100 mW; FI;
VOLTAGE 12.00V;
TECHNOLOGY CMOS; END APPLICATIVE;
AREA 30 sq. mm;
FREQUENCY 0 to 500 kHz;
CLOCKBASE clk; IMPERATIVE calc;
(FOR i := 1 to 16 DO
PERFORMED FUNCTION x := x * input / i;
output := #exp(input); y := y + x;
CONTROL OD CYCLES = 3);
(enable := 0 CYCLES = 1); END IMPERATIVE;
(enable := 1 CYCLES = 48);
END.
END;
4.15

PROCESS SYNCHRONIZATION
IN HARDWARE C
a b x y
process P1(a,b) process P2(x,y)

in port a; out port x;
out port b; in port y;
{ {
...... ......
} }
process P1(a,b) process P2(a,b)

in channel a; out channel a;
out channel b; a in channel b;
{ {
receive (a,buf); send (a,msg);
...... b ......
send (b,msg); receive (b,buf);
} }
4.16

MEMORY BOARD DESIGN
AND TIMING DIAGRAM
Memory board
ABus Addr
CPU MemReq
MR
DataRdy Mem
cntrl ROM
Data
BusAck
BUS
BusReq
CNTRL
DBus
MemReq
BusAck
MR
Addr
Busreq 175
ns
DBus
DataRdy
4.17

MEMORY BOARD DESCRIPTION
IN BIF
Memory board
ABus Addr
CPU MemReq
MR
DataRdy Mem
cntrl ROM
Data
BusAck
BUS
BusReq
CNTRL
DBus
CPU−memory board block diagram
Present Cond Next

Val Actions Event
State State
DBUS = ’X’;
0 T 1 Falling(MemReq)
DataRdy = 1;
BusReq = 1;
T MR = 0;
Abus 2 Falling(BusAck)
{18..16} Addr = ABus;
== BusReq|( delay 175ns) = 0;
1
Board_Id
F 1 Falling(MemReq)
BusReq = 1;
2 T DBus = Data; 3 Rising(MemReq)
DataRdy = 0;
3 T MR = 1; 0 Rising(BusAck)
Addr = ’X’;
Memory board read cycle in BIF

4.18

BEHAVIORAL HIERARCHY IN
SPEC CHARTS (NaVG 91)
SYSTEM declarations: port RESET_IN : in bit;

connections: CPU.CLK : CLK_GEN.CLK;
constraints: num_chips <= 3; area_per_chip <= 60sqmm;
CPU port CLK : in bit; CLK_GEN

variable ACCUM, INSTR, PC : integer;
variable MEM : mem_array (255 downto 0); port CLK : out bit;
/* code behavior that

RESET generates a clock */
rising(RESET_IN)
ACCUM, INSTR, PC := 0;
loop
CLK <= ’0’;
wait for 100 ns;
ACTIVE CLK <= ’1’;
signal OPCODE, ADDR : integer; wait for 100 ns;
end loop;
FETCH
INSTR := MEM(PC) ;
PC := PC + 1; EXECUTE
case OPCODE is
when 1 =>
ACCUM := 0;
DECODE when 2 =>
OPCODE <= INSTR/10; ACCUM := ACCUM + 1;
ADDR <= INSTR mod 10; ...
wait for 30 ns; end case;
(OPCODE=0) not (OPCODE=0)
4.19

DESIGN HIERARCHY IN VHDL
VHDL Configuration
Design Design
Entity Entity
Design Interface
Entity
Architectural Body
Design
Entity Process Block DataFlow Block
(sequential behavior) (concurrent behavior)
p1: process(clock) b1: block

begin begin
..... .....
end process; end block;
Structure Block
(netlist)
Reg Reg
ALU
Bus
4.20

VHDL [IEEE87]
VHDL Hardware Description Language
Paradigm
VHDL Description
Compiler
other
Internal Model applications
Event−Driven Simulator
Simulation Language
Signals, Registers, Ports: containers w/ drivers
Driver Values Scheduled as Events in Simulation Time

Bus Resolution Function for Multiple Drivers
4.21

VHDL FEATURES
Strongly Typed Language
Operator Overloading
Concurrency
Process Level
Operation Level:
DataFlow Blocks
Timing Specification
Transport & Inertial Delays
A <= {TRANSPORT} B AFTER 5 ns
WAIT
WAIT FOR 20 ns
4.22

VHDL FEATURES
Resolution Function
Resolve multiple drivers for a signal
Packages
Definitions, Macros
Attributes
Signals:
S’STABLE
S’QUIET
User Extensions
Constraints, Annotations (not simulated)
attribute Performance : Integer
attribute LayoutSize : Integer
attribute Power : Integer
4.23

SEQUENTIAL AND PARALLEL
EXECUTION IN VHDL
P1: PROCESS (clock)
begin
A <= B;
B <= A;
end PROCESS P1;
B1: BLOCK (clock’event AND

clock = ’1’)
begin
A <= guarded B;
B <= guarded A;
end BLOCK B1;
4.24

VHDL DESCRIPTION STYLES
Behavior
Abstract algorithm
Sequential description of functionality
No implied structural implementation
DataFlow
Parallel execution of operations
Data transformations, register transfers
Hints at structural implementations
Structural
Component instantiations, interconnections
Netlist description
4.25

FULL ADDER:
DATAFLOW DESCRIPTION
entity FULL_ADDER is
port (X,Y: in BIT; X Y CIN
CIN: in BIT;
SUM: out BIT;
COUT: out BIT ); COUT SUM
end FULL_ADDER;
Entity interface description in VHDL
architecture FA_BOOLEAN of
FULL_ADDER is
signal S1, S2, S3: BIT;
begin
S1 <= X xor Y;
SUM <= S1 xor CIN after 3 ns;
S2 <= X and Y;
S3 <= S1 and CIN;
COUT <= S2 or S3 after 5 ns;
end;
Data flow
X Y
S2
COUT S1
S3 CIN
SUM
Synthesized structure
4.26

FULL ADDER:
architecture FA_BEHAV of
FULL_ADDER is
begin
process(X,Y,CIN)
variable BV: BIT_VECTOR(1 to 3);
variable NUM,I: INTEGER;
variable Stemp, Ctemp: BIT;
begin
NUM := 0; case NUM is
when 0 => Ctemp:=’0’; Stemp:=’0’;
BV := X & Y & CIN;
for I := 1 to 3 loop when 1 => Ctemp:=’0’; Stemp:=’1’;
if (BV(I) = ’1’) then
NUM := NUM + 1; end case;
end if;
SUM <= Stemp after 3 ns;
end loop;
COUT <= Ctemp after 5 ns;
end process;
end FA_BEHAV;
VHDL description
NUM = 0 X Y CIN
3 2 1 S
INC
INC
1 1 0 0 1 0 1 0
INC
3 2 1 0 3 2 1 0
MUX MUX
COUT SUM
Synthesized full adder

4.27

FULL ADDER:
STRUCTURAL DESCRIPTION
entity FULL_ADDER is Y
X CIN
port (X,Y: in BIT;
CIN: in BIT;
SUM: out BIT;
COUT SUM
COUT: out BIT );
end FULL_ADDER;
architecture Structure_View of
A B
FULL_ADDER is
component Half_adder
HA
port (A,B: in BIT;
S,C: out BIT);
end component; AB C S
component Or_gate
port (A,B: in BIT;
O: out BIT); O
end component; X Y CIN
signal C1, S1, C2: BIT;

HA1
begin
S1
HA1: Half_adder port map
(A=>X, B=>Y, S=>S1, C=>C1); C1 HA2
HA2: Half_adder port map C2
(A=>S1, B=>CIN, S=>SUM, C=>C2);
OR!: Or_gate port map
(A=>C1, B=>C2, O=>COUT);
COUT SUM
end;
4.28

MODELING
1. Language and architecture matching
2. General languages induce modeling styles
3. Description disambiguation and

design optimization
4. Real vs simulated delays
5. Language constructs with no hardware

realization
6. Modeling guidelines
4.29

LANGUAGE INDUCED
UNNECESSARY HARDWARE
CNT_CLR: block(CLR = ’1’)

begin
CNT1 <= guarded B"0000" after CLRDEL;
end block;
CNT_UP: block(EN = ’1’ and CLK = ’1’

and CLK’event and INC = ’1’)
begin
CNT2 <= guarded CNT + B"0001" after INCDEL;
end block;
SEL: CNT <= CNT1 when not CNT1’quiet else

CNT2 when not CNT2’quiet else
CNT;
VHDL counter description
"0000"
CLR
mux1
"0001" CNT1
EN
+
not INC EN
mux3
INC CNT1’quiet
CLK CNT CLR
mux2
CLK
not
CNT2’quiet mux4
CNT2
CNT
Initial hardware synthesized RT counter
4.30

VHDL SYNTHESIS PROBLEMS
Identification of storage elements, signals
Language constructs with no realizable

hardware
Collecting, identifying component attributes
Specification of asynchronous events
Use of multiple blocks/processes to

describe one component
4.31

MODELING GUIDELINES
1. Matching of semantic models to architectural models

(a) specialized languages
(b) modeling styles (structured properties)
2. Combinatorial style
(a) concurrent execution semantics
(b) connection of logic gates
(c) no clocks
(d) dataflow VHDL constructs
3. Functional style
(a) one−state FSMD
(b) synchronous and asynchronous behavior
(c) signal typing
(d) dataflow block VHDL constructs
4. Register−transfer style
(a) FSMD model
(b) states, condition, actions
(c) no explicit VHDL constructs
5. Behavioral style
(a) communicating processes
(b) shared memory or message passing
(c) no allocation, no binding, no schedule
(d) VHDL process statements, wait statements
4.32

VHDL FUNCTIONAL
DESCRIPTION STYLE
CNT_UP_CLR: block( CLR = ’1’

or (EN = ’1’ and CLK’event and CLK = ’1’))
begin
CNT <= guarded
B"0000" after CLRDEL when CLR=’1’ else
CNT + B"0001" after INCDEL when INC=’1’ else
CNT;
end block;
4.33

VHDL RT
DESCRIPTION STYLE
State_Fetch: block ( (CLK’event and CLK=’1’) and (state=S0))

begin
IR <= M(PC);
state <= S1;
end block;
State_Decode: block ( (CLK’event and CLK=’1’) and (state=S1))

begin case IR is
when "0000" => ACC <= ACC + 1;
state <= S2;
when "0001" => ACC <= 0;
..... state <= S3;
end case;
end block;
4.34

BEHAVIORAL
DESCRIPTION STYLE
architecture SHIFT_MULT of MULT is

begin
A_PORT B_PORT
process
variable A, B, M : BIT_VECTOR;
START variable COUNT : INTEGER;
begin
CLK wait until (START = 1);
A := A_PORT; COUNT := 0;
M_OUT DONE B := B_PORT; DONE <= ’0’;
M := B"0000";
while (COUNT < 4) loop
entity MULT is
port ( A_PORT, if (A(0) = ’1’) then
B_PORT: in bit_vector(3 downto 0); M := M + B;
end if;
M_OUT: out bit_vector(7 downto 0);
A := SHR(A, M(0));
CLK: in CLOCK;
M := SHR(M, ‘0’);
START: in BIT; COUNT := COUNT + 1;
DONE: out BIT; end loop;
);
M_OUT <= M & A;
end MULT;
DONE <= ‘1’;
end process;
end SHIFT_MULT;
4.35

CURRENT ISSUES
Raise abstraction level
User−interaction / annotation
Design frameworks
Unified design representation
Constraint specification & representation
Design verification
4.36

FUTURE DIRECTIONS
Specialized languages
Variety of Intermediate forms
Architectural taxonomies
Modeling guidelines
Design scenarios
4.37

Design Representation
Chapter 5
5.1

ROLE OF INTERMEDIATE
REPRESENTATION
Synthesis
tools
T2
T1 T3
L1
A1
Input Canonical
L2 Target
intermediate A2
HDLs architectures
representation
A3
L3
− Database for complete design information
− Uniform view across tools and users
− Language independent
− Support all architectural styles
5.2

High−Level Synthesis Trajectory
VHDL Description
Compilation, Transformations
Value Lifetimes,
Control & Data Dependencies CDFG
Scheduling Partition into control steps,
Allocation Select component types (resources)

Assign resources to ops in each step
FSM DP
Controller Structure
5.3

SHIFT−MULTIPLIER: VHDL BEHAVIOR
architecture SHIFT_MULT of MULT is

begin
process
variable A, B, M : BIT_VECTOR;
variable COUNT : INTEGER;
begin
wait until (START = 1);
entity MULT is A := A_PORT; COUNT := 0;
port ( A_PORT, B := B_PORT; DONE <= ’0’; A_PORT B_PORT
B_PORT: in bit_vector(3 downto 0); M := B"0000";
M_OUT: out bit_vector(7 downto 0); START
while (COUNT < 4) loop
CLK: in CLOCK;
if (A(0) = ’1’) then CLK
START: in BIT;
M := M + B;
DONE: out BIT; end if;
M_OUT DONE
); A := SHR(A, M(0));
end MULT; M := SHR(M, ‘0’);
COUNT := COUNT + 1;
end loop;
M_OUT <= M & A;
DONE <= ‘1’;
end process;
end SHIFT_MULT;
5.4

DESIGN FLOW IN HLS
(Shift−Multiplier Example)
Control−flow graph
Data−flow graphs
Read ‘1’
START
0 1
=
Read Read ‘0’

A_PORT B_PORT
B1
Write
Write A Write B COUNT
‘0’ B"0000"
0 1 Write Write M
DONE
Read Read ‘4’

‘1’ COUNT
A[0]
B4 0 1 = <
Read M Read A
Read M Read B
& ‘1’
B2
Write Write
+
M_OUT DONE
Write M
Read A Read
M[0]
Read ‘1’
COUNT SHR Read M ‘0’
B3 + Write A SHR
Write
COUNT Write M
5.5

SCHEDULED CDFG (Shift−Multiplier Example)
START = 1
S0 0 1
B1
A := A_PORT; COUNT := 0; A_PORT B_PORT
B := B_PORT; DONE <= ‘0’;

M := B"0000";
Mult A_Reg Count_Reg B_Reg
S1 COUNT < 4 Shift1 Shift2 Compar

Adder
0 1
B4
Concat
M_OUT <= M & A; DONE <= ’1’;
START
A(0) = ’1’ CLK

0 1 B2 DONE M_Out
S M := M + B ;
2
ENDIF
Initial Allocation
B3
A := SHR(A, M(0)); COUNT := COUNT + 1;
S
3 M := SHR(M, ‘0’);
5.6

DESIGN VIEW: Finite State Machine with DataPath (FSMD)
Present Next
Condition Value Actions
State State
A := A_PORT;
B := B_PORT;
T COUNT := 0; S1
S0 START = 1 DONE := ’0’;
M := 0000";
A_PORT B_PORT
F S0
START
T S2
S1 COUNT < 4 CLK
M_OUT := M @ A;
F DONE := ’1’; S0 DONE M_OUT
T M := M + B; S3
S2 A(0) = 1
I/O ports
F S3
A := SHR(A, M(0));
M := SHR(M. ’0’);
S3 COUNT := COUNT + 1; S1
FSMD state table

5.7

SHIFT−MULTIPLIER: AFTER ALLOCATION
A_PORT B_PORT
Present Next Mux1 Mux2

Condition Value Actions
State State
Mult Count_Reg
Control Unit
A_Reg B_Reg
1 S2
1
S1 Compar.LT Concat(OP: concat, INPS: Mult, A_Reg);
0 M_OUT(OP: load, INPS: Concat);
Mux5(OP: c1, INPS: ’0’, ’1’); S0 Mult(0) Mux3 Mux4
0
DONE(OP: load, INPS: Mux5);
Mux3(OP: c0, INPS: Mult, Count_Reg);
Mux4(OP: c0, INPS: B_Reg, "0001");
1 Adder(OP: add, INPS: Mux3, Mux4); S3 Shift1 Shift2 Adder
START
Mux1(OP: c1, INPS: Shift1, Adder); 4
S2 A_Reg(0) Concat
Mult(OP: load, INPS: Mux1);
0 S3 CLK A_Reg(0) Compar

Compar.LT 0 1
Mux5
M_OUT DONE
Component−based state table Partial design
5.8

SHIFT−MULTIPLIER: After Control Generation
A_PORT B_PORT
Mux1 Mux2
Present Condition Actions Next
State Value State Control unit
S0 & START = 1
S0 & ~(START = 1)
S1 & COUNT < 4
S1 & ~(COUNT < 4)
S2 & A(0) = 1
S2 & ~(A(0) = 1)
S3
Mux1 − − − − 1 − 0 Mult Count_Reg
Mux2 1 − − − − − 0
Mux3 − − − − 0 − 1 A_Reg B_Reg
Mux4 − − − − 0 − −
0001
1 S2 Load A_Reg 1 − − − − − 1
Load B_Reg 1 − − − − − − 0
S1 Compar.LT Clear Count_Reg 1 − − − − − − Mult(0)
Load Count_Reg − − − − − − 1
0 DONE := 1; S0 Clear Mult 1 − − − − − − Mux3 Mux4
Load Mult − − − − 1 − 1
Adder − − − − 1 − Shift1 Shift2
1
Mux3.sel := 0; Shift1 − − − − − − 1
Mux4.sel := 0; Shift2 − − − − − − 1
1 Adder.add := 1; S3 DONE 0 − − 1 − − − Adder
A_Reg(0) Mux1.sel := 1; Next State s1 s0 s2 s0 s3 s3 s1
S2 0100
Mult.load := 1;
0 S3 Compar
State Reg
START Concat
A_Reg(0)
Compar.LT
Symbolic Control Table CLK

DONE M_OUT
Complete Design
5.9

HDL COMPILATION
A := B + C;
D := A * E;
X := D − A;
HDL description
Read B Read C Read B Read C
Stmt1
+ +
Write A
Read E
Read A Read E
Stmt2 * *
Write D
Read D Read A
Stmt3 −
−
Write X Write X
Parse tree DFG
5.10

CONTROL AND DATAFLOW
REPRESENTATION
1 2 E
case C is
when 1 => X := X + 2;
A := X + 5; X := X + 2; A := X + 3; A := X + W;
when 2 => A := X + 3; A := X + 5;
when others => A := X + W;
end case;
VHDL Description Control flow representation
Read X
+ 5 3 Read W
+ + +
1 2 E 1 2 E
Read C
Write X Write A
Dataflow representation
5.11

DATAFLOW WITH
PRECEDENCE ARCS
b <= a + 1;
a <= b + 1;
Concurrent VHDL
Read b Const 1 Read a
+ +
Write b Write a
Representation
5.12

VARIABLE ACCESS REPRESENTATION
Read b Const 1 b 1
+ +
a := b + 1; Write a
a
b := a + 1; Read a
+ +
Seq. VHDL
Write b b
DFG with DFG with

variable nodes variable traces
5.13

TIMING REPRESENTATION
Read a loop_init
Read b
Read req
delay: loop_join
min=500, max=1000 +
delay
delay min 100
min 50 max 1000
Const 1 max 90
shr loop_test
0 1
Write ack
loop_body
Write c loop_exit
Dataflow annotation Timing in DFG Timing in CFG
5.14

OTHER FLOW GRAPH
REPRESENTATIONS
Data flow
1. case C is
Read C Read X Read W
2. when 1 => X := X + 2;
3. A := X + 5; +
4. when 2 => A := X + 3;
5. when others => A := X + W; Control flow Write A
6. end case;
Const 3 Read X
1 2 E
VHDL Description +
Write A
Read X Const 2
+ Const 5
+
Partitioned CDFG
Write X Write A
Data flow
DX DW DC
BR BR
Control flow C Data flow
E E
1 2
1
2 3 (C=1) (C=E) 2 = 1
(C=2)
+ + + 2 + 2 X
5
5 W
+ 3 4 5
3
2 2
1 E 1
E + 3 + 4 +
5
6
ME ME
A
UX UA
DeJong’s hybrid flow graph SSIM flow graph
5.15

VHDL PROCESS
REPRESENTATION
Process <=> Control/Data Flow Graph (CDFG)
Sequential execution
Control Flow Graph
Data Flow Graph
Data_bus.r
VHDL process
Creg.w
P1: process
begin
CREG := DATA_BUS;
CREG = B"00" Count.r 1
if CREG = B"00" then
COUNT := COUNT + 1;
else
COUNT := 0; +
end if
Count.w
end process;
Count.w
5.16

VHDL BLOCK
REPRESENTATION
VHDL block <=> Directed acyclic Graph (DAG)
Parallel execution
DAGs
VHDL Block B.r 1
architecture swap of design is +

begin
L: block (clock’event AND clock = ’1’)
begin A.w
L1: A <= guarded B + 1;

L2: B <= guarded A + 1;
A.r 1
end L;
end swap;
+
B.w
5.17

TRANSFORMATIONS
1. Compiler transformations
2. Flowgraph transformations
3. Hardware transformations
5.18

COMPILER TRANSFORMATIONS
Const 4 Const 5 Const 9
Write C Write C
Constant folding
Read A Read B Read A Read B Read A Read B
* * *
Write C Write D Write C Write D
Redundant operator elimination
5.19

SIGNAL ATTRIBUTE TRANSFORMATIONS
Read Read Read Const

X = 1 and not(X’stable) Signal X Const 1 Stable
1 2 3
= =
NOT
4
6
Read
Signal X
AND
7 sensitivity: EDGE
active edge: POSITIVE
5.20

TREE−HEIGHT TRANSFORMATION
a := b + c − d + f − g + h + k;
VHDL code
−
+
− +
−
b + + +
+
c b c d f g h k
t1 := h + k −
d
t2 := g + t1 + t1 := b + c Potential
t3 := f − t2 f t2 := d + f
+
parallelism
t4 := d + t3 g t3 := g + h
t5 := c − t4 t4 := t1 − t2
a := b + t5 t5 := t3 + k Potential
h k parallelism
a := t4 − t5
Initial parse tree After tree height reduction
5.21

CONTROL−TO−DATAFLOW
TRANSFORMATIONS
if (X = 0) then
A := B + C;
D := B − C;
else
D := D − 1;
end if;
Textual representation
Read
0
X
If_test
0 1 Read Read
= B C
Stmt_blk2 Stmt_blk1
Read + −
1
D
Write Write
− A D
If_join
Write
D
CF representation
Read Read Read

1
B C D
Read
0
X
Read
−
+ −
A
=
1 0 1 0
Write Write
A D
DF representation
5.22

CONTROL FLATTENING
Stmt_blk1
Stmt_blk1
If_test
0 1
If_test
Stmt_blk2 0 1
Stmt_blk3 Stmt_blk6
Stmt_blk3
If_test
0 1
If_join
Stmt_blk4 Stmt_blk5
If_join
If_join
Stmt_blk7
Original CFG Final CFG
5.23

LOGIC−LEVEL
TRANSFORMATIONS
c = (a’ NAND (a NAND b)) = a
HDL boolean expression
Read a Read b
NOT NAND Read a
NAND Write C
Write C
Original flow graph After logic optimization
5.24

RT TRANSFORMATION
RT−level function recognition
RT−Component specific transformation
A B 1 A B
add
n1 + n3
inc
n2 +
Adder/Incrementer Transformation
5.25

COMPLEX FUNCTION
RECOGNITION
A B "0001"
case F is + −
when "00" => OUT <= A + B;
when "01" => OUT <= A + B + "0001";
when "10" => OUT <= A − B; + + − +
when "11" => OUT <= A − B + "0001";
end case; "00" "01" "10" "11"
F
OUT
VHDL DF graph
A B
A B
+ AI − SI
"00" : +
F "01" : AI
"00" "01" "10" "11"
F "10" : −
"11" : SI
OUT
OUT
Simplified DF graph Complex−node DF graph
5.26

FUTURE DIRECTIONS
1. Common representation
2. Different views
3. Different architectures
4. Description disambiguation
5. Layout−Driven Transformations
6. Transformation scripts
7. Representation for interactive design
5.27

Partitioning
Chapter 6
6.1

PARTITIONING
Used in HLS for:
Scheduling
Allocation
Unit selection
Chip partitioning
Problem decomposition
for tractability
6.2

COMPONENT PARTITIONING
a
FF1 FF2 c
b G1
a v1 v2 v3
c
ni nj b
Cutline e36
d e24
REG. v5
e d
e v6
e v4 25 G2
f
f G
g
g
Design structure Graph model
a b
Chip 1 c
n n
i j
Chip 2
d e f g
Partitioned Design
6.3

BEHAVIORAL PARTITIONING
−Time utilization
I1 I2
−Component utilization
process1
entity VHDL EXAMPLE is

port (I1, I2, I3 : in integer; B H
I3
O1 : out integer;)
signal B, F, H : integer;
process2 process3 O1
end entity;
F
architecture BEHAVIOR of EXAMPLE is
begin Inter−process communication
process1
var : A, C, E : integer;
while (I1 > 0) loop
if (B > 0) I1
B <= C − I2;
else
I1 > 0
B <= A − I2;
end if; H
B>0 exit
wait until (H > 0);
end loop; true false wait
end process1; C I2 A I2 H>0
process2 B B
var : D : integer;
end if process1
wait until (B <= 0);
D := I3 + B;
F <= I3 + I1;
B process2 F process3
end process2;
process3 wait wait
var : G : integer; B <= 0 F>0
wait until (F > 0);

O1 <= I3 + G; B I3 I1 F I3 I1
H <= I3 + I1; + +
end process3; + +
end D F G H
Textual Description Control/Data

Flow Graph
6.4

PARTITIONING TECHNIQUES
Constructive methods
1. Random selection
2. Cluster growth
3. Hierarchical clustering
Iterative−Improvement methods
1. Min−cut partitioning
2. Simulated annealing
6.5

CLUSTER GROWTH ALGORITHM
Algorithm 6.1
page 185
6.6

HIERARCHICAL CLUSTERING
Graph Closeness measure Cluster tree
v1
v1 v v v v
5 4 2 3 4 5 v(24)
v2 v1 − − − − −
1 v3
v2 5 − − − − v1 v2 v4 v3 v5
6 3 v3 4 1 − − −
v5 v4 0 6 0 − −
v4 v5 0 3 0 0 −
(a)
v(241)
v1 v1 v(24) v v5
3 v(24)
5 4 v1 − − − −
1 v(24) 5 − − −
v3 v1 v2 v4 v3 v5
v(24) v3 4 1 − −
3 v5 0 3 0 −
v5
(b)
v(2413)
v(241)
v(241) v3 v5 v(241)
v(241) v(24)
v3 − − −
4
v3 4 − −
3 v5 3 0 − v1 v2 v4 v3 v5
v5
(c)
v(24135)
v(2413)
v (2413) v(2413) v5 v(241)
v(24)
v(2413) − −
v5 3 −
3 v1 v2 v4 v3 v5
v5
Cluster tree formation
6.7

HIERARCHICAL CLUSTERING
Algorithm 6.2
page 188
6.8

CLUSTERING WITH SEVERAL
CRITERIA
Criterion A
a b c
First
3 4 1 2
5 cutline Criterion A
d e f
f c a e d b
Criterion B (a)
a b 3 c
First
2 cutline
1
5
4
Criterion B
d e f
c e f b a d
(b)
{f,c} {d}
3
5 1
2 A then B
{a,e} {b} {a,e} {f,c} {b} {d}
(c)
4
{c,e} {a,d}
5 1
{f} {b}
B then A
{c,e} {f} {a,d} {b}
(d)
Second
cutline 5
{f,c} {a,e,d}
3 1
A then B
{b} {a,e,d} {f,c} {b}
f c a e d b
6.9

ITERATIVE ALGORITHMS
G1 G2
Cutline
Two−way partitioning (Kernighan−Lin)
Start with 2 equal subgraphs
Each iteration, exchange

K−pairs between partitions
Continue until no further

improvement
6.10

MIN−CUT PARTITIONING
Interconnection Reduction
Cutline Cutline
v v v v
i j j i
G1 G2 G1 G2
Before interchange After interchange

of Vi and Vj of Vi and Vj
6.11

MIN−CUT PARTITIONING
Algorithm 6.3
page 194
6.12

MIN−CUT SEARCH STRATEGY
GAIN(k)
20
10
5
k
1 2 3 4 5 6 7 8 9 10
−5
GAIN (5) is maximum.
Thus, perform the first 5 exchanges
6.13

SIMULATED ANNEALING
Algorithm 6.4
page 197
6.14

CLUSTERING EXAMPLE
a b c
o1 o2 e13
(+) G1 (+) add1
Two cluster
mult1
G1
e
partition
e13 e23 23 G2
o3
G2
*
( )
a b c
o1 o2
G1 G2 e13
add1
(+) (+)
G1
mult1 Three cluster
e13 e
G3
partition
23 add2
o3 G2 e
G3 23
*
( )
6.15

CLUSTERING FOR UNIT
SELECTION
1. Functional proximity
2. Communication proximity
3. Potential parallelism
4. Closeness distance
6.16

PARTITIONING SCRIPTS IN APARTY
CDFG
User choice
Clustering
Procedure/ Procedure/
Control Data control data Operator
Physical constraints User choice
Cutline
Schedule
Area Connections length
No
Done?
Yes
Partitioned CDFG
6.17

DESCRIPTION PARTITIONING
main
num_msgs : register(8); reset system_off
user_id_ram : memory(4x4);
system_on not system_on
system_on
initialize respond_to_machine_button
machine_button_pushed
respond_to_external_line
monitor dialtone
answer
dialtone
play_announcement record_msg
tone=1 tone=1
remote_operation
check_user_id respond_to_cmds
code_ok
dialtone not code_ok
Answering machine description
check_user_id entered_code : memory(4x4);

i : integer range 1 to 5;
i := 1;
while i <= 4 loop
wait until button_tone;
entered_code[i] := button;
i := i + 1;
end loop;
if (entered_code[1] = user_id_ram[1]) and
...
(entered_code[4] = user_id_ram[4]) then
code_ok <= true;
else
code_ok <= false;
end if;
Object description
6.18

PARTITIONED DESCRIPTION
num_msgs (200) AREA: 12641

8
PINS: 62
2
2
main (3412) respond_to_machine_button (3461) respond_to_cmds (5568)
external ports
17 24
8 3 2 1 1 4
2
monitor (4489) check_user_id (4272) user_id_ram (750) AREA: 9511
7 PINS: 21
6.19

FUTURE DIRECTIONS
Partitioning of CDFG
Partitioning of specifications
Estimation of quality measures
Software/Hardware partitioning
6.20

Scheduling
Chapter 7
7.1

High−Level Synthesis Trajectory
VHDL Description
Compilation, Transformations
Value Lifetimes,
Control & Data Dependencies CDFG
Scheduling Partition into control steps,
Allocation Select component types (resources)

Assign resources to ops in each step
FSM DP
Controller Structure
7.2

Scheduling
Definition: Task of assigning behavioral operators

to control steps
Input: CDFG (data lifetimes with control and data

dependencies)
Output: Temporal ordering of individual operations

(FSM states)
Constraints: Hardware Resources, Timing,
Testability, Power, .....
Goal: Exploit parallelism to achieve fastest design

within constraints
7.3

SCHEDULED CDFG
START = 1
S0 0 1
B1
A := A_PORT; COUNT := 0;
B := B_PORT; DONE <= ‘0’;
M := B"0000";
S1 COUNT < 4
0 1
B4
M_OUT <= M & A; DONE <= ’1’;
A(0) = ’1’
0 1 B2
S M := M + B ;
2
ENDIF
B3
A := SHR(A, M(0)); COUNT := COUNT + 1;
S
3 M := SHR(M, ‘0’);
7.4

Scheduling: Assumptions
Target Architecture:
Pipelining, Clocking, Busing, Component Sets, ...
Language Constructs Permitted:

Conditionals, Loops, Arrays, ADTs,...
Temporary Assumptions for Illustration:

1. No pipelining, single−phase clock
2. Straight−line code, simple data−types
3. Each operation executes in 1 control step
4. Each operation performed by 1 component type
7.5

Scheduling Example: HAL
u dx 3 x u dx x dx
v v
* v * 4 + 10
* v 2
1 e
while (x < a) loop e e2,5
4,9
1,5 y
x1 := x + dx; y
e
10,11
* v * v + v
9
3
u1 := u − ( 3 * x * u * dx ) − (3 * y * dx ); 5
u e e a
y1 := y + (u * dx); 3,6 dx
5,7
x := x1; u := u1; y := y1;
− v < v
end loop; v * 6 11
7
e
7,8 e
6,8
c
− v
8
VHDL Behavior DFG Representation
7.6

ASAP ALGORITHM
Algorithm 7.1
page 217
7.7

ALAP ALGORITHM
Algorithm 7.2
page 219
(Check errata)
7.8

HAL: ASAP and ALAP Schedules
v1 v2 v3 v4 v
10
v1 v2
E=1 * * * + L=1
* * *
E=2 * * + v < L=2 * v * v

v5 v6
9
v11 5 3
v v v v
7 7
v4 10
6 +
E=3 − L=3 − * *
v8 v v9 v11
E=4 − L=4 − 8 + <
ASAP Schedule ALAP Schedule
7.9

Scheduling Formulations
Resource Constrained Scheduling:

Minimize control steps for given resources
List−Based Scheduling
Static−List Scheduling [JMSW91]
Time Constrained Scheduling:

Minimize resources for given number of time steps
Integer Linear Programming(ILP) [LeHL89]
Force−Directed Scheduling (FDS) [PaKn89]
Iterative Refinement [PaKy91]
7.10

LIST−BASED SCHEDULING
Algorithm 7.5
page 234
7.11

HAL: List Scheduling
ASAP Schedule Operator Mobility ALAP Schedule

1 2 3 4 5 6 7 8 9 10 11
v1 v2 v3 v4 v
10 s1 v1 v2
E=1 * * * + L=1
* * *
s2
E=2 * * + v < L=2 * v * v
v5 v6
9
v11 5 3
v v v v
7 s3 7
v4 10
6 +
E=3 − L=3 − * *
v8 s4 v v9 v11
E=4 − L=4 − 8 + <
Node: v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11

Mobility(op) = ASAP − ALAP Operation: * * * * * * − − + + <
Mobility: 0 0 1 2 0 1 0 0 2 2 2
Maintain priority list for each component type

Schedule critical nodes first
7.12

List Scheduling (Cont’d)
ASAP + Priority function (mobility) for each resource

Resource conflicts resolved by priority function
Constructive Method: schedule, reevaluate priority, sort,...
* * − + <
<0> <0> <2> <2>

PList : 1<0>, 2<0>, 3<1>, 4<2>
* s +
PList + : 10<2> 1 * 1 * 2
* 1 * 2 * 4 + 10 10
PList − : NIL
<0> <1> PList < : NIL s <
2 *
* * + < * 5
3 11
5 3 9 11
<2> <2>
<0> <1> Ready List
s −
3 * *
− * 7
7 6 6 4
<0> Resources : 2
* s4 − +
− 8 9
8 Resources+ : 1
Resources− : 1
DFG with mobilities Resources< : 1 Scheduled DFG
Resource Constraints
7.13

STATIC−LIST SCHEDULING
* 1 * 2 * 4 + node 8 9 11 7 6 4 10 5 3 1 2
10
ALAP 1 1 1 2 2 2 2 3 3 4 4
* 5 * 3
+ 9 <
11 ASAP 4 2 2 3 2 1 1 2 1 1 1
− 7 * 6
priority 1 2 3 4 5 6 7 8 9 10 11
− 8
DFG Priority List
− + < − + <
* * * *
+ s * * + 10
* 2 * 1 10 1 2 1
s * 3 * 5 < 11
* 3 * 5 2
s * 4 * 6 − 7
3
s4 − 8 + 9
Partial Schedule Final Schedule
7.14

Scheduling Formulations
Resource Constrained Scheduling:

Minimize control steps for given resources
List−Based Scheduling
Static−List Scheduling [JMSW91]
Time Constrained Scheduling:

Minimize resources for given number of time steps
Integer Linear Programming(ILP) [LeHL89]
Force−Directed Scheduling (FDS) [PaKn89]
Iterative Refinement [PaKy91]
7.15

HAL: ILP FORMULATION
1 2 3 4 5 6 7 8 9 10 11
s1
s2
s3
s4
Operation ranges
s1 v1 v2
* *
v5 v3 v10
s2
* * +
v7 v6 v
s3 − 4
* *
s4 v8 v v11
− + 9 <
Final Schedule
7.16

ILP FORMULATION
Formulation on
pages 221−222
(Check errata)
7.17

HAL: ILP FORMULATION
1 2 3 4 5 6 7 8 9 10 11
s1
s2
s3
s4
Eq. 7.5 with

Operation ranges
constraints
s1 v1 v2 (pages 222−223)
* *
v5 v3 v10
s2
* * +
v7 v6 v
s3 − 4
* *
s4 v8 v v11
− + 9 <
Final Schedule
7.18

Force−Directed Scheduling [PaKn89]
Time−constrained scheduling, minimize resources
Goal: Achieve high unit utilization by uniformly distributing

operations of a particular type over all states
Iterative Approach:
Determine global effect of scheduling an operation into a state
Compute probability of scheduling op into a state
Compute expected operator cost for each op type
Schedule each operation to balance hardware utilization
Choose assignment of op to state with min cost
Recompute probability distr. and expected op costs
7.19

HAL: Force−Directed Scheduling
s s 2.83
1 * * 1
1 2
*
s s 2.33
2 * * + 2
5 3
*
s s 0.83
3 − + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
Probabilities of op scheduling Operator cost for Mult (*)

(uniform distr from mobility) (sum of op prob’s per state)
For each iteration of FDS do:

Compute Expected Operator Costs (EOC) for scheduling
each operator type in every state
Assign one operation to control step based on minimal EOC
7.20

HAL: Force−Directed Scheduling(Cont’d)
s s 2.83
1 * * 1
1 2
*
s s 2.33
2 * * + 2
5 3
*
s s 0.83
3 − + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
Probabilities of op scheduling Operator cost for Mult (*)
s s 2.33
1 * * 1
1 2
s s 2.33
2 * * * + 2
5 3
s s 1.33
3
− * + < 3
7 6 4 10
0.00
s4 − s4
8 9 11
Probabilities after o3 is scheduled in s2 Revised operator cost for Mult (*)

7.21

FORCE−DIRECTED SCHEDULING
Algorithm 7.3
page 227
7.22

Iterative Refinement Scheduling [PaKy91]
FDS: Constructive, no backtracking (rescheduling)

Iterative Refinement Scheduling: allows rescheduling
Based on KL Graph Bisection [KeLi70]
Start with any initial schedule
Reschedule one operation at a time, and lock operation
Maximize cumulative gain
s 1 2 3 4 10 s 1 2 3 4 10
1 1
s 5 6 9 11 s 5 9 11
2 2
s 7 s 7 6
3 3
s4 8 s4 8
Initial Schedule After Op 6 is moved and locked

7.23

ITERATIVE RESCHEDULING
Algorithm 7.4
page 231
7.24

Scheduling with Realistic Assumptions
Functional Units with Varying Delays
s
1 +
s
1
*
s s
1 + + 2
s s1
2 − * * * *
s2 s3
− −
unit−delay multicycling chaining pipelining
Multi−Functional Units
ALUs, Shift−registers, etc.
Realistic Design Descriptions

Conditionals, loops, nested loops
7.25

LOOP SCHEDULING
time
b
1 2 3 4 5 6 7 8 9 10 11 12
Sequential Execution
1, 2, 3 4, 5, 6 7, 8, 9 10 , 11, 12
Partial Loop Unrolling
m
1 4 7 10
p 2 5 8 11
3 6 9 12
Loop Folding
7.26

LOOP SCHEDULING (Cont’d)
A B C
s A C
1
I F E D
s I F B
2
K L J H G
s K L D
3
M
s4 M E G
N s5
J N H
Q P R s6
Q P R
DFG with Depend. Seq. Schedule

Across Iterations w/ 3 FU’s
7.27

LOOP UNROLLING AND
LOOP UNFOLDING
Loop Overhead
A
s A1 A2 B1 C1
1
I F C
s I1 F1 E1 D1
2
L D B
s L1 J1 H1 G1
3
J M E G A
s4 K1 M1 I2 C2
K N H I F C
s5 N1 L2 F2 D2
Q P R L D B
s P1 R1 Q1 B2
6
J M E G
s7 J2 E2 M2 G2 Loop Body
K N H
s8 K2 N2 H2
Q P R
s9 Q2 P2 R2
Loop Overhead
Loop Unrolling Loop Folding
7.28

SIMULATED ANNEALING
FORMULATION
ALU1 ALU2 ALU3
s1 v1 = v2 + v3 v4 = v2 * v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
Initial Schedule
ALU1 ALU2 ALU3
s1 v4 = v2 * v3 v1 = v2 + v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
After Swapping Two Operations
ALU1 ALU2 ALU3
s1 v4 = v2 * v3 v1 = v2 + v3
s2 v5 = v1 + v4 v6 = v4 / v1
s3 v2 = v4 * v5
After Displacing an Operation
7.29

PATH−BASED SCHEDULING
Input Ports: branchpc, ibus, branch, ire

Output Ports: ppc, popc, obus
Variables: pc, oldpc
1 ppc <= pc 1 1
2 Write popc <= oldpc 2 2 Write
obus <= ibus + 4 i2 3

3 3
if (branch = ’1’) 4
4 4
Branch Branch
i1 Branch
Branch 5
5 then pc<= branchpc 5
6 6 6
end if
7 wait until (ire = ’1’) 7 7

Loop Loop
8 oldpc <= pc 8 8
Write Write
9 9 9
pc <= pc + 4
10 10 10
CDFG Constraint Scheduled

Intervals CDFG
7.30

DFG RESTRUCTURING
a b c d e f a b c d e f
+ + + +
1 5 1 5
+ + +
2 2
3
+ +
3
4
+
4
(((a + b) + c) + d) + (e + f) ((a + b) + c) + (d + (e + f))

before after
Tree−Height Reduction
a b d c a b d c
+ +
1 1
+ + +
2 6 2
* *
3 3
+ + +
4 5 4
+
5
d (a + b + c) + ab + ab d (a + b + c) + 2ab
before after
Redundant Operator Insertion
7.31

Scheduling: Future Directions
More realistic libraries

Realistic target architectures
Better cost functions (layout−driven)
Scheduling with allocation
Arbitrary descriptions (nested loops, conditionals)
Loop−pipelining, tree−height reduction
Complex data structures (arrays, ADT)
Application−specific scheduling (RISC, VLIW, DSP,..)
7.32

Allocation
Chapter 8
8.1

Allocation
Selection of components to be used in RT design
Binding of hardware structures (units, regs, connections)

to behavioral operators and variables
Define target DP architecture

Clocking, Busing,..
Approaches
Greedy
Decomposition
Iterative
8.2

Unit Selection and Binding: Example
Unit Selection: 2 adders, 4 registers
Mapping Behavior to RT Structure:
a b c d
r1 r2 r3 r4
s1 + o1 + o2 a b, e, g c, f, h d
e f
s2 + o3 + o4
+1, +3 +2, +4
ADD1 ADD2
g h
DFG Allocated RT−Structure
8.3

Point−to−Point DP Interconnection Architectures
Output OutBus1
interconnection
network OutBus2
Output Register file
interconnection
network r3
Register file
r1 r2 r4 r5 r6
r3
r1 r2 r4 r5 r6
InBus1
Input InBus2
Input interconnection
interconnection network InBus3
network
InBus4
ALU1 ALU2
ALU1 ALU2
Mux−oriented DP Bus−oriented DP
s1: r3 <= ALU1(r1,r2); r1 <= ALU2(r3,r4);

Register−transfers s2: r1 <= ALU1(r5,r6); r6 <= ALU2(r2,r5);
s3: r3 <= ALU1(r1,r6)
8.4

One−Phase Clocking of Point−to−Point DP
Read Read r5 InBus1

Output OutBus1
r1
interconnection
network OutBus2 Read Read InBus2
Register file
r2 r6
r3 InBus3
Read r3 Read r2
r1 r2 r4 r5 r6
Read r4 Read r5 InBus4
InBus1 ALU1
Execute Execute
Input InBus2
interconnection Execute Execute ALU2
network InBus3
InBus4 OutBus1
Write r3 Write r1
Write r6 OutBus2
ALU1 ALU2
Write r1
tr t e t w
Bus−oriented DP
Cycle 1 Cycle 2
Requirement: Cycle Time > tr + te + tw

Sequential execution
8.5

One−Phase Pipelined Datapath
OutBus1
OutBus2
r3
r1 r2 r4 r5 r6 Read r3 Read r2 InBus3
InBus1 Execute Execute ALU1
Execute Execute ALU2
InBus2 Write r3 Write r1 OutBus1
Write r1 Write r6 OutBus2
InBus3
Cycle 1 Cycle 2 Cycle 3
InBus4
ALU1 ALU2
Clock cycle = max ( tr + te, tw)

Overlapping read and write data transfers
8.6

Two−Phase Pipelined Datapath
r3
r1 r2 r4 r5 r6
te
tw tr
InBus1
Read r Read r5 Read r1 InBus1
InBus2 1
Read r Read r6 Read r6 InBus2
InBus3/OutBus1 2
InBus3/
Read r Read r2 Write r Write r1 OutBus1
InBus4/OutBus2 3 3
Read r4 Read r5 Write r1 Write r6 InBus4/
OutBus2
Execute Execute Execute ALU1
Execute Execute Execute ALU2
ALU1 ALU2
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Clock cycle = max (te, tr + tw)

Overlapping data transfers with FU execution
8.7

Allocation Tasks
Unit Selection
Functional−unit Binding
Storage Binding
Interconnection Binding
8.8

Interdependence and Ordering of Binding
a b c d
r1 r2 r3 r4
s1 + o1 + o2 a b, e, g c, f, h d
e f
s2 + o3 + o4
+1, +4 +2, +3
g h ADD1 ADD2
Scheduled DFG FU Binding with 6 Muxes
r1 r2 r3 r4
a,g b,e c,f d,h r1 r2 r3 r4
a, g b, e c, f d, h
+1, +3 +2, +4
+1, +4 +2, +3
ADD1 ADD2
ADD1 ADD2
Register Reallocation: 4 Muxes FU Rebinding: Optimal Design
8.9

Greedy Binding: Examples
r1 r2 r3 r4 r5 r1 r2 r3 r4 r5 r1 r2 r3 r4 r5
Mux1 Bus1 Mux1 Bus1 Mux1 Bus1

ALU1 ALU2 ALU1 ALU2 ALU1 ALU2
Initial partial design add 2 inputs to mux add tristate buffer to bus
r1 r2 r3 r4 r5 r1 r2 r3 r4 r5 r1 r2 r3 r4 r5
Mux2 Mux1 Mux1 Mux2 Bus1

Mux1 Bus1
ALU1 ALU2 ALU1 ALU3 ALU2 ALU1 ALU3 ALU2
add mux to FU input add FU and tristate to bus add FU and mux
r1 r2 r3 r4 r5
Bus1
ALU1 ALU2
convert muxes to shared bus
8.10

CONSTRUCTIVE ALLOCATION
Algorithm 8.1
page 276
8.11

Allocation: Decomposition Methods
Clique Partitioning [TsSi86]

Left−Edge Algorithm [KuPa87]
Weighted Bipartite Matching [HCLH90]
8.12

CLIQUE PARTITIONING
Algorithm 8.2
page 279
8.13

CLIQUE PARTITIONING EXAMPLE
Common
Edge neighbors
e’ 1
1,3
s
s 2 e’ 1
1,4
1 v v2
v v2 1
1 e’ 0
e 2,3
1,4
e e e e’
1,3 2,3 2,5 2,5 0
s
5 e’
v3 v4 3,4 1
v3 v4 v5
e e v5 e’ 0
3,4 4,5 s s 4,5
3 4
Graph G Common Neighbors
s Edge Common Common

neighbors s Edge
s 2 2 neighbors
13 e’ 0
13,4
v v2 e’ 0
v2 1
2,5
v e’
1 2,5 0
e’ 0
4,5
v3 v4 v3 v4
v5 v5 s
s 5
s 5 s
4 134
Supernode Creation 1 Supernode Creation 2
v v2 s
1
25
s = {v1 , v 3 , v 4 }
134
v3 v4
s {v2 , v 5 }
s
134
v5
25 =
Supernode Creation 3 Cliques for Graph G
8.14

CLIQUE PARTITIONING FOR
REGISTER BINDING
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
v1 v2
s1 s1 R
v6 + v4
v3 W
s2 s2 R
* −
v10 v5
W
v7
s s R
3 / + 3
+
W
v11 v8 v9
R
s4 & | s4
W
v1 v2
DFG Lifetime intervals
v8
v10
Cliques:
v1 r1 = {v1 , v 8 }
v9 v2
v7
r2 = {v2 , v3 , v 9 }
v11 r3 =
v3 v5 {v4 , v5 , v 11 }
r4 = {v6 , v7 }
v4
v6
r5 = {v10 }
Graph model A CP solution

8.15

REGISTER ALLOCATION USING
LEFT−EDGE ALGORITHM
Algorithm 8.3
page 285
8.16

Register Binding with Left Edge Algorithm
v1 v10 v4 v6 v2 v3 v5 v7 v8 v9 v11 v1’ v2’ r1 r2 r3 r4 r5
v2
s1 v1 v6
v10 v4
s2
v3
v5 v7
s3
v8 v9 v11
s
4
v1’ v2’
Lifetimes Registers
8.17

WEIGHTED BIPARTITE
MATCHING ALGORITHM
Algorithm 8.4
page 288
8.18

BIPARTITE MATCHING EXAMPLE
FOR REGISTER ALLOCATION
v1 v10 v4 v6 v2 v3 v5 v7 v8 v9 v11 v1’ v2’
s1
s2
s3
s
4
Cluster Cluster Cluster Cluster

1 2 3 4
Sorted Lifetime Intervals with Clusters
r1 v1
v3
r2
r1 = { v1 , v8 , v1’ }
v10
r2 = { v9 , v10 }
r3 v4 v5
r3 = { v4 , v5 , v11 }
r4 v6
v7 r4 = { v6 , v7 }
r5 v2
r5 = { v2 , v3 , v2’ }
Set R Set V
Bipartite Graph for Binding Final Register Binding

Vars in Cluster2 after Cluster1
8.19

PAIRWISE EXCHANGE
ALGORITHM
Algorithm 8.5
page 292
8.20

Interdependence of Scheduling and Allocation
Scheduling + Allocation => CU + DP
Scheduling: needs prelim component selection

Allocation: needs "rough" schedule
Which one first?
Iteration and Interleaving
8.21

Allocation: Future Directions
Interaction between scheduling and allocation
Realistic cost functions
Allocation of memories
Different target architectures
8.22

Design Methodology
for High−Level Synthesis
Chapter 9
9.1

DESIGN METHODOLOGY
REQUIREMENTS
1. The syntax and semantics of the input and

output descriptions.
2. The set of algorithms for translating input

into output descriptions.
3. The set of components to be used in the

design implementation.
4. The definition and ranges of design
constraints.
5. The mechanism for selection of design

styles, architectures, topologies and
components.
6. Control strategies (usually called scenarios
or scripts) that define synthesis tasks
and the order in which they are executed.
9.2

TRIVIAL SYNTHESIS SYSTEM
Assumptions: Sample/clock cycle
Computation/2 clock cycles
Operation/clock cycle
Same bit width
a b c d a b c d
r1 r2 r3 r4
+ − + Adder − Subtractor
r5 r6
* * Multiplier
r7
y y
DFG Annotated DFG
Behavioral Hardware Design

description resources constraints
a b c d
Compiler
r1 r2 r3 r4
Scheduler
+ − DFG
Allocator
r5 r6
Netlist
generator
*
r7
Physical design
y
ASIC description
to manufacturing
Datapath Synthesis system

9.3

PRACTICALITY OF ASSUMPTIONS
1. All units are not of the same bit width or

same propagation delay.
2. Dataflow architechure is too expensive.

3. I/O rates do not match architecture.
4. Synchronous I/O is not always available.
9.4

EXAMPLE WITH MEMORIES
a b c d
a b c d
r1 r2 r3 r4
+ − State 1 + −
r2
State 2 r4
*
State 3 *
r5
y
y
DFG Annotated DFG
Memory 1 Address bus1

a b c d Control bus2
16 16 16 16
mux mux AR1
r1 r2 r3 r4
Control Address
mux mux
unit generator
+/− *
32
r5 AR2
Control bus1
Address bus2
Memory 2
FSMD implementation
Load1 Load2 Load3 Memory1

+ − + − * − ALU
* * * Multiplier
Store1 Store2 Store3 Memory2
Resource utilization
Load1 Load2 Load3 Memory1

+ − + − + − ALU
* * * Multiplier
Memory2 Store1 Store2 Store3
Improved resource utilization

9.5

EXAMPLE SYNTHESIS SYSTEM
Behavioral Design
description constraints
HL synthesis
Compiler
Style, architecture, resource selection

Scheduler
DFG
Allocator
Netlist
generator
Logic/Sequential synthesis
Memory Control Functional

synthesis synthesis synthesis
CDB
Physical design
Synthesis system with a component database

and user controlled resource selection
9.6

EXAMPLE SYNTHESIS SYSTEM
Behavioral Design
description constraints
HL synthesis
Compiler
Scheduler Architecture,
topology,
CDFG
resource
Allocator selection
Netlist
generator
Logic/Sequential synthesis Designer
Memory Control Functional

synthesis synthesis synthesis
CDB
Design
Physical design quality
assessment
Synthesis system with automatic iterative improvement
9.7

GENERIC SYNTHESIS SYSTEM
Completeness
1. All levels of design
2. Different target architectures
Extensibility
1. Addition of new algorithms and tools
2. Addition of new architecture styles
3. Addition of new libraries
Controllability
1. Control of tools
2. Control of design exploration
3. Quality metrics of design assessment
Interactivity
1. Partial design definition
2. Modification during and after synthesis
Upgradability
1. Capture−and−simulate to describe−and−synthesize
2. Mixing of strategies
9.8

HYPOTHETICAL SYNTHESIS
SYSTEM
System Specification Designer
System
synthesis
Chip
SDB
Conceptualization environment
synthesis
Simulation code generators
Intermediate forms
Simulation suite
Logic/Sequential
CDB
synthesis
Physical design
synthesis
ASIC description
to manufacturing
1. Supports capture−and−simulate and

describe−and−synthesize methodologies.
2. Separation of synthesis and simulation.
3. Hierarchical interactive synthesis.
9.9

SYSTEM SYNTHESIS METHODOLOGY
System description
Compiler
Standard
Estimator SR component
binding
Partitioner
Interface &
arbitration
synthesis
Port minimization
Partitioned system
description
Scheduling
To chip synthesis To RT synthesis
9.10

CHIP SYNTHESIS METHODOLOGY
Behavioral
description
Compiler
Scheduler
Storage
allocator
CDFG Storage
Technology mapping
Architecture, topology, style selection
merger strategies:
Functional unit
allocator 1. Top−down
Interconnection
allocator 2. Meet−in−the−middle
Module
selector 3. Bottom−up
Technology
mapper
CDB
Microarchitecture
optimizer
Logic/Sequential synthesis
To physical design
9.11

LOGIC−SYNTHESIS SYSTEM
State Boolean Timing Memory

tables expressions diagrams specifications
State Timing graph Memory

minimization compiler synthesis
State Interface
encoding synthesis
Logic
minimization
Technology
mapping
Physical design
9.12

PHYSICAL DESIGN METHODOLOGY
RT netlist
Style Component
Partitioning instances
selection
from CDB
1D 2D 1Bit
Stack Glue logic

partitioning partitioning
Floorplanning
Stack Array Glue logic

layout layout layout
Routing
To ASIC manufacturing
9.13

PHYSICAL DESIGN METHODOLOGY
Register
Counter
Mux
ALU
Comparator Glue Logic
Datapath floorplan
Stack 2
Memory
A/D
Register
file Stack 1
Chip floorplan
9.14

SYSTEM DATABASES
Phase 1: Collection of tools
Phase 2: Tool integration
Phase 3: Common data model
Phase 4: Design views and

consistency checks
9.15

DATABASE ARCHITECTURE
Design entity graphs

hierarchy, version control, configuration management
Design data graphs

behavior, structure, geometry, timing
Version Transaction
manager manager
Designer
Schema Design entity Design entity

browser manager graph
Database
interface
Design data
Design view Design data
representation
manager manager
graphs
Design
tools
Consistency Design quality
checker evaluator
9.16

COMPONENT DATABASE
Component
descriptions,
Component
Component Schematic Component Component
generators,
netlist diagram request query
Component−
optimization
tools
Schematic Schematic
capture generator
RT, logic, layout

Knowledge Component descriptions
server server
Estimates
Component database
Component
Fixed store
Parameterized
Component
Component
descriptions
generators
9.17

CONCEPTUALIZATION
ENVIRONMENT
Data and design manager
Displays and editors

Design−quality estimators
Design−consistency checkers
Synthesis algorithms
9.18

DISPLAY
Variables
Begin State Condition CondValue Actions NextState
t, count, opd, result
Ports
R,A,B t=(A<B),
Operators
t=A<B BEGIN R=0, test1
R=0 result = 0
<, +, − result = 0
(t) ("1") testT

test1 test1
t==1 ("0") testF
testT testF
testT count=A, opd=B join
count = A count = B
opd = B opd = A
testF count=B, opd=A join
Join
join loop
loop
loop t=(count>0) test2
t = count > 0
(t) ("1") body

test2 test2
t==1
("0") 1
body
result = result + opd R = result body result=result+opd, loop

count=count−1
count = count −1
1 R = result END
END END
End
Flowchart State table
9.19

FLOORPLAN DISPLAY
Datapath1
bus1
Reg4.1
Reg4.2
Memory Control Unit
Reg4.3
ALU4.1
ALU4.2
bus2
Datapath2
MUL4.1
Control Unit
Glue Logic
MUL4.2
9.20

INTERACTIVE SYNTHESIS
Description capture
Description
partitioning
Component
selection
Module/Port Component Connection

placement binding allocation
Scheduling
To physical design
Possible scenarios for interactive synthesis
9.21

FUTURE DIRECTIONS
Complete synthesis systems/frameworks
Descriptions and modeling guidelines

Quality metrics and estimation
Component taxonomy and generators
Databases and environments
Design exploration strategies

Hardware/software codesign
9.22

Gajski HLS

Uploaded by

Copyright:

Available Formats

Gajski HLS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gajski HLS

Uploaded by

Copyright:

Available Formats

Introduction to

Kluwer Academic Publishers, 1992

Copyright © 1993 by Daniel Gajski UC Irvine

VLSI complexity requires hierarchy

VLSI technology reached maturity

First silicon and first specification

Shorter design cycle

Better exploration of design space

Algorithms outperform designers

Two schools of thought:

Copyright © 1993 by Daniel Gajski UC Irvine

Copyright © 1993 by Daniel Gajski UC Irvine

Control Address bus

Module layout generation

System partitioning and placement

Copyright © 1993 by Daniel Gajski UC Irvine

Several descriptions for the same behavior

Several styles for the same description

Different abstractions for the same design

Copyright © 1993 by Daniel Gajski UC Irvine

if ENIT = ’1’ and not ENIT’stable then

Copyright © 1993 by Daniel Gajski UC Irvine

Copyright © 1993 by Daniel Gajski UC Irvine

1 state (no status register)

2 states (with status register)

Copyright © 1993 by Daniel Gajski UC Irvine

Kluwer Academic Publishers, 1992

Copyright © 1993 by Daniel Gajski UC Irvine

Left Right Result Left Right

Register file Register file

3−bus nonpipelined 2−bus pipelined

Program 1: x <= a + b; (100ns) LIR <= a; RIR <= b; (50ns)

Program 2: x <= a + b; (100ns) LIR <= a; RIR <= b; (50ns)

Copyright © 1993 by Daniel Gajski UC Irvine

ROM implementation PLA implementation

Copyright © 1993 by Daniel Gajski UC Irvine

Decoder implementation Logic gate implementation

Copyright © 1993 by Daniel Gajski UC Irvine

Copyright © 1993 by Daniel Gajski UC Irvine

<S, I, O, f: S x I −> S, h: S x I −> O>

4. Machines with datapath

Copyright © 1993 by Daniel Gajski UC Irvine

Present state Next state

Copyright © 1993 by Daniel Gajski UC Irvine

Present state Next state Output

State and output waveforms

Copyright © 1993 by Daniel Gajski UC Irvine

Present state Input Next state Present state Output

Next−state and output tables

Input and output waveforms

Copyright © 1993 by Daniel Gajski UC Irvine

Present state Input Next state Output

Next−state and output tables

Input and output waveforms

Copyright © 1993 by Daniel Gajski UC Irvine

Copyright © 1993 by Daniel Gajski UC Irvine

FSMD = < S, I U B, O U A, f, h >

B = set of some status variables

A = set of storage variable assignments