03 Pipeline
03 Pipeline
Pipeline Review
Outline
• Review
• Quantify and summarize performance
– Ratios, Geometric Mean, Multiplicative Standard Deviation
• F&P: Benchmarks age, disks fail,1 point fail
danger
21/11/2 3
Example: MIPS (MIPS)
Register-Register
31 26 25 21 20 16 15 11 10 6 5 0
Register-Immediate
31 26 25 21 20 16 15 0
Op Rs1 Rd immediate
Branch
31 26 25 21 20 16 15 0
Jump / Call
31 26 25 0
Op target
21/11/2 4
Datapath vs Control
Datapath Controller
signals
Control Points
21/11/2 6
5 Steps of MIPS Datapath
Figure A.2, Page A-8
MUX
Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
RS2
Address
Memory
Reg File
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
IR <= mem[PC]; Imm Extend
PC <= PC + 4
WB Data
Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
21/11/2 7
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
EX/MEM
RS2
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
IR <= mem[PC];
WB Data
Sign
Extend
Imm
PC <= PC + 4
A <= Reg[IRrs]; RD RD RD
B <= Reg[IRrt]
rslt <= A opIRop B
WB <= rslt
21/11/2 8
Reg[IRrd] <= WB
Inst. Set Processor Controller
IR <= mem[PC];
Ifetch
PC <= PC + 4
PC <= PC+IRim
21/11/2 9
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
EX/MEM
RS2
Reg File
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
21/11/2 11
Pipelining is not quite that easy!
21/11/2 12
One Memory Port/Structural Hazards
Figure A.4, Page A-14
I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
Reg
t Instr 1
Ifetch Reg DMem
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Ifetch Reg DMem Reg
d Instr 3
e
r
ALU
Ifetch Reg DMem Reg
Instr 4
21/11/2 13
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
Reg
t Instr 1
Ifetch Reg DMem
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
Ifetch Reg DMem Reg
Instr 3
How
21/11/2 do you “bubble” the pipe? 14
Speed Up Equation for Pipelining
21/11/2 15
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
21/11/2 16
Data Hazard on R1
Figure A.6, Page A-17
IF ID/RF EX MEM WB
ALU
Reg
add r1,r2,r3 Ifetch Reg DMem
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.
ALU
O Ifetch Reg DMem Reg
and r6,r1,r7
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
21/11/2 17
Three Generic Data Hazards
I: add r1,r2,r3
J: sub r4,r1,r3
21/11/2 18
Three Generic Data Hazards
21/11/2 20
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
21/11/2 21
HW Change for Forwarding
Figure A.23, Page A-37
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
lw r4, 0(r1) Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d sw r4,12(r1)
e
r
ALU
Ifetch Reg DMem Reg
or r8,r6,r9
ALU
Ifetch Reg DMem Reg
xor r10,r9,r11
21/11/2 23
Data Hazard Even with Forwarding
Figure A.9, Page A-21
ALU
lw r1, 0(r2) Ifetch Reg DMem Reg
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r6
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
21/11/2 24
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d Bubble
ALU
Ifetch Reg DMem Reg
e and r6,r1,r7
r
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
21/11/2
How is this detected? 25
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
SW d,Rd SW d,Rd
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Ifetch Reg DMem Reg
18: or r6,r1,r7
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem
21/11/2 29
Pipelined MIPS Datapath
Figure A.24, page A-38
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
MUX
SEQ PC
Adder
Adder
Zero?
4 RS1
MEM/WB
Address
Memory
EX/MEM
RS2
Reg File
ID/EX
ALU
IF/ID
Memory
MUX
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
21/11/2 30
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
21/11/2 31
Four Branch Hazard Alternatives
branch instruction
sequential successor1
sequential successor2
........
sequential successorn Branch delay of length n
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage
pipeline
– MIPS uses this
21/11/2 32
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch B. From branch target C. From fall through
add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
if $1=0 then
delay slot sub $4,$5,$6
• A is the best choice, fills delay slot & reduces instruction count (IC)
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
21/11/2 33
Delayed Branch
21/11/2 34
Evaluating Branch Alternatives
21/11/2 35
Problems with Pipelining
• Exception: An unusual event happens to an
instruction during its execution
– Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch the
processor to a new instruction stream
– Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting)
• Problem: It must appear that the exception or
interrupt must appear between 2 instructions (Ii
and Ii+1)
– The effect of all instructions up to and including I i is totalling
complete
– No effect of any instruction after Ii can take place
• The interrupt (exception) handler either aborts
program or restarts at instruction Ii+1
21/11/2 36
Precise Exceptions in Static Pipelines
21/11/2 38