Systems I: Pipelining II

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Systems I

Pipelining II

Topics
 Pipelining hardware: registers and
feedback paths
 Difficulties with pipelines: hazards
 Method of mitigating hazards
valM

SEQ+ Hardware
data out
read
Mem. Data
Data
Memory control memory
memory
write

Addr Data

 Still sequential
Bch valE
implementation
ALU
 Reorder PC stage to put at Execute CC
CC ALU
ALU fun.

beginning ALU
A
ALU
B

PC Stage
 Task is to select PC for valA valB dstE dstM srcA srcB

current instruction
dstE dstM srcA srcB
 Based on results computed
Decode A B

by previous instruction Register M


Register
file
file E
Write back
Processor State icode ifun rA rB valC valP

 PC is no longer stored in
register Fetch Instruction
Instruction
memory
memory
PC
PC
increment
increment

 But, can determine PC based PC

on other stored information


PC PC

pIcode pBch pValM pValC pValP

2
Adding Pipeline Registers
valE, valM
Write back W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM

valM
valM
W
valM

Data
Data
Memory M_icode, Data
Data
memory
memory Memory
M_Bch, memory
memory
M_valA
Addr, Data
Addr, Data

M
valE

Bch Bch valE


Execute CC
CC
ALU
ALU CC
CC
Execute ALU
ALU
aluA, aluB
aluA, aluB

E
valA, valB

valA, valB
srcA, srcB
Decode d_srcA,
dstA, dstB A B
Decode A B
icode, valC Register M
Register d_srcB
RegisterM
Register
valP file
file
E file
file E
Write back
valP D
icode, ifun valP
rA, rB
valC
Instruction PC icode, ifun,
Fetch Instruction PC
rA, rB, valC valP
memory
memory increment
increment
Fetch Instruction
Instruction PC
PC
memory
memory increment
increment
PC predPC

PC pState
PC f_PC

3
Pipeline Stages
Fetch
 Select current PC
 Read instruction
 Compute incremented PC

Decode
 Read program registers

Execute
 Operate ALU

Memory
 Read or write data memory

Write Back
 Update register file
4
Write back

PIPE- Hardware W icode

read
valE valM
data out
dstE dstM

Mem. Data
Data
control memory
memory
write
Memory
 Pipeline registers hold Addr
data in

M_valA

intermediate values
M_Bch

M icode Bch valE valA dstE dstM


from instruction e_Bch
ALU
execution CC
CC ALU
ALU fun.

ALU ALU

Forward (Upward) Paths


A B
Execute

 Values passed from one


stage to next
E icode ifun valC valA valB dstE dstM srcA srcB

d_srcA d_srcB
Select d_rvalA dstE dstM srcA srcB
 Cannot jump past A
W_valM
Decode A B

stages Register M
Register
file
file
E
W_valE

 e.g., valC passes


D icode ifun rA rB valC valP
through decode Predict
PC
Instruction
Instruction PC
PC
memory
memory increment
increment
Fetch
f_PC
M_valA
Select
PC W_valM

F predPC

5
Signal Naming Conventions
S_Field
 Value of Field held in stage S pipeline
register

s_Field
 Value of Field computed in stage S

6
Write back

Feedback Paths W icode

read
valE valM
data out
dstE dstM

Mem. Data
Data
control memory
memory
write
Memory
Predicted PC M_Bch
Addr
data in

M_valA

 Guess value of next PC M icode Bch valE valA dstE dstM

e_Bch

Branch information
ALU
CC
CC ALU
ALU fun.

ALU ALU
 Jump taken/not-taken Execute
A B

 Fall-through or target
address E icode ifun valC valA valB dstE dstM srcA srcB

d_srcA d_srcB
Select d_rvalA dstE dstM srcA srcB

Return point Decode


A

A B W_valM
Register M
Register
 Read from memory file
file W_valE
E

D icode ifun rA rB valC valP

Register updates Predict


PC

To register file write


Instruction
Instruction PC
PC
 memory
memory increment
increment
Fetch
ports f_PC

Select
M_valA

PC W_valM

F predPC

7
Pipeline Demonstration
1 2 3 4 5 6 7 8 9

irmovl $1,%eax #I1 F D E M W


irmovl $2,%ecx #I2 F D E M W
irmovl $3,%edx #I3 F D E M W
irmovl $4,%ebx #I4 F D E M W
halt #I5 F D E M W

Cycle 5

File: demo-basic.ys W
I1
M
I2
E
I3
D
I4
F
I5
8
Data Dependencies: 3 Nop’s
1 2 3 4 5 6 7 8 9 10 11
# demo-h3.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
0x00d: nop F D E M W
0x00e: nop F D E M W
0x00f: addl %edx,%eax F D E M W
0x011: halt F D E M W

Cycle 6
W
R[ %eax]  3

Cycle 7
D
valA  R[ %edx] = 10
valB  R[ %eax] = 3 9
Data Dependencies: 2 Nop’s
1 2 3 4 5 6 7 8 9 10
# demo-h2.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
0x00d: nop F D E M W
0x00e: addl %edx,%eax F D E M W
0x010: halt F D E M W

Cycle 6
W
R[ %eax]  3

Can’t transport value


produced by first •

instruction back in •
time D
valA  R[ %edx] = 10 Error
valB  R[ %eax] = 0

10
Data Dependencies: 1 Nop
# demo-h1.ys 1 2 3 4 5 6 7 8 9

0x000: irmovl $10,%edx F D E M W


0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
0x00d: addl %edx,%eax F D E M W
0x00f: halt F D E M W

Cycle 5
W
R[ %edx]  10

M
M_valE = 3
M_dstE = %eax
Now a
problem with •
both operands •

D
Error
valA  R[ %edx] = 0
valB  R[ %eax] = 0 11
Data Dependencies: No Nop
1 2 3 4 5 6 7 8
# demo-h0.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: addl %edx,%eax F D E M W
0x00e: halt F D E M W

Cycle 4
M
M_valE = 10
M_dstE = %edx

E
e_valE  0 + 3 = 3
Wow - we really E_dstE = %eax
missed the boat here…
D
Error
valA  R[ %edx] = 0
valB  R[ %eax] = 0

12
M_icode

Predicting the M_Bch


M_valA
W_icode

PC
W_valM
D icode ifun rA rB valC valP

Predict
PC
Need
valC
Instr PC
PC
valid increment
Need increment
regids

Split
Split Align
Align
Byte 0 Bytes 1-5

Instruction
Instruction
memory
memory

Select
PC

F predPC

 Start fetch of new instruction after current one has completed


fetch stage
 Not enough time to reliably determine next instruction
 Guess which instruction will follow
 Recover if prediction was incorrect

13
Our Prediction Strategy
Instructions that Don’t Transfer Control
 Predict next PC to be valP
 Always reliable

Call and Unconditional Jumps


 Predict next PC to be valC (destination)
 Always reliable

Conditional Jumps
 Predict next PC to be valC (destination)
 Only correct if branch is taken
 Typically right 60% of time

Return Instruction
 Don’t try to predict
14
Recovering
M_icode
M_Bch
M_valA

from PC
W_icode
W_valM
D icode ifun rA rB valC valP

Misprediction Instr
Need
valC
Predict
PC

PC
PC
valid increment
Need increment
regids

Split
Split Align
Align
Byte 0 Bytes 1-5

Instruction
Instruction
memory
memory

Select
PC

F predPC

 Mispredicted Jump
 Will see branch flag once instruction reaches memory stage
 Can get fall-through PC from valA
 Return Instruction
 Will get return PC when ret reaches write-back stage
 In both cases
 Need to throw away instructions fetched between prediction and resolution
15
Branch Misprediction Example
demo-j.ys

0x000: xorl %eax,%eax


0x002: jne t # Not taken
0x007: irmovl $1, %eax # Fall through
0x00d: nop
0x00e: nop
0x00f: nop
0x010: halt
0x011: t: irmovl $3, %edx # Target (Should not execute)
0x017: irmovl $4, %ecx # Should not execute
0x01d: irmovl $5, %edx # Should not execute

 Should only execute first 7 instructions

16
Branch Misprediction Trace
# demo-j 1 2 3 4 5 6 7 8 9

0x000: xorl %eax,%eax F D E M W


0x002: jne t # Not taken F D E M W
0x011: t: irmovl $3, %edx # Target F D E M W
0x017: irmovl $4, %ecx # Target+1 F D E M W
0x007: irmovl $1, %eax # Fall Through F D E M W

Cycle 5
M
 Incorrectly execute two M_Bch = 0
M_valA = 0x007
instructions at branch target
E
valE  3
dstE = %edx

D
valC = 4
dstE = %ecx

F
valC  1
rB  %eax 17
Return Example demo-ret.ys

0x000: irmovl Stack,%esp # Intialize stack pointer


0x006: nop # Avoid hazard on %esp
0x007: nop
0x008: nop
0x009: call p # Procedure call
0x00e: irmovl $5,%esi # Return point
0x014: halt
0x020: .pos 0x20
0x020: p: <op> # procedure
0x021: <op>
0x022: <op>
0x023: ret
0x024: irmovl $1,%eax # Should not be executed
0x02a: irmovl $2,%ecx # Should not be executed
0x030: irmovl $3,%edx # Should not be executed
0x036: irmovl $4,%ebx # Should not be executed
0x100: .pos 0x100
0x100: Stack: # Stack: Stack pointer

18
Incorrect Return Example
# demo-ret

0x023: ret F D E M W
0x024: irmovl $1,%eax # Oops! F D E M W
0x02a: irmovl $2,%ecx # Oops! F D E M W
0x030: irmovl $3,%edx # Oops! F D E M W
0x00e: irmovl $5,%esi # Return F D E M W

 Incorrectly execute 3
W
instructions following ret
valM = 0x0e

M
valE = 1
dstE = %eax
E
valE  2
dstE = %ecx
D
valC = 3
dstE = %edx
F
valC  5
rB  %esi 19
Pipeline Summary
Concept
 Break instruction execution into 5 stages
 Run instructions through in pipelined mode

Limitations
 Can’t handle dependencies between instructions when
instructions follow too closely
 Data dependencies
 One instruction writes register, later one reads it
 Control dependency
 Instruction sets PC in way that pipeline did not predict correctly
 Mispredicted branch and return

20
The problem is hazards
Make the pipelined processor work!
Data Hazards
 Instruction having register R as source follows shortly after
instruction having register R as destination
 Common condition, don’t want to slow down pipeline
Control Hazards
 Mispredict conditional branch
 Our design predicts all branches as being taken
 Naïve pipeline executes two extra instructions
 Getting return address for ret instruction
 Naïve pipeline executes three extra instructions

Making Sure It Really Works


 What if multiple special cases happen simultaneously?

21
How do we fix the Pipeline?
Pad the program with NOPs
 Yuck!

Stall the pipeline


 Data hazards
 Wait for producing instruction to complete
 Then proceed with consuming instruction
 Control hazards
 Wait until new PC has been determined
 Then begin fetching

Forward data within the pipeline


 Grab the result from somewhere in the pipe
 After it has been computed
 But before it has been written back

22
Stalling for Data Dependencies
1 2 3 4 5 6 7 8 9 10 11
# demo-h2.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
0x00d: nop F D E M W
bubble E M W
0x00e: addl %edx,%eax F D D E M W
0x010: halt F F D E M W

 If instruction follows too closely after one that writes


register, slow it down
 Hold instruction in decode
 Dynamically inject nop into execute stage

23
Stall Condition
Source Registers
 srcA and srcB of current
instruction in decode stage

Destination Registers
 dstE and dstM fields
 Instructions in execute,
memory, and write-back
stages

Special Case
 Don’t stall for register ID 15
(0xF)
 Indicates absence of
register operand
 Don’t stall for failed
conditional move

24
Detecting Stall Condition
1 2 3 4 5 6 7 8 9 10 11
# demo-h2.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
0x00d: nop F D E M W
bubble E M W
0x00e: addl %edx,%eax F D D E M W
0x010: halt F F D E M W

Cycle 6
W
W_dstE = %eax
W_valE = 3




D
srcA = %edx
srcB = %eax 25
Stalling X3
1 2 3 4 5 6 7 8 9 10 11
# demo-h0.ys
0x000: irmovl $10,%edx F D E M W
0x006: irmovl $3,%eax F D E M W
bubble E M W
bubble E M W
bubble E M W
0x00c: addl %edx,%eax F D D D D E M W
0x00e: halt F F F F D E M W

Cycle 6
W
Cycle 5 W_dstE = %eax
M
Cycle 4 M_dstE = %eax •

E • •
E_dstE = %eax


D D D
srcA = %edx srcA = %edx srcA = %edx
srcB = %eax srcB = %eax srcB = %eax 26
What Happens When Stalling?
# demo-h0.ys
Cycle 8
4
5
6
7
0x000: irmovl $10,%edx
Write Back 0x000: bubble
0x006: irmovl $10,%edx
$3,%eax
0x006: irmovl $3,%eax
Memory 0x000: bubble
0x006: irmovl $10,%edx
$3,%eax
0x00c: addl %edx,%eax
Execute 0x006: bubble
0x00c: irmovl
addl %edx,%eax
$3,%eax
0x00e: halt
Decode 0x00c: halt
0x00e: addl %edx,%eax
Fetch 0x00e: halt

 Stalling instruction held back in decode stage


 Following instruction stays in fetch stage
 Bubbles injected into execute stage
 Like dynamically generated nop’s
 Move through later stages

27
Implementing Stalling
W_dstM
W_dstE

W icode valE valM dstE dstM

M_dstM
M_dstE

M icode Bch valE valA dstE dstM

E_dstM
Pipe E_dstE
control
logic E_bubble
E icode ifun valC valA valB dstE dstM srcA srcB

d_srcB
d_srcA
srcB
D_icode
srcA

D_stall D icode ifun rA rB valC valP

F_stall
F predPC

Pipeline Control
 Combinational logic detects stall condition
 Sets mode signals for how pipeline registers should update
28
Pipeline Register Modes
Rising
Input = y Output = x clock Output = y

Normal x
  y

stall bubble
=0 =0

Rising
Input = y Output = x clock Output = x

Stall x
  x

stall bubble
=1 =0

Rising
Input = y Output = x clock Output = nop

Bubble x
  n
o
p
stall bubble
=0 =1

29
Summary
Today
 Data hazards (read after write)
 Control hazards (branch, return)
 Mitigating hazards through stalling

Next Time
 Hazard mitigation through pipeline forwarding
 Hardware support for forwarding
 Forwarding to mitigate control (branch) hazards

30

You might also like