Helping Slides Pipelining Hazards Solutions

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 55

Pipelined Processor Design

2
Outlines

 Pipelining versus Serial Execution


 Pipelined Data path
 Pipeline Hazards
 Data Hazards and Forwarding
 Pipelined Control
 Load Delay, Hazard Detection, and Stall Unit
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
3
Pipelining Example

 Laundry Example: Three Stages

1. Wash dirty load of clothes

2. Dry wet clothes

3. Fold and put clothes into drawers

 Each stage takes 30 minutes to complete

 Four loads of clothes to wash, dry, and fold A B


C D
4
Sequential Laundry

6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30

 Sequential laundry takes 6 hours for 4 loads


 Intuitively, we can use pipelining to speed up laundry
5
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30

A  Pipelined laundry takes 3 hours for


4 loads
 Speedup factor is 2 for 4 loads
B
 Time to wash, dry, and fold one
load is still the same (90 minutes)
C

D
Serial Execution versus Pipelining 6

 Consider a task that can be divided into k subtasks


 The k subtasks are executed on k different stages
 Each subtask requires one time unit
 The total execution time of the task is k time units

 Pipelining is to start a new task before finishing previous


 The k stages work in parallel on k different tasks
 Tasks enter/leave pipeline at the rate of one task per time unit.

1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k

Without Pipelining With Pipelining


One completion every k time units One completion every 1 time unit
7

 Pipelining is one way of improving the overall processing performance of a processor.


 This architectural approach allows the simultaneous execution of several instructions

 The pipeline design technique decomposes a sequential process into several


subprocesses, called stages or segments.

 A stage performs a particular function and produces an intermediate result.


 It consists of an input latch, also called a register or buffer, followed by a processing
circuit.
8

 The processing circuit of a given stage is connected to the input latch of the next stage.

 A clock signal is connected to each input latch.


 At each clock pulse, every stage transfers its intermediate result to the input latch of the next
stage.
9

 The period of the clock pulse should be large enough to provide sufficient time for a
signal to traverse through the slowest stage, which is called the bottleneck.

 In addition, there should be enough time for a latch to store its input signals.
10
Pipeline Performance

 The ability to overlap stages of a sequential process for different input tasks (data or
operations) results in an overall theoretical completion time of:

where n is the number of input tasks, k is the number of stages in the pipeline, and P is the
clock period.

 Ideal speedup of a k-stage pipeline over serial execution


11

 To be more specific, when n is large, a pipelined processor can produce output


approximately k times faster than a nonpipelined processor.
12
Next . . .
 Pipelining versus Serial Execution
 Pipelined Datapath
 Pipeline Hazards
 Pipelined Control
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall Unit
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
13
INSTRUCTION PIPELINE

 In a von Neumann architecture, the process of executing an instruction involves several steps.

1. Instruction fetch (IF). Retrieval of instructions from cache (or main memory).
2. Instruction decoding (ID). Identification of the operation to be performed.
3. Operand fetch (OF). Decoding and retrieval of any required operands.
4. Execution (EX). Performing the operation on the operands.
5. Write-back (WB). Updating the destination operands.
14

 An instruction pipeline increases the performance of a


processor by overlapping the processing of several different
instructions.

 Instruction pipeline overlaps the process of the preceding


stages for different instructions to achieve a much lower total
completion time, on average, for a series of instructions.
Single-Cycle Datapath 15

 Shown below is the single-cycle datapath


 How to pipeline this single-cycle datapath?
Answer: Introduce registers at the end of each stage

IF = Instruction ID = Decode and EX = Execute and MEM = Memory WB = Write


Fetch Register Fetch Calculate Address Access Back

Inc
Next
Imm26 PC
00

ALU result
0
m Rs Imm16 zero 0
u Data m
PC

Address Register
x Rt Memory u
Instruction File A x
1 Ext L Address
BusW
0 1
U
RW

Instruction 0 m
m u Data_in
Memory u x
Rd x 1
1
Pipelined Datapath 16

 Pipeline registers, in green, separate each pipeline stage


 Pipeline registers are labeled by the stages they separate

IF = Instruction Fetch ID = Decode EX = Execute MEM = Memory WB

IF/ID ID/EX EX/MEM

Inc
Next MEM/WB
Imm26 PC
00

0 Imm16 zero
m Rs ALU result
u 0
PC

Address
x Register m
Instruction Rt Ext
A u
1 File L Address x
Instruction m U Data 1
BusW

u Memory
RW

Memory x
m Data_in
Rd u
x
17

assuming the clock period to be 10 ns


18

 Note that in a nonpipelined design the completion time will be much higher:
Instruction–Time Diagram 19
 Diagram shows:
 Which instruction occupies what stage at each clock cycle
 Instruction execution is pipelined over the 5 stages

Up to five instructions can be in ALU instructions skip


execution during a single cycle the MEM stage.
Store instructions
skip the WB stage
Instruction Order

lw $7, 8($3) IF ID EX MEM WB


lw $6, 8($5) IF ID EX MEM WB
or $4, $3, 7 IF ID EX – WB
sub $5, $2, $3 IF ID EX – WB
sw $2, 10($3) IF ID EX MEM –

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
22

 The goal of pipelining is to allow multiple instructions execute at the same time.
 We may need to perform several operations in a cycle
 Increment the PC and add registers at the same time.
 Fetch one instruction while another one reads or writes data.

 The big benefit from pipelining comes from the fact that once you fed the first element into the pipe
(an instruction, for example) the next cycle it's free to accept the next one, long before the previous
elements finished the complete datapath.
23
Next . . .
 Pipelining versus Serial Execution
 Pipelined Datapath
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall Unit
 Pipelined Control
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
24
Pipeline Hazards

 Hazards: situations that would cause incorrect execution


 If next instruction were launched during its designated clock cycle
1. Structural hazards
 When a resource is not available, Caused by resource contention
 Using same resource by two instructions during the same cycle
2. Data hazards
 An instruction may compute a result needed by next instruction
 Hardware can detect dependencies between instructions
3. Control hazards
 A control hazard refers to a situation in which an instruction, such as branch, causes a change in the program flow. (branches/jumps)
 Delays in changing the flow of control.

 Hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the
following clock cycle
Structural Hazards 25

 Problem
 Attempt to use the same hardware resource by two different
Structural Hazard
instructions during the same cycle
Two instructions are
 Example attempting to write
 Writing back ALU result in stage 4 the register file
 Conflict with writing load data in stage 5 during same cycle

lw $6, 8($5) IF ID EX MEM WB


Instructions

or $4, $3, 7 IF ID EX WB
sub $5, $2, $3 IF ID EX WB
sw $2, 10($3) IF ID EX MEM

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
Resolving Structural Hazards 26

 Serious Hazard:
 Hazard cannot be ignored
 Solution 1: Delay Access to Resource
 Must have mechanism to delay instruction access to resource
 Delay all write backs to the register file to stage 5
 ALU instructions bypass stage 4 (memory) without doing anything
 Solution 2: Add more hardware resources (more costly)
 Add more hardware to eliminate the structural hazard
 Redesign the register file to have two write ports
 First write port can be used to write back ALU results in stage 4
 Second write port can be used to write back load data in stage 5
27
Next . . .
 Pipelining versus Serial Execution
 Pipelined Datapath
 Pipeline Hazards
 Data Hazards and Forwarding
 Load Delay, Hazard Detection, and Stall Unit
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
 Pipelined Control
28
Data Hazards

 In a pipelined processor, instruction executions are overlapped. An instruction may be


started and completed before the previous instruction is completed.

 The data hazard, or the data dependency problem, comes about as a


result of overlapping (or changing the order of) the execution.
29

This would result in using the old contents of R2 for computing a new value for R5, leading to an
invalid result.
30

Solution1: Stalling
Instruction i2 is said to be stalled for two clock cycles. No Op(Bubbles are inserted)

Often, when an instruction is stalled, the instructions that are positioned after the stalled instruction will also
be stalled. However, the instructions before the stalled instruction can continue execution.

To insert a delay, an extra hardware component called a pipeline interlock can be


added to the pipeline. A pipeline interlock detects the dependency and delays the dependent instructions until
the conflict is resolved
31

 Another way is to let the compiler solve the dependency problem.

 During compilation, the compiler detects the dependency between data and instructions.
It then rearranges these instructions so that the dependency is not hazardous to the
system.
32
Compiler Scheduling

 Consider the following statements:


a = b + c; d = e – f;  Fast code: No Stalls
 Slow code: lw $10, 0($1)
lw $10, ($1) # $1 = addr b lw $11, 0($2)
lw $11, ($2) # $2 = addr c
lw $13, 0($4)
add $12, $10, $11 # stall
sw $12, ($3) # $3 = addr a lw $14, 0($5)
lw $13, ($4) # $4 = addr e add $12, $10,
lw $14, ($5) # $5 = addr f $11
sub $15, $13, $14 # stall sw $12, 0($3)
sw $15, ($6) # $6 = addr d
sub $15, $13,
$14
sw $14, 0($6)
33

 There are three primary types of data hazards: RAW (read after write), WAR (write after read), and
WAW (write after write).

1. RAW: This type of data hazard was discussed previously; it refers to the situation in which i2 reads a
data source before i1 writes to it.
34

2. WAR: This refers to the situation in which i2 writes to a location before i1 reads it.

 an invalid result may be produced if i2 writes to R4 before i1 reads it; that is, the instruction i1 might use the
wrong value of R4.
 WAR cannot occur in our basic 5-stage pipeline because:
 Reads are always in stage 2, and
 Writes are always in stage 5
 Instructions are processed in order
35

3. WAW: This refers to the situation in which i2 writes to a location before i1 writes to it.

the value of R2 is recomputed by i2.


 WAW can’t happen in our basic 5-stage pipeline because:
 All writes are ordered and always take place in stage 5

 WAR and WAW hazards can occur in complex pipelines


 Notice that Read After Read – RAR is NOT a hazard.
36

 One way to enhance the architecture of an instruction pipeline is to increase concurrent execution of
the instructions by dispatching several independent instructions to different functional units

 That is, the instructions can be executed out of order, and so their execution may be completed out of
order too.
37
Instruction 0: Register 1 = 6
Instruction 1: Register 1 = 3
Instruction 2: Register 2 = Register 1 + 7 = 10

• Following execution, register 2 should contain the value 10. However, if Instruction 1 (write 3 to
register 1) does not fully exit the pipeline before Instruction 2 starts executing, it means that Register
1 does not contain the value 3 when Instruction 2 performs its addition

• In such an event, Instruction 2 adds 7 to the old value of register 1 (6), and so register 2 contains 13
instead, i.e.:

Instruction 0: Register 1 = 6
Instruction 2: Register 2 = Register 1 + 7 = 13
Instruction 1: Register 1 = 3

Forwarding helps correct such errors by depending on the fact that the output of Instruction 1 (which
is 3) can be used by subsequent instructions it is stored in Register 1

https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Operand_forwarding
38
Forwarding

 Forwarding applied to the example means that there is no wait to commit/store the


output of Instruction 1 in Register 1 (in this example, the output is 3) before making that
output available to the subsequent instruction (in this case, Instruction 2). 

 The effect is that Instruction 2 uses the correct (the more recent) value of Register 1: the
commit/store was made immediately and not pipelined.

 Forwarding is another solution to data hazard (dependency) with no bubbles/stalls.

https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Operand_forwarding
39

 In today's architectures, the dependencies between instructions are checked statically by the compiler
and/or dynamically by the hardware at run time.
 This preserves the execution order for dependent instructions, which ensures valid results.
40
Next . . .
 Pipelining versus Serial Execution
 Pipelined Datapath
 Pipeline Hazards
 Data Hazards and Forwarding
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
 Load Delay, Hazard Detection, and Stall Unit
 Pipelined Control
41
Control Hazards

 In any set of instructions, there is normally a need for some kind of branches.
 In general, about 30% of all instructions in a program are branches.
 Thus branch instructions in the pipeline can reduce the throughput tremendously if not handled properly.
 Each such branch requires a new address to be loaded into the program counter, which may invalidate
all the instructions that are either already in the pipeline or prefetched in the buffer.
 This draining and refilling of the pipeline for each branch degrades the throughput of the pipeline to
that of a sequential processor
 A branch not taken allows the continued sequential flow of uninterrupted instructions to the pipeline.
42

 Branch instruction may be one of:


(1)unconditional branch, (2) conditional branch, and (3) loop branch.
 An unconditional branch always alters the sequential program flow.

 A conditional branch sets a new target address in the program counter only when a certain condition is satisfied.
 When the condition is satisfied, the path starts from the target
address and is called a target path. If it is not, the path starts from the next sequential instruction and is called a sequential
path.

 A loop branch in a loop statement usually jumps back to the beginning of


the loop and executes it either a fixed or a variable (data-dependent) number of times.

 Among all of above, conditional branches are the hardest to handle


43
e.g.
44

 To reduce the effect of branching some of the better known techniques are branch
prediction, delayed branching, and multiple prefetching.
45
Branch Prediction

 In this type of design, the outcome of a branch decision is predicted before the branch is
actually executed.

 Based on a particular prediction, the sequential path or the target path is


chosen for execution.

 Chosen path may often reduces the branch penalty, it may increase the
penalty in case of incorrect prediction.
46
Static vs Dynamic Prediction

 In static prediction, a fixed decision for prefetching one of the two paths is made before the program runs.
 For example, a simple technique would be to always assume that the branch is taken.
 This technique simply loads the program counter with the target address
when a branch is encountered

 Another such technique is to automatically choose one path (sequential or target) for some branch types and
another for the rest of the branch types.

 If the chosen path is wrong, the pipeline is drained and instructions corresponding to the correct path are
fetched; the penalty is paid
47
Static vs Dynamic Prediction

 In dynamic prediction, during the execution of the program the processor makes a decision based on
the past information of the previously executed branches.

 For example, recording the history of the last two paths taken by each branch instruction. If the last two
executions of a branch instruction have chosen the same path, that path will be chosen for the current
execution of the branch instruction. If the two paths do not match, one of the paths will be chosen
randomly.
48
Counter Based Branch Prediction

 A better approach is to associate an n-bit counter with each branch instruction. This is known as
the counter-based branch prediction approach.

 In this method, after executing a branch instruction for the first time, its counter, C, is set to a
threshold, T, if the target path was taken, or to
T-1 if the sequential path was taken.

 From then on, whenever the branch instruction is about to be executed,


if C T, then the target path is taken; otherwise, the sequential path is taken.
 If the correct path is the target path, the counter is incremented by 1; if
not, C is decremented by 1.
49
Delayed Branching

 In this type of design, a certain number of instructions after the branch instruction is fetched and
executed regardless of which path will be chosen for the branch.

 For example, a processor with a branch delay of k executes a path containing the next k sequential
instructions and then either continues on the same path or starts a new path from a new target address.

 The compiler tries to fill the next k instruction slots after the branch with instructions that are
independent from the branch instruction. NOP (no operation) instructions/bubbles are placed in any
remaining empty slots
e.g. k=2 50
51
Multiple Prefetching

 In this type of design, the processor fetches both possible paths.

 Once the branch decision is made, the unwanted path is thrown away.

 By prefetching both possible paths, the fetch penalty is avoided in the case of an incorrect prediction.

 To fetch both paths, two buffers are employed to service the pipeline.
52

 In normal execution, the first buffer is loaded with instructions from the next sequential address of the
branch instruction.

 If a branch occurs, the contents of the first buffer are invalidated, and the secondary buffer, which has
been loaded with instructions from the target address of the branch instruction, is used as the primary
buffer.

 This double buffering scheme ensures a constant flow of instructions and data to the pipeline and
reduces the time delays caused by the draining and refilling of the pipeline
53
Next . . .
 Pipelining versus Serial Execution
 Pipelined Datapath
 Pipeline Hazards
 Data Hazards and Forwarding
 Control Hazards
 Delayed Branch and Dynamic Branch Prediction
 Load Delay, Hazard Detection, and Stall Unit
Load Delay 54
 Unfortunately, not all data hazards can be forwarded
 Load has a delay that cannot be eliminated by forwarding
 In the example shown below …
 The LW instruction does not have data until end of CC4
 AND instruction wants data at beginning of CC4 - NOT possible

Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw $2, 20($1) However, load


IF Reg ALU DM Reg
Program Order

can forward
data to second
and $4, $2, $5 IF Reg ALU DM Reg
next instruction

or $6, $3, $2 IF Reg ALU DM Reg

add $7, $2, $2 IF Reg ALU DM Reg


Stall the Pipeline for one Cycle 55

 Freeze the PC and the IF/ID registers


 No new instruction is fetched and instruction after load is stalled
 Allow the Load instruction in ID/EX register to proceed
 Introduce a bubble into the ID/EX register
 Load can forward data to next instruction after delaying it

Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

lw $2, 20($1)
Program Order

IM Reg ALU DM Reg

and $4, $2, $5 IM bubble Reg ALU DM Reg

or $6, $3, $2 IM Reg ALU DM Reg


58
Pipelining Summary

 Pipelining doesn’t improve latency of a single instruction


 However, it improves throughput of entire workload
 Instructions are initiated and completed at a higher rate
 In a k-stage pipeline, k instructions operate in parallel
 Overlapped execution using multiple hardware resources
 Potential speedup = number of pipeline stages k.
 Unbalanced lengths of pipeline stages reduces speedup
 Pipeline rate is limited by slowest (bottleneck) pipeline stage.
 Unbalanced lengths of pipeline stages reduces speedup
 Also, time to fill and drain pipeline reduces speedup
59
Pipelining Summary

 Three types of pipeline hazards


 Structural hazards: conflicts using a resource during same cycle
 Data hazards: due to data dependencies between instructions
 Control hazards: due to branch and jump instructions
 Hazards limit the performance and complicate the design
 Structural hazards: eliminated by careful design or more hardware
 Data hazards are eliminated by forwarding
 However, load delay cannot be eliminated and stalls the pipeline
 Delayed branching can be a solution when branch delay = 1 cycle
 Branch prediction can reduce branch delay to zero
 Branch misprediction should drain the wrongly fetched instructions

You might also like