Helping Slides Pipelining Hazards Solutions
Helping Slides Pipelining Hazards Solutions
Helping Slides Pipelining Hazards Solutions
2
Outlines
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30
D
Serial Execution versus Pipelining 6
1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k
The processing circuit of a given stage is connected to the input latch of the next stage.
The period of the clock pulse should be large enough to provide sufficient time for a
signal to traverse through the slowest stage, which is called the bottleneck.
In addition, there should be enough time for a latch to store its input signals.
10
Pipeline Performance
The ability to overlap stages of a sequential process for different input tasks (data or
operations) results in an overall theoretical completion time of:
where n is the number of input tasks, k is the number of stages in the pipeline, and P is the
clock period.
In a von Neumann architecture, the process of executing an instruction involves several steps.
1. Instruction fetch (IF). Retrieval of instructions from cache (or main memory).
2. Instruction decoding (ID). Identification of the operation to be performed.
3. Operand fetch (OF). Decoding and retrieval of any required operands.
4. Execution (EX). Performing the operation on the operands.
5. Write-back (WB). Updating the destination operands.
14
Inc
Next
Imm26 PC
00
ALU result
0
m Rs Imm16 zero 0
u Data m
PC
Address Register
x Rt Memory u
Instruction File A x
1 Ext L Address
BusW
0 1
U
RW
Instruction 0 m
m u Data_in
Memory u x
Rd x 1
1
Pipelined Datapath 16
Inc
Next MEM/WB
Imm26 PC
00
0 Imm16 zero
m Rs ALU result
u 0
PC
Address
x Register m
Instruction Rt Ext
A u
1 File L Address x
Instruction m U Data 1
BusW
u Memory
RW
Memory x
m Data_in
Rd u
x
17
Note that in a nonpipelined design the completion time will be much higher:
Instruction–Time Diagram 19
Diagram shows:
Which instruction occupies what stage at each clock cycle
Instruction execution is pipelined over the 5 stages
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
22
The goal of pipelining is to allow multiple instructions execute at the same time.
We may need to perform several operations in a cycle
Increment the PC and add registers at the same time.
Fetch one instruction while another one reads or writes data.
The big benefit from pipelining comes from the fact that once you fed the first element into the pipe
(an instruction, for example) the next cycle it's free to accept the next one, long before the previous
elements finished the complete datapath.
23
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Pipelined Control
Control Hazards
Delayed Branch and Dynamic Branch Prediction
24
Pipeline Hazards
Hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the
following clock cycle
Structural Hazards 25
Problem
Attempt to use the same hardware resource by two different
Structural Hazard
instructions during the same cycle
Two instructions are
Example attempting to write
Writing back ALU result in stage 4 the register file
Conflict with writing load data in stage 5 during same cycle
or $4, $3, 7 IF ID EX WB
sub $5, $2, $3 IF ID EX WB
sw $2, 10($3) IF ID EX MEM
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
Resolving Structural Hazards 26
Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything
Solution 2: Add more hardware resources (more costly)
Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
27
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall Unit
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Pipelined Control
28
Data Hazards
This would result in using the old contents of R2 for computing a new value for R5, leading to an
invalid result.
30
Solution1: Stalling
Instruction i2 is said to be stalled for two clock cycles. No Op(Bubbles are inserted)
Often, when an instruction is stalled, the instructions that are positioned after the stalled instruction will also
be stalled. However, the instructions before the stalled instruction can continue execution.
During compilation, the compiler detects the dependency between data and instructions.
It then rearranges these instructions so that the dependency is not hazardous to the
system.
32
Compiler Scheduling
There are three primary types of data hazards: RAW (read after write), WAR (write after read), and
WAW (write after write).
1. RAW: This type of data hazard was discussed previously; it refers to the situation in which i2 reads a
data source before i1 writes to it.
34
2. WAR: This refers to the situation in which i2 writes to a location before i1 reads it.
an invalid result may be produced if i2 writes to R4 before i1 reads it; that is, the instruction i1 might use the
wrong value of R4.
WAR cannot occur in our basic 5-stage pipeline because:
Reads are always in stage 2, and
Writes are always in stage 5
Instructions are processed in order
35
3. WAW: This refers to the situation in which i2 writes to a location before i1 writes to it.
One way to enhance the architecture of an instruction pipeline is to increase concurrent execution of
the instructions by dispatching several independent instructions to different functional units
That is, the instructions can be executed out of order, and so their execution may be completed out of
order too.
37
Instruction 0: Register 1 = 6
Instruction 1: Register 1 = 3
Instruction 2: Register 2 = Register 1 + 7 = 10
• Following execution, register 2 should contain the value 10. However, if Instruction 1 (write 3 to
register 1) does not fully exit the pipeline before Instruction 2 starts executing, it means that Register
1 does not contain the value 3 when Instruction 2 performs its addition
• In such an event, Instruction 2 adds 7 to the old value of register 1 (6), and so register 2 contains 13
instead, i.e.:
Instruction 0: Register 1 = 6
Instruction 2: Register 2 = Register 1 + 7 = 13
Instruction 1: Register 1 = 3
Forwarding helps correct such errors by depending on the fact that the output of Instruction 1 (which
is 3) can be used by subsequent instructions it is stored in Register 1
https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Operand_forwarding
38
Forwarding
The effect is that Instruction 2 uses the correct (the more recent) value of Register 1: the
commit/store was made immediately and not pipelined.
https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Operand_forwarding
39
In today's architectures, the dependencies between instructions are checked statically by the compiler
and/or dynamically by the hardware at run time.
This preserves the execution order for dependent instructions, which ensures valid results.
40
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipeline Hazards
Data Hazards and Forwarding
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Load Delay, Hazard Detection, and Stall Unit
Pipelined Control
41
Control Hazards
In any set of instructions, there is normally a need for some kind of branches.
In general, about 30% of all instructions in a program are branches.
Thus branch instructions in the pipeline can reduce the throughput tremendously if not handled properly.
Each such branch requires a new address to be loaded into the program counter, which may invalidate
all the instructions that are either already in the pipeline or prefetched in the buffer.
This draining and refilling of the pipeline for each branch degrades the throughput of the pipeline to
that of a sequential processor
A branch not taken allows the continued sequential flow of uninterrupted instructions to the pipeline.
42
A conditional branch sets a new target address in the program counter only when a certain condition is satisfied.
When the condition is satisfied, the path starts from the target
address and is called a target path. If it is not, the path starts from the next sequential instruction and is called a sequential
path.
To reduce the effect of branching some of the better known techniques are branch
prediction, delayed branching, and multiple prefetching.
45
Branch Prediction
In this type of design, the outcome of a branch decision is predicted before the branch is
actually executed.
Chosen path may often reduces the branch penalty, it may increase the
penalty in case of incorrect prediction.
46
Static vs Dynamic Prediction
In static prediction, a fixed decision for prefetching one of the two paths is made before the program runs.
For example, a simple technique would be to always assume that the branch is taken.
This technique simply loads the program counter with the target address
when a branch is encountered
Another such technique is to automatically choose one path (sequential or target) for some branch types and
another for the rest of the branch types.
If the chosen path is wrong, the pipeline is drained and instructions corresponding to the correct path are
fetched; the penalty is paid
47
Static vs Dynamic Prediction
In dynamic prediction, during the execution of the program the processor makes a decision based on
the past information of the previously executed branches.
For example, recording the history of the last two paths taken by each branch instruction. If the last two
executions of a branch instruction have chosen the same path, that path will be chosen for the current
execution of the branch instruction. If the two paths do not match, one of the paths will be chosen
randomly.
48
Counter Based Branch Prediction
A better approach is to associate an n-bit counter with each branch instruction. This is known as
the counter-based branch prediction approach.
In this method, after executing a branch instruction for the first time, its counter, C, is set to a
threshold, T, if the target path was taken, or to
T-1 if the sequential path was taken.
In this type of design, a certain number of instructions after the branch instruction is fetched and
executed regardless of which path will be chosen for the branch.
For example, a processor with a branch delay of k executes a path containing the next k sequential
instructions and then either continues on the same path or starts a new path from a new target address.
The compiler tries to fill the next k instruction slots after the branch with instructions that are
independent from the branch instruction. NOP (no operation) instructions/bubbles are placed in any
remaining empty slots
e.g. k=2 50
51
Multiple Prefetching
Once the branch decision is made, the unwanted path is thrown away.
By prefetching both possible paths, the fetch penalty is avoided in the case of an incorrect prediction.
To fetch both paths, two buffers are employed to service the pipeline.
52
In normal execution, the first buffer is loaded with instructions from the next sequential address of the
branch instruction.
If a branch occurs, the contents of the first buffer are invalidated, and the secondary buffer, which has
been loaded with instructions from the target address of the branch instruction, is used as the primary
buffer.
This double buffering scheme ensures a constant flow of instructions and data to the pipeline and
reduces the time delays caused by the draining and refilling of the pipeline
53
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath
Pipeline Hazards
Data Hazards and Forwarding
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Load Delay, Hazard Detection, and Stall Unit
Load Delay 54
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
In the example shown below …
The LW instruction does not have data until end of CC4
AND instruction wants data at beginning of CC4 - NOT possible
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
can forward
data to second
and $4, $2, $5 IF Reg ALU DM Reg
next instruction
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
lw $2, 20($1)
Program Order