Lec18 Pipeline

Pipeline and Vector Processing
(Chapter2 and Appendix A)

Dr. Bernard Chen Ph.D.
University of Central Arkansas
Parallel processing
A parallel processing system is able to perform

concurrent data processing to achieve faster
execution time
The system may have two or more ALUs and be able
to execute two or more instructions at the same time
Goal is to increase the throughput the amount of
processing that can be accomplished during a given
interval of time
Parallel processing classification

Single instruction stream, single data stream SISD
Single instruction stream, multiple data stream SIMD
Multiple instruction stream, single data stream MISD
Multiple instruction stream, multiple data stream MIMD
Single instruction stream, single data

stream SISD
Single control unit, single computer, and a memory

unit
Instructions are executed sequentially. Parallel
processing may be achieved by means of multiple
functional units or by pipeline processing
Single instruction stream, multiple data

stream SIMD
Represents an organization that includes many

processing units under the supervision of a common
control unit.
Includes multiple processing units with a single
control unit. All processors receive the same
instruction, but operate on different data.
Multiple instruction stream, single data

stream MISD
Theoretical only
processors receive different instructions, but operate
on the same data.
Multiple instruction stream, multiple

data stream MIMD
A computer system capable of processing several

programs at the same time.
Most multiprocessor and multicomputer systems can
be classified in this category
Pipelining: Laundry Example
Small laundry has one

washer, one dryer and one
operator, it takes 90
minutes to finish one load:
Washer takes 30 minutes

Dryer takes 40 minutes
operator folding takes 20
minutes
Sequential Laundry
6 PM
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
90 min
D
This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined

Laundry
Operator start work ASAP
6 PM
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
40 40 40
A
B
O
r
d
e
r
C
D
Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Facts
6 PM
9
Time
T
a
s
k
O
r
d
e
r
30 40
A
40
40 20
B
C
40
The washer
waits for the
dryer for 10
minutes
Multiple tasks
operating
simultaneously
Pipelining doesnt help
latency of single task,
it helps throughput of
entire workload
Pipeline rate limited by
slowest pipeline stage
Potential speedup =
Number of pipe stages
Unbalanced lengths of
pipe stages reduces
speedup
Time to fill pipeline
and time to drain it
reduces speedup
9.2 Pipelining
Decomposes a sequential process into segments.
Divide the processor into segment processors each

one is dedicated to a particular segment.
Each segment is executed in a dedicated segmentprocessor operates concurrently with all other
segments.
Information flows through these multiple hardware
segments.
9.2 Pipelining
Instruction execution is divided into k segments or

stages
Instruction exits pipe stage k-1 and proceeds into
pipe stage k
All pipe stages take the same amount of time;
called one processor cycle
Length of the processor cycle is determined by the
slowest pipe stage
k segments
SPEEDUP
Consider a k-segment pipeline operating on n data

sets. (In the above example, k = 3 and n = 4.)
It takes k clock cycles to fill the pipeline and get the

first result from the output of the pipeline.
After that the remaining (n - 1) results will come out

at each clock cycle.
It therefore takes (k + n - 1) clock cycles to complete

the task.
Example
A non-pipeline system takes 100ns to

process a task;
the same task can be processed in a
FIVE-segment pipeline into 20ns, each
Determine how much time does it
required to finish 10 tasks?
SPEEDUP
If we execute the same task sequentially

in a single processing unit, it takes (k *
n) clock cycles.
The speedup gained by using the
pipeline is:
Example
A non-pipeline system takes 100ns to

process a task;
the same task can be processed in a
FIVE-segment pipeline into 20ns, each
Determine the speedup ratio of the
pipeline for 1000 tasks?
5-Stage Pipelining
S1
S2
S3
S4
S5
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Time
S1
1 2 3 4 5 6 7 8 9
S2
1 2 3 4 5 6 7 8
S3
1 2 3 4 5 6 7
S4
1 2 3 4 5 6
S5
1 2 3 4 5
Example Answer
Speedup Ratio for 1000 tasks:

100*1000 / (5 + 1000 -1)*20 = 4.98
Example
A non-pipeline system takes 100ns to process

a task;
the same task can be processed in a sixsegment pipeline with the time delay of each
segment in the pipeline is as follows 20ns,
25ns, 30ns, 10ns, 15ns, and 30ns.
Determine the speedup ratio of the pipeline
for 10, 100, and 1000 tasks. What is the
maximum speedup that can be achieved?
Example Answer

100*10 / (6+10-1)*30

100*100 / (6+100-1)*30

100*1000 / (6+1000-1)*30
Maximum Speedup:
100*N/ (6+N-1)*30 = 10/3
Some definitions
Pipeline: is an implementation technique

where multiple instructions are overlapped in
execution.
Pipeline stage: The computer pipeline is to

divided instruction processing into stages.
Each stage completes a part of an instruction and

loads a new part in parallel.
Some definitions
Throughput of the instruction pipeline is determined by
how often an instruction exits the pipeline. Pipelining
does not decrease the time for individual instruction
execution. Instead, it increases instruction throughput.
Machine cycle . The time required to move an

instruction one step further in the pipeline. The length of
the machine cycle is determined by the time required for
the slowest pipe stage.
Instruction pipeline versus sequential

processing
sequential processing
Instruction pipeline
Instruction pipeline (Contd.)
sequential processing is
faster for few instructions
Instructions seperate
1.
2.
3.
4.
5.
Fetch the instruction

Decode the instruction
Fetch the operands from memory
Execute the instruction
Store the results in the proper place
5-Stage Pipelining
S1
S2
S3
S4
S5
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Time
S1
1 2 3 4 5 6 7 8 9
S2
1 2 3 4 5 6 7 8
S3
1 2 3 4 5 6 7
S4
1 2 3 4 5 6
S5
1 2 3 4 5
Five Stage
Instruction Pipeline
Fetch instruction
Decode instruction
Fetch operands
Execute instructions
Write result
Difficulties...
If a complicated memory access occurs in
stage 1, stage 2 will be delayed and the
rest of the pipe is stalled.
If there is a branch, if.. and jump,
then some of the instructions that have
already entered the pipeline should not be
processed.
We need to deal with these difficulties to
keep the pipeline moving
Pipeline Hazards
There are situations, called hazards, that

prevent the next instruction in the instruction
stream from executing during its designated
cycle
There are three classes of hazards
Structural hazard
Data hazard
Branch hazard
Pipeline Hazards
Structural hazard
Data hazard
Resource conflicts when the hardware cannot

support all possible combination of instructions
simultaneously
An instruction depends on the results of a
previous instruction
Branch hazard
Instructions that change the PC
Structural hazard
Some pipeline processors have shared a

single-memory pipeline for data and
instructions
Structural hazard
Memory data fetch requires on FI and FO
S1
S2
S3
S4
S5
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Time
S1
1 2 3 4 5 6 7 8 9
S2
1 2 3 4 5 6 7 8
S3
1 2 3 4 5 6 7
S4
1 2 3 4 5 6
S5
1 2 3 4 5
Structural hazard
To solve this hazard, we stall the

pipeline until the resource is freed
A stall is commonly called pipeline
bubble, since it floats through the
pipeline taking space but carry no
useful work
Structural hazard
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Time
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Data hazard
Example:
ADD
SUB
AND
OR
XOR
R1R2+R3
R4R1-R5
R6R1 AND R7
R8R1 OR R9
R10R1 XOR R11
Data hazard
FO: fetch data value
S1
S2
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Time
WO: store the executed value

S3
S4
S5
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Data hazard
Delay load approach inserts a no-operation

instruction to avoid the data conflict
ADD
No-op
No-op
SUB
AND
OR
XOR
R1R2+R3
R4R1-R5
R6R1 AND R7
R8R1 OR R9
R10R1 XOR R11
Data hazard
Data hazard
It can be further solved by a simple hardware technique called

forwarding (also called bypassing or short-circuiting)
The insight in forwarding is that the result is not really needed

by SUB until the ADD execute completely
If the forwarding hardware detects that the previous ALU

operation has written the register corresponding to a source for
the current ALU operation, control logic selects the results in
ALU instead of from memory
Data hazard
Branch hazards
Branch hazards can cause a greater

performance loss for pipelines
When a branch instruction is executed, it may
or may not change the PC
If a branch changes the PC to its target
address, it is a taken branch
Otherwise, it is untaken
Branch hazards
There are FOUR schemes to handle

branch hazards
Freeze scheme
Predict-untaken scheme
Predict-taken scheme
Delayed branch
5-Stage Pipelining
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Time
S1
1 2 3 4 5 6 7 8 9
S2
1 2 3 4 5 6 7 8
S3
1 2 3 4 5 6 7
S4
1 2 3 4 5 6
S5
1 2 3 4 5
Write
Operand
(WO)
Branch Untaken
(Freeze approach)
The simplest method of dealing with branches is to

redo the fetch following a branch
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Branch Taken
(Freeze approach)
The simplest method of dealing with branches is to

redo the fetch following a branch
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Branch Taken
(Freeze approach)
The simplest scheme to handle branches is to

freeze the pipeline holding or deleting any
instructions after the branch until the branch
destination is known
The attractiveness of this solution lies
primarily in its simplicity both for hardware
and software
Branch Hazards
(Predicted-untaken)
A higher performance, and only slightly more

complex, scheme is to treat every branch as not
taken
It is implemented by continuing to fetch instructions
as if the branch were normal instruction
The pipeline looks the same if the branch is not taken
If the branch is taken, we need to redo the fetch
instruction
Branch Untaken
(Predicted-untaken)
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Time
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Branch Taken
(Predicted-untaken)
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Branch Taken
(Predicted-taken)
An alternative scheme is to treat every

branch as taken
As soon as the branch is decoded and

the target address is computed, we
assume the branch to be taken and
begin fetching and executing the target
Branch Untaken
(Predicted-taken)
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Branch taken
(Predicted-taken)
Fetch
Instruction
(FI)
Decode
Instruction
(DI)
Fetch
Operand
(FO)
Execution
Instruction
(EI)
Write
Operand
(WO)
Delayed Branch
A fourth scheme in use in some processors is

called delayed branch
It is done in compiler time. It modifies the
code
The general format is:

branch instruction
Delay slot
branch target if taken
Delayed Branch
Optimal
Delayed
Branch
If the optimal is not
available:
(b) Act like
predict-taken
(in complier way)
(c) Act like
predict-untaken
(in complier way)
Delayed Branch
Delayed Branch is limited by
(1) the restrictions on the instructions that

are scheduled into the delay slots (for
example: another branch cannot be
scheduled)
(2) our ability to predict at compile time
whether a branch is likely to be taken or
not (hard to choose (b) or (c))
Branch Prediction
A pipeline with branch prediction uses

some additional logic to guess the
outcome of a conditional branch
instruction before it is executed
Branch Prediction
Various techniques can be used to predict whether a

branch will be taken or not:
Prediction never taken

Prediction always taken
Prediction by opcode
Branch history table
The first three approaches are static: they do not

depend on the execution history up to the time of the
conditional branch instruction. The last approach is
dynamic: they depend on the execution history.

Lec18 Pipeline

Uploaded by

Copyright:

Available Formats

Lec18 Pipeline

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec18 Pipeline

Uploaded by

Copyright:

Available Formats

Pipeline and Vector Processing

(Chapter2 and Appendix A)

A parallel processing system is able to perform

Parallel processing classification

Single instruction stream, single data

Single control unit, single computer, and a memory

Single instruction stream, multiple data

Represents an organization that includes many

Multiple instruction stream, single data

Multiple instruction stream, multiple

A computer system capable of processing several

Pipelining: Laundry Example

Small laundry has one

Washer takes 30 minutes

Efficiently scheduled laundry: Pipelined

Divide the processor into segment processors each

Instruction execution is divided into k segments or

Consider a k-segment pipeline operating on n data

It takes k clock cycles to fill the pipeline and get the

After that the remaining (n - 1) results will come out

It therefore takes (k + n - 1) clock cycles to complete

A non-pipeline system takes 100ns to

If we execute the same task sequentially

A non-pipeline system takes 100ns to

Speedup Ratio for 1000 tasks:

A non-pipeline system takes 100ns to process

Speedup Ratio for 10 tasks:

Speedup Ratio for 100 tasks:

Speedup Ratio for 1000 tasks:

Pipeline: is an implementation technique

Pipeline stage: The computer pipeline is to

Each stage completes a part of an instruction and

Machine cycle . The time required to move an

Instruction pipeline versus sequential

Instruction pipeline (Contd.)

faster for few instructions

Fetch the instruction

There are situations, called hazards, that

Resource conflicts when the hardware cannot

Instructions that change the PC

Some pipeline processors have shared a

To solve this hazard, we stall the

WO: store the executed value

Delay load approach inserts a no-operation

It can be further solved by a simple hardware technique called

The insight in forwarding is that the result is not really needed

If the forwarding hardware detects that the previous ALU

Branch hazards can cause a greater

There are FOUR schemes to handle

The simplest method of dealing with branches is to

The simplest method of dealing with branches is to

The simplest scheme to handle branches is to

A higher performance, and only slightly more

An alternative scheme is to treat every

As soon as the branch is decoded and

A fourth scheme in use in some processors is

The general format is:

Delayed Branch is limited by

(1) the restrictions on the instructions that