Unit 6

Overview
Pipelining is widely used in modern processors.

Pipelining improves system performance in terms of
throughput.
Pipelined organization requires sophisticated
compilation techniques.
Making the Execution of
Programs Faster
Use faster circuit technology to build the processor
and the main memory.
Arrange the hardware so that more than one
operation can be performed at the same time.
In the latter way, the number of operations
performed per second is increased even though the
elapsed time needed to perform any one operation is
not changed.
Traditional Concept
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
A B C D
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes

Traditional Concept
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
Sequential laundry takes 6 hours
A for 4 loads
If they learned pipelining, how
long would laundry take?
B
D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it helps
throughput of entire
Time
workload
T Pipeline rate limited by
a 30 40 40 40 40 20
slowest pipeline stage
s Multiple tasks operating
A
k simultaneously using
different resources
O B Potential speedup = Number
r pipe stages
d C Unbalanced lengths of pipe
e stages reduces speedup
r Time to “fill” pipeline and
D
time to “drain” it reduces
speedup
Stall for Dependences
Use the Idea of Pipelining in a
ComputerFetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch unit (c) Pipelined execution
unit
Figure 8.1. Basic idea of instruction pipelining.

(b) Hardware organization
Use the Idea of Pipelining in a Time
Computer
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
(a) Instruction execution divided into four steps
Interstage buffers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
(b) Hardware organization
Figure 8.2. A 4-stage pipeline.

Consider a pipeline having 4 phases with duration 60, 50, 90
and 80 ns. Given latch delay is 10 ns. Calculate-
1.Pipeline cycle time
2.Non-pipeline execution time
3.Speed up ratio
4.Pipeline time for 1000 tasks
5.Sequential time for 1000 tasks
6.Throughput
Given-
Four stage pipeline is used
Delay of stages = 60, 50, 90 and 80 ns
Latch delay or delay due to each register = 10 ns
Part-01: Pipeline Cycle Time-
Cycle time
= Maximum delay due to any stage + Delay due to its
register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
Part-02: Non-Pipeline Execution Time-
Non-pipeline execution time for one instruction

= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns
Part-03: Speed Up Ratio-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8
Part-04: Pipeline Time For 1000 Tasks-
Pipeline time for 1000 tasks

= Time taken for 1st task + Time taken for remaining 999
tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
Part-05: Sequential Time For 1000 Tasks-
Non-pipeline time for 1000 tasks

= 1000 x Time taken for one task
= 1000 x 280 ns
= 280000 ns
Part-06: Throughput-
Throughput for pipelined execution

= Number of instructions executed per unit time
= 1000 tasks / 100300 ns
A four stage pipeline has the stage delays as 150, 120, 160
and 140 ns respectively. Registers are used between the
stages and have a delay of 5 ns each. Assuming constant
clocking rate, the total time taken to process 1000 data items
on the pipeline will be-
1.120.4 microseconds
Consider a non-pipelined processor with a clock rate of 2.5
gigahertz and average cycles per instruction of 4. The same
processor is upgraded to a pipelined processor with five
stages but due to the internal pipeline delay, the clock speed
is reduced to 2 gigahertz. Assume there are no stalls in the
pipeline. The speed up achieved in this pipelined processor is-
1.3.2
2.3.0
3.2.2
4.2.0
We have 2 designs D1 and D2 for a synchronous pipeline
processor. D1 has 5 stage pipeline with execution time of 3 ns, 2
ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages
each with 2 ns execution time. How much time can be saved
using design D2 over design D1 for executing 100 instructions?
Cycle Time in Design D1-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 3, 2, 4, 2, 3 } + 0
= 4 ns
Execution Time For 100 Instructions in Design D1-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99
instructions
= 5 x 4 ns + 99 x 4 ns
= 20 ns + 396 ns
= 416 ns
Cycle Time in Design D2-
Cycle time
= Delay due to a stage + Delay due to its register
= 2 ns + 0
= 2 ns
Execution Time For 100 Instructions in Design D2-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99
instructions
= 8 x 2 ns + 99 x 2 ns
= 16 ns + 198 ns
= 214 ns
Time Saved-
Time saved
= Execution time in design D1 – Execution time in design D2
= 416 ns – 214 ns
= 202 ns
Role of Cache Memory
Each pipeline stage is expected to complete in one
clock cycle.
The clock period should be long enough to let the
slowest pipeline stage to complete.
Faster stages can only wait for the slowest one to
complete.
Since main memory is very slow compared to the
execution, if each instruction needs to be fetched
from main memory, pipeline is almost useless.
Fortunately, we have cache.
Pipeline Performance
The potential increase in performance resulting from
pipelining is proportional to the number of pipeline
stages.
However, this increase would be achieved only if all
pipeline stages require the same time to complete,
and there is no interruption throughout program
execution.
Unfortunately, this is not true.
Clock cycle 1 2 3 4 5 6 7 8 9
Time
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
Figure 8.3. Effect of an execution operation taking more than one clock cycle.
 The previous pipeline is said to have been stalled for two clock
cycles.
 Any condition that causes a pipeline to stall is called a hazard.
 Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and
the pipeline stalls.
 Instruction (control) hazard – a delay in the availability of an
instruction causes the pipeline to stall.
 Structural hazard – the situation when two instructions require the
use of a given hardware resource at the same time.
Data Hazards
 We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially.
 Hazard occurs
A←3+A
B←4×A
 No hazard
A←5×C
B ← 20 + C
 When two operations depend on each other, they must be
executed sequentially in the correct order.
 Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 (Mul) F1 D1 E1 W1
I2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Figure 8.6. Pipeline stalled by data dependency between D2 and W 1.

Figure 8.6. Pipeline stalled by data dependency between D 2 and W1.
Overview
Whenever the stream of instructions supplied by the
instruction fetch unit is interrupted, the pipeline
stalls.
Cache miss
Branch
Unconditional Branches
Time
Clock cycle 1 2 3 4 5 6
Instruction
I1 F1 E1
I2 (Branch) F2 E2 Execution unit idle
I3 F3 X
Ik Fk Ek
Ik+1 Fk+1 Ek+1
Figure 8.8. An idle cycle caused by a branch instruction.

Unit 6

Uploaded by

Copyright:

Available Formats

Unit 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 6

Uploaded by

Copyright:

Available Formats

Overview

Pipelining is widely used in modern processors.

Dryer takes 40 minutes

“Folder” takes 20 minutes

Figure 8.1. Basic idea of instruction pipelining.

(a) Instruction execution divided into four steps

(b) Hardware organization

Figure 8.2. A 4-stage pipeline.

Part-01: Pipeline Cycle Time-

Non-pipeline execution time for one instruction

Part-03: Speed Up Ratio-

Pipeline time for 1000 tasks

Non-pipeline time for 1000 tasks

Throughput for pipelined execution

Cycle Time in Design D1-

Figure 8.6. Pipeline stalled by data dependency between D2 and W 1.

I2 (Branch) F2 E2 Execution unit idle

Ik+1 Fk+1 Ek+1

Figure 8.8. An idle cycle caused by a branch instruction.

You might also like