Unit 6

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 30

Overview

Pipelining is widely used in modern processors.


Pipelining improves system performance in terms of
throughput.
Pipelined organization requires sophisticated
compilation techniques.
Making the Execution of
Programs Faster
Use faster circuit technology to build the processor
and the main memory.
Arrange the hardware so that more than one
operation can be performed at the same time.
In the latter way, the number of operations
performed per second is increased even though the
elapsed time needed to perform any one operation is
not changed.
Traditional Concept
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
A B C D
Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes


Traditional Concept
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
Sequential laundry takes 6 hours
A for 4 loads
If they learned pipelining, how
long would laundry take?
B

D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight

Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes
3.5 hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it helps
throughput of entire
Time
workload
T Pipeline rate limited by
a 30 40 40 40 40 20
slowest pipeline stage
s Multiple tasks operating
A
k simultaneously using
different resources
O B Potential speedup = Number
r pipe stages
d C Unbalanced lengths of pipe
e stages reduces speedup
r Time to “fill” pipeline and
D
time to “drain” it reduces
speedup
Stall for Dependences
Use the Idea of Pipelining in a
ComputerFetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution

I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Execution
fetch unit (c) Pipelined execution
unit

Figure 8.1. Basic idea of instruction pipelining.


(b) Hardware organization
Use the Idea of Pipelining in a Time

Computer
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage buffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Figure 8.2. A 4-stage pipeline.


Consider a pipeline having 4 phases with duration 60, 50, 90
and 80 ns. Given latch delay is 10 ns. Calculate-
1.Pipeline cycle time
2.Non-pipeline execution time
3.Speed up ratio
4.Pipeline time for 1000 tasks
5.Sequential time for 1000 tasks
6.Throughput
Given-
Four stage pipeline is used
Delay of stages = 60, 50, 90 and 80 ns
Latch delay or delay due to each register = 10 ns

Part-01: Pipeline Cycle Time-

Cycle time
= Maximum delay due to any stage + Delay due to its
register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
Part-02: Non-Pipeline Execution Time-

Non-pipeline execution time for one instruction


= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns

Part-03: Speed Up Ratio-

Speed up
= Non-pipeline execution time / Pipeline execution time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8
Part-04: Pipeline Time For 1000 Tasks-

Pipeline time for 1000 tasks


= Time taken for 1st task + Time taken for remaining 999
tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
Part-05: Sequential Time For 1000 Tasks-

Non-pipeline time for 1000 tasks


= 1000 x Time taken for one task
= 1000 x 280 ns
= 280000 ns

Part-06: Throughput-

Throughput for pipelined execution


= Number of instructions executed per unit time
= 1000 tasks / 100300 ns
A four stage pipeline has the stage delays as 150, 120, 160
and 140 ns respectively. Registers are used between the
stages and have a delay of 5 ns each. Assuming constant
clocking rate, the total time taken to process 1000 data items
on the pipeline will be-
1.120.4 microseconds
2.160.5 microseconds
3.165.5 microseconds
4.590.0 microseconds
Consider a non-pipelined processor with a clock rate of 2.5
gigahertz and average cycles per instruction of 4. The same
processor is upgraded to a pipelined processor with five
stages but due to the internal pipeline delay, the clock speed
is reduced to 2 gigahertz. Assume there are no stalls in the
pipeline. The speed up achieved in this pipelined processor is-
1.3.2
2.3.0
3.2.2
4.2.0
We have 2 designs D1 and D2 for a synchronous pipeline
processor. D1 has 5 stage pipeline with execution time of 3 ns, 2
ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages
each with 2 ns execution time. How much time can be saved
using design D2 over design D1 for executing 100 instructions?

Cycle Time in Design D1-

Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 3, 2, 4, 2, 3 } + 0
= 4 ns
Execution Time For 100 Instructions in Design D1-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99
instructions
= 1 x 5 clock cycles + 99 x 1 clock cycle
= 5 x cycle time + 99 x cycle time
= 5 x 4 ns + 99 x 4 ns
= 20 ns + 396 ns
= 416 ns
Cycle Time in Design D2-
Cycle time
= Delay due to a stage + Delay due to its register
= 2 ns + 0
= 2 ns
Execution Time For 100 Instructions in Design D2-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99
instructions
= 1 x 8 clock cycles + 99 x 1 clock cycle
= 8 x cycle time + 99 x cycle time
= 8 x 2 ns + 99 x 2 ns
= 16 ns + 198 ns
= 214 ns
Time Saved-
Time saved
= Execution time in design D1 – Execution time in design D2
= 416 ns – 214 ns
= 202 ns
Role of Cache Memory
Each pipeline stage is expected to complete in one
clock cycle.
The clock period should be long enough to let the
slowest pipeline stage to complete.
Faster stages can only wait for the slowest one to
complete.
Since main memory is very slow compared to the
execution, if each instruction needs to be fetched
from main memory, pipeline is almost useless.
Fortunately, we have cache.
Pipeline Performance
The potential increase in performance resulting from
pipelining is proportional to the number of pipeline
stages.
However, this increase would be achieved only if all
pipeline stages require the same time to complete,
and there is no interruption throughout program
execution.
Unfortunately, this is not true.
Pipeline Performance
Clock cycle 1 2 3 4 5 6 7 8 9
Time

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 8.3. Effect of an execution operation taking more than one clock cycle.
Pipeline Performance
 The previous pipeline is said to have been stalled for two clock
cycles.
 Any condition that causes a pipeline to stall is called a hazard.
 Data hazard – any condition in which either the source or the
destination operands of an instruction are not available at the time
expected in the pipeline. So some operation has to be delayed, and
the pipeline stalls.
 Instruction (control) hazard – a delay in the availability of an
instruction causes the pipeline to stall.
 Structural hazard – the situation when two instructions require the
use of a given hardware resource at the same time.
Data Hazards
 We must ensure that the results obtained when instructions are
executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially.
 Hazard occurs
A←3+A
B←4×A
 No hazard
A←5×C
B ← 20 + C
 When two operations depend on each other, they must be
executed sequentially in the correct order.
 Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 (Mul) F1 D1 E1 W1

I2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependency between D2 and W 1.


Figure 8.6. Pipeline stalled by data dependency between D 2 and W1.
Overview
Whenever the stream of instructions supplied by the
instruction fetch unit is interrupted, the pipeline
stalls.
Cache miss
Branch
Unconditional Branches
Time
Clock cycle 1 2 3 4 5 6

Instruction
I1 F1 E1

I2 (Branch) F2 E2 Execution unit idle

I3 F3 X

Ik Fk Ek

Ik+1 Fk+1 Ek+1

Figure 8.8. An idle cycle caused by a branch instruction.

You might also like