UNIT-5: Pipeline and Vector Processing
UNIT-5: Pipeline and Vector Processing
UNIT-5: Pipeline and Vector Processing
Introduction to pipelining and pipeline hazards Design issues of pipeline architecture Instruction level parallelism and advanced issues Parallel processing concepts-
Vector processing
Array processors CISC RISC VLIW
PARALLEL PROCESSING
1.
It is used to provide simultaneous data processing tasks for the purpose of increasing the computational speed of a computer system.
EX When an inst is being executed in ALU, the next inst can be read from memory. 2. H/W and cost increase Parallel processing
Integer
multiply Logic unit Shift unit Incremented Floating point add-subtract Floating point multiply Floating point divide
ADDERSUBTRACTOR
INTEGER MULTIPLY
LOGIC UNIT To memory SHIFT UNIT INCREMENTER FLOATING-POINT ADD-SUBTRACT FLOATING-POINT MULTIPLY FLOATING-POINT DIVIDE
PROCESSOR REGISTER
Parallel Processing
PARALLEL COMPUTERS
Architectural Classification
Flynn's classification Based on the multiplicity of Instruction Streams and Data Streams Instruction Stream Sequence of Instructions read from memory Data Stream Operations performed on the data in the processor
Number of Data Streams Single Number of Single Instruction Streams Multiple SISD MISD Multiple SIMD MIMD
SISD
Inst executed sequentially an system may or may not have internal parallel processing capabilities. SIMD Many processing units under supervision of common control unit. All processors receive the same inst from control unit but operate on different items of data.
Pipeline
processing: When arithmetic sub operations on the phases of computer inst cycle overlay in execution Vector processing: Deals with computations involving large vectors and matrices. Array processing: Perform computations on large arrays of data
PIPELINING
A technique of decomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments. Result obtained from computation in each segment is transferred to next segment in pipeline. Overlapping of computation. Register holds data and combinational circuit performs sub operation in particular segment.
WHAT IS PIPELINING??
Pipelining is an implementation technique where multiple instructions are overlapped in execution to make fast CPUs. It is an implementation which exploits parallelism among the instructions in a sequential instruction stream.
THE METHODOLOGY
In a pipeline each step is called a pipe stage/pipe segment which completes a part of an instruction. Each stage is connected with each other to form a pipe. Instructions enter at one end ,progress through each stage and exit at the other end.
Pipelining
PIPELINING
Ai * Bi + Ci
Ai Segment 1 R1 R2
for i = 1, 2, 3, ... , 7
Bi
Memory Ci
Multiplier Segment 2 R3 R4
Segment 3
Adder
R5
R1 Ai, R2 Bi R3 R1 * R2, R4 Ci R5 R3 + R4
Pipelining
R3
A1 * B1 A2 * B2 A3 * B3 A4 * B4 A5 * B5 A6 * B6 A7 * B7
R4
C1 C2 C3 C4 C5 C6 C7
R5
A1 * B1 + C1 A2 * B2 + C2 A3 * B3 + C3 A4 * B4 + C4 A5 * B5 + C5 A6 * B6 + C6 A7 * B7 + C7
Pipelining
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input
S1
R1
S2
R2
S3
R3
S4
R4
Space-Time Diagram
1 Segment 1 2 3 4 T1 2 T2 3 T3 T2 T1 4 T4 T3 T2 T1 5 T5 T4 T3 T2 6 T6 T5 T4 T3 T6 T5 T4 T6 T5 T6 7 8 9 Clock cycles
T1
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Conventional Machine (Non-Pipelined) tn: Clock cycle t1: Time required to complete the n tasks t 1 = n * tn
Pipelined Machine (k stages) tp: Clock cycle (time to complete each suboperation) tk: Time required to complete the n tasks tk = (k + n - 1) * tp Speedup Sk: Speedup Sk = n*tn / (k + n - 1)*tp lim
n
Sk =
tn tp
( = k, if tn = k * tp )
Pipelining
P1
P2
P3
P4
Disadvantage of pipeline
Different segment may take different times to complete their sub operation. Clk cycle must be chosen to equal the time delay of segment with max propagation time. This causes all other segment to waste time waiting for the next clk. Two area of computer design where pipeline org is applicable: 1.Arithmetic pipeline : divides arithmetic operation into sub operation for execution in pipeline segment. 2.Inst pipeline: operates on a stream of inst by overlapping fetch, decode and execute phases of inst cycle
PIPELINE HAZARDS
WHAT ARE PIPELINE HAZARDS ??? Hazards are those situations ,that prevent the next instruction in the instruction stream from executing during its designated clock cycle. They reduce the performance from the ideal speedup gained by pipelining.
CLASSIFICATION OF HAZARDS
Structural Hazards : arise from resource conflicts when the hardware cant support all possible combinations in simultaneous overlapped execution. Data hazards : arise when an instruction depends upon the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
CLASSIFICATION OF HAZARDS
Control Hazards : arise from the pipelining of branches and other instructions that change the PC
STRUCTURAL HAZARDS
For any system to be free from hazards, pipelining of functional units and duplication of resources is necessary to allow all possible combinations of instructions in the pipeline. Structural hazards arise due to the following reasons :
STRUCTURAL HAZARDS
When a functional unit is not fully pipelined , then the sequence of instructions using that unit cannot proceed at the rate of one per clock cycle. When the resource is not duplicated enough to allow all possible combinations of instructions. ex : a machine may have one register file write port, but it may want to perform 2 writes during the same clock cycle.
STRUCTURAL HAZARDS
A machine with a shared single memory for data and instructions . An instruction containing data memory reference will conflict with the instruction reference for a later instruction. This resolved by stalling the pipeline for one clock cycle when the data memory access occurs.
DATA HAZARDS
Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order they see by sequentially executing instructions on an unpipelined machine.
WAW (write after write) : j tries to write an operand before it is written by i. Thus the writes are performed in the wrong order leaving the value of i as the final value. This hazard is present in pipelines that write in more than one pipe stage. However in DLX this isnt a hazard as it writes only in the WB stage.
CONTROL HAZARDS
Control hazards cause a greater performance loss compared to the losses posed by data hazards. The simplest method of dealing with branches is that the pipeline is stalled as soon the branch is detected in the ID phase and until the MEM stage where the new PC is finally determined.
CONTROL HAZARDS
Each branch causes a 3 cycle stall in the DLX pipeline which is a significant loss as the 30% of the instructions used are branch instructions.
The number of clock cycles in the branch is reduced by testing the condition for branching in the ID stage and computing the destination address in the ID stage using a separate adder. Thus there is only clock cycle on branches .
Integer arithmetic overflow/underflow. Power failure Hardware malfunctions. I/O device request.
Arithmetic pipeline
Usually found in high speed computers. Used to implement floating-point operations ,multiplication of fixed-point number. EX floating point addition and subtraction. A and B are fractions representing mantissas and a and b are the exponents. Sub operation that are performed in four segment are
[1] Compare the exponents [2] Align the mantissa [3] Add/sub the mantissa [4] Normalize the result
Arithmetic Pipeline
ARITHMETIC PIPELINE
Exponents a b
R
Mantissas A B
R
Segment 1:
Difference
Floating-point adder
Segment 2:
Choose exponent Align mantissa R
X = A x 2a Y = B x 2b
Segment 3:
Compare the exponents Align the mantissa Add/sub the mantissa Normalize the result
Segment 4:
Adjust exponent R
Normalize result R
Instruction pipeline
Instruction
pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments . This causes instructions fetch and execute phases to overlap and perform simultaneous operation
Instruction Pipeline
INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle [1] [2] [3] [4] [5] [6] Fetch an instruction from memory Decode the instruction Calculate the effective address of the operand Fetch the operands from memory Execute the operation Store the result in the proper place
Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation
Instruction Pipeline
Segment1:
Fetch instruction from memory Decode instruction and calculate effective address
Segment2:
Branch? yes Segment3: Fetch operand from memory Execute instruction Interrupt? no Update PC Empty pipe no
Timing
Step: Instruction
of instruction pipeline
1 1 2 FI 2 DA FI 3 FO DA FI 4 EX FO DA FI EX FO EX FI DA FI FO DA FI EX FO DA EX FO EX 5 6 7 8 9 10 11 12 13
(Branch)
3 4 5 6
FI
DA
FO
EX
Major difficulties
Resource
conflicts: caused by access to memory by 2 segment at the same time. These conflicts can be resolved by using separate instruction and data memories. Data dependency conflicts: when inst depends on result of previous inst but this result not yet available. Branch difficulties arise from branch and other inst that change the value of pc.
1.Data dependency A data dependency occurs when an inst needs data that are not yet available. H/W interlocks: circuit that detects inst whose source operands are dest of inst farther up in pipeline. Inst whose source is not available to be delayed by enough clock cycle to resolve conflict. 2.Operand forwarding: Special h/w to detect a conflict an then avoid it by routing data through special path b/w pipeline segment. 3. Delayed load: Reorder the inst as necessary to delay the loading of conflicting data by inserting no-operation inst
Branch Prediction
Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty Delayed Branch
Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction
RISC pipeline
RISC - Machine with a very fast clock cycle that executes at the rate of one instruction per cycle <- Simple Instruction Set Fixed Length Instruction Format Register-to-Register Operations
RISC Pipeline
RISC PIPELINE
Instruction Cycles of Three-Stage Instruction Pipeline
Data Manipulation Instructions I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write a Register Load and Store Instructions I: Instruction Fetch A: Decode, Evaluate Effective Address E: Register-to-Memory or Memory-to-Register
Program Control Instructions I: Instruction Fetch A: Decode, Evaluate Branch Address E: Write Register(PC)
RISC Pipeline
DELAYED LOAD
LOAD: LOAD: ADD: STORE: R1 M[address 1] R2 M[address 2] R3 R1 + R2 M[address 3] R3
clock cycle 1 2 3 4 5 6 Load R1 I A E Load R2 I A E Add R1+R2 I A E Store R3 I A E Pipeline timing with delayed load
clock cycle 1 2 3 4 5 6 7 Load R1 I A E Load R2 I A E NOP I A E Add R1+R2 I A E Store R3 I A E Advantage
The data dependency is taken care by the compiler rather than the hardware
RISC Pipeline
DELAYED BRANCH
Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions
Clock cycle s: 1. Load 2. Incre me nt 3. Add 4. Subtr act 5. Branch to X 6. NOP 7. NOP 8. Instr. in X 1 2 3 4 5 6 7 8 9 10 I A E I A E I A E I A E I A E I A E I A E I A E
There are simple and complex instructions in a CISC architecture One approach:
Adapt the RISC pipeline
Execute
the simple, frequently-used CISC instructions as in RISC For the more complex instructions, use microinstructions.
That means, a sequence of microinstructions is stored on ROM for each complex CISC instruction Complex instructions often involve multiple microoperations or memory accesses in sequence When a complex instruction is decoded in the DOF stage of the pipeline, the microcode address and control is given to the microcode counter. Microinstructions are executed until the instruction is completed. Each microoperation is simply a set of control input signals Example: A certain CISC instruction I has microcode written in the microcode ROM at address A. When I is decoded in the main pipeline, stall the main pipeline, and give control to the microcode control (MC). At each subsequent clock cycle, MC will increment and execute the next microoperation (which is a control word that controls the datapath). The last microoperation in the sequence will give control back to the main pipeline, and un-stall the main pipeline.
CISC Approach
The primary goal of CISC architecture is to complete a task in a few lines of Assembly as possible. This is achieved by building processor hardware that is capable of Understanding and executing a series of operation. EX: MULT 2:3,5:2 For this task a CISC processor would come prepare with a specific instruction. This instruction loads two values into separate register, multiplies the operand In execution unit and stores the product in app. Register. The entire task of multiplying two numbers can be completed with one Instruction. MULT ----- complex instruction. CISC minimizes the number of instruction per program, sacrificing number of Cycles per instruction.
Vector Processing
VECTOR PROCESSING
Vector Processing Applications Problems that can be efficiently formulated in terms of vectors Long-range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Image processing Vector Processor (computer) Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers Vector Processors may also be pipelined
Vector Processing
VECTOR PROGRAMMING
20 DO 20 I = 1, 100 C(I) = B(I) + A(I)
Conventional computer Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = i + 1 If I 100 goto 20
Vector computer
C(1:100) = A(1:100) + B(1:100)
Vector Processing
VECTOR INSTRUCTIONS
f1: V * V f2: V * S f3: V x V * V f4: V x S * V Type f1 V: Vector operand S: Scalar operand
Mnemonic Description (I = 1, ..., n) VSQR VSIN VCOM Vector square root Vector sine Vector complement Vector summation Vector maximum Vector add Vector multiply Vector AND Vector larger Vector test > B(I) * SQR(A(I)) B(I) * sin(A(I)) A(I) * A(I) S * S A(I) S * max{A(I)} C(I) * A(I) + B(I) C(I) * A(I) * B(I) C(I) * A(I) . B(I) C(I) * max(A(I),B(I)) C(I) * 0 if A(I) < B(I) C(I) * 1 if A(I) > B(I)
f2
VSUM VMAX
f3
f4
SADD SDIV
Vector Processing
Source B
M ultiplie r pipeline
Adde r pipeline
Vector Processing
M0
AR
M1
AR
M2
AR
M3
AR
Memory array
Memory array
Memory array
Memory array
DR
DR
DR
DR
Data bus
Address Interleaving Different sets of addresses are assigned to different memory modules
Pipeline & vector processors often require simultaneous Pipeline & vector processors often require simultaneous access to memory from 2 or more sources. Inst pipeline may require fetching of an inst and an operand at same time from two different segment. Memory can be partitioned into number of modules connected to common memory address & data buses. Memory module is a memory array together with its own address & data register. One module initiates memory access while other modules are in the process of reading or writing and each modules can honor a memory request independent of the state of the other modules.
In an interleaved memory, different sets of address are assigned to different memory modules. A vector processor that uses n-way interleaved memory can fetch n operands from n different modules. Example In 2-modules memory system , even address may be 1 module and the odd addresses in the other. A CPU with inst pipeline can take advantage of multiple memory modules so that each segment in pipeline can access memory independent of memory access from other segments.
Array processors
It
is a processor that performs computations on largearrays of data. Attached array processor It is an auxiliary processor attached to general purpose computer. SIMD Array Processor. It is a processor that has single inst multiple data org. Manipulates vector inst by means of multiple functional units responding to a common inst.
Enhance the performance of computer by providing vector processing for complex scientific applications Attached array processor
General purpose computer i/p O/p interface I/P -O/P interface Attached array processor
High speed
Main memory
Memory to memory bus
Local memory
M2
Mn
Main memory
It is a computer with multiple processing units operating in parallel. It consist of a set of identical processing element each having a local memory. Each processor element includes an ALU, Floating point arithmetic unit an working register. Master ctl unit controls the operations in the processors elements. Main memory is used for storage of program. Function of master control unit is to decode the inst and determine how the inst is to executed. Vector inst are broadcast to all PEs Simultaneously. Vector operands are distributed to local memories prior to parallel execution of inst.