Unit 6
Pipeline and vector processing
Contents
• Parallel Processing
• General Consideration for pipelining
• Pipelining
• Arithmetic Pipeline
• Instruction level Pipelining
• Pipeline Hazards and their Solution
• Array and Vector Processing
• RISC Pipelining
Parallel Processing
• A technique that is used to provide simultaneously data processing
task for the purpose of increasing the conceptual speed of computer
system.
• The system may have two or more ALUs and be able to execute two or
more instructions at the same time.
• The purpose of parallel processing is to speed up the computer
processing capability and increase its throughput (the amount of task
those are completed during given interval of time).
Flynn’s Classification of Computers
• Different ways to classify parallel computers.
• More widely used classifications, in use since 1966, is Flynn's
classification.
• Flynn's classification distinguishes multi-processor computer
architectures according to the two independent dimensions of
Instruction and Data.
• The sequence of instructions read from memory constitutes an
instruction stream.
• The operations performed on the data in the processor constitute a
data stream.
SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
1. Single Instruction, Single Data (SISD):
• A serial (non-parallel) computer.
• Single Instruction: Only one instruction stream is being acted on by the CPU
during any One clock cycle.
• Single Data: Only one data stream is being used as input during any one clock
cycle.
• Single instruction is performed on a single set of data in a sequential form.
• Deterministic execution.
• This is the oldest and even today, the most common type of computer.
• Examples: Older generation mainframes, minicomputers and workstations;
most modern day PCs.
2. Single Instruction, Multiple Data (SIMD):
• A type of parallel computer.
• Single Instruction: All processing units executes the same instruction at any
given clock cycle.
• Multiple data: Each processing unit can operate on a different data element.
• Best suited for specialized problems characterized by a high degree of regularity,
such as graphics/image processing.
• Single instruction is performed on multiple data. A good example is the “For”
loop statement. Over here instruction is the same but the data stream is different.
• Example: Array processors: Connection Machine CM-2, MasPar
MP-1 & MP-2, ILLIAC IV
• Vector Processors: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP,
NEC SX-2, Hitachi S820, ETA10
3. Multiple Instruction, Single Data (MISD):
• A type of parallel computer.
• Multiple Instruction: Each processing unit operates on the data independently
via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing units.
• N numbers of processors are working on different set of instruction on the same
set of data.
• Few actual examples of this class of parallel computer have ever existed. One is
the experimental Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be:
multiple frequency filters operating on a single signal stream.
Multiple cryptography algorithms attempting to crack a single coded
message.
4. Multiple Instruction, Multiple Data (MIMD):
• A type of parallel computer.
• Multiple Instruction: Every processor may be executing a different instruction
stream.
• Multiple Data: Every processor may be working with a different data stream.
• There is an interaction of N numbers of processors on a same data stream shared
by all processors.
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic.
• Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
• Examples: Most current supercomputers, networked parallel computer clusters
and "grids", multi-processor SMP computers, multi-core PCs.
• Note: many MIMD architectures also include SIMD execution sub-components.
Pipelining
• A technique of decomposing a sequential process into sub operations,
with each sub process being executed in a special dedicated segment
that operates concurrently with all other segments.
• Each segment performs partial processing dictated by the way the task
partitioned.
• The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
• The final result is obtained after the data have passed through all
segments.
• It is characteristic of pipelines that several computations can be in
progress in distinct segments at the same time.
• The registers provide isolation between each segment so that each can
operate on distinct data simultaneously.
Suppose we want to perform the combined multiply and add operations with a
stream of numbers. E.g. Ai * Bi + Ci for i = 1 , 2, 3,…….,7
The sub operations performed in each segment of the pipeline are as follows:
R1Ai, R2Bi Input Ai and Bi
R3R1*R2, R4Ci Multiply and input Ci
R5R3+R4 Add Ci to product
Fig2: Example of pipeline processing
Clock Pulse Segment 1 Segment 2 Segment 3
R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1+C1
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
6 A6 B6 A5*B5 C5 A4*B4+C4
7 A7 B7 A6*B6 C6 A5*B5+C5
8 - - A7*B7 C7 A6*B6+C6
9 - - - - A7*B7+C7
General Consideration for pipelining
Fig3: General structure of a 4-segment pipeline.
Fig: Space Time Diagram
Speedup Equation
Consider a K-segment pipeline with a clock cycle time tp to execute n tasks. The first task T1
require time Ktp to complete. The remaining (n-1) tasks finish at the rate of one task per clock
cycle and will be completed after time (n-1) tp. The total time to complete the n task is [Ktp + (n -
1) tp] = (K + n - 1) tp.
Consider a non–pipeline unit that performs the same operation and takes tn time to complete each
task. The total time required for n tasks would be ntn.
Speedup(S) =
S=
As number of tasks increases, n becomes much larger than K - 1, and (K + n – 1)
approaches to n then,
S=
Arithmetic Pipeline
• Usually found in very high speed computers.
• Used to implement floating point operations, multiplication of fixed
point numbers, and similar computations encountered in scientific
problems.
• Floating point addition and subtraction:
• Let us consider a two floating point binary inputs numbers to the adder
pipeline in normalized form.
X=A 2a
Y=B 2b
• Can be performed in four segments, as shown in figure.
Ex: X= 0.9504 103
Y= 0.8200 102
=3-2 = 1
The larger exponent 3 is chosen
as the exponent of the result.
3 X= 0.9504 103
Y= 0.0820 103
Sum=0.9504+0.0820=1.0324
4
0.10324
Z= 1.0324 103
Fig4: Pipeline for floating-point addition and subtraction
• The registers labeled R are placed between the segments to store
intermediate results.
• The sub-operations that are performed in the four segments are:
i. Compare the exponents.
ii. Align the mantissas.
iii. Add or subtract the mantissas.
iv. Normalize the result.
• The exponents are compared by subtracting them to determine their
difference.
• The larger exponent is chosen as the exponent of the result.
• The exponent difference determines how many times the mantissa
associated with the smaller exponent must be shifted to the right.
• This produces an alignment of the two mantissas.
• The two mantissas are added or subtracted in segment 3. The result is
normalized in segment 4.
• When an overflow occurs, the mantissa of the sum or difference is
shifted right and the exponent incremented by one.
• If an underflow occurs, the number of leading zeros in the mantissa
determines the number of left shifts in the mantissa and the number
that must be subtracted from the exponent.
Instruction pipeline
• Pipeline processing can occur not only in the data stream but in the
instruction as well.
• Generally, instruction fetch unit and an instruction execution unit
designed to provide a two-segment pipeline.
• Computers with complex instructions require other phases in addition to
above phases to process an instruction completely.
• In the most general case, the computer needs to process each instruction
with the following sequence of steps.
• Fetch the instruction from memory.
• Decode the instruction.
• Calculate the effective address.
• Fetch the operands from memory.
• Execute the instruction.
• Store the result in the proper place.
• Difficulties that will prevent the instruction pipeline from operating at
its maximum rate:
• Different segments may take different times to operate on available data.
• Some segments are skipped for certain operations.
• Two or more segments may require memory access at the same time, causing
one segment to wait until another is finished with the memory.
4-stage Pipeline: To further speed up, the pipeline must have following more
stages:
a. Fetch instruction (FI)
b. Decode instruction (DI)
c. Fetch operands (FO)
d. Execute instruction (EI)
4-segment pipeline:
Conventional
Pipelined
1 2 3 4 5 6
i FI DI FO EI
i+1 FI DI FO EI
i+2 FI DI FO EI
Fig: Flowchart of 4-segment CPU pipeline
Fig: Space time diagram of instruction pipeline
• Each segments operates on different segments.
• In step 4, instruction 1 is being executed in segment EX; the operand for
instruction 2 is being fetched in segment FO; instruction 3 is being decoded in
segment DA; and instruction 4 is being fetched from memory in segment FI.
Pipeline Hazards and their Solution
Hazards
1. Resource conflicts
• If one common memory is used for both data and instruction, and there is need to
read/write data and fetch the instruction at the same time, the resource conflicts occur.
2. Data dependency conflict
• Data dependency conflicts arise when an instruction depends on the result of a
previous instruction but result of that instruction is not yet available.
3. Branch difficulties
• Branch difficulties arise from program control instructions that may change the value
of PC.
• A program is not a straight flow of sequential instructions. There may be branch
instructions after the normal flow of program which delay the pipelining executions
and affects the programs.
Solution of pipeline hazards
1. Resource conflicts can be solved by using separate instruction and
data memory.
2. To solve the data dependency conflict, we have following methods:
i) Hardware interlock
• A circuit that is responsible to detect the data dependency.
• After detecting the particular instruction needs data from instruction which is being executed,
hardware interface delays the instruction till the data is not available.
ii) Operand forwarding
• Operand forwarding uses special hardware to detect a conflict and then avoid by routing the data
through special paths between pipeline segments.
iii) Delayed load
• Delayed load is a procedure that gives the responsibilities for solving data conflicts to the
compiler.
• The compiler is designed to detect conflict & reorder the instructions as necessary to delay the
loading of the conflicting data by inserting no operation instructions.
3. Solution to branch difficulties
i. Pre fetch target instruction
• Both the branch target instruction & the instruction following the branch are pre fetched and are
saved until the branch instruction is executed.
• If branching occurs, then branch target instruction continues.
ii. Branch target buffer (BTB)
• An associative memory included in fetch segment of branch instruction that stores the target situation
for the previously executed branch.
• Also stores the next few instructions after the branch target instruction.
iii. Loop buffer
• Similar to BTB but its speed is high.
• Program loops are stored in the loop buffer and can be executed directly without having access to
memory.
iv. Branch prediction
• Special hardware is used to detect the branch in the conditional branch instruction. On the basis of
prediction, the instructions are pre fetched.
v. Delayed branch
• Compiler detects the branch instructions, so it re-arranges the instruction to make delay by inserting
no operation instruction.
RISC pipeline
• More efficient and most of the operations need single clock cycle.
• Data transfer instructions in RISC - load and store.
• Use register indirect addressing.
• Usually need three or four stages in the pipeline.
• Uses two separate buses for storing instruction and data and to avoid conflicts.
• Cache memory is available for high speed operation.
• Most of the instruction executes in single cycle.
• A typical set of instructions are:
Data manipulation,
Data transfer and
Program control
RISC 3-segment instruction pipelining:
• Thee are three types of instructions:
The data manipulation instructions: operate on data in processor registers
The data transfer instructions.
The program control instructions.
• The instruction cycle can be divided into three sub-operations and
implemented in three segments:
I: Instruction fetch
A: ALU operation
E: Execute instruction
Delayed Load
• Consider the operation of the following four instructions:
1. LOAD: Rl M[address 1]
2. LOAD: R2 M[address 2]
3. ADD: R3Rl + R2
4. STORE: M[address 3] R3
• There will be a data conflict in instruction 3 because the operand in
R2 is not yet available in the A segment (fig: a).
• The E segment in clock cycle 4 is in a process of placing the memory
data into R2.
• The A segment in clock cycle 4 is using the data from R2.
Fig: 3-segment pipelining timing
• It is up to the compiler to make sure that the instruction following the load
instruction uses the data fetched from memory.
• This concept of delaying the use of the data loaded from memory is referred
to as delayed load (fig: b).
• The no-op instruction is used to advance one clock cycle in order to
compensate for the data conflict in the pipeline.
• The advantage of the delayed load approach is that the data
dependency is taken care of by the compiler rather than the hardware.
Delay branch
• RISC machines use different techniques to redefine the branches so
that they take effect at the proper time in the pipeline.
• This method is referred to as delayed branch.
The program for this example consists of Five instructions:
i. Load from memory to R1
ii. Increment R2
iii.Add R3 to R4
iv. Subtract R5 from R6
v. Branch to address X
Vector Processing
• In many science and engineering applications, the problems can be
formulated in terms of vectors and matrices that lend themselves to
vector processing.
• Applications:
Long-range weather forecasting
Petroleum explorations
Seismic data analysis
Medical diagnosis
Aerodynamics and space flight simulations
Artificial intelligence and expert systems
Mapping the human genome
Image processing
• To achieve the required level of high performance, it is necessary to utilize
the fastest and most reliable hardware and apply innovative procedures from
vector and parallel processing techniques.
Vector Operation
• Many scientific problems require arithmetic operations on large arrays of
numbers.
• A vector is an ordered set of a one-dimensional array of data items.
• A vector V of length n is represented as a row vector by V=[v1,v2,…,Vn].
• Column and Row vector.
Fig: Vector instruction format
• A computer capable of vector processing eliminates the overhead associated
with the time it takes to fetch and execute the instructions in the program.
Matrix multiplication
• One of the most computational intensive operations performed in
computers with vector processors.
• The multiplication of two n x n matrices consists of n2 inner products
or n3 multiply-add operations.
• An n x m matrix of numbers has n rows (vectors) and m columns
(vectors). Consider, for example, the multiplication of two 3 x 3
matrices A and B.
𝑎11 𝑎12 𝑎13 𝑏11 𝑏12 𝑏13 𝑐11 𝑐12 𝑐13
𝑎21 𝑎22 𝑎23 × 𝑏21 𝑏22 𝑏23 = 𝑐21 𝑐22 𝑐23
𝑎31 𝑎32 𝑎33 𝑏31 𝑏32 𝑏33 𝑐31 𝑐32 𝑐33
• In general, the inner product consists of the sum of k product terms of the form
C= A1B1+A2B2+A3B3+…+AkBk.
In a typical application k may be equal to 100 or even 1000.
The inner product calculation on a pipeline vector processor is shown below:
Fig: Pipeline for calculating an inner product
Memory Interleaving
• Pipeline and vector processors often require simultaneous access to memory from
two or more sources.
An instruction pipeline may require the fetching of an instruction and an operand at the
same time from two different segments.
An arithmetic pipeline usually requires two or more operands to enter the
pipeline at the same time.
• Instead of using two memory buses for simultaneous access, the memory can be
partitioned into a number of modules connected to a common memory address and
data buses.
A memory module is a memory array together with its own address and data registers.
• The advantage of a modular memory is that it allows the use of a technique called
interleaving.
Fig: Multiple module memory organization
• In an interleaved memory, different sets of addresses are assigned to different
memory modules.
• By staggering the memory access, the effective memory cycle time can be reduced
by a factor close to the number of modules.
Super Computers
• A commercial computer with vector instructions and pipelined
floating-point arithmetic operations is referred to as a supercomputer.
• To speed up the operation, the components are packed tightly together
to minimize the distance that the electronic signals have to travel.
• This is augmented by instructions that process vectors and
combinations of scalars and vectors.
• A computer system best known for its high computational speed, fast
and large memory systems, and the extensive use of parallel
processing.
• It is equipped with multiple functional units and each unit has its own
pipeline configuration.
• It is specifically optimized for the type of numerical calculations involving
vectors and matrices of floating-point numbers.
• They are limited in their use to a number of scientific applications, such as
numerical weather forecasting, seismic wave analysis, and space research.
• A measure used to evaluate computers in their ability to perform a given number
of floating-point operations per second is referred to as flops.
• A typical supercomputer has a basic cycle time of 4 to 20 ns.
• The examples of supercomputer:
• Cray-1: It uses vector processing with 12 distinct functional units in parallel;
a large number of registers (over 150); multiprocessor configuration (Cray X-
MP and Cray Y-MP)
• Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300
megaflops
Array Processing
• An array processor is a processor that performs computations on large
arrays of data.
• The term is used to refer to two different types of processors.
• Attached array processor:
• Is an auxiliary processor.
• It is intended to improve the performance of the host computer in specific
numerical computation tasks.
• SIMD array processor:
• Has a single-instruction multiple-data organization.
• It manipulates vector instructions by means of multiple functional units
responding to a common instruction.
Attached Array Processor
• Its purpose is to enhance the performance of the computer by
providing vector processing for complex scientific applications.
Parallel processing with multiple functional units
• For example, when the FSP-164/MAX (Attached processor for VAX
system) is attached to a VAX 11 computer, the Floating-Point Systems
operation increases the computing power of the VAX to100
megaFLOPS.
• The objective of the attached array processor is to provide vector
manipulation capabilities to a conventional computer at a fraction of
the cost of supercomputer.
Fig: Attached array processor with host computer
megaFLOPS (floating point operations per second) : measure of computer performance, useful in
fields of scientific computations that require floating-point calculations.
For such cases it is a more accurate measure than measuring instructions per second.
SIMD Array Processor
• An SIMD array processor is a computer with multiple processing units
operating in parallel.
• Contains a set of identical processing elements (PEs), each having a local memory
M.
• Each PE includes an ALU, a floating-point arithmetic unit, and working registers.
• Vector instructions are broadcast to all PEs simultaneously.
• Masking schemes are used to control the status of each PE during the execution of
vector instructions.
• Each PE has a flag that is set when the PE is active and reset when the PE is
inactive.
• For example, the ILLIAC IV computer developed at the University of Illinois and
manufactured by the Burroughs Corp.
• Are highly specialized computers.
• They are suited primarily for numerical problems that can be expressed in vector
or matrix form.
Fig: SIMD array processor organization