Parallel Computing Unit Wise

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

UNIT-1

THE STATE OF COMPUTING


Computer Generations:

A computer is an electronic device that manipulates information or data. It has the ability to store, retrieve, and process
data.
Nowadays, a computer can be used to type documents, send email, play games, and browse the Web. It can also be used
to edit or create spreadsheets, presentations, and even videos. But the evolution of this complex system started around
1940 with the first Generation of Computer and evolving ever since.

There are five generations of computers.

1. FIRST GENERATION
o Introduction:
1. 1946-1959 is the period of first generation computer.
2. J.P.Eckert and J.W.Mauchy invented the first successful electronic computer called ENIAC,
ENIAC stands for “Electronic Numeric Integrated And Calculator”.

Few Examples are:


1. ENIAC
2. EDVAC
3. UNIVAC
4. IBM-701
5. IBM-650

Advantages:
1. It made use of vacuum tubes which are the only electronic component available during those days.
2. These computers could calculate in milliseconds.

Disadvantages:
1. These were very big in size, weight was about 30 tones.
2. These computers were based on vacuum tubes.
3. These computers were very costly.
4. It could store only a small amount of information due to the presence of magnetic drums.
5. As the invention of first generation computers involves vacuum tubes, so another disadvantage of these computers
was, vacuum tubes require a large cooling system.
6. Very less work efficiency.
7. Limited programming capabilities and punch cards were used to take inputs.
8. Large amount of energy consumption.
9. Not reliable and constant maintenance is required.

 Introduction:
1. 1959-1965 is the period of second-generation computer.
2. 3.Second generation computers were based on Transistor instead of vacuum tubes.
 Few Examples are:
1. Honeywell 400
2. IBM 7094
3. CDC 1604
4. CDC 3600
5. UNIVAC 1108and many more
 Advantages:
1. Due to the presence of transistors instead of vacuum tubes, the size of electron component decreased. This
resulted in reducing the size of a computer as compared to first generation computers.
2. Less energy and not produce as much heat as the first genration.
3. Assembly language and punch cards were used for input.
4. Low cost than first generation computers.
5. Better speed, calculate data in microseconds.
6. Better portability as compared to first generation
 Disadvantages:
1. A cooling system was required.
2. Constant maintenance was required.
3. Only used for specific purposes.

 Introduction:
1. 1965-1971 is the period of third generation computer.
2. These computers were based on Integrated circuits.
3. IC was invented by Robert Noyce and Jack Kilby In 1958-1959.
4. IC was a single component containing number of transistors.
 Few Examples are:
1. PDP-8
2. PDP-11
3. ICL 2900
4. IBM 360
5. IBM 370 and many more
 Advantages:
1. These computers were cheaper as compared to second-generation computers.
2. They were fast and reliable.
3. Use of IC in the computer provides the small size of the computer.
4. IC not only reduce the size of the computer but it also improves the performance of the computer as
compared to previous computers.
5. This generation of computers has big storage capacity.

5. Instead of punch cards, mouse and keyboard are used for input.
6. They used an operating system for better resource management and used the concept of time-sharing and
multiple programming.
7. These computers reduce the computational time from microseconds to nanoseconds.
 Disadvantages:
5. IC chips are difficult to maintain.
6. The highly sophisticated technology required for the manufacturing of IC chips.
7. Air conditioning is required.

GENERATION
 Introduction:
1. 1971-1980 is the period of fourth generation computer.
2. This technology is based on Microprocessor.
3. A microprocessor is used in a computer for any logical and arithmetic function to be performed in any
program.
4. Graphics User Interface (GUI) technology was exploited to offer more comfort to users.

Few Examples are:


1. IBM 4341
2. DEC 10
3. STAR 1000
4. PUP 11 and many more

Advantages:
1. Fastest in computation and size get reduced as compared to the previous generation of computer.
2. Heat generated is negligible.
3. Small in size as compared to previous generation computers.
4. Less maintenance is required.
5. All types of high-level language can be used in this type of computers.

Disadvantages:
1. The Microprocessor design and fabrication are very complex.
2. Air conditioning is required in many cases due to the presence of ICs.
3. Advance technology is required to make the ICs.

 Introduction:
1. The period of the fifth generation in 1980-onwards.
2. This generation is based on artificial intelligence.
3. The aim of the fifth generation is to make a device which could respond to natural language input and are
capable of learning and self-organization.
4. This generation is based on ULSI(Ultra Large Scale Integration) technology resulting in the production of
microprocessor chips having ten million electronic component.
 Few Examples are:
1. Desktop
2. Laptop
3. NoteBook
4. UltraBook
5. Chromebook
6. and many more

Advantages:

 It is more reliable and works faster.


 It is available in different sizes and unique features.
 It provides computers with more user-friendly interfaces with multimedia features.

Disadvantages:

 They need very low-level languages.


 They may make the human brains dull and doomed.

Elements of Modern Computers:

The hardware, software, and programming elements of modern computer systems can be
characterized by looking at a variety of factors in context of parallel computing these factors are:
 Computing problems
 Algorithms and data structures
 Hardware resources
 Operating systems
 System software support
 Compiler support
Computing Problems

 Numerical computing complex mathematical formulations tedious integer or floating -


point computation
 Transaction processing accurate transactions large database management information
retrieval
 Logical Reasoning logic inferences symbolic manipulation
Algorithms and Data Structures

 Traditional algorithms and data structures are designed for sequential machines.
 New, specialized algorithms and data structures are needed to exploit the capabilities of
parallel architectures.
 These often require interdisciplinary interactions among theoreticians, experimentalists,
and programmers.
Hardware Resources

 The architecture of a system is shaped only partly by the hardware resources.


 The operating system and applications also significantly influence the overall
architecture.
 Not only must the processor and memory architectures be considered, but also the
architecture of the device interfaces (which often include their advanced processors).
Operating System

 Operating systems manage the allocation and deallocation of resources during user
program execution.
 UNIX, Mach, and OSF/1 provide support for multiprocessors and multicomputers
 multithreaded kernel functions virtual memory management file subsystems network
communication services
 An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
System Software Support

 Compilers, assemblers, and loaders are traditional tools for developing programs in high-
level languages. With the operating system, these tools determine the bind of resources to
applications, and the effectiveness of this determines the efficiency of hardware
utilization and the system’s programmability.
 Most programmers still employ a sequential mind set, abetted by a lack of popular
parallel software support.
System Software Support
 Parallel software can be developed using entirely new languages designed specifically
with parallel support as its goal, or by using extensions to existing sequential languages.
 New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system software.
 The most common approach is to extend an existing language.
Compiler Support

 Preprocessors use existing sequential compilers and specialized libraries to implement


parallel constructs
 Precompilers perform some program flow analysis, dependence checking, and limited
parallel optimzations
 Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
 Compiler directives are often inserted into source code to aid compiler parallelizing
efforts

A modern computer system consists of computer hardware, instruction sets, application programs, system
software and user interface.

The computing problems are categorized as numerical computing, logical reasoning, and transaction
processing. Some complex problems may need the combination of all the three processing modes.
 Evolution of Computer Architecture − In last four decades, computer architecture has gone
through revolutionary changes. We started with Von Neumann architecture and now we have
multicomputers and multiprocessors.
 Performance of a computer system − Performance of a computer system depends both on machine
capability and program behavior. Machine capability can be improved with better hardware
technology, advanced architectural features and efficient resource management. Program behavior is
unpredictable as it is dependent on application and run-time conditions.

SYSTEM ATTRIBUTES TO PERFORMANCE

Performance of a system depends on


 hardware technology
 architectural features
 efficient resource management
 algorithm design
 data structures
 language efficiency
 programmer skill
 compiler technology
When we talk about performance of computer system we would describe how quickly a
given system can execute a program or programs. Thus we are interested in knowing the
turnaround time. Turnaround time depends on:
 disk and memory accesses
 input and output
 compilation time
 operating system overhead
 CPU time
An ideal performance of a computer system means a perfect match between the machine
capability and program behavior. The machine capability can be improved by using
better hardware technology and efficient resource management. But as far as program
behavior is concerned it depends on code used, compiler used and other run time
conditions. Also a machine performance may vary from program to program. Because
there are too many programs and it is impractical to test a CPU's speed on all of them,
benchmarks were developed. Computer architects have come up with a variety of metrics
to describe the computer performance.
Clock rate and CPI / IPC : Since I/O and system overhead frequently overlaps
processing by other programs, it is fair to consider only the CPU time used by a program,
and the user CPU time is the most important factor. CPU is driven by a clock with a
const
internal operations in the CPU. The clock mostly has the constant cycle time (t in
nanoseconds). The inverse of the cycle time is the clock rate (f
megahertz). A shorter clock cycle time, or equivalently a larger number of cycles per
second, implies more operations can be performed per unit time. The size of the program
is determined by the instruction count (Ic). The size of a program is determined by its
instruction count, Ic, the number of machine instructions to be executed by the program.
Different machine instructions require different numbers of clock cycles to execute. CPI
(cycles per instruction) is thus an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given
program mix. The time required to execute a program containing Ic instructions is just T
= Ic * CPI
Each instruction must be fetched from memory, decoded, then operands fetched from
memory, the instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k
times the processor cycl
the processor-memory interconnection scheme. The processor cycles required for each
instruction (CPI) can be attributed to cycles needed for instruction decode and execution
(p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*
MIPS: The millions of instructions per second, this is calculated by dividing the number of
instructions executed in a running program by time required to run the program. The MIPS
rate is directly proportional to the clock rate and inversely proportion to the CPI. All four
systems attributes (instruction set, compiler, processor, and memory technologies) affect the
MIPS rate, which varies also from program to program. MIPS does not proved to be
effective as it does not account for the fact that different systems often require different
number of instruction to implement the program. It does not inform about how many
instructions are required to perform a given task. With the variation in instruction styles,
internal organization, and number of processors per system it is almost meaningless for
comparing two systems.
MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point operations per
second.'' This is often used as a ``bottom-line'' figure. If one know ahead of time how many
operations a program needs to perform, one can divide the number of operations by the
execution time to come up with a MFLOPS rating. For example, the standard algorithm for
multiplying n*n matrices requires 2n3 – n operations (n2 inner products, with n
multiplications and n-1additions in each product). Suppose you compute the product of two
100 *100 matrices in 0.35 seconds. Then the computer achieves
(2(100)3 – 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS
The term ``theoretical peak MFLOPS'' refers to how many operations per second would be
possible if the machine did nothing but numerical operations. It is obtained by calculating
the time it takes to perform one operation and then computing how many of them could be
done in one second. For example, if it takes 8 cycles to do one floating point multiplication,
the cycle time on the machine is 20 nanoseconds, and arithmetic operations are not
overlapped with one another, it takes 160ns for one multiplication, and (1,000,000,000
nanosecond/1sec)*(1 multiplication / 160 nanosecond) = 6.25*106 multiplication /sec so the
theoretical peak performance is 6.25 MFLOPS. Of course, programs are not just long
sequences of multiply and add instructions, so a machine rarely comes close to this level of
performance on any real program. Most machines will achieve less than 10% of their peak
rating, but vector processors or other machines with internal pipelines that have an effective
CPI near 1.0 can often achieve 70% or more of their theoretical peak on small programs.
Throughput rate : Another important factor on which system’s performance is measured is
throughput of the system which is basically how many programs a system can execute per
unit time Ws. In multiprogramming the system throughput is often lower than the CPU
throughput Wp which is defined as
Wp = f/(Ic * CPI)
Unit of Wp is programs/second.
Ws <Wp as in multiprogramming environment there is always additional overheads like
timesharing operating system etc. An Ideal behavior is not achieved in parallel computers
because while executing a parallel algorithm, the processing elements cannot devote
100% of their time to the computations of the algorithm. Efficiency is a measure of the
fraction of time for which a PE is usefully employed. In an ideal parallel system
efficiency is equal to one. In practice, efficiency is between zero and one
s of overhead associated with parallel execution
Speed or Throughput (W/Tn) - the execution rate on an n processor system, measured in
FLOPs/unit-time or instructions/unit-time.
Speedup (Sn = T1/Tn) - how much faster in an actual machine, n processors compared to 1
will perform the workload. The rati asymptotic speedup.
Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup achieved by n
processors
Degree of Parallelism (DOP) - for a given piece of the workload, the number of processors
that can be kept busy sharing that piece of computation equally. Neglecting overhead, we
assume that if k processors work together on any workload, the workload gets done k times
as fast as a sequential execution.
Scalability - The attributes of a computer system which allow it to be gracefully and linearly
scaled up or down in size, to handle smaller or larger workloads, or to obtain proportional
decreases or increase in speed on a given application. The applications run on a scalable
machine may not scale well. Good scalability requires the algorithm and the machine to
have the right properties
Thus in general there are five performance factors (Ic, p, m, k, t) which are influenced by
four system attributes:
 instruction-set architecture (affects Ic and p)
 compiler technology (affects Ic and p and m)
 CPU implementation and control (affects p *t cache and memory hierarchy
(affects memory access latency, k ´t
 Total CPU time can be used as a basis in estimating the execution rate of a
processor.
Programming Environments
Programmability depends on the programming environment provided to the users.
Conventional computers are used in a sequential programming environment with tools
developed for a uniprocessor computer. Parallel computers need parallel tools that allow
specification or easy detection of parallelism and operating systems that can perform
parallel scheduling of concurrent events, shared memory allocation, and shared peripheral
and communication links.
Implicit Parallelism
Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the program.
Use a parallelizing compiler to translate the source code into parallel code.
The compiler must detect parallelism and assign target machine resources.
Success relies heavily on the quality of the compiler.
Explicit Parallelism
Programmer writes explicit parallel code using parallel dialects of common languages.
Compiler has reduced need to detect parallelism, but must still preserve existing
parallelism and assign target machine resources.
Needed Software Tools
Parallel extensions of conventional high-level languages.
Integrated environments to provide different levels of program abstraction validation,
testing and debugging performance prediction and monitoring visualization support to aid
program development, performance measurement graphics display and animation of
computational results

PARADIGMS OF PARALLEL COMPUTING

(1) Synchronous:
(i) Vector Supercomputers: In a vector computer, a vector processor is attached to the scalar processor as an
optional feature. The host computer first loads program and data to the main memory. Then the scalar
control unit decodes all the instructions. If the decoded instructions are scalar operations or program
operations, the scalar processor executes those operations using scalar functional pipelines.

On the other hand, if the decoded instructions are vector operations then the instructions will be sent to
vector control unit.
(ii) SIMD Supercomputers: In SIMD computers, ‘N’ number of processors are connected to a
control unit and all the processors have their individual memory units. All the processors are
connected by an interconnection network.
(iii) Systolic arrays: Parallel processing approach diverges from traditional Von Neumann
architecture. One such approach is the concept of Systolic processing using systolic arrays. A
systolic array is a network of processors that rhythmically compute and pass data through the
system. They derived their name from drawing an analogy to how blood rhythmically flows
through a biological heart as the data flows from memory in a rhythmic fashion passing through
many elements before it returns to memory.It is also an example of pipelining along with parallel
computing.It was introduced in 1970s and was used by Intel to make CMU’s iWarp processor in
1990.

In a systolic array there are a large number of identical simple processors or processing elements(PEs) that
are arranged in a well organised structure such as linear or two dimensional array. Each processing element
is connected with the other PEs and has a limited private storage.

A Host station is often used for communication with the outside world in the network.

Characteristics:

1. Parallel Computing –
Many processes are carried out simultaneously. As the arrays have a non-centralized structure,
parallel computing is implemented.
2. Pipelinability –
It means that the array can achieve high speed. It shows a linear rate pipelinability.
3. Synchronous evaluation –
Computation of data is timed by a global clock and then the data is passed through the network.The
global clock synchronizes the array and has fixed length clock cycles.
4. Repetability –
Most of the arrays have the repetition and interconnection of a single type of PE in the entire
network.
5. Spatial Locality –
The cells have a local communication interconnection.
6. Temporal Locality –
One unit time delay is at least required for the transmission of signals from one cell to another.

7. Modularity and regularity –


A systolic array consists of processing units that are modular and have homogeneous interconnection
and the computer network can be extended indefinitely.
Advantages of Systolic array –

 It employs high degree of parallelism and can sustain a very high throughput.
 These are highly compact, robust and efficient.
 Data and control flow are simple and regular.

Disadvantages of Systolic array –

 They are highly specialized and thus are inflexible regarding the problems they can solve.
 These are difficult to build.
 These are expensive.

(2) Asynchronous:
(i) MIMD:
In computing, MIMD (multiple instruction, multiple data) is a technique employed to achieve
parallelism. Machines using MIMD have a number of processors that function asynchronously and
independently. At any time, different processors may be executing different instructions on different
pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided
design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD
machines can be of either shared memory or distributed memory categories. These classifications are
based on how MIMD processors access memory. Shared memory machines may be of the bus-based,
extended, or hierarchical type. Distributed memory machines may have hypercube or mesh
interconnection schemes.
An example of MIMD system is Intel Xeon Phi, descended from Larrabee microarchitecture.[1] These
processors have multiple processing cores (up to 61 as of 2015) that can execute different instructions on
different data. Most parallel computers, as of 2013, are MIMD systems.

FLYNN’S CLASSIFICATION

Parallel computing is a computing where the jobs are broken into discrete parts that can be executed
concurrently. Each part is further broken down to a series of instructions. Instructions from each part execute
simultaneously on different CPUs. Parallel systems deal with the simultaneous use of multiple computer
resources that can include a single computer with multiple processors, a number of computers connected by
a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a single processor because the
architecture of parallel computers varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.

The crux of parallel processing are CPUs. Based on the number of instruction and data streams that can be
processed simultaneously, computing systems are classified into four major categories:

Flynn’s classification –

1. Single-instruction, single-data (SISD) systems –


An SISD computing system is a uniprocessor machine which is capable of executing a single
instruction, operating on a single data stream. In SISD, machine instructions are processed in a
sequential manner and computers adopting this model are popularly called sequential computers.
Most conventional computers have SISD architecture. All the instructions and data to be processed
have to be stored in primary memory.

The speed of the processing element in the SISD model is limited(dependent) by the rate at which the
computer can transfer information internally. Dominant representative SISD systems are IBM PC,
workstations.

2. Single-instruction, multiple-data (SIMD) systems –


An SIMD system is a multiprocessor machine capable of executing the same instruction on all the
CPUs but operating on different data streams. Machines based on an SIMD model are well suited to
scientific computing since they involve lots of vector and matrix operations. So that the information
can be passed to all the processing elements (PEs) organized data elements of vectors can be divided
into multiple sets(N-sets for N PE systems) and each PE can process one data set.
3. Multiple-instruction, single-data (MISD) systems –
An MISD computing system is a multiprocessor machine capable of executing different instructions
on different PEs but all of them operating on the same dataset .

Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set. Machines built using the MISD model are not
useful in most of the application, a few machines are built, but none of them are available commercially.

4. Multiple-instruction, multiple-data (MIMD) systems –


An MIMD system is a multiprocessor machine which is capable of executing multiple instructions on
multiple data sets. Each PE in the MIMD model has separate instruction and data streams; therefore
machines built using this model are capable to any kind of application. Unlike SIMD and MISD
machines, PEs in MIMD machines work asynchronously.
MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory MIMD
based on the way PEs are coupled to the main memory.

In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are connected to
a single global memory and they all have access to it. The communication between PEs in this model takes
place through the shared memory, modification of the data stored in the global memory by one PE is visible
to all other PEs. Dominant representative shared memory MIMD systems are Silicon Graphics machines and
Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a local
memory. The communication between PEs in this model takes place through the interconnection network
(the inter process communication channel, or IPC). The network connecting PEs can be configured to tree,
mesh or in accordance with the requirement.
The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to
extend with respect to the distributed memory MIMD model. Failures in a shared-memory MIMD affect the
entire system, whereas this is not the case of the distributed model, in which each of the PEs can be easily
isolated. Moreover, shared memory MIMD architectures are less likely to scale because the addition of more
PEs leads to memory contention. This is a situation that does not happen in the case of distributed memory,
in which each PE has its own memory. As a result of practical outcomes and user’s requirement , distributed
memory MIMD architecture is superior to the other existing models.

FENG’S CLASSIFICATION
According to Feng’s classification, computer architecture can be classified into four. The classification is based on the
way contents stored in memory are processed. The contents can be either data or instructions.

1. Word Serial Bit Serial (WSBS)


2. Word Serial Bit Parallel (WSBP)
3. Word Parallel Bit Serial (WPBS)
4. Word Parallel Bit Parallel (WPBP)

Word serial bit serial (WSBS)

One bit of one selected word is processed at a time. This represents serial processing and needs maximum processing
time.
Word serial bit parallel (WSBP)

is found in most existing computers and has been called as Word Slice processing because one word of n bit is
processed at a time. All bits of a selected word are processed at a time. Bit parallel means all bits of a word.

Word parallel bit serial (WPBS)

It has been called bit slice processing because m-bit slice is processed at a time. Word parallel signifies selection of all
words. It can be considered as one bit from all words are processed at a time.

Word parallel bit parallel (WPBP)

It is known as fully parallel processing in which an array on n x m bits is processed at one time. Maximum parallelism
is achieved here.

Limitations of Feng's classification

It fails to project the concurrency in pipeline processors, as degree of parallelism doesn't account for concurrency
handle by pipe-lined design.

HANDLER’S CLASSIFICATION
In 1977, Wolfgang Handler proposed an elaborate notation for expressing the pipelining and parallelism of
computers. Handler's classification addresses the computer at three distinct levels:
•Processor control unit (PCU),
•Arithmetic logic unit (ALU),
•Bit-level circuit (BLC).
The PCU corresponds to a processor or CPU, the ALU corresponds to a functional unit or a processing element
and the BLC corresponds to the logic circuit needed to perform one-bit operations in the ALU.

Handler's classification uses the following three pairs of integers to describe a computer:
Computer = (p * p', a * a', b * b')
Where p = number of PCUs
Where p'= number of PCUs that can be pipelined
Where a = number of ALUs controlled by each PCU
Where a'= number of ALUs that can be pipelined
Where b = number of bits in ALU or processing element (PE) word Where b'= number of pipeline segments on
all ALUs or in a single PE

The following rules and operators are used to show the relationship between various elements of the computer:
•The '*' operator is used to indicate that the units are pipelined or macro-pipelined with a stream of data running
through all the units.
•The '+' operator is used to indicate that the units are not pipelined but work on independent streams of data.
•The 'v' operator is used to indicate that the computer hardware can work in one of several modes. •The '~'
symbol is used to indicate a range of values for any one of the parameters.
•Peripheral processors are shown before the main processor using another three pairs of integers. If the value of
the second element of any pair is 1, it may omitted for brevity.

Handler's classification is best explained by showing how the rules and operators are used to classify several
machines. 33The CDC 6600 has a single main processor supported by 10 I/O processors. One control unit
coordinates one ALU with a 60-bit word length. The ALU has 10 functional units which can be formed into a
pipeline. The 10 peripheral I/O processors may work in parallel with each other and with the CPU. Each I/O
processor contains one 12-bit ALU. The description for the 10 I/O processors is:

CDC 6600I/O = (10, 1, 12) The description for the main processor is:
Elements of Parallel Computing and Architecture
CDC 6600main = (1, 1 * 10, 60) The main processor and the I/O processors can be regarded as forming a macro-
pipeline so the '*' operator is used to combine the two structures: CDC 6600 = (I/O processors) * (central
processor = (10, 1, 12) * (1, 1 * 10, 60) Texas Instrument's Advanced Scientific Computer (ASC) has one
controller coordinating four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words.
Thus we have: ASC = (1, 4, 64 * 8) The Cray-1 is a 64-bit single processor computer whose ALU has twelve
functional units, eight of which can be chained together to from a pipeline. Different functional units have from 1
to 14 segments, which can also be pipelined. Handler's description of the Cray-1 is: Cray-1 = (1, 12 * 8, 64 * (1 ~
14))

Another sample system is Carnegie-Mellon University's C.mmp multiprocessor. This system was designed to
facilitate research into parallel computer architectures and consequently can be extensively reconfigured. The
system consists of 16 PDP-11 'minicomputers' (which have a 16-bit word length), interconnected by a crossbar
switching network. Normally, the C.mmp operates in MIMD mode for which the description is (16, 1, 16).

It can also operate in SIMD mode, where all the processors are coordinated by a single master controller. The
SIMD mode description is (1, 16, 16). Finally, the system can be rearranged to operate in MISD mode. Here the
processors are arranged in a chain with a single stream of data passing through all of them.

The MISD modes description is (1 * 16, 1, 16). The 'v' operator is used to combine descriptions of the same piece
of hardware operating in differing modes. Thus, Handler's description for the complete C.mmp is: C.mmp = (16,
1, 16) v (1, 16, 16) v (1 * 16, 1, 16) The '*' and '+' operators are used to combine several separate pieces of
hardware.

The 'v' operator is of a different form to the other two in that it is used to combine the different operating modes
of a single piece of hardware. While Flynn's classification is easy to use, Handler's classification is cumbersome.
The direct use of numbers in the nomenclature of Handler’s classification’s makes it much more abstract and
hence difficult. Handler's classification is highly geared towards the description of pipelines and chains. While it
is well able to describe the parallelism in a single processor, the variety of parallelism in multiprocessor
computers is not addressed well.
UNIT-2

COMBINATIONAL CIRCUITS
Combinational circuit is a circuit in which we combine the different gates in the circuit, for example encoder, decoder,
multiplexer and demultiplexer. Some of the characteristics of combinational circuits are following −

 The output of combinational circuit at any instant of time, depends only on the levels present at input
terminals.
 The combinational circuit do not use any memory. The previous state of input does not have any effect on the
present state of the circuit.
 A combinational circuit can have an n number of inputs and m number of outputs.

Block diagram

We're going to elaborate few important combinational circuits as follows.

Half Adder

Half adder is a combinational logic circuit with two inputs and two outputs. The half adder circuit is designed to add
two single bit binary number A and B. It is the basic building block for addition of two single bit numbers. This circuit
has two outputs carry and sum.

Block diagram
Truth Table

Circuit Diagram

Full Adder

Full adder is developed to overcome the drawback of Half Adder circuit. It can add two one-bit numbers A and B, and
carry c. The full adder is a three input and two output combinational circuit.

Block diagram
Truth Table

Circuit Diagram

N-Bit Parallel Adder

The Full Adder is capable of adding only two single digit binary number along with a carry input. But in practical we
need to add binary numbers which are much longer than just one bit. To add two n-bit binary numbers we need to use
the n-bit parallel adder. It uses a number of full adders in cascade. The carry output of the previous full adder is
connected to carry input of the next full adder.

4 Bit Parallel Adder

In the block diagram, A0 and B0 represent the LSB of the four bit words A and B. Hence Full Adder-0 is the lowest
stage. Hence its Cin has been permanently made 0. The rest of the connections are exactly same as those of n-bit
parallel adder is shown in fig. The four bit parallel adder is a very common logic circuit.
Block diagram

N-Bit Parallel Subtractor

The subtraction can be carried out by taking the 1's or 2's complement of the number to be subtracted. For example we
can perform the subtraction (A-B) by adding either 1's or 2's complement of B to A. That means we can use a binary
adder to perform the binary subtraction.

4 Bit Parallel Subtractor

The number to be subtracted (B) is first passed through inverters to obtain its 1's complement. The 4-bit adder then
adds A and 2's complement of B to produce the subtraction. S3 S2 S1 S0 represents the result of binary subtraction (A-
B) and carry output Cout represents the polarity of the result. If A > B then Cout = 0 and the result of binary form (A-B)
then Cout = 1 and the result is in the 2's complement form.
Block diagram

Half Subtractors

Half subtractor is a combination circuit with two inputs and two outputs (difference and borrow). It produces the
difference between the two binary bits at the input and also produces an output (Borrow) to indicate if a 1 has been
borrowed. In the subtraction (A-B), A is called as Minuend bit and B is called as Subtrahend bit.

Truth Table
Circuit Diagram

Full Subtractors

The disadvantage of a half subtractor is overcome by full subtractor. The full subtractor is a combinational circuit with
three inputs A,B,C and two output D and C'. A is the 'minuend', B is 'subtrahend', C is the 'borrow' produced by the
previous stage, D is the difference output and C' is the borrow output.

Truth Table
Circuit Diagram

Multiplexers

Multiplexer is a special type of combinational circuit. There are n-data inputs, one output and m select inputs with 2m
= n. It is a digital circuit which selects one of the n data inputs and routes it to the output. The selection of one of the n
inputs is done by the selected inputs. Depending on the digital code applied at the selected inputs, one out of n data
sources is selected and transmitted to the single output Y. E is called the strobe or enable input which is useful for the
cascading. It is generally an active low terminal that means it will perform the required operation when it is low.
Block diagram

Multiplexers come in multiple variations

 2 : 1 multiplexer
 4 : 1 multiplexer
 16 : 1 multiplexer
 32 : 1 multiplexer

Block Diagram
Truth Table

Demultiplexers

A demultiplexer performs the reverse operation of a multiplexer i.e. it receives one input and distributes it over several
outputs. It has only one input, n outputs, m select input. At a time only one output line is selected by the select lines
and the input is transmitted to the selected output line. A de-multiplexer is equivalent to a single pole multiple way
switch as shown in fig.

Demultiplexers comes in multiple variations.

 1 : 2 demultiplexer
 1 : 4 demultiplexer
 1 : 16 demultiplexer
 1 : 32 demultiplexer

Block diagram

Truth Table
Decoder

A decoder is a combinational circuit. It has n input and to a maximum m = 2n outputs. Decoder is identical to a
demultiplexer without any data input. It performs operations which are exactly opposite to those of an encoder.

Block diagram

Examples of Decoders are following.

 Code converters
 BCD to seven segment decoders
 Nixie tube decoders
 Relay actuator

2 to 4 Line Decoder

The block diagram of 2 to 4 line decoder is shown in the fig. A and B are the two inputs where D through D are the
four outputs. Truth table explains the operations of a decoder. It shows that each output is 1 for only a specific
combination of inputs.

Block diagram
Truth Table

Logic Circuit

Encoder

Encoder is a combinational circuit which is designed to perform the inverse operation of the decoder. An encoder has n
number of input lines and m number of output lines. An encoder produces an m bit binary code corresponding to the
digital input number. The encoder accepts an n input digital word and converts it into an m bit another digital word.

Block diagram

Examples of Encoders are following.


 Priority encoders
 Decimal to BCD encoder
 Octal to binary encoder
 Hexadecimal to binary encoder

Priority Encoder

This is a special type of encoder. Priority is given to the input lines. If two or more input line are 1 at the same time,
then the input line with highest priority will be considered. There are four input D0, D1, D2, D3 and two output Y0, Y1.
Out of the four input D3 has the highest priority and D0 has the lowest priority. That means if D3 = 1 then Y1 Y1 = 11
irrespective of the other inputs. Similarly if D3 = 0 and D2 = 1 then Y1 Y0 = 10 irrespective of the other inputs.

Block diagram

Truth Table
Logic Circuit

SORTING NETWORK
Comparison networks differ from RAM's in two important respects. First, they can only perform comparisons. Thus,
an algorithm such as counting sort (see Section 9.2) cannot be implemented on a comparison network. Second, unlike
the RAM model, in which operations occur serially--that is, one after another--operations in a comparison network
may occur at the same time, or "in parallel." As we shall see, this characteristic allows the construction of comparison
networks that sort n values in sublinear time.

We begin in Section 28.1 by defining comparison networks and sorting networks. We also give a natural definition for
the "running time" of a comparison network in terms of the depth of the network. Section 28.2 proves the "zero-one
principle," which greatly eases the task of analyzing the correctness of sorting networks.

The efficient sorting network that we shall design is essentially a parallel version of the merge-sort algorithm from
Section 1.3.1. Our construction will have three steps. Section 28.3 presents the design of a "bitonic" sorter that will be
our basic building block. We modify the bitonic sorter slightly in Section 28.4 to produce a merging network that can
merge two sorted sequences into one sorted sequence. Finally, in Section 28.5, we assemble these merging networks
into a sorting network that can sort n values in O(lg2 n) time.

Comparison networks:

Sorting networks are comparison networks that always sort their inputs, so it makes sense to begin our discussion with
comparison networks and their characteristics. A comparison network is comprised solely of wires and comparators. A
comparator, shown in Figure 28.1 (a), is a device with two inputs, x and y, and two outputs, x' and y', that performs the
following function:
Figure 28.1 (a) A comparator with inputs x and y and outputs x' and y'. (b) The same comparator, drawn as a single
vertical line. Inputs x = 7, y = 3 and outputs x' = 3, y' = 7 are shown.
x' = min(x, y) ,
y' = max(x, y) .

Because the pictorial representation of a comparator in Figure 28.1 (a) is too bulky for our purposes, we shall adopt the
convention of drawing comparators as single vertical lines, as shown in Figure 28.1(b). Inputs appear on the left and
outputs on the right, with the smaller input value appearing on the top output and the larger input value appearing on
the bottom output. We can thus think of a comparator as sorting its two inputs.

We shall assume that each comparator operates in O(1) time. In other words, we assume that the time between the
appearance of the input values x and y and the production of the output values x' and y' is a constant.

A wire transmits a value from place to place. Wires can connect the output of one comparator to the input of another,
but otherwise they are either network input wires or network output wires. Throughout this chapter, we shall assume
that a comparison network contains n input wires a1, a2, . . . , an, through which the values to be sorted enter the
network, and n output wires b1, b2, . . . , bn, which produce the results computed by the network. Also, we shall speak
of the input sequence a1, a2, . . . , an and the output sequence b1,b2, . . . ,bn , referring to the values on the input and
output wires. That is, we use the same name for both a wire and the value it carries. Our intention will always be clear
from the context.

Figure 28.2 shows a comparison network, which is a set of comparators interconnected by wires. We draw a
comparison network on n inputs as a collection of n horizontal lines with comparators stretched vertically. Note that a
line does not represent a single wire, but rather a sequence of distinct wires connecting various comparators. The top
line in Figure 28.2, for example, represents three wires: input wire a1, which connects to an input of comparator A; a
wire connecting the top output of comparator A to an input of comparator C; and output wire b1, which comes from the
top output of comparator C. Each comparator input is connected to a wire that is either one of the network's n input
wires a1, a2, . . . , an or is connected to the output of another comparator. Similarly, each comparator output is
connected to a wire that is either one of the network's n output wires b1, b2, . . . , bn or is connected to the input of
another comparator. The main requirement for interconnecting comparators is that the graph of interconnections must
be acyclic: if we trace a path from the output of a given comparator to the input of another to output to input, etc., the
path we trace must never cycle back on itself and go through the same comparator twice. Thus, as in Figure 28.2, we
can draw a comparison network with network inputs on the left and network outputs on the right; data move through
the network from left to right.
Figure 28.2 (a) A 4-input, 4-output comparison network, which is in fact a sorting network. At time 0, the input values
shown appear on the four input wires. (b) At time 1, the values shown appear on the outputs of comparators A and B,
which are at depth l. (c) At time 2, the values shown appear on the outputs of comparators C and D, at depth 2. Output
wires b1 and b4 now have their final values, but output wires b2 and b3 do not. (d) At time 3, the values shown appear
on the outputs of comparator E, at depth 3. Output wires b2 and b3 now have their final values.

Each comparator produces its output values only when both of its input values are available to it. In Figure 28.2(a), for
example, suppose that the sequence 9, 5, 2, 6 appears on the input wires at time 0. At time 0, then, only
comparators A and B have all their input values available. Assuming that each comparator requires one time unit to
compute its output values, comparators A and B produce their outputs at time 1; the resulting values are shown in
Figure 28.2(b). Note that comparators A and B produce their values at the same time, or "in parallel." Now, at time 1,
comparators C and D, but not E, have all their input values available. One time unit later, at time 2, they produce their
outputs, as shown in Figure 28.2(c). Comparators C and D operate in parallel as well. The top output of comparator C
and the bottom output of comparator D connect to output wires b1 and b4, respectively, of the comparison network, and
these network output wires therefore carry their final values at time 2. Meanwhile, at time 2, comparator E has its
inputs available, and Figure 28.2(d) shows that it produces its output values at time 3. These values are carried on
network output wires b2 and b3, and the output sequence 2, 5, 6, 9 is now complete.

Under the assumption that each comparator takes unit time, we can define the "running time" of a comparison
network, that is, the time it takes for all the output wires to receive their values once the input wires receive theirs.
Informally, this time is the largest number of comparators that any input element can pass through as it travels from an
input wire to an output wire. More formally, we define the depth of a wire as follows. An input wire of a comparison
network has depth 0. Now, if a comparator has two input wires with depths dx and dy, then its output wires have depth
max(dx, dy) + 1. Because there are no cycles of comparators in a comparison network, the depth of a wire is well
defined, and we define the depth of a comparator to be the depth of its output wires. Figure 28.2 shows comparator
depths. The depth of a comparison network is the maximum depth of an output wire or, equivalently, the maximum
depth of a comparator. The comparison network of Figure 28.2, for example, has depth 3 because comparator E has
depth 3. If each comparator takes one time unit to produce its output value, and if network inputs appear at time 0, a
comparator at depth d produces its outputs at time d; the depth of the network therefore equals the time for the network
to produce values at all of its output wires.
A sorting network is a comparison network for which the output sequence is monotonically increasing (that is, b1 b2
. . . bn) for every input sequence. Of course, not every comparison network is a sorting network, but the network of
Figure 28.2 is. To see why, observe that after time 1, the minimum of the four input values has been produced by
either the top output of comparator A or the top output of comparator B. After time 2, therefore, it must be on the top
output of comparator C. A symmetrical argument shows that after time 2, the maximum of the four input values has
been produced by the bottom output of comparator D. All that remains is for comparator E to ensure that the middle
two values occupy their correct output positions, which happens at time 3.

A comparison network is like a procedure in that it specifies how comparisons are to occur, but it is unlike a procedure
in that its physical size depends on the number of inputs and outputs. Therefore, we shall actually be describing
"families" of comparison networks. For example, the goal of this chapter is to develop a family SORTER of efficient
sorting networks. We specify a given network within a family by the family name and the number of inputs (which
equals the number of outputs). For example, the n-input, n-output sorting network in the family SORTER is named
SORTER[n].

Figure 28.3 A sorting network based on insertion sort

PRAM MODELS
It is a model, which is considered for most of the parallel algorithms. Here, multiple processors are attached
to a single block of memory. A PRAM model contains −

 A set of similar type of processors.


 All the processors share a common memory unit. Processors can communicate among themselves
through the shared memory only.
 A memory access unit (MAU) connects the processors with the single shared memory.
Here, n number of processors can perform independent operations on n number of data in a particular unit of
time. This may result in simultaneous access of same memory location by different processors.

To solve this problem, the following constraints have been enforced on PRAM model −

 Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed to read from or
write to the same memory location at the same time.
 Exclusive Read Concurrent Write (ERCW) − Here no two processors are allowed to read from the
same memory location at the same time, but are allowed to write to the same memory location at the
same time.
 Concurrent Read Exclusive Write (CREW) − Here all the processors are allowed to read from the
same memory location at the same time, but are not allowed to write to the same memory location at
the same time.
 Concurrent Read Concurrent Write (CRCW) − All the processors are allowed to read from or
write to the same memory location at the same time.

There are many methods to implement the PRAM model, but the most prominent ones are −

 Shared memory model


 Message passing model
 Data parallel model

Shared Memory Model


Shared memory emphasizes on control parallelism than on data parallelism. In the shared memory model, multiple
processes execute on different processors independently, but they share a common memory space. Due to any
processor activity, if there is any change in any memory location, it is visible to the rest of the processors.

As multiple processors access the same memory location, it may happen that at any particular point of time, more than
one processor is accessing the same memory location. Suppose one is reading that location and the other is writing on
that location. It may create confusion. To avoid this, some control mechanism, like lock / semaphore, is implemented
to ensure mutual exclusion.
Shared memory programming has been implemented in the following −

 Thread libraries − The thread library allows multiple threads of control that run concurrently in the same
memory location. Thread library provides an interface that supports multithreading through a library of
subroutine. It contains subroutines for
o Creating and destroying threads
o Scheduling execution of thread
o passing data and message between threads
o saving and restoring thread contexts

Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads as implemented in Linux, Win32
threads available in Windows NT and Windows 2000, and JavaTM threads as part of the standard JavaTM
Development Kit (JDK).

 Distributed Shared Memory (DSM) Systems − DSM systems create an abstraction of shared memory on
loosely coupled architecture in order to implement shared memory programming without hardware support.
They implement standard libraries and use the advanced user-level memory management features present in
modern operating systems. Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and
Cashmere.
 Program Annotation Packages − This is implemented on the architectures having uniform memory access
characteristics. The most notable example of program annotation packages is OpenMP. OpenMP implements
functional parallelism. It mainly focuses on parallelization of loops.

The concept of shared memory provides a low-level control of shared memory system, but it tends to be tedious and
erroneous. It is more applicable for system programming than application programming.

Merits of Shared Memory Programming


 Global address space gives a user-friendly programming approach to memory.
 Due to the closeness of memory to CPU, data sharing among processes is fast and uniform.
 There is no need to specify distinctly the communication of data among processes.
 Process-communication overhead is negligible.
 It is very easy to learn.

Demerits of Shared Memory Programming


 It is not portable.
 Managing data locality is very difficult.
Message Passing Model
Message passing is the most commonly used parallel programming approach in distributed memory systems. Here, the
programmer has to determine the parallelism. In this model, all the processors have their own local memory unit and
they exchange data through a communication network.

Processors use message-passing libraries for communication among themselves. Along with the data being sent, the
message contains the following components −

 The address of the processor from which the message is being sent;
 Starting address of the memory location of the data in the sending processor;
 Data type of the sending data;
 Data size of the sending data;
 The address of the processor to which the message is being sent;
 Starting address of the memory location for the data in the receiving processor.

Processors can communicate with each other by any of the following methods −

 Point-to-Point Communication
 Collective Communication
 Message Passing Interface

Point-to-Point Communication
Point-to-point communication is the simplest form of message passing. Here, a message can be sent from the sending
processor to a receiving processor by any of the following transfer modes −

 Synchronous mode − The next message is sent only after the receiving a confirmation that its previous
message has been delivered, to maintain the sequence of the message.
 Asynchronous mode − To send the next message, receipt of the confirmation of the delivery of the previous
message is not required.

Collective Communication
Collective communication involves more than two processors for message passing. Following modes allow collective
communications −

 Barrier − Barrier mode is possible if all the processors included in the communications run a particular bock
(known as barrier block) for message passing.
 Broadcast − Broadcasting is of two types −
o One-to-all − Here, one processor with a single operation sends same message to all other processors.
o All-to-all − Here, all processors send message to all other processors.
Messages broadcasted may be of three types −

 Personalized − Unique messages are sent to all other destination processors.


 Non-personalized − All the destination processors receive the same message.
 Reduction − In reduction broadcasting, one processor of the group collects all the messages from all other
processors in the group and combine them to a single message which all other processors in the group can
access.

Merits of Message Passing


 Provides low-level control of parallelism;
 It is portable;
 Less error prone;
 Less overhead in parallel synchronization and data distribution.

Demerits of Message Passing


 As compared to parallel shared-memory code, message-passing code generally needs more software overhead.

Message Passing Libraries


There are many message-passing libraries. Here, we will discuss two of the most-used message-passing libraries −

 Message Passing Interface (MPI)


 Parallel Virtual Machine (PVM)

Message Passing Interface (MPI)


It is a universal standard to provide communication among all the concurrent processes in a distributed memory
system. Most of the commonly used parallel computing platforms provide at least one implementation of message
passing interface. It has been implemented as the collection of predefined functions called library and can be called
from languages such as C, C++, Fortran, etc. MPIs are both fast and portable as compared to the other message passing
libraries.

Merits of Message Passing Interface

 Runs only on shared memory architectures or distributed memory architectures;


 Each processors has its own local variables;
 As compared to large shared memory computers, distributed memory computers are less expensive.

Demerits of Message Passing Interface

 More programming changes are required for parallel algorithm;


 Sometimes difficult to debug; and
 Does not perform well in the communication network between the nodes.

Parallel Virtual Machine (PVM)


PVM is a portable message passing system, designed to connect separate heterogeneous host machines to form a single
virtual machine. It is a single manageable parallel computing resource. Large computational problems like
superconductivity studies, molecular dynamics simulations, and matrix algorithms can be solved more cost effectively
by using the memory and the aggregate power of many computers. It manages all message routing, data conversion,
task scheduling in the network of incompatible computer architectures.

Features of PVM
 Very easy to install and configure;
 Multiple users can use PVM at the same time;
 One user can execute multiple applications;
 It’s a small package;
 Supports C, C++, Fortran;
 For a given run of a PVM program, users can select the group of machines;
 It is a message-passing model,
 Process-based computation;
 Supports heterogeneous architecture.

Data Parallel Programming


The major focus of data parallel programming model is on performing operations on a data set simultaneously. The
data set is organized into some structure like an array, hypercube, etc. Processors perform operations collectively on
the same data structure. Each task is performed on a different partition of the same data structure.

It is restrictive, as not all the algorithms can be specified in terms of data parallelism. This is the reason why data
parallelism is not universal.

Data parallel languages help to specify the data decomposition and mapping to the processors. It also includes data
distribution statements that allow the programmer to have control on data – for example, which data will go on which
processor – to reduce the amount of communication within the processors.

VLSI Complexity Model

Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching
networks.

Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount of
storage (memory) space available in that chip.

We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation
of that algorithm. If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound
on the total number of bits processed through the chip (or I/O). For certain computing, there exists a lower
bound, f(s), such that

A.T2 >= O (f(s))

Where A=chip area and T=time

PARALLELISM APPROACHES

(1) DATA PARALLELISM:


Sequential vs. data-parallel job execution

Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses
on distributing the data across different nodes, which operate on the data in parallel. It can be applied on
regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task
parallelism as another form of parallelism.

A data parallel job on an array of n elements can be divided equally among all the processors. Let us assume
we want to sum all the elements of the given array and the time for a single addition operation is Ta time
units. In the case of sequential execution, the time taken by the process will be n×Ta time units as it sums up
all the elements of an array. On the other hand, if we execute this job as a data parallel job on 4 processors
the time taken would reduce to (n/4)×Ta + merging overhead time units. Parallel execution results in a
speedup of 4 over sequential execution. One important thing to note is that the locality of data references
plays an important part in evaluating the performance of a data parallel programming model. Locality of data
depends on the memory accesses performed by the program as well as the size of the cache.

(2) CONTROL PARALLELISM:

Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization
of computer code across multiple processors in parallel computing environments. Task parallelism focuses
on distributing tasks—concurrently performed by processes or threads—across different processors. In
contrast to data parallelism which involves running the same task on different components of data, task
parallelism is distinguished by running many different tasks at the same time on the same data.[1] A common
type of task parallelism is pipelining which consists of moving a single set of data through a series of
separate tasks where each task can execute independently of the others.

Description

In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or
process) on the same or different data. The threads may execute the same or different code. In the general
case, different execution threads communicate with one another as they work, but this is not a requirement.
Communication usually takes place by passing data from one thread to the next as part of a workflow.
As a simple example, if a system is running code on a 2-processor system (CPUs "a" & "b") in a parallel
environment and we wish to do tasks "A" and "B", it is possible to tell CPU "a" to do task "A" and CPU "b"
to do task "B" simultaneously, thereby reducing the run time of the execution. The tasks can be assigned
using conditional statements as described below.

Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as opposed
to the data (data parallelism). Most real programs fall somewhere on a continuum between task parallelism
and data parallelism.

Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at
once. This type of parallelism is found largely in applications written for commercial servers such as
databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O
and memory system latency their workloads can incur - while one thread is delayed waiting for a memory or
disk access, other threads can do useful work.

The exploitation of thread-level parallelism has also begun to make inroads into the desktop market with the
advent of multi-core microprocessors. This has occurred because, for various reasons, it has become
increasingly impractical to increase either the clock speed or instructions per clock of a single core. If this
trend continues, new applications will have to be designed to utilize multiple threads in order to benefit from
the increase in potential computing power. This contrasts with previous microprocessor innovations in which
existing code was automatically sped up by running it on a newer/faster computer.

CONDITIONS OF PARALLELISM

The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and
directed edges with different labels are used to represent the ordered relations among the
statements. The analysis of dependence graphs shows where opportunity exists for
parallelization and vectorization.
(a) Data dependence: The ordering relationship between statements is indicated by the
data dependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input

(operands to be used) to S2 also called RAW hazard and denoted as


2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows S1 in
the program order and if the output of S2 overlaps the input to S1 also called RAW
hazard and denoted as
3. Output dependence : two statements are output dependent if they produce (write) the
same output variable. Also called WAW hazard and denoted as
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not
because the same variable is involved but because the same file referenced by both I/O
statement.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
 The subscript of a variable is itself subscribed( indirect addressing)
 The subscript does not contain the loop index variable.
 A variable appears more than once with subscripts having different coefficients
of the loop variable.
 The subscript is non linear in the loop index variable.
Parallel execution of program segments which do not have total data independence can
produce non-deterministic results.
Consider the following fragment of any program:
S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
• here the Forward dependency S1to S2, S3 to S4, S2 to S2
• Anti-dependency from S2to S3
• Output dependency S1 toS3

Figure 2.1 Dependence graph

(b) Control Dependence: This refers to the situation where the order of the execution of
statements cannot be determined before run time. For example all condition statement,
where the flow of statement depends on the output. Different paths taken after a
conditional branch may depend on the data hence we need to eliminate this data
dependence among the instructions. This dependence also exists between operations
performed in successive iterations of looping procedure. Control dependence often
prohibits parallelism from being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];

if (a[i] < 0) a[i] = 1;


}
Control-dependent example:
for (i=1;i<n;i++) {

if (a[i-1] < 0) a[i] = 1;


}
Control dependence also avoids parallelism to being exploited. Compilers are used to
eliminate this control dependence and exploit the parallelism.
(c) Resource dependence:

Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such as
registers, integer and floating point ALUs, etc. ALU conflicts are called ALU
dependence. Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions - 1
Bernstein’s conditions are a set of conditions which must exist if two processes can
execute in parallel.
Notation
Ii is the set of all input variables for a process Pi . Ii is also called the read set or domain
of Pi. Oi is the set of all output variables for a process Pi .Oi is also called write set
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:

Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute
in parallel if they are flow-independent, antiindependent, and output- independent. The
parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not transitive (Pi || Pj and
Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an equivalence relation. Intersection of
the input sets is allowed.

HARDWARE PARALLELISM
Hardware parallelism is defined by machine architecture and hardware multiplicity i.e.,
functional parallelism times the processor parallelism. It can be characterized by the number
of instructions that can be issued per machine cycle. If a processor issues k instructions per
machine cycle, it is called a k-issue processor. Conventional processors are one-issue
machines. This provide the user the information about peak attainable performance.
Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM
RS -6000 is a four-issue processor (arithmetic, floating-point, memory access, branch).A
machine with n k-issue processors should be able to handle a maximum of nk threads
simultaneously.

SOFTWARE PARALLELISM
Software parallelism is defined by the control and data dependence of programs, and is
revealed in the program’s flow graph i.e., it is defined by dependencies with in the code and
is a function of algorithm, programming style, and compiler optimization.

LAWS GOVERNING PERFORMANCE MEASUREMENTS

(a) Amdahl’s Law:

It is named after computer scientist Gene Amdahl( a computer architect from IBM and Amdahl corporation),
and was presented at the AFIPS Spring Joint Computer Conference in 1967. It is also known as Amdahl’s
argument. It is a formula which gives the theoretical speedup in latency of the execution of a task at a fixed
workload that can be expected of a system whose resources are improved. In other words, it is a formula
used to find the maximum improvement possible by just improving a particular part of a system. It is often
used in parallel computing to predict the theoretical speedup when using multiple processors.

Speedup-
Speedup is defined as the ratio of performance for the entire task using the enhancement and performance for
the entire task without using the enhancement or speedup can be defined as the ratio of execution time for
the entire task without using the enhancement and execution time for the entire task using the enhancement.
If Pe is the performance for entire task using the enhancement when possible, Pw is the performance for
entire task without using the enhancement, Ew is the execution time for entire task without using the
enhancement and Ee is the execution time for entire task using the enhancement when possible then,

Speedup = Pe/Pw
or
Speedup = Ew/Ee

Amdahl’s law uses two factors to find speedup from some enhancement –
Fraction enhanced – The fraction of the computation time in the original computer that can be converted
to take advantage of the enhancement. For example- if 10 seconds of the execution time of a program that
takes 40 seconds in total can use an enhancement , the fraction is 10/40. This obtained value is Fraction
Enhanced.
Fraction enhanced is always less than 1.
Speedup enhanced – The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program. For example – If the
enhanced mode takes, say 3 seconds for a portion of the program, while it is 6 seconds in the original mode,
the improvement is 6/3. This value is Speedup enhanced.
Speedup Enhanced is always greater than 1.

The overall Speedup is the ratio of the execution time:-

(b) Gustafson’s Law:

In computer architecture, Gustafson's law gives the theoretical speedup in latency of the execution of a task at fixed
execution time that can be expected of a system whose resources are improved. It is named after computer scientist
John L. Gustafson and his colleague Edwin H. Barsis, and was presented in the article Reevaluating Amdahl's Law in
1988.

Definition
Gustafson estimated the speedup S gained by using N processors (instead of just one) for a task with a serial fraction s
(which does not benefit from parallelism) as follows:

S=N+(1-N)s

Using different variables, Gustafson's law can be formulated the following way:
S latency(s)=1- p + sp

where

 Slatency is the theoretical speedup in latency of the execution of the whole task;
 s is the speedup in latency of the execution of the part of the task that benefits from the improvement of the
resources of the system;
 p is the percentage of the execution workload of the whole task concerning the part that benefits from the
improvement of the resources of the system before the improvement.

Gustafson's law addresses the shortcomings of Amdahl's law, which is based on the assumption of a fixed problem
size, that is of an execution workload that does not change with respect to the improvement of the resources.
Gustafson's law instead proposes that programmers tend to set the size of problems to fully exploit the computing
power that becomes available as the resources improve. Therefore, if faster equipment is available, larger problems
can be solved within the same time.

The impact of Gustafson's law was to shift research goals to select or reformulate problems so that solving a larger
problem in the same amount of time would be possible. In a way the law redefines efficiency, due to the possibility
that limitations imposed by the sequential part of a program may be countered by increasing the total amount of
computation.

METRICS
(a) Speedup:
The speedup is defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem
to the time taken by the parallel algorithm to solve the same problem on P processors.
S=Ts/Tp

(b) Efficiency: The efficiency is defined as the ratio of speedup to the number of processors. Efficiency
measures the fraction of time for which a processor is usefully utilized.
E=S/p= Ts/pTp

( c) Utilization: The system utilization in a parallel computation is defined as:


U(n) = O(n)/nT(n)
Where O(n) is the total no. of operations performed by an n-processor system, T(n) is the execution time in
unit time steps and n is the no. of processors.

(c) Communication Overhead: Communication overhead can dramatically affect the performance of parallel
computations. Given the long latencies associated with accessing data stored in remote memories, computations
that repeatedly access remote data can easily spend most of their time communicating rather than performing
useful computation.

(d) Single/Multiple Program performances: There are many ways to measure the performance of a parallel
algorithm running on a parallel processor. The most commonly used measurements are the elapsed time,
price/performance, the speed-up, and the efficiency. The elapsed time to run a particular job on a given machine is
the most important metric.
UNIT-3

SHARED MEMORY MULTIPROCESSORS

Three most common shared memory multiprocessors models are −

Uniform Memory Access (UMA)

In this model, all the processors share the physical memory uniformly. All the processors have equal access
time to all the memory words. Each processor may have a private cache memory. Same rule is followed for
peripheral devices.

When all the processors have equal access to all the peripheral devices, the system is called a symmetric
multiprocessor. When only one or a few processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.

Non-uniform Memory Access (NUMA)

In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the
shared memory is physically distributed among all the processors, called local memories. The collection of
all local memories forms a global address space which can be accessed by all the processors.
Cache Only Memory Architecture (COMA)

The COMA model is a special case of the NUMA model. Here, all the distributed main memories are
converted to cache memories.
DISTRIBUTED MEMORY MULTICOMPUTERS
A distributed memory multicomputer system consists of multiple computers, known as nodes, inter-
connected by message passing network. Each node acts as an autonomous computer having a processor, a
local memory and sometimes I/O devices. In this case, all local memories are private and are accessible only
to the local processors. This is why, the traditional machines are called no-remote-memory-access
(NORMA) machines.

STATIC INTERCONNECTIONS
In static network the interconnection network is fixed and permanent interconnection path
between two processing elements and data communication has to follow a fixed route to reach the
destination processing element. Thus it Consist of a number of point- to-point links. Topologies
in the static networks can be classified according to the dimension required for layout i.e., it can
be 1-D, 2-D, 3-D or hypercube.
One dimensional topologies include Linear array as shown in figure 2.2 (a) used in some pipeline
architecture.
Various 2-D topologies are:
 The ring (figure 2.2(b))
 Star (figure 2.2(c))
 Tree (figure 2.2(d))
 Mesh (figure 2.2(e))
 Systolic Array (figure 2.2(f))
3-D topologies include:
 Completely connected chordal ring (figure 2.2(g))
 Chordal ring (figure 2.2(h))
 3 cube (figure 2.2(i))
Figure 2.2 Static interconnection network topologies.

Torus architecture is also one of popular network topology, it is extension of the mesh by having
wraparound connections Figure below is a 2D Torus This architecture of torus is a symmetric
topology unlike mesh which is not. The wraparound connections reduce the torus diameter and at
the same time restore the symmetry. It can be
1-D torus
2-D torus
3-D torus
The torus
topology is used in Cray T3E
Figure 2.3 Torus technology
We can have further higher dimension circuits for example 3-cube connected cycle. A D-
dimension W-wide hypercube contains W nodes in each dimension and there is a connection to a
node in each dimension. The mesh and the cube architecture are actually 2-D and 3-D hypercube
respectively. The below figure we have hypercube with dimension 4.

Figure 2.4 4-D hypercube.

DYNAMIC INTERCONNECTIONS
The dynamic networks are those networks where the route through which data move from one
PE to another is established at the time communication has to be performed. Usually all
processing elements are equidistant and an interconnection path is established when two
processing element want to communicate by use of switches. Such systems are more difficult to
expand as compared to static network. Examples: Bus-based, Crossbar, Multistage Networks.
Here the Routing is done by comparing the bit-level representation
of source and destination addresses. If there is a match goes to next stage via pass- through else in
case of it mismatch goes via cross-over using the switch.
There are two classes of dynamic networks namely
 single stage network
 multi stage
Single Stage Networks
A single stage switching network with N input selectors (IS) and N output selectors (OS). Here at
each network stage there is a 1- to-D demultiplexer corresponding to each IS such that 1<D<N
and each OS is an M-to-1 multiplexer such that 1<M<=N. Cross bar network is a single stage
network with D=M=N. In order to establish a desired connecting path different path control
signals will be applied to all IS and OS selectors. The single stage network is also called as
recirculating network as in this network connection the single data items may have to recirculate
several time through the single stage before reaching their final destinations. The number of
recirculation depends on the connectivity in the single stage network. In general higher the
hardware connectivity the lesser is the number of recirculation. In cross bar network only one
circulation is needed to establish the connection path. The cost of completed connected cross bar
network is O(N2) which is very high as compared to other most recirculating networks which
have cost O(N log N) or lower hence are more cost effective for large value of N.
Multistage Networks
Many stages of interconnected switches form a multistage SIMD network. It is basicaaly consist
of three characteristic features
 The switch box,
 The network topology
 The control structure
Many stages of interconnected switches form a multistage SIMD networks. Eachbox is
essentially an interchange device with two inputs and two outputs. The four possible states of a
switch box are which are shown in figure 3.6
 Straight
 Exchange
 Upper Broadcast
 Lower broadcast.
A two function switch can assume only two possible state namely state or exchange states.
However a four function switch box can be any of four possible states. A multistage network is
capable of connecting any input terminal to any output terminal. Multi-stage networks are
basically constructed by so called shuffle-exchange switching element, which is basically a 2 x 2
crossbar. Multiple layers of these elements are connected and form the network.

Figure 2.5 A two-by-two switching box and its four interconnection states

A multistage network is capable of connecting an arbitrary input terminal to an arbitrary output


terminal. Generally it is consist of n stages where N = 2n is the number of input and output lines.
And each stage use N/2 switch boxes. The interconnection patterns from one stage to another
stage is determined by network topology. Each stage is connected to the next stage by at least N
paths. The total wait time is proportional to the number stages i.e., n and the total cost depends on
the total number of switches used and that is Nlog2N. The control structure can be individual
stage control i.e., the same control signal is used to set all switch boxes in the same stages thus we
need n control signal. The second control structure is individual box control where a separate
control signal is used to set the state of each switch box. This provide flexibility at the same time
require n2/2 control signal which increases the complexity of the control circuit. In between path
is use of partial stage control.
Examples of Multistage Networks

Banyan
Baseline
Cube
Delta
Flip
Indirect cube
Omega
Multistage network can be of two types
 One side networks : also called full switch having input output port on the same side
 Two sided multistage network : which have an input side and an output side. It can be
further divided into three class
o Blocking: In Blocking networks, simultaneous connections of more than one
terminal pair may result conflicts in the use of network communication links.
Examples of blocking network are the Data Manipulator, Flip, N cube, omega,
baseline. All multistage networks that are based on shuffle-exchange elements,
are based on the concept of blocking network because not all possible here to
make the input-output connections at the same time as one path might block
another. The figure
(a) show an omega network.
o Rearrangeable : In rearrangeable network, a network can perform all possible
connections between inputs and outputs by rearranging its existing connections
so that a connection path for a new input-output pair can always be established.
An example of this network topology is Benes Network ( see figure 2.6 (b)
showing a 8** Benes network)which support synchronous data permutation and
a synchronous interprocessor communication.
o Non blocking : A non –blocking network is the network which can handle all
possible connections without blocking. There two possible cases first one is the
Clos network ( see figure 2.6(c)) where a one to one connection
is made between input and output. Another case of one to many connections can
be obtained by using crossbars instead of the shuffle- exchange elements. The
cross bar switch network can connect every input port to a free output port
without blocking.

Figure 2.6 Several Multistage Interconnection Networks Mesh-


Connected Illiac Networks

A single stage recirculating network has been implemented in the ILLiac –IV array with N= 64
PEs. Here in mesh network nodes are arranged as a q-dimensional lattice. The
neighboring nodes are only allowed to communicate the data in one step i.e., each PEi is allowed
to send the data to any one of PE(i+1) , PE (i-1), Pe(i+r) and PE(i-r) where r= square root N( in
case of Iliac r=8). In a periodic mesh, nodes on the edge of the mesh have wrap-around
connections to nodes on the other side this is also called a toroidal mesh.

SHARED MEMORY PROGRAMMING


In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an
intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of
passing data between programs. Depending on context, programs may run on a single processor or on multiple
separate processors.

Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared
memory.

In computer software, shared memory is either

 a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAM which other processes can access;
 a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memory mappings or with explicit support of the program in
question. This is most often used for shared libraries and for Execute in place (XIP).

Since both processes can access the shared memory area like regular working memory, this is a very fast way of
communication (as opposed to other mechanisms of IPC such as named pipes, Unix domain sockets or CORBA). On
the other hand, it is less scalable, as for example the communicating processes must be running on the same machine
(of other IPC methods, only Internet domain sockets—not Unix domain sockets—can use a computer network), and
care must be taken to avoid issues if processes sharing memory are running on separate CPUs and the underlying
architecture is not cache coherent.

IPC by shared memory is used for example to transfer images between the application and the X server on Unix
systems, or inside the IStream object returned by CoMarshalInterThreadInterfaceInStream in the COM libraries under
Windows.

Dynamic libraries are generally held in memory once and mapped to multiple processes, and only pages that had to be
customized for the individual process (because a symbol resolved differently there) are duplicated, usually with a
mechanism known as copy-on-write that transparently copies the page when a write is attempted, and then lets the
write succeed on the private copy.
DISTRIBUTED MEMORY PROGRAMMING

It refers to a multiprocessor computer system in which each processor has its own private memory.
Computational tasks can only operate on local data, and if remote data is required, the computational task
must communicate with one or more remote processors. In contrast, a shared memory multiprocessor offers
a single memory space used by all processors. Processors do not have to be aware where data resides, except
that there may be performance penalties, and that race conditions are to be avoided.

In a distributed memory system there is typically a processor, a memory, and some form of interconnection
that allows programs on each processor to interact with each other. The interconnect can be organised with
point to point links or separate hardware can provide a switching network. The network topology is a key
factor in determining how the multiprocessor machine scales. The links between nodes can be implemented
using some standard network protocol (for example Ethernet), using bespoke network links (used in for
example the Transputer), or using dual-ported memories.

The key issue in programming distributed memory systems is how to distribute the data over the
memories. Depending on the problem solved, the data can be distributed statically, or it can be moved
through the nodes. Data can be moved on demand, or data can be pushed to the new nodes in advance.

As an example, if a problem can be described as a pipeline where data x is processed subsequently through
functions f, g, h, etc. (the result is h(g(f(x)))), then this can be expressed as a distributed memory problem
where the data is transmitted first to the node that performs f that passes the result onto the second node that
computes g, and finally to the third node that computes h. This is also known as systolic computation.

Data can be kept statically in nodes if most computations happen locally, and only changes on edges have to
be reported to other nodes. An example of this is simulation where data is modeled using a grid, and each
node simulates a small part of the larger grid. On every iteration, nodes inform all neighboring nodes of the
new edge data.

Distributed shared memory


Similarly, in distributed shared memory each node of a cluster has access to a large shared memory in
addition to each node's limited non-shared private memory.
OBJECT ORIENTED PROGRAMMING
Very active research has been conducted in the last decade in the area of parallel OOP. A significant number
of parallel OOP languages have been designed, implemented, and tested in diverse parallel applications.
Despite of the progress in the area of parallel OOP, many difficulties and open issues still exist. Existing
parallel OOP languages are often compromised in important areas, such as inheritance capability, ease of
use, efficiency, heterogeneity, automatic memory management, and debugging. Inheritance capability. Many
of the proposed languages fail to provide inheritance for objects which can be distributed in a network and
which can accept and handle messages concurrently. Even languages that permit some amalgamation of
parallelism with inheritance tend to support only single-class inheritance. Most languages are weak in
providing inheritance for the synchronization code of parallel objects. Efficiency and ease of use. The largest
group of experimental languages for parallel OOP consists of C++ extensions. Such extensions can be very
large and complex, and therefore, not easy to learn, use, or implement efficiently. An alternative group
includes interpretation-oriented languages (i.e. Smalltalk, Self, actor-based languages, Java) which do not
provide high run-time efficiency and lack the reliable static-type checking. Heterogeneity. Nowadays
computing environments are becoming more and more heterogeneous. Users typically have access to a
variety of platforms (such as workstations, PCs, specialized high-performance computers) which are
networked locally or are geographically distributed. Most parallel OOP languages, however, are targeted at
specific high-performance platforms, or implemented for homogeneous networks. There are no compilation-
oriented languages that can convert heterogeneous networks into a single high-performance object-oriented
computing environment in which the peculiarities of the specific architectures are transparent for the user.
Researchers have proposed diverse approaches to the problematic points of parallel OOP. For example, the
following alternatives have been advocated: Enhancing popular serial OOP languages with parallelism
versus designing entirely new parallel OOP languages, explicit parallelism versus parallelizing compilers,
parallel OOP languages versus parallel OOP libraries, and so on. The numerous proposals to integrate
objects and parallelism typically follow two distinct scenarios: a) design a new parallel OOP language with
built-in parallelism, or b) use an existing OOP language and enhance it with parallelism mechanisms. Most
recent proposals follow the second approach. Its proponents take the object-oriented paradigm as given (on
the basis of its contribution to the production of quality software) and investigate how to enhance it so that it
covers parallel applications as well. The transition from a sequential language to its parallel enhancement
would be easier and less frustrating than the transition to a new language designed from scratch. The
problem is not so much that of learning a new language as it is of rewriting a 100,000-line program. For
entirely practical reasons like the above one, the parallel extension of an existing language may have better
utility than anew language designed from scratch. Some researchers assume that the programmer is
interested in and capable of specifying parallelism in explicit form. Because OOP means programming by
modelling, and because real-world objects may exist and do things concurrently, OOP languages may have
to provide explicit support to modelling parallelism. Other researchers adhere to the idea that parallelism
should be transparent to the programmer, and that a parallelizing compiler should take the burden of finding
and exploiting potential parallelism. It seems possible to combine both approaches in certain beneficial
proportions. A sequential language can be enhanced with explicit parallelism by means of a special library of
low-level parallelism primitives, such as fork and join for example. Alternatively, the language can be
extended with additional linguistic constructs that implement higher-level parallelism abstractions, such as
monitors. The main motivation for the library approach is that the underlying sequential language does not
need to be changed. External library primitives can provide flexible but uncontrolled access to data and
hardware resources; unfortunately, such access can result in unreliable programs. Some authors overcome
this difficulty through encapsulating the library services in special classes. Parallel class libraries are
extended by end-users and adapted to their specific needs. Parallel OOP has proven to be, or is expected to
be, a beneficial implementation approach in a number of application areas, such as scientific computing,
management information systems, telecommunications, and banking. New developments in the domain of
parallel OOP occur in close interaction with other computer science fields, such as database management
and artificial intelligence. More research is needed to further improve the design and implementation of
parallel OOP platforms, and to increase their applicability.

DATA PARALLEL PROGRAMMING


Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses on
distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data
structures like arrays and matrices by working on each element in parallel. It contrasts to task parallelism as another
form of parallelism.

A data parallel job on an array of n elements can be divided equally among all the processors. Let us assume we want
to sum all the elements of the given array and the time for a single addition operation is Ta time units. In the case of
sequential execution, the time taken by the process will be n×Ta time units as it sums up all the elements of an array.
On the other hand, if we execute this job as a data parallel job on 4 processors the time taken would reduce to (n/4)×Ta
+ merging overhead time units. Parallel execution results in a speedup of 4 over sequential execution. One important
thing to note is that the locality of data references plays an important part in evaluating the performance of a data
parallel programming model. Locality of data depends on the memory accesses performed by the program as well as
the size of the cache.
Below is the sequential pseudo-code for multiplication and addition of two matrices where the result is stored in the
matrix C. The pseudo-code for multiplication calculates the dot product of two matrices A, B and stores the result into
the output matrix C.

If the following programs were executed sequentially, the time taken to calculate the result would be of the

(assuming row lengths and column lengths of both matrices are n) and for multiplication and addition
respectively.

// Matrix multiplication
for (i=0; i<row_length_A; i++)
{
for (k=0; k<column_length_B; k++)
{
sum = 0;
for (j=0; j<column_length_A; j++)
{
sum += A[i][j]*B[j][k];
}
C[i][k] = sum;
}
}
// Array addition
for (i=0; i<n; i++) {
c[i] = a[i]+b[i];
}
We can exploit data parallelism in the preceding codes to execute it faster as the arithmetic is loop independent.
Parallelization of the matrix multiplication code is achieved by using OpenMP. An OpenMP directive, "omp parallel
for" instructs the compiler to execute the code in the for loop in parallel. For multiplication, we can divide matrix A
and B into blocks along rows and columns respectively. This allows us to calculate every element in matrix C

individually thereby making the task parallel. For example: A[m x n] dot B [n x k] can be finished in instead of

when executed in parallel using m*k processors.

Data parallelism in matrix multiplication

// Matrix multiplication in parallel


#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (i=0; i<row_length_A; i++){
for (k=0; k<column_length_B; k++){
sum = 0;
for (j=0; j<column_length_A; j++){
sum += A[i][j]*B[j][k];
}
C[i][k] = sum;
}
}
It can be observed from the example that a lot of processors will be required as the matrix sizes keep on increasing.
Keeping the execution time low is the priority but as the matrix size increases, we are faced with other constraints like
complexity of such a system and its associated costs. Therefore, constraining the number of processors in the system,
we can still apply the same principle and divide the data into bigger chunks to calculate the product of two matrices. [4]

For addition of arrays in a data parallel implementation, let’s assume a more modest system with two central
processing units (CPU) A and B, CPU A could add all elements from the top half of the arrays, while CPU B could
add all elements from the bottom half of the arrays. Since the two processors work in parallel, the job of performing
array addition would take one half the time of performing the same operation in serial using one CPU alone.

The program expressed in pseudocode below—which applies some arbitrary operation, foo, on every element in the
array d—illustrates data parallelism:[nb 1]
if CPU = "a" then
lower_limit := 1
upper_limit := round(d.length / 2)
else if CPU = "b" then
lower_limit := round(d.length / 2) + 1
upper_limit := d.length

for i from lower_limit to upper_limit by 1 do


foo(d[i])
In an SPMD system executed on 2 processor system, both CPUs will execute the code.

Data parallelism emphasizes the distributed (parallel) nature of the data, as opposed to the processing (task
parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.

FUNCTIONAL PROGRAMMING

Functional programming is a programming paradigm in which we try to bind everything in pure


mathematical functions style. It is a declarative type of programming style. Its main focus is on “what to
solve” in contrast to an imperative style where the main focus is “how to solve”. It uses expressions instead
of statements. An expression is evaluated to produce a value whereas a statement is executed to assign
variables. Those functions have some special features discussed below.

Functional Programming is based on Lambda Calculus:


Lambda calculus is framework developed by Alonzo Church to study computations with functions. It can be called as
the smallest programming language of the world. It gives the definition of what is computable. Anything that can be
computed by lambda calculus is computable. It is equivalent to Turing machine in its ability to compute. It provides a
theoretical framework for describing functions and their evaluation. It forms the basis of almost all current functional
programming languages.

Programming Languages that support functional programming: Haskell, JavaScript, Scala, Erlang, Lisp, ML,
Clojure, OCaml, Common Lisp, Racket.

Concepts of functional programming:

 Pure functions
 Recursion
 Referential transparency
 Functions are First-Class and can be Higher-Order
 Variables are Immutable

Pure functions: These functions have two main properties. First, they always produce the same output for same
arguments irrespective of anything else.
Secondly, they have no side-effects i.e. they do modify any argument or global variables or output something.
Later property is called immutability. The pure functions only result is the value it returns. They are deterministic.
Programs done using functional programming are easy to debug because pure functions have no side effect or hidden
I/O. Pure functions also make it easier to write parallel/concurrent applications. When the code is written in this style,
a smart compiler can do many things – it can parallelize the instructions, wait to evaluate results when need them, and
memorize the results since the results never change as long as the input doesn’t change.
example of the pure function:

sum(x, y) // sum is function taking x and y as arguments


return x + y // sum is returning sum of x and y without changing them

Recursion: There are no “for” or “while” loop in functional languages. Iteration in functional languages is
implemented through recursion. Recursive functions repeatedly call themselves, until it reaches the base case.
example of the recursive function:

fib(n)
if (n <= 1)
return 1;
else
return fib(n - 1) + fib(n - 2);

Referential transparency: In functional programs variables once defined do not change their value throughout the
program. Functional programs do not have assignment statements. If we have to store some value, we define new
variables instead. This eliminates any chances of side effects because any variable can be replaced with its actual value
at any point of execution. State of any variable is constant at any instant.
example:

x = x + 1 // this changes the value assigned to the variable x.


// So the expression is not referentially transparent.

Functions are First-Class and can be Higher-Order: First-class functions are treated as first-class variable. The first
class variables can be passed to functions as parameter, can be returned from functions or stored in data structures.
Higher order functions are the functions that take other functions as arguments and they can also return functions.
example:

show_output(f) // function show_output is declared taking argument f


// which are another function
f(); // calling passed function

print_gfg() // declaring another function


print("hello gfg");

show_output(print_gfg) // passing function in another function

Variables are Immutable: In functional programming, we can’t modify a variable after it’s been initialized. We can
create new variables – but we can’t modify existing variables, and this really helps to maintain state throughout the
runtime of a program. Once we create a variable and set its value, we can have full confidence knowing that the value
of that variable will never change.

Advantages and Disadvantages of Functional programming

Advantages:

1. Pure functions are easier to understand because they don’t change any states and depend only on the input
given to them. Whatever output they produce is the return value they give. Their function signature gives all
the information about them i.e. their return type and their arguments.
2. The ability of functional programming languages to treat functions as values and pass them to functions as
parameters make the code more readable and easily understandable.
3. Testing and debugging is easier. Since pure functions take only arguments and produce output, they don’t
produce any changes don’t take input or produce some hidden output. They use immutable values, so it
becomes easier to check some problems in programs written uses pure functions.
4. It is used to implement concurrency/parallelism because pure functions don’t change variables or any other
data outside of it.
5. It adopts lazy evaluation which avoids repeated evaluation because the value is evaluated and stored only
when it is needed.

Disadvantages:

1. Sometimes writing pure functions can reduce the readability of code.


2. Writing programs in recursive style instead of using loops can be bit intimidating.
3. Writing pure functions are easy but combining them with rest of application and I/O operations is the difficult
task.
4. Immutable values and recursion can lead to decrease in performance.

Applications:

 It is used in mathematical computations.


 It is needed where concurrency or parallelism is required.

DATA FLOW PROGRAMMING


In computer programming, dataflow programming is a programming paradigm that models a program as a
directed graph of the data flowing between operations, thus implementing dataflow principles and
architecture. Dataflow programming languages share some features of functional languages, and were
generally developed in order to bring some functional concepts to a language more suitable for numeric
processing. Some authors use the term datastream instead of dataflow to avoid confusion with dataflow
computing or dataflow architecture, based on an indeterministic machine paradigm. Dataflow programming
was pioneered by Jack Dennis and his graduate students at MIT in the 1960s.
Properties of dataflow programming languages:

Traditionally, a program is modelled as a series of operations happening in a specific order; this may be
referred to as sequential, procedural, control flow (indicating that the program chooses a specific path), or
imperative programming. The program focuses on commands, in line with the von Neumann vision of
sequential programming, where data is normally "at rest".

In contrast, dataflow programming emphasizes the movement of data and models programs as a series of
connections. Explicitly defined inputs and outputs connect operations, which function like black boxes. An
operation runs as soon as all of its inputs become valid. Thus, dataflow languages are inherently parallel and
can work well in large, decentralized systems.
State:

One of the key concepts in computer programming is the idea of state, essentially a snapshot of various
conditions in the system. Most programming languages require a considerable amount of state information,
which is generally hidden from the programmer. Often, the computer itself has no idea which piece of
information encodes the enduring state. This is a serious problem, as the state information needs to be shared
across multiple processors in parallel processing machines. Most languages force the programmer to add
extra code to indicate which data and parts of the code are important to the state. This code tends to be both
expensive in terms of performance, as well as difficult to read or debug. Explicit parallelism is one of the
main reasons for the poor performance of Enterprise Java Beans when building data-intensive, non-OLTP
application.

Where a sequential program can be imagined as a single worker moving between tasks (operations), a
dataflow program is more like a series of workers on an assembly line, each doing a specific task whenever
materials are available. Since the operations are only concerned with the availability of data inputs, they
have no hidden state to track, and are all "ready" at the same time.

Representation:

Dataflow programs are represented in different ways. A traditional program is usually represented as a series
of text instructions, which is reasonable for describing a serial system which pipes data between small,
single-purpose tools that receive, process, and return. Dataflow programs start with an input, perhaps the
command line parameters, and illustrate how that data is used and modified. The flow of data is explicit,
often visually illustrated as a line or pipe.

In terms of encoding, a dataflow program might be implemented as a hash table, with uniquely identified
inputs as the keys, used to look up pointers to the instructions. When any operation completes, the program
scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs
it. When that operation finishes, it will typically output data, thereby making another operation become
valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of
maintaining state is removed from the programmer and given to the language's runtime. On machines with a
single processor core where an implementation designed for parallel operation would simply introduce
overhead, this overhead can be removed completely by using a different runtime.

Dataflow programming languages include:

 ASCET
 AviSynth scripting language, for video processing
 BMDFM Binary Modular Dataflow Machine
 CAL
 Cuneiform, a functional workflow language.
 CMS Pipelines
 Hume
 Joule (programming language)
 Keysight VEE
 KNIME is a free and open-source data analytics, reporting and integration platform
 LabVIEW, G[3]
 Linda
 Lucid[2]
 Lustre
 Max/MSP
 Microsoft Visual Programming Language - A component of Microsoft Robotics Studio designed for
robotics programming
 Orange - An open-source, visual programming tool for data mining, statistical data analysis, and
machine learning.
 Oz now also distributed since 1.4.0
 Pipeline Pilot
 Prograph
 Pure Data
 Quartz Composer - Designed by Apple; used for graphic animations and effects
 SAC Single assignment C
 SIGNAL (a dataflow-oriented synchronous language enabling multi-clock specifications)
 Simulink
 SISAL
 SystemVerilog - A hardware description language
 Verilog - A hardware description language absorbed into the SystemVerilog standard in 2009
 VHDL - A hardware description language
 VSXu
 XEE (Starlight) XML engineering environment
 XProc

Application programming interfaces

 Apache Beam: Java/Scala SDK that unifies streaming (and batch) processing with several execution
engines supported (Spark, Flink, Google dataflow...)
 Apache Flink: Java/Scala library that allows streaming (and batch) computations to be run atop a
distributed Hadoop (or other) cluster
 SystemC: Library for C++, mainly aimed at hardware design.
 TensorFlow: A machine-learning library based on dataflow programming.
UNIT-4
LOOP PARALLELIZATION AND PIPELINING
PDF ATTACHED

SOFTWARE PIPELINING
Software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software
pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand
written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit
support for software pipelining, notably Intel's IA-64 architecture.

It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from
modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.
Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism
since such architectures existed. Effective compiler generation of such code dates to the invention of modulo
scheduling by Rau and Glaeser. Lam showed that special hardware is unnecessary for effective modulo scheduling.
Her technique, modulo variable expansion is widely used in practice.

Example

Consider the following loop:

for i = 1 to bignumber
A(i)
B(i)
C(i)
end

In this example, let A(i), B(i), C(i) be instructions, each operating on data i, that are dependent on each other. In other
words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could
perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no
dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.

Without software pipelining, the operations execute in the following sequence:

A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...

Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control
flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long
as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9
cycles to complete: 3 clock cycles for A(1), 3 clock cycles for B(1), and 3 clock cycles for C(1).

Now consider the following sequence of instructions with software pipelining:

A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...
It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be
executed in a total of 9 cycles, giving an average of 3 cycles per iteration.

Implementation

Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far
better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for
the moment that bignumber is divisible by 3):

for i = 1 to (bignumber - 2) step 3


A(i)
A(i+1)
A(i+2)
B(i)
B(i+1)
B(i+2)
C(i)
C(i+1)
C(i+2)
end

Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will
be divisible by the number of iterations we unroll. In the general case, loop unrolling may not be the best way to
implement software pipelining. Consider a loop containing instructions with a high latency. For example, the
following code:

for i = 1 to bignumber
A(i) ; 3 cycle latency
B(i) ; 3
C(i) ; 12(perhaps a floating point operation)
D(i) ; 3
E(i) ; 3
F(i) ; 3
end

would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code
of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache
performance, see code bloat). Even worse, the prologue (code before the loop for handling the case of bignumber not
divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software
pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if
bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the
execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization
ineffectual.

By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):

prologue
for i = 1 to (bignumber - 6)
A(i+6)
B(i+5)
C(i+4)
D(i+2) ; note that we skip i+3
E(i+1)
F(i)
end
epilogue

Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify
that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration
7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from
iteration 7 of the original loop. The sequence of instructions is:

Iteration 1: A(7) B(6) C(5) D(3) E(2) F(1)

Iteration 2: A(8) B(7) C(6) D(4) E(3) F(2)

Iteration 3: A(9) B(8) C(7) D(5) E(4) F(3)

Iteration 4: A(10) B(9) C(8) D(6) E(5) F(4)

Iteration 5: A(11) B(10) C(9) D(7) E(6) F(5)

Iteration 6: A(12) B(11) C(10) D(8) E(7) F(6)

Iteration 7: A(13) B(12) C(11) D(9) E(8) F(7)

However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12
instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7)
are used for other instructions instead of being wasted.

The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our
example above:

; loop prologue (arranged on lines for clarity)


A(1)
A(2), B(1)
A(3), B(2), C(1)
A(4), B(3), C(2) ; cannot start D(1) yet
A(5), B(4), C(3), D(1)
A(6), B(5), C(4), D(2), E(1)

Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that
have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:

; loop epilogue (arranged on lines for clarity)


B(bignumber), C(bignumber-1), D(bignumber-3), E(bignumber-4), F(bignumber-5)
C(bignumber), D(bignumber-2), E(bignumber-3), F(bignumber-4)
D(bignumber-1), E(bignumber-2), F(bignumber-3)
D(bignumber), E(bignumber-1), F(bignumber-2)
E(bignumber), F(bignumber-1)
F(bignumber)
PROGRAM PARTITIONING AND SCHEDULING
Grain size and latency:
The size of the parts or pieces of a program that can be considered for parallel execution can vary.
The sizes are roughly classified using the term “granule size,” or simply “granularity.” The
simplest measure, for example, is the number of instructions in a
program part. Grain sizes are usually described as fine, medium or coarse, depending on the level of
parallelism involved.

Latency

Latency is the time required for communication between different subsystems in a computer.
Memory latency, for example, is the time required by a processor to access memory.
Synchronization latency is the time required for two processes to synchronize their execution.
Computational granularity and communication latency are closely related. Latency and grain size
are interrelated and some general observation are
 As grain size decreases, potential parallelism increases, and overhead also increases.
 Overhead is the cost of parallelizing a task. The principle overhead is communication
latency.
 As grain size is reduced, there are fewer operations between communication, and hence
the impact of latency increases.
 Surface to volume: inter to intra-node comm.

Levels of Parallelism:

(a) Instruction Level


Parallelism:

This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain.
The number of candidates for parallel execution varies from 2 to thousands, with about five
instructions or statements (on the average) being the average level of parallelism.
Advantages:
There are usually many candidates for parallel execution
Compilers can usually do a reasonable job of finding this parallelism

(b) Loop-level Parallelism:

Typical loop has less than 500 instructions. If a loop operation is independent between iterations,
it can be handled by a pipeline, or by a SIMD machine. Most optimized program construct to
execute on a parallel or vector machine. Some loops (e.g. recursive) are difficult to handle. Loop-
level parallelism is still considered fine grain computation.

(c) Procedure-level Parallelism:


Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more difficult
than with smaller grains; interprocedural dependence analysis is difficult and history-sensitive.
Communication requirement less than instruction level SPMD (single procedure multiple data) is
a special case Multitasking belongs to this level.

(d) Subprogram-level Parallelism:

Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job
steps can overlap across different jobs. Multiprograming conducted at this level No compilers
available to exploit medium- or coarse-grain parallelism at present.

(e) Job or Program-Level Parallelism:

Corresponds to execution of essentially independent jobs or programs on a parallel computer.


This is practical for a machine with a small number of powerful processors, but impractical for a
machine with a large number of simple processors (since each processor would take too long to
process a single job).

Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to
machine architecture, technology, and communication patterns used. Latency imposes a limiting
factor on machine scalability. Ex. Memory latency increases as memory capacity increases,
limiting the amount of memory that can be used with a given tolerance for communication
latency.

Interprocessor Communication Latency

 Needs to be minimized by system designer


 Affected by signal delays and communication patterns Ex. n communicating tasks may
require n (n - 1)/2 communication links, and the complexity grows quadratically,
effectively limiting the number of processors in the system.

Communication Patterns

 Determined by algorithms used and architectural support provided


 Patterns include permutations broadcast multicast conference
 Tradeoffs often exist between granularity of parallelism and communication demand.

Grain Packing and Scheduling:


How can I partition a program into parallel “pieces” to yield the shortest execution time? What
is the optimal size of parallel grains?
There is an obvious tradeoff between the time spent scheduling and synchronizing parallel
grains and the speedup obtained by parallel execution.
One approach to the problem is called “grain packing.”

Program Graphs and Packing:

A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node name, s =
size (larger s = larger grain size).
Edges = { (v,d) }, where v = variable being “communicated,” and d = communication delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to
other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall
scheduling overhead.
Scheduling

A schedule is a mapping of nodes to processors and start times such that communication delay
requirements are observed, and no two nodes are executing on the same processor at the same
time. Some general scheduling goals
 Schedule all fine-grain activities in a node to the same processor to minimize
communication delays.
 Select grain sizes for packing to achieve better schedules for a particular parallel
machine.

Node Duplication:

Grain packing may potentially eliminate interprocessor communication, but it may not always
produce a shorter schedule. By duplicating nodes (that is, executing some instructions on multiple
processors), we may eliminate some interprocessor communication, and thus
produce a shorter schedule.

Program partitioning and scheduling:

Scheduling and allocation is a highly important issue since an inappropriate scheduling of tasks
can fail to exploit the true potential of the system and can offset the gain from parallelization. In
this paper we focus on the scheduling aspect. The objective of scheduling is to minimize the
completion time of a parallel application by properly
allocating the tasks to the processors. In a broad sense, the scheduling problem exists in two
forms: static and dynamic. In static scheduling, which is usually done at compile time, the
characteristics of a parallel program (such as task processing times, communication, data
dependencies, and synchronization requirements) are known before program execution
A parallel program, therefore, can be represented by a node- and edge-weighted directed acyclic
graph (DAG), in which the node weights represent task processing times and the edge weights
represent data dependencies as well as the communication times between tasks. In dynamic
scheduling only, a few assumptions about the parallel program can be made before execution, and
thus, scheduling decisions have to be made on-the-fly. The goal of a dynamic scheduling
algorithm as such includes not only the minimization of the program completion time but also the
minimization of the scheduling overhead which constitutes a significant portion of the cost paid
for running the scheduler. In general dynamic scheduling is an NP hard problem.

Program flow mechanism:

Conventional machines used control flow mechanism in which order of program execution
explicitly stated in user programs. Dataflow machines which instructions can be executed by
determining operand availability. Reduction machines trigger an instruction’s execution based on
the demand for its results.

Control Flow vs. Data Flow:

In Control flow computers the next instruction is executed when the last instruction as stored in
the program has been executed where as in Data flow computers an instruction executed when the
data (operands) required for executing that instruction is available.Control flow machines used
shared memory for instructions and data. Since variables are updated by many instructions, there
may be side effects on other instructions. These side effects frequently prevent parallel
processing. Single processor systems are inherently sequential. Instructions in dataflow machines
are unordered and can be executed as soon as their operands are available; data is held in the
instructions themselves. Data tokens are passed from an instruction to its dependents to trigger
execution.
Data Flow Features:

No need for shared memory program counter control sequencer Special mechanisms are required
to detect data availability match data tokens with instructions needing them enable chain reaction
of asynchronous instruction execution.
A Dataflow Architecture – 1 The Arvind machine (MIT) has N PEs and an N -by –N
interconnection network. Each PE has a token-matching mechanism that dispatches only
instructions with data tokens available. Each datum is tagged with
 address of instruction to which it belongs
 context in which the instruction is being executed
Tagged tokens enter PE through local path (pipelined), and can also be communicated to other
PEs through the routing network. Instruction address(es) effectively replace the program counter
in a control flow machine. Context identifier effectively replaces the frame base register in a
control flow machine. Since the dataflow machine matches the data tags from one instruction
with successors, synchronized instruction execution is implicit.
An I-structure in each PE is provided to eliminate excessive copying of data structures. Each
word of the I-structure has a two-bit tag indicating whether the value is empty, full, or has
pending read requests.
This is a retreat from the pure dataflow approach. Special compiler technology needed for
dataflow machines.

Demand-Driven Mechanisms:

Data-driven machines select instructions for execution based on the availability of their operands;
this is essentially a bottom-up approach. Demand-driven machines take a top-down approach,
attempting to execute the instruction (a demander) that yields the final result. This triggers the
execution of instructions that yield its operands, and so forth. The demand-driven approach
matches naturally with functional programming languages (e.g. LISP and SCHEME). Pattern
driven computers : An instruction is executed when we obtain a particular data patterns as output.
There are two types of pattern driven computers
(a) String-reduction model: each demander gets a separate copy of the
expression string to evaluate each reduction step has an operator and
embedded reference to demand the corresponding operands each
operator is suspended while arguments are evaluated
(b) Graph-reduction model: expression graph reduced by evaluation of
branches or subgraphs, possibly in parallel, with demanders given
pointers to results of reductions. based on sharing of pointers to
arguments; traversal and reversal of pointers continues until constant
arguments are encountered.

LOOP SCHEDULING
In parallel computing, loop scheduling is the problem of assigning proper iterations of parallelizable loops
among n processors to achieve load balancing and maintain data locality with minimum dispatch
overhead.

Typical loop scheduling methods are:

 static even scheduling: evenly divide loop iteration space into n chunks and assign each chunk to
a processor
 dynamic scheduling: a chunk of loop iteration is dispatched at runtime by an idle processor.
When the chunk size is 1 iteration, it is also called self-scheduling.
 guided scheduling: similar to dynamic scheduling, but the chunk sizes per dispatch keep
shrinking until reaching a preset value.

PARALLELIZATION OF SEQUENTIAL PROGRAMS


The main reason of parallelization a sequential program is to run the program faster. The easiest way to
parallelize a sequential program is to use a compiler that automatically detects the parallelism of the
program and generates the parallel version.

Automatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of
which implies automation when used in context, refers to converting sequential code into multi-threaded
or vectorized (or even both) code in order to utilize multiple processors simultaneously in a shared-
memory multiprocessor (SMP) machine. The goal of automatic parallelization is to relieve programmers
from the hectic and error-prone manual parallelization process.[1] Though the quality of automatic
parallelization has improved in the past several decades, fully automatic parallelization of sequential
programs by compilers remains a grand challenge due to its need for complex program analysis and the
unknown factors (such as input data range) during compilation.[2]

The programming control structures on which autoparallelization places the most focus are loops,
because, in general, most of the execution time of a program takes place inside some form of loop. There
are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-
threading.[3]
For example, consider a loop that on each iteration applies a hundred operations, and runs for a thousand
iterations. This can be thought of as a grid of 100 columns by 1000 rows, a total of 100,000 operations.
Cyclic multi-threading assigns each row to a different thread. Pipelined multi-threading assigns each
column to a different thread.

You might also like