Parallel Computing Unit Wise
Parallel Computing Unit Wise
Parallel Computing Unit Wise
A computer is an electronic device that manipulates information or data. It has the ability to store, retrieve, and process
data.
Nowadays, a computer can be used to type documents, send email, play games, and browse the Web. It can also be used
to edit or create spreadsheets, presentations, and even videos. But the evolution of this complex system started around
1940 with the first Generation of Computer and evolving ever since.
1. FIRST GENERATION
o Introduction:
1. 1946-1959 is the period of first generation computer.
2. J.P.Eckert and J.W.Mauchy invented the first successful electronic computer called ENIAC,
ENIAC stands for “Electronic Numeric Integrated And Calculator”.
Advantages:
1. It made use of vacuum tubes which are the only electronic component available during those days.
2. These computers could calculate in milliseconds.
Disadvantages:
1. These were very big in size, weight was about 30 tones.
2. These computers were based on vacuum tubes.
3. These computers were very costly.
4. It could store only a small amount of information due to the presence of magnetic drums.
5. As the invention of first generation computers involves vacuum tubes, so another disadvantage of these computers
was, vacuum tubes require a large cooling system.
6. Very less work efficiency.
7. Limited programming capabilities and punch cards were used to take inputs.
8. Large amount of energy consumption.
9. Not reliable and constant maintenance is required.
Introduction:
1. 1959-1965 is the period of second-generation computer.
2. 3.Second generation computers were based on Transistor instead of vacuum tubes.
Few Examples are:
1. Honeywell 400
2. IBM 7094
3. CDC 1604
4. CDC 3600
5. UNIVAC 1108and many more
Advantages:
1. Due to the presence of transistors instead of vacuum tubes, the size of electron component decreased. This
resulted in reducing the size of a computer as compared to first generation computers.
2. Less energy and not produce as much heat as the first genration.
3. Assembly language and punch cards were used for input.
4. Low cost than first generation computers.
5. Better speed, calculate data in microseconds.
6. Better portability as compared to first generation
Disadvantages:
1. A cooling system was required.
2. Constant maintenance was required.
3. Only used for specific purposes.
Introduction:
1. 1965-1971 is the period of third generation computer.
2. These computers were based on Integrated circuits.
3. IC was invented by Robert Noyce and Jack Kilby In 1958-1959.
4. IC was a single component containing number of transistors.
Few Examples are:
1. PDP-8
2. PDP-11
3. ICL 2900
4. IBM 360
5. IBM 370 and many more
Advantages:
1. These computers were cheaper as compared to second-generation computers.
2. They were fast and reliable.
3. Use of IC in the computer provides the small size of the computer.
4. IC not only reduce the size of the computer but it also improves the performance of the computer as
compared to previous computers.
5. This generation of computers has big storage capacity.
5. Instead of punch cards, mouse and keyboard are used for input.
6. They used an operating system for better resource management and used the concept of time-sharing and
multiple programming.
7. These computers reduce the computational time from microseconds to nanoseconds.
Disadvantages:
5. IC chips are difficult to maintain.
6. The highly sophisticated technology required for the manufacturing of IC chips.
7. Air conditioning is required.
GENERATION
Introduction:
1. 1971-1980 is the period of fourth generation computer.
2. This technology is based on Microprocessor.
3. A microprocessor is used in a computer for any logical and arithmetic function to be performed in any
program.
4. Graphics User Interface (GUI) technology was exploited to offer more comfort to users.
Advantages:
1. Fastest in computation and size get reduced as compared to the previous generation of computer.
2. Heat generated is negligible.
3. Small in size as compared to previous generation computers.
4. Less maintenance is required.
5. All types of high-level language can be used in this type of computers.
Disadvantages:
1. The Microprocessor design and fabrication are very complex.
2. Air conditioning is required in many cases due to the presence of ICs.
3. Advance technology is required to make the ICs.
Introduction:
1. The period of the fifth generation in 1980-onwards.
2. This generation is based on artificial intelligence.
3. The aim of the fifth generation is to make a device which could respond to natural language input and are
capable of learning and self-organization.
4. This generation is based on ULSI(Ultra Large Scale Integration) technology resulting in the production of
microprocessor chips having ten million electronic component.
Few Examples are:
1. Desktop
2. Laptop
3. NoteBook
4. UltraBook
5. Chromebook
6. and many more
Advantages:
Disadvantages:
The hardware, software, and programming elements of modern computer systems can be
characterized by looking at a variety of factors in context of parallel computing these factors are:
Computing problems
Algorithms and data structures
Hardware resources
Operating systems
System software support
Compiler support
Computing Problems
Traditional algorithms and data structures are designed for sequential machines.
New, specialized algorithms and data structures are needed to exploit the capabilities of
parallel architectures.
These often require interdisciplinary interactions among theoreticians, experimentalists,
and programmers.
Hardware Resources
Operating systems manage the allocation and deallocation of resources during user
program execution.
UNIX, Mach, and OSF/1 provide support for multiprocessors and multicomputers
multithreaded kernel functions virtual memory management file subsystems network
communication services
An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
System Software Support
Compilers, assemblers, and loaders are traditional tools for developing programs in high-
level languages. With the operating system, these tools determine the bind of resources to
applications, and the effectiveness of this determines the efficiency of hardware
utilization and the system’s programmability.
Most programmers still employ a sequential mind set, abetted by a lack of popular
parallel software support.
System Software Support
Parallel software can be developed using entirely new languages designed specifically
with parallel support as its goal, or by using extensions to existing sequential languages.
New languages have obvious advantages (like new constructs specifically for
parallelism), but require additional programmer education and system software.
The most common approach is to extend an existing language.
Compiler Support
A modern computer system consists of computer hardware, instruction sets, application programs, system
software and user interface.
The computing problems are categorized as numerical computing, logical reasoning, and transaction
processing. Some complex problems may need the combination of all the three processing modes.
Evolution of Computer Architecture − In last four decades, computer architecture has gone
through revolutionary changes. We started with Von Neumann architecture and now we have
multicomputers and multiprocessors.
Performance of a computer system − Performance of a computer system depends both on machine
capability and program behavior. Machine capability can be improved with better hardware
technology, advanced architectural features and efficient resource management. Program behavior is
unpredictable as it is dependent on application and run-time conditions.
(1) Synchronous:
(i) Vector Supercomputers: In a vector computer, a vector processor is attached to the scalar processor as an
optional feature. The host computer first loads program and data to the main memory. Then the scalar
control unit decodes all the instructions. If the decoded instructions are scalar operations or program
operations, the scalar processor executes those operations using scalar functional pipelines.
On the other hand, if the decoded instructions are vector operations then the instructions will be sent to
vector control unit.
(ii) SIMD Supercomputers: In SIMD computers, ‘N’ number of processors are connected to a
control unit and all the processors have their individual memory units. All the processors are
connected by an interconnection network.
(iii) Systolic arrays: Parallel processing approach diverges from traditional Von Neumann
architecture. One such approach is the concept of Systolic processing using systolic arrays. A
systolic array is a network of processors that rhythmically compute and pass data through the
system. They derived their name from drawing an analogy to how blood rhythmically flows
through a biological heart as the data flows from memory in a rhythmic fashion passing through
many elements before it returns to memory.It is also an example of pipelining along with parallel
computing.It was introduced in 1970s and was used by Intel to make CMU’s iWarp processor in
1990.
In a systolic array there are a large number of identical simple processors or processing elements(PEs) that
are arranged in a well organised structure such as linear or two dimensional array. Each processing element
is connected with the other PEs and has a limited private storage.
A Host station is often used for communication with the outside world in the network.
Characteristics:
1. Parallel Computing –
Many processes are carried out simultaneously. As the arrays have a non-centralized structure,
parallel computing is implemented.
2. Pipelinability –
It means that the array can achieve high speed. It shows a linear rate pipelinability.
3. Synchronous evaluation –
Computation of data is timed by a global clock and then the data is passed through the network.The
global clock synchronizes the array and has fixed length clock cycles.
4. Repetability –
Most of the arrays have the repetition and interconnection of a single type of PE in the entire
network.
5. Spatial Locality –
The cells have a local communication interconnection.
6. Temporal Locality –
One unit time delay is at least required for the transmission of signals from one cell to another.
It employs high degree of parallelism and can sustain a very high throughput.
These are highly compact, robust and efficient.
Data and control flow are simple and regular.
They are highly specialized and thus are inflexible regarding the problems they can solve.
These are difficult to build.
These are expensive.
(2) Asynchronous:
(i) MIMD:
In computing, MIMD (multiple instruction, multiple data) is a technique employed to achieve
parallelism. Machines using MIMD have a number of processors that function asynchronously and
independently. At any time, different processors may be executing different instructions on different
pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided
design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD
machines can be of either shared memory or distributed memory categories. These classifications are
based on how MIMD processors access memory. Shared memory machines may be of the bus-based,
extended, or hierarchical type. Distributed memory machines may have hypercube or mesh
interconnection schemes.
An example of MIMD system is Intel Xeon Phi, descended from Larrabee microarchitecture.[1] These
processors have multiple processing cores (up to 61 as of 2015) that can execute different instructions on
different data. Most parallel computers, as of 2013, are MIMD systems.
FLYNN’S CLASSIFICATION
Parallel computing is a computing where the jobs are broken into discrete parts that can be executed
concurrently. Each part is further broken down to a series of instructions. Instructions from each part execute
simultaneously on different CPUs. Parallel systems deal with the simultaneous use of multiple computer
resources that can include a single computer with multiple processors, a number of computers connected by
a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a single processor because the
architecture of parallel computers varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the number of instruction and data streams that can be
processed simultaneously, computing systems are classified into four major categories:
Flynn’s classification –
The speed of the processing element in the SISD model is limited(dependent) by the rate at which the
computer can transfer information internally. Dominant representative SISD systems are IBM PC,
workstations.
Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set. Machines built using the MISD model are not
useful in most of the application, a few machines are built, but none of them are available commercially.
In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are connected to
a single global memory and they all have access to it. The communication between PEs in this model takes
place through the shared memory, modification of the data stored in the global memory by one PE is visible
to all other PEs. Dominant representative shared memory MIMD systems are Silicon Graphics machines and
Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a local
memory. The communication between PEs in this model takes place through the interconnection network
(the inter process communication channel, or IPC). The network connecting PEs can be configured to tree,
mesh or in accordance with the requirement.
The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to
extend with respect to the distributed memory MIMD model. Failures in a shared-memory MIMD affect the
entire system, whereas this is not the case of the distributed model, in which each of the PEs can be easily
isolated. Moreover, shared memory MIMD architectures are less likely to scale because the addition of more
PEs leads to memory contention. This is a situation that does not happen in the case of distributed memory,
in which each PE has its own memory. As a result of practical outcomes and user’s requirement , distributed
memory MIMD architecture is superior to the other existing models.
FENG’S CLASSIFICATION
According to Feng’s classification, computer architecture can be classified into four. The classification is based on the
way contents stored in memory are processed. The contents can be either data or instructions.
One bit of one selected word is processed at a time. This represents serial processing and needs maximum processing
time.
Word serial bit parallel (WSBP)
is found in most existing computers and has been called as Word Slice processing because one word of n bit is
processed at a time. All bits of a selected word are processed at a time. Bit parallel means all bits of a word.
It has been called bit slice processing because m-bit slice is processed at a time. Word parallel signifies selection of all
words. It can be considered as one bit from all words are processed at a time.
It is known as fully parallel processing in which an array on n x m bits is processed at one time. Maximum parallelism
is achieved here.
It fails to project the concurrency in pipeline processors, as degree of parallelism doesn't account for concurrency
handle by pipe-lined design.
HANDLER’S CLASSIFICATION
In 1977, Wolfgang Handler proposed an elaborate notation for expressing the pipelining and parallelism of
computers. Handler's classification addresses the computer at three distinct levels:
•Processor control unit (PCU),
•Arithmetic logic unit (ALU),
•Bit-level circuit (BLC).
The PCU corresponds to a processor or CPU, the ALU corresponds to a functional unit or a processing element
and the BLC corresponds to the logic circuit needed to perform one-bit operations in the ALU.
Handler's classification uses the following three pairs of integers to describe a computer:
Computer = (p * p', a * a', b * b')
Where p = number of PCUs
Where p'= number of PCUs that can be pipelined
Where a = number of ALUs controlled by each PCU
Where a'= number of ALUs that can be pipelined
Where b = number of bits in ALU or processing element (PE) word Where b'= number of pipeline segments on
all ALUs or in a single PE
The following rules and operators are used to show the relationship between various elements of the computer:
•The '*' operator is used to indicate that the units are pipelined or macro-pipelined with a stream of data running
through all the units.
•The '+' operator is used to indicate that the units are not pipelined but work on independent streams of data.
•The 'v' operator is used to indicate that the computer hardware can work in one of several modes. •The '~'
symbol is used to indicate a range of values for any one of the parameters.
•Peripheral processors are shown before the main processor using another three pairs of integers. If the value of
the second element of any pair is 1, it may omitted for brevity.
Handler's classification is best explained by showing how the rules and operators are used to classify several
machines. 33The CDC 6600 has a single main processor supported by 10 I/O processors. One control unit
coordinates one ALU with a 60-bit word length. The ALU has 10 functional units which can be formed into a
pipeline. The 10 peripheral I/O processors may work in parallel with each other and with the CPU. Each I/O
processor contains one 12-bit ALU. The description for the 10 I/O processors is:
CDC 6600I/O = (10, 1, 12) The description for the main processor is:
Elements of Parallel Computing and Architecture
CDC 6600main = (1, 1 * 10, 60) The main processor and the I/O processors can be regarded as forming a macro-
pipeline so the '*' operator is used to combine the two structures: CDC 6600 = (I/O processors) * (central
processor = (10, 1, 12) * (1, 1 * 10, 60) Texas Instrument's Advanced Scientific Computer (ASC) has one
controller coordinating four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words.
Thus we have: ASC = (1, 4, 64 * 8) The Cray-1 is a 64-bit single processor computer whose ALU has twelve
functional units, eight of which can be chained together to from a pipeline. Different functional units have from 1
to 14 segments, which can also be pipelined. Handler's description of the Cray-1 is: Cray-1 = (1, 12 * 8, 64 * (1 ~
14))
Another sample system is Carnegie-Mellon University's C.mmp multiprocessor. This system was designed to
facilitate research into parallel computer architectures and consequently can be extensively reconfigured. The
system consists of 16 PDP-11 'minicomputers' (which have a 16-bit word length), interconnected by a crossbar
switching network. Normally, the C.mmp operates in MIMD mode for which the description is (16, 1, 16).
It can also operate in SIMD mode, where all the processors are coordinated by a single master controller. The
SIMD mode description is (1, 16, 16). Finally, the system can be rearranged to operate in MISD mode. Here the
processors are arranged in a chain with a single stream of data passing through all of them.
The MISD modes description is (1 * 16, 1, 16). The 'v' operator is used to combine descriptions of the same piece
of hardware operating in differing modes. Thus, Handler's description for the complete C.mmp is: C.mmp = (16,
1, 16) v (1, 16, 16) v (1 * 16, 1, 16) The '*' and '+' operators are used to combine several separate pieces of
hardware.
The 'v' operator is of a different form to the other two in that it is used to combine the different operating modes
of a single piece of hardware. While Flynn's classification is easy to use, Handler's classification is cumbersome.
The direct use of numbers in the nomenclature of Handler’s classification’s makes it much more abstract and
hence difficult. Handler's classification is highly geared towards the description of pipelines and chains. While it
is well able to describe the parallelism in a single processor, the variety of parallelism in multiprocessor
computers is not addressed well.
UNIT-2
COMBINATIONAL CIRCUITS
Combinational circuit is a circuit in which we combine the different gates in the circuit, for example encoder, decoder,
multiplexer and demultiplexer. Some of the characteristics of combinational circuits are following −
The output of combinational circuit at any instant of time, depends only on the levels present at input
terminals.
The combinational circuit do not use any memory. The previous state of input does not have any effect on the
present state of the circuit.
A combinational circuit can have an n number of inputs and m number of outputs.
Block diagram
Half Adder
Half adder is a combinational logic circuit with two inputs and two outputs. The half adder circuit is designed to add
two single bit binary number A and B. It is the basic building block for addition of two single bit numbers. This circuit
has two outputs carry and sum.
Block diagram
Truth Table
Circuit Diagram
Full Adder
Full adder is developed to overcome the drawback of Half Adder circuit. It can add two one-bit numbers A and B, and
carry c. The full adder is a three input and two output combinational circuit.
Block diagram
Truth Table
Circuit Diagram
The Full Adder is capable of adding only two single digit binary number along with a carry input. But in practical we
need to add binary numbers which are much longer than just one bit. To add two n-bit binary numbers we need to use
the n-bit parallel adder. It uses a number of full adders in cascade. The carry output of the previous full adder is
connected to carry input of the next full adder.
In the block diagram, A0 and B0 represent the LSB of the four bit words A and B. Hence Full Adder-0 is the lowest
stage. Hence its Cin has been permanently made 0. The rest of the connections are exactly same as those of n-bit
parallel adder is shown in fig. The four bit parallel adder is a very common logic circuit.
Block diagram
The subtraction can be carried out by taking the 1's or 2's complement of the number to be subtracted. For example we
can perform the subtraction (A-B) by adding either 1's or 2's complement of B to A. That means we can use a binary
adder to perform the binary subtraction.
The number to be subtracted (B) is first passed through inverters to obtain its 1's complement. The 4-bit adder then
adds A and 2's complement of B to produce the subtraction. S3 S2 S1 S0 represents the result of binary subtraction (A-
B) and carry output Cout represents the polarity of the result. If A > B then Cout = 0 and the result of binary form (A-B)
then Cout = 1 and the result is in the 2's complement form.
Block diagram
Half Subtractors
Half subtractor is a combination circuit with two inputs and two outputs (difference and borrow). It produces the
difference between the two binary bits at the input and also produces an output (Borrow) to indicate if a 1 has been
borrowed. In the subtraction (A-B), A is called as Minuend bit and B is called as Subtrahend bit.
Truth Table
Circuit Diagram
Full Subtractors
The disadvantage of a half subtractor is overcome by full subtractor. The full subtractor is a combinational circuit with
three inputs A,B,C and two output D and C'. A is the 'minuend', B is 'subtrahend', C is the 'borrow' produced by the
previous stage, D is the difference output and C' is the borrow output.
Truth Table
Circuit Diagram
Multiplexers
Multiplexer is a special type of combinational circuit. There are n-data inputs, one output and m select inputs with 2m
= n. It is a digital circuit which selects one of the n data inputs and routes it to the output. The selection of one of the n
inputs is done by the selected inputs. Depending on the digital code applied at the selected inputs, one out of n data
sources is selected and transmitted to the single output Y. E is called the strobe or enable input which is useful for the
cascading. It is generally an active low terminal that means it will perform the required operation when it is low.
Block diagram
2 : 1 multiplexer
4 : 1 multiplexer
16 : 1 multiplexer
32 : 1 multiplexer
Block Diagram
Truth Table
Demultiplexers
A demultiplexer performs the reverse operation of a multiplexer i.e. it receives one input and distributes it over several
outputs. It has only one input, n outputs, m select input. At a time only one output line is selected by the select lines
and the input is transmitted to the selected output line. A de-multiplexer is equivalent to a single pole multiple way
switch as shown in fig.
1 : 2 demultiplexer
1 : 4 demultiplexer
1 : 16 demultiplexer
1 : 32 demultiplexer
Block diagram
Truth Table
Decoder
A decoder is a combinational circuit. It has n input and to a maximum m = 2n outputs. Decoder is identical to a
demultiplexer without any data input. It performs operations which are exactly opposite to those of an encoder.
Block diagram
Code converters
BCD to seven segment decoders
Nixie tube decoders
Relay actuator
2 to 4 Line Decoder
The block diagram of 2 to 4 line decoder is shown in the fig. A and B are the two inputs where D through D are the
four outputs. Truth table explains the operations of a decoder. It shows that each output is 1 for only a specific
combination of inputs.
Block diagram
Truth Table
Logic Circuit
Encoder
Encoder is a combinational circuit which is designed to perform the inverse operation of the decoder. An encoder has n
number of input lines and m number of output lines. An encoder produces an m bit binary code corresponding to the
digital input number. The encoder accepts an n input digital word and converts it into an m bit another digital word.
Block diagram
Priority Encoder
This is a special type of encoder. Priority is given to the input lines. If two or more input line are 1 at the same time,
then the input line with highest priority will be considered. There are four input D0, D1, D2, D3 and two output Y0, Y1.
Out of the four input D3 has the highest priority and D0 has the lowest priority. That means if D3 = 1 then Y1 Y1 = 11
irrespective of the other inputs. Similarly if D3 = 0 and D2 = 1 then Y1 Y0 = 10 irrespective of the other inputs.
Block diagram
Truth Table
Logic Circuit
SORTING NETWORK
Comparison networks differ from RAM's in two important respects. First, they can only perform comparisons. Thus,
an algorithm such as counting sort (see Section 9.2) cannot be implemented on a comparison network. Second, unlike
the RAM model, in which operations occur serially--that is, one after another--operations in a comparison network
may occur at the same time, or "in parallel." As we shall see, this characteristic allows the construction of comparison
networks that sort n values in sublinear time.
We begin in Section 28.1 by defining comparison networks and sorting networks. We also give a natural definition for
the "running time" of a comparison network in terms of the depth of the network. Section 28.2 proves the "zero-one
principle," which greatly eases the task of analyzing the correctness of sorting networks.
The efficient sorting network that we shall design is essentially a parallel version of the merge-sort algorithm from
Section 1.3.1. Our construction will have three steps. Section 28.3 presents the design of a "bitonic" sorter that will be
our basic building block. We modify the bitonic sorter slightly in Section 28.4 to produce a merging network that can
merge two sorted sequences into one sorted sequence. Finally, in Section 28.5, we assemble these merging networks
into a sorting network that can sort n values in O(lg2 n) time.
Comparison networks:
Sorting networks are comparison networks that always sort their inputs, so it makes sense to begin our discussion with
comparison networks and their characteristics. A comparison network is comprised solely of wires and comparators. A
comparator, shown in Figure 28.1 (a), is a device with two inputs, x and y, and two outputs, x' and y', that performs the
following function:
Figure 28.1 (a) A comparator with inputs x and y and outputs x' and y'. (b) The same comparator, drawn as a single
vertical line. Inputs x = 7, y = 3 and outputs x' = 3, y' = 7 are shown.
x' = min(x, y) ,
y' = max(x, y) .
Because the pictorial representation of a comparator in Figure 28.1 (a) is too bulky for our purposes, we shall adopt the
convention of drawing comparators as single vertical lines, as shown in Figure 28.1(b). Inputs appear on the left and
outputs on the right, with the smaller input value appearing on the top output and the larger input value appearing on
the bottom output. We can thus think of a comparator as sorting its two inputs.
We shall assume that each comparator operates in O(1) time. In other words, we assume that the time between the
appearance of the input values x and y and the production of the output values x' and y' is a constant.
A wire transmits a value from place to place. Wires can connect the output of one comparator to the input of another,
but otherwise they are either network input wires or network output wires. Throughout this chapter, we shall assume
that a comparison network contains n input wires a1, a2, . . . , an, through which the values to be sorted enter the
network, and n output wires b1, b2, . . . , bn, which produce the results computed by the network. Also, we shall speak
of the input sequence a1, a2, . . . , an and the output sequence b1,b2, . . . ,bn , referring to the values on the input and
output wires. That is, we use the same name for both a wire and the value it carries. Our intention will always be clear
from the context.
Figure 28.2 shows a comparison network, which is a set of comparators interconnected by wires. We draw a
comparison network on n inputs as a collection of n horizontal lines with comparators stretched vertically. Note that a
line does not represent a single wire, but rather a sequence of distinct wires connecting various comparators. The top
line in Figure 28.2, for example, represents three wires: input wire a1, which connects to an input of comparator A; a
wire connecting the top output of comparator A to an input of comparator C; and output wire b1, which comes from the
top output of comparator C. Each comparator input is connected to a wire that is either one of the network's n input
wires a1, a2, . . . , an or is connected to the output of another comparator. Similarly, each comparator output is
connected to a wire that is either one of the network's n output wires b1, b2, . . . , bn or is connected to the input of
another comparator. The main requirement for interconnecting comparators is that the graph of interconnections must
be acyclic: if we trace a path from the output of a given comparator to the input of another to output to input, etc., the
path we trace must never cycle back on itself and go through the same comparator twice. Thus, as in Figure 28.2, we
can draw a comparison network with network inputs on the left and network outputs on the right; data move through
the network from left to right.
Figure 28.2 (a) A 4-input, 4-output comparison network, which is in fact a sorting network. At time 0, the input values
shown appear on the four input wires. (b) At time 1, the values shown appear on the outputs of comparators A and B,
which are at depth l. (c) At time 2, the values shown appear on the outputs of comparators C and D, at depth 2. Output
wires b1 and b4 now have their final values, but output wires b2 and b3 do not. (d) At time 3, the values shown appear
on the outputs of comparator E, at depth 3. Output wires b2 and b3 now have their final values.
Each comparator produces its output values only when both of its input values are available to it. In Figure 28.2(a), for
example, suppose that the sequence 9, 5, 2, 6 appears on the input wires at time 0. At time 0, then, only
comparators A and B have all their input values available. Assuming that each comparator requires one time unit to
compute its output values, comparators A and B produce their outputs at time 1; the resulting values are shown in
Figure 28.2(b). Note that comparators A and B produce their values at the same time, or "in parallel." Now, at time 1,
comparators C and D, but not E, have all their input values available. One time unit later, at time 2, they produce their
outputs, as shown in Figure 28.2(c). Comparators C and D operate in parallel as well. The top output of comparator C
and the bottom output of comparator D connect to output wires b1 and b4, respectively, of the comparison network, and
these network output wires therefore carry their final values at time 2. Meanwhile, at time 2, comparator E has its
inputs available, and Figure 28.2(d) shows that it produces its output values at time 3. These values are carried on
network output wires b2 and b3, and the output sequence 2, 5, 6, 9 is now complete.
Under the assumption that each comparator takes unit time, we can define the "running time" of a comparison
network, that is, the time it takes for all the output wires to receive their values once the input wires receive theirs.
Informally, this time is the largest number of comparators that any input element can pass through as it travels from an
input wire to an output wire. More formally, we define the depth of a wire as follows. An input wire of a comparison
network has depth 0. Now, if a comparator has two input wires with depths dx and dy, then its output wires have depth
max(dx, dy) + 1. Because there are no cycles of comparators in a comparison network, the depth of a wire is well
defined, and we define the depth of a comparator to be the depth of its output wires. Figure 28.2 shows comparator
depths. The depth of a comparison network is the maximum depth of an output wire or, equivalently, the maximum
depth of a comparator. The comparison network of Figure 28.2, for example, has depth 3 because comparator E has
depth 3. If each comparator takes one time unit to produce its output value, and if network inputs appear at time 0, a
comparator at depth d produces its outputs at time d; the depth of the network therefore equals the time for the network
to produce values at all of its output wires.
A sorting network is a comparison network for which the output sequence is monotonically increasing (that is, b1 b2
. . . bn) for every input sequence. Of course, not every comparison network is a sorting network, but the network of
Figure 28.2 is. To see why, observe that after time 1, the minimum of the four input values has been produced by
either the top output of comparator A or the top output of comparator B. After time 2, therefore, it must be on the top
output of comparator C. A symmetrical argument shows that after time 2, the maximum of the four input values has
been produced by the bottom output of comparator D. All that remains is for comparator E to ensure that the middle
two values occupy their correct output positions, which happens at time 3.
A comparison network is like a procedure in that it specifies how comparisons are to occur, but it is unlike a procedure
in that its physical size depends on the number of inputs and outputs. Therefore, we shall actually be describing
"families" of comparison networks. For example, the goal of this chapter is to develop a family SORTER of efficient
sorting networks. We specify a given network within a family by the family name and the number of inputs (which
equals the number of outputs). For example, the n-input, n-output sorting network in the family SORTER is named
SORTER[n].
PRAM MODELS
It is a model, which is considered for most of the parallel algorithms. Here, multiple processors are attached
to a single block of memory. A PRAM model contains −
To solve this problem, the following constraints have been enforced on PRAM model −
Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed to read from or
write to the same memory location at the same time.
Exclusive Read Concurrent Write (ERCW) − Here no two processors are allowed to read from the
same memory location at the same time, but are allowed to write to the same memory location at the
same time.
Concurrent Read Exclusive Write (CREW) − Here all the processors are allowed to read from the
same memory location at the same time, but are not allowed to write to the same memory location at
the same time.
Concurrent Read Concurrent Write (CRCW) − All the processors are allowed to read from or
write to the same memory location at the same time.
There are many methods to implement the PRAM model, but the most prominent ones are −
As multiple processors access the same memory location, it may happen that at any particular point of time, more than
one processor is accessing the same memory location. Suppose one is reading that location and the other is writing on
that location. It may create confusion. To avoid this, some control mechanism, like lock / semaphore, is implemented
to ensure mutual exclusion.
Shared memory programming has been implemented in the following −
Thread libraries − The thread library allows multiple threads of control that run concurrently in the same
memory location. Thread library provides an interface that supports multithreading through a library of
subroutine. It contains subroutines for
o Creating and destroying threads
o Scheduling execution of thread
o passing data and message between threads
o saving and restoring thread contexts
Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads as implemented in Linux, Win32
threads available in Windows NT and Windows 2000, and JavaTM threads as part of the standard JavaTM
Development Kit (JDK).
Distributed Shared Memory (DSM) Systems − DSM systems create an abstraction of shared memory on
loosely coupled architecture in order to implement shared memory programming without hardware support.
They implement standard libraries and use the advanced user-level memory management features present in
modern operating systems. Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and
Cashmere.
Program Annotation Packages − This is implemented on the architectures having uniform memory access
characteristics. The most notable example of program annotation packages is OpenMP. OpenMP implements
functional parallelism. It mainly focuses on parallelization of loops.
The concept of shared memory provides a low-level control of shared memory system, but it tends to be tedious and
erroneous. It is more applicable for system programming than application programming.
Processors use message-passing libraries for communication among themselves. Along with the data being sent, the
message contains the following components −
The address of the processor from which the message is being sent;
Starting address of the memory location of the data in the sending processor;
Data type of the sending data;
Data size of the sending data;
The address of the processor to which the message is being sent;
Starting address of the memory location for the data in the receiving processor.
Processors can communicate with each other by any of the following methods −
Point-to-Point Communication
Collective Communication
Message Passing Interface
Point-to-Point Communication
Point-to-point communication is the simplest form of message passing. Here, a message can be sent from the sending
processor to a receiving processor by any of the following transfer modes −
Synchronous mode − The next message is sent only after the receiving a confirmation that its previous
message has been delivered, to maintain the sequence of the message.
Asynchronous mode − To send the next message, receipt of the confirmation of the delivery of the previous
message is not required.
Collective Communication
Collective communication involves more than two processors for message passing. Following modes allow collective
communications −
Barrier − Barrier mode is possible if all the processors included in the communications run a particular bock
(known as barrier block) for message passing.
Broadcast − Broadcasting is of two types −
o One-to-all − Here, one processor with a single operation sends same message to all other processors.
o All-to-all − Here, all processors send message to all other processors.
Messages broadcasted may be of three types −
Features of PVM
Very easy to install and configure;
Multiple users can use PVM at the same time;
One user can execute multiple applications;
It’s a small package;
Supports C, C++, Fortran;
For a given run of a PVM program, users can select the group of machines;
It is a message-passing model,
Process-based computation;
Supports heterogeneous architecture.
It is restrictive, as not all the algorithms can be specified in terms of data parallelism. This is the reason why data
parallelism is not universal.
Data parallel languages help to specify the data decomposition and mapping to the processors. It also includes data
distribution statements that allow the programmer to have control on data – for example, which data will go on which
processor – to reduce the amount of communication within the processors.
Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching
networks.
Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount of
storage (memory) space available in that chip.
We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation
of that algorithm. If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound
on the total number of bits processed through the chip (or I/O). For certain computing, there exists a lower
bound, f(s), such that
PARALLELISM APPROACHES
Data parallelism is parallelization across multiple processors in parallel computing environments. It focuses
on distributing the data across different nodes, which operate on the data in parallel. It can be applied on
regular data structures like arrays and matrices by working on each element in parallel. It contrasts to task
parallelism as another form of parallelism.
A data parallel job on an array of n elements can be divided equally among all the processors. Let us assume
we want to sum all the elements of the given array and the time for a single addition operation is Ta time
units. In the case of sequential execution, the time taken by the process will be n×Ta time units as it sums up
all the elements of an array. On the other hand, if we execute this job as a data parallel job on 4 processors
the time taken would reduce to (n/4)×Ta + merging overhead time units. Parallel execution results in a
speedup of 4 over sequential execution. One important thing to note is that the locality of data references
plays an important part in evaluating the performance of a data parallel programming model. Locality of data
depends on the memory accesses performed by the program as well as the size of the cache.
Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization
of computer code across multiple processors in parallel computing environments. Task parallelism focuses
on distributing tasks—concurrently performed by processes or threads—across different processors. In
contrast to data parallelism which involves running the same task on different components of data, task
parallelism is distinguished by running many different tasks at the same time on the same data.[1] A common
type of task parallelism is pipelining which consists of moving a single set of data through a series of
separate tasks where each task can execute independently of the others.
Description
In a multiprocessor system, task parallelism is achieved when each processor executes a different thread (or
process) on the same or different data. The threads may execute the same or different code. In the general
case, different execution threads communicate with one another as they work, but this is not a requirement.
Communication usually takes place by passing data from one thread to the next as part of a workflow.
As a simple example, if a system is running code on a 2-processor system (CPUs "a" & "b") in a parallel
environment and we wish to do tasks "A" and "B", it is possible to tell CPU "a" to do task "A" and CPU "b"
to do task "B" simultaneously, thereby reducing the run time of the execution. The tasks can be assigned
using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as opposed
to the data (data parallelism). Most real programs fall somewhere on a continuum between task parallelism
and data parallelism.
Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at
once. This type of parallelism is found largely in applications written for commercial servers such as
databases. By running many threads at once, these applications are able to tolerate the high amounts of I/O
and memory system latency their workloads can incur - while one thread is delayed waiting for a memory or
disk access, other threads can do useful work.
The exploitation of thread-level parallelism has also begun to make inroads into the desktop market with the
advent of multi-core microprocessors. This has occurred because, for various reasons, it has become
increasingly impractical to increase either the clock speed or instructions per clock of a single core. If this
trend continues, new applications will have to be designed to utilize multiple threads in order to benefit from
the increase in potential computing power. This contrasts with previous microprocessor innovations in which
existing code was automatically sped up by running it on a newer/faster computer.
CONDITIONS OF PARALLELISM
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and
directed edges with different labels are used to represent the ordered relations among the
statements. The analysis of dependence graphs shows where opportunity exists for
parallelization and vectorization.
(a) Data dependence: The ordering relationship between statements is indicated by the
data dependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(b) Control Dependence: This refers to the situation where the order of the execution of
statements cannot be determined before run time. For example all condition statement,
where the flow of statement depends on the output. Different paths taken after a
conditional branch may depend on the data hence we need to eliminate this data
dependence among the instructions. This dependence also exists between operations
performed in successive iterations of looping procedure. Control dependence often
prohibits parallelism from being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such as
registers, integer and floating point ALUs, etc. ALU conflicts are called ALU
dependence. Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions - 1
Bernstein’s conditions are a set of conditions which must exist if two processes can
execute in parallel.
Notation
Ii is the set of all input variables for a process Pi . Ii is also called the read set or domain
of Pi. Oi is the set of all output variables for a process Pi .Oi is also called write set
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute
in parallel if they are flow-independent, antiindependent, and output- independent. The
parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not transitive (Pi || Pj and
Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an equivalence relation. Intersection of
the input sets is allowed.
HARDWARE PARALLELISM
Hardware parallelism is defined by machine architecture and hardware multiplicity i.e.,
functional parallelism times the processor parallelism. It can be characterized by the number
of instructions that can be issued per machine cycle. If a processor issues k instructions per
machine cycle, it is called a k-issue processor. Conventional processors are one-issue
machines. This provide the user the information about peak attainable performance.
Examples. Intel i960CA is a three-issue processor (arithmetic, memory access, branch). IBM
RS -6000 is a four-issue processor (arithmetic, floating-point, memory access, branch).A
machine with n k-issue processors should be able to handle a maximum of nk threads
simultaneously.
SOFTWARE PARALLELISM
Software parallelism is defined by the control and data dependence of programs, and is
revealed in the program’s flow graph i.e., it is defined by dependencies with in the code and
is a function of algorithm, programming style, and compiler optimization.
It is named after computer scientist Gene Amdahl( a computer architect from IBM and Amdahl corporation),
and was presented at the AFIPS Spring Joint Computer Conference in 1967. It is also known as Amdahl’s
argument. It is a formula which gives the theoretical speedup in latency of the execution of a task at a fixed
workload that can be expected of a system whose resources are improved. In other words, it is a formula
used to find the maximum improvement possible by just improving a particular part of a system. It is often
used in parallel computing to predict the theoretical speedup when using multiple processors.
Speedup-
Speedup is defined as the ratio of performance for the entire task using the enhancement and performance for
the entire task without using the enhancement or speedup can be defined as the ratio of execution time for
the entire task without using the enhancement and execution time for the entire task using the enhancement.
If Pe is the performance for entire task using the enhancement when possible, Pw is the performance for
entire task without using the enhancement, Ew is the execution time for entire task without using the
enhancement and Ee is the execution time for entire task using the enhancement when possible then,
Speedup = Pe/Pw
or
Speedup = Ew/Ee
Amdahl’s law uses two factors to find speedup from some enhancement –
Fraction enhanced – The fraction of the computation time in the original computer that can be converted
to take advantage of the enhancement. For example- if 10 seconds of the execution time of a program that
takes 40 seconds in total can use an enhancement , the fraction is 10/40. This obtained value is Fraction
Enhanced.
Fraction enhanced is always less than 1.
Speedup enhanced – The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program. For example – If the
enhanced mode takes, say 3 seconds for a portion of the program, while it is 6 seconds in the original mode,
the improvement is 6/3. This value is Speedup enhanced.
Speedup Enhanced is always greater than 1.
In computer architecture, Gustafson's law gives the theoretical speedup in latency of the execution of a task at fixed
execution time that can be expected of a system whose resources are improved. It is named after computer scientist
John L. Gustafson and his colleague Edwin H. Barsis, and was presented in the article Reevaluating Amdahl's Law in
1988.
Definition
Gustafson estimated the speedup S gained by using N processors (instead of just one) for a task with a serial fraction s
(which does not benefit from parallelism) as follows:
S=N+(1-N)s
Using different variables, Gustafson's law can be formulated the following way:
S latency(s)=1- p + sp
where
Slatency is the theoretical speedup in latency of the execution of the whole task;
s is the speedup in latency of the execution of the part of the task that benefits from the improvement of the
resources of the system;
p is the percentage of the execution workload of the whole task concerning the part that benefits from the
improvement of the resources of the system before the improvement.
Gustafson's law addresses the shortcomings of Amdahl's law, which is based on the assumption of a fixed problem
size, that is of an execution workload that does not change with respect to the improvement of the resources.
Gustafson's law instead proposes that programmers tend to set the size of problems to fully exploit the computing
power that becomes available as the resources improve. Therefore, if faster equipment is available, larger problems
can be solved within the same time.
The impact of Gustafson's law was to shift research goals to select or reformulate problems so that solving a larger
problem in the same amount of time would be possible. In a way the law redefines efficiency, due to the possibility
that limitations imposed by the sequential part of a program may be countered by increasing the total amount of
computation.
METRICS
(a) Speedup:
The speedup is defined as the ratio of the serial runtime of the best sequential algorithm for solving a problem
to the time taken by the parallel algorithm to solve the same problem on P processors.
S=Ts/Tp
(b) Efficiency: The efficiency is defined as the ratio of speedup to the number of processors. Efficiency
measures the fraction of time for which a processor is usefully utilized.
E=S/p= Ts/pTp
(c) Communication Overhead: Communication overhead can dramatically affect the performance of parallel
computations. Given the long latencies associated with accessing data stored in remote memories, computations
that repeatedly access remote data can easily spend most of their time communicating rather than performing
useful computation.
(d) Single/Multiple Program performances: There are many ways to measure the performance of a parallel
algorithm running on a parallel processor. The most commonly used measurements are the elapsed time,
price/performance, the speed-up, and the efficiency. The elapsed time to run a particular job on a given machine is
the most important metric.
UNIT-3
In this model, all the processors share the physical memory uniformly. All the processors have equal access
time to all the memory words. Each processor may have a private cache memory. Same rule is followed for
peripheral devices.
When all the processors have equal access to all the peripheral devices, the system is called a symmetric
multiprocessor. When only one or a few processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.
In NUMA multiprocessor model, the access time varies with the location of the memory word. Here, the
shared memory is physically distributed among all the processors, called local memories. The collection of
all local memories forms a global address space which can be accessed by all the processors.
Cache Only Memory Architecture (COMA)
The COMA model is a special case of the NUMA model. Here, all the distributed main memories are
converted to cache memories.
DISTRIBUTED MEMORY MULTICOMPUTERS
A distributed memory multicomputer system consists of multiple computers, known as nodes, inter-
connected by message passing network. Each node acts as an autonomous computer having a processor, a
local memory and sometimes I/O devices. In this case, all local memories are private and are accessible only
to the local processors. This is why, the traditional machines are called no-remote-memory-access
(NORMA) machines.
STATIC INTERCONNECTIONS
In static network the interconnection network is fixed and permanent interconnection path
between two processing elements and data communication has to follow a fixed route to reach the
destination processing element. Thus it Consist of a number of point- to-point links. Topologies
in the static networks can be classified according to the dimension required for layout i.e., it can
be 1-D, 2-D, 3-D or hypercube.
One dimensional topologies include Linear array as shown in figure 2.2 (a) used in some pipeline
architecture.
Various 2-D topologies are:
The ring (figure 2.2(b))
Star (figure 2.2(c))
Tree (figure 2.2(d))
Mesh (figure 2.2(e))
Systolic Array (figure 2.2(f))
3-D topologies include:
Completely connected chordal ring (figure 2.2(g))
Chordal ring (figure 2.2(h))
3 cube (figure 2.2(i))
Figure 2.2 Static interconnection network topologies.
Torus architecture is also one of popular network topology, it is extension of the mesh by having
wraparound connections Figure below is a 2D Torus This architecture of torus is a symmetric
topology unlike mesh which is not. The wraparound connections reduce the torus diameter and at
the same time restore the symmetry. It can be
1-D torus
2-D torus
3-D torus
The torus
topology is used in Cray T3E
Figure 2.3 Torus technology
We can have further higher dimension circuits for example 3-cube connected cycle. A D-
dimension W-wide hypercube contains W nodes in each dimension and there is a connection to a
node in each dimension. The mesh and the cube architecture are actually 2-D and 3-D hypercube
respectively. The below figure we have hypercube with dimension 4.
DYNAMIC INTERCONNECTIONS
The dynamic networks are those networks where the route through which data move from one
PE to another is established at the time communication has to be performed. Usually all
processing elements are equidistant and an interconnection path is established when two
processing element want to communicate by use of switches. Such systems are more difficult to
expand as compared to static network. Examples: Bus-based, Crossbar, Multistage Networks.
Here the Routing is done by comparing the bit-level representation
of source and destination addresses. If there is a match goes to next stage via pass- through else in
case of it mismatch goes via cross-over using the switch.
There are two classes of dynamic networks namely
single stage network
multi stage
Single Stage Networks
A single stage switching network with N input selectors (IS) and N output selectors (OS). Here at
each network stage there is a 1- to-D demultiplexer corresponding to each IS such that 1<D<N
and each OS is an M-to-1 multiplexer such that 1<M<=N. Cross bar network is a single stage
network with D=M=N. In order to establish a desired connecting path different path control
signals will be applied to all IS and OS selectors. The single stage network is also called as
recirculating network as in this network connection the single data items may have to recirculate
several time through the single stage before reaching their final destinations. The number of
recirculation depends on the connectivity in the single stage network. In general higher the
hardware connectivity the lesser is the number of recirculation. In cross bar network only one
circulation is needed to establish the connection path. The cost of completed connected cross bar
network is O(N2) which is very high as compared to other most recirculating networks which
have cost O(N log N) or lower hence are more cost effective for large value of N.
Multistage Networks
Many stages of interconnected switches form a multistage SIMD network. It is basicaaly consist
of three characteristic features
The switch box,
The network topology
The control structure
Many stages of interconnected switches form a multistage SIMD networks. Eachbox is
essentially an interchange device with two inputs and two outputs. The four possible states of a
switch box are which are shown in figure 3.6
Straight
Exchange
Upper Broadcast
Lower broadcast.
A two function switch can assume only two possible state namely state or exchange states.
However a four function switch box can be any of four possible states. A multistage network is
capable of connecting any input terminal to any output terminal. Multi-stage networks are
basically constructed by so called shuffle-exchange switching element, which is basically a 2 x 2
crossbar. Multiple layers of these elements are connected and form the network.
Figure 2.5 A two-by-two switching box and its four interconnection states
Banyan
Baseline
Cube
Delta
Flip
Indirect cube
Omega
Multistage network can be of two types
One side networks : also called full switch having input output port on the same side
Two sided multistage network : which have an input side and an output side. It can be
further divided into three class
o Blocking: In Blocking networks, simultaneous connections of more than one
terminal pair may result conflicts in the use of network communication links.
Examples of blocking network are the Data Manipulator, Flip, N cube, omega,
baseline. All multistage networks that are based on shuffle-exchange elements,
are based on the concept of blocking network because not all possible here to
make the input-output connections at the same time as one path might block
another. The figure
(a) show an omega network.
o Rearrangeable : In rearrangeable network, a network can perform all possible
connections between inputs and outputs by rearranging its existing connections
so that a connection path for a new input-output pair can always be established.
An example of this network topology is Benes Network ( see figure 2.6 (b)
showing a 8** Benes network)which support synchronous data permutation and
a synchronous interprocessor communication.
o Non blocking : A non –blocking network is the network which can handle all
possible connections without blocking. There two possible cases first one is the
Clos network ( see figure 2.6(c)) where a one to one connection
is made between input and output. Another case of one to many connections can
be obtained by using crossbars instead of the shuffle- exchange elements. The
cross bar switch network can connect every input port to a free output port
without blocking.
A single stage recirculating network has been implemented in the ILLiac –IV array with N= 64
PEs. Here in mesh network nodes are arranged as a q-dimensional lattice. The
neighboring nodes are only allowed to communicate the data in one step i.e., each PEi is allowed
to send the data to any one of PE(i+1) , PE (i-1), Pe(i+r) and PE(i-r) where r= square root N( in
case of Iliac r=8). In a periodic mesh, nodes on the edge of the mesh have wrap-around
connections to nodes on the other side this is also called a toroidal mesh.
Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared
memory.
a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAM which other processes can access;
a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memory mappings or with explicit support of the program in
question. This is most often used for shared libraries and for Execute in place (XIP).
Since both processes can access the shared memory area like regular working memory, this is a very fast way of
communication (as opposed to other mechanisms of IPC such as named pipes, Unix domain sockets or CORBA). On
the other hand, it is less scalable, as for example the communicating processes must be running on the same machine
(of other IPC methods, only Internet domain sockets—not Unix domain sockets—can use a computer network), and
care must be taken to avoid issues if processes sharing memory are running on separate CPUs and the underlying
architecture is not cache coherent.
IPC by shared memory is used for example to transfer images between the application and the X server on Unix
systems, or inside the IStream object returned by CoMarshalInterThreadInterfaceInStream in the COM libraries under
Windows.
Dynamic libraries are generally held in memory once and mapped to multiple processes, and only pages that had to be
customized for the individual process (because a symbol resolved differently there) are duplicated, usually with a
mechanism known as copy-on-write that transparently copies the page when a write is attempted, and then lets the
write succeed on the private copy.
DISTRIBUTED MEMORY PROGRAMMING
It refers to a multiprocessor computer system in which each processor has its own private memory.
Computational tasks can only operate on local data, and if remote data is required, the computational task
must communicate with one or more remote processors. In contrast, a shared memory multiprocessor offers
a single memory space used by all processors. Processors do not have to be aware where data resides, except
that there may be performance penalties, and that race conditions are to be avoided.
In a distributed memory system there is typically a processor, a memory, and some form of interconnection
that allows programs on each processor to interact with each other. The interconnect can be organised with
point to point links or separate hardware can provide a switching network. The network topology is a key
factor in determining how the multiprocessor machine scales. The links between nodes can be implemented
using some standard network protocol (for example Ethernet), using bespoke network links (used in for
example the Transputer), or using dual-ported memories.
The key issue in programming distributed memory systems is how to distribute the data over the
memories. Depending on the problem solved, the data can be distributed statically, or it can be moved
through the nodes. Data can be moved on demand, or data can be pushed to the new nodes in advance.
As an example, if a problem can be described as a pipeline where data x is processed subsequently through
functions f, g, h, etc. (the result is h(g(f(x)))), then this can be expressed as a distributed memory problem
where the data is transmitted first to the node that performs f that passes the result onto the second node that
computes g, and finally to the third node that computes h. This is also known as systolic computation.
Data can be kept statically in nodes if most computations happen locally, and only changes on edges have to
be reported to other nodes. An example of this is simulation where data is modeled using a grid, and each
node simulates a small part of the larger grid. On every iteration, nodes inform all neighboring nodes of the
new edge data.
A data parallel job on an array of n elements can be divided equally among all the processors. Let us assume we want
to sum all the elements of the given array and the time for a single addition operation is Ta time units. In the case of
sequential execution, the time taken by the process will be n×Ta time units as it sums up all the elements of an array.
On the other hand, if we execute this job as a data parallel job on 4 processors the time taken would reduce to (n/4)×Ta
+ merging overhead time units. Parallel execution results in a speedup of 4 over sequential execution. One important
thing to note is that the locality of data references plays an important part in evaluating the performance of a data
parallel programming model. Locality of data depends on the memory accesses performed by the program as well as
the size of the cache.
Below is the sequential pseudo-code for multiplication and addition of two matrices where the result is stored in the
matrix C. The pseudo-code for multiplication calculates the dot product of two matrices A, B and stores the result into
the output matrix C.
If the following programs were executed sequentially, the time taken to calculate the result would be of the
(assuming row lengths and column lengths of both matrices are n) and for multiplication and addition
respectively.
// Matrix multiplication
for (i=0; i<row_length_A; i++)
{
for (k=0; k<column_length_B; k++)
{
sum = 0;
for (j=0; j<column_length_A; j++)
{
sum += A[i][j]*B[j][k];
}
C[i][k] = sum;
}
}
// Array addition
for (i=0; i<n; i++) {
c[i] = a[i]+b[i];
}
We can exploit data parallelism in the preceding codes to execute it faster as the arithmetic is loop independent.
Parallelization of the matrix multiplication code is achieved by using OpenMP. An OpenMP directive, "omp parallel
for" instructs the compiler to execute the code in the for loop in parallel. For multiplication, we can divide matrix A
and B into blocks along rows and columns respectively. This allows us to calculate every element in matrix C
individually thereby making the task parallel. For example: A[m x n] dot B [n x k] can be finished in instead of
For addition of arrays in a data parallel implementation, let’s assume a more modest system with two central
processing units (CPU) A and B, CPU A could add all elements from the top half of the arrays, while CPU B could
add all elements from the bottom half of the arrays. Since the two processors work in parallel, the job of performing
array addition would take one half the time of performing the same operation in serial using one CPU alone.
The program expressed in pseudocode below—which applies some arbitrary operation, foo, on every element in the
array d—illustrates data parallelism:[nb 1]
if CPU = "a" then
lower_limit := 1
upper_limit := round(d.length / 2)
else if CPU = "b" then
lower_limit := round(d.length / 2) + 1
upper_limit := d.length
Data parallelism emphasizes the distributed (parallel) nature of the data, as opposed to the processing (task
parallelism). Most real programs fall somewhere on a continuum between task parallelism and data parallelism.
FUNCTIONAL PROGRAMMING
Programming Languages that support functional programming: Haskell, JavaScript, Scala, Erlang, Lisp, ML,
Clojure, OCaml, Common Lisp, Racket.
Pure functions
Recursion
Referential transparency
Functions are First-Class and can be Higher-Order
Variables are Immutable
Pure functions: These functions have two main properties. First, they always produce the same output for same
arguments irrespective of anything else.
Secondly, they have no side-effects i.e. they do modify any argument or global variables or output something.
Later property is called immutability. The pure functions only result is the value it returns. They are deterministic.
Programs done using functional programming are easy to debug because pure functions have no side effect or hidden
I/O. Pure functions also make it easier to write parallel/concurrent applications. When the code is written in this style,
a smart compiler can do many things – it can parallelize the instructions, wait to evaluate results when need them, and
memorize the results since the results never change as long as the input doesn’t change.
example of the pure function:
Recursion: There are no “for” or “while” loop in functional languages. Iteration in functional languages is
implemented through recursion. Recursive functions repeatedly call themselves, until it reaches the base case.
example of the recursive function:
fib(n)
if (n <= 1)
return 1;
else
return fib(n - 1) + fib(n - 2);
Referential transparency: In functional programs variables once defined do not change their value throughout the
program. Functional programs do not have assignment statements. If we have to store some value, we define new
variables instead. This eliminates any chances of side effects because any variable can be replaced with its actual value
at any point of execution. State of any variable is constant at any instant.
example:
Functions are First-Class and can be Higher-Order: First-class functions are treated as first-class variable. The first
class variables can be passed to functions as parameter, can be returned from functions or stored in data structures.
Higher order functions are the functions that take other functions as arguments and they can also return functions.
example:
Variables are Immutable: In functional programming, we can’t modify a variable after it’s been initialized. We can
create new variables – but we can’t modify existing variables, and this really helps to maintain state throughout the
runtime of a program. Once we create a variable and set its value, we can have full confidence knowing that the value
of that variable will never change.
Advantages:
1. Pure functions are easier to understand because they don’t change any states and depend only on the input
given to them. Whatever output they produce is the return value they give. Their function signature gives all
the information about them i.e. their return type and their arguments.
2. The ability of functional programming languages to treat functions as values and pass them to functions as
parameters make the code more readable and easily understandable.
3. Testing and debugging is easier. Since pure functions take only arguments and produce output, they don’t
produce any changes don’t take input or produce some hidden output. They use immutable values, so it
becomes easier to check some problems in programs written uses pure functions.
4. It is used to implement concurrency/parallelism because pure functions don’t change variables or any other
data outside of it.
5. It adopts lazy evaluation which avoids repeated evaluation because the value is evaluated and stored only
when it is needed.
Disadvantages:
Applications:
Traditionally, a program is modelled as a series of operations happening in a specific order; this may be
referred to as sequential, procedural, control flow (indicating that the program chooses a specific path), or
imperative programming. The program focuses on commands, in line with the von Neumann vision of
sequential programming, where data is normally "at rest".
In contrast, dataflow programming emphasizes the movement of data and models programs as a series of
connections. Explicitly defined inputs and outputs connect operations, which function like black boxes. An
operation runs as soon as all of its inputs become valid. Thus, dataflow languages are inherently parallel and
can work well in large, decentralized systems.
State:
One of the key concepts in computer programming is the idea of state, essentially a snapshot of various
conditions in the system. Most programming languages require a considerable amount of state information,
which is generally hidden from the programmer. Often, the computer itself has no idea which piece of
information encodes the enduring state. This is a serious problem, as the state information needs to be shared
across multiple processors in parallel processing machines. Most languages force the programmer to add
extra code to indicate which data and parts of the code are important to the state. This code tends to be both
expensive in terms of performance, as well as difficult to read or debug. Explicit parallelism is one of the
main reasons for the poor performance of Enterprise Java Beans when building data-intensive, non-OLTP
application.
Where a sequential program can be imagined as a single worker moving between tasks (operations), a
dataflow program is more like a series of workers on an assembly line, each doing a specific task whenever
materials are available. Since the operations are only concerned with the availability of data inputs, they
have no hidden state to track, and are all "ready" at the same time.
Representation:
Dataflow programs are represented in different ways. A traditional program is usually represented as a series
of text instructions, which is reasonable for describing a serial system which pipes data between small,
single-purpose tools that receive, process, and return. Dataflow programs start with an input, perhaps the
command line parameters, and illustrate how that data is used and modified. The flow of data is explicit,
often visually illustrated as a line or pipe.
In terms of encoding, a dataflow program might be implemented as a hash table, with uniquely identified
inputs as the keys, used to look up pointers to the instructions. When any operation completes, the program
scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs
it. When that operation finishes, it will typically output data, thereby making another operation become
valid.
For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of
maintaining state is removed from the programmer and given to the language's runtime. On machines with a
single processor core where an implementation designed for parallel operation would simply introduce
overhead, this overhead can be removed completely by using a different runtime.
ASCET
AviSynth scripting language, for video processing
BMDFM Binary Modular Dataflow Machine
CAL
Cuneiform, a functional workflow language.
CMS Pipelines
Hume
Joule (programming language)
Keysight VEE
KNIME is a free and open-source data analytics, reporting and integration platform
LabVIEW, G[3]
Linda
Lucid[2]
Lustre
Max/MSP
Microsoft Visual Programming Language - A component of Microsoft Robotics Studio designed for
robotics programming
Orange - An open-source, visual programming tool for data mining, statistical data analysis, and
machine learning.
Oz now also distributed since 1.4.0
Pipeline Pilot
Prograph
Pure Data
Quartz Composer - Designed by Apple; used for graphic animations and effects
SAC Single assignment C
SIGNAL (a dataflow-oriented synchronous language enabling multi-clock specifications)
Simulink
SISAL
SystemVerilog - A hardware description language
Verilog - A hardware description language absorbed into the SystemVerilog standard in 2009
VHDL - A hardware description language
VSXu
XEE (Starlight) XML engineering environment
XProc
Apache Beam: Java/Scala SDK that unifies streaming (and batch) processing with several execution
engines supported (Spark, Flink, Google dataflow...)
Apache Flink: Java/Scala library that allows streaming (and batch) computations to be run atop a
distributed Hadoop (or other) cluster
SystemC: Library for C++, mainly aimed at hardware design.
TensorFlow: A machine-learning library based on dataflow programming.
UNIT-4
LOOP PARALLELIZATION AND PIPELINING
PDF ATTACHED
SOFTWARE PIPELINING
Software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software
pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand
written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit
support for software pipelining, notably Intel's IA-64 architecture.
It is important to distinguish software pipelining, which is a target code technique for overlapping loop iterations, from
modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops.
Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism
since such architectures existed. Effective compiler generation of such code dates to the invention of modulo
scheduling by Rau and Glaeser. Lam showed that special hardware is unnecessary for effective modulo scheduling.
Her technique, modulo variable expansion is widely used in practice.
Example
for i = 1 to bignumber
A(i)
B(i)
C(i)
end
In this example, let A(i), B(i), C(i) be instructions, each operating on data i, that are dependent on each other. In other
words, A(i) must complete before B(i) can start. For example, A could load data from memory into a register, B could
perform some arithmetic operation on the data, and C could store the data back into memory. However, let there be no
dependence between operations for different values of i. In other words, A(2) can begin before A(1) finishes.
A(1) B(1) C(1) A(2) B(2) C(2) A(3) B(3) C(3) ...
Assume that each instruction takes 3 clock cycles to complete (ignore for the moment the cost of the looping control
flow). Also assume (as is the case on most modern systems) that an instruction can be dispatched every cycle, as long
as it has no dependencies on an instruction that is already executing. In the unpipelined case, each iteration thus takes 9
cycles to complete: 3 clock cycles for A(1), 3 clock cycles for B(1), and 3 clock cycles for C(1).
A(1) A(2) A(3) B(1) B(2) B(3) C(1) C(2) C(3) ...
It can be easily verified that an instruction can be dispatched each cycle, which means that the same 3 iterations can be
executed in a total of 9 cycles, giving an average of 3 cycles per iteration.
Implementation
Software pipelining is often used in combination with loop unrolling, and this combination of techniques is often a far
better optimization than loop unrolling alone. In the example above, we could write the code as follows (assume for
the moment that bignumber is divisible by 3):
Of course, matters are complicated if (as is usually the case) we can't guarantee that the total number of iterations will
be divisible by the number of iterations we unroll. In the general case, loop unrolling may not be the best way to
implement software pipelining. Consider a loop containing instructions with a high latency. For example, the
following code:
for i = 1 to bignumber
A(i) ; 3 cycle latency
B(i) ; 3
C(i) ; 12(perhaps a floating point operation)
D(i) ; 3
E(i) ; 3
F(i) ; 3
end
would require 12 iterations of the loop to be unrolled to avoid the bottleneck of instruction C. This means that the code
of the loop would increase by a factor of 12 (which not only affects memory usage, but can also affect cache
performance, see code bloat). Even worse, the prologue (code before the loop for handling the case of bignumber not
divisible by 12) will likely be even larger than the code for the loop, and very probably inefficient because software
pipelining cannot be used in this code (at least not without a significant amount of further code bloat). Furthermore, if
bignumber is expected to be moderate in size compared to the number of iterations unrolled (say 10-20), then the
execution will spend most of its time in this inefficient prologue code, rendering the software pipelining optimization
ineffectual.
By contrast, here is the software pipelining for our example (the prologue and epilogue will be explained later):
prologue
for i = 1 to (bignumber - 6)
A(i+6)
B(i+5)
C(i+4)
D(i+2) ; note that we skip i+3
E(i+1)
F(i)
end
epilogue
Before getting to the prologue and epilogue, which handle iterations at the beginning and end of the loop, let's verify
that this code does the same thing as the original for iterations in the middle of the loop. Specifically, consider iteration
7 in the original loop. The first iteration of the pipelined loop will be the first iteration that includes an instruction from
iteration 7 of the original loop. The sequence of instructions is:
However, unlike the original loop, the pipelined version avoids the bottleneck at instruction C. Note that there are 12
instructions between C(7) and the dependent instruction D(7), which means that the latency cycles of instruction C(7)
are used for other instructions instead of being wasted.
The prologue and epilogue handle iterations at the beginning and end of the loop. Here is a possible prologue for our
example above:
Each line above corresponds to an iteration of the main pipelined loop, but without the instructions for iterations that
have not yet begun. Similarly, the epilogue progressively removes instructions for iterations that have completed:
Latency
Latency is the time required for communication between different subsystems in a computer.
Memory latency, for example, is the time required by a processor to access memory.
Synchronization latency is the time required for two processes to synchronize their execution.
Computational granularity and communication latency are closely related. Latency and grain size
are interrelated and some general observation are
As grain size decreases, potential parallelism increases, and overhead also increases.
Overhead is the cost of parallelizing a task. The principle overhead is communication
latency.
As grain size is reduced, there are fewer operations between communication, and hence
the impact of latency increases.
Surface to volume: inter to intra-node comm.
Levels of Parallelism:
This fine-grained, or smallest granularity level typically involves less than 20 instructions per grain.
The number of candidates for parallel execution varies from 2 to thousands, with about five
instructions or statements (on the average) being the average level of parallelism.
Advantages:
There are usually many candidates for parallel execution
Compilers can usually do a reasonable job of finding this parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations,
it can be handled by a pipeline, or by a SIMD machine. Most optimized program construct to
execute on a parallel or vector machine. Some loops (e.g. recursive) are difficult to handle. Loop-
level parallelism is still considered fine grain computation.
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job
steps can overlap across different jobs. Multiprograming conducted at this level No compilers
available to exploit medium- or coarse-grain parallelism at present.
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to
machine architecture, technology, and communication patterns used. Latency imposes a limiting
factor on machine scalability. Ex. Memory latency increases as memory capacity increases,
limiting the amount of memory that can be used with a given tolerance for communication
latency.
Communication Patterns
A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node name, s =
size (larger s = larger grain size).
Edges = { (v,d) }, where v = variable being “communicated,” and d = communication delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to
other nodes. Packing is done to eliminate unnecessary communication delays or reduce overall
scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay
requirements are observed, and no two nodes are executing on the same processor at the same
time. Some general scheduling goals
Schedule all fine-grain activities in a node to the same processor to minimize
communication delays.
Select grain sizes for packing to achieve better schedules for a particular parallel
machine.
Node Duplication:
Grain packing may potentially eliminate interprocessor communication, but it may not always
produce a shorter schedule. By duplicating nodes (that is, executing some instructions on multiple
processors), we may eliminate some interprocessor communication, and thus
produce a shorter schedule.
Scheduling and allocation is a highly important issue since an inappropriate scheduling of tasks
can fail to exploit the true potential of the system and can offset the gain from parallelization. In
this paper we focus on the scheduling aspect. The objective of scheduling is to minimize the
completion time of a parallel application by properly
allocating the tasks to the processors. In a broad sense, the scheduling problem exists in two
forms: static and dynamic. In static scheduling, which is usually done at compile time, the
characteristics of a parallel program (such as task processing times, communication, data
dependencies, and synchronization requirements) are known before program execution
A parallel program, therefore, can be represented by a node- and edge-weighted directed acyclic
graph (DAG), in which the node weights represent task processing times and the edge weights
represent data dependencies as well as the communication times between tasks. In dynamic
scheduling only, a few assumptions about the parallel program can be made before execution, and
thus, scheduling decisions have to be made on-the-fly. The goal of a dynamic scheduling
algorithm as such includes not only the minimization of the program completion time but also the
minimization of the scheduling overhead which constitutes a significant portion of the cost paid
for running the scheduler. In general dynamic scheduling is an NP hard problem.
Conventional machines used control flow mechanism in which order of program execution
explicitly stated in user programs. Dataflow machines which instructions can be executed by
determining operand availability. Reduction machines trigger an instruction’s execution based on
the demand for its results.
In Control flow computers the next instruction is executed when the last instruction as stored in
the program has been executed where as in Data flow computers an instruction executed when the
data (operands) required for executing that instruction is available.Control flow machines used
shared memory for instructions and data. Since variables are updated by many instructions, there
may be side effects on other instructions. These side effects frequently prevent parallel
processing. Single processor systems are inherently sequential. Instructions in dataflow machines
are unordered and can be executed as soon as their operands are available; data is held in the
instructions themselves. Data tokens are passed from an instruction to its dependents to trigger
execution.
Data Flow Features:
No need for shared memory program counter control sequencer Special mechanisms are required
to detect data availability match data tokens with instructions needing them enable chain reaction
of asynchronous instruction execution.
A Dataflow Architecture – 1 The Arvind machine (MIT) has N PEs and an N -by –N
interconnection network. Each PE has a token-matching mechanism that dispatches only
instructions with data tokens available. Each datum is tagged with
address of instruction to which it belongs
context in which the instruction is being executed
Tagged tokens enter PE through local path (pipelined), and can also be communicated to other
PEs through the routing network. Instruction address(es) effectively replace the program counter
in a control flow machine. Context identifier effectively replaces the frame base register in a
control flow machine. Since the dataflow machine matches the data tags from one instruction
with successors, synchronized instruction execution is implicit.
An I-structure in each PE is provided to eliminate excessive copying of data structures. Each
word of the I-structure has a two-bit tag indicating whether the value is empty, full, or has
pending read requests.
This is a retreat from the pure dataflow approach. Special compiler technology needed for
dataflow machines.
Demand-Driven Mechanisms:
Data-driven machines select instructions for execution based on the availability of their operands;
this is essentially a bottom-up approach. Demand-driven machines take a top-down approach,
attempting to execute the instruction (a demander) that yields the final result. This triggers the
execution of instructions that yield its operands, and so forth. The demand-driven approach
matches naturally with functional programming languages (e.g. LISP and SCHEME). Pattern
driven computers : An instruction is executed when we obtain a particular data patterns as output.
There are two types of pattern driven computers
(a) String-reduction model: each demander gets a separate copy of the
expression string to evaluate each reduction step has an operator and
embedded reference to demand the corresponding operands each
operator is suspended while arguments are evaluated
(b) Graph-reduction model: expression graph reduced by evaluation of
branches or subgraphs, possibly in parallel, with demanders given
pointers to results of reductions. based on sharing of pointers to
arguments; traversal and reversal of pointers continues until constant
arguments are encountered.
LOOP SCHEDULING
In parallel computing, loop scheduling is the problem of assigning proper iterations of parallelizable loops
among n processors to achieve load balancing and maintain data locality with minimum dispatch
overhead.
static even scheduling: evenly divide loop iteration space into n chunks and assign each chunk to
a processor
dynamic scheduling: a chunk of loop iteration is dispatched at runtime by an idle processor.
When the chunk size is 1 iteration, it is also called self-scheduling.
guided scheduling: similar to dynamic scheduling, but the chunk sizes per dispatch keep
shrinking until reaching a preset value.
Automatic parallelization, also auto parallelization, autoparallelization, or parallelization, the last one of
which implies automation when used in context, refers to converting sequential code into multi-threaded
or vectorized (or even both) code in order to utilize multiple processors simultaneously in a shared-
memory multiprocessor (SMP) machine. The goal of automatic parallelization is to relieve programmers
from the hectic and error-prone manual parallelization process.[1] Though the quality of automatic
parallelization has improved in the past several decades, fully automatic parallelization of sequential
programs by compilers remains a grand challenge due to its need for complex program analysis and the
unknown factors (such as input data range) during compilation.[2]
The programming control structures on which autoparallelization places the most focus are loops,
because, in general, most of the execution time of a program takes place inside some form of loop. There
are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-
threading.[3]
For example, consider a loop that on each iteration applies a hundred operations, and runs for a thousand
iterations. This can be thought of as a grid of 100 columns by 1000 rows, a total of 100,000 operations.
Cyclic multi-threading assigns each row to a different thread. Pipelined multi-threading assigns each
column to a different thread.