Slides
Slides
Module 1
Introduction to Parallel and Distributed Computing.
1
Contents
2
Weekly Learning Outcomes
1. What is parallel or distributed computing and why is it necessary?
3
Required Reading
Chapter 1 Introduction to Parallel and Distributed Computing: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison Wesley, 2003.)
Recommended Reading:
Introduction to parallel and distributed processing:
https://www.youtube.com/watch?v=RNVIcm8-6RE
Implicit parallelism https://www.youtube.com/watch?v=9_uX7L7-7ME
https://www.cs.rochester.edu/users/faculty/sandhya/csc258/
4
What is Parallel Computing:
The process of running numerous processors an application or computation
simultaneously is referred to as parallel computing. In general, it refers to a type of
computing architecture where big issues are divided into separate, smaller, typically
related sections that can be processed all at once. Multiple CPUs work together to
complete it by exchanging information across shared memory, which then combines
the findings. It facilitates the execution of complex computations by distributing the
enormous problem among multiple processors.
In distributed computing, a user sees a single system that is actually made up of several
autonomous computers. There is no shared memory in distributed systems, and
machines connect with one another by exchanging messages. A single work is split up
among several processors in distributed computing.
Types of parallel computing
1. Bit-level parallelism: The form of parallel computing in which every task is dependent
on processor word size. In terms of performing a task on large-sized data, it reduces the
number of instructions the processor must execute. There is a need to split the operation into
series of instructions. For example, there is an 8-bit processor, and you want to do an
operation on 16-bit numbers. First, it must operate the 8 lower-order bits and then the 8
higher-order bits. Therefore, two instructions are needed to execute the operation. The
operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor decides in
instruction- level parallelism how many instructions are implemented at the same time. For
each clock cycle phase, a processor in instruction-level parallelism can have the ability to
address that is less than one instruction.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks are
decomposed into subtasks. Then, each subtask is allocated for execution. And, the execution
of subtasks is performed concurrently by processors.
Parallel computing
Observation
High-performance distributed computing started
with parallel computing
Multiprocessor and multicore versus multicomputer
Moore’s law (attributed to Gordon Moore, Founder of Intel): Number
of transistors doubles every 2 years
Leveraging Moore’s Law
More transistors – opportunities for
exploiting parallelism
Implicit parallelism
Pipelining
Superscalar
Explicit parallelism
Streaming and multimedia processor
extensions
– E.g., MMX, Altivec
Very long instruction words (VLIW)
Uniprocessor Limits
The power problem!
http://www.tomshardware.com/2005/11/21/the_mother_of_all_cpu_charts_2005
S.No. Parallel Computing Distributed Computing
1 Many operations are performed System components are located at
simultaneously different locations
2 Single computer is required Uses multiple computers
– Distributed consensus
– Replication, caching consistency
– Security and trust
Scope of Parallelism
Conventional architectures coarsely comprise of a processor,
memory system, and the data path. Each of these components'
present significant performance bottlenecks. Parallelism addresses
each of these components in significant ways. Different applications
utilize different aspects of parallelism – e.g., data intensive
applications utilize high aggregate throughput, server applications
utilize high aggregate network bandwidth, and scientific applications
typically utilize high processing and memory system performance. It
is important to understand each of these performance bottlenecks.
Implicit Parallelism:
Microprocessor clock speeds have posted impressive gains over
the past two decades (two to three orders of magnitude) Higher
levels of device integration have made available a large number of
transistors.
-The question of how best to utilize these resources is an important
one.
-Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
-The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
Very Long Instruction Word (VLIW) Processors
The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
#To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
#These instructions are packed and dispatched together, and thus the name very long
instruction word.
#This concept was used with some commercial success in the Multiflow Trace machine
(circa 1984).
#Variants of this concept are employed in the Intel IA64 processors.
Dichotomy of Parallel Computing Platforms:
• An explicitly parallel program must specify concurrency and
interaction between concurrent subtasks.
• The former is sometimes also referred to as the control structure
and the latter as the communication model.
Control Structure of Parallel Programs
q Parallelism can be expressed at various levels of granularity – from instruction
level to processes.
q Between these extremes exist a range of models, along with corresponding
architectural support.
q Processing units in parallel computers either operate under the centralized
control of a single control unit or work independently.
q If there is a single control unit that dispatches the same instruction to
various processors (that work on different data), the model is referred to as
single instruction stream, multiple data stream (SIMD).
q If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
• Some of the earliest parallel computers such as the • In contrast to SIMD processors, MIMD processors
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 can execute different programs on different
processors.
belonged to this class of machines.
• A variant of this, called single program multiple
• Variants of this concept have found use in co-
data streams (SPMD) executes the same program
processing units such as the MMX units in Intel
on different processors.
processors and DSP chips such as the Sharc.
• SIMD relies on the regular structure of • It is easy to see that SPMD and MIMD are closely
computations (such as those in image processing). related in terms of programming flexibility and
underlying architectural support.
• It is often necessary to selectively turn off
operations on certain data items. For this reason, • Examples of such platforms include current
most SIMD programming paradigms allow for an generation Sun Ultra Servers, SGI Origin Servers,
“activity mask”, which determines if a processor multiprocessor PCs, workstation clusters, and the
should participate in a computation or not. IBM SP.
Communication Model of Parallel Platforms
Pros:
Easier to program (probably)
Cons:
Performance may surfer if the memory is located on
distant machines
Limited scalability
Distributed-Memory Parallel Computer
Distributed-Memory Parallel Computer
Distributed Memory
Programming through processes
Explicit message passing
Networking
Pros:
Tighter control on message passing
Cons:
Harder to program
Modern supercomputers are hybrids!
Multicore Resource Management
Problem:
§ However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
§ Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
§ A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
•
Shared-Address-Space vs. Shared Memory Machines
oIt is important to note the difference between the terms
shared address space and shared memory.