0% found this document useful (0 votes)
65 views36 pages

Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views36 pages

Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS476

Parallel and distributed systems

Module 1
Introduction to Parallel and Distributed Computing.

1
Contents

1. Introduction-Scope, issues , applications.


2. Challenges of Parallel and Distributed Computing.
3. Implicit Parallelism: Trends in Microprocessor Architectures.
4. Dichotomy of Parallel Computing Platforms.
5. Physical Organization, co-processing.

2
Weekly Learning Outcomes
1. What is parallel or distributed computing and why is it necessary?

2. Explain Scope, and Applications and challenges of parallel and


distributed computing?

3. Understanding Parallel Platforms' Communication Model.

3
Required Reading
Chapter 1 Introduction to Parallel and Distributed Computing: (A Grama, AGupra, G
Karypis, V Kumar. Introduction to Parallel Computing (2nd ed.). Addison Wesley, 2003.)

Recommended Reading:
Introduction to parallel and distributed processing:
https://www.youtube.com/watch?v=RNVIcm8-6RE
Implicit parallelism https://www.youtube.com/watch?v=9_uX7L7-7ME
https://www.cs.rochester.edu/users/faculty/sandhya/csc258/

4
What is Parallel Computing:
The process of running numerous processors an application or computation
simultaneously is referred to as parallel computing. In general, it refers to a type of
computing architecture where big issues are divided into separate, smaller, typically
related sections that can be processed all at once. Multiple CPUs work together to
complete it by exchanging information across shared memory, which then combines
the findings. It facilitates the execution of complex computations by distributing the
enormous problem among multiple processors.

What is Distributed Computing :

In distributed computing, a user sees a single system that is actually made up of several
autonomous computers. There is no shared memory in distributed systems, and
machines connect with one another by exchanging messages. A single work is split up
among several processors in distributed computing.
Types of parallel computing
1. Bit-level parallelism: The form of parallel computing in which every task is dependent
on processor word size. In terms of performing a task on large-sized data, it reduces the
number of instructions the processor must execute. There is a need to split the operation into
series of instructions. For example, there is an 8-bit processor, and you want to do an
operation on 16-bit numbers. First, it must operate the 8 lower-order bits and then the 8
higher-order bits. Therefore, two instructions are needed to execute the operation. The
operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor decides in
instruction- level parallelism how many instructions are implemented at the same time. For
each clock cycle phase, a processor in instruction-level parallelism can have the ability to
address that is less than one instruction.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks are
decomposed into subtasks. Then, each subtask is allocated for execution. And, the execution
of subtasks is performed concurrently by processors.
Parallel computing

Observation
High-performance distributed computing started
with parallel computing
Multiprocessor and multicore versus multicomputer
Moore’s law (attributed to Gordon Moore, Founder of Intel): Number
of transistors doubles every 2 years
Leveraging Moore’s Law
More transistors – opportunities for
exploiting parallelism
Implicit parallelism
Pipelining
Superscalar
Explicit parallelism
Streaming and multimedia processor
extensions
– E.g., MMX, Altivec
Very long instruction words (VLIW)
Uniprocessor Limits
The power problem!

http://www.tomshardware.com/2005/11/21/the_mother_of_all_cpu_charts_2005
S.No. Parallel Computing Distributed Computing
1 Many operations are performed System components are located at
simultaneously different locations
2 Single computer is required Uses multiple computers

3 Multiple processors perform multiple Multiple computers perform multiple


operations operations
4 It may have shared or distributed It have only distributed memory
memory
5 Processors communicate with each Computer communicate with each
other through bus other through message passing.
6 Improves the system performance Improves system scalability, fault
tolerance and resource sharing
capabilities
Applications of Parallel Computing
There are various applications of Parallel Computing, which are as
follows:
1. One of the primary applications of parallel computing is Databases
and Data mining.
2. The real-time simulation of systems is another use of parallel
computing.
3. The technologies, such as Networked videos and Multimedia.
4. Science and Engineering.
5. Collaborative work environments.
6. The concept of parallel computing is used by augmented reality,
advanced graphics, and virtual reality.
Applications of Distributed computing:

Social networks, mobile systems, online


banking, and online gaming (e.g. multiplayer
systems) also use efficient distributed
systems. Additional areas of application for
distributed computing include e-learning
platforms, artificial intelligence, and e-
commerce.
What are the challenges of parallel and distributed
computing?
Important concerns are workload sharing, which attempts to take advantage of
access to multiple computers to complete jobs faster; task migration, which
supports workload sharing by efficiently distributing jobs among machines; and
automatic task replication, which occurs at different sites for greater
reliability.
Heterogeneity: The Internet enables users to access services and run applications
over a heterogeneous collection of computers and networks. ...
Transparency: ...
Openness. ...
Concurrency. ...
Security. ...
Scalability. ...
Failure Handling.
Distributed Systems Issues
• The lack of global knowledge.
• Naming.
• Scalability: Size, geographically, administratively
• Compatibility.
• Process synchronization (requires global knowledge)
• Resource management (requires global knowledge)
• Reliability/fault tolerance: fail-stop, byzantine (arbitrary) failure models
• Problems:

– Distributed consensus
– Replication, caching consistency
– Security and trust
Scope of Parallelism
Conventional architectures coarsely comprise of a processor,
memory system, and the data path. Each of these components'
present significant performance bottlenecks. Parallelism addresses
each of these components in significant ways. Different applications
utilize different aspects of parallelism – e.g., data intensive
applications utilize high aggregate throughput, server applications
utilize high aggregate network bandwidth, and scientific applications
typically utilize high processing and memory system performance. It
is important to understand each of these performance bottlenecks.
Implicit Parallelism:
Microprocessor clock speeds have posted impressive gains over
the past two decades (two to three orders of magnitude) Higher
levels of device integration have made available a large number of
transistors.
-The question of how best to utilize these resources is an important
one.
-Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
-The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
Very Long Instruction Word (VLIW) Processors
The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
#To address this issues, VLIW processors rely on compile time analysis to identify and
bundle together instructions that can be executed concurrently.
#These instructions are packed and dispatched together, and thus the name very long
instruction word.
#This concept was used with some commercial success in the Multiflow Trace machine
(circa 1984).
#Variants of this concept are employed in the Intel IA64 processors.
Dichotomy of Parallel Computing Platforms:
• An explicitly parallel program must specify concurrency and
interaction between concurrent subtasks.
• The former is sometimes also referred to as the control structure
and the latter as the communication model.
Control Structure of Parallel Programs
q Parallelism can be expressed at various levels of granularity – from instruction
level to processes.
q Between these extremes exist a range of models, along with corresponding
architectural support.
q Processing units in parallel computers either operate under the centralized
control of a single control unit or work independently.
q If there is a single control unit that dispatches the same instruction to
various processors (that work on different data), the model is referred to as
single instruction stream, multiple data stream (SIMD).
q If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).


SIMD-MIMD Comparison

• Some of the earliest parallel computers such as the • In contrast to SIMD processors, MIMD processors
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 can execute different programs on different
processors.
belonged to this class of machines.
• A variant of this, called single program multiple
• Variants of this concept have found use in co-
data streams (SPMD) executes the same program
processing units such as the MMX units in Intel
on different processors.
processors and DSP chips such as the Sharc.

• SIMD relies on the regular structure of • It is easy to see that SPMD and MIMD are closely
computations (such as those in image processing). related in terms of programming flexibility and
underlying architectural support.
• It is often necessary to selectively turn off
operations on certain data items. For this reason, • Examples of such platforms include current
most SIMD programming paradigms allow for an generation Sun Ultra Servers, SGI Origin Servers,
“activity mask”, which determines if a processor multiprocessor PCs, workstation clusters, and the
should participate in a computation or not. IBM SP.
Communication Model of Parallel Platforms

üThere are two primary forms of data exchange between parallel


tasks – accessing a shared data space and exchanging messages.

üPlatforms that provide a shared data space are called shared


address-space machines or multiprocessors.

üPlatforms that support messaging are also called message passing


platforms or multi computers.
Shared Memory Parallel Systems
• Multiple processors can access the same memory simultaneously
• Challenges:
– One processor’s cache may contain a copy of data that was just
modified by another
Requires hardware support for coherence
Two processes’ operations may be interleaved
• in unintuitive ways
• Requires hardware support and guarantees on atomicity and
ordering
Shared-Memory Parallel Computer
Shared-Memory Parallel Computer
Shared Memory
Programming through threading
Multiple processors share a pool of memory
Problems: cache coherence
UMA vs. NUMA architecture

Pros:
Easier to program (probably)

Cons:
Performance may surfer if the memory is located on
distant machines
Limited scalability
Distributed-Memory Parallel Computer
Distributed-Memory Parallel Computer
Distributed Memory
Programming through processes
Explicit message passing
Networking
Pros:
Tighter control on message passing

Cons:
Harder to program
Modern supercomputers are hybrids!
Multicore Resource Management

Multicore systems contain many shared resources, e.g.,


memory bandwidth and cache

Problem:

– Management of shared resources for Efficiency, fairness


Distributed Memory Parallel Systems
vParallel systems that do not share memory
vSoftware system support for communication
(point-to-point, group)
vData must be explicitly partitioned and transferred
when needed
vDynamic workload management?
Shared-Address-Space Platforms
• Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this
shared-address-space.

• If the time taken by a processor to access any memory


word in the system global or local is identical, the
platform is classified as a uniform memory access
(UMA), else, a non uniform memory access (NUMA)
machine.
NUMA and UMA Shared-Address-Space Platforms:
§ The distinction between NUMA and UMA platforms is important from the point of
view of algorithm design. NUMA machines require locality from underlying
algorithms for performance. • Programming these platforms is easier since reads and
writes are implicitly visible to other processors.

§ However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).

§ Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.

§ A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.

Shared-Address-Space vs. Shared Memory Machines
oIt is important to note the difference between the terms
shared address space and shared memory.

oWe refer to the former as a programming abstraction and to


the latter as a physical machine attribute.

oIt is possible to provide a shared address space using a


physically distributed memory.
Physical Organization of Parallel Platforms:
We begin this discussion with an ideal parallel machine called Parallel Random Access
Machine, or PRAM.
Architecture of an Ideal Parallel Computer

o A natural extension of the Random Access Machine (RAM) serial


architecture is the Parallel Random Access Machine, or PRAM.

o PRAMs consist of p processors and a global memory of unbounded size


that is uniformly accessible to all processors.

o Processors share a common clock but may execute different instructions in


each cycle.

o Depending on how simultaneous memory accesses are handled, PRAMs


can be divided into four subclasses.
Contin….
• Exclusive-read, exclusive-write (EREW) PRAM.

• Concurrent-read, exclusive-write (CREW) PRAM

• Exclusive-read, concurrent-write (ERCW) PRAM.

• Concurrent-read, concurrent-write (CRCW) PRAM.


Physical Complexity of an Ideal Parallel Computer
• Processors and memories are connected via switches.

• Since these switches must operate in O(1) time at the level


of words, for a system of p processors and m words, the
switch complexity is O (mp ).

• Clearly, for meaningful values of p and m, a true PRAM is


not realizable.

You might also like