0% found this document useful (0 votes)
13 views34 pages

001 - DDS IIIT Jan 10th

Uploaded by

aditibraut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views34 pages

001 - DDS IIIT Jan 10th

Uploaded by

aditibraut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Parallel Computing

What is Parallel Computing?


• Doing things simultaneously
• Same thing or different things
• Solving one large problem

• Serial computing
• Problem is broken down into a stream of instructions, executed sequentially
one after the other on a single processor
• Only one instruction is executed at a time

• Parallel computing
• Problem is broken down into parts that can be solved concurrently
• Each part is further divided into a stream of instructions
• Instructions from different parts execute simultaneously on different
processors
Why Parallel Computing?
• The real world is parallel
• Complex interrelated events happen simultaneously
• galaxies, planetary movements, functioning of the brain, weather, traffic

• Why use parallel computing


• Save time, produce results in reasonable time for them to be useful
• e.g weather monitoring, automated stock trading
• Problems, interesting to scientist/engineers don’t fit on a PC, as huge
memory requirements

• Processors are not getting any faster


• Clock frequency is not improving anymore
• Heat dissipation issues and power consumption issues
Why Parallel Computing?

 Why we need ever-increasing performance?

Dramatic advances in the field like ---


internet, entertainment, decoding human genome, accurate
medical imaging, web search, computer games, atmospheric
simulations, financial processing, computational biology, drug
discovery, data analytics etc….
Where is parallel computing used?
• Scientific Computing
• Numerically Intensive Simulations

• Database Operations and information systems


• web based services, web search engines, online transaction
processing
• client inventory database management systems, data mining, online
transaction processing, MIS systems and so on.

• Artificial intelligence, machine learning and deep learning

• Real time systems and control applications


• Hardware and Robotic Control, Speech Processing, Pattern Recognition
How Parallel computing fits into scientific computing?

Air flow around an


airplane Physical
Processes

Navier-stocks
equations Mathematical Models

Algorithms,
Solvers, Numerical Parallel
Application codes, Solutions Computing
supercomputers

Data
viz software Visualization,
Validation,
Physical
Insights
Parallel Computers
• Racks of server/computer/processing units
(40/80 server per rack) sits on 68680 sq-foot space
(Google data space)

Such 40-50 racks in each rows, Such ‘n’ number of rows


Single application is not using all racks

• Fastest Supercomputer in the world -


HPE Cray EX235a – 1: 1.102 exaFLOPS,
591872 (0248 x 64 core @2.0 GHz) AMD processors and
MI250X General Purpose (GPU)
(as on November 2022 powerful supercomputer on TOP500)

But here single application will use the supercomputer at the same time.
Parallel Computers

13 fastest Supercomputer in
the world - 2022

https://www.top500.org/

Data Center
Traditional Processor

Controller
• The traditional logical view of a sequential computer – Datapath
Control
memory connected to processor via datapath. logic and
Register
file
state
register
General
• All these becomes bottleneck for the over-all processing rate IR PC ALU
of the computer
Program Data
• Number of architectural innovation have addressed these memory memory

bottleneck over the years like


Assembly code
for:

total = 0
• Improvement of clock speed for i =1 to …

• More number of transistors per sq-inch


Superscalar Processor
• A superscalar processor : multiple copies of the datapath to execute multiple instructions
simultaneously.
• A superscalar processor, fetches and executes two instructions per cycle. i.e the datapath
fetches two instructions at a time from the instruction memory.
• Desktop and laptop computers often use superscalar execution.
Parallel Programming Platforms

• Pipelining Execution - enables execution of multiple instruction in a single clock


cycle

• By overlapping various stages (fetch, schedule, decode, operand fetch, execute,


store), pipelining enables faster execution.

• µp’s like Itanium, Space Ultra, MIPS, Power4 supports multiple instruction
execution.

• If the assembly of a car taking 100 time units, can be broken into 10
pipelined stages of 10 units each, a single assembly line can produce a car every
10 time units!

• This represents a 10-fold speedup over serial production process


Parallel Programming Platforms

Also we have the concept of

- Data level parallelism


Data level parallelism: Partition the data used in solving the problem among
the cores.
• Operation to be performed is same on various data

- Instruction/task level parallelism

Instruction/task level parallelism: Partition various tasks carried out in


solving the problem among the cores
• Data is same operation to be performed on it is different
Parallel Programming Platforms

- Ex: Data level parallelism


Compute ‘n’ values and add them together on p cores (p<<<n)

sum = 0;
for(i=0 ; i<n ; i++) {
x = compute_next_value(…);
sum += x;
}

my_sum = 0;
my_first_i = …;
my_last_i = …;

for(my_i = my_first_i ; my_last_i < n ; my_i++) { my_ : indicates each core is


my_x = compute_next_value(…); using it own, private variables
and each core can execute this
my_sum += my_x;
block of code independently of
} the other core
Parallel Programming Platforms

- Ex: Data level parallelism


Compute ‘n’ values and add them together on p cores (p<<<n)

sum = 0;
for(i=0 ; i<n ; i++) {
x = compute_next_value(…);
sum += x;

If ‘n’ = 24 and 24 calls to compute_next_value() returns value:

1,4,3, 9,2,8 5,1,1, 6,2,7 2,5,0 4,1,8 6,5,1 2,3,9

Then the values stored in my_sum might be:


When cores are done computing
Core: 0 1 2 3 4 5 6 7
their values of my_sum, computing
my_sum: 8 19 7 15 7 13 12 14 global_sum by master_core….
Parallel Programming Platforms

- Ex: Data level parallelism


Compute ‘n’ values and add them together on p cores (p<<<n)

if ( I’m the master core ) {


sum = my_x;
for each core other than myself {
receive value from core;
sum += value;
}
else {
send my_x to the master;
}

If master core = care_0 it would add

Core: 0
sum: 8 + 19 + 7 + 15 + 7 +13 + 12 +14 = 95
Parallel Programming Platforms

- Ex: Instruction level parallelism (when number of cores is large)

With 1000 cores:


data parallelism requires
999 receives and adds while
task parallelism requires
only 10
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• In a shared-memory system, the cores can share/access the computer


memory.

• In a distributed memory system, each core has its own, private memory, and
the cores must communicate explicitly by doing something like sending
messages across a network.
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• OpenMP were designed for programming shared-memory systems. They


provide mechanisms for accessing shared-memory locations. It is a high-
level extension to C. For example, it can “parallelize” our ‘for’ loop

• MPI, designed for programming distributed-memory systems. It provides


mechanisms for sending messages.
What we will be doing…

Learning to write programs that are explicitly parallel.

• On parallel computers using the C language and extensions to C:


The Message-Passing Interface or MPI, and OpenMP.

• MPI are libraries of type definitions, functions, and macros that can be
used in C programs.

• OpenMP consists of pragmas and some modifications to the C compiler.

• CUDA programming on graphics processor/card


Parallel Architectures

How to pick up single processor architecture and


combined them together to form parallel architecture
Parallel Hardware…

(a) Shared Memory System (b) Distributed Memory System (c) GPU Architecture
Limitation of Memory System Performance
Limitation of Memory System Performance

• Performance of a program relies on


• the speed of processor and
• the speed of the memory system (feed data to the processor)

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system performance


Limitation of Memory System Performance
Effect of memory latency on performance:

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency of


100ns.

• Processor can execute 4 instructions/1 ns clock cycle


(assume processor has 2 multiply-add unit)

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

• Every time memory request is made, the processor must wait 100 cycles
Effect of memory latency on performance
Example:

• Consider the problem of computing the dot-product of 2-vectors on such a platform


• Each floating point operation requires one data fetch in 100ns with no cache memory

i.e the speedup of 10 MFLOPS

1 1000
= = 10 𝑀𝐹𝐿𝑂𝑃𝑆
100 𝑥 10−9 100 𝑥 10−6

• Hence effective memory system performance is needed to achieve high computation


rates.
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing elements.

If ‘n’ is a power of 2,
these operations performed in log2(n) steps

i.e if n = 16 Log2 (16) = 4 as 24 = 16


Basic concepts
• Adding ‘n’ numbers using ‘n’ processing elements.

Problem can be solved in


Θ(n) times on single processor, Ts and
Θ(log n) times on multiple processors, Tp

Ts = Θ(n), Tp = Θ(log(n))

So what is the speedup?


Basic concepts

Ts = Θ(n), Tp = Θ(log(n))

Ideal case: If speed-up = p then efficiency = 1


Practical case: (speed-up < p) as efficiency is 0-1
Amdahal’s Law

se = F, fraction of calculation
that is serial

pe = (1 - F), fraction i.e parallel


F + pe = 1
Amdahal’s Law
Amdahal’s Law
• Suppose a parallel program is executing on 10 processors, and only 40% time
is executing in parallel.

• What is the overall speedup gained by incorporating parallelism?


Amdahal’s Law
Basic concepts of
Parallelization

Text Book:
An Introduction to Parallel Programming
By Peter S. Pacheco

Introduction to Parallel Computing


By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar

You might also like