0% found this document useful (0 votes)

13 views34 pages

001 - DDS IIIT Jan 10th

Uploaded by

aditibraut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views34 pages

001 - DDS IIIT Jan 10th

Uploaded by

aditibraut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Parallel Computing

What is Parallel Computing?

• Doing things simultaneously
• Same thing or different things
• Solving one large problem

• Serial computing
• Problem is broken down into a stream of instructions, executed sequentially
one after the other on a single processor
• Only one instruction is executed at a time

• Parallel computing
• Problem is broken down into parts that can be solved concurrently
• Each part is further divided into a stream of instructions
• Instructions from different parts execute simultaneously on different
processors
Why Parallel Computing?
• The real world is parallel
• Complex interrelated events happen simultaneously
• galaxies, planetary movements, functioning of the brain, weather, traffic

• Why use parallel computing

• Save time, produce results in reasonable time for them to be useful
• e.g weather monitoring, automated stock trading
• Problems, interesting to scientist/engineers don’t fit on a PC, as huge
memory requirements

• Processors are not getting any faster

• Clock frequency is not improving anymore
• Heat dissipation issues and power consumption issues
Why Parallel Computing?

 Why we need ever-increasing performance?

Dramatic advances in the field like ---

internet, entertainment, decoding human genome, accurate
medical imaging, web search, computer games, atmospheric
simulations, financial processing, computational biology, drug
discovery, data analytics etc….
Where is parallel computing used?
• Scientific Computing
• Numerically Intensive Simulations

• Database Operations and information systems

• web based services, web search engines, online transaction
processing
• client inventory database management systems, data mining, online
transaction processing, MIS systems and so on.

• Artificial intelligence, machine learning and deep learning

• Real time systems and control applications

• Hardware and Robotic Control, Speech Processing, Pattern Recognition
How Parallel computing fits into scientific computing?

Air flow around an

airplane Physical
Processes

Navier-stocks
equations Mathematical Models

Algorithms,
Solvers, Numerical Parallel
Application codes, Solutions Computing
supercomputers

Data
viz software Visualization,
Validation,
Physical
Insights
Parallel Computers
• Racks of server/computer/processing units
(40/80 server per rack) sits on 68680 sq-foot space
(Google data space)

Such 40-50 racks in each rows, Such ‘n’ number of rows

Single application is not using all racks

• Fastest Supercomputer in the world -

HPE Cray EX235a – 1: 1.102 exaFLOPS,
591872 (0248 x 64 core @2.0 GHz) AMD processors and
MI250X General Purpose (GPU)
(as on November 2022 powerful supercomputer on TOP500)

But here single application will use the supercomputer at the same time.
Parallel Computers

13 fastest Supercomputer in
the world - 2022

https://www.top500.org/

Data Center
Traditional Processor

Controller
• The traditional logical view of a sequential computer – Datapath
Control
memory connected to processor via datapath. logic and
Register
file
state
register
General
• All these becomes bottleneck for the over-all processing rate IR PC ALU
of the computer
Program Data
• Number of architectural innovation have addressed these memory memory

bottleneck over the years like

Assembly code
for:

total = 0
• Improvement of clock speed for i =1 to …

• More number of transistors per sq-inch

Superscalar Processor
• A superscalar processor : multiple copies of the datapath to execute multiple instructions
simultaneously.
• A superscalar processor, fetches and executes two instructions per cycle. i.e the datapath
fetches two instructions at a time from the instruction memory.
• Desktop and laptop computers often use superscalar execution.
Parallel Programming Platforms

• Pipelining Execution - enables execution of multiple instruction in a single clock

cycle

• By overlapping various stages (fetch, schedule, decode, operand fetch, execute,

store), pipelining enables faster execution.

• µp’s like Itanium, Space Ultra, MIPS, Power4 supports multiple instruction
execution.

• If the assembly of a car taking 100 time units, can be broken into 10
pipelined stages of 10 units each, a single assembly line can produce a car every
10 time units!

• This represents a 10-fold speedup over serial production process

Parallel Programming Platforms

Also we have the concept of

- Data level parallelism

Data level parallelism: Partition the data used in solving the problem among
the cores.
• Operation to be performed is same on various data

- Instruction/task level parallelism

Instruction/task level parallelism: Partition various tasks carried out in

solving the problem among the cores
• Data is same operation to be performed on it is different
Parallel Programming Platforms

- Ex: Data level parallelism

Compute ‘n’ values and add them together on p cores (p<<<n)

sum = 0;
for(i=0 ; i<n ; i++) {
x = compute_next_value(…);
sum += x;
}

my_sum = 0;
my_first_i = …;
my_last_i = …;

for(my_i = my_first_i ; my_last_i < n ; my_i++) { my_ : indicates each core is

my_x = compute_next_value(…); using it own, private variables
and each core can execute this
my_sum += my_x;
block of code independently of
} the other core
Parallel Programming Platforms

- Ex: Data level parallelism

Compute ‘n’ values and add them together on p cores (p<<<n)

sum = 0;
for(i=0 ; i<n ; i++) {
x = compute_next_value(…);
sum += x;

If ‘n’ = 24 and 24 calls to compute_next_value() returns value:

1,4,3, 9,2,8 5,1,1, 6,2,7 2,5,0 4,1,8 6,5,1 2,3,9

Then the values stored in my_sum might be:

When cores are done computing
Core: 0 1 2 3 4 5 6 7
their values of my_sum, computing
my_sum: 8 19 7 15 7 13 12 14 global_sum by master_core….
Parallel Programming Platforms

- Ex: Data level parallelism

Compute ‘n’ values and add them together on p cores (p<<<n)

if ( I’m the master core ) {

sum = my_x;
for each core other than myself {
receive value from core;
sum += value;
}
else {
send my_x to the master;
}

If master core = care_0 it would add

Core: 0
sum: 8 + 19 + 7 + 15 + 7 +13 + 12 +14 = 95
Parallel Programming Platforms

- Ex: Instruction level parallelism (when number of cores is large)

With 1000 cores:

data parallelism requires
999 receives and adds while
task parallelism requires
only 10
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• In a shared-memory system, the cores can share/access the computer

memory.

• In a distributed memory system, each core has its own, private memory, and
the cores must communicate explicitly by doing something like sending
messages across a network.
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• OpenMP were designed for programming shared-memory systems. They

provide mechanisms for accessing shared-memory locations. It is a high-
level extension to C. For example, it can “parallelize” our ‘for’ loop

• MPI, designed for programming distributed-memory systems. It provides

mechanisms for sending messages.
What we will be doing…

Learning to write programs that are explicitly parallel.

• On parallel computers using the C language and extensions to C:

The Message-Passing Interface or MPI, and OpenMP.

• MPI are libraries of type definitions, functions, and macros that can be
used in C programs.

• OpenMP consists of pragmas and some modifications to the C compiler.

• CUDA programming on graphics processor/card

Parallel Architectures

How to pick up single processor architecture and

combined them together to form parallel architecture
Parallel Hardware…

(a) Shared Memory System (b) Distributed Memory System (c) GPU Architecture
Limitation of Memory System Performance
Limitation of Memory System Performance

• Performance of a program relies on

• the speed of processor and
• the speed of the memory system (feed data to the processor)

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system performance

Limitation of Memory System Performance
Effect of memory latency on performance:

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency of

100ns.

• Processor can execute 4 instructions/1 ns clock cycle

(assume processor has 2 multiply-add unit)

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

• Every time memory request is made, the processor must wait 100 cycles
Effect of memory latency on performance
Example:

• Consider the problem of computing the dot-product of 2-vectors on such a platform

• Each floating point operation requires one data fetch in 100ns with no cache memory

i.e the speedup of 10 MFLOPS

1 1000
= = 10 𝑀𝐹𝐿𝑂𝑃𝑆
100 𝑥 10−9 100 𝑥 10−6

• Hence effective memory system performance is needed to achieve high computation

rates.
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing elements.

If ‘n’ is a power of 2,
these operations performed in log2(n) steps

i.e if n = 16 Log2 (16) = 4 as 24 = 16

Basic concepts
• Adding ‘n’ numbers using ‘n’ processing elements.

Problem can be solved in

Θ(n) times on single processor, Ts and
Θ(log n) times on multiple processors, Tp

Ts = Θ(n), Tp = Θ(log(n))

So what is the speedup?

Basic concepts

Ts = Θ(n), Tp = Θ(log(n))

Ideal case: If speed-up = p then efficiency = 1

Practical case: (speed-up < p) as efficiency is 0-1
Amdahal’s Law

se = F, fraction of calculation
that is serial

pe = (1 - F), fraction i.e parallel

F + pe = 1
Amdahal’s Law
Amdahal’s Law
• Suppose a parallel program is executing on 10 processors, and only 40% time
is executing in parallel.

• What is the overall speedup gained by incorporating parallelism?

Amdahal’s Law
Basic concepts of
Parallelization

Text Book:
An Introduction to Parallel Programming
By Peter S. Pacheco

Introduction to Parallel Computing

By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar

AQA Comp Sci WB2 Answers Ms
No ratings yet
AQA Comp Sci WB2 Answers Ms
52 pages
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
No ratings yet
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
18 pages
60+ OKR Examples - How To Write Effective OKRs 2023 ClickUp
100% (2)
60+ OKR Examples - How To Write Effective OKRs 2023 ClickUp
25 pages
P 1
No ratings yet
P 1
44 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Doip
No ratings yet
Doip
15 pages
Brochure SRT 4930 - en
No ratings yet
Brochure SRT 4930 - en
2 pages
Home and Building Automation Systems
No ratings yet
Home and Building Automation Systems
54 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
APG43 InternalWorkshop CSI
No ratings yet
APG43 InternalWorkshop CSI
56 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
KELOMPOK 5 - An Overview of Business Intelligence, Analytics, and Data Science
No ratings yet
KELOMPOK 5 - An Overview of Business Intelligence, Analytics, and Data Science
15 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
High Performance Computing: Sabah Sayed
No ratings yet
High Performance Computing: Sabah Sayed
22 pages
Lect 1 Overview
No ratings yet
Lect 1 Overview
17 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Python Can
No ratings yet
Python Can
174 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
SIM7100 SIM7500 SIM7600 Sleep Mode Application Note V1.01
No ratings yet
SIM7100 SIM7500 SIM7600 Sleep Mode Application Note V1.01
11 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Hsslive-XI-CS Chap1-The Discipline of Computing
No ratings yet
Hsslive-XI-CS Chap1-The Discipline of Computing
3 pages
Cours 1
No ratings yet
Cours 1
38 pages
How To Find The Product Model of Your Dell Computer - Dell India
No ratings yet
How To Find The Product Model of Your Dell Computer - Dell India
3 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
Laboratory Exercise 3
No ratings yet
Laboratory Exercise 3
3 pages
Cours 1
No ratings yet
Cours 1
38 pages
CHEM259 Answer 9
No ratings yet
CHEM259 Answer 9
5 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Cloud Functions Service Level Agreement Verified PDF
No ratings yet
Cloud Functions Service Level Agreement Verified PDF
3 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
The Future of LTE
No ratings yet
The Future of LTE
6 pages
HCMS Documentation
No ratings yet
HCMS Documentation
81 pages
DA-100 Mod6-ENU-PowerPoint
No ratings yet
DA-100 Mod6-ENU-PowerPoint
26 pages
Gerd Baumann - Mathematics For Engineers III - Vector Calculus (2011, Oldenbourg Wissenschaftsverlag) (10.1524 - 9783486714470) - Libgen - Li
100% (1)
Gerd Baumann - Mathematics For Engineers III - Vector Calculus (2011, Oldenbourg Wissenschaftsverlag) (10.1524 - 9783486714470) - Libgen - Li
434 pages
IT Professional F
No ratings yet
IT Professional F
20 pages
University Institute of Computing: Big Data Analytics 22CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 22CAH-782
27 pages
AI-Big Data Analytics For Building Automation and Management Systems A Survey, Actual Challenges and Future Perspectives
No ratings yet
AI-Big Data Analytics For Building Automation and Management Systems A Survey, Actual Challenges and Future Perspectives
93 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Unit 5
No ratings yet
Unit 5
66 pages
IMS Brochure
No ratings yet
IMS Brochure
11 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Rezilens Profile-New
No ratings yet
Rezilens Profile-New
13 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
FF - Echelon Oval Brochure 2021 09 24
No ratings yet
FF - Echelon Oval Brochure 2021 09 24
20 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
MKTG Practice Chap 3
No ratings yet
MKTG Practice Chap 3
8 pages
A Study On Domination in Graphs
No ratings yet
A Study On Domination in Graphs
39 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Rate of Penetration (ROP) Optimization in Drilling With Vibration Control by Hegde 2019
No ratings yet
Rate of Penetration (ROP) Optimization in Drilling With Vibration Control by Hegde 2019
11 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Complete Project
No ratings yet
Complete Project
43 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
1 Introduction
No ratings yet
1 Introduction
48 pages
PDC 3
No ratings yet
PDC 3
26 pages
CMP 252 - Parallelism Fundamentals
No ratings yet
CMP 252 - Parallelism Fundamentals
64 pages
Smart Lock System Project Report
No ratings yet
Smart Lock System Project Report
2 pages
PDC Lecture 2
No ratings yet
PDC Lecture 2
13 pages
Applied Python Programming (Cycle-1) - 1
No ratings yet
Applied Python Programming (Cycle-1) - 1
26 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages

001 - DDS IIIT Jan 10th

Uploaded by

001 - DDS IIIT Jan 10th

Uploaded by

Parallel Computing

What is Parallel Computing?

• Why use parallel computing

• Processors are not getting any faster

 Why we need ever-increasing performance?

Dramatic advances in the field like ---

• Database Operations and information systems

• Artificial intelligence, machine learning and deep learning

• Real time systems and control applications

Air flow around an

Such 40-50 racks in each rows, Such ‘n’ number of rows

• Fastest Supercomputer in the world -

bottleneck over the years like

• More number of transistors per sq-inch

• Pipelining Execution - enables execution of multiple instruction in a single clock

• By overlapping various stages (fetch, schedule, decode, operand fetch, execute,

• This represents a 10-fold speedup over serial production process

Also we have the concept of

- Data level parallelism

- Instruction/task level parallelism

Instruction/task level parallelism: Partition various tasks carried out in

- Ex: Data level parallelism

for(my_i = my_first_i ; my_last_i < n ; my_i++) { my_ : indicates each core is

- Ex: Data level parallelism

If ‘n’ = 24 and 24 calls to compute_next_value() returns value:

1,4,3, 9,2,8 5,1,1, 6,2,7 2,5,0 4,1,8 6,5,1 2,3,9

Then the values stored in my_sum might be:

- Ex: Data level parallelism

if ( I’m the master core ) {

If master core = care_0 it would add

- Ex: Instruction level parallelism (when number of cores is large)

With 1000 cores:

• In a shared-memory system, the cores can share/access the computer

• OpenMP were designed for programming shared-memory systems. They

• MPI, designed for programming distributed-memory systems. It provides

Learning to write programs that are explicitly parallel.

• On parallel computers using the C language and extensions to C:

• OpenMP consists of pragmas and some modifications to the C compiler.

• CUDA programming on graphics processor/card

How to pick up single processor architecture and

• Performance of a program relies on

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system performance

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency of

• Processor can execute 4 instructions/1 ns clock cycle

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

• Consider the problem of computing the dot-product of 2-vectors on such a platform

i.e the speedup of 10 MFLOPS

• Hence effective memory system performance is needed to achieve high computation

i.e if n = 16 Log2 (16) = 4 as 24 = 16

Problem can be solved in

So what is the speedup?

Ideal case: If speed-up = p then efficiency = 1

pe = (1 - F), fraction i.e parallel

• What is the overall speedup gained by incorporating parallelism?

Introduction to Parallel Computing

You might also like