0% found this document useful (0 votes)
10 views

HPC_introduction_Lecture_3

Uploaded by

Shehzad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

HPC_introduction_Lecture_3

Uploaded by

Shehzad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Independent University, Bangladesh

Department of Computer Science and Engineering


Course Title: Introduction to High Performance Computing
Course Code: Autumn-2024-CSC471

SECTION 1: (T) 06:30 PM --- 9:30 PM

Presented by
Dr. Rubaiyat Islam
Adjunct Faculty, IUB.
Omdena Bangladesh Chapter Lead
Crypto-economist Consultant
Sifchain Finance, USA.
WHAT IS APPLICATION
TIMING?

• Analysis of a program’s behavior using information gathered


as the program runs

• Why do it?
• Good way to improve efficiency of scripts
• Identify performance problems
• Many times required for allocation requests
HOW WOULD YOU DO IT?

• Measure time of execution of an entire program or simply


a code snippet
• Loops
• Timing functions within programs
• Python, Fortran, C++, R have functions that allow you to
measure the execution time of small code snippets
• Can also do this with the Linux “time” command
• Can make changes to code to improve efficiency
• Or…it’s just informational
THE LINUX TIME UTILITY

• The first place to start when profiling your program

time mpirun –np 4 ./prog.mpi

real 0m17.801s Wall clock time


user 0m58.125s Threads x Wall clock
time
sys 0m0.081s System overhead
FINE-GRAINED TIMING

• Often useful to time portions of a program


• Good idea when developing your own code
• Tough when its 3rd party software
• Useful functions:
• Fortran: system_clock
• C++ clock()
• Python: time.time()
• R sys.time()
SERIAL VS. PARALLEL
PROCESSING

• Serial processing
• A problem is broken into a set of discrete instructions
• These instructions are carried out sequentially on a single
processor
• Only one instruction is executed at a tifme

• Parallel processing
• Idea where many instructions are carried out simultaneously
across a computing system
• Can divide a large problem up into many smaller problems

6
WHY PARALLELIZE?

• Single core too slow for solving the problem in a


“reasonable” time
• “Reasonable” time: overnight, over lunch, duration of a PhD
thesis

• Memory requirements
• Larger problem
• More physics
• More particles
• Larger images
• Larger neural networks
7
BASIC COMPUTER
ARCHITECTURE

• Old computers – one unit to • New computers have 4 or


execute instructions more cpu cores

8
SERIAL PROCESSING –
THOUGHT EXPERIMENT

• Let’s say you own a lawn service company


• You have one hundred clients who each want their lawn
mowed, with patterns
• Each of them want their lawn mowed by the end of the week
• A serial process would be for you to mow all one hundred
laws yourself
• You cannot mow lawn 2 until you mow lawn 1, etc
• Let’s say doing this takes you the full 7 days to complete,
working 16 hour days

9
SERIAL PROCESSING

• Instructions are executed on


one core
• The other cores sit idle
• If a task is running, Task 2 waits
for Task 1 to complete, etc.
• Wasting resources
• Want to instead parallelize and
use all cores

10
SERIAL VS. PARALLEL
PROCESSING
1
2

PARALLEL PROCESSING –
THOUGHT EXPERIMENT

• Let’s say that you decide that 100 lawns is too many for one
person to mow in a week
• Or you want to finish it faster
• Therefore you hire one additional person to help you
• How long (in theory) should it take you to finish the lawns?
• Either 3.5 days working 16 hours each day, or 7 days working 8 hour
days
• You could accomplish this either by both working on one lawn
together or each of you working on a different lawn at the
same time (more on this later)
1
3

PARALLEL PROCESSING –
THOUGHT EXPERIMENT

• Similarly, you could hire three more people


• Now five total
• How long should it take you to finish?
• In theory, five times faster
• However, it doesn’t actually work out this way. Why?
• Overhead
• Communication
• Who is mowing which lawn?
• If you split a lawn, who mows which parts?
• How do you make sure the patterns match up?
1
4

PARALLEL PROCESSING – THOUGHT


EXP ERIMENT (CONT.)

• However, it doesn’t actually work out this way. Why?


• Resource contention
• Fights over who gets to use the best lawn mower
• So maybe instead of five times as fast its four times as fast
• Still faster
• More people?
• Too many people slows down the process too much to make it
worthwhile
• Diminishing return
• 100 might be too many
1
5

PARALLEL OVERHEAD

• Should you convert your serial code to parallel?


• Usually do it to speed up
• But need to consider things like overhead
• Overhead because of
• Startup time
• Synchronizations
• Communication
• Overhead by libraries, compilers
• Termination time

https://computing.llnl.gov/tutorials/parallel_comp/#ModelsShared
1
6

PROGRAMMING TO USE
PARALLELISM

• Parallelism across
processors/threads - OpenMP

• Parallelism across multiple


nodes - MPI

www.scan.co.uk
PARALLEL PROCESSING MUSTS
AND TRICKS

• Need to be able to break the problem up into parts that


can work independently of each other
• Can’t have the results from one CPU depend on another at
each time step

• Do loops are a great place to start looking for bottlenecks


in your code
1
8

MEMORY MODELS

• There are three common kinds of parallel memory models


• Shared
• Distributed
• Hybrid
SHARED MEMORY MODEL
2
0

SHARED MEMORY MODEL

• All cores share the same pool of memory


• HPC Architecture – we talked about the memory available on one node
• Any memory changes seen by all processors
2
1

THOUGHT EXPERIMENT

• Let’s go back to our lawn mowing example


• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• When all the workers are working on one lawn, they are sharing the memory
• Every “core” is impacted by changes to “memory”
2
2

BENEFITS AND DRAWBACK

• Benefit:
• Data sharing is fast

• Drawback:
• Adding more processors may lead to performance issues when accessing the same
shared memory resource (memory contention)
DISTRIBUTED MEMORY MODEL
2
4

DISTRIBUTED MEMORY MODEL


• In a distributed memory model, each core has its own memory
• Processors share memory only through a network connection and/or
communication protocol ( e.g., MPI )
• Changes to local memory associated with processor do not have an impact on
other processors
• Remote-memory access must be explicitly managed by the programmer
2
5

THOUGHT EXPERIMENT

• Let’s go back to our lawn mowing example


• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• When each worker is working on a different lawn, it is distributed memory
2
6

BENEFITS AND DRAWBACKS

• Biggest benefit is scalability


• Adding more processors doesn’t result in resource contention as far as memory is
concerned

• Biggest Drawback
• Can be tedious to program for distributed memory models
• All data relocation must be programmed by hand
HYBRID MEMORY MODEL
2
8

HYBRID MEMORY MODEL

• As the name implies, the hybrid memory model is a combination of the shared
and distributed memory models
• Most large and fast clusters today admit a hybrid-memory model
• A certain number of cores share the memory on one node, but are connected to
the cores sharing memory on other nodes through a network
2
9

THOUGHT EXPERIMENT

• Let’s go back to our lawn mowing example


• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• The idea that you have several workers working on one lawn
• Or, better, several workers working on sections of a lawn, and have to
communicate to make it work
• Patterns
BENEFITS AND DRAWBACKS

• Benefit:
• Scalability

• Drawback
• Must know how to program communication between
nodes (e.g., MPI)
DATA AND TASK PARALLELISM

• Earlier discussed data parallel memory methods


• One of them was distributed memory, wherein different
memory pools were accessed by different cores on a single
node
• Data and task parallelism are a similar concept
• Data parallelism
• Distribute the data across processors
• Task parallelism
• Distribute the compute tasks across processors
DATA PARALLELISM

• Different parts of a dataset are distributed across nodes

array1=a b c d
NODE 1 NODE 2
a b c d
TASK PARALLELISM

• Each processor executes a different task on the same


dataset
• Tasks (code, instructions) are spread out among the cores
• Might be same instructions/code or different

• Distributed programming
• Example: Calculating wind speed from vector components
across a geographic area. Divide vector calculation among
processors
DATA PARALLELISM - SIMD

• Two types of data parallelism we’ll discuss here


• SIMD – Single Instruction, Multiple Data
• SPMD – Single Program, Multiple Data
• SIMD
• Carry out the same instruction simultaneously multiple times across
different elements of a dataset
• Vector operation
• Addition, subtraction, multiplication, division
• Have to prepare your data to be vectorized
VECTORIZATION

Non-vectorized
a=rand(1,4)
b=rand(1,4)

• Simply put, performing multiple math for i=1:length(a)


operations c(i)=a(i)+b(i)
end

Vectorized
a=rand(1,4)
b=rand(1,4)

c=a+b

Python, R, etc.
Compiled languages – compiler can handle it
DATA PARALLELISM - SPMD

• SPMD
• Carry out the same program multiple times on different elements of a dataset
• Calculate the wind direction from wind components

a1 a2 b1 b2

Program

c1 c2
WHY DO THIS?

• Cleaner code
• Faster execution time
• Eliminating loops!
• Usually not too challenging
• Many languages have functions that make this easy to perform
3
8

HIGH THROUGHPUT COMPUTING

• Thus far: High Performance Computing (HPC)


• Typical HPC: employ multiple processors to
• Solve a problem faster
• Solve larger problems
• Today: High Throughput Computing (HTC)
• Typical HTC:
• Multiple small jobs spread across many processors
3
9

HIGH THROUGHPUT COMPUTING

• HTC useful when have many small jobs that require little computational power
or memory
• Jobs are typically serial, and not parallel
• HTC advantage:
• Small serial jobs can fill in the “gaps” left by large parallel jobs
• E.g., Open Science Grid
• Effectively parallel: batch of jobs completes faster when spread across multiple cores
• Example: Image analysis
4
0

ADVANTAGES AND DISADVANTAGES

• Advantages
• Simplicity
• Much easier to match one task to one CPU rather than many at
once
• Doesn’t require knowledge of parallelization for programmer
• Disadvantage
• Your HPC center might not be set up ideally for HTC
• Might not allow for node sharing
• No batch submission system to manage multiple small jobs
• Possibly requires heavy scripting to manage workflow
4
1

HTC MECHANICS

• No real tricks
• Break down your problem and then submit a lot of smaller jobs
• For example, if you are analyzing 1 million images, rather than submitting
one job to analyze all 1 million images, submit one thousand jobs that
analyze 1000 images each
• The resource manager (i.e., Slurm) should take care of the rest

• Is HTC appropriate? Problem dependent!


• Is the serial execution time reasonable?
• Can the problem fit into one core’s worth of memory?
THANK YOU

42

You might also like