Coms7059 Lecturenotes
Coms7059 Lecturenotes
Scott Hazelhurst
2020
1 Introduction
Goal is to develop high-quality code that can run fast:
• Who is it for?
Amdahl’s law
A B C D F
• Overall speed up is 1
(1 p)+(p/s)
1
When s ! 1 then speed up is 1/(1 p)
Summary:
• Be aware that there is a limit to improvement – if you do a good job, other parts become
the bottleneck
Comparison Validity
Any programming language compari- I posted a “call for programs” message on in programming with their respectiv
son based on actual sample programs is several newsgroups. Consequently, these scripting language or as not having a thor
valid only to the degree to which the capa- subjects show more diversity in back- ough programming background, includ
Tcl
Rexx
Python
This comparison lends credibility to the stu
Perl work-time comparison. The times reported for s
programming are likely either only modestly o
Java
mistic or not at all so. Thus, a work-time advan
for the script languages of about factor two holds
Java work times appear to be a bit pessimistic bec
C++
when the study took place, the Java programm
were less experienced than the other programm
C
0 5 10 15 20 25 Program structure
Total time for programming in hours If we consider the designs chosen by those
wrote programs in the various languages studied
Figure 6. Total working time for realizing the program. The programmers measured and uncover a striking difference. Most programme
reported the script group times; the experimenter measured the nonscript group times. the script group used the associative arrays prov
These results are almost 20 years
The bad-to-good old from
ratios range – unfortunately not
1.5 for C up to 3.2 for much
Perl. work
Three Java worklike
times this.
of The result
by their mayand stored the dictionary wor
language
40, 49, and 63 hours exceed the chart’s bounds and thus are not shown. be retrieved by their number encodings. The se
be different now – the point not to sell one programming language over another, but just that
algorithm simply attempts to retrieve from this a
the choice of language really matters. using prefixes of increasing length based on the
script programmers and as measured for the nonscript of the remaining current phone number as the
programmers. Any match found leads to a new partial solutio
Interpretation/Compilation We see that scripts, which have a total median of 3.1 be completed later.
hours, take less than half as long to write as nonscripts, In contrast, essentially all the nonscript prog
Each processor has an instruction set
which have a total median of 10.0 hours, although mers chose one of the following solutions: In the
validity threats may have exaggerated this difference. ple case, they stored the whole dictionary in an a
• machine code/assembly code Fortunately, we can check two things at once, usually in both its original character form and the
namely, the correctness of the work-time reporting and responding phone number representation. They
• very low level the equivalence of the programmer capabilities in the selected and tested one-tenth of the whole dictio
script versus the nonscript group. Both these possible for each digit of the phone number they were en
High iflevel
problems, code
present, tend to bias the script group’s ing, using only the first digit as a key to constrai
worktimes downward: We would expect cheaters to search space. This procedure leads to a simple
s = 0 fake their time to be shorter, not longer, and, if there is inefficient solution.
a difference, we expect to see more capable program- The more elaborate case uses a 10-ary tree in w
for i in 0..10: mers in the script group compared to the nonscript each node represents a certain digit, nodes at hei
s =s + a*y[i]+z[i] group because in 1997 and 1998, when the study took representing the nth character of a word. A wo
place, the Java programmers were less experienced rel- stored at a node if the path from the root to this
ative Assembly programmers. This check relies on represents the word’s number encoding. This solu
to the other code
an old rule of thumb, which says that programmer pro- is most efficient, but requires a comparatively l
ductivity measured in lines of code per hour is roughly number of statements to implement the tree
clr R0 ;; s independent of the programming language. Several struction and traversal. In Java, the large resu
clr R1 ;; i widely used effort estimation methods—including number of objects also leads to high memory
ldi R2, 10 Barry Boehm’s Cocomo3 and Capers Jones’ program- sumption due to the severe memory overhead incu
loop: ming language table for function point estimation4— per object by current Java implementations.
explicitly assume that productivity LOC per hour is The script programs require fewer statements rel
cmp R1, R2 independent of programming language. Figure 7 plots to the nonscript programs because they do most o
ld R3, 15000;; a the validation of my work-time data, based on this actual search with the hashing algorithm, which is
ldi y, 20000 rule. Judging from the reliably known productivity internally by the associative arrays. In contrast, the
range of Java, all data points, except perhaps the top script programs require the programmer to code
ldi z, 30000 three Tcl and topmost Perl results, are quite plausible. of the search process’s elementary steps explicitly.
jeq end None of the median differences show clear statistical difference is further accentuated by the effort—or
ld R4, y significance, although Java versus C, Perl, Python, or of it—for data structure and variable declaration
Tcl—where 0.07 ! p ! 0.10—comes close to doing so. Despite the existence of hash table implementa
inc y
Even in the group-aggregated view, with its much larger in both the Java and C++ class libraries, none o
mul R4, r3 groups, the difference between C and C++ versus scripts nonscript programmers used them, choosing ins
ld R5, z is insignificant (p = 0.22), and only Java is less produc- to implement a tree solution by hand. Conversely
tive than scripts (p = 0.031), the difference being at least script group’s programmers found the hash ta
5.2 LOC per hour, with 80 percent confidence. built into their languages to be an obvious choic
3
28 Computer
inc z
add R4, R5
add R0, R4
inc i
jmp loop
end:
Advantages of assembly
• Programmer has direct control over memory
• In almost all cases, tools that translated HLLs produce better quality code
Compilation
Code is converted statically or just-in-time
• Compiler converts HLL program into machine code/assembly before running.
Compilation can take time on very large projects though with modern computers and tools
this is not always a problem
Interpretation
Interpretation is on the fly translation of code as executed
• each time program executes translation happens
4
Examples of languages
Compiled Interpreted
C, C++ LISP
Swift Python, Perl
Java ? Java ?
Relatively new (in practice) development
• Just-in-time compilation
def f(x,z):
y = x[0]+x[1]
ans = process(x)
for i in 1..10000
m = blah(ans, z[i])
if m < 0:
res = blah1(res)
else:
res = blah2(res)
return res, ans, y
Libraries
Many libraries including scientific libraries, available for all languages
• Some code very specialised, can take advantage of hardware features – difficult to im-
plement
For interpreted language, allows compiled code to be used for key operations
5
1.3 What causes problems?
Source of all the world’s problems
• latency,
• consistency
I/O lines
Data bus
ALU RAM
registers
Disk
Comparison of latencies
6
• Execute instruction
Performance determined by
2 Architecture overview
Moore’s law
Since the 1960s, computer performance improved by technology improvement in inte-
grated circuit design
For decades this also mean exponential performance improvement of chip frequency
7
800
700
600
500
Time (s)
400
300
200
100
0
0.5 1 1.5 2 2.5 3 3.5 4
Chip frequency in GHz
Response 1
Can’t make CPUs much faster by increasing chip speed
• Have more than 1 CPU – parallelise code
Cheaper per FLOP (money and power) to have multiple slower CPUs than 1 fast CPU
• How to exploit this ?
In detail later, but we’ll see how pervasive this is
8
• Spatial
• Temporal
Examples
• Disk latency extremely high: but disk rate of 100MB/s is “only” factor of 20-30 worse
than clock speed,
• Not practical or cost-effective to build all memory using this faster memory
• Can be on chip
Memory
Memory
CPU
9
Memory
Memory
CPU Cache
Memory hierarchy
2017 example
Size Latency
Register ⇠100 bytes 1 cycle
L1-cache 512KB 4 cycle
L2-cache 8MB 15 cycles
L3-cache 11MB 50 cycles
RAM 1GB-2TB 150 cyles
• registers
visible to assembly programmer
in most high level languages, registers are invisible to programmer
compiler/interpreter responsible for managing
• cache is invisibe – hardware entirely responsible for managing moving data between
RAM and different cache levels
10
NB: Don’t just fetch one word into cache, but a whole line. Depending on cache typical value
is 64 bytes. Shows locality of reference
Memory
0
1
2
3
4
Cache
401204004
401204005
401205006
401205007 Need to map
memory
addresses to
cache
There are many ways for mapping memory to cache: full-associative, n-way associative
Virtual Memory
Virtual memory is an abstraction that provides programmer/process a model of having the
entire memory to itself
• Simplifies coding
11
• Operating system fetches those part of VM that are needed on into physical memory
• If program references part of memory on disk, page is swapped in and another pagse
swapped out (expensive)
Works well if
• That part actively being used is the Resident Set (RES, RSS)
Otherwise paging costs start to dominate – thrashing – CPU performance rapidly drops to
0%.
top
12
• Transparent
Hyper-threading
Pipelining
• Explicit
Vector processing
Multi-processing
Multi-core
2.3.1 Hyper-threading
Hardware support for running multiple threads on one physical core.
• Two virtual cores share one physical core
• Each virtual core has own copy of state (e.g., registers)
• Can cooperate with each other
• If one thread stalls, the other thread can take over immediately
• Needs OS awareness
Performance benefits vary – depends on workload
Pipelining
Instruction Level Parallelism
• break execution down into steps : simpler, faster steps allows clock to run faster
• can do some fetching of data in advance
For example:
• Instruction Fetch
• Data Fetch
• Execute
• Write result
Cycle—
Instruction 1 2 3 4 5 6 7 8 9
1 IF DF X W
2 IF DF X W
3 IF DF X W
4 IF DF X W
5 IF DF X W
6 IF DF X W
Problems:
• branch prediction
• hazards : dependencies of one step on the other
13
Vector Processing
Vector Processing: perform the same operations on each item of a vector, no dependancy
beween the operations
for i in range(N):
a[i]=x*b[i]+c[i]
• Today most popular version: Streaming SIMD Extensions (SSE, AVX) found on x86 64
arhitecture
Can program explicitly, but typically rely on optimising compiler to automatically detect
and use SSE:
CPU CPU
RAM RAM
Multi-processor
14
CPU CPU
Cores Cores RAM RAM
Very simple model – works well when it works, but not suitable for all methods.
Cache
Cache can be
• per-core
• per processor
Overview of architecture
Core L1−cache
Memory
Memory
Core
L2−cache
Common configuration
* multiple processors
* each processor has multiple cores
* L1, L2, L3 cache at different levels
Processor
Software framework
Program: code accomplishing a task for a user. Programs reasonably independent of each
other
A program consists of
15
• a process consists of one or more threads
threads in the same process share memory space
communicate via shared memory or variables
RAM/L2-cache
x:0
x = x + 1 x = x + 1 x = x + 1 x = x + 1
Mutex
Critical section
• that part of a program that only one thread should be in at a time; and/or
Difficult to design
16
2.5 Specialised processing units/hardware
Traditional CPU architecture has limits of cost-effective parallelisation
• Many alternative technologies
Will look at 3 different approaches
• GPUs
• FPGAs : Field Programmable Gate Arrays
• Xeon PHI
GPUs
GPUs developed to support video, image rendering
• Same operation performed independently on lots of data
• Try to have massive parallelism (100s, 1000s?) of cores
– each very simple
– good use of power, costs
As GPUs became more sophisticated, used for general purpose programming
2 1. GPU DESIGN, PROGRAMMING, AND TRENDS
general notion of heterogeneous computing. Additionally, we focus only on the GPGPU aspects of
Example Architecture
the GPU architecture, rather than those specific to graphics applications.
A GPU might contain
1.2
• 16 A BRIEF
“Streaming OVERVIEW
Multiprocessors” OF A GPU SYSTEM
(SMs)
• Each SM
In this has 32
section, we processors – each
provide a brief processor
overview of both ahas an ALU
baseline GPUand FPU based primarily
architecture,
on NVIDIA’s G80/Fermi architecture, and GPGPU programming using the CUDA programming
• Each SM has SFUs for transcendental instructions
model.
Memory Figure 1.1 showsishow
organisation a GPU is typically connected with a modern processor. A GPU is
complex
an accelerator (or a co-processor) that is connected to a host processor (typically a conventional
• 6GB of RAM (Global
general-purpose memoryThe
CPU processor). across all SMs)and GPU communicate to each other via PCI
host processor
Express (PCIe) that provides 4 Gbit/s (Gen 2) or 8 Gbit/s (Gen 3) interconnection bandwidth. This
• Shared memory in each SM (say 48 MB)
communication bandwidth often becomes one of the biggest bottlenecks; thus, it is critical to offload
• Local memory
the work to GPUsinonly
each core
if the (sayof16K
benefits usingper thread)
GPUs outweigh the offload cost. The communication
bandwidth is expected to grow as the CPU bus bandwidth of the system memory increases in the
Complex
future.
caching
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
Figure 1.2: Block diagram of NVIDIA’s G80 graphics processor (the components required for graphics
processing are not shown).
Programming model restricted
• lockstep execution : e.g., for an if taking different paths there will be stalls
Xeon PHI
18
– 512-bit width vector processors
– has L2-cache
Two extremes:
• Software
Slow, difficult to parallelise on standard hardware
Cheap and flexible
Basis of implementation
19
CLB CLB CLB CLB
S S S
S S S
3 Software frameworks
We build software abstraction on top of physical architecture
Flynn’s taxonomy
Data
SISD SIMD
Instruction
MISD MIMD
SIMD
• Vector processing
MISD
• Pipelining (arguably)
• Systolic arrays
Data is pumped through computational units which can do different things.
Idea – reduce cost of communication
e.g. multiplication of band matrices
20
c3
a0 6 b0
c4@ c2
a1 6 @R 0:0 6 b1
c5@ @ c1
a2 6 @R 0:1 6 @ R 1:0 6 b2
c6@ @ @ c0
a3 6 @R 0:2 6 @R 1:1 6 @R 2:0 6 b3
@ @ @ @
R 0:3
@ 6 @R 1:2 6 @ R 2:1 6 @R 3:0
@ @ @
06 @R 1:3 6 @R 2:2 6 @R 3:1 06
@ @
06 @R 2:3 6 @ R 3:2 06
@
06 @R 3:3 06
06
MIMD
Generalised multi-threaded – can have threads executing different code
Even if the code is the same, they can be at different places – not in lockstep
Speed-up
Empirical speed up:
• Let T (n) be time time taken with n processors
• Speed-up S(n) is S(n) = T (1)/T (n)
To be fair, need to have T (1) be the cost of best sequential algorithm
• real super-linear performance possible – due to memory, cache increase
21
Efficiency
Extra parallelism comes with cost – efficiency an important metric
• E(n) = S(n)/n
• How quickly?
• Alternate view: Asanovic – but practically after a point you don’t get improvement
Gustafson’s law
Constraint on parallelism is communication overhead
• bigger the data, easier it is break into meaninful work blocks. e.g., search in 10k file
not easy to parallelise, in 100G may be east
• W = (1 p)W + pW
• S(n) = 1 p + sp
• Queue
• Pipe
22
Example
def worker ( x ) :
p r i n t ( ”%d s t a r t i n g ”%x )
s l e e p ( random . r a n d i n t ( 5 , 1 0 ) )
p r i n t ( ”%d t e r m i n a t i n g ”%x )
procs = []
f o r x in range ( 1 0 ) :
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(x , ) )
p r o c s . append ( p )
p . s t a r t ()
p r i n t ( ” Master going t o s l e e p ” )
s l e e p (30)
p r i n t ( ” Master wakes ” )
f o r p in p r o c s :
p . join ()
p r i n t ( ” Master s t o p p i n g ” )
Queue
Queue is an an object which allows master to communicate with the worker
• put: put a value on the queue
• get: retrieve a value
• empty: ?
Other methods – optional parameters
def worker ( q , x ) :
ans =random . r a n d i n t ( 5 , 1 0 )
s l e e p ( ans )
q . put ( ans )
procs = []
q = mp. Queue ( )
23
f o r x in range ( 1 0 ) :
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(q , x ) )
p r o c s . append ( p )
p . s t a r t ()
s l e e p s =[]
f o r x in range ( 1 0 ) :
n = q . get ()
s l e e p s . append ( n )
f o r p in p r o c s :
p . join ()
p r i n t ( ” T o t a l s l e e p i n g was %d ”%sum( s l e e p s ) )
print ( sleeps )
Could we replace the for x in range(10) with while not q.empty()?
Pipe
pipe can be created to connect two processes
procs = []
conns = [ ]
f o r x in range ( 1 0 ) :
c1 , c2 = mp. P i p e ( )
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(c2 , x ) )
p r o c s . append ( p )
conns . append ( c1 )
24
p . s t a r t ()
s l e e p s =[]
while len ( conns ) !=0:
f o r conn in conns :
i f conn . p o l l ( ) :
s l e e p s . append ( conn . r e c v ( ) )
conns . remove ( conn )
....
Sharing state
By default, no sharing of state
• Depending on mode/OS can fork processes rather than spawn – process is intialised
with state of master
Locks
Locks allow one process to signal that no other process can access some resource
Can’t prevent, but
• acquire lock:
• release lock
Process A Process P
lock.acquire() lock.acquire()
lock.release() lock.release()
• each philosopher goes through a cycle of thinking, being hungry, and eating
Need to avoid:
• deadlock
• livelock
25
• starvation
P2
P1
P3
P4
P6
P5
Multi-threaded programming
Similar in principle to multi-process programming
• programming easier
• programming harder
Multi-threaded programming
Widely available
• e.g., Pthreads in C
• Java threads
26
Open MP
Library that provides multi-threading for many programming languages
• C, C++
• FORTRAN
Goal is to simplify programming
• Programmer provides pragmas – hints to the compiler
• Burden is on the compiler, but relies on well structured code
Programmer can
• Specify how data shared if al all
• Identify critical sections, variables that require atomic update
• Specify barriers
• Whether loops should be parallelised
• How work units scheduled
Support for communicating data between “master” and “workers”, producing answers.
x = 2;
#pragma omp parallel num_threads(2) shared(x) {
if (omp_get_thread_num() == 0) {
x = 5;
} else {
printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x );
}
#pragma omp barrier
if (omp_get_thread_num() == 0)
printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x );
else
printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x );
27
Distributed memory processing
Limit to how many cores on a single computer sharing memory
Cluster
Head node
Workers
28
#!/bin/bash
#SBATCH -J admix
#SBATCH -t 01:20:00
#SBATCH -c 8
#SBATCH -p batch
#SBATCH --mem=4096
cd ${DIR}
• Some libraries upport shared virtual memory across multiple machines but very expen-
sive
MPI
MPI is the best known parallel library for distributed processing
• Core implementation in C but there’s support for other systems
• mpi4py
Need to have MPI installed on your system
Key concepts
Communicator
Object which represents processes involved in a parallel task
• Most programs only have one communicator: COMM_WORLD – all the processes
29
Rank
When we launch n processes all are peers
#!/usr/bin/env python3
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
num_procs = comm.Get_size()
Example
I am 2 of 5
I am 4 of 5
I am 1 of 5
I am 3 of 5
Note, that in principle all processes are peers, do the same thing. In most programs we’ll
want some processes (maybe only 1) to do something special and act as a coordinator, thut
that is up to us to program. We usually will choose process with rank 0 for this, but that is up
to us to enforce.
Features
30
• synchronous send, receive
• broadcast
• scatter/gather
Synchronous send/receive
send/ receive
• Can send numpy arrays with structure information – much more efficient.
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = [("sensor1", 5), 2019,11,4, "ready"]
comm.send(data, dest=1, tag=11)
elif rank == 1:
data = comm.recv(source=0, tag=11)
print(data)
This is slow because Python has to pickle the data (marshall) and then transmit and to un-
pickle. If you use numpy arrays and specify type, it’s much faster (but more restricted)
31
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
Or
if rank == 0:
data = numpy.arange(100, dtype=numpy.float64)
comm.Send(data, dest=1, tag=13)
elif rank == 1:
data = numpy.empty(100, dtype=numpy.float64)
comm.Recv(data, source=0, tag=13)
• Sometimes natural
if rank == 0:
data1 = 10*numpy.arange(1000, dtype=’i’)
req1 = comm.Isend([data1, MPI.INT], dest=1, tag=10)
data2 = numpy.arange(1000, dtype=’i’)
req2 = comm.Isend([data2, MPI.INT], dest=1, tag=11)
req1.wait()
req2.wait()
elif rank == 1:
dataA = numpy.empty(1000, dtype=’i’)
dataB = numpy.empty(1000, dtype=’i’)
req1 = comm.Irecv([dataA, MPI.INT], source=0, tag=11)
req2 = comm.Irecv([dataB, MPI.INT], source=0, tag=10)
print("do something")
32
req1.wait(); req2.wait()
print(dataA[0:10])
print(dataB[0:10])
Broadcasting
if rank == 0:
data1 = 10*np.arange(100, dtype=’i’)
else:
data1 = np.empty(100, dtype=’i’)
comm.Bcast([data1, MPI.INT],root=0)
print("Proc %d received: "%rank, data1[:10])
Scatter
if rank == 0:
data1 = 10*np.arange(50, dtype=’i’).reshape(10,5)
else:
data1 = None
recv = np.empty(10, dtype=’i’)
comm.Scatter(data1,recv,root=0)
print("Proc %d received: "%rank, recv)
Scatter
if rank == 0:
data1 = np.arange(50, dtype=’i’).reshape(10,5)
else:
data1 = None
recv = np.empty(10, dtype=’i’)
comm.Scatter(data1,recv,root=0)
print("Proc %d received: "%rank, recv)
33
Scatter
comm.Scatter(data1,recv,root=0)
...
response = rank*recv
ans = None
if rank == 0:
ans = np.empty(50, dtype=’i’).reshape(10,5)
comm.Gather(response, ans, root=0)
I/O
Can use standard I/O, e.g,
Problems:
• Performance
NB: processes can be running on different computers – typically we assume a global file
system – all computers see the same file system – but this need not be the case. There are
certainly implementations of MPI where some of the workers do not see the same file system
as the head node.
MPI I-O
MPI provides a library for doing I/O
As an aside, let’s look at the difference between standard network file system and a parallel
file system
34
Head node
NFS server
Workers
Head node
FS FS FS FS FS
Workers
Only a performance reason to use MPI I-O if a parallel file system. But may be useful to make
programming easier.
Operations can be
35
File pointer/offset
Often access to file is sequential
• open file – top of file
• subsequent operations advance through file
But can explicitly set the file pointer/offset
• Some operations have parameter
• or use seek to move to that point
Example writing
• Write(buf) : individual operation – each process has its private file handle (presumably
to different files)
• Write_at(buf, offset) : individual operation – write at specified position
• Write_all(buf) : collective operation – all processes write to same file – MPI makes it
look like access serialised
• Write_at_all(buf, offset) ditto, but each process can write to different place
Reading
MPI I-O normally assume that a buffer is set up for the data
• In C/C++ typically an array
• In MPI, use bytearray or numpy arrays
Example read
• Read(buf)
• Read_at(buf,offset) : individual operation – read at specified position
• Read_all(buf) : collective operation – all processes read same file – MPI makes it look
like access serialised
• Read_at_all(buf, offset) ditto, but each process can read from different place
Other opertions
36