0% found this document useful (0 votes)
23 views

Coms7059 Lecturenotes

This document summarizes a study comparing the runtime performance of programs written in different programming languages. The study tested programs written in languages like Java, C++, Python, Perl, Tcl and Rexx on various data sets. It found that runtimes varied significantly depending on the language, with C++ programs taking the longest on average to complete and Tcl programs being the fastest. The document also discusses factors like the choice of software tools and hardware environment that can impact a program's speed.

Uploaded by

Mailo Ramaotswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Coms7059 Lecturenotes

This document summarizes a study comparing the runtime performance of programs written in different programming languages. The study tested programs written in languages like Java, C++, Python, Perl, Tcl and Rexx on various data sets. It found that runtimes varied significantly depending on the language, with C++ programs taking the longest on average to complete and Tcl programs being the fastest. The document also discusses factors like the choice of software tools and hardware environment that can impact a program's speed.

Uploaded by

Mailo Ramaotswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

U NIVERSITY OF THE W ITWATERSRAND , J OHANNESBURG

Large-scale computer systems and scientific


programmingSelected Topics in Software Engineering

Scott Hazelhurst

2020

1 Introduction
Goal is to develop high-quality code that can run fast:

• Identify how important speed is and potential bottlenecks;

• Choose right software tools;

• Understand hardware environment and impact of choices we can make

1.1 How important is speed?


• Speed of program versus development time?
Often a tension

• How often does it need to run?


Reality check: If it’s only for you, maybe not often.
Reality check: Even if only for you, maybe often.

• Who is it for?

• Computer time may be cheap.


If it parallelises well, a thousand CPU hours may be manageable.

Amdahl’s law
A B C D F

• Sweat the big stuff

• But need to reassess if you improve

Suppose original time a task takes is T0

• Can improve proportion p of the original job

• You can make that proportion s times faster


pT0
New cost is T0 (1 p) + s = T0 ((1 p) + p/s)

• Overall speed up is 1
(1 p)+(p/s)

1
When s ! 1 then speed up is 1/(1 p)

• If p = 0.9, the best speed up is 10

Summary:

• Identify the parts that really matter in a program


Is it the I/O?
Is it an inner loop?

• Find better ways of doing things

• Be aware that there is a limit to improvement – if you do a good job, other parts become
the bottleneck

1.2 Choice of software tool


This is only one experiment, but probably reasonable accurate. Taken from Prechelt, Com-
puter33(10)

Table 1. Programming-language comparison statistic


Tcl
Number of Compiler or
Rexx Language programs execution platform

Python Tcl 10 Tcl 8.2.2


Rexx 4 Regina 0.08g
Perl Python 13 Python 1.5.2
Perl 13 Perl 5.005_02
Java Java 24 Sun JDK 1.2.1/1.2.2
C++ 11 GNU g++ 2.7.2
C++ C 5 GNU gcc 2.7.2

C ping, which is described in more detail elsewhere


Given the study’s validity caveats, these quant
4 16 64 256 1,024 4,096
tive statistical inference results merely indicate tre
Runtime for z1000 in seconds
and should not be considered precise evidence.
Table 1 shows the number of programs conside
Individual data value for each language and the execution platforms us
Median The Java evaluation uses either the JDK 1.2.2 Hots
Range for data‘s middle half Reference version or the JDK 1.2.1 Solaris Product
Bounds for the data‘s bottom and top 10 percent
Artithmetic mean and plus/minus one standard error version with JIT, depending on which one was fa
for each program. I executed all programs on a 3
MHz Sun Ultra II workstation with 256 Mbytes
Figure 1. Program runtime on the z1000 data set. Three programs timed out with no RAM, running under SunOS 5.7. The results fo
output after about 21 minutes. The bad-to-good ratios range from 1.5 for Tcl up to 27 and Rexx are based on only five and four progra
for C++. Note the logarithmic axis. The legend shown here applies to Figures 2 through respectively, and are thus rather coarse estimates
7 as well. Refer to the “Plots and Statistical Methods” section for more details. reality, but for all other languages, the results der
from 10 or more programs, which is a broad-enou
does indeed exist. I will usually not give the p-value base for reasonably precise results.
itself, but rather say “... is larger than ...” if 0 < p ! The “The Programming Problem: Phonecode” a
0.10 or “... tends to be larger than ...” if 0.10 < p ! “Comparison Validity” sidebars describe the stud
0.20. If p > 0.10, there is “no significant difference.” setup and validity. Readers interested in a m
At several points I also provide confidence inter- detailed description of the study can find it on the W
vals, either on the differences in means or on the dif- at ftp.ira.uka.de.2
ferences in logarithms of means—that is, on the ratios
of means. The confidence levels are chosen so that RESULTS
they are open ended, with their upper end at infinity. I evaluated the programs using three different in
I computed the2confidence intervals using bootstrap- files: z1000, which contains 1,000 nonempty rand

Comparison Validity
Any programming language compari- I posted a “call for programs” message on in programming with their respectiv
son based on actual sample programs is several newsgroups. Consequently, these scripting language or as not having a thor
valid only to the degree to which the capa- subjects show more diversity in back- ough programming background, includ
Tcl

Rexx

Python
This comparison lends credibility to the stu
Perl work-time comparison. The times reported for s
programming are likely either only modestly o
Java
mistic or not at all so. Thus, a work-time advan
for the script languages of about factor two holds
Java work times appear to be a bit pessimistic bec
C++
when the study took place, the Java programm
were less experienced than the other programm
C

0 5 10 15 20 25 Program structure
Total time for programming in hours If we consider the designs chosen by those
wrote programs in the various languages studied
Figure 6. Total working time for realizing the program. The programmers measured and uncover a striking difference. Most programme
reported the script group times; the experimenter measured the nonscript group times. the script group used the associative arrays prov
These results are almost 20 years
The bad-to-good old from
ratios range – unfortunately not
1.5 for C up to 3.2 for much
Perl. work
Three Java worklike
times this.
of The result
by their mayand stored the dictionary wor
language
40, 49, and 63 hours exceed the chart’s bounds and thus are not shown. be retrieved by their number encodings. The se
be different now – the point not to sell one programming language over another, but just that
algorithm simply attempts to retrieve from this a
the choice of language really matters. using prefixes of increasing length based on the
script programmers and as measured for the nonscript of the remaining current phone number as the
programmers. Any match found leads to a new partial solutio
Interpretation/Compilation We see that scripts, which have a total median of 3.1 be completed later.
hours, take less than half as long to write as nonscripts, In contrast, essentially all the nonscript prog
Each processor has an instruction set
which have a total median of 10.0 hours, although mers chose one of the following solutions: In the
validity threats may have exaggerated this difference. ple case, they stored the whole dictionary in an a
• machine code/assembly code Fortunately, we can check two things at once, usually in both its original character form and the
namely, the correctness of the work-time reporting and responding phone number representation. They
• very low level the equivalence of the programmer capabilities in the selected and tested one-tenth of the whole dictio
script versus the nonscript group. Both these possible for each digit of the phone number they were en
High iflevel
problems, code
present, tend to bias the script group’s ing, using only the first digit as a key to constrai
worktimes downward: We would expect cheaters to search space. This procedure leads to a simple
s = 0 fake their time to be shorter, not longer, and, if there is inefficient solution.
a difference, we expect to see more capable program- The more elaborate case uses a 10-ary tree in w
for i in 0..10: mers in the script group compared to the nonscript each node represents a certain digit, nodes at hei
s =s + a*y[i]+z[i] group because in 1997 and 1998, when the study took representing the nth character of a word. A wo
place, the Java programmers were less experienced rel- stored at a node if the path from the root to this
ative Assembly programmers. This check relies on represents the word’s number encoding. This solu
to the other code
an old rule of thumb, which says that programmer pro- is most efficient, but requires a comparatively l
ductivity measured in lines of code per hour is roughly number of statements to implement the tree
clr R0 ;; s independent of the programming language. Several struction and traversal. In Java, the large resu
clr R1 ;; i widely used effort estimation methods—including number of objects also leads to high memory
ldi R2, 10 Barry Boehm’s Cocomo3 and Capers Jones’ program- sumption due to the severe memory overhead incu
loop: ming language table for function point estimation4— per object by current Java implementations.
explicitly assume that productivity LOC per hour is The script programs require fewer statements rel
cmp R1, R2 independent of programming language. Figure 7 plots to the nonscript programs because they do most o
ld R3, 15000;; a the validation of my work-time data, based on this actual search with the hashing algorithm, which is
ldi y, 20000 rule. Judging from the reliably known productivity internally by the associative arrays. In contrast, the
range of Java, all data points, except perhaps the top script programs require the programmer to code
ldi z, 30000 three Tcl and topmost Perl results, are quite plausible. of the search process’s elementary steps explicitly.
jeq end None of the median differences show clear statistical difference is further accentuated by the effort—or
ld R4, y significance, although Java versus C, Perl, Python, or of it—for data structure and variable declaration
Tcl—where 0.07 ! p ! 0.10—comes close to doing so. Despite the existence of hash table implementa
inc y
Even in the group-aggregated view, with its much larger in both the Java and C++ class libraries, none o
mul R4, r3 groups, the difference between C and C++ versus scripts nonscript programmers used them, choosing ins
ld R5, z is insignificant (p = 0.22), and only Java is less produc- to implement a tree solution by hand. Conversely
tive than scripts (p = 0.031), the difference being at least script group’s programmers found the hash ta
5.2 LOC per hour, with 80 percent confidence. built into their languages to be an obvious choic
3
28 Computer
inc z
add R4, R5
add R0, R4
inc i
jmp loop
end:

Advantages of assembly
• Programmer has direct control over memory

• Expert can produce very efficient code (in principle)


But:
• Development costs very high – HLLs much better

• In almost all cases, tools that translated HLLs produce better quality code

• Many performance issues handled at even lower level (see later)

Compilation
Code is converted statically or just-in-time
• Compiler converts HLL program into machine code/assembly before running.

• Done once – code executes many times

• Static checking of code improves quality

• Does place restrictions on programming language design

• Compile/execute cycle may be a hindrance to development

Compilation can take time on very large projects though with modern computers and tools
this is not always a problem

Interpretation
Interpretation is on the fly translation of code as executed
• each time program executes translation happens

• each time a line of code in a loop is executed, it gets translated


Advantage
• more flexible, higher level code
Disadvantage
• Much slower, usually less control of memory

• less error-checking possible

4
Examples of languages
Compiled Interpreted
C, C++ LISP
Swift Python, Perl
Java ? Java ?
Relatively new (in practice) development

• Just-in-time compilation

def f(x,z):
y = x[0]+x[1]
ans = process(x)
for i in 1..10000
m = blah(ans, z[i])
if m < 0:
res = blah1(res)
else:
res = blah2(res)
return res, ans, y

Aside: Top programming languages – TIOBE


Language Feb 18 Jul 19 Aug 20
1 C 11.8 14.2 17.0
2 Java 15.0 15.1 14.4
3 Python 5.0 9.3 9.7
4 C++ 9 6.7 6.8
Percentage index 5 C# 4.4 4.4 4.7
6 Visual Basic 4.3 4.2 4.7
7 JavaScript 3.1 2.3 2.9
8 R 2.8
9 PHP 3.4 2.2 2.2
10 SQL 2.0 1.5

Libraries
Many libraries including scientific libraries, available for all languages

• Why re-invent wheel?

• Some code very specialised, can take advantage of hardware features – difficult to im-
plement

For interpreted language, allows compiled code to be used for key operations

• Greater control over memory

• Much faster code

5
1.3 What causes problems?
Source of all the world’s problems

Why is the world so difficult?


There are some fundamental constraints, including

• latency,

• consistency

• and power and heat

Latency caused by speed of light delays, technology limitations

I/O lines

Data bus

ALU RAM
registers

Disk

Comparison of latencies

• Chip runs at 2GHz – clock cycle every 0.5ns

• RAM DDR4: 80ns for first word

• SSD Disk: latency 80k ns

• Hard disk: 300k ns

Throughput is not latency though . . . first byte costs a lot

Fetch execute cycle


Most computer architecture are based on fetch/execute model – controlled by a system
clock

• Fetch next instruction

• Fetch data for instruction

6
• Execute instruction

Performance determined by

• How long it take to fetch instruction/data

• How long to execute instruction (1-4 cycles – so dependant on frequency)

2 Architecture overview
Moore’s law
Since the 1960s, computer performance improved by technology improvement in inte-
grated circuit design

• number of transistors doubles every 1.5-2 years (or so)

For decades this also mean exponential performance improvement of chip frequency

• Chip performance also improved

• From 2003 or so, no longer practical (e.g., heat, power)

7
800

700

600

500
Time (s)

400

300

200

100

0
0.5 1 1.5 2 2.5 3 3.5 4
Chip frequency in GHz

Response 1
Can’t make CPUs much faster by increasing chip speed
• Have more than 1 CPU – parallelise code
Cheaper per FLOP (money and power) to have multiple slower CPUs than 1 fast CPU
• How to exploit this ?
In detail later, but we’ll see how pervasive this is

2.1 Dealing with the latency gap


1. Try to amortise cost of moving data

2. Use principle of locality of reference

8
• Spatial
• Temporal

3. Overlap things where you can

Examples

• Memory latency high, but memory bandwidth is 10GB/se

• Disk latency extremely high: but disk rate of 100MB/s is “only” factor of 20-30 worse
than clock speed,

Caches and caching


A cache is a much faster RAM than main memory RAM

• Not practical or cost-effective to build all memory using this faster memory

• Can be on chip
Memory

Memory

CPU

9
Memory

Memory

CPU Cache

Memory hierarchy
2017 example

Size Latency
Register ⇠100 bytes 1 cycle
L1-cache 512KB 4 cycle
L2-cache 8MB 15 cycles
L3-cache 11MB 50 cycles
RAM 1GB-2TB 150 cyles

May be separate caches for data and instruction

Visibility of memory hierarchy


At program level

• registers
visible to assembly programmer
in most high level languages, registers are invisible to programmer
compiler/interpreter responsible for managing

• cache is invisibe – hardware entirely responsible for managing moving data between
RAM and different cache levels

10
NB: Don’t just fetch one word into cache, but a whole line. Depending on cache typical value
is 64 bytes. Shows locality of reference

Memory
0
1
2
3
4
Cache

401204004
401204005
401205006
401205007 Need to map
memory
addresses to
cache

There are many ways for mapping memory to cache: full-associative, n-way associative

• Trade-off : simplicity versus speed

• More complex may be faster – but takes up more

2.2 Virtual Memory


We can use cache to make access to RAM faster. We can use external disk to make RAM
bigger. This works well in many situations.

Virtual Memory
Virtual memory is an abstraction that provides programmer/process a model of having the
entire memory to itself

• Each process sees memory starting at 0!

• Simplifies coding

• Protects users/coders from each other

• Supported and enforced by the hardware and operating system

Essential for a computer running multiple programs, especially multiple users


VM can support memory space greater than RAM

• Part of memory is on disk (swap space)

11
• Operating system fetches those part of VM that are needed on into physical memory

• OS and hardware support to map VM to physical memory/disk

• Program does not see this – it just sees the VM

• If program references part of memory on disk, page is swapped in and another pagse
swapped out (expensive)

Works well if

• Good locality of reference

• That part actively being used is the Resident Set (RES, RSS)

• RSS must be smaller than physical memort available

Otherwise paging costs start to dominate – thrashing – CPU performance rapidly drops to
0%.

top

2.3 Parallelism I: Mostly hidden from the programmer


Parallelism
Doing multiple things at once:

12
• Transparent
Hyper-threading
Pipelining
• Explicit
Vector processing
Multi-processing
Multi-core

2.3.1 Hyper-threading
Hardware support for running multiple threads on one physical core.
• Two virtual cores share one physical core
• Each virtual core has own copy of state (e.g., registers)
• Can cooperate with each other
• If one thread stalls, the other thread can take over immediately
• Needs OS awareness
Performance benefits vary – depends on workload

Pipelining
Instruction Level Parallelism
• break execution down into steps : simpler, faster steps allows clock to run faster
• can do some fetching of data in advance
For example:
• Instruction Fetch
• Data Fetch
• Execute
• Write result
Cycle—
Instruction 1 2 3 4 5 6 7 8 9
1 IF DF X W
2 IF DF X W
3 IF DF X W
4 IF DF X W
5 IF DF X W
6 IF DF X W
Problems:
• branch prediction
• hazards : dependencies of one step on the other

13
Vector Processing
Vector Processing: perform the same operations on each item of a vector, no dependancy
beween the operations

for i in range(N):
a[i]=x*b[i]+c[i]

Each operation in loop can be computed separately


Many different architectures dating back decades: e.g. Cray in the mid-70s

• Today most popular version: Streaming SIMD Extensions (SSE, AVX) found on x86 64
arhitecture

e.g., With AVX, 512-bit registers

• Can do 16 32-bit integer additions in parallel

Can program explicitly, but typically rely on optimising compiler to automatically detect
and use SSE:

• NB: choose types judiciously

2.4 Multiple processing


Multiprocessor, multicore computers
Also decades old , but have become ubiquitous in the last 15 years in the mid and lower-
end market

• Can’t make CPUs faster, but can have more CPUs

Need to exploit this computational resource effectively.


Multi-processor

CPU CPU
RAM RAM

Multi-processor

14
CPU CPU
Cores Cores RAM RAM

Very simple model – works well when it works, but not suitable for all methods.

Cache
Cache can be

• per-core

• per processor

• shared between processors (rarer)

NB: Contention between cores and processors becomes a big issue

Overview of architecture

Core L1−cache
Memory

Memory

Core
L2−cache

Common configuration
* multiple processors
* each processor has multiple cores
* L1, L2, L3 cache at different levels

Processor

Software framework
Program: code accomplishing a task for a user. Programs reasonably independent of each
other
A program consists of

• one or more processes


each process has its own (virtual) memory space
processes can communicate with each other with messages, may be other specialised
mechanisms

15
• a process consists of one or more threads
threads in the same process share memory space
communicate via shared memory or variables

Complications of parallel programming

• How to find parallelism

• Synchronisation: One thread/process needs information from the other – implicit or


explicit communication

– Need to wait for data from another process


– Need to ensure other process has done some work

• Mutual exclusion: Shared resources need to be managed carefully

RAM/L2-cache

x:0

x = x + 1 x = x + 1 x = x + 1 x = x + 1

Mutex
Critical section

• that part of a program that only one thread should be in at a time; and/or

• when a variable should be modified

Difficult to design

• Too small a critical section, code will be wrong

• Too big, too many, less code can run in parallel

• easy to make mistakes

Consistency versus performance

16
2.5 Specialised processing units/hardware
Traditional CPU architecture has limits of cost-effective parallelisation
• Many alternative technologies
Will look at 3 different approaches
• GPUs
• FPGAs : Field Programmable Gate Arrays
• Xeon PHI

GPUs
GPUs developed to support video, image rendering
• Same operation performed independently on lots of data
• Try to have massive parallelism (100s, 1000s?) of cores
– each very simple
– good use of power, costs
As GPUs became more sophisticated, used for general purpose programming
2 1. GPU DESIGN, PROGRAMMING, AND TRENDS
general notion of heterogeneous computing. Additionally, we focus only on the GPGPU aspects of
Example Architecture
the GPU architecture, rather than those specific to graphics applications.
A GPU might contain
1.2
• 16 A BRIEF
“Streaming OVERVIEW
Multiprocessors” OF A GPU SYSTEM
(SMs)
• Each SM
In this has 32
section, we processors – each
provide a brief processor
overview of both ahas an ALU
baseline GPUand FPU based primarily
architecture,
on NVIDIA’s G80/Fermi architecture, and GPGPU programming using the CUDA programming
• Each SM has SFUs for transcendental instructions
model.
Memory Figure 1.1 showsishow
organisation a GPU is typically connected with a modern processor. A GPU is
complex
an accelerator (or a co-processor) that is connected to a host processor (typically a conventional
• 6GB of RAM (Global
general-purpose memoryThe
CPU processor). across all SMs)and GPU communicate to each other via PCI
host processor
Express (PCIe) that provides 4 Gbit/s (Gen 2) or 8 Gbit/s (Gen 3) interconnection bandwidth. This
• Shared memory in each SM (say 48 MB)
communication bandwidth often becomes one of the biggest bottlenecks; thus, it is critical to offload
• Local memory
the work to GPUsinonly
each core
if the (sayof16K
benefits usingper thread)
GPUs outweigh the offload cost. The communication
bandwidth is expected to grow as the CPU bus bandwidth of the system memory increases in the
Complex
future.
caching

Figure 1.1: A system overview with CPU and a discrete GPU.


17

1.2.1 AN OVERVIEW OF GPU ARCHITECTURE


Figure 1.2 illustrates the major components of a general-purpose graphics processor, based on a
simplified diagram of G80. At a high level, the GPU architecture consists of several streaming
multiprocessors (SMs), which are connected to the GPU’s DRAM. (NVIDIA calls an SM and AMD
1.2. A BRIEF OVERVIEW OF A GPU SYSTEM 3

SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP

SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP
SP SP SP SP SP SP SP SP SP SP

Figure 1.2: Block diagram of NVIDIA’s G80 graphics processor (the components required for graphics
processing are not shown).
Programming model restricted

• Programmer specifies block of threads to execute on an SM


SIMD/SIMT GPU processors supply high floating-point (FP) execution bandwidth, which is
the driving force for designing graphics applications. To make efficient use of the high number of
• Divided into “warps”
FP units, of size 32
GPU architectures employ a SIMD or SIMT execution paradigm (we explain SIMT and
contrast with SIMD in Chapter 1.3.4.) In SIMD, one instruction operates on multiple data (i.e.,
• Warps execute
only one synchronously
instruction is fetched, decoded, and scheduled but on multiple data operands). Depending
on the word width, anywhere from 32, 64, or 128 FP operations may be performed by a single
• Switch between
instructionwarps when
on current systems.block for memory
This technique access
significantly increases system throughput and also
improves its energy-efficiency. This SIMD-style execution is essentially the same technique used in
early vector processors. To support SIMD/vector operations, the register file should provide high
Programming bandwidth
GPU of read and write operations.
GPU programming is complex
Multithreading The other important execution paradigm that GPU architectures employ is hard-
ware multithreading. As in more conventional highly multithreaded architectures, such as HEP [86],
M-Machine [35], Tera MTA [87], GPU processors use fast hardware-based context switching to
• CUDA, OpenCL best known libraries
tolerate long memory and operation latencies.
The effectiveness of multithreading depends on whether an application can provide a high
Difficulties in parallelising
number of concurrent threads. Most graphics applications have this characteristics since they typically

• communication with CPU,

• memory management complex

• lockstep execution : e.g., for an if taking different paths there will be stalls

Xeon PHI

Many-core architecture by Intel

• > 50 Cores, connected with fast interconnect to a shared memory


• general x86 architcture
• Each core relatively slow < 2GHz
– can support 4 threads

18
– 512-bit width vector processors
– has L2-cache

Two extremes:

• ASICs : application-specific integrated circuits


Very fast, efficient (speed, power).
Expensive to design, mistakes very expensive – only really viable in large quantities or
high-end needs

• Software
Slow, difficult to parallelise on standard hardware
Cheap and flexible

FPGAs : fall in between – programmable hardware


Grew from PLDs

Basis of implementation

19
CLB CLB CLB CLB

S S S

CLB CLB CLB CLB

S S S

CLB CLB CLB CLB

3 Software frameworks
We build software abstraction on top of physical architecture

• Software framework hides some of the complexity


e.g. in HLL we don’t worry about what’s in registers, what’s in RAM

• Abstraction aids ease of programming

• But usually some match between software and hardware

Flynn’s taxonomy
Data
SISD SIMD
Instruction
MISD MIMD
SIMD

• Vector processing

• Variant : SIMT – single instruction multiple thread


Multiple threads running per core to hide latency

MISD

• Pipelining (arguably)

• Systolic arrays
Data is pumped through computational units which can do different things.
Idea – reduce cost of communication
e.g. multiplication of band matrices

20
c3
a0 6 b0
c4@ c2
a1 6 @R 0:0 6 b1
c5@ @ c1
a2 6 @R 0:1 6 @ R 1:0 6 b2
c6@ @ @ c0
a3 6 @R 0:2 6 @R 1:1 6 @R 2:0 6 b3
@ @ @ @
R 0:3
@ 6 @R 1:2 6 @ R 2:1 6 @R 3:0
@ @ @
06 @R 1:3 6 @R 2:2 6 @R 3:1 06
@ @
06 @R 2:3 6 @ R 3:2 06
@
06 @R 3:3 06
06

MIMD
Generalised multi-threaded – can have threads executing different code
Even if the code is the same, they can be at different places – not in lockstep

Amdahl’s law, speed-up, efficiency


Amdahl’s general law applies
Suppose original time a task takes is T0
• Can improve parallelise proportion p of the original job
• You can have n processors running in parallel
Best you can do: T0 ((1 p) + p/n)
• Overall speed up is 1
(1 p)+(p/n)

When n ! 1 then speed up is 1/(1 p)


• If p = 0.9, the best speed up is 10
Optimal speed-up is hard to achieve
• Typically communication overhead adds work to be done
• Gets worse as n grows
• Work allocation not always optimal so cores may be idle
– Static – hard to load balance
– Dynamic – may be hard to implement, load balancing still not optimal

Speed-up
Empirical speed up:
• Let T (n) be time time taken with n processors
• Speed-up S(n) is S(n) = T (1)/T (n)
To be fair, need to have T (1) be the cost of best sequential algorithm
• real super-linear performance possible – due to memory, cache increase

21
Efficiency
Extra parallelism comes with cost – efficiency an important metric

• E(n) = S(n)/n

As a general rule E(n) ! 0 and n ! 1

• How quickly?

• Alternate view: Asanovic – but practically after a point you don’t get improvement

Gustafson’s law
Constraint on parallelism is communication overhead

• Generally depends on granularity of work

• e.g. embarassingly parallel – essentially independent

• bigger the data, easier it is break into meaninful work blocks. e.g., search in 10k file
not easy to parallelise, in 100G may be east

• Limitation of Amdah’s law – assumes work if fixed

• But parallelism allows us to tackle much bigger problems

• For fixed time T , how much bigger task can we do?

• W = (1 p)W + pW

• W (n) = (1 p)W + npW

• S(n) = 1 p + sp

Different insight – but depends on workload.

• If algorithm is cubic for example, may not be possible to scale

Python multiprocessing module


Supports parallelism at the process level

• Process : abstraction of a process

• Pool: abstraction of a collection of processes

Can communicate between processes with

• Queue

• Pipe

22
Example

from time import s l e e p


import random
import m u l t i p r o c e s s i n g as mp

def worker ( x ) :
p r i n t ( ”%d s t a r t i n g ”%x )
s l e e p ( random . r a n d i n t ( 5 , 1 0 ) )
p r i n t ( ”%d t e r m i n a t i n g ”%x )

procs = []
f o r x in range ( 1 0 ) :
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(x , ) )
p r o c s . append ( p )
p . s t a r t ()

p r i n t ( ” Master going t o s l e e p ” )
s l e e p (30)
p r i n t ( ” Master wakes ” )
f o r p in p r o c s :
p . join ()

p r i n t ( ” Master s t o p p i n g ” )

Queue
Queue is an an object which allows master to communicate with the worker
• put: put a value on the queue
• get: retrieve a value
• empty: ?
Other methods – optional parameters

def worker ( q , x ) :
ans =random . r a n d i n t ( 5 , 1 0 )
s l e e p ( ans )
q . put ( ans )

procs = []
q = mp. Queue ( )

23
f o r x in range ( 1 0 ) :
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(q , x ) )
p r o c s . append ( p )
p . s t a r t ()

s l e e p s =[]
f o r x in range ( 1 0 ) :
n = q . get ()
s l e e p s . append ( n )

f o r p in p r o c s :
p . join ()

p r i n t ( ” T o t a l s l e e p i n g was %d ”%sum( s l e e p s ) )
print ( sleeps )
Could we replace the for x in range(10) with while not q.empty()?

Pipe
pipe can be created to connect two processes

conn1 , conn2 = mp. P i p e ( )


Can send and receive

• Pipe can be simplex or duplex

def worker ( conn , x ) :


ans =random . r a n d i n t ( 5 , 1 0 )
s l e e p ( ans )
conn . send ( ans )

procs = []
conns = [ ]

f o r x in range ( 1 0 ) :
c1 , c2 = mp. P i p e ( )
p = mp. P r o c e s s ( t a r g e t=worker , a r g s =(c2 , x ) )
p r o c s . append ( p )
conns . append ( c1 )

24
p . s t a r t ()

s l e e p s =[]
while len ( conns ) !=0:
f o r conn in conns :
i f conn . p o l l ( ) :
s l e e p s . append ( conn . r e c v ( ) )
conns . remove ( conn )
....

Sharing state
By default, no sharing of state

• Depending on mode/OS can fork processes rather than spawn – process is intialised
with state of master

• Can share objects through multiprocessing.Value and Array


Must be explicitly shared

Locks
Locks allow one process to signal that no other process can access some resource
Can’t prevent, but

• acquire lock:

• release lock

Process A Process P
lock.acquire() lock.acquire()

lock.release() lock.release()

Dining Philosophers Problem

A set of n philosophers set down at a round table

• there are n chopsticks

• each philosopher goes through a cycle of thinking, being hungry, and eating

• in order to eat s/he needs to have two chopsticks

Need to avoid:

• deadlock

• livelock

25
• starvation

P2

P1
P3

P4
P6

P5

Multi-threaded programming
Similar in principle to multi-process programming

• Message passing between threads

• Shared memory access common

Shared memory access makes

• programming easier

• programming harder

• data access/communication more efficient

• at least some of the time

Multi-threaded programming
Widely available

• e.g., Pthreads in C

• Java threads

• Python does not support well

26
Open MP
Library that provides multi-threading for many programming languages
• C, C++
• FORTRAN
Goal is to simplify programming
• Programmer provides pragmas – hints to the compiler
• Burden is on the compiler, but relies on well structured code

Programmer can
• Specify how data shared if al all
• Identify critical sections, variables that require atomic update
• Specify barriers
• Whether loops should be parallelised
• How work units scheduled

Support for communicating data between “master” and “workers”, producing answers.

void simple(int n, float *a, float *ans) {


int i;
#pragma omp parallel for
for (i=1; i<n; i++)
ans[i] = (a[i] + a[i-1]) / 2.0;

But not completely straightfoward

x = 2;
#pragma omp parallel num_threads(2) shared(x) {
if (omp_get_thread_num() == 0) {
x = 5;
} else {
printf("1: Thread# %d: x = %d\n", omp_get_thread_num(),x );
}
#pragma omp barrier
if (omp_get_thread_num() == 0)
printf("2: Thread# %d: x = %d\n", omp_get_thread_num(),x );
else
printf("3: Thread# %d: x = %d\n", omp_get_thread_num(),x );

27
Distributed memory processing
Limit to how many cores on a single computer sharing memory

• Common model – many computers

• Cluster : connected on fast local network

• Grid: connected via internet (different organisations own)

• Cloud computing as model

Cluster

Head node

Workers

3.0.1 Independent Jobs

Parallelism – Independent jobs

• Different or same user can run independent jobs on cluster

• Effective sharing of resources

• User logs on to head node.

– Creates job request


– Sends it to scheduler which allocates job to computer

Job request in file e.g. admix.sh

28
#!/bin/bash
#SBATCH -J admix
#SBATCH -t 01:20:00
#SBATCH -c 8
#SBATCH -p batch
#SBATCH --mem=4096

cd ${DIR}

admixture -j3 ${DATA}.bed ${K}


Run by saying
sbatch admix.sh

3.1 MPI – parallelism within the job


Job is split into multiple cooperating processes:
• Each process has its own separate memory

• Processes can be running on same node, typically on separate ones – so no physical


shared memory possible

• Shared (virtual) memory is usually not possible;

• Some libraries upport shared virtual memory across multiple machines but very expen-
sive

MPI
MPI is the best known parallel library for distributed processing
• Core implementation in C but there’s support for other systems

• mpi4py
Need to have MPI installed on your system

mpiexec -np 4 multiply.py matrix.dat

Runs the jobs on four processors provided by system

Key concepts

Communicator
Object which represents processes involved in a parallel task
• Most programs only have one communicator: COMM_WORLD – all the processes

• Could have multiple – but only needed for complex programs

29
Rank
When we launch n processes all are peers

• Rank: a number identifying a process uniquely – usually we make 0 the master

#!/usr/bin/env python3

from mpi4py import MPI

comm = MPI.COMM_WORLD

rank = comm.Get_rank()
num_procs = comm.Get_size()

print ("I am %d of %d\n"%(rank,num_procs))

Running the code

mpiexec -np 5 python3 try.py

Example

[scott@n01 ~]$ mpiexec -np 5 python3 try.py


I am 0 of 5

I am 2 of 5

I am 4 of 5

I am 1 of 5

I am 3 of 5

Note, that in principle all processes are peers, do the same thing. In most programs we’ll
want some processes (maybe only 1) to do something special and act as a coordinator, thut
that is up to us to program. We usually will choose process with rank 0 for this, but that is up
to us to enforce.

Features

30
• synchronous send, receive

• asynchonous send, receive

• broadcast

• scatter/gather

Synchronous send/receive

send/ receive

• Can send arbitrary data but inefficient

• Can send numpy arrays with structure information – much more efficient.

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
data = [("sensor1", 5), 2019,11,4, "ready"]
comm.send(data, dest=1, tag=11)
elif rank == 1:
data = comm.recv(source=0, tag=11)
print(data)

mpiexec -np 5 python3 try1.py

• Process 0 sends data

• Process 1 receives data

• Process 2–4 do nothing

scott@n01 ~]$ mpiexec -np 5 python3 try1.py

[(’sensor1’, 5), 2019, 11, 4, ’ready’]

This is slow because Python has to pickle the data (marshall) and then transmit and to un-
pickle. If you use numpy arrays and specify type, it’s much faster (but more restricted)

31
from mpi4py import MPI
import numpy

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# passing MPI datatypes explicitly


if rank == 0:
data = numpy.arange(1000, dtype=’i’)
comm.Send([data, MPI.INT], dest=1, tag=77)
elif rank == 1:
data = numpy.empty(1000, dtype=’i’)
comm.Recv([data, MPI.INT], source=0, tag=77)
print(data)

Or

if rank == 0:
data = numpy.arange(100, dtype=numpy.float64)
comm.Send(data, dest=1, tag=13)
elif rank == 1:
data = numpy.empty(100, dtype=numpy.float64)
comm.Recv(data, source=0, tag=13)

Asynchronous communication isend/irecv/wait


With asynchronous communication , the receiver and sender can initiate the communication,
continue to do other things and then check later

• Can be more efficient

• Sometimes natural

• Often makes coding trickier to get right

if rank == 0:
data1 = 10*numpy.arange(1000, dtype=’i’)
req1 = comm.Isend([data1, MPI.INT], dest=1, tag=10)
data2 = numpy.arange(1000, dtype=’i’)
req2 = comm.Isend([data2, MPI.INT], dest=1, tag=11)
req1.wait()
req2.wait()
elif rank == 1:
dataA = numpy.empty(1000, dtype=’i’)
dataB = numpy.empty(1000, dtype=’i’)
req1 = comm.Irecv([dataA, MPI.INT], source=0, tag=11)
req2 = comm.Irecv([dataB, MPI.INT], source=0, tag=10)
print("do something")

32
req1.wait(); req2.wait()
print(dataA[0:10])
print(dataB[0:10])

• There is also a test routine which checks whether communication complete

• note use of tag

Broadcasting

if rank == 0:
data1 = 10*np.arange(100, dtype=’i’)
else:
data1 = np.empty(100, dtype=’i’)

comm.Bcast([data1, MPI.INT],root=0)
print("Proc %d received: "%rank, data1[:10])

Scatter

if rank == 0:
data1 = 10*np.arange(50, dtype=’i’).reshape(10,5)
else:
data1 = None
recv = np.empty(10, dtype=’i’)

comm.Scatter(data1,recv,root=0)
print("Proc %d received: "%rank, recv)

Scatter

if rank == 0:
data1 = np.arange(50, dtype=’i’).reshape(10,5)
else:
data1 = None
recv = np.empty(10, dtype=’i’)

comm.Scatter(data1,recv,root=0)
print("Proc %d received: "%rank, recv)

33
Scatter

comm.Scatter(data1,recv,root=0)
...
response = rank*recv
ans = None
if rank == 0:
ans = np.empty(50, dtype=’i’).reshape(10,5)
comm.Gather(response, ans, root=0)

I/O
Can use standard I/O, e.g,

• master process does I/O and broadcasts, sends data; or

• worker processes read/write different files

Problems:

• Coordinating access to the same file

• Performance

NB: processes can be running on different computers – typically we assume a global file
system – all computers see the same file system – but this need not be the case. There are
certainly implementations of MPI where some of the workers do not see the same file system
as the head node.
MPI I-O
MPI provides a library for doing I/O

• higher level of abstraction for shared I/O

• better performance where parallel file system is provided

As an aside, let’s look at the difference between standard network file system and a parallel
file system

NFS: Traditional file system for cluster

34
Head node
NFS server

Workers

Parallel file system

Head node

FS FS FS FS FS
Workers

Only a performance reason to use MPI I-O if a parallel file system. But may be useful to make
programming easier.

Introduction to MPI I-O

Operations can be

• independent : each process can read/write to files independently of each other

• collective: processes all read or write from same file


This requires cooperation otherwise can have deadlock – potential performance im-
provement

35
File pointer/offset
Often access to file is sequential
• open file – top of file
• subsequent operations advance through file
But can explicitly set the file pointer/offset
• Some operations have parameter
• or use seek to move to that point

Example writing

• Write(buf) : individual operation – each process has its private file handle (presumably
to different files)
• Write_at(buf, offset) : individual operation – write at specified position
• Write_all(buf) : collective operation – all processes write to same file – MPI makes it
look like access serialised
• Write_at_all(buf, offset) ditto, but each process can write to different place

Reading
MPI I-O normally assume that a buffer is set up for the data
• In C/C++ typically an array
• In MPI, use bytearray or numpy arrays

Example read

• Read(buf)
• Read_at(buf,offset) : individual operation – read at specified position
• Read_all(buf) : collective operation – all processes read same file – MPI makes it look
like access serialised
• Read_at_all(buf, offset) ditto, but each process can read from different place

Other opertions

• Open – need to specify mode(s)


• Close
• Get_size() : how big file is
• Seek(offset, whence) move the file pointer to the right
• Seek_all(offset, whence) move the file pointer to the right

36

You might also like