0% found this document useful (0 votes)
3 views

Parallel Processing Lecture2

The document discusses parallel processing concepts in CS 408, focusing on Amdahl's Law, which highlights the limitations of parallelization in programs with non-parallelizable components. It also covers Flynn's Taxonomy, which categorizes computer architectures based on instruction and data streams, and outlines trends in high-technology development related to parallel computing. Additionally, it mentions the evolution of parallel processing technologies and their implications for performance and algorithm design.

Uploaded by

ayman mossad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Parallel Processing Lecture2

The document discusses parallel processing concepts in CS 408, focusing on Amdahl's Law, which highlights the limitations of parallelization in programs with non-parallelizable components. It also covers Flynn's Taxonomy, which categorizes computer architectures based on instruction and data streams, and outlines trends in high-technology development related to parallel computing. Additionally, it mentions the evolution of parallel processing technologies and their implications for performance and algorithm design.

Uploaded by

ayman mossad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CS 408 - Parallel Processing

Professor BenBella S. Tawfik

Lecture 2
Analysis


Useful Diagram

One node per


O(1) operation



Question: why

are there no
cycles in this
graph?
Analysis


Definitions


More Definitions



Amdahl’s Law

 Now it’s time for some bad news.


 In practice, your program won’t just sum all the
elements in an array.
 You will have a program with
 Some parts that parallelize well
 Can turn them into a map or a reduce.
 Some parts that won’t parallelize at all
 Operations on a linked list.
 Reading a text file.
 A computation where each step needs the result of
the previous steps.
Amdahl’s Law


Amdahl’s Law


Amdahl’s Law


Amdahl’s Law

 This is BAD NEWS


 If 1/3 of our program can’t be parallelized, we
can’t get a speedup better than 3.
 No matter how many processors we throw at
our problem.
 And while the first few processors make a huge
difference, the benefit diminishes quickly.
Amdahl’s Law and Moore’s
Law

Amdahl’s Law: Moving
Forward
 Unparallelized code becomes a bottleneck
quickly.
 What do we do? Design smarter algorithms!

 Consider the following problem:


 Given 3
an array of7numbers,6return an2array with4
the “running sum”

3 10 16 18 22
1.3 Parallel Processing Ups and Downs
Fig. 1.10 Richardson’s circular Using thousands of “computers”
theater for weather forecasting (humans + calculators) for 24-hr
calculations. weather prediction in a few hours

1960s: ILLIAC IV (U Illinois) –


 four 8  8 mesh quadrants, SIMD

 
  1980s: Commercial interest –

  technology was driven by


  
 government grants & contracts.
Once funding dried up,

  
 

Conductor
 many companies went bankrupt
 2000s: Internet revolution –

   info providers, multimedia, data
 
 mining, etc. need lots of power

2020s: Cloud, big-data, AI/ML
Trends in High-Technology Development
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes
Graphics
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRe
Networking
IndResIndResIndResIndResIndResIndResIndResIndResIndR
Transfer of
ideas/people IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

GovRes
RISC
IndResIndR
IndDev $1B$1B$1B$1B$1B$1B$1B$1B

GovResGovResGovResG GovResGovResGovResGo
Parallelism
IndResIndResIndResIndResIndResInd
Evolution of parallel processing has been
quite different from other high tech fields IndDevIndDev $1B$1B$1B$1B$1B$1B$1
1960 1970 1980 1990 2000
Development of some technical fields into $1B businesses and the roles played by
government research and industrial R&D over time (IEEE Computer, early 90s?).
Trends in Hi-Tech Development (2003)
2010 2020
Status of Computing Power (circa 2000)
TFLOPS PFLOPS (Peta = 1015)
GFLOPS on desktop: Apple Macintosh, with G4 processor
PFLOPS EFLOPS (Exa = 1018)
TFLOPS in supercomputer center:
1152-processor IBM RS/6000 SP (switch-based network)
Cray T3E, torus-connected
EFLOPS ZFLOPS (Zeta = 1021)
PFLOPS on the drawing board:
1M-processor IBM Blue Gene (2005?)
32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers
Processor: 8 threads, on-chip memory, no data cache
Chip: defect-tolerant, row/column rings in a 6  6 array
Board: 8  8 chip grid organized as 4  4  4 cube
Tower: Boards linked to 4 neighbors in adjacent towers
System: 323232 cube of chips, 1.5 MW (water-cooled)
Flynn Taxonomy

Flynn (1966) Single Data Multiple Data


Single Instruction SISD SIMD
Multiple Instruction MISD MIMD

• MISD
• Fault tolerance
• Pipeline processing/streaming or systolic
array
• Now extended to SPMD
1
• single program multiple data
Mikko Lipasti-University of Wisconsin

9
Memory Taxonomy
For Shared Memory Uniform Nonuniform
Memory Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA

• NUMA wins out for practical implementation


• Cache coherence favors programmer
• Common in general-purpose systems
• NCC widespread in scalable systems
• CC overhead is too high, not always2
necessary
Mikko Lipasti-University of Wisconsin

0
Flynn Taxonomy (1966) Recap

Data Stream
Single Multi
Instruction Single SISD SIMD
Stream (Single-Core Processors) (GPUs, Intel SSE/AVX
extensions, …)
Multi MISD MIMD
(Systolic Arrays, …) (VLIW, Parallel Computers)
Flynn Taxonomy Recap

Single-Instruction Single-Instruction
Single-Data Multi-Data
(Single-Core Processors) (GPUs, Intel SIMD Exte

Multi-Instruction Multi-Instruction
Single-Data Multi-Data
(Systolic Arrays,…) (Parallel Computers)
Intel SIMD Extensions

 New instructions, new registers


 Introduced in phases/groups of functionality
 SSE – SSE4 (1999 –2006)
 128 bit width operations

 AVX, FMA, AVX2, AVX-512 (2008 – 2015)


 256 – 512 bit width operations
 AVX-512 chips not available yet (as of Spring, 2019), but soon!

 F16C, and more to come?


Intel SIMD Registers (AVX-512)

 XMM0 – XMM15
XMM0  128-bit registers
YMM0
ZMM0
 SSE
XMM1  YMM0 – YMM15
YMM1
ZMM1
 256-bit registers
 AVX, AVX2

 ZMM0 – ZMM31
XMM31  512-bit registers
YMM31
ZMM31  AVX-512
SSE/AVX Data Types

255 0
YMM0

float float float float float float float float


double double double double
int32 int32 int32 int32 int32 int32 int32 int32
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
8888 8888 8888 8888 8888 8888 8888 8 8 8 8 Operation on
32 8-bit values
in one instruction!
Sandy Bridge
Microarchitecture

e.g., “Port 5 pressure” when code uses too much shuffle operations
Skylake Die Layout
Aside: Do I Have SIMD
Capabilities?
 less /proc/cpuinfo
The Global View of Computer
Architecture

Applications Parallelism
History

Technology
Computer Architecture
Programming -instruction set design
Languages -Organization OS
-Hardware/Software boundary

Measurement
and Compilers
Evaluation Interface Design (ISA)

Eurípides Montagne University of Central CDA


31
Florida 4150
The Task of A Computer
Designer

 Determine what attributes are important for a new


machine.

 Design a machine to maximize performance while staying


within cost constrains.

Eurípides Montagne University of Central CDA


32
Florida 4150
Flynn’s Taxonomy

 Michael Flynn (from Stanford)


 Made a characterization of computer systems which
became known as Flynn’s Taxonomy

Computer

Instructions Data

Eurípides Montagne University of Central CDA


33
Florida 4150
Flynn’s Taxonomy
 SISD – Single Instruction Single Data Systems

SI SISD SD

Eurípides Montagne University of Central CDA


34
Florida 4150
Flynn’s Taxonomy

 SIMD – Single Instruction Multiple Data Systems “Array


Processors”

SISD SD

SI SISD SD Multiple Data

SISD SD

Eurípides Montagne University of Central CDA


35
Florida 4150
Flynn’s Taxonomy

 MIMD Multiple Instructions Multiple Data System:


“Multiprocessors”

Multiple Instructions Multiple Data


SI SISD SD

SI SISD SD

SI SISD SD
Eurípides Montagne University of Central CDA
36
Florida 4150
Flynn’s Taxonomy

 MISD- Multiple Instructions / Single Data System


 Some people say “pipelining” lies here, but this is
debatable.
Multiple Instructions Single Data

SI SISD

SI SISD SD

SI SISD
Eurípides Montagne University of Central CDA
37
Florida 4150
Abbreviations:
SISD one address Machine.
IP

 IP: Instruction pointer


MAR
 MAR: Memory Address Register
 MDR: Memory Data Register MEMORY

 A: Accumulator
OP ADDR MDR A
 ALU: Arithmetic Logic Unit
 IR: Instruction Register DECODER ALU

 OP: Opcode
Eurípides Montagne University of Central CDA
 ADDR: Address
Florida 4150
38
One address format
LOAD X
OP ADDRESS
 MAR  IP
 MDR  M[MAR] || IP  IP + 1
IP
 IR  MDR
 DECODER IR.OP MAR

 MAR  IR.ADDR
MEMORY
 MDR M[MAR]
A  MDR
OP ADDR MDR A

DECODER ALU

Eurípides Montagne University of Central CDA


39
Florida 4150
One address format

OP ADDRESS
ADD X
- MAR  IP
IP
– MDR  M[MAR] || IP  IP + 1
– IR  MDR MAR
– DECODER IR.OP
– MAR  IR.ADDR MEMORY
– MDR M[MAR]
– A  A + MDR MDR A
OP ADDR

DECODER ALU

Eurípides Montagne University of Central CDA


40
Florida 4150
One address format

OP ADDRESS
STORE X

IP
- MAR  IP
- MDR  M[MAR] || IP  IP + 1 MAR
- IR  MDR
- DECODER IR.OP MEMORY
- MAR  IR.ADDR
- MDR  A MDR
OP ADDR A
- M[MAR]  MDR
DECODER ALU

Eurípides Montagne University of Central CDA


41
Florida 4150
SISD Stack Machine

IP
 First Stack Machine
• B5000 MAR

MEMORY

MDR Stack
OP ADDR
1
2
ALU 3
DECODER 4

Eurípides Montagne University of Central CDA


42
Florida 4150
Multiprocessor Machines (MIMD)

MEMORY

CPU CPU CPU

Eurípides Montagne University of Central CDA


43
Florida 4150
Flynn’s Taxonomy
(figure 2.20 from Quinn)
Data stream
Single Multiple
Instruction stream

SIMD
Single

SISD
Processor arrays
Uniprocessor
Pipelined vector processors

MIMD
MISD
Multiple

Multiprocessors
Systolic array
Multicomputers
2-D Mesh Network

 Direct topology
 Switches arranged into a 2-D lattice
 Communication allowed only between neighboring
switches
 Variants allow wraparound connections between
switches on edge of mesh
2-D Meshes
Evaluating 2-D Meshes

 Diameter: (n1/2)

 Bisection width: (n1/2)

 Number of edges per switch: 4

 Constant edge length? Yes


Binary Tree Network

 Indirect topology
 n = 2d processor nodes, n-1 switches
Evaluating Binary Tree
Network
 Diameter: 2 log n

 Bisection width: 1

 Edges / node: 3

 Constant edge length? Yes/No?


Hypertree Network

 Indirect topology
 Shares low diameter of binary tree
 Greatly improves bisection width
 From “front” looks like k-ary tree of height d
 From “side” looks like upside down binary tree of height
d
Hypertree Network
Evaluating 4-ary Hypertree

 Diameter: log n

 Bisection width: n / 2

 Edges / node: 6

 Constant edge length? No


Butterfly Network

 Indirect topology
 n = 2d processor 0 1 2 3 4 5 6 7
nodes connected
by n(log n + 1) Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
switching nodes

Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7


Butterfly Network Routing
Evaluating Butterfly Network

 Diameter: log n

 Bisection width: n / 2

 Edges per node: 4

 Constant edge length? No


Hypercube

 Directory topology
 2 x 2 x … x 2 mesh
 Number of nodes a power of 2
 Node addresses 0, 1, …, 2k-1
 Node i connected to k nodes whose addresses differ
from i in exactly one bit position
Hypercube Addressing

1110 1111

0110 0111

1010 1011

0010 0011

1100 1101

0100 0101

1000 1001

0000 0001
Evaluating Hypercube
Network
 Diameter: log n

 Bisection width: n / 2

 Edges per node: log n

 Constant edge length? No


1.4 Types of Parallelism: A Taxonomy

Single data Multiple data Shared Message


stream streams variables passing
Single instr

memory
stream

Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr

Distributed
streams

memory
MISD MIMD DMSV DMMP
Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers

Flynn’s categories

Fig. 1.11 The Flynn-Johnson classification of computer systems.


1.5 Roadblocks to Parallel Processing
 Grosch’s law: Economy of scale applies, or power = cost2
No longer valid; in fact we can get more bang per buck in micros
 Minsky’s conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
 Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)
Faster ICs make parallel machines faster too; what about x1000?

 Tyranny of vector supercomputers: Familiar programming model


Not all computations involve vectors; parallel vector machines

 Software inertia: Billions of dollars investment in software


New programs; even uniprocessors benefit from parallelism spec

 Amdahl’s law: Unparallelizable code severely limits the speedup


Amdahl’s Law
50

f = 0
f = fraction
40
unaffected
Spee du p ( s )

f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 – f)/p
f = 0 .1

0
 min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )

Fig. 1.12 Limit on speed-up according to Amdahl’s law.


1.6 Effectiveness of Parallel Processing
1 Fig. 1.13 p Number of processors
Task graph
exhibiting W(p) Work performed by p processors
2
limited
inherent T(p) Execution time with p processors
3
parallelism. T(1) = W(1); T(p)  W(p)
4
S(p) Speedup = T(1) / T(p)
5
8

7 6 E(p) Efficiency = T(1) / [p T(p)]

10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8  4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76

Unit-time communication
+ +
E(8) = 15 / (8  7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum

Fig. 1.14 Computation graph for finding the sum of 16 numbers .


ABCs of Parallel Processing in One Slide
A Amdahl’s Law (Speedup Formula)
Bad news – Sequential overhead will kill you, because:
Speedup = T1/Tp  1/[f + (1 – f)/p]  min(1/f, p)
Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.

B Brent’s Scheduling Theorem


Good news – Optimal scheduling is very difficult, but even a naive
scheduling algorithm can ensure:
T1/p  Tp  T1/p + T = (T1/p)[1 + p/(T1/T)]
Result: For a reasonably parallel task (large T1/T), or for a suitably
small p (say, p  T1/T), good speedup and efficiency are possible.

C Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.

You might also like