Parallel Processing Lecture2
Parallel Processing Lecture2
Lecture 2
Analysis
Useful Diagram
…
…
Question: why
…
are there no
cycles in this
graph?
Analysis
Definitions
More Definitions
Amdahl’s Law
Amdahl’s Law
Amdahl’s Law
Amdahl’s Law
3 10 16 18 22
1.3 Parallel Processing Ups and Downs
Fig. 1.10 Richardson’s circular Using thousands of “computers”
theater for weather forecasting (humans + calculators) for 24-hr
calculations. weather prediction in a few hours
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRe
Networking
IndResIndResIndResIndResIndResIndResIndResIndResIndR
Transfer of
ideas/people IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
GovRes
RISC
IndResIndR
IndDev $1B$1B$1B$1B$1B$1B$1B$1B
GovResGovResGovResG GovResGovResGovResGo
Parallelism
IndResIndResIndResIndResIndResInd
Evolution of parallel processing has been
quite different from other high tech fields IndDevIndDev $1B$1B$1B$1B$1B$1B$1
1960 1970 1980 1990 2000
Development of some technical fields into $1B businesses and the roles played by
government research and industrial R&D over time (IEEE Computer, early 90s?).
Trends in Hi-Tech Development (2003)
2010 2020
Status of Computing Power (circa 2000)
TFLOPS PFLOPS (Peta = 1015)
GFLOPS on desktop: Apple Macintosh, with G4 processor
PFLOPS EFLOPS (Exa = 1018)
TFLOPS in supercomputer center:
1152-processor IBM RS/6000 SP (switch-based network)
Cray T3E, torus-connected
EFLOPS ZFLOPS (Zeta = 1021)
PFLOPS on the drawing board:
1M-processor IBM Blue Gene (2005?)
32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers
Processor: 8 threads, on-chip memory, no data cache
Chip: defect-tolerant, row/column rings in a 6 6 array
Board: 8 8 chip grid organized as 4 4 4 cube
Tower: Boards linked to 4 neighbors in adjacent towers
System: 323232 cube of chips, 1.5 MW (water-cooled)
Flynn Taxonomy
• MISD
• Fault tolerance
• Pipeline processing/streaming or systolic
array
• Now extended to SPMD
1
• single program multiple data
Mikko Lipasti-University of Wisconsin
9
Memory Taxonomy
For Shared Memory Uniform Nonuniform
Memory Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA
0
Flynn Taxonomy (1966) Recap
Data Stream
Single Multi
Instruction Single SISD SIMD
Stream (Single-Core Processors) (GPUs, Intel SSE/AVX
extensions, …)
Multi MISD MIMD
(Systolic Arrays, …) (VLIW, Parallel Computers)
Flynn Taxonomy Recap
Single-Instruction Single-Instruction
Single-Data Multi-Data
(Single-Core Processors) (GPUs, Intel SIMD Exte
Multi-Instruction Multi-Instruction
Single-Data Multi-Data
(Systolic Arrays,…) (Parallel Computers)
Intel SIMD Extensions
XMM0 – XMM15
XMM0 128-bit registers
YMM0
ZMM0
SSE
XMM1 YMM0 – YMM15
YMM1
ZMM1
256-bit registers
AVX, AVX2
…
ZMM0 – ZMM31
XMM31 512-bit registers
YMM31
ZMM31 AVX-512
SSE/AVX Data Types
255 0
YMM0
e.g., “Port 5 pressure” when code uses too much shuffle operations
Skylake Die Layout
Aside: Do I Have SIMD
Capabilities?
less /proc/cpuinfo
The Global View of Computer
Architecture
Applications Parallelism
History
Technology
Computer Architecture
Programming -instruction set design
Languages -Organization OS
-Hardware/Software boundary
Measurement
and Compilers
Evaluation Interface Design (ISA)
Computer
Instructions Data
SI SISD SD
SISD SD
SISD SD
SI SISD SD
SI SISD SD
Eurípides Montagne University of Central CDA
36
Florida 4150
Flynn’s Taxonomy
SI SISD
SI SISD SD
SI SISD
Eurípides Montagne University of Central CDA
37
Florida 4150
Abbreviations:
SISD one address Machine.
IP
A: Accumulator
OP ADDR MDR A
ALU: Arithmetic Logic Unit
IR: Instruction Register DECODER ALU
OP: Opcode
Eurípides Montagne University of Central CDA
ADDR: Address
Florida 4150
38
One address format
LOAD X
OP ADDRESS
MAR IP
MDR M[MAR] || IP IP + 1
IP
IR MDR
DECODER IR.OP MAR
MAR IR.ADDR
MEMORY
MDR M[MAR]
A MDR
OP ADDR MDR A
DECODER ALU
OP ADDRESS
ADD X
- MAR IP
IP
– MDR M[MAR] || IP IP + 1
– IR MDR MAR
– DECODER IR.OP
– MAR IR.ADDR MEMORY
– MDR M[MAR]
– A A + MDR MDR A
OP ADDR
DECODER ALU
OP ADDRESS
STORE X
IP
- MAR IP
- MDR M[MAR] || IP IP + 1 MAR
- IR MDR
- DECODER IR.OP MEMORY
- MAR IR.ADDR
- MDR A MDR
OP ADDR A
- M[MAR] MDR
DECODER ALU
IP
First Stack Machine
• B5000 MAR
MEMORY
MDR Stack
OP ADDR
1
2
ALU 3
DECODER 4
MEMORY
SIMD
Single
SISD
Processor arrays
Uniprocessor
Pipelined vector processors
MIMD
MISD
Multiple
Multiprocessors
Systolic array
Multicomputers
2-D Mesh Network
Direct topology
Switches arranged into a 2-D lattice
Communication allowed only between neighboring
switches
Variants allow wraparound connections between
switches on edge of mesh
2-D Meshes
Evaluating 2-D Meshes
Diameter: (n1/2)
Indirect topology
n = 2d processor nodes, n-1 switches
Evaluating Binary Tree
Network
Diameter: 2 log n
Bisection width: 1
Edges / node: 3
Indirect topology
Shares low diameter of binary tree
Greatly improves bisection width
From “front” looks like k-ary tree of height d
From “side” looks like upside down binary tree of height
d
Hypertree Network
Evaluating 4-ary Hypertree
Diameter: log n
Bisection width: n / 2
Edges / node: 6
Indirect topology
n = 2d processor 0 1 2 3 4 5 6 7
nodes connected
by n(log n + 1) Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
switching nodes
Diameter: log n
Bisection width: n / 2
Directory topology
2 x 2 x … x 2 mesh
Number of nodes a power of 2
Node addresses 0, 1, …, 2k-1
Node i connected to k nodes whose addresses differ
from i in exactly one bit position
Hypercube Addressing
1110 1111
0110 0111
1010 1011
0010 0011
1100 1101
0100 0101
1000 1001
0000 0001
Evaluating Hypercube
Network
Diameter: log n
Bisection width: n / 2
memory
stream
Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
streams
memory
MISD MIMD DMSV DMMP
Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers
Flynn’s categories
f = 0
f = fraction
40
unaffected
Spee du p ( s )
f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 – f)/p
f = 0 .1
0
min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )
10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76
Unit-time communication
+ +
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum
C Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.