Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

CSE 392/CS 378: High-performance

Computing - Principles and Practice

Parallel Computer Architectures


A Conceptual Introduction for
Software Developers
Jim Browne
browne@cs.utexas.edu
Parallel Computer Architectures
A Conceptual Introduction
Lecture Outline
• Why Do Applications Care about Architecture?
• Example Architecture – Ranger
• What Are the Issues?
• Why Do Application Developers/Users Need to Know about
Parallel Architectures?
– To efficiently develop efficient applications
• What Do Application Developers/Users Need to Know about
Parallel Architectures?
– Characteristics of Memory Hierarchy
– Parallelism in Cluster Computer Architectures
• Homogeneous Multicore Chip Architecture
• Homogeneous Chip Node Architecture
• Interconnect Architecture
• Heterogeneous Multicore Chip Architecture (Not covered in this lecture)
What Are the Issues ? Parallelism!

Algorithms: Programming Application Hardware


Models/Types Model/Language: Program: Architecture:
of Models/Types of Models/Types Models/Types
Parallelism Parallelism of of
Parallelism Parallelism

Many different Core – SISD,SIMD


models/types node –
of parallelism MIMD/shared
Interconnect –
MIMD/distributed

Map from the models/types of parallel computation


of algorithms, programming languages and applications
to the models and types of parallelism of hardware architectures.
What are the Issues? Memory Hierarchy!

• Multiple levels of memory


– Five levels of caching
• Complex Parallelism in Memory Access
– Cache Line Fetches
• Interference among processors for access to
memory which is shared among processors
• Effects of prefetching
What Knowledge is Needed to Map Applications to
Architectures?

• Characteristics of Memory Hierarchy


– Core
– Chip
– Node
– Interconnect
• Parallelism:
– Algorithms/Libraries
– Resource Control – Multicore Programming
– Programming Models/Languages
– Hardware Architectures
Ranger Architecture - Single Processor Architecture

Processor Core for AMD Barcelona Chip


Ranger Architecture - Chip Architecture

AMD Barcelona Chip


Ranger Architecture - Node Architecture
Ranger - Node Memory Architecture

• 32 GB per node
• Assume 32K pages => ~1,000,000 pages/node
• 48 pages mapped in TLB
• DRAM Bank – Parallel accessibility to each bank
• DRAM pages - A DRAM page ((not to be confused with
OS pages.) ) represents a row of data that has
been read from a DRAM bank and is cached within the DRAM
for faster access. DRAM pages can be large, 32 KB in the case
of Ranger, and there are typically two per DIMM. This means
that a Ranger node shares 32 different 32 KB pages among 16
cores on four chips, yielding a megabyte of SRAM cache in the
main memory.
Ranger Interconnect - InfiniBand

Properties: One Hop Connectivity, Packets and DMA between Nodes


Why Do Applications Care about
Architecture?
• Return on Investment
– Understand What You are Doing!
– Performance
• Choice of Algorithms
• Choice of Libraries
• Optimization of Application Specific Code
– Productivity
• Personal
• Resource
What Do Application Developers/Users Need to
Know about Architecture?
• Parallelism in Software
– Algorithm
– Resource Control
– Language/Library/Programming Model
– Hardware Architecture
• Parallelism in Cluster Computer Architecture
– Processor/Core Architecture
– Homogeneous Multicore Chips
– Node Architecture
– Interconnect Architecture
– Heterogeneous Multicore Chips (Not covered in this
lecture.
Flynn’s Classic Parallel Hardware Taxonomy
Processor Organizations

Single instruction, single Single instruction, multiple Multiple instruction, single Multiple instruction,
data (SISD) stream data (SIMD) stream data (MISD) stream multiple data (MIMD)
stream

Uniprocessor
Vector Array processor Shared Distributed
processor memory memory

Clusters

Symmetric Nonuniform memory


multiprocessor (SMP) access (NUMA)
Amdahl and Gustafson’s Laws

Amdahl’s Law – Speedup S for: S = 1/((1-F)+F/P)


F = fraction parallelizable and
P = number of processors P infinite
assuming perfect parallelism S = 1/(1-F)
and identical problem size.

Gustafson’s Law – Speedup S for S = (1-F(n)) + P*F(n)


P(n) = fraction parallelizable P,n >> 1
as a function of problem size
measured by n and P = number S = P*F(n)
of processors assuming perfect
parallelism but increasing problem size. In some problems,
F(n) => 1 as n becomes
large so that S => P
Types of Parallelism
• Streaming Parallelism
• Vector Parallelism
• Asynchronous SIMD => SPMD
• Synchronous SIMD
• Asynchronous MIMD
• Shared Name Space Parallel Execution
• Distributed Name Space Parallel Execution
Taxonomy of Parallelism
Attributes {
• mechanism for control – synchronous (implicit by time), asynchronous
(explicit)
• resolution of control – static, dynamic
• cardinality of control – single instruction stream, multiple instruction
stream
• locality of control – single, distributed
• state data for control – central/shared, distributed
• name space for execution data – shared, distributed
• cardinality of input/output (data) stream – single, multiple
• granularity of data access – discrete, chunk, stream
• }
Models - {control: (mechanism, resolution, cardinality, locality, state data),
name space, cardinality of input/output, granularity of data access}
SIMD {(synchronous, static, multiple, single,
central), shared, multiple, stream}

Naturally suited for specialized problems


characterized by a high degree of regularity,
such as graphics/image processing.
MISD - {(synchronous, static, multiple, single, central),
shared, single,stream} – MISD

• Natural uses include:


– multiple frequency filters operating on a single signal
stream
– multiple cryptography algorithms attempting to crack a
single coded message.
MIMD - {(asynchronous, static/dynamic, multiple,
single, central/distributed), multiple, discrete} – MIMD

• Most Clusters, Servers and Grids


Visible Parallelism

• Invisible Parallelism – Internal to core or


instruction
– Prefetching from Memory
– Pipelining of instruction execution
– SIMD instructions
• Mostly addressed by compilers and runtime
systems but compilers often need help.
• Will be addressed in later lectures on
performance optimization.
Visible Parallelism

• Four Levels of Visible Parallelism


– Core
– Chip
– Node
– System
• Generally asynchronous and user
controllable.
Threads – Software and Hardware
• Software
– Multiple operating system defined threads
(pThreads) can be created on each core.
– The thread scheduler can switch among them but
only one can be active similtaneously.
• But some core/chip architecture have
hardware implemented threads.
– Simultaneous Multithreading
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Core 1 - Thread 1 Core 2 - Thread 2 23


Multi-core:
multiple threads can run on each core
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Core 1 – Two Threads Core 2 – Two Threads 24


A technique complementary to multi-core:
Simultaneous multithreading on one core

• Problem addressed: L1 D-Cache D-TLB

The processor pipeline Integer Floating Point


can get stalled:

L2 Cache and Control


– Waiting for the result Schedulers

of a long floating point Uop queues


(or integer) operation
Rename/Alloc
– Waiting for data to
BTB Trace Cache uCode
arrive from memory ROM

Other execution units Bus Decoder

wait unused BTB and I-TLB


Source: Intel

25
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core

• Example: if one thread is waiting for a floating point


operation to complete, another thread can use the
integer units

26
Without SMT, only a single thread can run
at any given time
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 1: floating point


27
Without SMT, only a single thread can run
at any given time
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2:
integer operation 28
SMT processor: both threads can run
concurrently
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2: Thread 1: floating point


integer operation 29
But: Can’t simultaneously use the same
functional unit
L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder This scenario is


impossible with SMT
Bus

BTB and I-TLB


on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit) 30
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each simultaneous
thread as a separate
“virtual processor”
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources

31
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
32
SMT Dual-core: all four threads can run
concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Thread 1 Thread 3 Thread 2 Thread 4 33


Comparison: multi-core vs SMT
• Multi-core:
– Since there are several cores,
each is smaller and not as powerful
(but also easier to design and manufacture)
– However, great with thread-level parallelism
• SMT
– Can have one large and fast superscalar core
– Great performance on a single thread
– Mostly still only exploits instruction-level
parallelism
34
Node Level Parallelism
• Memory sharing and interference issues
become even more complex.
• Ranger nodes have asymmetries which lead to
additional issues of load imbalance and cores
getting out of “phase.”
• I/O is at node level – Each node can be writing
to the file system simultaneously and
asynchronously.
The memory hierarchy
• If simultaneous multithreading only:
– all caches shared
• Multi-core chips:
– L1 caches core private
– L2 caches core private in some architectures
and shared in others
– L3 caches usually shared across cores
– DRAM page cache usually shared
• Memory is always shared
36
CORE1
Designs with private L2 caches

CORE0

CORE1

CORE0
L1 cache L1 cache L1 cache L1 cache

L2 cache L2 cache L2 cache L2 cache

L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
37
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the same
cache data
– More cache space available if a single (or a few)
high-performance thread runs on the system

38
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores

39
The cache coherence problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
x=15213
40
The cache coherence problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
x=15213
41
The cache coherence problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
42
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory assuming
x=21660 write-through
43
caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660
44
Solutions for cache coherence
• This is a general problem with multiprocessors,
not limited just to multi-core
• There exist many solution algorithms, coherence
protocols, etc.
• We will return to consistency and coherence
issues in a later lecture.

• A HARDWARE solution: invalidation-based


protocol with snooping
45
Inter-core bus

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
inter-core
bus 46
Invalidation protocol with snooping
• Invalidation:
If a core writes to a data item, all other copies
of this data item in other caches are
invalidated
• Snooping:
All cores continuously “snoop” (monitor) the
bus connecting the cores.

47
The cache coherence problem
Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
48
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming inter-core
x=21660 write-through bus 49
caches
The cache coherence problem
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660

multi-core chip
Main memory
x=21660
50
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
Main memory
x=21660
51
Invalidation protocols
• This was just the basic invalidation protocol
• More sophisticated protocols use extra cache
state bits
• MSI, MESI - (Modified, Exclusive, Shared,
Invalid) to which we will return later

52
Another hardware solution: update protocol

Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660
UPDATED

broadcasts
multi-core chip
updated
value Main memory assuming inter-core
x=21660 write-through bus 53
caches
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)

• Invalidation generally performs better:


it generates less bus traffic

54
User Control of Multi-core Parallelism
• Programmers must use threads or processes

• Spread the workload across multiple cores

• Write parallel algorithms

• OS will map threads/processes to cores

55
Thread safety very important
• Pre-emptive context switching:
context switch can happen AT ANY TIME

• True concurrency, not just uniprocessor time-


slicing

• Concurrency bugs exposed much faster with


multi-core

56
However: Need to use synchronization even if only
time-slicing on a uniprocessor
int counter=0;

void thread1() {
int temp1=counter;
counter = temp1 + 1;
}

void thread2() {
int temp2=counter;
counter = temp2 + 1;
}
57
Need to use synchronization even if only time-
slicing on a uniprocessor

temp1=counter;
counter = temp1 + 1; gives counter=2
temp2=counter;
counter = temp2 + 1

temp1=counter;
temp2=counter; gives counter=1
counter = temp1 + 1;
counter = temp2 + 1

58
Assigning threads to the cores

• Each thread/process has an affinity mask

• Affinity mask specifies what cores the thread


is allowed to run on

• Different threads can have different masks

• Affinities are inherited across fork()


59
Affinity masks are bit vectors
• Example: 4-way multi-core, without SMT

1 1 0 1

core 3 core 2 core 1 core 0

• Process/thread is allowed to run on


cores 0,2,3, but not on core 1
60
Affinity masks when multi-core and SMT
combined
• Separate bits for each simultaneous thread
• Example: 4-way multi-core, 2 threads per core
1 1 0 0 1 0 1 1

core 3 core 2 core 1 core 0

thread thread thread thread thread thread thread thread


1 0 1 0 1 0 1 0

• Core 2 can’t run the process


• Core 1 can only use one simultaneous thread 61
Default Affinities
• Default affinity mask is all 1s:
all threads can run on all processors

• Then, the OS scheduler decides what threads


run on what core

• OS scheduler detects skewed workloads,


migrating threads to less busy processors

62
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as much as
possible:
it tends to keeps a thread on the same core
• This is called soft affinity

63
User Set Affinities

• The programmer can prescribe her own


affinities (hard affinities)
• Rule of thumb: use the default scheduler
unless you have knowledge of program
behavior
• There are also memory affinity tags which can
be set.

64
When to set your own affinities
• Two (or more) threads share data-structures in
memory
– map to same core so that can share cache
• Real-time threads:
Example: a thread running
a robot controller:
- must not be context switched,
or else robot can go unstable Source: Sensable.com

- dedicate an entire core just to this thread

65
Connections to Succeeding Lectures
• Parallel programming for multicore chips and
multichip nodes.
• Concurrent/Parallel execution and consistency
and coherence
• Models for thinking about and formulating
parallel computations
• Performance optimization – Software
engineering
Assignment
• To be submitted on Friday, 9/16 by 5PM
• Go to the User Guide for Ranger
http://www.tacc.utexas.edu/user-services/user-
guides/ranger-user-guide
• Read the sections on affinity policy and NUMA
Control
• Take the program which will be distributed on
Wednesday, 9/14 and define the affinity policy
and NUMA control policy you would use and
justify your choices.

You might also like