Architecture
Architecture
Architecture
• 32 GB per node
• Assume 32K pages => ~1,000,000 pages/node
• 48 pages mapped in TLB
• DRAM Bank – Parallel accessibility to each bank
• DRAM pages - A DRAM page ((not to be confused with
OS pages.) ) represents a row of data that has
been read from a DRAM bank and is cached within the DRAM
for faster access. DRAM pages can be large, 32 KB in the case
of Ranger, and there are typically two per DIMM. This means
that a Ranger node shares 32 different 32 KB pages among 16
cores on four chips, yielding a megabyte of SRAM cache in the
main memory.
Ranger Interconnect - InfiniBand
Single instruction, single Single instruction, multiple Multiple instruction, single Multiple instruction,
data (SISD) stream data (SIMD) stream data (MISD) stream multiple data (MIMD)
stream
Uniprocessor
Vector Array processor Shared Distributed
processor memory memory
Clusters
Rename/Alloc Rename/Alloc
Bus
Rename/Alloc Rename/Alloc
Bus
25
Simultaneous multithreading (SMT)
• Permits multiple independent threads to execute
SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core
26
Without SMT, only a single thread can run
at any given time
L1 D-Cache D-TLB
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Thread 2:
integer operation 28
SMT processor: both threads can run
concurrently
L1 D-Cache D-TLB
Schedulers
Uop queues
Rename/Alloc
Decoder
Bus
Schedulers
Uop queues
Rename/Alloc
31
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
32
SMT Dual-core: all four threads can run
concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB
Rename/Alloc Rename/Alloc
Bus
CORE0
CORE1
CORE0
L1 cache L1 cache L1 cache L1 cache
L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
37
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the same
cache data
– More cache space available if a single (or a few)
high-performance thread runs on the system
38
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores
39
The cache coherence problem
Suppose variable x initially contains 15213
multi-core chip
Main memory
x=15213
40
The cache coherence problem
Core 1 reads x
multi-core chip
Main memory
x=15213
41
The cache coherence problem
Core 2 reads x
multi-core chip
Main memory
x=15213
42
The cache coherence problem
Core 1 writes to x, setting it to 21660
multi-core chip
Main memory assuming
x=21660 write-through
43
caches
The cache coherence problem
Core 2 attempts to read x… gets a stale copy
multi-core chip
Main memory
x=21660
44
Solutions for cache coherence
• This is a general problem with multiprocessors,
not limited just to multi-core
• There exist many solution algorithms, coherence
protocols, etc.
• We will return to consistency and coherence
issues in a later lecture.
multi-core chip
Main memory
inter-core
bus 46
Invalidation protocol with snooping
• Invalidation:
If a core writes to a data item, all other copies
of this data item in other caches are
invalidated
• Snooping:
All cores continuously “snoop” (monitor) the
bus connecting the cores.
47
The cache coherence problem
Revisited: Cores 1 and 2 have both read x
multi-core chip
Main memory
x=15213
48
The cache coherence problem
Core 1 writes to x, setting it to 21660
sends INVALIDATED
invalidation
multi-core chip
request
Main memory assuming inter-core
x=21660 write-through bus 49
caches
The cache coherence problem
After invalidation:
multi-core chip
Main memory
x=21660
50
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.
multi-core chip
Main memory
x=21660
51
Invalidation protocols
• This was just the basic invalidation protocol
• More sophisticated protocols use extra cache
state bits
• MSI, MESI - (Modified, Exclusive, Shared,
Invalid) to which we will return later
52
Another hardware solution: update protocol
broadcasts
multi-core chip
updated
value Main memory assuming inter-core
x=21660 write-through bus 53
caches
Invalidation vs update
• Multiple writes to the same location
– invalidation: only the first time
– update: must broadcast each write
(which includes new variable value)
54
User Control of Multi-core Parallelism
• Programmers must use threads or processes
55
Thread safety very important
• Pre-emptive context switching:
context switch can happen AT ANY TIME
56
However: Need to use synchronization even if only
time-slicing on a uniprocessor
int counter=0;
void thread1() {
int temp1=counter;
counter = temp1 + 1;
}
void thread2() {
int temp2=counter;
counter = temp2 + 1;
}
57
Need to use synchronization even if only time-
slicing on a uniprocessor
temp1=counter;
counter = temp1 + 1; gives counter=2
temp2=counter;
counter = temp2 + 1
temp1=counter;
temp2=counter; gives counter=1
counter = temp1 + 1;
counter = temp2 + 1
58
Assigning threads to the cores
1 1 0 1
62
Process migration is costly
• Need to restart the execution pipeline
• Cached data is invalidated
• OS scheduler tries to avoid migration as much as
possible:
it tends to keeps a thread on the same core
• This is called soft affinity
63
User Set Affinities
64
When to set your own affinities
• Two (or more) threads share data-structures in
memory
– map to same core so that can share cache
• Real-time threads:
Example: a thread running
a robot controller:
- must not be context switched,
or else robot can go unstable Source: Sensable.com
65
Connections to Succeeding Lectures
• Parallel programming for multicore chips and
multichip nodes.
• Concurrent/Parallel execution and consistency
and coherence
• Models for thinking about and formulating
parallel computations
• Performance optimization – Software
engineering
Assignment
• To be submitted on Friday, 9/16 by 5PM
• Go to the User Guide for Ranger
http://www.tacc.utexas.edu/user-services/user-
guides/ranger-user-guide
• Read the sections on affinity policy and NUMA
Control
• Take the program which will be distributed on
Wednesday, 9/14 and define the affinity policy
and NUMA control policy you would use and
justify your choices.