Cache Coherence: Computer Science & Artificial Intelligence Lab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Cache Coherence

Daniel Sanchez
Computer Science & Artificial Intelligence Lab
M.I.T.

April 6, 2021 MIT 6.823 Spring 2021 L13-1


The Shift to Multicore

[https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/]

• Since 2005, improvements in system performance mainly due


to increasing cores per chip
• Why? Technology scaling
Limited instruction-level parallelism
April 6, 2021 MIT 6.823 Spring 2021 L13-2
Multicore Performance

High-perf,
Cost (area, energy…) expensive Cost/perf curve of
core possible core designs

4 cores
Moderate perf,
efficient core
2 cores

Performance
What factors may limit multicore performance?
• Limited application parallelism
• Memory accesses and inter-core communication
• Programming complexity
April 6, 2021 MIT 6.823 Spring 2021 L13-3
Amdahl’s Law
• Speedup= timewithout enhancement / timewith enhancement
• Suppose an enhancement speeds up a fraction f of
a task by a factor of S
timenew = timeold·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
timeold

(1 - f) f
timenew

(1 - f) f/S
Corollary: Make the common case fast
April 6, 2021 MIT 6.823 Spring 2021 L13-4
Amdahl’s Law and Parallelism
• Say you write a program that can do 90% of the
work in parallel, but the other 10% is sequential
• What is the maximum speedup you can get by
running on a multicore machine?
Soverall = 1 / ( (1-f) + f/S )

f = 0.9, S=∞  Soverall = 10

What f do you need to use a 1000-core machine well?

April 6, 2021 MIT 6.823 Spring 2021 L13-5


Communication Models
• Shared memory:
– Single address space Mem
– Implicit communication by reading/writing memory
• Data
• Control (semaphores, locks, barriers, …)
– Low-level programming model: threads
• Message passing:
– Separate address spaces
– Explicit communication by send/rcv messages Mem Mem Mem

• Data
• Control (blocking msgs, barriers, …)
– Low-level programming model:
processes + inter-process communication (e.g., MPI)

• Pros/cons of each model? Network

April 6, 2021 MIT 6.823 Spring 2021 L13-6


Coherence & Consistency
• Shared memory systems:
– Have multiple private caches for performance reasons
– Need to provide the illusion of a single shared memory

• Intuition: A read should return the most recently


written value
– What is “most recent”?

• Formally:
– Coherence: What values can a read return?
• Concerns reads/writes to a single memory location
– Consistency: When do writes become visible to reads?
• Concerns reads/writes to multiple memory locations

April 6, 2021 MIT 6.823 Spring 2021 L13-7


Cache Coherence Avoids Stale Data

Main Memory

Cache Cache Cache Cache


$[0xA] = 2 $[0xA] = 3

Core 0 Core 1 Core 2 Core 3


1 LD 0xA  2
2 ST 3  0xA

3 LD 0xA  2 (stale!)

• A cache coherence protocol controls cache contents


to avoid stale cache lines

April 6, 2021 MIT 6.823 Spring 2021 L13-8


Implementing Cache Coherence
• Coherence protocols must enforce two rules:
– Write propagation: Writes eventually become visible to all processors
– Write serialization: Writes to the same location are serialized (all
processors see them in the same order)
• How to ensure write propagation?
– Write-invalidate protocols: Invalidate all other cached copies before
performing the write
– Write-update protocols: Update all other cached copies after
performing the write
• How to track sharing state of cached data and serialize
requests to the same address?
– Snooping-based protocols: All caches observe each other’s actions
through a shared bus (bus is the serialization point)
– Directory-based protocols: A coherence directory tracks contents of
private caches and serializes requests (directory is the serialization
point)
April 6, 2021 MIT 6.823 Spring 2021 L13-9
Snooping-Based Coherence
(Goodman, 1983)
Shared
Bus

Snoopy
P1 Cache Physical
Memory
Snoopy
P2 Cache

P3 Snoopy DMA DISKS


Cache

Caches watch (snoop on) bus to keep all


processors’ view of memory coherent

April 6, 2021 MIT 6.823 Spring 2021 L13-10


Snooping-Based Coherence
• Bus provides serialization point
– Broadcast, totally ordered
• Controller
– One cache controller for each core “snoops” all bus transactions
– Controller
• Responds to requests from core and the bus
• changes state of the selected cache block
• generates bus transactions to access data or invalidate
• Snoopy protocol (FSM)
– State-transition diagram Processor Cache
– Actions ld/st
• Handling writes: State Tag Data
– Write-invalidate
– Write-update ...

Snoop (observed bus transaction)


April 6, 2021 MIT 6.823 Spring 2021 L13-11
A Simple Protocol: Valid/Invalid (VI)

PrRd / -- PrWr / BusWr


• Assume write-
through caches
• Transition
Valid
nomenclature:
BusWr / -- triggering action /
PrRd / BusRd taken action(s)

Actions
Invalid Processor Read (PrRd)
Processor Write (PrWr)
PrWr / BusWr Bus Read (BusRd)
Bus Write (BusWr)

April 6, 2021 MIT 6.823 Spring 2021 L13-12


Valid/Invalid Example
Main Memory

BusRd 0xA

Cache Cache
Tag State Data Tag State Data
0xA V 2

Core 0 Core 1
1 LD 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-13


Valid/Invalid Example
Main Memory

BusRd 0xA

Cache Cache
Tag State Data Tag State Data
0xA V 2 0xA V 2

Core 0 Core 1
1 LD 0xA
2 LD 0xA

Additional loads satisfied locally, without BusRd

April 6, 2021 MIT 6.823 Spring 2021 L13-14


Valid/Invalid Example
Main Memory

BusWr 0xA, 3

Cache Cache
Tag State Data Tag State Data
0xA V 2
3 0xA V
I 2

Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-15


Valid/Invalid Example
Main Memory

BusRd 0xA

Cache Cache
Tag State Data Tag State Data
0xA V 3 0xA V
I 3
2

Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 LD 0xA

VI Problems? Every write updates main memory


Every write requires broadcast & snoop
April 6, 2021 MIT 6.823 Spring 2021 L13-16
Modified/Shared/Invalid (MSI)
Protocol
• Allows writeback caches + satisfying writes locally

PrRd /-- PrWr / --


Processor-initiated transitions
Bus-initiated transitions
M

BusRd / Actions
PrWr / BusWB
BusRdX Processor Read (PrRd)
BusRdX Processor Write (PrWr)
PrWr / S / BusWB
BusRdX PrRd / Bus Read (BusRd)
BusRd BusRdX / --
Bus Read Exclusive
(BusRdX)
PrRd / --
BusRd / -- Bus Writeback (BusWB)
I

April 6, 2021 MIT 6.823 Spring 2021 L13-17


MSI Example
Main Memory

BusRd 0xA

Cache Cache
Tag State Data Tag State Data
0xA S 2

Core 0 Core 1
1 LD 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-18


MSI Example
Main Memory

BusRd 0xA

Cache Cache
Tag State Data Tag State Data
0xA S 2 0xA S 2

Core 0 Core 1
1 LD 0xA
2 LD 0xA

Additional loads satisfied locally, without BusRd


(like in VI)

April 6, 2021 MIT 6.823 Spring 2021 L13-19


MSI Example
Main Memory

BusRdX 0xA

Cache Cache
Tag State Data Tag State Data
0xA M
S 2
3 0xA S
I 2

Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA

Additional loads and stores from core 0 satisfied locally,


without bus transactions (unlike in VI)
April 6, 2021 MIT 6.823 Spring 2021 L13-20
MSI Example
Main Memory

BusWB 0xA, 3 BusRdX 0xA

Cache Cache
Tag State Data Tag State Data
0xA I
M 3 0xA M
I 10
2

Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-21


Cache interventions
Main Memory

BusWB 0xA, 3 BusRdX 0xA

Cache Cache
Tag State Data Tag State Data
0xA I
M 3 0xA M
I 10
2

Core 0 Core 1
• MSI allows caches to serve writes without updating
memory, so main memory can have stale data
– Core 0’s cache needs to supply data
– But main memory may also respond!
• Cache must override response from main memory

April 6, 2021 MIT 6.823 Spring 2021 L13-22


MSI Example
Main Memory

BusRd 0xA BusWB 0xA, 10

Cache Cache
Tag State Data Tag State Data
0xA S
I 10
3 0xA S
M 10

Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA
5 LD 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-23


MSI Optimizations: Exclusive State
• Observation: Doing read-modify-write sequences
on private data is common
– What’s the problem with MSI?

• Solution: E state (exclusive, clean)


– If no other sharers, a read acquires line in E instead of S
– Writes silently cause EM (exclusive, dirty)

April 6, 2021 MIT 6.823 Spring 2021 L13-24


MESI: An Enhanced MSI protocol
increased performance for private read-write data

Each cache line has a tag M: Modified Exclusive


E: Exclusive, unmodified
Address tag S: Shared
state
I: Invalid
bits
PrWr / -- PrRd / --
PrWr / -- M E
PrRd /--

BusRdX
/ -- PrRd / BusRd
BusRd / PrWr/
if no other
BusWB BusRdX
sharers

BusRdX / --
S I
PrRd / --
PrRd / BusRd
BusRd / -- if other sharers
April 6, 2021 MIT 6.823 Spring 2021 L13-25
MSI Optimizations: Owner State
• Observation: On MS transitions, must write back
line!
– What happens with frequent read-write sharing?
– Can we defer the write after S?

• Solution: O state (Owner)


– O = S + responsibility to write back
– On MS transition, one sharer (typically the one who had the
line in M) retains the line in O instead of S
– On eviction, O writes back line (or another sharer does SO)

• MSI, MESI, MOSI, MOESI…


– Typically E if private read-write >> shared read-only (common)
– Typically O only if writebacks are expensive (main mem vs L3)

April 6, 2021 MIT 6.823 Spring 2021 L13-26


Split-Transaction and Pipelined Buses
Atomic Transaction Bus
Req
Delay
Response
Simple, but low throughput! Time

Split-Transaction Bus
Req1 Req2 Req3
Resp1 Resp3
• Supports multiple simultaneous transactions
– Higher throughput
– Responses may arrive out of order
• Often implemented as multiple buses (req+resp)
April 6, 2021 MIT 6.823 Spring 2021 L13-27
Non-Atomicity  Transient States
• Protocol must handle BusGnt /
lack of atomicity BusRdX PrRd /-- PrWr / --
• Two types of states
– Stable (e.g. MSI)
BusGnt / M
– Transient
BusInv
• Split + race
BusRd /
transitions
SM PrWr / BusWB
• More complex BusRdX
BusReq
S / BusWB

IM BusGnt / BusRdX / --


BusRd PrRd / --
Actions BusRd / --
Bus Request IS
(BusReq)
PrWr /
PrRd /
BusReq I
Bus Grant BusReq
(BusGnt)

April 6, 2021 MIT 6.823 Spring 2021 L13-28


Scaling Cache Coherence
• Can implement ordered interconnects that scale
better than buses…

Starfire E10000 (drawn with only eight processors for clarity).


A coherence request is unicast up to the root, where it is
serialized, before being broadcast down to all processors

• … but broadcast is fundamentally unscalable


– Bandwidth, energy of transactions with 100s of cache snoops?
April 6, 2021 MIT 6.823 Spring 2021 L13-29
Directory-Based Coherence

• Route all coherence transactions through a directory


– Tracks contents of private caches  No broadcasts
– Serves as ordering point for conflicting requests  Unordered
networks

(more on next lecture)


April 6, 2021 MIT 6.823 Spring 2021 L13-30
Coherence and False Sharing
Performance Issue #1

state blk addr data0 data1 ... dataN

A cache block contains more than one word and cache


coherence is done at the block-level and not word-level

Suppose P1 writes wordi and P2 writes wordk and


both words have the same block address.

What can happen? The block may be invalidated (ping-pong)


many times unnecessarily because
addresses are in the same block.

How to address this problem?

April 6, 2021 MIT 6.823 Spring 2021 L13-31


Coherence and Synchronization
Performance Issue #2
Processor 1 Processor 2 Processor 3
R1 R1 R1
L: swap (mutex), R; L: swap (mutex), R; L: swap (mutex), R;
if <R> then goto L; if <R> then goto L; if <R> then goto L;
<critical section> <critical section> <critical section>
M[mutex]  0; M[mutex]  0; M[mutex]  0;

cache cache
cache mutex=1
CPU-Memory Bus

Cache coherence protocols will cause mutex to ping-pong


between P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the mutex location


(non-atomically) and executing a swap only if it is found to be
zero (test&test&set).

April 6, 2021 MIT 6.823 Spring 2021 L13-32


Coherence and Bus Occupancy
Performance Issue #3
• In general, an atomic read-modify-write instruction
requires two memory (bus) operations without
intervening memory operations by other processors

• In a multiprocessor setting, bus needs to be locked


for the entire duration of the atomic read and write
operation
expensive for simple buses
very expensive for split-transaction buses

• modern processors use


load-reserve
store-conditional

April 6, 2021 MIT 6.823 Spring 2021 L13-33


Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve R, (a): Store-conditional (a), R:
<flag, adr>  <1, a>; if <flag, adr> == <1, a>
R  M[a]; then cancel other procs’
reservation on a;
M[a] <R>;
status succeed;
else status fail;

If the snooper sees a store transaction to the address


in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
April 6, 2021 MIT 6.823 Spring 2021 L13-34
Performance:
Load-reserve & Store-conditional

The total number of memory (bus) transactions is not


necessarily reduced, but splitting an atomic instruction
into load-reserve & store-conditional:

• increases bus utilization (and reduces processor


stall time), especially in split-transaction buses

• reduces cache ping-pong effect because


processors trying to acquire a mutex do not have to
perform stores each time

April 6, 2021 MIT 6.823 Spring 2021 L13-35


Thank you!

Next lecture: Directory-based


Cache Coherence

April 6, 2021 MIT 6.823 Spring 2021 L13-36

You might also like