Cache Coherence: Computer Science & Artificial Intelligence Lab

Cache Coherence
Daniel Sanchez
Computer Science & Artificial Intelligence Lab
M.I.T.
April 6, 2021 MIT 6.823 Spring 2021 L13-1

The Shift to Multicore
[https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/]
• Since 2005, improvements in system performance mainly due

to increasing cores per chip
• Why? Technology scaling
Limited instruction-level parallelism
Multicore Performance
High-perf,
Cost (area, energy…) expensive Cost/perf curve of
core possible core designs
4 cores
Moderate perf,
efficient core
2 cores
Performance
What factors may limit multicore performance?
• Limited application parallelism
• Memory accesses and inter-core communication
• Programming complexity
Amdahl’s Law
• Speedup= timewithout enhancement / timewith enhancement
• Suppose an enhancement speeds up a fraction f of
a task by a factor of S
timenew = timeold·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
timeold
(1 - f) f
timenew
(1 - f) f/S
Corollary: Make the common case fast
Amdahl’s Law and Parallelism
• Say you write a program that can do 90% of the
work in parallel, but the other 10% is sequential
• What is the maximum speedup you can get by
running on a multicore machine?
Soverall = 1 / ( (1-f) + f/S )
f = 0.9, S=∞  Soverall = 10
What f do you need to use a 1000-core machine well?

Communication Models
• Shared memory:
– Single address space Mem
– Implicit communication by reading/writing memory
• Data
• Control (semaphores, locks, barriers, …)
– Low-level programming model: threads
• Message passing:
– Separate address spaces
– Explicit communication by send/rcv messages Mem Mem Mem
• Data
• Control (blocking msgs, barriers, …)
– Low-level programming model:
processes + inter-process communication (e.g., MPI)
• Pros/cons of each model? Network

Coherence & Consistency
• Shared memory systems:
– Have multiple private caches for performance reasons
– Need to provide the illusion of a single shared memory
• Intuition: A read should return the most recently

written value
– What is “most recent”?
• Formally:
– Coherence: What values can a read return?
• Concerns reads/writes to a single memory location
– Consistency: When do writes become visible to reads?
• Concerns reads/writes to multiple memory locations

Cache Coherence Avoids Stale Data
Main Memory
Cache Cache Cache Cache

$[0xA] = 2 $[0xA] = 3
Core 0 Core 1 Core 2 Core 3

1 LD 0xA  2
2 ST 3  0xA
3 LD 0xA  2 (stale!)
• A cache coherence protocol controls cache contents

to avoid stale cache lines

Implementing Cache Coherence
• Coherence protocols must enforce two rules:
– Write propagation: Writes eventually become visible to all processors
– Write serialization: Writes to the same location are serialized (all
processors see them in the same order)
• How to ensure write propagation?
– Write-invalidate protocols: Invalidate all other cached copies before
performing the write
– Write-update protocols: Update all other cached copies after
performing the write
• How to track sharing state of cached data and serialize
requests to the same address?
– Snooping-based protocols: All caches observe each other’s actions
through a shared bus (bus is the serialization point)
– Directory-based protocols: A coherence directory tracks contents of
private caches and serializes requests (directory is the serialization
point)
Snooping-Based Coherence
(Goodman, 1983)
Shared
Bus
Snoopy
P1 Cache Physical
Memory
Snoopy
P2 Cache
P3 Snoopy DMA DISKS

Cache
Caches watch (snoop on) bus to keep all

processors’ view of memory coherent

Snooping-Based Coherence
• Bus provides serialization point
– Broadcast, totally ordered
• Controller
– One cache controller for each core “snoops” all bus transactions
– Controller
• Responds to requests from core and the bus
• changes state of the selected cache block
• generates bus transactions to access data or invalidate
• Snoopy protocol (FSM)
– State-transition diagram Processor Cache
– Actions ld/st
• Handling writes: State Tag Data
– Write-invalidate
– Write-update ...
Snoop (observed bus transaction)

A Simple Protocol: Valid/Invalid (VI)
PrRd / -- PrWr / BusWr

• Assume write-
through caches
• Transition
Valid
nomenclature:
BusWr / -- triggering action /
PrRd / BusRd taken action(s)
Actions
Invalid Processor Read (PrRd)
Processor Write (PrWr)
PrWr / BusWr Bus Read (BusRd)
Bus Write (BusWr)

Valid/Invalid Example
Main Memory
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA V 2
Core 0 Core 1
1 LD 0xA

Main Memory
BusRd 0xA
Cache Cache
0xA V 2 0xA V 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
Additional loads satisfied locally, without BusRd

Main Memory
BusWr 0xA, 3
Cache Cache
0xA V 2
3 0xA V
I 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA

Main Memory
BusRd 0xA
Cache Cache
0xA V 3 0xA V
I 3
2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 LD 0xA
VI Problems? Every write updates main memory

Every write requires broadcast & snoop
Modified/Shared/Invalid (MSI)
Protocol
• Allows writeback caches + satisfying writes locally
PrRd /-- PrWr / --

Processor-initiated transitions
Bus-initiated transitions
M
BusRd / Actions
PrWr / BusWB
BusRdX Processor Read (PrRd)
BusRdX Processor Write (PrWr)
PrWr / S / BusWB
BusRdX PrRd / Bus Read (BusRd)
BusRd BusRdX / --
Bus Read Exclusive
(BusRdX)
PrRd / --
BusRd / -- Bus Writeback (BusWB)
I

MSI Example
Main Memory
BusRd 0xA
Cache Cache
0xA S 2
Core 0 Core 1
1 LD 0xA

MSI Example
Main Memory
BusRd 0xA
Cache Cache
0xA S 2 0xA S 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
Additional loads satisfied locally, without BusRd

(like in VI)

MSI Example
Main Memory
BusRdX 0xA
Cache Cache
0xA M
S 2
3 0xA S
I 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
Additional loads and stores from core 0 satisfied locally,

without bus transactions (unlike in VI)
MSI Example
Main Memory
BusWB 0xA, 3 BusRdX 0xA
Cache Cache
0xA I
M 3 0xA M
I 10
2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA

Cache interventions
Main Memory
BusWB 0xA, 3 BusRdX 0xA
Cache Cache
0xA I
M 3 0xA M
I 10
2
Core 0 Core 1
• MSI allows caches to serve writes without updating
memory, so main memory can have stale data
– Core 0’s cache needs to supply data
– But main memory may also respond!
• Cache must override response from main memory

MSI Example
Main Memory
BusRd 0xA BusWB 0xA, 10
Cache Cache
0xA S
I 10
3 0xA S
M 10
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA
5 LD 0xA

MSI Optimizations: Exclusive State
• Observation: Doing read-modify-write sequences
on private data is common
– What’s the problem with MSI?
• Solution: E state (exclusive, clean)

– If no other sharers, a read acquires line in E instead of S
– Writes silently cause EM (exclusive, dirty)

MESI: An Enhanced MSI protocol
increased performance for private read-write data
Each cache line has a tag M: Modified Exclusive

E: Exclusive, unmodified
Address tag S: Shared
state
I: Invalid
bits
PrWr / -- PrRd / --
PrWr / -- M E
PrRd /--
BusRdX
/ -- PrRd / BusRd
BusRd / PrWr/
if no other
BusWB BusRdX
sharers
BusRdX / --
S I
PrRd / --
PrRd / BusRd
BusRd / -- if other sharers
MSI Optimizations: Owner State
• Observation: On MS transitions, must write back
line!
– What happens with frequent read-write sharing?
– Can we defer the write after S?
• Solution: O state (Owner)

– O = S + responsibility to write back
– On MS transition, one sharer (typically the one who had the
line in M) retains the line in O instead of S
– On eviction, O writes back line (or another sharer does SO)
• MSI, MESI, MOSI, MOESI…

– Typically E if private read-write >> shared read-only (common)
– Typically O only if writebacks are expensive (main mem vs L3)

Split-Transaction and Pipelined Buses
Atomic Transaction Bus
Req
Delay
Response
Simple, but low throughput! Time
Split-Transaction Bus
Req1 Req2 Req3
Resp1 Resp3
• Supports multiple simultaneous transactions
– Higher throughput
– Responses may arrive out of order
• Often implemented as multiple buses (req+resp)
Non-Atomicity  Transient States
• Protocol must handle BusGnt /
lack of atomicity BusRdX PrRd /-- PrWr / --
• Two types of states
– Stable (e.g. MSI)
BusGnt / M
– Transient
BusInv
• Split + race
BusRd /
transitions
SM PrWr / BusWB
• More complex BusRdX
BusReq
S / BusWB
IM BusGnt / BusRdX / --

BusRd PrRd / --
Actions BusRd / --
Bus Request IS
(BusReq)
PrWr /
PrRd /
BusReq I
Bus Grant BusReq
(BusGnt)

Scaling Cache Coherence
• Can implement ordered interconnects that scale
better than buses…
Starfire E10000 (drawn with only eight processors for clarity).

A coherence request is unicast up to the root, where it is
serialized, before being broadcast down to all processors
• … but broadcast is fundamentally unscalable

– Bandwidth, energy of transactions with 100s of cache snoops?
Directory-Based Coherence
• Route all coherence transactions through a directory

– Tracks contents of private caches  No broadcasts
– Serves as ordering point for conflicting requests  Unordered
networks
(more on next lecture)

Coherence and False Sharing
Performance Issue #1
state blk addr data0 data1 ... dataN
A cache block contains more than one word and cache

coherence is done at the block-level and not word-level
Suppose P1 writes wordi and P2 writes wordk and

both words have the same block address.
What can happen? The block may be invalidated (ping-pong)

many times unnecessarily because
addresses are in the same block.
How to address this problem?

Coherence and Synchronization
Processor 1 Processor 2 Processor 3
R1 R1 R1
L: swap (mutex), R; L: swap (mutex), R; L: swap (mutex), R;
if <R> then goto L; if <R> then goto L; if <R> then goto L;
<critical section> <critical section> <critical section>
M[mutex]  0; M[mutex]  0; M[mutex]  0;
cache cache
cache mutex=1
CPU-Memory Bus
Cache coherence protocols will cause mutex to ping-pong

between P1’s and P2’s caches.
Ping-ponging can be reduced by first reading the mutex location

(non-atomically) and executing a swap only if it is found to be
zero (test&test&set).

Coherence and Bus Occupancy
• In general, an atomic read-modify-write instruction
requires two memory (bus) operations without
intervening memory operations by other processors
• In a multiprocessor setting, bus needs to be locked

for the entire duration of the atomic read and write
operation
expensive for simple buses
very expensive for split-transaction buses
• modern processors use

load-reserve
store-conditional

Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve R, (a): Store-conditional (a), R:
<flag, adr>  <1, a>; if <flag, adr> == <1, a>
R  M[a]; then cancel other procs’
reservation on a;
M[a] <R>;
status succeed;
else status fail;
If the snooper sees a store transaction to the address

in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
Performance:
Load-reserve & Store-conditional
The total number of memory (bus) transactions is not

necessarily reduced, but splitting an atomic instruction
into load-reserve & store-conditional:
• increases bus utilization (and reduces processor

stall time), especially in split-transaction buses
• reduces cache ping-pong effect because

processors trying to acquire a mutex do not have to
perform stores each time

Thank you!
Next lecture: Directory-based

Cache Coherence

Cache Coherence: Computer Science & Artificial Intelligence Lab

Uploaded by

Copyright:

Available Formats

Cache Coherence: Computer Science & Artificial Intelligence Lab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cache Coherence: Computer Science & Artificial Intelligence Lab

Uploaded by

Copyright:

Available Formats

Cache Coherence

April 6, 2021 MIT 6.823 Spring 2021 L13-1

• Since 2005, improvements in system performance mainly due

f = 0.9, S=∞  Soverall = 10

What f do you need to use a 1000-core machine well?

April 6, 2021 MIT 6.823 Spring 2021 L13-5

• Pros/cons of each model? Network

April 6, 2021 MIT 6.823 Spring 2021 L13-6

• Intuition: A read should return the most recently

April 6, 2021 MIT 6.823 Spring 2021 L13-7

Cache Cache Cache Cache

Core 0 Core 1 Core 2 Core 3

• A cache coherence protocol controls cache contents

April 6, 2021 MIT 6.823 Spring 2021 L13-8

P3 Snoopy DMA DISKS

Caches watch (snoop on) bus to keep all

April 6, 2021 MIT 6.823 Spring 2021 L13-10

Snoop (observed bus transaction)

PrRd / -- PrWr / BusWr

April 6, 2021 MIT 6.823 Spring 2021 L13-12

April 6, 2021 MIT 6.823 Spring 2021 L13-13

Additional loads satisfied locally, without BusRd

April 6, 2021 MIT 6.823 Spring 2021 L13-14

April 6, 2021 MIT 6.823 Spring 2021 L13-15

VI Problems? Every write updates main memory

PrRd /-- PrWr / --

April 6, 2021 MIT 6.823 Spring 2021 L13-17

April 6, 2021 MIT 6.823 Spring 2021 L13-18

Additional loads satisfied locally, without BusRd

April 6, 2021 MIT 6.823 Spring 2021 L13-19

Additional loads and stores from core 0 satisfied locally,

BusWB 0xA, 3 BusRdX 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-21

BusWB 0xA, 3 BusRdX 0xA

April 6, 2021 MIT 6.823 Spring 2021 L13-22

BusRd 0xA BusWB 0xA, 10

April 6, 2021 MIT 6.823 Spring 2021 L13-23

• Solution: E state (exclusive, clean)

April 6, 2021 MIT 6.823 Spring 2021 L13-24

Each cache line has a tag M: Modified Exclusive

• Solution: O state (Owner)

• MSI, MESI, MOSI, MOESI…

April 6, 2021 MIT 6.823 Spring 2021 L13-26

IM BusGnt / BusRdX / --

April 6, 2021 MIT 6.823 Spring 2021 L13-28

Starfire E10000 (drawn with only eight processors for clarity).

• … but broadcast is fundamentally unscalable

• Route all coherence transactions through a directory

(more on next lecture)

state blk addr data0 data1 ... dataN

A cache block contains more than one word and cache

Suppose P1 writes wordi and P2 writes wordk and

What can happen? The block may be invalidated (ping-pong)

How to address this problem?

April 6, 2021 MIT 6.823 Spring 2021 L13-31

Cache coherence protocols will cause mutex to ping-pong

Ping-ponging can be reduced by first reading the mutex location

April 6, 2021 MIT 6.823 Spring 2021 L13-32

• In a multiprocessor setting, bus needs to be locked