Cache Coherence: Computer Science & Artificial Intelligence Lab
Cache Coherence: Computer Science & Artificial Intelligence Lab
Cache Coherence: Computer Science & Artificial Intelligence Lab
Daniel Sanchez
Computer Science & Artificial Intelligence Lab
M.I.T.
[https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/]
High-perf,
Cost (area, energy…) expensive Cost/perf curve of
core possible core designs
4 cores
Moderate perf,
efficient core
2 cores
Performance
What factors may limit multicore performance?
• Limited application parallelism
• Memory accesses and inter-core communication
• Programming complexity
April 6, 2021 MIT 6.823 Spring 2021 L13-3
Amdahl’s Law
• Speedup= timewithout enhancement / timewith enhancement
• Suppose an enhancement speeds up a fraction f of
a task by a factor of S
timenew = timeold·( (1-f) + f/S )
Soverall = 1 / ( (1-f) + f/S )
timeold
(1 - f) f
timenew
(1 - f) f/S
Corollary: Make the common case fast
April 6, 2021 MIT 6.823 Spring 2021 L13-4
Amdahl’s Law and Parallelism
• Say you write a program that can do 90% of the
work in parallel, but the other 10% is sequential
• What is the maximum speedup you can get by
running on a multicore machine?
Soverall = 1 / ( (1-f) + f/S )
• Data
• Control (blocking msgs, barriers, …)
– Low-level programming model:
processes + inter-process communication (e.g., MPI)
• Formally:
– Coherence: What values can a read return?
• Concerns reads/writes to a single memory location
– Consistency: When do writes become visible to reads?
• Concerns reads/writes to multiple memory locations
Main Memory
3 LD 0xA 2 (stale!)
Snoopy
P1 Cache Physical
Memory
Snoopy
P2 Cache
Actions
Invalid Processor Read (PrRd)
Processor Write (PrWr)
PrWr / BusWr Bus Read (BusRd)
Bus Write (BusWr)
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA V 2
Core 0 Core 1
1 LD 0xA
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA V 2 0xA V 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
BusWr 0xA, 3
Cache Cache
Tag State Data Tag State Data
0xA V 2
3 0xA V
I 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA V 3 0xA V
I 3
2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 LD 0xA
BusRd / Actions
PrWr / BusWB
BusRdX Processor Read (PrRd)
BusRdX Processor Write (PrWr)
PrWr / S / BusWB
BusRdX PrRd / Bus Read (BusRd)
BusRd BusRdX / --
Bus Read Exclusive
(BusRdX)
PrRd / --
BusRd / -- Bus Writeback (BusWB)
I
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA S 2
Core 0 Core 1
1 LD 0xA
BusRd 0xA
Cache Cache
Tag State Data Tag State Data
0xA S 2 0xA S 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
BusRdX 0xA
Cache Cache
Tag State Data Tag State Data
0xA M
S 2
3 0xA S
I 2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
Cache Cache
Tag State Data Tag State Data
0xA I
M 3 0xA M
I 10
2
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA
Cache Cache
Tag State Data Tag State Data
0xA I
M 3 0xA M
I 10
2
Core 0 Core 1
• MSI allows caches to serve writes without updating
memory, so main memory can have stale data
– Core 0’s cache needs to supply data
– But main memory may also respond!
• Cache must override response from main memory
Cache Cache
Tag State Data Tag State Data
0xA S
I 10
3 0xA S
M 10
Core 0 Core 1
1 LD 0xA
2 LD 0xA
3 ST 0xA
4 ST 0xA
5 LD 0xA
BusRdX
/ -- PrRd / BusRd
BusRd / PrWr/
if no other
BusWB BusRdX
sharers
BusRdX / --
S I
PrRd / --
PrRd / BusRd
BusRd / -- if other sharers
April 6, 2021 MIT 6.823 Spring 2021 L13-25
MSI Optimizations: Owner State
• Observation: On MS transitions, must write back
line!
– What happens with frequent read-write sharing?
– Can we defer the write after S?
Split-Transaction Bus
Req1 Req2 Req3
Resp1 Resp3
• Supports multiple simultaneous transactions
– Higher throughput
– Responses may arrive out of order
• Often implemented as multiple buses (req+resp)
April 6, 2021 MIT 6.823 Spring 2021 L13-27
Non-Atomicity Transient States
• Protocol must handle BusGnt /
lack of atomicity BusRdX PrRd /-- PrWr / --
• Two types of states
– Stable (e.g. MSI)
BusGnt / M
– Transient
BusInv
• Split + race
BusRd /
transitions
SM PrWr / BusWB
• More complex BusRdX
BusReq
S / BusWB
cache cache
cache mutex=1
CPU-Memory Bus