ch5 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

The MESI protocol

• As described earlier, in MSI, a cache block can be in one of three states


• Invalid (uncached) : not in the cache (not valid in any cache)
• Shared/clean: cached in one or more processors and memory is up-to-date
• Modified/dirty/exclusive: one processor (owner) has data; memory out-of-date

• The MESI protocol regroup the Shared and Modified states into three states:
• Invalid (uncached): same as in MSI
• Shared: cached in more than one processors and memory is up-to-date
• Exclusive: one processor (owner) has data and it is clean (clean but not shared)
• Modified: one processor (owner) has data, but it is dirty
• If MESI is implemented using a directory, then the information kept for each block in
the directory is the same as the three state protocol:
• Shared in MESI = shared/clean but more than one sharer
• Exclusive in MESI = shared/clean but only one sharer
• Modified in MESI = Exclusive/Modified/dirty
• However, at each cached copy, a distinction is made between shared, exclusive and
modified (rather than only shared and modified). 39

The MESI protocol


• On a read miss (local block is invalid), load the block and change its state to
• “exclusive” if it was uncached in memory
• “shared” if it was already shared, modified or exclusive
- if it was modified, the owner will send you a clean copy
- if was modified or exclusive, the previous owner will change the state
of the block to “shared” in its cache.
• On a write miss: same as read miss, except set the state to “modified”
 copies in other caches (if any) are invalidated
• On a write hit to a “modified” block, do nothing
• On a write hit to an “exclusive” block change the block to “modified”
 no need for invalidation. this is the main advantage of MESI over MSI
• On a write hit to a “shared” block change the block to “modified” and
invalidate the other cached copies.
• When a modified block is evicted, write it back .
• In snooping bus implementations of MESI, on a read miss, we need to set its state
correctly to “shared” (if the block is in some other cache(s)) or “exclusive” (if not).
• To take full advantage of MESI, should know when a block is to be changed from
shared to exclusive 40
The MESI protocol
• If MESI is implemented as a snooping protocol, then the main advantage over
the three state protocol is when a read to an uncached block is followed by a
write to that block.
• After the uncached block is read, it is marked “exclusive” – (need a scheme to
know that it was uncached).
• Note that, when writing to a shared block, the transaction has to be posted on the
bus so that other sharers invalidate their copies.
• But when writing to an exclusive block, there is no need to post the transaction on
the bus. Hence, by distinguishing between shared and exclusive states, we can
avoid bus transactions when writing on an exclusive block.
• However, now a cache that has an “exclusive” block has to monitor the bus for any
read to that block. Such a read will change the state to “shared”.
• What if a block was shared in two caches and is evicted from one of them. Should
we detect this case and set the block to Exclusive in the other cache?
• This advantage disappears in a directory protocol since after a write onto an
exclusive block , the directory has to be notified to change the state to
modified.
41

Latency optimization
1) Forwarding requests
1: req 2: forward
L H R
3: req 4: reply 3: respond

1: req
L H R 1: req 2: forward
4: revise
2: reply L H R
4: reply 3: revise

3: reply

2) Use SRAM for directories (hardware optimization)

3) Overlap activities on the critical path


- parallel multiple invalidation
- parallel lookup of directory and memory at home node.
42
Storage overhead
• In the simplest representation of a directory entry, a full bit vector is
used for each entry (one bit used to indicate presence in each node.)
– storage overhead doesn’t scale well with number of nodes.
• Larger blocks (cache lines) means lower overhead
• For very large number of nodes, may use a list of sharers instead of a
bit vector
– Lower overhead if only few sharers
– Example; for 1024 processors, overhead is reduced if fewer than
100 sharers
• May reduce overhead further by keeping only directory entries for the
blocks that are cached (uncached blocks do not need an entry)
– Can keep the directory entries for the cached blocks in a hash table
(associative cache structure) – should invalidate cached copies
when the directory entry is removed (evicted) from the hash table.
43

Cache-based Directory Schemes

x x
x
cache cache cache

x
Mem

• Keep the information about the sharers of a cached block in the cache by
linking the replicated cached entries in a linked list rather than storing a list of
sharers with the block in the main memory.
• When a processor caches a block, it inserts itself at the front of the linked list
• To invalidate a cache block in the other caches, follow the link list (easier if a
doubly link list)
• Scalable Coherent Interface (SCI) IEEE Standard
44
Hierarchical approaches to coherence
• Multi-levels - especially useful for multi-node systems, when each node is a
multiprocessor (example: multi SMPs)
• Examples of two-level systems:

P P P P

$ B1 $ $ B1 $

Dir. Main Main Dir.


Mem Mem

Network

Snooping-directory

P P P P P P P P

M/D A M/D A M/D A M/D A M/D A M/D A M/D A M/D A

Network1 Network1 Network1 Network1

adapter adapter adapter adapter

Bus
Network 2

Directory-directory Directory-snooping 45

Cache organization in multicore systems


Shared L2 systems Private L2 systems

P P P P P P P P

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2
L2
System interconnect
Memory controller Memory controller

Memory system Memory system

• Examples: Intel Core Duo Pentium • Examples: AMD Dual Core Opteron

• Uses MESI (Modified, Exclusive, • Uses MOESI (M + Owned + ESI) cache


Shared, Invalid) cache coherence coherence protocol to keep the L2 data
protocol to keep the L1 data coherent coherent (L1 is inclusive to L2)
46
Example of distributed directories in CMPs

L1 P0 L1 P1 L1 Pn

Distributed
L2 Dir L2 Dir L2 Dir
shared L2

Network (on chip)

Off chip Memory


(or on-chip L3)

• Directories are used to keep track of the state of shared entities that are
cached in multiple private caches.
• If the L2 modules form a shared cache space, then the directories perform
a role very similar to their roles in distributed shared memory systems.
• Preserve coherence in the private L1 caches
• One directory entry for each entry in L2
• Location of a cache line in L2 is determine by address of cache entry
47

Example of distributed directories in CMPs

L1 P0 L1 P1 L1 P0 L1 P1

L2 L2 L2 dir L2 dir

Network (on chip) Network (on chip)

Shared memory Directory Shared memory

• If each L2 module is private to the corresponding core, then on chip directories


may be used as replacement for a centralized directory
• Each cache block is associated with a directory entry.
• Only cache blocks that are on chip need to have directory entries
• How do you organize and distribute the directory entries among tiles?
• location of directory entry (called its home) is determined by the address.
48
Handling atomic operations ($5.5)
Example: “Atomic Swap” interchanges a value in a register for a value in memory
• loads the value from a memory location into the register
• stores the value in register into the memory location

Atomic swap can be used to implement locks:


• The lock is represented by a memory location, L
• L=1  locked
• L=0  not locked

Lock (L):
Put 1 in Register, R
Loop: Atomic Swap (R, L)
BNEZ R, Loop

Unlock(L):
Store 0 into memory location L

49

Possible implementations of Atomic Swap


May use two special load/store instructions and guarantee that no other instruction
will change the loaded/stored memory location in between.
• Load linked: LL R2, xx(R1) /* load from memory location L=xx(R1)
• Store conditional: SC R3, xx(R1)

If a single CPU/memory: OS does not allow context switching between LL and LC

If shared memory system:


1. The memory controller may keep a record that an LL was issued to location L and
does not allow a store or another LL to L before an SC is executed on L.

2. A system with no cache coherence can get P ro c e s s o r P ro c e s s o r P ro c e s s o r

the bus to perform LL and keep it until it


performs an LC. C ache C ache C a che

3. A cache in an MSI coherent system executing LL S in g le b u s

can set L to Modified (a miss on L generates a


“request to write”) and ignores requests for L on M e m o ry I/O

the bus until the LC is executed.


50
A more efficient implementation of “Lock”
In a system that applies MSI coherence, using
atomic swaps to implement locks (as described in P ro c e s s o r P ro c e s s o r P ro c e s s o r

the last slide) will cause excessive bus traffic if


multiple processors compete for a lock. C ache C ache C a che

S in g le b u s
Example:
1) P1, P2 and P3 are competing for a lock, L. M e m o ry I/O

2) P1 gets L (is exclusive in P1)


3) P2 executes a swap to get the lock (gets it in exclusive) but fails
4) P3 executes a swap to get the lock (gets it in exclusive) but fails
5) ….

Optimized Lock (L):


Loop: load R, L /* load L into R
BNEZ R, Loop /* loop on local copy
Put 1 in Register, R
Atomic Swap (R,L) /* LL R1,L; SC R,L; RR1
BNEZ R, Loop

51

Barrier synchronization

• A barrier synchronization between N threads can be implemented using a


shared variable initialized to N.
• When a processor reaches the barrier, it decrements the shared variable
by 1 and waits (in a busy wait loop) until the value of the variable is equal
to zero before it leaves the barrier.

• Need locks???

• What if there is no shared variables (distributed memory machines)?

• Can you synchronize using special hardware?

52
The Tilera TILE-Gx36™ Architecture:
 36 Processor Cores
 866M, 1.2GHz, 1.5GHz clk
MiCA
 12 MBytes total cache
Memory Controller (DDR3) 4x GbE
SGMII

SerDes
10 GbE
UARTx2,

40 Gbps total packet I/O


XAUI
USBx2,
JTAG, 
I2C, SPI
4x GbE
SGMII – 4 ports 10GbE (XAUI)

SerDes
– 16 ports 1GbE (SGMII)
SerDes

PCIe 2.0 10 GbE

mPIPE
8-lane XAUI

4x GbE
 48 Gbps PCIe I/O
– 2 16Gbps Stream IO ports
SGMII

SerDes
SerDes

PCIe 2.0
4-lane
10 GbE
XAUI

Wire-speed packet engine


SerDes

PCIe 2.0 
4-lane 4x GbE

Flexible
SGMII

SerDes
– 60Mpps
I/O 10 GbE
Memory Controller (DDR3) XAUI

 MiCA engine:
– 20 Gbps crypto
– Compress & decompress

53

TILE-Gx100™:
Complete System-on-a-Chip with 100 64-bit cores

4x GbE  1.2GHz – 1.5GHz


SerDes

Memory Controller (DDR3) Memory Controller (DDR3) SGMII


MiCA 10 GbE  32 MBytes total cache
XAUI
4x GbE  546 Gbps peak mem BW
SerDes

UART x2,
SGMII
USB x2,
200 Tbps iMesh BW
Interlaken

JTAG, 10 GbE 
I2C, SPI XAUI
4x GbE
SerDes

SGMII  80-120 Gbps packet I/O


10 GbE
SerDes

PCIe 2.0
8-lane
XAUI – 8 ports XAUI / 2 XAUI
4x GbE
SerDes

SGMII – 2 40Gb Interlaken


10 GbE
mPIPE

XAUI – 32 ports 1GbE (SGMII)


SerDes

PCIe 2.0 4x GbE


SerDes

8-lane SGMII
10 GbE
 80 Gbps PCIe I/O
XAUI – 3 StreamIO ports (20Gb)
4x GbE
SerDes

SerDes

PCIe 2.0 SGMII


Wire-speed packet eng.
Interlaken

4-lane 10 GbE 
XAUI
4x GbE – 120Mpps
SerDes

Flexible SGMII
I/O 10 GbE
XAUI
 MiCA engines:
4x GbE
– 40 Gbps crypto
SerDes

MiCA SGMII
Memory Controller (DDR3) Memory Controller (DDR3) 10 GbE
XAUI – compress & decompress

54
The Tilera core

• Processor
– Each core is a complete computer
– 3-way VLIW CPU
– Protection and interrupts
• Memory Core

– L1 cache and L2 Cache


– Virtual and physical address space Register File

– Instruction and data TLBs Three Execution Pipelines

– Cache integrated 2D DMA engine Cache


16K L1-I I-TLB 2D

8K L1-D D-TLB DMA

• Runs SMP Linux 64K L2

• Runs off-the-shelf C/C++ Terabit

programs Switch

• Signal processing and general


apps
55

Tilera Tile64

x5

56

You might also like