ch5 4

The MESI protocol
• As described earlier, in MSI, a cache block can be in one of three states

• Invalid (uncached) : not in the cache (not valid in any cache)
• Shared/clean: cached in one or more processors and memory is up-to-date
• Modified/dirty/exclusive: one processor (owner) has data; memory out-of-date
• The MESI protocol regroup the Shared and Modified states into three states:
• Invalid (uncached): same as in MSI
• Shared: cached in more than one processors and memory is up-to-date
• Exclusive: one processor (owner) has data and it is clean (clean but not shared)
• Modified: one processor (owner) has data, but it is dirty
• If MESI is implemented using a directory, then the information kept for each block in
the directory is the same as the three state protocol:
• Shared in MESI = shared/clean but more than one sharer
• Exclusive in MESI = shared/clean but only one sharer
• Modified in MESI = Exclusive/Modified/dirty
• However, at each cached copy, a distinction is made between shared, exclusive and
modified (rather than only shared and modified). 39
The MESI protocol

• On a read miss (local block is invalid), load the block and change its state to
• “exclusive” if it was uncached in memory
• “shared” if it was already shared, modified or exclusive
- if it was modified, the owner will send you a clean copy
- if was modified or exclusive, the previous owner will change the state
of the block to “shared” in its cache.
• On a write miss: same as read miss, except set the state to “modified”
 copies in other caches (if any) are invalidated
• On a write hit to a “modified” block, do nothing
• On a write hit to an “exclusive” block change the block to “modified”
 no need for invalidation. this is the main advantage of MESI over MSI
• On a write hit to a “shared” block change the block to “modified” and
invalidate the other cached copies.
• When a modified block is evicted, write it back .
• In snooping bus implementations of MESI, on a read miss, we need to set its state
correctly to “shared” (if the block is in some other cache(s)) or “exclusive” (if not).
• To take full advantage of MESI, should know when a block is to be changed from
shared to exclusive 40
The MESI protocol
• If MESI is implemented as a snooping protocol, then the main advantage over
the three state protocol is when a read to an uncached block is followed by a
write to that block.
• After the uncached block is read, it is marked “exclusive” – (need a scheme to
know that it was uncached).
• Note that, when writing to a shared block, the transaction has to be posted on the
bus so that other sharers invalidate their copies.
• But when writing to an exclusive block, there is no need to post the transaction on
the bus. Hence, by distinguishing between shared and exclusive states, we can
avoid bus transactions when writing on an exclusive block.
• However, now a cache that has an “exclusive” block has to monitor the bus for any
read to that block. Such a read will change the state to “shared”.
• What if a block was shared in two caches and is evicted from one of them. Should
we detect this case and set the block to Exclusive in the other cache?
• This advantage disappears in a directory protocol since after a write onto an
exclusive block , the directory has to be notified to change the state to
modified.
41
Latency optimization
1) Forwarding requests
1: req 2: forward
L H R
3: req 4: reply 3: respond
1: req
L H R 1: req 2: forward
4: revise
2: reply L H R
4: reply 3: revise
3: reply
2) Use SRAM for directories (hardware optimization)
3) Overlap activities on the critical path

- parallel multiple invalidation
- parallel lookup of directory and memory at home node.
42
Storage overhead
• In the simplest representation of a directory entry, a full bit vector is
used for each entry (one bit used to indicate presence in each node.)
– storage overhead doesn’t scale well with number of nodes.
• Larger blocks (cache lines) means lower overhead
• For very large number of nodes, may use a list of sharers instead of a
bit vector
– Lower overhead if only few sharers
– Example; for 1024 processors, overhead is reduced if fewer than
100 sharers
• May reduce overhead further by keeping only directory entries for the
blocks that are cached (uncached blocks do not need an entry)
– Can keep the directory entries for the cached blocks in a hash table
(associative cache structure) – should invalidate cached copies
when the directory entry is removed (evicted) from the hash table.
43
Cache-based Directory Schemes
x x
x
cache cache cache
x
Mem
• Keep the information about the sharers of a cached block in the cache by
linking the replicated cached entries in a linked list rather than storing a list of
sharers with the block in the main memory.
• When a processor caches a block, it inserts itself at the front of the linked list
• To invalidate a cache block in the other caches, follow the link list (easier if a
doubly link list)
• Scalable Coherent Interface (SCI) IEEE Standard
44
Hierarchical approaches to coherence
• Multi-levels - especially useful for multi-node systems, when each node is a
multiprocessor (example: multi SMPs)
• Examples of two-level systems:
P P P P
$ B1 $ $ B1 $
Dir. Main Main Dir.

Mem Mem
Network
Snooping-directory
P P P P P P P P
M/D A M/D A M/D A M/D A M/D A M/D A M/D A M/D A
Network1 Network1 Network1 Network1
adapter adapter adapter adapter
Bus
Network 2
Directory-directory Directory-snooping 45
Cache organization in multicore systems

Shared L2 systems Private L2 systems
P P P P P P P P
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2
L2
System interconnect
Memory controller Memory controller
Memory system Memory system
• Examples: Intel Core Duo Pentium • Examples: AMD Dual Core Opteron
• Uses MESI (Modified, Exclusive, • Uses MOESI (M + Owned + ESI) cache

Shared, Invalid) cache coherence coherence protocol to keep the L2 data
protocol to keep the L1 data coherent coherent (L1 is inclusive to L2)
46
Example of distributed directories in CMPs
L1 P0 L1 P1 L1 Pn
Distributed
L2 Dir L2 Dir L2 Dir
shared L2
Network (on chip)
Off chip Memory

(or on-chip L3)
• Directories are used to keep track of the state of shared entities that are
cached in multiple private caches.
• If the L2 modules form a shared cache space, then the directories perform
a role very similar to their roles in distributed shared memory systems.
• Preserve coherence in the private L1 caches
• One directory entry for each entry in L2
• Location of a cache line in L2 is determine by address of cache entry
47
Example of distributed directories in CMPs
L1 P0 L1 P1 L1 P0 L1 P1
L2 L2 L2 dir L2 dir
Network (on chip) Network (on chip)
Shared memory Directory Shared memory
• If each L2 module is private to the corresponding core, then on chip directories

may be used as replacement for a centralized directory
• Each cache block is associated with a directory entry.
• Only cache blocks that are on chip need to have directory entries
• How do you organize and distribute the directory entries among tiles?
• location of directory entry (called its home) is determined by the address.
48
Handling atomic operations ($5.5)
Example: “Atomic Swap” interchanges a value in a register for a value in memory
• loads the value from a memory location into the register
• stores the value in register into the memory location
Atomic swap can be used to implement locks:

• The lock is represented by a memory location, L
• L=1  locked
• L=0  not locked
Lock (L):
Put 1 in Register, R
Loop: Atomic Swap (R, L)
BNEZ R, Loop
Unlock(L):
Store 0 into memory location L
49
Possible implementations of Atomic Swap

May use two special load/store instructions and guarantee that no other instruction
will change the loaded/stored memory location in between.
• Load linked: LL R2, xx(R1) /* load from memory location L=xx(R1)
• Store conditional: SC R3, xx(R1)
If a single CPU/memory: OS does not allow context switching between LL and LC
If shared memory system:

1. The memory controller may keep a record that an LL was issued to location L and
does not allow a store or another LL to L before an SC is executed on L.
2. A system with no cache coherence can get P ro c e s s o r P ro c e s s o r P ro c e s s o r
the bus to perform LL and keep it until it

performs an LC. C ache C ache C a che
3. A cache in an MSI coherent system executing LL S in g le b u s
can set L to Modified (a miss on L generates a

“request to write”) and ignores requests for L on M e m o ry I/O
the bus until the LC is executed.

50
A more efficient implementation of “Lock”
In a system that applies MSI coherence, using
atomic swaps to implement locks (as described in P ro c e s s o r P ro c e s s o r P ro c e s s o r
the last slide) will cause excessive bus traffic if

multiple processors compete for a lock. C ache C ache C a che
S in g le b u s
Example:
1) P1, P2 and P3 are competing for a lock, L. M e m o ry I/O
2) P1 gets L (is exclusive in P1)

3) P2 executes a swap to get the lock (gets it in exclusive) but fails
4) P3 executes a swap to get the lock (gets it in exclusive) but fails
5) ….
Optimized Lock (L):

Loop: load R, L /* load L into R
BNEZ R, Loop /* loop on local copy
Put 1 in Register, R
Atomic Swap (R,L) /* LL R1,L; SC R,L; RR1
BNEZ R, Loop
51
Barrier synchronization
• A barrier synchronization between N threads can be implemented using a

shared variable initialized to N.
• When a processor reaches the barrier, it decrements the shared variable
by 1 and waits (in a busy wait loop) until the value of the variable is equal
to zero before it leaves the barrier.
• Need locks???
• What if there is no shared variables (distributed memory machines)?
• Can you synchronize using special hardware?
52
The Tilera TILE-Gx36™ Architecture:
 36 Processor Cores
 866M, 1.2GHz, 1.5GHz clk
MiCA
 12 MBytes total cache
Memory Controller (DDR3) 4x GbE
SGMII
SerDes
10 GbE
UARTx2,
40 Gbps total packet I/O

XAUI
USBx2,
JTAG, 
I2C, SPI
4x GbE
SGMII – 4 ports 10GbE (XAUI)
SerDes
– 16 ports 1GbE (SGMII)
SerDes
PCIe 2.0 10 GbE
mPIPE
8-lane XAUI
4x GbE
 48 Gbps PCIe I/O
– 2 16Gbps Stream IO ports
SGMII
SerDes
SerDes
PCIe 2.0
4-lane
10 GbE
XAUI
Wire-speed packet engine

SerDes
PCIe 2.0 
4-lane 4x GbE
Flexible
SGMII
SerDes
– 60Mpps
I/O 10 GbE
Memory Controller (DDR3) XAUI
 MiCA engine:
– 20 Gbps crypto
– Compress & decompress
53
TILE-Gx100™:
Complete System-on-a-Chip with 100 64-bit cores
4x GbE  1.2GHz – 1.5GHz

SerDes
Memory Controller (DDR3) Memory Controller (DDR3) SGMII

MiCA 10 GbE  32 MBytes total cache
XAUI
4x GbE  546 Gbps peak mem BW
SerDes
UART x2,
SGMII
USB x2,
200 Tbps iMesh BW
Interlaken
JTAG, 10 GbE 
I2C, SPI XAUI
4x GbE
SerDes
SGMII  80-120 Gbps packet I/O

10 GbE
SerDes
PCIe 2.0
8-lane
XAUI – 8 ports XAUI / 2 XAUI
4x GbE
SerDes
SGMII – 2 40Gb Interlaken

10 GbE
mPIPE
XAUI – 32 ports 1GbE (SGMII)

SerDes
PCIe 2.0 4x GbE

SerDes
8-lane SGMII
10 GbE
 80 Gbps PCIe I/O
XAUI – 3 StreamIO ports (20Gb)
4x GbE
SerDes
SerDes
PCIe 2.0 SGMII

Wire-speed packet eng.
Interlaken
4-lane 10 GbE 
XAUI
4x GbE – 120Mpps
SerDes
Flexible SGMII
I/O 10 GbE
XAUI
 MiCA engines:
4x GbE
– 40 Gbps crypto
SerDes
MiCA SGMII
Memory Controller (DDR3) Memory Controller (DDR3) 10 GbE
XAUI – compress & decompress
54
The Tilera core
• Processor
– Each core is a complete computer
– 3-way VLIW CPU
– Protection and interrupts
• Memory Core
– L1 cache and L2 Cache

– Virtual and physical address space Register File
– Instruction and data TLBs Three Execution Pipelines
– Cache integrated 2D DMA engine Cache

16K L1-I I-TLB 2D
8K L1-D D-TLB DMA
• Runs SMP Linux 64K L2
• Runs off-the-shelf C/C++ Terabit
programs Switch
• Signal processing and general

apps
55
Tilera Tile64
x5
56

ch5 4

Uploaded by

Copyright:

Available Formats

ch5 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch5 4

Uploaded by

Copyright:

Available Formats

The MESI protocol

• As described earlier, in MSI, a cache block can be in one of three states

The MESI protocol

2) Use SRAM for directories (hardware optimization)

3) Overlap activities on the critical path

Cache-based Directory Schemes

Dir. Main Main Dir.

M/D A M/D A M/D A M/D A M/D A M/D A M/D A M/D A

Network1 Network1 Network1 Network1

adapter adapter adapter adapter

Cache organization in multicore systems

Memory system Memory system

• Uses MESI (Modified, Exclusive, • Uses MOESI (M + Owned + ESI) cache

Network (on chip)

Off chip Memory

Example of distributed directories in CMPs

Network (on chip) Network (on chip)

Shared memory Directory Shared memory

• If each L2 module is private to the corresponding core, then on chip directories

Atomic swap can be used to implement locks:

Possible implementations of Atomic Swap

If a single CPU/memory: OS does not allow context switching between LL and LC

If shared memory system:

2. A system with no cache coherence can get P ro c e s s o r P ro c e s s o r P ro c e s s o r

the bus to perform LL and keep it until it

3. A cache in an MSI coherent system executing LL S in g le b u s

can set L to Modified (a miss on L generates a

the bus until the LC is executed.

the last slide) will cause excessive bus traffic if

2) P1 gets L (is exclusive in P1)

Optimized Lock (L):

• A barrier synchronization between N threads can be implemented using a

• What if there is no shared variables (distributed memory machines)?

• Can you synchronize using special hardware?

40 Gbps total packet I/O

PCIe 2.0 10 GbE

Wire-speed packet engine

4x GbE  1.2GHz – 1.5GHz

Memory Controller (DDR3) Memory Controller (DDR3) SGMII

SGMII  80-120 Gbps packet I/O

SGMII – 2 40Gb Interlaken

XAUI – 32 ports 1GbE (SGMII)

PCIe 2.0 4x GbE

PCIe 2.0 SGMII

– L1 cache and L2 Cache

– Instruction and data TLBs Three Execution Pipelines

– Cache integrated 2D DMA engine Cache

8K L1-D D-TLB DMA

• Runs SMP Linux 64K L2

• Runs off-the-shelf C/C++ Terabit

• Signal processing and general

You might also like