ch5 4
ch5 4
ch5 4
• The MESI protocol regroup the Shared and Modified states into three states:
• Invalid (uncached): same as in MSI
• Shared: cached in more than one processors and memory is up-to-date
• Exclusive: one processor (owner) has data and it is clean (clean but not shared)
• Modified: one processor (owner) has data, but it is dirty
• If MESI is implemented using a directory, then the information kept for each block in
the directory is the same as the three state protocol:
• Shared in MESI = shared/clean but more than one sharer
• Exclusive in MESI = shared/clean but only one sharer
• Modified in MESI = Exclusive/Modified/dirty
• However, at each cached copy, a distinction is made between shared, exclusive and
modified (rather than only shared and modified). 39
Latency optimization
1) Forwarding requests
1: req 2: forward
L H R
3: req 4: reply 3: respond
1: req
L H R 1: req 2: forward
4: revise
2: reply L H R
4: reply 3: revise
3: reply
x x
x
cache cache cache
x
Mem
• Keep the information about the sharers of a cached block in the cache by
linking the replicated cached entries in a linked list rather than storing a list of
sharers with the block in the main memory.
• When a processor caches a block, it inserts itself at the front of the linked list
• To invalidate a cache block in the other caches, follow the link list (easier if a
doubly link list)
• Scalable Coherent Interface (SCI) IEEE Standard
44
Hierarchical approaches to coherence
• Multi-levels - especially useful for multi-node systems, when each node is a
multiprocessor (example: multi SMPs)
• Examples of two-level systems:
P P P P
$ B1 $ $ B1 $
Network
Snooping-directory
P P P P P P P P
Bus
Network 2
Directory-directory Directory-snooping 45
P P P P P P P P
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2
L2
System interconnect
Memory controller Memory controller
• Examples: Intel Core Duo Pentium • Examples: AMD Dual Core Opteron
L1 P0 L1 P1 L1 Pn
Distributed
L2 Dir L2 Dir L2 Dir
shared L2
• Directories are used to keep track of the state of shared entities that are
cached in multiple private caches.
• If the L2 modules form a shared cache space, then the directories perform
a role very similar to their roles in distributed shared memory systems.
• Preserve coherence in the private L1 caches
• One directory entry for each entry in L2
• Location of a cache line in L2 is determine by address of cache entry
47
L1 P0 L1 P1 L1 P0 L1 P1
L2 L2 L2 dir L2 dir
Lock (L):
Put 1 in Register, R
Loop: Atomic Swap (R, L)
BNEZ R, Loop
Unlock(L):
Store 0 into memory location L
49
S in g le b u s
Example:
1) P1, P2 and P3 are competing for a lock, L. M e m o ry I/O
51
Barrier synchronization
• Need locks???
52
The Tilera TILE-Gx36™ Architecture:
36 Processor Cores
866M, 1.2GHz, 1.5GHz clk
MiCA
12 MBytes total cache
Memory Controller (DDR3) 4x GbE
SGMII
SerDes
10 GbE
UARTx2,
SerDes
– 16 ports 1GbE (SGMII)
SerDes
mPIPE
8-lane XAUI
4x GbE
48 Gbps PCIe I/O
– 2 16Gbps Stream IO ports
SGMII
SerDes
SerDes
PCIe 2.0
4-lane
10 GbE
XAUI
PCIe 2.0
4-lane 4x GbE
Flexible
SGMII
SerDes
– 60Mpps
I/O 10 GbE
Memory Controller (DDR3) XAUI
MiCA engine:
– 20 Gbps crypto
– Compress & decompress
53
TILE-Gx100™:
Complete System-on-a-Chip with 100 64-bit cores
UART x2,
SGMII
USB x2,
200 Tbps iMesh BW
Interlaken
JTAG, 10 GbE
I2C, SPI XAUI
4x GbE
SerDes
PCIe 2.0
8-lane
XAUI – 8 ports XAUI / 2 XAUI
4x GbE
SerDes
8-lane SGMII
10 GbE
80 Gbps PCIe I/O
XAUI – 3 StreamIO ports (20Gb)
4x GbE
SerDes
SerDes
4-lane 10 GbE
XAUI
4x GbE – 120Mpps
SerDes
Flexible SGMII
I/O 10 GbE
XAUI
MiCA engines:
4x GbE
– 40 Gbps crypto
SerDes
MiCA SGMII
Memory Controller (DDR3) Memory Controller (DDR3) 10 GbE
XAUI – compress & decompress
54
The Tilera core
• Processor
– Each core is a complete computer
– 3-way VLIW CPU
– Protection and interrupts
• Memory Core
programs Switch
Tilera Tile64
x5
56