Muge - Snoop Based Multiprocessor Design
Muge - Snoop Based Multiprocessor Design
Muge - Snoop Based Multiprocessor Design
Write propagation
Write serialization
All writes to the same location are seen in the same order by all processes (to all locations called write atomicity)
E.g., w1 followed by w2 seen by a read from P1, will be seen in the same order by all reads by other processors Pi
Cache Coherence
Snooping
Shared memory multiprocessor environment Main Memory is passive Caches distribute state transitions to other caches and memory All caches listen to snoop messages and act on them Most machines use cache coherence protocols with different trade-offs
Physical Design
Protocol intent is implemented in FSMs
BusRd/Flush
Support for:
PrWr/BusRdX S BusRdX/Flush BusRdX/ PrRd/BusRd PrWr/BusRdX PrRd/ BusRd/
High Performance can be achieved with multiple events in progress, overlapping latencies
Reporting snoop results: how and when Handling write-backs Non-atomic state transitions
Overall set of actions for memory operation are not atomic Race conditions
Must monitor bus operations and respond to processor operations Two controllers: bus-side, and processor-side Bus transactions: Bus-side capture address and perform tag check.
Fail, snoop miss, no action Hit, cache coherence protocol, RMW on state bits
For single level caches, duplicate set of tags and state or dual-ported tag and state store
Ta gs
Ca che d Da ta
Ta gs
Initiator places an address on the bus Responder must acknowledge within a time-out window (wired-OR), otherwise bus error.
Snooping Caches
All caches must report on the bus before transaction can proceed.
Snoop result informs main memory, if it should respond or a cache has a modified copy of the block. When and how the snoop result is reported on the bus? For Example to implement MESI protocol,
Memory needs to know; Is block dirty? Should it respond or not? Requesting cache needs to know; Is block shared?
Within fixed number of clock cycles from the address issue on the bus
Memory assumes one of the caches will supply the data, until all have snooped and indicated results. Easier to implement, tag access-conflicts and high performance
Immediately
Main memory maintains a state bit per block, modified in a cache. Complexity introduced to main memory subsystem
Shared: asserted if any cache has a copy Dirty: asserted if some cache has a dirty copy
3 : One indicating snoop valid. Inhibit signal, asserted until all processors have completed their snoop.
Retrieve data from other caches rather than memory. Priority scheme needed SI Challenge, Sun Enterprise Server, only in exclusive or modified state. Challenge updates memory during cache-to-cache transfer (no shared modified state)
C ache data R AM
C parator om
T o controller
T ag
W rite-back buffer
C parator om
T o controller
S noop state
A ddr
C d m
Data buffer
Addr
C d m
S ystembus
Changes made by the processor to L1 cache may not be visible to L2 cache controller, which is responsible for bus operations Bus transactions are not directly visible to L1 cache
L1 cache is usually on the processor, on chip snooper consumes pins to monitor shared bus Duplicating tags consumes too much on chip area Duplication of effort between L1 and L2 snoops.
L1 accesses L2 for cache miss handling and block state changes; L2 forwards to L1 blocks invalidated/updated by bus transactions;
Inclusion Property
Difficulties with maintaining the inclusion property:
L1 is direct mapped
L2 is either direct mapped or set associative Same block size for both caches Number of sets in L1 is smaller than in L2
Propagate L2 replacements to L1
Send all transactions to L1 (even if the given block is not present there)
Add extra state to L2 (a bit per block) which blocks in L2 are also in L1 (inclusion bit)
Add per bit state every block in L2, "modified-but-stale" Request flush from L1 on Bus read
L2 serves as a filter for the L1 cache, screening out irrelevant transactions from the bus, i.e. dual tags are less critical with multilevel caches
Only one transaction on the bus at a time. Transactions are propagated up and down the hierarchy, bus transactions may be held until propagation completes. Performance penalty for holding processor write until BusRdX has been granted in high, so motivation to de-couple these operation
Ta gs used by the proc e ssor
Ta gs
Ca che d Da ta
Ta gs
Ca che d Da ta
Ta gs L2 Cache
Buffering between bus and the cache controllers allows multiple outstanding transactions (waiting for snoop and/or data responses)
Pro: By pipelining bus operations the bus is utilized more efficiently. Con: Increased complexity.
Mem Access Delay Mem Access Delay Data Address/CMD Address/CMD Address/CMD Data
Bus arbitration
A new request can appear on the bus before the snoop and/or servicing of an earlier request are complete;
The number of buffers for incoming requests and potential data responses from bus to cache controller is usually fixed and small, flow control is needed Since requests from the bus are buffered, when and how snoop and data responses are produced on the bus
In the same order as requests arrive? Snoop and data response together or separately
There are 3 phases in a transaction: 1. A request is put on the bus 2. Snoop results are sent by other caches
Does not allow conflicting requests for same block (8 outstanding requests)
NACK Flow-control
NACK as soon as request appears on bus, requestor retries Separate command (incl. NACK) + address and tag + data buses Responses may be in different order than requests
Order of transactions determined by requests Snoop results presented on bus with response
Bus Design
Match each response to outstanding request, since they arrive out of order
Tag request with 3-bits (8 outstanding) when launched Tag arrives back with corresponding response
Address and data buses can be arbitrated seperately Separate bus lines for arbitration, flow control and snoop results
To keep track of outstanding requests on the bus: each cache controller maintains eight entry buffer, request table
A new request on the bus, added to all request tables at same index, Index is 3-bit tag assigned at arbitration Table entry contains; block address, request type, state in that cache etc. Table is fully associative, new entry can be placed anywhere in table Checked for a match by the requesting processor and by all snooped requests and responses on the bus Entry and tag freed when response is observed on the bus,
Miscellaneous information
Originator My response
7 Request table
T ag
W rite-back buffer
Write backs
Data buffer Addr + cm d Addr + cm bus d Data + tag bus
Com parator
T o control
Snoop state
Addr + cm d
T ag
T ag
Responses
Response queue
Request buffer
Address
Tag
Variable delay snooping Snoop portion of the bus consists of three wired-OR lines
Request phase determines who will respond, but may take may cycles and intervening request response transactions All controllers present their snoop results on bus when they see response No data response or snoop results for write backs and upgrades
Flow Control
Implement flow control at: incoming request buffers from bus to cache controller (write-back buffer)
Flow control is also needed at main memory, Each of the 8 pending requests can generate a write-back to memory
SGI Challenge: separate NACK lines for address and data buses
source keeps watch for this retry guaranteed space will still be there, so only two tries needed at most
Multiple outstanding requests on the bus, invalidations are buffered between bus and cache and are not applied to cache immediately Commitment versus completion
Condition necessary for SC: a processor should not be allowed to actually see the new value to a write before previous writes (in bus order) are visible to it. 1. not letting certain types of incoming transactions from bus to cache be reordered in the incoming queues 2. allowing these re-orderings in the queues, but then ensuring that the important orders are preserved at the necessary points in the machine. 3. a simpler approach is to threat all the requests in FIFO order. Although this approach is simpler, it can have performance problems;
To maintain high bandwidth while allowing the individual units (controllers and caches) to operate at their own rates, queues are placed between levels of the hierarchy.
L$ 2
Bs u
Deadlock
Fetch deadlock:
One outstanding request per processor => need space to hold p requests plus one reply (latter is essential)
If smaller (or if multiple o/s requests), may need to NACK Then need priority mechanism in bus arbiter to ensure progress
Buffer deadlock:
L1 to L2 queue filled with read requests, waiting for response from L2 L2 to L1 queue filled with bus requests waiting for response from L1
Latter condition only when cache closer than lowest level is write back Could provide enough buffering, requires a lot of area, not scalable
Sequential Consistency
Separation of commitment from completion even greater with multi level cache Do not wait for an invalidation to reach all the way up to L1 and return a reply, consider write committed when placed on the bus Fortunately techniques for single-level cache and ST bus extend, either method works: dont allow certain re-orderings of transactions at any level
dont let outgoing operation proceed past level before incoming invalidations/updates at that level are applied
If L1 cache is shared then there are no multiple copies of a cache block and hence no coherence problem
L1 communication latency 2-10 clocks, main-memory many times larger reduced latency enables finer-grained sharing of data
With private caches each processor incurs miss penalty separately Reduces the BW requirements at the next level of the hierarchy.
Shared cache is smaller than the combined size of the private caches if working sets from different processors overlap
Processors are connected to shared cache by a switch, More likely a crossbar to allow cache access by processors in parallel
Support high BW by interleaving cache and main memory
higher bandwidth demand hit latency to a shared cache is higher than to a private one higher cache complexity shared caches are usually slower instead of constructive interference (like the working set example), destructive interference can occur
8 custom processors
Snoopy cache coherent multiprocessor Each private cache supports two processors instead of one
Practical approach:
private L1 caches and a shared L2 cache among groups of processors. packaging considerations are also important
References
[1] David Culler, Jaswinder Pal Singh, and Anoop Gupta, Morgan Kaufmann, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann; preliminary draft edition (August 1997), pp. 355-417 [2] Daniel Braga de Faria, Stanfard, Book Summaries, retrieved October 2010, from http://www-csstudents.stanford.edu/~dbfaria/ [3] Andy Pimentel, Introduction to Parallel Architecture, retrived on October 2010, from http://staff.science.uva.nl/~andy/aci/syl.pdf, pp. 46-52 [4] R. H. Katz, S. J. Eggers, DA.A. Wood, C.L Perkins and R.G. Shedon, Implementing a cache Consistency Protocol, Proceedings of the 12th ISCA, 1985, pp. 276-283 [5] M. S. Papamarcos, J.H. Patel, A low Overhead Coherence Solution for Microprocessors with Private Cache Memories , Proceedings of the 11th ISCA, 1984, pp. 348-354 [6] R. Kumar, V. Zyuban, and D. M. Tullsen, Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In ISCA, Jun 2005.