CSA Mod 3-Part 2 Notes (Cache Coherence)
CSA Mod 3-Part 2 Notes (Cache Coherence)
CSA Mod 3-Part 2 Notes (Cache Coherence)
Fig:3.19
• Consider a multiprocessor with two processors, each using a private cache and
both sharing the main memory. Let X be a shared data element which has been
referenced by both processors.
• Before update value of X is same in memory as well as both the caches
• In write through policy , if P1 updates X to X’ the same copy will be written
immediately to main memory. But inconsistency occurs between the cached
copies
In write-back policy the main memory will be updated only when the modified data
in the cache is replaced or invalidated.so inconsistency occurs between the cached
copy and shared memory.
Fig:3.20
• Initially before migration, P1 have the value x in its cache and also x in shared
memory.
• Now x has been migrated to p2 and ,If p2 reads x and updates it to x’(Centre
fig), in write through policy, the value will be updated simultaneously in shared
memory also. But in p1 its still x. so inconsistencies exists.
• Now in case of write back, If p1 reads the value x , modifies it to x’ ,but it will not
be updated in shared memory which is being read in the p2, while executing the
process. So here the updated value is not used by the p2 to perform the particular
task.. (Right most figure)(P1 to p2).
Qn: How can the I/O operations bypassing cache result in cache
coherence problem? Also, suggest a solution for preventing it?
Data movement from an I/o device to a shared memory usually does not
cause the cached copies of data to be updated.
As a result, an input operation that writes x will write data to the shared
primary memory,causes it to become inconsistent with a cached value of x.
Likewise, writing data to an I/O device usually use the data in the shared
primary memory, ignoring any potential cached data with different values.
Fig:3.21
• When I/O processors loads a new data X’ into the main memory (middle fig
above) inconsistency occurs between caches and shared memory.
• When outputting a data directly from the shared memory(Bypassing the cache)
also create inconsistency.
• Possible solution is to attach I/O processors to private caches(C1 and C2) as
shown in fig below. This way I/O processors share caches with the CPU. The I/O
consistency can be maintained if cache-to-cache consistency is maintained via
the bus.
Fig:3.22
1. Write Invalidate policy – invalidates all remote copies when a local cache
block is updated. In fig below processor P1 modifies(write) its cache from X
to X’ and all other copies are invalidated via the bus. Invalidated blocks are
called dirty, means, they should not be used.
Fig:3.23
2. Write Update policy – broadcast the new data block to all caches
containing a copy of the block. In fig below the new updated block content
X’ is broadcast to all cache copies via the bus. The memory copy is also
updated if write-through caches are used, if write-back caches are used
memory copy is updated later when the block is replaced
Fig:3.24
Write-Back Caches:
The valid state of a write-back cache can be further split into two cache states,labelled:-
• RW (read –write/ Exclusive) and
• RO (read only/ Shared) as shown in Fig. below.
The INV (invalidated or not-in-cache) cache state is equivalent to the invalid state
mentioned before.
Fig:3.26 State transition graph for a cache block using write-back ,write-invalidate snoopy protocol
• When the memory owns a block, caches can contain only the RO(Shared) copies of
the block. In other words, multiple copies may exist in the RO state and every
processor having a copy (called a keeper of the copy) can read(R(i)), R(j)) the copy
safely.
• The INV state is entered whenever a remote processor writes (w(j)) its local copy or
the local processor replaces (Z(i)) its own block copy.
• . From either the RO state or the INV state, the cache block becomes uniquely owned
when a local write (w(i) takes place.
Fig: 3.27
Valid: The cache block, which is consistent with the memory copy, has been read from
shared memory and has not been modified.
Invalid: The block is not found in the cache or is inconsistent with the memory copy.
Reserved: Data has been written exactly once since being read from shared memory. The
cache copy is consistent with the memory copy, which is the only other copy.
Dirty: The cache block has been modified (written) more than once, and the cache copy is
the only one in the system (thus inconsistent with all other copies).
To maintain consistency, the protocol requires two different sets of commands. The solid
lines in Fig. above correspond to access commands issued by a local processor labeled
read-miss, write-miss and write –hit. Whenever a read-miss occurs, the valid state is
entered.
The first write-hit leads to the reserved state.
The second write-hit leads to the dirty state,
and all future write-hits stay in the dirty state.
Whenever a write-miss occurs, the eache block enters the dirty state.
The dashed lines correspond to invalidation commands issued by remote processors via
the snoopy bus.
The read-invalidate command reads a block and invalidates all other copies. The write-
invalidate command invalidates all other copies of a block.
Cache Events and Action: The memory-access and invalidation commands trigger
the following events and actions:
Read-Miss: When a processor wants to read a block that is not in the cache, a read-miss
occurs. A bus read operation will be initiated. If no dirty copy exists, then main memory
has a consistent copy and supplies a copy to the requesting cache. If a dirty copy does
exist in a remote cache, that cache will inhibit the main memory and send a copy to the
requesting cache. In all cases, the cache copy will enter the valid state after a rcad-miss.
Write-Hit: If the copy is in the dirty or reserved state, the write can be carried out locally
and the new state is dirty. If the new state is valid, a write-invalidate command is
broadcast to all caches, invalidating their copies. The shared memory is written through,
and the resulting state is reserved after this first write.
Write-miss: When a processor fails to write in a local cache, the copy must come either
from the main memory or from a remote cache with a dirty block. This is
accomplished by sending a read-invalidate command which will invalidate all cache
copies. The local copy is thus updated and ends up in a dirty state.
Read-hit: Read-hits can always be performed in a local cache without causing a state
transition or using the snoopy bus for invalidation.
Block Replacement: If a copy is dirty, it has to be written back to main memory by block
replacement. If the copy is clean (i.e., in either the valid, reserved, or invalid state), no
replacement will take place.
• Write invalidate scheme for Snoopy Bus causes heavy bus traffic since it
invalidates cache copies on each update.
• Write update scheme for Snoopy bus updates all cache copies, some of which
may never be used.
• Hence in large multiprocessors with hundreds of processors broadcasting
becomes expensive. This leads to Directory based protocols.
Directory Structure
• In multistage network, Cache coherence is supported by using cache directories
to store information on where copies of cache block reside. Directory can be
Central or Distributed
• Directory scheme using a Central Directory stores duplicates of all cache
directories. Contention and long search time are the drawbacks in using central
directory..
• In Distributed directory scheme Each memory module maintains a separate
directory which records the state and presence information for each memory
block. The state information is local, but the presence information indicates which
caches have a copy of the block.
•
Fig: 3.28: Basic concept of a directory based cache coherence scheme
1 Full Map directories:- Full-map directories store enough data associated with each
block in global memory so that every cache in the system can simultaneously store a
copy of any block of data. That is. each directory entry contains N pointers, where N is
the number of processors in the system.
2. Limited directories;- they have a fixed number of pointers per entry, regardless of the
system size.
Fig: 3.29
Let us examine the transition from the second state to the third state in more detail. Once
processor P3 issues the write to cache C3, the following events will take place:
(1) Cache C3 detects that the block containing location X is valid but that the
processor does not have permission to write to the block, indicated by the
block‘s write-permission bit in the cache.
(2) Cache C3 issues a write request to the memory module containing location X
and stalls processor P3.
(3) The memory module issues invalidate requests to caches Cl and C2.
(4) Caches Cl and C2 receive the invalidate requests, set the appropriate bit to
indicate that the block containing location X is invalid and send
acknowledgments back to the memory module.
(5) The memory module receives the acknowledgments, sets the dirty hit, clears
the pointers to caches Cl and C2, and sends write permission to cache C3.
(6) Cache C3 receives the write permission message, updates the state in the
cache, and reactivates processor P3.
The memory module waits to receive the acknowledgments before allowing processor P3
to complete its write transaction. By waiting for acknowledgments, the protocol
guarantees that the memory system ensures sequential consistency.
Fig: 3.30
• the above fig shows situation when three cache request read copies
in a memory system with Dir2 NB protocol( only 2 copies are
allowed). The two pointer directory can be viewed as a two-way set
associative cache.
• When cache C3 requests copy of location X, the memory module
must invalidate the copy in either C1 or C2 cache. This process of
pointer replacement is called eviction.
Fig: 3.31
• In the fig above singly linked chain, initially assume there are no
shared copies of location X.
• Processor P1 reads location X, the memory sends a copy to cache C1
along with chain termination (CT) pointer. The memory also keeps
a pointer to cache C1.
• Subsequently, When processor C2 reads location X, the memory
sends a copy to cache C2, along with the pointer to cache C1. The
memory then keeps a pointer to cache C2..
By repeating the above step, all of the caches can cache a copy of the location X.
If processor P3 writes to location X, it is necessary to send a data invalidation
message down the chain.
To ensure sequential consistency, the memory module denies processor P3 write
permission until the processor with the chain termination pointer
acknowledges the invalidation of the chain.
Perhaps this scheme should be called a gossip protocol (as opposed to a snoopy
protocol) because information is passed from individual to individual rather than
being spread by covert observation.
If processor P3 write to location X, data invalidation message has to be send down the
chain. Also called as gossip protocol
Beyond the use of private caches, three design alteratives are suggested below. Each of
the design alternatives has its own advantages and shortcomings. There exists insufiicient
evidence to determine whether any of the alternatives is always better or worse than the
use of private caches.
Shared Cache :
An alternative approach to maintaining cache coherence is to completely eliminate the
problem by using shared caches attached to shared-memory modules. No private caches
are allowed in this case. This approach will reduce the main memory access time but
contributes very little to reducing the overall memory-access time and to resolving access
conflicts.
Non-cacheable Data :
Another approach is not to cache shared writable data. Shared data are non cacheable,
and only instructions or private data are cacheable in local caches.
Shared data include locks, process queues. and any other data structures protected by
critical sections.
The compiler must tag data as either cacheable or non cacheable. Special hardware
tagging must be used to distinguish them.
Cache Flushing :
A third approach is to use cache flushing every time a synchronization primitive is
executed. This may work well with transaction processing multiprocessor systems. Cache
flushes are slow unless special hardware is used.
Flushing can be made very selective by the compiler in order to increase efficiency.
Cache flushing at synchronization, I/O and process migration may be carried out
unconditionally or selectively.
Atomic Operations
• Atomic operations such as read, write, or read-modify-write can be used to
implement some synchronization primitives.
• some interprocessor interrupts can be used for synchronization purposes.
• For example , Test & Set and Reset (Lock) primitives are defined below:
• To synchronize concurrent processes, the software may repeat Test&.Set until the
returned value (temp) becomes 0.
• This synchronization primitive may tie up some bus cycles while a processor
enters busy-waiting on the spin lock.
• To avoid spinning, interprocessor interrupts can be used.
• Lock tied to an interrupt is called a suspended lock.
When all processes finish their jobs, the xi bits from the participating processors are
all set to 0; and the barrier line is then raised to high (1), signalling the synchronization
barrier has been crossed.
This timing is watched by all processors through snooping on the Yi bits. Thus only
one barrier line is needed to monitor the initiation and completion of a single
synchronization involving many concurrent processes.
Figure below shows the synchronization of four processes residing on four processors
using one barrier line.
Fig: 3.32
Another example using multiple barrier lines to monitor several synchronisation points is
shown below: (Optional)
Fig: 3.33