Computer Science 246
Computer Architecture
Spring
S i 2009
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Memory Hierarchy and Caches (Part 2)
Computer Science 246
David Brooks
Caches
• Monday lecture
– Review of cache basics, direct-mapped, set-associative
caches
• Today
– More on cache performance, write strategies
Computer Science 246
David Brooks
Summary of Set Associativity
• Direct Mapped
pp
– One place in cache, One Comparator, No Muxes
• Set Associative Caches
– Restricted set of places
– N-way set associativity
– Number of comparators = number of blocks per set
– N:1 mux
• Fully Associative
– Anywhere in cache
– Number of comparators = number of blocks in cache
– N:1 mux needed
Computer Science 246
David Brooks
More Detailed Questions
• Block placement policy?
– Where does a block go when it is fetched?
• Block identification policy?
– How do we find a block in the cache?
• Block replacement policy?
– When fetching a block into a full cache, how do we
decide what other block gets kicked out?
• Write strategy?
– Does any of this differ for reads vs. writes?
Computer Science 246
David Brooks
Block Placement + ID
• Placement
– Invariant: block always goes in exactly one set
– Fully
Fully-Associative:
Associative: Cache is one set, block goes
anywhere
– Direct-Mapped: Block goes in exactly one frame
– Set-Associative: Block goes in one of a few frames
• Identification
– Find Set
– Search ways in parallel (compare tags, check valid bits)
Computer Science 246
David Brooks
Block Replacement
• Cache miss requires a replacement
• No decision needed in direct mapped cache
• More than one place for memory blocks in set
set-
associative
• Replacement Strategies
– Optimal
• Replace Block used furthest ahead in time (oracle)
– Least Recently Used (LRU)
• Optimized for temporal locality
– (Pseudo) Random
• Nearly as good as LRU, simpler
Computer Science 246
David Brooks
Write Policies
• Writes are only about 21% of data cache traffic
• Optimize cache for reads, do writes “on the side”
– Reads can do tag check/data read in parallel
– Writes must be sure we are updating the correct data
and the correct amount of data (1-8 byte writes)
– Serial process => slow
• What to do on a write hit?
• What to do on a write miss?
Computer Science 246
David Brooks
Write Hit Policies
• Q1: When to propagate new values to memory?
• Write back – Information is only written to the cache.
– Next lower level only updated when it is evicted (dirty bits
say when
h data
d t has
h been
b modified)
difi d)
– Can write at speed of cache
– Caches become temporarily
p y inconsistent with lower-levels of
hierarchy.
– Uses less memory bandwidth/power (multiple consecutive
writes may require only 1 final write)
– Multiple writes within a block can be merged into one write
– Evictions are longer latency now (must write back)
Computer Science 246
David Brooks
Write Hit Policies
• Q1: When to propagate new values to memory?
• Write through – Information is written to cache
and to the lower-level memory
– Main memory is always “consistent/coherent”
– Easier to implement
p – no dirtyy bits
– Reads never result in writes to lower levels (cheaper)
– Higher bandwidth needed
– Write buffers used to avoid write stalls
Computer Science 246
David Brooks
Write buffers
CPU • Small chunks of memory to
buffer outgoing writes
• Processor can continue
Cache when data written to buffer
Write Buffer
• Allows overlap of
processor execution with
Lower Levels of Memory memory update
• Write buffers are essential for write-through caches
Computer Science 246
David Brooks
Write buffers
• Writes can now be pipelined (rather than serial)
• Check tag + Write store data into Write Buffer
• Write data from Write buffer to L2 cache (tags ok)
• Loads must check write buffer for Store Op
pending stores to same address Address| Data
• Loads Check: Write Buffer Address| Data
• Write Buffer Entry
• Cache
Tagg Data
• Subsequent Levels of Memory
Computer Science 246 Data Cache
David Brooks
Write buffer policies:
Performance/Complexity Tradeoffs
Stores L2 Cache
Loads
• Allow merging of multiple stores? (“coalescing”)
• “Flush
Flush Policy”
Policy – How to do flushing of entries?
• “Load Servicing Policy” – What happens when a
load occurs to data currently in write buffer?
Computer Science 246
David Brooks
Write misses?
• Write Allocate
– Block is allocated on a write miss
– Standard write hit actions follow the block allocation
– Write misses = Read Misses
– Goes well with write-back
• No-write Allocate
– Write misses do not allocate a block
– O l update
Only d t lower-level
l l l memory
– Blocks only allocate on Read misses!
– Goes well with write-through
Computer Science 246
David Brooks
Summary of Write Policies
Write Policy Hit/Miss Writes to
WriteBack/Allocate Both L1 Cache
WriteBack/NoAllocate Hit L1 Cache
WriteBack/NoAllocate Miss L2 Cache
WriteThrough/Allocate Both Both
WriteThrough/NoAllocate Hit Both
WriteThrough/NoAllocate Miss L2 Cache
Computer Science 246
David Brooks
Cache Performance
CPU time = (CPU execution cycles + Memory Stall
Cycles)*Clock Cycle Time
AMAT = Hit Time + Miss Rate * Miss Penalty
• Reducing these three parameters can have a big
impact on performance
• Out-of-order processors can hide some of the miss
penalty
Computer Science 246
David Brooks
Reducing Miss Penalty
• Have already seen two examples of techniques to
reduce miss penalty
– Write buffers give priority to read misses over writes
– Merging write buffers
• Multiword writes are faster than many single word writes
• Now we consider several more
– Victim Caches
– Critical Word First/Early Restart
– Multilevel caches
Computer Science 246
David Brooks
Reducing Miss Penalty:
Victim Caches
• Direct mapped caches => many conflict misses
• Solution 1: More associativity (expensive)
• Solution 2: Victim Cache
• Victim Cache
– Small (4 to 8-entry),
8 entry) fully
fully-associative
associative cache between L1
cache and refill path
– Holds blocks discarded from cache because of evictions
– Checked on a miss before going to L2 cache
– Hit in victim cache => swap victim block with cache block
Computer Science 246
David Brooks
Reducing Miss Penalty:
Victim Caches
• Even one entry helps
some benchmarks!
• Helps more for smaller
caches, larger block
sizes
Computer Science 246
David Brooks
Reducing Miss Penalty:
Critical Word First/Early Restart
• CPU normally just needs one word at a time
• Large cache blocks have long transfer times
• Don
Don’tt wait for the full block to be loaded before
sending requested data word to the CPU
• Critical Word First
– Request the missed word first from memory and send it
to the CPU and continue execution
• Early Restart
– Fetch in order, but as soon as the requested block
arrives send it to the CPU and continue execution
Computer Science 246
David Brooks
Review: Improving Cache
Performance
• How to improve cache performance?
– Reducing Cache Miss Penalty
– Reducing Miss Rate
– Reducing Miss Penalty/Rate via parallelism
– Reducing Hit Time
Computer Science 246
David Brooks
Non-blocking Caches to reduce stalls
on misses
• Non-blocking
Non blocking cache or lockup
lockup-free
free cache allow data cache to
continue to supply cache hits during a miss
– requires out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
• “hit
“h under
d multiple
l l miss”” or “miss
“ under
d miss”” may further
f h
lower the effective miss penalty by overlapping multiple
misses
– Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
– Requires multiple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses
Computer Science 246
David Brooks
Value of Hit Under Miss for SPEC
P
Percentage
t M
Memory St
Stall
ll Time
Ti off a Blocking
Bl ki Cache
C h
• FP pprograms
g on average:
g AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Reducing Misses by Hardware
Prefetching of Instructions & Data
• Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer” not the cache
– On Access: check both cache and stream buffer
– On SB Hit: move line into cache
– On SB Miss: Clear and refill SB with successive lines
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4
streams got 43%
– Palacharla & Kessler [[1994]] for scientific pprograms
g for 8 streams ggot
50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth that can
be used without penalty
Computer Science 246
David Brooks
Hardware Prefetching
• What to prefetch?
p
– One block ahead (spatially)
• What will this work well for?
– Address prediction for non-sequential data
• Correlated predictors (store miss, next_miss pairs in table)
• Jump-pointers
pp (augment
( g data structures with prefetch
p pointers)
p )
• When to prefetch?
– On everyy reference
– On a miss (basically doubles block size!)
– When resident data becomes “dead” -- how do we know?
• No one will use it anymore, so it can be kicked out
Computer Science 246
David Brooks
Reducing Misses by
S f
Software P f hi D
Prefetching Data
• Data Prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults; a form of speculative
execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
register
• Must be correct address and register!
– Non-Binding prefetch: Load into cache.
• Can be incorrect.
incorrect Faults?
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher
Hi h superscalar l reduces
d difficulty
diffi lt off issue
i bandwidth
b d idth
Computer Science 246
David Brooks
Reducing Hit Times
• Some common techniques/trends
q
– Small and simple caches
• Pentium III – 16KB L1
• Pentium
i 4 – 8KB
8 L11
– Pipelined Caches (actually bandwidth increase)
• Pentium – 1 clock cycle
y I-Cache
• Pentium III – 2 clock cycle I-Cache
• Pentium 4 – 4 clock cycle I-Cache
– Trace
T Caches
C h
• Beyond spatial locality
• Dynamic sequences of instruction (including taken branches)
Computer Science 246
David Brooks
Cache Bandwidth
• Superscalars need multiple memory access per cycle
• Parallel cache access: more difficult than parallel ALUs
– Caches have state so multiple
p accesses will affect each other
• “True Multiporting”
– Multiple decoders, read/write wordlines per SRAM cell
– Pipeline a single port by “double pumping” Alpha 21264
– Multiple cache copies (like clustered register file) POWER4
• Interleaved Multiporting
– Cache divides into banks – two accesses to same bank =>
conflict
Computer Science 246
David Brooks