Memory Hierarchies (Part 2) Review: The Memory Hierarchy
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
Chapter 7
L1$
8-32 bytes (block)
L2$
1 to 4 blocks
Main Memory
Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
Secondary Memory
Chapter 7
Chapter 7
Page 1
Chapter 7
Chapter 7
Chapter 7
1000
Clocks per instruction
Chapter 7
Data
Q1: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache
Main Memory 0000xx 0001xx Two low order bits define the byte in the 0010xx word (32-b words) 0011xx One word blocks 0100xx 0101xx 0110xx 0111xx Q2: How do we find it? 1000xx 1001xx Use next 1 low order 1010xx memory address bit to 1011xx 1100xx determine which 1101xx cache set (i.e., modulo 1110xx the number of sets in 1111xx the cache)
Page 2
Chapter 7
Chapter 7
10
01 00
00 01
01 00
00 01
0 miss 0 Mem(4)
01 00
4 miss
4 Mem(0)
00 01
0 miss 0 Mem(4)
01 00
4 miss 4 Mem(0)
8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!
8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block
Chapter 7
11
Chapter 7
12
28 = 256 sets each with four ways (each with one block)
31 30 ... 13 12 11 ... 2 1 0
Tag
Index V Tag
0 1 2 . . . 253 254 255
22
8 V Tag
0 1 2 . . . 253 254 255
Index
Data
0 1 2 . . . 253 254 255
V Tag
Data
Data
0 1 2 . . . 253 254 255
V Tag
Tag
Index
32
Page 3
Chapter 7
13
Chapter 7
14
Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset
N-way set associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available Data available after set selection (and Hit/Miss decision). In a direct mapped cache, the cache block is available before the Hit/Miss decision
So its not possible to just assume a hit and continue and recover later if it was a miss
Chapter 7
15
Chapter 7
16
Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)
Page 4
Chapter 7
17
Chapter 7
18
L1 typical Total size (blocks) Total size (KB) Block size (B) Miss penalty (clocks) Miss rates (global for L2) 250 to 2000 16 to 64 32 to 64 10 to 25 2% to 5%
Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times
Larger with larger block sizes
The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache so it can be smaller (i.e., faster) but have a higher miss rate For the L2 cache, hit time is less important than miss rate The L2$ hit time determines L1$s miss penalty L2$ local miss rate >> the global miss rate
Chapter 7
19
Chapter 7
20
Page 5
Chapter 7
21
Chapter 7
22
Chapter 7
23
Chapter 7
24
1. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache small buffer holding most recently discarded blocks
Page 6
Chapter 7
25
Chapter 7
26
Associativity
Block Size
Bad
Factor B
More
Chapter 7
27
Read misses (I$ and D$) stall the entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume Write misses (D$ only) 1. stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume 2. Write allocate just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall (normally used in write-back caches) 3. No-write allocate skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isnt full; must invalidate the cache block since it will be inconsistent (now holding stale data) No-write allocate is normally used in write-through caches with a write buffer)
Page 7