Cache
Cache
Cache
Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley
Increase performance by having hierarchy of memory subsystems Temporal Locality and Spatial Locality are big ideas
Disadvantage
Mapping is fixed !!!
Associative Caches
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets
Fully associative: block 12 can go anywhere Block no. 01234567 Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) 01234567 Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) 01234567
Block-frame address
Block no.
1111111111222222222233 01234567890123456789012345678901
Valid
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
: :
Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Coherence (Invalidation): other process (e.g., I/O) updates memory
Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag
miss
IR IRex A B
Compulsory
Conflict
Compulsory
3Cs Comparison
100% Miss Rate per Type 1-way 80% 60% 40% 20% 0% 1 2 4 8 16 32 64 Compulsory 128 2-way 4-way 8-way Capacity
Conflict
Compiler Optimizations
Compiler Optimizations to reduce miss rate
Loop Fusion Loop Interchange Merge Arrays
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO:
Eg: 1 clock cycle to send address and data, 15 for each DRAM access
Case 1: (Main memory of one word) 1 + 4 x 15 + 4 x 1 = 65 clk cycles, bytes/clock = 16/65 = 0.25 Case 2: (Main memory of two words) 1 + 2 x 15 + 2 x 1 = 33 clk cycles, bytes/clock = 16/33 = 0.48 Case 3: (Interleaved memory) 1 + 1 x 15 + 4 x 1 = 20 clk cycles, bytes/clock = 16/20 = 0.80
Read Miss
Stall CPU, fetch, Resume
Write Hit
Write Through Write Back
Write Miss
Write entire block into memory, Read into cache
Performance
CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Different measure: AMAT Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) Note: memory hit time is included in execution cycles.
Performance [contd]
Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) Base CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Suppose that 1% of instructions get same miss penalty CPI = Base CPI + average stalls per instruction [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 58% of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
1.1(cycles/ins) +
Summary
The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy
Summary [contd]
Cache Size
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins
Associativity
Block Size
Bad
Factor B
More