ASA Chapter4
ASA Chapter4
2019
dce
2019
ADVANCED SYSTEM
ARCHITECTURES
Khoa Khoa học và Kỹ thuật Máy tính Memory Hierarchy Design
BM Kỹ thuật Máy tính
BK
TP.HCM
©2019, dce
2
dce dce
2019
Since 1980, CPU has outpaced DRAM ... 2019
Processor-DRAM Performance Gap Impact
Four-issue 2GHz superscalar accessing 100ns DRAM could execute • To illustrate the performance impact, assume a single-issue pipelined
800 instructions during time for one memory access! CPU with CPI = 1 using non-ideal memory.
Performance • Ignoring other factors, the minimum cost of a full memory access in terms
(1/latency) CPU of number of wasted CPU cycles:
CPU 60% per yr
CPU CPU Memory Minimum CPU memory stall cycles
2X in 1.5 yrs Year speed cycle Access or instructions wasted
MHZ ns ns
3 4
1
dce Addressing the Processor-Memory Performance GAP dce
2019 2019
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time Staging
Cost
Today’s Xfer Unit faster
• Goal: Illusion of large, fast, cheap memory. CPU Registers Focus Registers
100s Bytes
Let programs address a memory space that <10s ns
Instr. Operands prog./compiler
1-8 bytes
scales to the disk size, at a speed that is Cache
K Bytes
Cache
10-100 ns
usually as fast as register access 1-0.1 cents/bit cache cntl
Blocks 8-128 bytes
Main Memory
M Bytes Memory
200ns- 500ns
• Solution: Put smaller, faster “cache” $.0001-.00001 cents /bit
Pages
OS
512-4K bytes
Disk
memories between CPU and DRAM. Create G Bytes, 10 ms
(10,000,000 ns) Disk
a “memory hierarchy”. -5 -6
10 - 10 cents/bit user/operator
Files Mbytes
Tape Larger
infinite
sec-min Tape Lower Level
10 -8
5 6
dce dce
2019
Common Predictable Patterns 2019
Caches
7 8
2
dce dce
Simple view of cache
2019 2019
Simple view of cache
Address Address • Hit rate: fraction of cache hit
Processor Main
CACHE Memory • Miss rate: 1 – Hit rate
Data Data
- Miss penalty: Time to replace a block + time to
• The processor accesses the cache first deliver the data to the processor
• Cache hit: Just use the data
• Cache miss: replace a block in cache by a
block from main memory, use the data
• The data transferred between cache and main
memory is in blocks, and controlled by
independent hardware
9 10
dce dce
2019
Simple view of cache 2019
Replacement
• Example: For(i = 0; i < 10; i++) S = S + A[i]; Block Number 0 1 2 3 4 5 6 7 8 9
1111111111 2222222222 33
0123456789 0123456789 01
3
dce dce
2019
Basic Cache Design & Operation Issues 2019
dce dce
2019
Direct-Mapped Cache
2019
Direct-mapped Cache
Tag Index Block Address • Address: N bits (2N bytes)
Offset
t
• Cache has 2k lines (blocks)
k b
V Tag Data Block • Each line has 2b bytes
• Block M is mapped to the line M % 2k
2k
lines • Need t = N-k-bTag bits to identify mem. block
t • Advantage: Simple
= • Disadvantage: High miss rate
• What if CPU accesses block N0, N1 and N0 %
HIT Data Word or Byte 2k = N1 % 2k ?
15 16
4
dce dce
2019
Direct-mapped Cache 2019
4KB Direct Mapped Cache Example
A d d r e s s ( s h o w i n g b i t p o s it io n s )
Index field
31 30 13 12 11 2 1 0
1111111111 2222222222 33
Tag field B y te
20 10
H it D a ta
1K = 1024 Blocks Tag
Can cache up to
In d e x V a l id T ag D a ta
01234567
232 bytes = 4 GB 1
of memory 2
Mapping function:
Cache Block frame number = 1021
1022
(Block address) MOD (1024)
• Access N0, N1 where N0 % 2k = N1 % 2k i.e. index field or
1023
20 32
17 18
dce dce
2019
64KB Direct Mapped Cache Example 2019
Can cache up to In d e x B lo c k o f f s e t
232 bytes = 4 GB
of memory
1 6 b i ts 1 2 8 b i ts
t
V T ag D a ta
4K
Tag
e n t rie s
t
16 32 32 32 32
=
HIT
Mux
Offset
Block
Hit or miss? 32
Data
Larger cache blocks take better advantage of spatial locality
and thus may result in a lower miss rate
Block Address = 28 bits
Block offset = Word
Tag = 16 bits Index = 12 bits = 4 bits
b or Byte
Mapping Function: Cache Block frame number = (Block address) MOD (4096)
i.e. index field or 12 low bit of block address
19 20
5
dce dce
2019
Fully associative cache 2019
Set-Associative Cache
• CAM: Content Addressable Memory Tag Index Block
Offset b
• Each block can be mapped to any lines in t
k
cache V Tag Data Block V Tag Data Block
• Tag bit: t = N-b. Compared to Tag of all lines
• Advantage: replacement occurs only when no
rooms available t
HIT
21 22
25 3
Can cache up to
• Block M is mapped to one of w lines in set M 232 bytes = 4 GB
25 4
25 5
22 32
of memory
% 2k
• Tag bit: t = N-k-b Set associative cache requires parallel tag
matching and more complex hit logic which
may increase hit time
H it D a ta
Mapping Function: Cache Set Number = index= (Block address) MOD (256)
23 24
6
dce dce
2019
Q2: How is a block found? 2019
What causes a MISS?
• Index selects which set to look in
• Three Major Categories of Cache Misses:
• Compare Tag to find block – Compulsory Misses: first access to a block
• Increasing associativity shrinks index, – Capacity Misses: cache cannot contain all blocks needed
expands tag. Fully Associative caches have to execute the program
– Conflict Misses: block replaced by another block and then
no index field. later retrieved - (affects set assoc. or direct mapped
• Direct-mapped: 1-way set associative? caches)
• Nightmare Scenario: ping pong effect!
• Fully associative: 1 set?
Memory Address
Block Address Block
Tag Index Offset
25 26
dce dce
Block Size and Spatial Locality
2019 2019
7
dce dce
2019
Q4: What happens on a write? 2019
Reading assignment 1
• Cache hit:
– write through: write both cache & memory • Cache performance
• generally higher traffic but simplifies cache coherence – Replacement policy (algorithms)
– write back: write cache only – Optimization (Miss rate, penalty, …)
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic • Reference
• Cache miss: – Hennessy - Patterson - Computer Architecture. A
– no write allocate: only write to main memory Quantitative
– write allocate (aka fetch on write): fetch into cache – www2.lns.mit.edu/~avinatan/research/cache.pdf
– More on internet
• Common combinations:
– write through and no write allocate
– write back with write allocate
29 30
dce
2019
Reading Log1
• Listing some algorithms of replacement
policy? Can it affect cache performance?
• Multi-level Caches
31