0% found this document useful (0 votes)
10 views8 pages

ASA Chapter4

The document discusses the growing performance gap between CPUs and memory (DRAM) over time. It describes how caches were introduced to bridge this gap by exploiting the principles of temporal and spatial locality in memory access patterns. Caches provide smaller, faster memory between the CPU and main memory in a memory hierarchy. The goal is to create the illusion of large, fast, and cheap memory for programs.

Uploaded by

Little Bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

ASA Chapter4

The document discusses the growing performance gap between CPUs and memory (DRAM) over time. It describes how caches were introduced to bridge this gap by exploiting the principles of temporal and spatial locality in memory access patterns. Caches provide smaller, faster memory between the CPU and main memory in a memory hierarchy. The goal is to create the illusion of large, fast, and cheap memory for programs.

Uploaded by

Little Bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

dce

2019
dce
2019

ADVANCED SYSTEM
ARCHITECTURES
Khoa Khoa học và Kỹ thuật Máy tính Memory Hierarchy Design
BM Kỹ thuật Máy tính
BK
TP.HCM

Trần Ngọc Thịnh


http://www.cse.hcmut.edu.vn/~tnthinh

©2019, dce
2

dce dce
2019
Since 1980, CPU has outpaced DRAM ... 2019
Processor-DRAM Performance Gap Impact
Four-issue 2GHz superscalar accessing 100ns DRAM could execute • To illustrate the performance impact, assume a single-issue pipelined
800 instructions during time for one memory access! CPU with CPI = 1 using non-ideal memory.
Performance • Ignoring other factors, the minimum cost of a full memory access in terms
(1/latency) CPU of number of wasted CPU cycles:
CPU 60% per yr
CPU CPU Memory Minimum CPU memory stall cycles
2X in 1.5 yrs Year speed cycle Access or instructions wasted
MHZ ns ns

1986: 8 125 190 190/125 - 1 = 0.5


Gap grew 50% per 1989: 33 30 165 165/30 -1 = 4.5
year 1992: 60 16.6 120 120/16.6 -1 = 6.2
DRAM 1996: 200 5 110 110/5 -1 = 21
1998: 300 3.33 100 100/3.33 -1 = 29
DRAM
9% per yr 2000: 1000 1 90 90/1 - 1 = 89
2X in 10 yrs 2002: 2000 .5 80 80/.5 - 1 = 159
Year 2004: 3000 .333 60 60.333 - 1 = 179

3 4

1
dce Addressing the Processor-Memory Performance GAP dce
2019 2019
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time Staging
Cost
Today’s Xfer Unit faster
• Goal: Illusion of large, fast, cheap memory. CPU Registers Focus Registers
100s Bytes
Let programs address a memory space that <10s ns
Instr. Operands prog./compiler
1-8 bytes
scales to the disk size, at a speed that is Cache
K Bytes
Cache
10-100 ns
usually as fast as register access 1-0.1 cents/bit cache cntl
Blocks 8-128 bytes
Main Memory
M Bytes Memory
200ns- 500ns
• Solution: Put smaller, faster “cache” $.0001-.00001 cents /bit
Pages
OS
512-4K bytes
Disk
memories between CPU and DRAM. Create G Bytes, 10 ms
(10,000,000 ns) Disk
a “memory hierarchy”. -5 -6
10 - 10 cents/bit user/operator
Files Mbytes
Tape Larger
infinite
sec-min Tape Lower Level
10 -8

5 6

dce dce
2019
Common Predictable Patterns 2019

Caches

Two predictable properties of memory references:


Caches exploit both types of predictability:
• Temporal Locality: If a location is referenced, it is
likely to be referenced again in the near future (e.g., – Exploit temporal locality by remembering the contents of
loops, reuse). recently accessed locations.

– Exploit spatial locality by fetching blocks of data around


• Spatial Locality: If a location is referenced it is likely recently accessed locations.
that locations near it will be referenced in the near
future (e.g., straightline code, array access).

7 8

2
dce dce
Simple view of cache
2019 2019
Simple view of cache
Address Address • Hit rate: fraction of cache hit
Processor Main
CACHE Memory • Miss rate: 1 – Hit rate
Data Data
- Miss penalty: Time to replace a block + time to
• The processor accesses the cache first deliver the data to the processor
• Cache hit: Just use the data
• Cache miss: replace a block in cache by a
block from main memory, use the data
• The data transferred between cache and main
memory is in blocks, and controlled by
independent hardware
9 10

dce dce
2019
Simple view of cache 2019
Replacement
• Example: For(i = 0; i < 10; i++) S = S + A[i]; Block Number 0 1 2 3 4 5 6 7 8 9
1111111111 2222222222 33
0123456789 0123456789 01

• No cache: At least 12 accesses to main Memory


memory (10 A[i] and Read S, write S)
• With Cache: if A[i] and S is in a single block 0123

CPU need this


(ex 32-bytes), 1 access to load block to cache, Cache
and 1 access to write block to main memory
• Access to S: Temporal Locality • Cache cannot hold all blocks
• Access to A[i]: Spatial Locality (A[i]) • Replace a block by another that is currently
needed by CPU
11 12

3
dce dce
2019
Basic Cache Design & Operation Issues 2019

Q1: Where can a block be placed?


• Q1: Where can a block be placed cache? Block Number 0 1 2 3 4 5 6 7 8 9
1111111111 2222222222 33
0123456789 0123456789 01
(Block placement strategy & Cache organization)
– Fully Associative, Set Associative, Direct Mapped. Memory
• Q2: How is a block found if it is in cache?
(Block identification)
– Tag/Block. Set Number 0 0 1 2 3 01234567

• Q3: Which block should be replaced on a miss? Cache


(Block replacement)
– Random, LRU, FIFO.
Fully (2-way) Set Direct
• Q4: What happens on a write? Associative Associative Mapped
anywhere anywhere in only into
(Cache write policy) Block 12
set 0 block 4
can be placed
– Write through, write back. (12 mod 4) (12 mod 8)
13 14

dce dce
2019

Direct-Mapped Cache
2019
Direct-mapped Cache
Tag Index Block Address • Address: N bits (2N bytes)
Offset

t
• Cache has 2k lines (blocks)
k b
V Tag Data Block • Each line has 2b bytes
• Block M is mapped to the line M % 2k
2k
lines • Need t = N-k-bTag bits to identify mem. block
t • Advantage: Simple
= • Disadvantage: High miss rate
• What if CPU accesses block N0, N1 and N0 %
HIT Data Word or Byte 2k = N1 % 2k ?
15 16

4
dce dce
2019
Direct-mapped Cache 2019
4KB Direct Mapped Cache Example
A d d r e s s ( s h o w i n g b i t p o s it io n s )
Index field
31 30 13 12 11 2 1 0

1111111111 2222222222 33
Tag field B y te

Block Number 0 1 2 3 4 5 6 7 8 9 0123456789 0123456789 01


o f fs e t

20 10
H it D a ta
1K = 1024 Blocks Tag

Memory Each block = one word


In d e x

Can cache up to
In d e x V a l id T ag D a ta

01234567
232 bytes = 4 GB 1

of memory 2

Mapping function:
Cache Block frame number = 1021

1022
(Block address) MOD (1024)
• Access N0, N1 where N0 % 2k = N1 % 2k i.e. index field or
1023

20 32

10 low bits of block address


• Replace a block while there are many rooms
available! Block Address = 30 bits Block offset
Hit or miss?

Tag = 20 bits Index = 10 bits = 2 bits

17 18

dce dce
2019
64KB Direct Mapped Cache Example 2019

Fully Associative Cache


Tag field A d d r e s s (s h o w in g b it p o s i ti o n s )

4K= 4096 blocks 31 16 1 5 4 32 1 0 Index field


Each block = four words = 16 bytes V Tag Data Block
16 12 2 B yte
H it
T ag o ffs e t Word select D a ta

Can cache up to In d e x B lo c k o f f s e t
232 bytes = 4 GB
of memory
1 6 b i ts 1 2 8 b i ts
t
V T ag D a ta

4K

Tag
e n t rie s

t
16 32 32 32 32
=
HIT
Mux
Offset
Block

Hit or miss? 32

Data
Larger cache blocks take better advantage of spatial locality
and thus may result in a lower miss rate
Block Address = 28 bits
Block offset = Word
Tag = 16 bits Index = 12 bits = 4 bits
b or Byte
Mapping Function: Cache Block frame number = (Block address) MOD (4096)
i.e. index field or 12 low bit of block address
19 20

5
dce dce
2019
Fully associative cache 2019
Set-Associative Cache
• CAM: Content Addressable Memory Tag Index Block
Offset b
• Each block can be mapped to any lines in t
k
cache V Tag Data Block V Tag Data Block
• Tag bit: t = N-b. Compared to Tag of all lines
• Advantage: replacement occurs only when no
rooms available t

• Disadvantage: resource consumption, delay Data


= =
by comparing many elements Word
or Byte

HIT
21 22

dce dce 4K Four-Way Set Associative Cache:


2019
W-way Set-associative Cache 2019

MIPS Implementation Example


Block
• Balancing: Direct mapped cache vs Fully
A d d re s s
Tag 31 30 1 2 11 10 9 8 3 2 1 0 Offset
Field
associative cache 22 8 Index
Field
1024 block frames
• Cache has 2k sets Each block = one word In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D a ta

4-way set associative 0

• Each set has w lines


1
1024 / 4= 256 sets 2

25 3
Can cache up to
• Block M is mapped to one of w lines in set M 232 bytes = 4 GB
25 4
25 5
22 32
of memory
% 2k
• Tag bit: t = N-k-b Set associative cache requires parallel tag
matching and more complex hit logic which
may increase hit time

• Currently: widely used (Intel, AMD, …) Block Address = 30 bits


Block offset 4 - to - 1 m u l tip le xo r
Tag = 22 bits Index = 8 bits = 2 bits

H it D a ta

Mapping Function: Cache Set Number = index= (Block address) MOD (256)
23 24

6
dce dce
2019
Q2: How is a block found? 2019
What causes a MISS?
• Index selects which set to look in
• Three Major Categories of Cache Misses:
• Compare Tag to find block – Compulsory Misses: first access to a block
• Increasing associativity shrinks index, – Capacity Misses: cache cannot contain all blocks needed
expands tag. Fully Associative caches have to execute the program
– Conflict Misses: block replaced by another block and then
no index field. later retrieved - (affects set assoc. or direct mapped
• Direct-mapped: 1-way set associative? caches)
• Nightmare Scenario: ping pong effect!
• Fully associative: 1 set?
Memory Address
Block Address Block
Tag Index Offset

25 26

dce dce
Block Size and Spatial Locality
2019 2019

Q3: Which block should be replaced on a miss?


Block is unit of transfer between the cache and memory • Easy for Direct Mapped
4 word block, • Set Associative or Fully Associative:
Tag Word0 Word1 Word2 Word3
b=2 – Random
Split CPU block address offsetb – Least Recently Used (LRU)
address • LRU cache state must be updated on every access
32-b bits b bits • true implementation only feasible for small sets (2-
2b = block size a.k.a line size (in bytes) way, 4-way)
Larger block size has distinct hardware advantages
• pseudo-LRU binary tree often used for 4-8 way
• less tag overhead – First In, First Out (FIFO) a.k.a. Round-Robin
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses • used in highly associative caches

What are the disadvantages of increasing block size?


• Replacement policy has a second order effect
Fewer blocks => more conflicts. Can waste bandwidth. since replacement only happens on misses
27 28

7
dce dce
2019
Q4: What happens on a write? 2019
Reading assignment 1
• Cache hit:
– write through: write both cache & memory • Cache performance
• generally higher traffic but simplifies cache coherence – Replacement policy (algorithms)
– write back: write cache only – Optimization (Miss rate, penalty, …)
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic • Reference
• Cache miss: – Hennessy - Patterson - Computer Architecture. A
– no write allocate: only write to main memory Quantitative
– write allocate (aka fetch on write): fetch into cache – www2.lns.mit.edu/~avinatan/research/cache.pdf
– More on internet
• Common combinations:
– write through and no write allocate
– write back with write allocate
29 30

dce
2019
Reading Log1
• Listing some algorithms of replacement
policy? Can it affect cache performance?

• What is cache performance?

• What criteria can effect cache performance?

• Explain optimization technique on each


criterion

• Multi-level Caches
31

You might also like