The Memory Hierarchy: - Ideally One Would Desire An Indefinitely Large Memory Capacity Such
The Memory Hierarchy: - Ideally One Would Desire An Indefinitely Large Memory Capacity Such
The Memory Hierarchy: - Ideally One Would Desire An Indefinitely Large Memory Capacity Such
CPU
Size
Cost ($/bit)
Fastest
Memory
Smallest
Highest
Biggest
Lowest
Memory
Slowest
Memory
The Cache ()
Cache: a safe place for hiding or string things. Webster's
Dictionary
The level between main memory and CPU was called the cache
in a 1968 paper describing the IBM 360/85 (1st commercial
machine with cache). Nowadays all levels between main memory
and CPU are called caches.
X4
X4
Lets look at a simple cache in which
X1
X1
the processor accesses a word at a
Xn 2
Xn 2
time and the block size is 1 word.
Xn 1
Xn 1
The processor requested the word Xn.
X2
X2
It isn't in the cache which results in
Xn
a cache miss. The word Xn is brought
X3
X3
from memory into the cache.
a. Before the reference to Xn
000
001
010
011
100
101
110
111
11101
Tags
If each cache location is mapped
to several memory locations, how Hit
do we know whether the data in
the cache corresponds to the data
requested from memory?
We add a tag to each block. The
tag contains the upper bits of the
address not used to index the
cache.
We also need a way to determine
if the cache has valid information.
At start up the cache will hold
garbage. In order not to use it by
mistake a valid bit is added to
each block. If the bit isn't set there
can't be a match.
13 12 11
210
Byte
offset
10
20
Data
Tag
Index
Data
0
1
2
1021
1022
1023
20
32
Cache Size
Assuming a 32 bit address, a direct mapped cache of size 2n words
with 1 word (4-byte) blocks will need a tag field of size 32 - (n+2)
because 2 bits are the byte offset and n bits are used for the index.
The total number of bits in a direct mapped cache is
2n*(32 + (32-n-2)+1) = 2n*(63-n).
How many total bits are needed for a direct mapped cache with
64KB data in 1 word blocks?
64KB = 16K words = 214 words = 214 blocks. Each block has 32
bits of data, 32-14-2 bits of tag, and a valid bit. Thus the total
cache size is: 214*(32+(32-14-2)+1) = 214*49 = ~98KB. The size
of the cache is 50% more than the size of the data it holds (for this
configuration).
17 16 15
16
543210
14
Hit
Data
16 bits
Valid Tag
32 bits
Data
16K
entries
16
32
Byte
offset
Writes in a Cache
Write works differently. On a store instruction we write the data
into the D-cache, now main memory has a different value from the
cache. The cache and memory are inconsistent () .
The simple way to keep consistency is to write the data both to
memory and to the cache. This is called write-through.
On a write miss there is no reason to read the block from memory.
We can just overwrite the data in the cache block and change the
tag. In fact we can do this for a write hit as well. Thus in the
DECStation 300 a write works like this:
Index the cache using bits 15-2 of the address.
Write bits 31-16 of the address into the tag, write the data value and set the
valid bit.
Write the word to main memory
Write-Back Caches
The data in the write-buffer is written to memory in parallel to the
CPU continuing computation. This cuts down the penalty of the
write-through scheme.
If the write buffer is full the CPU is stalled ( )until data in
the buffer is written to memory. In the DECStation 3000 the write
buffer can hold 4 blocks.
The alternative to write-through is write-back. When a write
occurs it is written into the cache only. Only when the block is
replaced is the data written into main memory. Write-back caches
are better when the CPU generates writes faster than the main
memory can handle them.
The miss rates for 2 popular programs on the DECStation are:
Program I miss rate
D miss rate Combined miss rate
gcc
6.1%
2.1%
5.4%
spice
1.2%
1.3%
1.2%
11
16 15
16
Hit
4 32 1 0
12
2 Byte
offset
Tag
Data
Index
V
Block offset
16 bits
128 bits
Tag
Data
4K
entries
16
32
32
32
32
Mux
32
13
D-cache
2.1%
1.7%
1.3%
0.6%
Combined
5.4%
1.9%
1.2%
0.4%
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
16
64
Block size (bytes)
256
1 KB
8 KB
The miss rate rises if the block size is too large as not all
the data in a block is used so space in the cache is wasted.
16 KB
64 KB
256 KB
14
For a cache block of 4 word and 1 word wide memory bank the
miss penalty is: 1 + 4*15 + 4*1=65.
If we widen the memory and busses between memory to cachewe
can reduce the miss penalty. For a 2 word wide memory the miss
penalty is 1 + 2*15 + 2*1=33. For a 4 word wide memory the miss
penalty is 1 + 1*15 + 1 = 17. But we pay the cost of a wide
memory and wide busses.
15
Interleaved ( )Memory
The third option interleaves memory into multiple banks.
Sequential addresses are in sequential banks. We can read 4 words
at the same time, but send the data to the cache one word at a time.
When using interleaved memory the miss penalty is 1 + 1*15 +
4*1 = 20. Using banks is also helpful when using write-through
caches, the write bandwidth is quadrupled (4 ) .
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Memory
Memory
a. One-word -wide
memory organization
Bus
Bus
Memory
bank 0
Memory
bank 1
Memory
bank 2
Memory
bank 3
16
Cache Associativity
We have seen one mapping scheme: direct mapped, each memory
location can be in only 1 cache location.
On the other extreme is a fully associative cache. Each memory
location can be in any cache block. To find a block we must
compare in parallel the tags of all blocks with the memory address.
The middle range is called set associative. A block can be mapped
to a fixed number of locations. The tags in the set are compared in
parallel to the memory address.
Direct mapped
Block #
0 1 2 3 4 5 6 7
Data
Tag
Search
Set associative
Set #
Data
1
2
Tag
Search
Fully associative
Data
1
2
Tag
Search
18
Data Tag
12 11 10 9 8
8
22
Index
0
1
2
Tag
Data
3 2 1 0
Tag
Data
Tag
Data
Tag
Data
253
254
255
22
4-to-1 multiplexor
Hit
Data
20
32
12%
Miss rate
9%
6%
3%
0%
One-way
Two-way
Four-way
Associativity
Eight-way
1 KB
16 KB
2 KB
32 KB
4 KB
64 KB
8 KB
128 KB
22
The 3 Cs
Cache misses can be divided into 3 categories:
Compulsory misses: Misses that are caused because the block
was never in the cache. These are also called cold-start misses.
Increasing the block size reduces compulsory misses. Too large
a block size can cause capacity misses and increases the miss
penalty.
Capacity misses: Misses caused when the cache can't contain
all the blocks needed during a program's execution. The block
was in the cache, was replaced and now is needed again.
Enlarging the cache reduces capacity misses. Enlarging to
much raises the access time.
Conflict misses: Occur in direct mapped and set associative
caches. Multiple blocks compete for the same set. Increasing
associativity reduces conflict misses. But a too high
associativity increases access time.
23