Memory Lecture
Memory Lecture
Memory Lecture
Spring 2020
Digital Design and Integrated Circuits
Instructor:
John Wawrzynek
Lecture 16:
Memory Circuits
and Blocks
EE141
Outline
❑ Memory Circuits
❑ SRAM
❑ DRAM
❑ Memory Blocks
❑ Multi-ported RAM
❑ Combining Memory blocks
❑ FIFOs
❑ FPGA memory blocks
❑ Caches
❑ Memory Blocks in the Project
2
EE141
First, Some Memory Classifications:
• Hardwired (Read-only-memory- ROM)
• Programmable
• Volatile
• SRAM - uses positive feedback (and restoration) to
hold state
• DRAM - uses capacitive charge (only) to hold state
• Non-volatile
• Persistent state without power supplied
• Ex: Flash Memory
3
EE141
Memory Circuits
4
EE141
Volatile Storage Mechanisms
D D D Q
CLK
5
EE141
Generic Memory Block Architecture
❑ Word lines used to select a
row for reading or writing
❑ Bit lines carry data to/from
periphery
❑ Core aspect ratio keep
close to 1 to help balance
delay on word line versus
bit line
❑ Address bits are divided
between the two decoders
❑ Row decoder used to
select word line
❑ Column decoder used to
select one or more columns
for input/output of data
EE141
6-Transistor CMOS SRAM Cell
WL0
WL2
WL3
WL
V DD
M2 M4
BL_B BL
M5 Q M6
Q
M1 M3
BL BL
8
EE141
Memory Cells Enable Enable
DD
D
Complementary data values are
written (read) from two sides
9
EE141
6T-SRAM — Older Layout Style
BL BLB
VDD
WL0
WL2
GND
WL3
WL
Bitlines: M2 (purple)
Wordline: poly-silicon (red)
EE141
WL0
Modern SRAM
❑ ST/Philips/Motorola WL2
Access Transistor
WL3
BL_B BL
EE141
SRAM read/write operations
12
EE141
SRAM Operation - Read
1. Bit lines are “pre-charged” to VDD
2. Word line is driven high (pre-charger is turned off)
3. Cell pulls-down one bit line
4. Differential sensing circuit on periphery is activated to capture value on bit lines.
WL
Q
BL BL
0 1
During read Q will get slightly pulled up when WL first goes high, but …
• But by sizing the transistors correctly, reading the cell will not
destroy the stored value
13
EE141
SRAM Operation - Write
1. Column driver circuit on periphery differentially drives the bit
lines
2. Word line is driven high (column driver stays on)
3. One side of cell is driven low, flips the other side
WL
Q_b
BL BL
1-0 0-1
For successful write the access transistor needs to overpower the cell pullup.
The transistors are sized to allow this to happen.
14
EE141
Memory Periphery
EE141
Periphery
❑ Decoders
❑ Sense Amplifiers
❑ Input/Output Buffers
❑ Control / Timing Circuitry
16
EE141
Row Decoder • L total address bits
• K for column decoding
• L-K for row decoding
• Row decoder expands L-K
address lines into 2L-K word
lines
In this case: L=13 total address bits (2L=8K), K=5 (2K=32), L-K=8 (2L-K
17=256)
EE141
Row Decoders
(N)AND Decoder
NOR Decoder
18
EE141
Predecoders
a5 a4 a3 a2 a1 a0 ❑ Use a single gate for
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 each of the shared terms
a5 a4 a3 a2 a1 a0 ▪ E.g., from a1, a1, a0, a0
a5 a4 a3 a2 a1 a0 generate four signals:
a5 a4 a3 a2 a1 a0
▪ a1 a0 , a1 a0 , a1 a0 , a1 a0
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 ❑ In other words, we are
decoding smaller groups
of address bits first
▪ And using the
a5 a4 a3 a2 a1 a0 “predecoded” outputs to
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 do the rest of the decoding
a5 a4 a3 a2 a1 a0
19
EE141
Predecoder and Decoder
A0 A1 A2 A3 A4 A5
Predecoders
Final Decoder
20
EE141
Column “Decoder”
❑ Is basically a multiplexer
❑ Each row contains 2K
words each M bit wide.
❑ Bit of each of the 2K are
interleaved
❑ ex: K=2, M=8
d7c7b7a7d6c6b6a6d5c5b5a5d4c4b4a4d3c3b3a3d2c2b2a2d1c1b1a1d0c0b0a0
21
EE141
4-input pass-transistor based Column Decoder (for read)
BL0 BL 1 BL 2 BL 3
A0 S0
S1
(actual circuit would use
S2 a “differential signaling”)
A1
S3
22
EE141
Sense Amplifiers
large make as small as possible
C ⋅ ΔV
τp =
Iav
small
23
EE141
Differential Sense Amplifier
VDD
M3 M4
y Out
bit M1 M2 bit
SE M5
24
EE141
Differential Sensing ― SRAM
25
EE141
DRAM
EE141
3-Transistor DRAM Cell
BL 1 BL 2
WWL
RWL WWL
M3 RWL
M1 X X V DD - V T
M2
V DD
CS BL 1
BL 2 V DD - VT DV
VBL
Cell Plate Si
29
EE141
Latch-Based Sense Amplifier (DRAM)
EQ
BL BL
VDD
SE
• Initialized in its meta-stable point with EQ
• Once adequate voltage gap created,
sense amp enabled with SE
• Positive feedback quickly forces output to
a stable operating point.
SE
30
EE141
Memory Blocks
❑ Multi-ported RAM
❑ Combining Memory
blocks
❑ FIFOs
❑ FPGA memory blocks
❑ Caches
❑ Memory Blocks in the
Project
31
EE141
Multi-ported memory
EE141
Memory Architecture Review
❑ Word lines used to select a
row for reading or writing
❑ Bit lines carry data to/from
periphery
❑ Core aspect ratio keep
close to 1 to help balance
delay on word line versus
bit line
❑ Address bits are divided
between the two decoders
❑ Row decoder used to
select word line
❑ Column decoder used to
select one or more columns
for input/output of data
33
EE141
Multi-ported Memory
❑ Motivation:
Aa
▪ Consider CPU core register file: Douta
Dina
– 1 read or write per cycle limits
WEa
processor performance. Dual-port
– Complicates pipelining. Difficult for Memory
different instructions to Ab
simultaneously read or write regfile. Dinb
Doutb
– Common arrangement in pipelined WEb
CPUs is 2 read ports and 1 write
port.
• dual-porting allows
disk or network interface both sides to
– I/O data buffering: data simultaneously
CPU access memory at
buffer
full bandwidth.
34
EE141
Dual-ported Memory Internals
❑ Add decoder, another set of • Example cell: SRAM
read/write logic, bits lines, word WL2
lines:
WL1
35
EE141
Combining Memory
Blocks
EE141
37
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the width. Example: given 1Kx8, want 1Kx16
38
EE141
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the depth. Example: given 1Kx8, want 2Kx8
39
EE141
Adding Ports to Primitive Memory Blocks
Adding a read port to a simple dual port (SDP) memory.
40
EE141
Adding Ports to Primitive Memory Blocks
How to add a write port to a simple dual port memory.
Example: given 1Kx8 SDP, want 1 read & 2 write ports.
41
EE141
FIFOs
EE141
First-in-first-out (FIFO) Memory
❑ Used to implement queues. • Producer can perform many writes
without consumer performing any
❑ These find common use in computers reads (or vis versa). However,
and communication circuits. because of finite buffer size, on
❑ Generally, used to “decouple” actions average, need equal number of
of producer and consumer: reads and writes.
• Typical uses:
– interfacing I/O devices. Example
starting state network interface. Data bursts
from network, then processor
c ba bursts to memory buffer (or
reads one word at a time from
after write interface). Operations not
synchronized.
d c ba – Example: Audio output.
Processor produces output
after read samples in bursts (during
process swap-in time). Audio
dc b DAC clocks it out at constant
sample rate.
EE141
FIFO Interfaces
DIN RST CLK
• Address pointers are used internally to
WE keep next write position and next read
FULL FIFO position into a dual-port memory.
HALF FULL write ptr
EMPTY
RE read ptr
DOUT
• If pointers equal after write ⇒ FULL:
❑ After write or read operation, FULL and
EMPTY indicate status of buffer.
❑ Used by external logic to control it’s own write ptr read ptr
reading from or writing to the buffer.
❑ FIFO resets to EMPTY state. • If pointers equal after read ⇒ EMPTY:
❑ HALF FULL (or other indicator of partial
fullness) is optional. write ptr read ptr
EE141
Memory on FPGAs
EE141
Virtex-5 LX110T
memory blocks.
Distributed RAM
using LUTs
among the CLBs.
Block RAMs
in four
columns.
47
EE141
A SLICEM 6-LUT … (‘distributed RAM”)
Memory data input
Normal
5/6-LUT
Normal outputs.
6-LUT
inputs. Memory
data input.
50
EE141
Distributed RAM Primitives
51
EE141
Distributed RAM Timing
52
EE141
Block RAM Overview
❑ 36K bits of data total, can be configured as:
▪ 2 independent 18Kb RAMs, or one 36Kb RAM.
❑ Each 36Kb block RAM can be configured as:
▪ 64Kx1 (when cascaded with an adjacent 36Kb block
RAM), 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, or 1Kx36
memory.
❑ Each 18Kb block RAM can be configured as:
▪ 16Kx1, 8Kx2, 4Kx4, 2Kx9, or 1Kx18 memory.
❑ Write and Read are synchronous operations.
❑ The two ports are symmetrical and totally
independent (can have different clocks),
sharing only the stored data.
❑ Each port can be configured in one of the
available widths, independent of the other port.
The read port width can be different from the
write port width for each port.
❑ The memory content can be initialized or
cleared by the configuration bitstream.
53
EE141
Block RAM Timing
54
EE141
Ultra-RAM Blocks
55
EE141
State-of-the-Art - Xilinx FPGAs
Virtex Ultra-scale
56
EE141
Caches
EE141
1977: DRAM faster than microprocessors
Apple II (1977)
CPU: 1000 ns
DRAM: 400 ns
58
EE141
1980-2003, CPU speed outpaced DRAM ...
Q. How did architects address this gap?
Performance
(1/latency)
A. Put smaller, faster “cache” memories bet ween CPU and
DRAM.
Create a “memory hierarchy”. The
0
1000 power
CPU
wall
60% per yr CPU
1000 2X in 1.5 yrs
Gap grew 50% per
100 year
DRAM
9% per yr
10 2X in 10 yrs
DRAM
0 5
198
0
199 200
0 200
Year 59
EE141
Review from 61C
❑ Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon.
– Spatial Locality (Locality in Space): If an item is referenced,
items whose addresses are close by tend to be referenced
soon.
❑ By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the
cheapest technology.
– Provide access at the speed offered by the fastest
technology.
❑ DRAM is slow but cheap and dense:
– Good choice for presenting the user with a BIG memory
system
❑ SRAM is fast but expensive and not very dense:
– Good choice for providing the user FAST access time.
EE141
CPU-Cache Interaction
(5-stage pipeline)
0x4 Add E
A
M
we
bubble Decode, ALU Y addr
IR Register B Primary
Data rdata
addr inst Fetch Cache R
PC
hit? D wdata hit?
wdata
PCen Primary
Instruction
Cache
MD1 MD2
Stall entire
CPU on data
cache miss
To Memory Control
EE141
Nahalem Die Photo (i7, i5)
L1
L2
❑ Per core:
▪ 32KB L1 I-Cache (4-way set associative (SA))
▪ 32KB L1 D-Cache (8-way SA)
▪ 256KB unified L2 (8-way SA, 64B blocks)
▪ Common L3 8MB cache
62
❑ Common L3 8MB cache
EE141
Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
EE141
Fully Associative
Fully Associative Cache
– No Cache Index
– For read, compare the Cache Tags of all cache entries in
parallel
– Example: Block Size = 32 B blocks, we need N 27-bit
comparators
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01
: :
= Byte 63 Byte 33 Byte 32
=
=
: : :
=
EE141
Set Associative Cache
N-way set associative: N entries for each Cache Index
– (N direct mapped caches operates in parallel)
Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
EE141
RAM Blocks and the Project
EE141
Processor Design Considerations
(FPGA Version)
❑ Register File: Consider distributed RAM (LUT RAM)
▪ Size is close to what is needed: distributed RAM primitive configurations
are 32 or 64 bits deep. Extra width is easily achieved by parallel
arrangements.
▪ LUT-RAM configurations offer multi-porting options - useful for register
files.
▪ Asynchronous read, might be useful by providing flexibility on where to
put register read in the pipeline.
❑ Instruction / Data Memories : Consider Block RAM
▪ Higher density, lower cost for large number of bits
▪ A single 36kbit Block RAM implements 1K 32-bit words.
▪ Configuration stream based initialization, permits a simple “boot strap”
procedure.
67
EE141
Processor Design Considerations
(ASIC Version)
68
EE141