Memory Lecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

EECS 151/251A

Spring 2020
Digital Design and Integrated Circuits

Instructor:
John Wawrzynek

Lecture 16:
Memory Circuits
and Blocks
EE141
Outline
❑ Memory Circuits
❑ SRAM
❑ DRAM
❑ Memory Blocks
❑ Multi-ported RAM
❑ Combining Memory blocks
❑ FIFOs
❑ FPGA memory blocks
❑ Caches
❑ Memory Blocks in the Project

2
EE141
First, Some Memory Classifications:
• Hardwired (Read-only-memory- ROM)
• Programmable
• Volatile
• SRAM - uses positive feedback (and restoration) to
hold state
• DRAM - uses capacitive charge (only) to hold state
• Non-volatile
• Persistent state without power supplied
• Ex: Flash Memory

3
EE141
Memory Circuits

4
EE141
Volatile Storage Mechanisms

Static - feedback Dynamic - charge


CLK

D D D Q

CLK

5
EE141
Generic Memory Block Architecture
❑ Word lines used to select a
row for reading or writing
❑ Bit lines carry data to/from
periphery
❑ Core aspect ratio keep
close to 1 to help balance
delay on word line versus
bit line
❑ Address bits are divided
between the two decoders
❑ Row decoder used to
select word line
❑ Column decoder used to
select one or more columns
for input/output of data

Storage cell could be either static or dynamic 6


EE141
Memory - SRAM

EE141
6-Transistor CMOS SRAM Cell
WL0

WL2

WL3

WL

V DD
M2 M4
BL_B BL

M5 Q M6
Q
M1 M3

BL BL

8
EE141
Memory Cells Enable Enable

DD
D
Complementary data values are
written (read) from two sides

Cells stacked in 2D to form memory core.


WL0 WL0 WL0

WL2 WL2 WL2

WL3 WL3 WL3

BL_B BL BL_B BL BL_B BL

9
EE141
6T-SRAM — Older Layout Style
BL BLB

VDD
WL0

WL2

GND
WL3
WL

VDD and GND: in M1(blue) BL_B BL

Bitlines: M2 (purple)
Wordline: poly-silicon (red)
EE141
WL0

Modern SRAM
❑ ST/Philips/Motorola WL2

Access Transistor
WL3

BL_B BL

Pull down Pull up

EE141
SRAM read/write operations

12
EE141
SRAM Operation - Read
1. Bit lines are “pre-charged” to VDD
2. Word line is driven high (pre-charger is turned off)
3. Cell pulls-down one bit line
4. Differential sensing circuit on periphery is activated to capture value on bit lines.

WL

Q
BL BL
0 1

During read Q will get slightly pulled up when WL first goes high, but …
• But by sizing the transistors correctly, reading the cell will not
destroy the stored value

13
EE141
SRAM Operation - Write
1. Column driver circuit on periphery differentially drives the bit
lines
2. Word line is driven high (column driver stays on)
3. One side of cell is driven low, flips the other side
WL

Q_b
BL BL
1-0 0-1

For successful write the access transistor needs to overpower the cell pullup.
The transistors are sized to allow this to happen.

14
EE141
Memory Periphery

EE141
Periphery
❑ Decoders
❑ Sense Amplifiers
❑ Input/Output Buffers
❑ Control / Timing Circuitry

16
EE141
Row Decoder • L total address bits
• K for column decoding
• L-K for row decoding
• Row decoder expands L-K
address lines into 2L-K word
lines

❑ Example: decoder for


8Kx8 memory block
❑ core arranged as
256x256 cells
❑ Need 256 AND gates,
each driving one word
line
each row has 32 8-bit words (8x32=256)
8K x 8 means 8K words of 8-bits each

In this case: L=13 total address bits (2L=8K), K=5 (2K=32), L-K=8 (2L-K
17=256)
EE141
Row Decoders
(N)AND Decoder

NOR Decoder

Collection of 2K logic gates, but need to be dense and fast.

18
EE141
Predecoders
a5 a4 a3 a2 a1 a0 ❑ Use a single gate for
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 each of the shared terms
a5 a4 a3 a2 a1 a0 ▪ E.g., from a1, a1, a0, a0
a5 a4 a3 a2 a1 a0 generate four signals:
a5 a4 a3 a2 a1 a0
▪ a1 a0 , a1 a0 , a1 a0 , a1 a0
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 ❑ In other words, we are
decoding smaller groups
of address bits first
▪ And using the
a5 a4 a3 a2 a1 a0 “predecoded” outputs to
a5 a4 a3 a2 a1 a0
a5 a4 a3 a2 a1 a0 do the rest of the decoding
a5 a4 a3 a2 a1 a0
19
EE141
Predecoder and Decoder
A0 A1 A2 A3 A4 A5

Predecoders

Final Decoder

20
EE141
Column “Decoder”

❑ Is basically a multiplexer
❑ Each row contains 2K
words each M bit wide.
❑ Bit of each of the 2K are
interleaved
❑ ex: K=2, M=8

d7c7b7a7d6c6b6a6d5c5b5a5d4c4b4a4d3c3b3a3d2c2b2a2d1c1b1a1d0c0b0a0
21
EE141
4-input pass-transistor based Column Decoder (for read)

BL0 BL 1 BL 2 BL 3

A0 S0
S1
(actual circuit would use
S2 a “differential signaling”)
A1
S3

2-input NOR decoder


D

decoder shared across all 2K x M row bits


Advantages: speed (Only one extra transistor in signal path, share sense amp)

22
EE141
Sense Amplifiers
large make as small as possible
C ⋅ ΔV
τp =
Iav
small

Idea: Use Sense Amplifer

23
EE141
Differential Sense Amplifier

VDD

M3 M4
y Out

bit M1 M2 bit

SE M5

Classic Differential Amp structure - basic of opAmp

24
EE141
Differential Sensing ― SRAM

25
EE141
DRAM

EE141
3-Transistor DRAM Cell

BL 1 BL 2
WWL
RWL WWL
M3 RWL
M1 X X V DD - V T
M2
V DD
CS BL 1
BL 2 V DD - VT DV

No constraints on device ratios


Reads are non-destructive
Value stored at node X when writing a “1” = V WWL -VTn

Can work with a normal logic IC process


27
EE141
1-Transistor DRAM Cell

VBL

VBIT= 0 or (VDD – VT)

Write: C S is charged or discharged by asserting WL and BL.


Read: Charge redistribution takes places between bit line and storage capacitance
CS << CBL Voltage swing is small; typically around 250 mV.

❑ To get sufficient Cs, special IC process is used


❑ Cell reading is destructive, therefore read operation always is followed
by a write-back
28
❑ Cell looses charge (leaks away in ms - highly temperature dependent),
therefore cells occasionally need to be “refreshed” - read/write cycle
EE141
Advanced 1T DRAM Cells

Word line Capacitor dielectric layer


Insulating Layer Cell plate

Cell Plate Si

Capacitor Insulator Transfer gate Isolation


Refilling Poly Storage electrode
Storage Node Poly
Si Substrate
2nd Field Oxide

Trench Cell Stacked-capacitor Cell

29
EE141
Latch-Based Sense Amplifier (DRAM)
EQ
BL BL
VDD

SE
• Initialized in its meta-stable point with EQ
• Once adequate voltage gap created,
sense amp enabled with SE
• Positive feedback quickly forces output to
a stable operating point.
SE

30
EE141
Memory Blocks
❑ Multi-ported RAM
❑ Combining Memory
blocks
❑ FIFOs
❑ FPGA memory blocks
❑ Caches
❑ Memory Blocks in the
Project

31
EE141
Multi-ported memory

EE141
Memory Architecture Review
❑ Word lines used to select a
row for reading or writing
❑ Bit lines carry data to/from
periphery
❑ Core aspect ratio keep
close to 1 to help balance
delay on word line versus
bit line
❑ Address bits are divided
between the two decoders
❑ Row decoder used to
select word line
❑ Column decoder used to
select one or more columns
for input/output of data
33
EE141
Multi-ported Memory
❑ Motivation:
Aa
▪ Consider CPU core register file: Douta
Dina
– 1 read or write per cycle limits
WEa
processor performance. Dual-port
– Complicates pipelining. Difficult for Memory
different instructions to Ab
simultaneously read or write regfile. Dinb
Doutb
– Common arrangement in pipelined WEb
CPUs is 2 read ports and 1 write
port.

• dual-porting allows
disk or network interface both sides to
– I/O data buffering: data simultaneously
CPU access memory at
buffer
full bandwidth.
34
EE141
Dual-ported Memory Internals
❑ Add decoder, another set of • Example cell: SRAM
read/write logic, bits lines, word WL2
lines:
WL1

deca decb cell


array
b2 b1 b1 b2
• Repeat everything but cross-coupled
inverters.
r/w logic
• This scheme extends up to a couple
r/w logic more ports, then need to add
additional transistors.
address
ports data ports

35
EE141
Combining Memory
Blocks

EE141
37
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the width. Example: given 1Kx8, want 1Kx16

38
EE141
Cascading Memory-Blocks
How to make larger memory blocks out of smaller ones.
Increasing the depth. Example: given 1Kx8, want 2Kx8

39
EE141
Adding Ports to Primitive Memory Blocks
Adding a read port to a simple dual port (SDP) memory.

Example: given 1Kx8 SDP, want 1 write & 2 read ports.

40
EE141
Adding Ports to Primitive Memory Blocks
How to add a write port to a simple dual port memory.
Example: given 1Kx8 SDP, want 1 read & 2 write ports.

41
EE141
FIFOs

EE141
First-in-first-out (FIFO) Memory
❑ Used to implement queues. • Producer can perform many writes
without consumer performing any
❑ These find common use in computers reads (or vis versa). However,
and communication circuits. because of finite buffer size, on
❑ Generally, used to “decouple” actions average, need equal number of
of producer and consumer: reads and writes.
• Typical uses:
– interfacing I/O devices. Example
starting state network interface. Data bursts
from network, then processor
c ba bursts to memory buffer (or
reads one word at a time from
after write interface). Operations not
synchronized.
d c ba – Example: Audio output.
Processor produces output
after read samples in bursts (during
process swap-in time). Audio
dc b DAC clocks it out at constant
sample rate.

EE141
FIFO Interfaces
DIN RST CLK
• Address pointers are used internally to
WE keep next write position and next read
FULL FIFO position into a dual-port memory.
HALF FULL write ptr
EMPTY
RE read ptr
DOUT
• If pointers equal after write ⇒ FULL:
❑ After write or read operation, FULL and
EMPTY indicate status of buffer.
❑ Used by external logic to control it’s own write ptr read ptr
reading from or writing to the buffer.
❑ FIFO resets to EMPTY state. • If pointers equal after read ⇒ EMPTY:
❑ HALF FULL (or other indicator of partial
fullness) is optional. write ptr read ptr

EE141 Note: pointer incrementing is done “mod size-of-buffer”


Xilinx Virtex5 FIFOs
❑ Virtex5 BlockRAMS include dedicated circuits for FIFOs.
❑ Details in User Guide (ug190).
❑ Takes advantage of separate dual ports and independent ports clocks.

EE141
Memory on FPGAs

EE141
Virtex-5 LX110T
memory blocks.

Distributed RAM
using LUTs
among the CLBs.

Block RAMs
in four
columns.

47
EE141
A SLICEM 6-LUT … (‘distributed RAM”)
Memory data input
Normal
5/6-LUT
Normal outputs.
6-LUT
inputs. Memory
data input.

Control output for


Memory chaining LUTs to
write make larger memories.
address
Synchronous write / asychronous read
A 1.1 Mb distributed RAM can be made if
all SLICEMs of an LX110T are used as RAM.
48
EE141
SLICEL vs SLICEM ...
SLICEL SLICEM

SLICEM adds memory


features to LUTs, + muxes.
Page"49
32
EE141
Example Distributed RAM (LUT RAM)
Example configuration:
Single-port 256b x 1,
registered output.

50
EE141
Distributed RAM Primitives

All are built from a single slice or less.


Remember, though, that the SLICEM LUT
is naturally only 1 read and 1 write port.

51
EE141
Distributed RAM Timing

52
EE141
Block RAM Overview
❑ 36K bits of data total, can be configured as:
▪ 2 independent 18Kb RAMs, or one 36Kb RAM.
❑ Each 36Kb block RAM can be configured as:
▪ 64Kx1 (when cascaded with an adjacent 36Kb block
RAM), 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, or 1Kx36
memory.
❑ Each 18Kb block RAM can be configured as:
▪ 16Kx1, 8Kx2, 4Kx4, 2Kx9, or 1Kx18 memory.
❑ Write and Read are synchronous operations.
❑ The two ports are symmetrical and totally
independent (can have different clocks),
sharing only the stored data.
❑ Each port can be configured in one of the
available widths, independent of the other port.
The read port width can be different from the
write port width for each port.
❑ The memory content can be initialized or
cleared by the configuration bitstream.

53
EE141
Block RAM Timing

❑ Optional output register, would delay appearance of output data by one


cycle.
❑ Maximum clock rate, roughly 400MHz.

54
EE141
Ultra-RAM Blocks

55
EE141
State-of-the-Art - Xilinx FPGAs

Virtex Ultra-scale

56
EE141
Caches

EE141
1977: DRAM faster than microprocessors
Apple II (1977)
CPU: 1000 ns
DRAM: 400 ns

Steve Jobs Steve


Wozniak

58
EE141
1980-2003, CPU speed outpaced DRAM ...
Q. How did architects address this gap?
Performance
(1/latency)
A. Put smaller, faster “cache” memories bet ween CPU and
DRAM.
Create a “memory hierarchy”. The
0
1000 power
CPU
wall
60% per yr CPU
1000 2X in 1.5 yrs
Gap grew 50% per
100 year
DRAM
9% per yr
10 2X in 10 yrs
DRAM

0 5
198
0
199 200
0 200
Year 59
EE141
Review from 61C
❑ Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon.
– Spatial Locality (Locality in Space): If an item is referenced,
items whose addresses are close by tend to be referenced
soon.
❑ By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the
cheapest technology.
– Provide access at the speed offered by the fastest
technology.
❑ DRAM is slow but cheap and dense:
– Good choice for presenting the user with a BIG memory
system
❑ SRAM is fast but expensive and not very dense:
– Good choice for providing the user FAST access time.

EE141
CPU-Cache Interaction
(5-stage pipeline)

0x4 Add E
A
M
we
bubble Decode, ALU Y addr
IR Register B Primary
Data rdata
addr inst Fetch Cache R
PC
hit? D wdata hit?
wdata

PCen Primary
Instruction
Cache
MD1 MD2
Stall entire
CPU on data
cache miss
To Memory Control

Cache Refill Data from Lower Levels of


Memory Hierarchy

EE141
Nahalem Die Photo (i7, i5)

L1

L2

❑ Per core:
▪ 32KB L1 I-Cache (4-way set associative (SA))
▪ 32KB L1 D-Cache (8-way SA)
▪ 256KB unified L2 (8-way SA, 64B blocks)
▪ Common L3 8MB cache
62
❑ Common L3 8MB cache
EE141
Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”

Valid Bit Cache Tag Cache Data


Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
EE141
Fully Associative
Fully Associative Cache
– No Cache Index
– For read, compare the Cache Tags of all cache entries in
parallel
– Example: Block Size = 32 B blocks, we need N 27-bit
comparators
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01

Cache Tag Valid Bit Cache Data


= Byte 31 Byte 1 Byte 0

: :
= Byte 63 Byte 33 Byte 32
=
=

: : :
=

EE141
Set Associative Cache
N-way set associative: N entries for each Cache Index
– (N direct mapped caches operates in parallel)
Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in parallel
– Data is selected based on the tag result

Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit
EE141
RAM Blocks and the Project

EE141
Processor Design Considerations
(FPGA Version)
❑ Register File: Consider distributed RAM (LUT RAM)
▪ Size is close to what is needed: distributed RAM primitive configurations
are 32 or 64 bits deep. Extra width is easily achieved by parallel
arrangements.
▪ LUT-RAM configurations offer multi-porting options - useful for register
files.
▪ Asynchronous read, might be useful by providing flexibility on where to
put register read in the pipeline.
❑ Instruction / Data Memories : Consider Block RAM
▪ Higher density, lower cost for large number of bits
▪ A single 36kbit Block RAM implements 1K 32-bit words.
▪ Configuration stream based initialization, permits a simple “boot strap”
procedure.

67
EE141
Processor Design Considerations
(ASIC Version)

❑ Register File: use synthesized RAM


▪ At this size (1k bits) synthesized is competitive with dense RAM block
▪ Latch-based instead of flip-flop-based would save on area.
▪ Asynchronous read, might be useful by providing flexibility on where to
put register read in the pipeline.
❑ Instruction / Data Caches : Use generated dense Block RAM
▪ Higher density, lower cost for large number of bits
▪ We will provide for you

68
EE141

You might also like