13m Arch PDF

18-548/15-548 Main Memory Architecture
10/19/98
13
Main Memory
Architecture
18-548/15-548 Memory System Architecture
Philip Koopman
October 19, 1998
Required Reading:
Cragon 5.1 - 5.1.5
Supplemental Reading: Hennessy & Patterson 5.6

IBM App. Note: Understanding DRAM
Assignments
u
By next class read about Main Memory Performance:

Cragon 5.1.6 - 5.1.7
Fast DRAM article from EDN (Feb. 1997)
Supplemental Reading:
Siewiorek & Koopman 5.2.2
Homework 7 due October 21
Lab 4 due October 23
Test #2 Wednesday October 28

Emphasizes material since Test #1
In-class review Monday October 26
Closed book, closed notes; bring erasers, sharpened pencils, calculator
10/19/98
Where Are We Now?

u
Where weve been:

Cache memory
Tuning for performance
Where were going for two classes:

Main memory architecture & performance
Where were going next:

Vector computing
Buses
Preview
u
DRAM chip operation

Constraints on minimum memory size vs. performance improvement techniques
Increasing memory bandwidth

Burst transfers
Interleaved access
Some of these techniques help with latency as well
10/19/98
Main Memory
CPU
I-UNIT
SPECIALPURPOSE
MEMORY
E-UNIT
REGISTER
FILE
ON-CHIP
L1 CACHE
(?)
ON-CHIP
L2 CACHE
MAIN
MEMORY
CA CH E BY
PA SS
(?)
L3/L4
CACHE
L2/L3
CACHE
INTERCONNECTION
NETWORK
TLB
SPECIALPURPOSE
CACHES
OTHER
COMPUTERS
& WWW
Usually Implemented in DRAM
VIRTUAL
MEMORY
DISK FILES &
DATABASES
CD-ROM
TAPE
etc.
10/19/98
Main Memory
u
Main Memory is what programmers (think they) manipulate

Program space
Data space
Commonly referred to as physical memory (as opposed to virtual memory)
Typically constructed from DRAM chips

Multiple clock cycles to access data, but may operate in a burst mode once
data access is started
Optimized for capacity, not necessarily speed
Latency -- determined by DRAM construction

Shared pins for high & low half of address to save on packaging costs
Typically 2 or 3 bus cycles to begin accessing data
Once access initiated can return multiple data at rate of datum per bus clock
Main Memory Capacities

u
Main memory capacity is determined by DRAM chip

At least 1 bank of DRAM chips is required for minimum memory size
(e.g., 4 Mbit chips arranged as 4 bits wide (1 Mbitx4) require 16 chips for a
64-bit bus --- 8 Mbyte minimum memory size)
Multiple banks (or bigger chips) used to increase memory capacity
Bandwidth -- determined by memory word width

Memory words typically same width as bus
Peak memory bandwidth is usually one word per bus cycle
Sustained memory bandwidth varies with the complexity of the design
Sometimes multiple banks can be activated concurrently, exploiting interleaved
memory
Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec

sustained
10/19/98
DRAM OPERATION
Main Memory DRAM Cells -- Small and Slow
10/19/98
Micron 64 Mbit DRAM
DRAM chip operation

u
Row Address Select (RAS)
RAS#
RAS.L
CAS#
CAS.L
Present first half of address to DRAM chip

Use to read row from memory array
u
Column Address Select (CAS)

Present second half of address to DRAM chip
Use to select bits from row for read/write
Cycle time
RAS + CAS + rewriting data back to array
Refresh cycle
Access to refresh capacitors
Needed every few milliseconds (say, 64 msec); varies with chip
10/19/98
DRAM Read Cycle

(Micron MT4LC16M4A7)
DRAM Write Cycle

10/19/98
Main Memory vs. Cache Memory

u
Cache is optimized for speed

On-chip when possible; usually SRAM design
If off-chip, single bank of SRAM chips for simplicity & speed
Main memory is optimized for capacity & cost

Off-chip; DRAM design
Multiple banks of DRAM for capacity, introduces issues of:
Delays for buffers, chip select, address multiplexing
Delays for backplane if separate memory card is used
Delays for bus arbitration if memory is shared with I/O or multiple CPUs
High capacity machines have longer main memory latency

Alpha L3 cache is 8 MB ...
... which has lower latency than accessing the 512 MB main memory DRAM
Embedded system with exactly 1 bank of DRAM can get rid of memory system
overhead and run faster (but only for small programs)
INCREASING DRAM BANDWIDTH

Part 1 of 2 -Exotic DRAM components
are in next lecture
10/19/98
Exponential Size Growth; Linear Speedups

DRAM SPEED TRENDS
250
64 Kbit
CYCLE TIME (ns)
256 Kbit
200
SLOWEST RAS (ns

FASTEST RAS (ns
1 Mbit
CAS (ns)
SPEED (ns)
4 Mbit
150
16 Mbit
100
64 Mbit
50
0
1980
1982
1984
1986
1988
1990
1992
1994
(Hennessy & Patterson Figure 5.30)
YEAR OF INTRODUCTION
Concurrency To Speed Up DRAM Accesses

u
Parallelism: wider paths from cache to DRAM

Provides high bandwidth and low latency
Increases cost
Pipelining: pipelined access to DRAM

Can provide higher bandwidth with modest latency penalty
Often a cost-effective tradeoff, since cache is already helping with latency on
most accesses
Replication: more than one bank of DRAM

Can start accessing second DRAM bank while first DRAM bank is refreshing
row
Can initiate accesses to many DRAM banks, then read results later
10/19/98
Wide Paths to DRAM

u
Low cost systems reduce board space & package count

32-byte cache block might need 32 DRAM cycles to service a miss
Wide path to DRAM reduces latency and increases bandwidth

Only 1 DRAM cycle to provide full cache line
Minimum DRAM configuration might be very large (and expensive)
Exploiting DRAM Spatial Locality

u
Multiple CAS cycles for single RAS

Can access multiple bits from same row without refreshing cells, since all bits
are latched at sense amps
Permits slow access to initial word in DRAM, followed by fast access to
subsequent words
A good match to servicing cache misses with block size > transfer size
Various modes:
Nibble mode: DRAM provides several bits sequentially for every RAS
Fast Page mode: DRAM row can be randomly addressed with several CAS
cycles
Static column: Same as page mode, but asynchronous CAS access
10
10/19/98
Fast Page Mode

Burst Transfers from DRAM

u
Use fast page mode, etc., to read several words over a modest width
DRAM bank
Can provide higher bandwidth with modest latency penalty
Often a cost-effective tradeoff, since cache is already helping with latency on
most accesses
11
10/19/98
Page Mode DRAM Bandwidth Example

Page Mode DRAM Example:
16 bits x 1M DRAM chips in 64-bit module (8 MB module)
60 ns RAS+CAS access time; 25 ns CAS access time
110 ns read/write cycle time; 40 ns page mode access time ; 256 words per page
Latency to first access=60 ns
Latency to subsequent accesses=25 ns
Bandwidth takes into account 110 ns first cycle, 40 ns for CAS cycles
Bandwidth for one word = 8 bytes / 110 ns = 69.35 MB/sec
Bandwidth for two words = 16 bytes / (110+40 ns) = 101.73 MB/sec
Peak bandwidth = 8 bytes / 40 ns = 190.73 MB/sec
Maximum sustained bandwidth = (256 words * 8 bytes) / ( 110ns + 256*40ns) = 188.71 MB/sec
Cache on a Shoestring
u
Use page or static column DRAM operation as cache

RAS access analogous to cache miss that moves data from DRAM array into
row buffer
CAS cycles analogous to cache hit the provides quick access when spatial
locality is present (i.e., accesses all to single row of DRAM)
Want DRAM controller to keep chip in column access mode unless row address
changes; a good trick for high-end systems too
Acts as a one-block cache of block size = DRAM row size

u
AMD 29000 used this technique

Targetted for moderately cost-sensitive applications (e.g., laser printers)
Used branch instruction buffer to hide RAS latency when taking a branch
Keep in mind for single chip CPU+DRAM

Wide DRAM row size can be fed to instruction buffer
DRAM row can provide wide data input to parallel functional units
12
10/19/98
INTERLEAVED MEMORY
Multiple Memory Banks

u
Can increase available bandwidth

Multiple memory banks take turns supplying data -- interleaved access
Data can be streamed from memory faster than DRAM cycle time
Can reduce latency when multiple memory banks are active

Multiple banks can be used to hide cycle time
Multiple memory references can be serviced concurrently
Typically number of banks is a power of 2 for addressing ease

Use lowest bits of memory address to select bank
Up to 128 banks on supercomputers (Cray 2 and NEC SX/3)
(Cragon Figure 5.1)
13
10/19/98
Interleaved Bandwidth Increase

u
Banks take turns supplying data

Permits pipelining address and data
Equivalent performance to page mode access of DRAM
Address x is in bank x MOD b where b is number of banks
Interleaved Memory As Dual-Port Alternative

u
Multiple independent memory banks have latent bandwidth available

Can access m of n single-ported banks simultaneously
If m << n , chances for bank conflict are reduced
If bank conflict occurs, stall one access port (can also simplify data dependency
handling)
Example: Pentium interleaved data cache
(Cragon Figure 5.6)
14
10/19/98
Interleaved Latency Decrease -- cycle time

u
Multiple banks hide refresh portion of cycle time

For example, ping-ponging between two banks hides end of cycle time
Interleaved Latency Decrease -- concurrency

u
Multiple banks service multiple pending memory requests

If bus runs faster than DRAM cycles, can have multiple pending memory
requests
Assumes non-blocking cache or uncached memory accesses
Multiple requests can be to arbitrary location; no reliance on spatial locality

Memory requests must be to different banks to prevent conflicts
BUT, time to go through interleaving process costs time for a single, isolated
memory access
DRAM SET
INTERLEAVE
CONTROLLER
DRAM SET
CPU
DRAM SET
INTERLEAVE
CONTROLLER
DRAM SET
15
10/19/98
Why Interleaved Memory?

u
Historically important on machines that:

Didnt have cache memory
Used magnetic core memory instead of DRAM
Had multiple CPUs sharing common memory
But, becoming prevalent because of generality of the above situations

Large gap between CPU-type memory technology and Main Memory-type
techology
Magnetic core instead of transistors
DRAM chips instead of SRAM chips
Off-chip cache instead of on-chip cache
Need for high bandwidth in successive accesses having poor spatial locality
Superscalar access to multiple data locations
Multiprocessing accessing shared main memory
Practical Limits -- Minimum Memory Size

u
Minimum memory size can be a cost constraint on all but the biggest
systems
Assume 64-bit data bus; 64 Mbit DRAMs in 1-bit wide configuration
64-bit DRAM module will have 64 chips with 512 MB of DRAM
Assume 8-way interleaving

Minimum system size is 64 x 8 = 512 DRAMs = 4 GB of DRAM
Memory can only be added in 4 GB chunks
u
More importantly (and independent of current technology):

minimum # DRAM chips = (# memory banks * access width) / DRAM chip width
Cost-effective interleaving solutions

Use wide DRAM chips (8-bit wide means 1/8 as many chips as 1-bit wide)
Permit smaller interleave factors on low-end machines (expanded memory
gives larger interleave factors
Use multiple cycles in page mode to retrieve data (Titan trick discussed later)
16
10/19/98
REVIEW
Review
u
Main memory tradeoffs

DRAM optimized for capacity more than for speed
Exploit DRAM operation to provide bandwidth (e.g., fast page mode)
Minimum possible DRAM size can be a cost constraint
Interleaved memory access

Helps with latency by hiding refresh/rewrite time & reducing access conflicts
Multiple banks can provide multiple concurrent accesses
17

13m Arch PDF

Uploaded by

Copyright:

Available Formats

13m Arch PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13m Arch PDF

Uploaded by

Copyright:

Available Formats

18-548/15-548 Main Memory Architecture

Cragon 5.1 - 5.1.5

Supplemental Reading: Hennessy & Patterson 5.6

By next class read about Main Memory Performance:

Homework 7 due October 21

Lab 4 due October 23

Test #2 Wednesday October 28

18-548/15-548 Main Memory Architecture

Where Are We Now?

Where weve been:

Where were going for two classes:

Where were going next:

DRAM chip operation

Increasing memory bandwidth

18-548/15-548 Main Memory Architecture

Usually Implemented in DRAM

18-548/15-548 Main Memory Architecture

Main Memory is what programmers (think they) manipulate

Typically constructed from DRAM chips

Latency -- determined by DRAM construction

Main Memory Capacities

Main memory capacity is determined by DRAM chip

Bandwidth -- determined by memory word width

Representative main memory bandwidth is 500 MB/sec peak; 125 MB/sec

18-548/15-548 Main Memory Architecture

Main Memory DRAM Cells -- Small and Slow

18-548/15-548 Main Memory Architecture

Micron 64 Mbit DRAM

DRAM chip operation

Row Address Select (RAS)

Present first half of address to DRAM chip

Column Address Select (CAS)

18-548/15-548 Main Memory Architecture

DRAM Read Cycle

DRAM Write Cycle

18-548/15-548 Main Memory Architecture

Main Memory vs. Cache Memory

Cache is optimized for speed

Main memory is optimized for capacity & cost

High capacity machines have longer main memory latency

INCREASING DRAM BANDWIDTH

18-548/15-548 Main Memory Architecture

Exponential Size Growth; Linear Speedups

SLOWEST RAS (ns

(Hennessy & Patterson Figure 5.30)

Concurrency To Speed Up DRAM Accesses

Parallelism: wider paths from cache to DRAM

Pipelining: pipelined access to DRAM

Replication: more than one bank of DRAM

18-548/15-548 Main Memory Architecture

Wide Paths to DRAM

Low cost systems reduce board space & package count

Wide path to DRAM reduces latency and increases bandwidth

Exploiting DRAM Spatial Locality

Multiple CAS cycles for single RAS

18-548/15-548 Main Memory Architecture

Fast Page Mode

Burst Transfers from DRAM

18-548/15-548 Main Memory Architecture