Inside Pascal: Manuel Ujaldón

Inside Pascal
Manuel Ujaldn
CUDA Fellow @ NVIDIA Corporation
Full Professor @ Computer Architecture Dept.
University of Mlaga (Spain)
Talk contents [27 slides]
1. Major innovations [4]

2. The new memory: 3D DRAM [7]
3. 3D memory consortiums [11]
1. HMC (Hybrid Memory Cube) [6]
2. HBM (High Bandwidth Memory) [5]
4. Performance analysis with the roofline model [4]
5. Conclusions [1]
2
I. Major innovations
3
4
Benefits
When you shrink the When you adopt

transistor gate, you get: Stacked-DRAM, you get:
Faster switching: Faster response:
Higher frequency. Higher frequency and bandwidth.
Smaller units: High density packaging:
More transistors per chip. More bytes per chip.
Bigger designs. Bigger sizes.
Lower power: Low power:
Less heat. Less heat.
Wider autonomy. Wider autonomy.
More GFLOPS/W More bandwidth

5
A look ahead through Nvidias GPU roadmap
GFLOPS in double precision for each watt consumed
24
22
20 Pascal
3D Memory
18 NVLink
16
14
12 Maxwell
Unified memory
10 DX12
8
Kepler
6 Dynamic Parallelism
4
Fermi
2 Tesla FP64
CUDA
2008 2010 2012 2014 2016 6

Past, present and future in numerical
accuracy: Trade-off vs. performance
[2010] Fermi: float (fp32) 2x faster than double (fp64).
[2012] Kepler: fp32 3x fp64.
[2014] Maxwell: fp32 32x fp64.
[2016] Pascal: Introducing half-precision (fp16) 2x fp32.
Half precision widely used in video-games and deep
learning applications, so expect good scalability in future
GPU generations.
7
II. The new memory: 3D DRAM
8
Time to fill a typical cache line (128 bytes)
by DDR memory (from 1997 to 2015)
0ns. 20 ns. 40 ns. 60 ns. 80 ns. 100 ns. 120 ns. 140 ns. 160 ns. 180 ns. 200 ns.
100 MHz
Address bus Row Col. Tclk = 10 ns.
Control bus ACTIVE READ (burst length: 16 words of 8 bytes to complete a cache lines 128 bytes long)
RCD=2 CL=2 SDRAM-100,
Data bus Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato
CL=2 (1998)
Dato Dato Dato Dato Dato Dato Dato Dato
t = 200 ns.
RCD=2 CL=2 DDR-200, CL=2 (2001) latency
Dato Dato Dato Dato
t = 120 ns. weight: 20%
Dato Dato Dato Dato latency weight: 33%
RCD=2 CL=2 DDR-200, CL=2, dual-channel architecture (2004)
200 MHz t = 80 ns. Latency weight: 50%
RCD=4 CL=4 DDR2-400, CL=4, dual-channel (2007)

t = 60 ns. Latency weight: 66%
RCD=4 CL=4 DDR2-400, CL=4, quad-channel (2010) t = 50 ns. Latency weight: 80%
RCD=8 CL=8 DDR3-800, CL=8, quad-channel (2013) t = 45 ns. Latency weight: 89%
The printed circuit board for Pascal
as introduced in GTC16
3x performance density.
18 billion FinFET 16 nm transistors on a 600 mm2 die:
Manufactured by TSMC.
HBM2 memory cubes with 4000 wires:
Manufactured by Samsung.
10
An industry pioneer: Tezzaron
Started in 2008.
Now in its 4th gen. 3D memory.
11
Other developments: Microns prototypes (up)
and Vikings products (down)
12
Stacked DRAM Memory in Nvidias Pascal
3D chip-on-wafer integration.
3x bandwidth vs. GDDR5.
2.7x capacity vs. GDDR5.
4x energy efficient per bit. 13
How to break the 1 TB/s bandwidth barrier
with a 2x 500 MHz clock
BW = frequency*width => 1 TB/s = 2x500MHz * width =>
width = 8000 Gbits/s / 1 GHz = 8000 bits
Width in Titan X: 384 bits. Max. in GPU history: 512 bits.
TSVs: Through-silicon vias
Heatsink The GPU Cube
Layer HBM HBM (same height
HBM Pascal HBM for memory
HBM (GP100) HBM and GPU)
HBM HBM
Bumps
passive silicon interposer
Package Substrate
There is an interconnection hierarchy! 14

Pascal and its 4 3D memory cubes:
Front and back
The Tesla P100 model:
GPU: 56 SMs x 64 cores.
Peak performance:
5.3 TFLOPS (FP64).
10.6 TFLOPS (FP32).
21.2 TFLOPS (FP16).
Memory:
More register files and shared
memory than Maxwell (same sizes
but now there are 56 SMX2, versus
24 in Maxwell Titan X).
NVLINK bus: 160 GB/s.
80 GB/s. bidirectional.
15
III. 3D Memory Consortiums
16
Stacked DRAM: A tale of two consortiums
HMCC (Hybrid Memory Cube Consortium).

Mentors: Micron and Samsung.
http://www.hybridmemorycube.org (HMC 1.0, 1.1, 2.0 available)
Adopted by Intel Xeon Phi with little changes.
HBM (High Bandwidth Memory).
Mentors: AMD and SK Hynix.
https://www.jedec.org/standards-documents/docs/jesd235 (access
via JEDEC).
Adopted by Nvidia Pascal GPU (HBM2).
17
III.1. HMC
(Hybrid Memory Cube) 18
Hybrid Memory Cube Consortium (HMCC)
HMCC achievements and milestones Date
First papers published about Stacked DRAM

2003-2006
(based on research projects)
First commercial announcement of the technology,
January, 2005
by Tezzaron Semiconductors
HMC Consortium is launched by Micron Technologies
October, 2011
and Samsung Electronics
Specification HMC 1.0 available April, 2013
Production samples based on the standard Second half of 2014
2.5 configuration available End of 2014
Specification HMC 2.0 available 2015

19
Developer members of HMCC
(at the time HMC 1.0 was available)
Founders of
the consortium
20
Details on silicon integration
DRAM cells are organized in vaults,

which take borrowed the interleaved
memory banks from already existing
DRAM chips.
A logic controller is placed at the base
of the DRAM layers, with data matrices
on top.
The assembly is connected with
through-silicon vias, TSVs, which
traverse vertically the stack using pitches
between 4 and 50 microns with a vertical
latency of 12 picoseconds for a Stacked
DRAM endowed with 20 layers. 21
What are the benefits for the DRAM chip?
Speed doubles (*), based on three benefits:

Shorter connections between memory controller and DRAM cell
matrices improve speed 1/3.
Wider buses up to 512 bits thanks to higher wiring densities
improve speed another 1/3.
Lower latencies thanks to faster TSV connections and higher
interleaved factors on a 3D geometry improve the remaining 1/3.
DRAM chips (cells)
3D DRAM (cells)
Latency
Lower latency
Memory
controller
Address Data Address Data

Higher bandwidth
Memory
controller (*) Rough estimations, based on simulations by
G. Loh [ISCA'08], with improvement factors of 2.17x. 22
3D integration,
side by side with the processor
Step 5: Buses connecting 3D memory chips
Step 1: Partition into 16 cell and the processor are incorporated.
matrices (future vaults)
Links to processor(s),
which can be another 3D
Vault control
chip, but more
interface
heterogeneous:
Link
- Base: CPU and GPU.
Step 4: Build vaults with TSVs - Layers: Cache (SRAM).
A typical multi-core die
Step 3: Pile-up
Vault control
uses >50% for SRAM.
interface
DRAM layers. And those transistors
Link
Cossbar switch
Memory control
switch slower on lower
Logic base
Step 2: Gather the voltage, so the cache
will rely on interleaving
common logic underneath. over piled-up matrices,
Vault control
just the way DRAM does.
interface
Link
3D technology 3D technology
for DRAM memory for processor(s)
DRAM7 SRAM7
Typical DRAM DRAM6 SRAM6
Vault control
chips use 74%
interface
DRAM5 SRAM5
DRAM4
Link
of the silicon DRAM3
SRAM4
SRAM3
area for the DRAM2 SRAM2
cell matrices. DRAM1 SRAM1
DRAM0 SRAM0
Control CPU+GPU
logic
23
What it takes to each technology
to reach 640 GB/s.
Circuitry required DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0
Data bandwidth (GB/s.) 12.8 per module 25.6 per module 20 per link of 16 bits
Items required to reach 640 GB/s. 50 modules 25 modules 32 links (8 3D chips)
Active signals DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Active pinout required 143 per module 148 per module 270 per chip
Total number of electrical lines 7150 3700 2160 (70% savings)
Energy consumed DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Watts (W.) 6.2 per module 8.4 per module 5 per link
Power consumed for 640 GB/s. 310 W. 210 W. 160 W. (50% savings)
Physical space on motherboard DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Module area (width x height) 165 mm. x 10 mm. = 1650 mm2 1089 mm2 per chip
Total area occupied for 640 GB/s. 825 cm2 412.5 cm2 43.5 cm2 (95% savings)
24
III.2. HBM
(High Bandwidth Memory) 25
Why GDDR5 is not enough
Performance: Scaling has slowed down dramatically and

grown exponentially more expensive in the last few years.
Power:
Already in the non-efficient region of power/performance chart.
It requires much more energy to increase BW that used to.
Video Bandwidth Total power
Case study Bandwidth
memory per watt consumed
AMD Radeon R9 290X GDDR5 320 GB/s 10 GB/s 32 W.
AMD Fiji HBM 512 GB/s 35 GB/s 15 W.
Space:
4 chips of 256 MB occupy 672 mm2.
Using HBM, 1 GB occupies only 35 mm2 (5%).
Silicon interposer is required to benefit from wire density. 26
The bandwidth battle:
DDR3 & GDDR5 versus HBM1 & HBM2
DDR3 GDDR5 HBM1 HBM2

Intel CPU Existing AMD GPUs Nvidia GPUs
Adopted by
motherboards GPU boards in 2015/16 in 2017
Power 18-22 pJ / bit 6-7 pJ / bit
Cubes per GPU - 4
2 GB/s / chip 28 GB/s / chip 32 GB/s / layer 64 GB/s / layer
Bandwidth
(2 Gbps/pin) (7 Gbps/pin) (1 Gbps/pin) (2 Gbps/pin)
Chips or layers 8 chips/module 12 chips/card 4 layers/cube 4 or 8 layers/cube
Typical CPU: Maxwell Titan X: AMDs Fiji:
64 GB/s
2 GB/s 28 GB/s 32 GB/s
* 4 or 8 layers
Total GPU bandwidth * 8 chips * 12 chips * 4 layers
* 4 cubes =
* 4 channels = 336 GB/s * 4 cubes =
1 or 2 TB/s
= 64 GB/s (the end) 512 GB/s (start)
27
DDR versus HMC and HBM consortiums
DDR3 & DDR4 HMC HBM
High-end servers and
Target platforms PCs, laptops, servers GPUs, HPC
enterprises
Cost Low High Medium
JEDEC standard Yes No Yes
Power consumption Medium High Low
Width 4-32/chip, 64/module 16/link, 4 links/cube 128/channel, 8 ch./cube
Data rate per pin Up to 3200 Mbps 10, 12.5, 15 Gbps 2 Gbps
System PCB based, PCB based, 2.5D TSV based
configuration DIMM modules point to point (SerDes) silicon interposer
Availability 2008 (3), 2014 (4) 2016 2015 (HBM1), 2016 (HBM2)
Mature infrastructure. High & scalable bandwidth. High & scalable bandwidth.
Benefits Low risk and cost. Power efficiency. Power efficiency.
Familiar interface. PCB connectivity host-DRAM
Scalability for speed. Relies on TSVs. Relies on TSVs.
Challenges Signal integrity. Not a JEDEC standard. Relies on interposer.
Logistics for 3D. Cost. Cost. 28
Pending challenges
Competitive cost (hopefully solved on massive sellings).

Power density: One watt for every 35 GB/s is too much
when your goal breaks the TB/s barrier.
Heat: How to remove it from intermediate layers.
Capacity (hopefully solved on 16nm, 10nm and 7nm
manufacturing processes).
HBM1 HBM2
Capacity / layer 2 Gbits 8 Gbits
Layers / cube 4 4-8
Capacity / cube 1 GB 4-8 GB
Cubes / GPU 4 4
Total capacity 4 GB 16-32 GB
29
Commercial models for Tesla P100
Tesla K40 (Kepler) Tesla M40 (Maxwell) P100 w. NV-link P100 w. PCI-e
Release date 2012 2014 2016
Transistors 7.1 B @ 28 nm. 8 B @ 28 nm. 15.3 B @ 16 nm. FinFET
# of multiprocessors 15 24 56
fp32 cores / Multiproc. 192 128 64
fp32 cores / GPU 2880 3072 3584
fp64 cores / Multiproc. 64 4 32
fp64 cores / GPU 960 (1/3 fp32) 96 (1/32 fp32) 1792 (1/2 fp32)
Base clock 745 MHz 948 MHz 1328 MHz 1126 MHz
GPU Boost clock 810 / 875 MHz 1114 MHz 1480 MHz 1303 MHz
Peak performance (DP) 1680 GFLOPS 213 GFLOPS 5304 GFLOPS 4670 GFLOPS
L2 cache size 1536 KB 3072 KB 4096 KB
Memory interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2
Memory size Up to 12 GB Up to 24 GB 16 GB
Memory bandwidth 288 GB/s 288 GB/s 732 GB/s
30
IV. Performance analysis
with the roofline model 31
The roofline model. Example: Pascal
GFLOP/s (performance on double precision)
32768
16384
8192
5304 GFLOPS (double precision)
4096
2048
1024
/ s)
512
GB
0
256 2
(7
2
Peak performance: 5304 GFLOPS
BM
128
H
Memory bandwidth: 720 GB/s
64
32
16
8
1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256 512 1024 2048
log/log scale FLOP/byte (operational intensity) = GFLOP/s / GB/s 32
The last 3 generations at NVIDIA
(single and double precision)
GFLOP/s (double precision performance)
32768
16384
Tesla P100 (SP)
8192 Titan X (SP)
Tesla P100 (DP)
4096 Tesla K40 (SP)
2048
Tesla K40 (DP)
1024
512
/ s)
GB
256 2 0
(7 s ) Titan X (DP)
2 B/
128 BM G Processor GFLOP/s. GB/s. FLOP/byte
H 88
(2
64
D R5 Tesla K40 1680 (DP)
288
5.83
G D (Kepler) 5040 (SP) 17.50

32
Titan X 213 (DP) 0.74
288
(Maxwell) 6844 (SP) 23.76
16
8 Tesla P100 5304 (DP) 7.36
1/16 1/8 1/4 1/2 1 2 8 16 64 128 720
4 32 256 (w. NVlink) 10608 (SP) 14.73
FLOP/byte (operational intensity) 33
The Roofline Model for HPC Accelerators
Balance zone
32768
Memory-bound Compute-bound
16384 applications applications

8192
Pascal
4096
KNL 7290
s. KNL 7210
/
2048 GB Kepler
7 20 KNC 7210A
2 ): s .
B/
1024
B M
G
(H 00 Processor GFLOP/s. Gbytes/s. FLOP/byte
512 M
A ): 4
R
D MC KNC 7120A 1239 (DP) 352 3.52
256
c ke
d (H Maxwell KNL 7210 2724 (DP) 355 7.67
a M
MxM (DGEMM in BLAS)

St RA
128 D KNL 7290 3543 (DP) 400 8.85
MC
1680 (DP) 5.83
64 Kepler 5040 (SP)
288
17.50
213 (DP) 0.74
32 Maxwell 288
FFT 3D
6844 (SP) 23.76

Stencil
SpMxV
16 5304 (DP) 7.36

Pascal 10608 (SP)
720
14.73
8
1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128
FLOP/byte (operational intensity) 34
The Roofline model: Software evolution.
Case study: FMM (Fast Multipole Method)
8192
Pascal (GP100)
Xeon Phi 7290
4096
2048
1024
512
256
128
FMM M2L (Cartesian)

FMM M2L (Spherical)
64
FMM M2L P2P

32
Stencil
16
8
1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256
FLOP/byte (operational intensity)
35
Conclusions
We enter the heterogeneous era for chips, with better

integration of computation and memory plus an emphasis
on buses:
TSVs for communicating memory cells faster.
Silicon interposers for higher data volume.
GPU programmers can benefit from this technology with a
new API where they explicitly manage the new memory, or
the can provide hints to compilers about the way they
actually use data.
HMC and HBM emerge to break the memory wall and
promote 3D memories to increase size and bandwidth and
reduce latencies and energy consumption.
36
For additional information
M. Ujaldn. HPC Accelerators with 3D Memory.

Invited paper in Proceedings of the 19th IEEE Conference on
Computational Science and Engineering (CSE16). Paris
(France), August, 24-26, 2016.
HMCC (Hybrid Memory Cube Consortium).
http://www.hybridmemorycube.org (HMC 1.0, 1.1, 2.0 available)
HBM (High Bandwidth Memory).
https://www.jedec.org/standards-documents/docs/jesd235.
ITRS Report from SIA (Semiconductor Industry Assoc.):
http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/
37

Inside Pascal: Manuel Ujaldón

Uploaded by

Copyright:

Available Formats

Inside Pascal: Manuel Ujaldón

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inside Pascal: Manuel Ujaldón

Uploaded by

Copyright:

Available Formats

Inside Pascal

1. Major innovations [4]

When you shrink the When you adopt

More GFLOPS/W More bandwidth

2008 2010 2012 2014 2016 6

Address bus Row Col. Tclk = 10 ns.

RCD=4 CL=4 DDR2-400, CL=4, dual-channel (2007)

There is an interconnection hierarchy! 14

HMCC (Hybrid Memory Cube Consortium).

HMCC achievements and milestones Date

First papers published about Stacked DRAM

Specification HMC 1.0 available April, 2013

Production samples based on the standard Second half of 2014

2.5 configuration available End of 2014

Specification HMC 2.0 available 2015

DRAM cells are organized in vaults,

Speed doubles (*), based on three benefits:

Address Data Address Data

chips use 74%

Active signals DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Energy consumed DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Physical space on motherboard DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0

Performance: Scaling has slowed down dramatically and

DDR3 GDDR5 HBM1 HBM2

Competitive cost (hopefully solved on massive sellings).

G D (Kepler) 5040 (SP) 17.50

16384 applications applications

MxM (DGEMM in BLAS)

6844 (SP) 23.76

16 5304 (DP) 7.36

FMM M2L (Cartesian)

FMM M2L P2P

We enter the heterogeneous era for chips, with better

M. Ujaldn. HPC Accelerators with 3D Memory.

You might also like