Inside Pascal: Manuel Ujaldón
Inside Pascal: Manuel Ujaldón
Inside Pascal: Manuel Ujaldón
Manuel Ujaldn
CUDA Fellow @ NVIDIA Corporation
Full Professor @ Computer Architecture Dept.
University of Mlaga (Spain)
Talk contents [27 slides]
2
I. Major innovations
3
4
Benefits
24
22
20 Pascal
3D Memory
18 NVLink
16
14
12 Maxwell
Unified memory
10 DX12
8
Kepler
6 Dynamic Parallelism
4
Fermi
2 Tesla FP64
CUDA
7
II. The new memory: 3D DRAM
8
Time to fill a typical cache line (128 bytes)
by DDR memory (from 1997 to 2015)
0ns. 20 ns. 40 ns. 60 ns. 80 ns. 100 ns. 120 ns. 140 ns. 160 ns. 180 ns. 200 ns.
100 MHz
Control bus ACTIVE READ (burst length: 16 words of 8 bytes to complete a cache lines 128 bytes long)
RCD=2 CL=2 SDRAM-100,
Data bus Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato Dato
CL=2 (1998)
Dato Dato Dato Dato Dato Dato Dato Dato
t = 200 ns.
RCD=2 CL=2 DDR-200, CL=2 (2001) latency
Dato Dato Dato Dato
t = 120 ns. weight: 20%
Dato Dato Dato Dato latency weight: 33%
RCD=2 CL=2 DDR-200, CL=2, dual-channel architecture (2004)
200 MHz t = 80 ns. Latency weight: 50%
RCD=4 CL=4 DDR2-400, CL=4, quad-channel (2010) t = 50 ns. Latency weight: 80%
RCD=8 CL=8 DDR3-800, CL=8, quad-channel (2013) t = 45 ns. Latency weight: 89%
The printed circuit board for Pascal
as introduced in GTC16
3x performance density.
18 billion FinFET 16 nm transistors on a 600 mm2 die:
Manufactured by TSMC.
HBM2 memory cubes with 4000 wires:
Manufactured by Samsung.
10
An industry pioneer: Tezzaron
Started in 2008.
Now in its 4th gen. 3D memory.
11
Other developments: Microns prototypes (up)
and Vikings products (down)
12
Stacked DRAM Memory in Nvidias Pascal
3D chip-on-wafer integration.
3x bandwidth vs. GDDR5.
2.7x capacity vs. GDDR5.
4x energy efficient per bit. 13
How to break the 1 TB/s bandwidth barrier
with a 2x 500 MHz clock
BW = frequency*width => 1 TB/s = 2x500MHz * width =>
width = 8000 Gbits/s / 1 GHz = 8000 bits
Width in Titan X: 384 bits. Max. in GPU history: 512 bits.
TSVs: Through-silicon vias
Heatsink The GPU Cube
Layer HBM HBM (same height
HBM Pascal HBM for memory
HBM (GP100) HBM and GPU)
HBM HBM
Bumps
passive silicon interposer
Package Substrate
15
III. 3D Memory Consortiums
16
Stacked DRAM: A tale of two consortiums
17
III.1. HMC
(Hybrid Memory Cube) 18
Hybrid Memory Cube Consortium (HMCC)
Founders of
the consortium
20
Details on silicon integration
Lower latency
Memory
controller
Vault control
chip, but more
interface
heterogeneous:
Link
- Base: CPU and GPU.
Step 4: Build vaults with TSVs - Layers: Cache (SRAM).
A typical multi-core die
Step 3: Pile-up
Vault control
uses >50% for SRAM.
interface
DRAM layers. And those transistors
Link
Cossbar switch
Memory control
switch slower on lower
Logic base
Step 2: Gather the voltage, so the cache
will rely on interleaving
common logic underneath. over piled-up matrices,
Vault control
just the way DRAM does.
interface
Link
3D technology 3D technology
for DRAM memory for processor(s)
DRAM7 SRAM7
Typical DRAM DRAM6 SRAM6
Vault control
interface
DRAM5 SRAM5
DRAM4
Link
of the silicon DRAM3
SRAM4
SRAM3
area for the DRAM2 SRAM2
cell matrices. DRAM1 SRAM1
DRAM0 SRAM0
Control CPU+GPU
logic
23
What it takes to each technology
to reach 640 GB/s.
Circuitry required DDR3L-1600 DDR4-3200 Stacked DRAM HMC 1.0
Data bandwidth (GB/s.) 12.8 per module 25.6 per module 20 per link of 16 bits
Items required to reach 640 GB/s. 50 modules 25 modules 32 links (8 3D chips)
Space:
4 chips of 256 MB occupy 672 mm2.
Using HBM, 1 GB occupies only 35 mm2 (5%).
Silicon interposer is required to benefit from wire density. 26
The bandwidth battle:
DDR3 & GDDR5 versus HBM1 & HBM2
27
DDR versus HMC and HBM consortiums
DDR3 & DDR4 HMC HBM
High-end servers and
Target platforms PCs, laptops, servers GPUs, HPC
enterprises
Cost Low High Medium
JEDEC standard Yes No Yes
Power consumption Medium High Low
Width 4-32/chip, 64/module 16/link, 4 links/cube 128/channel, 8 ch./cube
Data rate per pin Up to 3200 Mbps 10, 12.5, 15 Gbps 2 Gbps
System PCB based, PCB based, 2.5D TSV based
configuration DIMM modules point to point (SerDes) silicon interposer
Availability 2008 (3), 2014 (4) 2016 2015 (HBM1), 2016 (HBM2)
Mature infrastructure. High & scalable bandwidth. High & scalable bandwidth.
Benefits Low risk and cost. Power efficiency. Power efficiency.
Familiar interface. PCB connectivity host-DRAM
Scalability for speed. Relies on TSVs. Relies on TSVs.
Challenges Signal integrity. Not a JEDEC standard. Relies on interposer.
Logistics for 3D. Cost. Cost. 28
Pending challenges
Tesla K40 (Kepler) Tesla M40 (Maxwell) P100 w. NV-link P100 w. PCI-e
Release date 2012 2014 2016
Transistors 7.1 B @ 28 nm. 8 B @ 28 nm. 15.3 B @ 16 nm. FinFET
# of multiprocessors 15 24 56
fp32 cores / Multiproc. 192 128 64
fp32 cores / GPU 2880 3072 3584
fp64 cores / Multiproc. 64 4 32
fp64 cores / GPU 960 (1/3 fp32) 96 (1/32 fp32) 1792 (1/2 fp32)
Base clock 745 MHz 948 MHz 1328 MHz 1126 MHz
GPU Boost clock 810 / 875 MHz 1114 MHz 1480 MHz 1303 MHz
Peak performance (DP) 1680 GFLOPS 213 GFLOPS 5304 GFLOPS 4670 GFLOPS
L2 cache size 1536 KB 3072 KB 4096 KB
Memory interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2
Memory size Up to 12 GB Up to 24 GB 16 GB
Memory bandwidth 288 GB/s 288 GB/s 732 GB/s
30
IV. Performance analysis
with the roofline model 31
The roofline model. Example: Pascal
GFLOP/s (performance on double precision)
32768
16384
8192
5304 GFLOPS (double precision)
4096
2048
1024
/ s)
512
GB
0
256 2
(7
2
Peak performance: 5304 GFLOPS
BM
128
H
Memory bandwidth: 720 GB/s
64
32
16
8
1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256 512 1024 2048
log/log scale FLOP/byte (operational intensity) = GFLOP/s / GB/s 32
The last 3 generations at NVIDIA
(single and double precision)
GFLOP/s (double precision performance)
32768
16384
Tesla P100 (SP)
8192 Titan X (SP)
Tesla P100 (DP)
4096 Tesla K40 (SP)
2048
Tesla K40 (DP)
1024
512
/ s)
GB
256 2 0
(7 s ) Titan X (DP)
2 B/
128 BM G Processor GFLOP/s. GB/s. FLOP/byte
H 88
(2
64
D R5 Tesla K40 1680 (DP)
288
5.83
Balance zone
32768
Memory-bound Compute-bound
GFLOP/s (double precision performance)
8192
Pascal (GP100)
Xeon Phi 7290
GFLOP/s (double precision performance)
4096
2048
1024
512
256
128
64
16
8
1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256
FLOP/byte (operational intensity)
35
Conclusions
37