Talk 1 Satoshi Matsuoka
Talk 1 Satoshi Matsuoka
Talk 1 Satoshi Matsuoka
x40
<<
CPU
コア
2008 TSUBAME1.2
2006 TSUBAME1.0 2010 TSUBAME2.0
2000 2002 “TSUBAME0” 170 TeraFlops
80 TeraFlops 2.4 Petarlops No1 World
128 Gigaflops 1.3 TeraFlops Word’s first GPU
No.1 Asia, No.7 World No.1 Production Green
Custom First “TeraScale” Supercomputer
JP Univ. Supercomputer 10,000 cores ACM Gordon Bell Prize
Supercomputer
32 cores 800 cores General Purpose CPU & Many Core Processor (GPU), Advaned Optical
Networks, Non-Volatile Memory, Efficient Power Control and Cooling
PCIe NVMe
Drive Bay x 4
9
Peak FP64 Exaflop TSUBAME in 2020 ‒
Just getting Flops/W is within reach
• 7nm+ post Volta GPU (Pascal P100 16nm)
• ~10,000 CUDA Cores (P100 3840),12.5 Teraflops/Chip (P100 5.3TF)
w/matrix engine
• 80,000 chips => 80 million small cores
• 4 GPUs/node => 20,000 nodes (x40 TSUBAME3, 500~600 racks)
• Scalable High-Dimensional Torus or Hypercube topology (Tsubame3 : Full
Fattree)
• x3 power efficiency 50GF/W (x1.9 via process, x1.6 via arch) (TSUBAME3.0
14.1GF/W)
• 1 Exa DFP Peak, ~600 PF Linpack, 12MW Facility Power
• So the DARPA Exascale report projection turned out to be fairly accurate
• But is just getting FLOPS all that valuable?
2004 2008 2012 2016 2020
Exaflop
DARPA
ExaScale
Report
2018
12,000
ZettaScaler-1.6 c Gflops/W
10,000
Tsubame KFC
8,000 AMD FirePro NVIDIA K20x – K80
Mic
6,000
BlueGene/Q
4,000 Cell
2,000
TOP500 Average
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
DFP 64bit SFP 32bit HFP 16bit Tokyo Tech GSIC leads Japan in aggregated
AI-capable FLOPS TSUBAME3+2.5+KFC, in
Simulation all Supercomuters and CloudsNV
12000
10000
NVIDIA Pascal
8000
P100 DGEMM Riken K
6000 Performane
GFLOPS
4000
2000 0 10 20 30 40 50 60 70
0 PFLOPS
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Common Modules
Planning/Business Team Planning/Business Team
Common Data/Models
NLP, NLU Behavior Prediction Planning Image Recognition
Text mining Mining & Modeling Recommend Control 3D Object recognition AI Research Framework
・・・
18
The “Real” ABCI – 2018Q1
• Extreme computing power
– w/ >130 AI-PFlops for AI/ML especially DNN
– x1 million speedup over high-end PC: 1 Day training for 3000-Year DNN
training job
– TSUBAME-KFC (1.4 AI-Pflops) x 90 users (T2 avg)
• Big Data and HPC converged modern design
– For advanced data analytics (Big Data) and scientific simulation (HPC), etc.
– Leverage Tokyo Techʼs “TSUBAME3” design, but differences/enhancements
being AI/BD centric
• Ultra high BW & Low latency memory, network, and storage
– For accelerating various AI/BD workloads
– Data-centric architecture, optimizes data movement
• Big Data/AI and HPC SW Stack Convergence
– Incl. results from JST-CREST EBD
– Wide contributions from the PC Cluster community desirable.
• Ultra-Green (PUE<1.1), High Thermal (60KW) Rack
– Custom, warehouse-like IDC building and internal pods
– Final “commoditization” of HPC technologies into Clouds
19
ABCI Cloud Infrastructure
• Ultra-dense IDC design from ground-up
– Custom inexpensive lightweight “warehouse” building w/ substantial ABCI AI-IDC CG Image
earthquake tolerance
– x20 thermal density of standard IDC
• Extreme green
– Ambient warm liquid cooling, large Li-ion battery storage, and high-
efficiency power supplies, etc.
– Commoditizing supercomputer cooling technologies to
Clouds (60KW/rack)
• Cloud ecosystem
– Wide-ranging Big Data and HPC standard software stacks
Reference Image
• Advanced cloud-based operation
– Incl. dynamic deployment, container-based virtualized provisioning,
multitenant partitioning, and automatic failure recovery, etc.
– Joining HPC and Cloud Software stack for real
• Final piece in the commoditization of HPC (into IDC) 引⽤元: NEC導⼊事例
JCAHPC Oakforest-PACS
4 7 Fujitsu PRIMERGY CX1640 M1, Japan 0.3855 13.6 1.5% 2.8%
Joint Center for Advanced HPC
Intel Xeons Phi 7250 68C 1.4 GHz, OmniPath
NASA/ Pleiades
10 15 HPE SGI ICE X, USA 0.1750 5.95 2.5% 2.9%
Ames Research Center/NAS
Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4
K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.
Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu
88,000 nodes, 660,000
73% total exec CPU Cores
1200 Communicaton time wait in 1.3 Petabyte mem
Computation
≫
(Scale 30) (Scale 40)
Linpack
BYTES Rich Machine LLNL-IBM Sequoia TaihuLight
List Rank GTEPS Implementation + Superior BYTES 1.6 million CPUs
10 million CPUs
1.6 Petabyte mem
November 2013 4 5524.12 Top-down only algoithm 1.3 Petabyte mem
1.TOP500 List 2 4 4 4 7 8
2. HPCG 2
3. HPC Challenge
Awards
(HPC、Random Access、STREAM、FFT)
4. Graph500 4 2
5. Green500(TSUBAME) 2
3 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Japan Flagship 2020 “Post K” Supercomputer
üCPU
• ARM v8 + 512 bit SVE extensions
• Multi-hundred petaflops peak total
• Power Knob feature for saving power
üMemory
Prime Minister Abe visiting K Computer 2013
ü3-D stacked DRAM, Terabyte/s BW
üInterconnect …
I/O Network
Maitenance
• 30MW+ Power
…
… Login
Servers
:Interconnect
: Compute
l Fujitsu’s inheritances
l FMA
l Math acceleration primitives
l Inter core barrier
l Sector cache
l Hardware prefetch assist
l Explicit
govt. committee recommendation to
explore HW component collaboration
ISC'16, June 21, 2016 32
Post-K will be/have
l Continuum of K (^-^)
l Pretty good DFP FLOPS (^-^)
l VeryGood low precision FLOPS (^◇^)
l Awesome Memory Bandwidth \(◎o◎)/!
l (but) relatively low memory capacity
(-_-;)
l Awesome Network Injection Bandwidth\(◎o◎)/
l Very Good Network Bisection Bandwidth (^◇^)
l (but) modest I/O speed (-_-;)
l ARM Ecosystem – 99.9% of codes work \(◎o◎)/
2017/03/09 RIKEN AICS 33
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
Intra-node GPU via NVLink Terabit class network/node Intra-node GPU via NVLink
20~40GB/s 800Gbps (400+400) 20~40GB/s
full bisection
HBM2
64GB Inter-node GPU via OmniPath
2.5TB/s 12.5GB/s fully switched
HBM2
64GB Inter-node GPU via OmniPath
2.5TB/s 12.5GB/s fully switched
Performance Core
256-bit SIMD
1.1TFLOPS FMA x2
140.8GB/s(in)
L1 cache or
70.4 GB/s(out)
4.4TB/s
L1 cache
L2 cache 70.4 70.4
2.2TB/s GB/s GB/s
140GB/s
(Read)
L2 cache 140GB/s L2 cache
(Read)
Memory 120
120
120 120
GB/s Tofu2 GB/s
240GB/s x2(in/out) GB/s GB/s
HMCs controller HMCs
Memory BW : Injection BW = 2:1
(TSUBAME3: 40:1) Tofu2 12.5
12.5
GB/s
125GB/s x2(in/out) GB/s
x10 ports
Post Moore
Many Core Era Cambrian Era
Flops-Centric Monolithic Algorithms and Apps Cambrian Heterogeneous Algorithms and Apps
Unit) L
(Deep Learning
IBM TrueNorth
Manchester SpiNNaker
(ARM Based)
• Hitachi@ISSCC2015
“An 1800-Times-Higher
Power-Efficient 20k-spin
Ising Chip for
Combinational
Optimization Problem
with CMOS Annealing”
• Competitive to
Quantum Annealing,
room temperature,
easy to scale
• Could be applicable
to deep learning?
Tokyo Tech. Work on FPGAs in Post-Moore
[Artur Podbas, Hamid Zohouri, Satoshi Matsuoka]
l Moore’s law is ending
- Silicon area will become scarce
- Cannot afford generality (general-purpose) architectures
l Expecting a “Cambrian Explosion” in computer architecture
- Many niche architecture for specific purposes
l e.g. Quantum computing, Neomorphic Computing, specialized ASICs, DSPs etc.
l Field-Programmable Gate-Arrays (FPGAs)
- Devices hosting a sea of logic and interconnect
l Logic includes Look-Up Tables (LUTs), On-Chip RAM (Block-RAM) and
Digital Signal Processing blocks (DSP)
l Programmer responsible to program and connect logic to specify device behavior
l Field-Programmable Gate-Arrays (FPGAs) are Post-Moore
friendly
- Silicon area of FPGAs is malleable – dynamically reconfigurable
- Better (more diverse) use of chip logic
Using FPGAs for High Performance Computing
“Evaluating High-Level Design Strategies on FPGAs for High-Performance Computing”,
A Podobas, H.R. Zohouri, N. Maruyama, S. Matsuoka, IEEE FPL 2017
l Motivation
- FPGAs are post-moore friendly
l FPGAs allows dynamic reconfiguration of silicon
l “Tune” architecture towards application needs
- FPGAs are notoriously hard to program (require
hardware expertise)
l High-level programming approaches are attractive but their
performance is unknown
l Method
- Evaluated three high-level programming approaches for
FPGAs
l 30-core many-core system (represents: programmability)
l LegUp High-Level Synthesis (represents: multiple custom
accelerators)
l Intel OpenCL for FPGA (represents: Deep-pipeline designs)
- Improvements:
l We improved the memory hierarchy for the many-core and
the multi-accelerator designs through cache multi-banking
- Evaluated on the Rodinia Benchmark Suite using the
Stratix V FPGA
Using FPGAs for High Performance Computing (Cont.)
l Results
- Intel FPGA SDK for OpenCL achieves highest average performance
- LegUp can remain competitive for applications with high-compute
and good spatial/temporal locality
- The many-core approach offer good programmability (OpenMP) but
relatively low performance
Optimizing Kernels for High-Performance Computing with FPGAs2
“Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs”, H.R.
Zohouri, N. Maruyama, A. Smith, M. Matsuda, S. Matsuoka, ACM Supercomputing 16
l Motivation
- OpenCL for FPGAs a promising High-
Level Synthesis tool to leverage FPGA
technology without hardware expertise
- What performance can we expect from
FPGAs compare to the more general-
purpose CPU and GPUs?
l Method
- Six benchmarks from Rodinia’s benchmark
suit
l Optimized using advanced FPGA
optimizations
- Minimizing Initiation Intervals l Results
- Inferring Shift-Registers and Sliding Windows - We found our FPGA to, on average, be faster than the
- Two FPGA devices CPU implementations
l Intel Stratix V (A7) FPGA l Especially the newer Arria 10 GX1150 FPGA
l Intel Arria 10 GX1150 - Our FPGA implementations were up-to 3.4x more power
- Power and Performance compared against: efficient than the K20c
Intel Xeon E5-2650v3
l But the performance of the GPU could not be reached
l
l Motivation
- Stencil computation is one of the more important computation
patterns in HPC
l Differential Equations
l Weather, Seismic and Fluid simulation
l Convolution Neural Networks
- Hypothesis: by exploiting the unique architecture of FPGAs, they
can achieve comparable performance to GPUs in stencil computation
l This is despite their much lower memory bandwidth and compute
performance compared to GPUs
Stencil Accelerator
DDR Memory
Compute
Out-of-bound
Spatial Block
Time
Maximizing Performance of FPGAs in Stencil Computation (Cont.)
l Methodology
- Hotspot 2D and 3D
l Two inputs and one output, 8/9 (2D) and 8/13 (3D) bytes per FLOP
- Diffusion 2D and 3D
l One input and one output, 12/15 (2D) and 12/17 (3D) bytes per FLOP
- Intel Stratix V A7 and Arria 10 GX 1150
- Four generations of highest-end NVIDIA GPUs
- Estimation for upcoming Intel Stratix 10 FPGAs
Performance fmax Memory Power
Stencil Device (GBps/GFLOPS/GCell/s) Logic DSP
(MHZ) (Bits/Blocks) (Watt)
Stratix V 99.582/112.030/12.448 302.48 69% 22%/52% 95% 29.845
Diffusion 2D
Arria 10 673.959/758.204/84.245 343.76 55% 38%/83% 95% 72.530
Stratix V 112.218/140.273/9.352 269.97 84% 27%/61% 64% 33.361
Hotspot 2D
Arria 10 480.335/600.419/40.028 326.58 47% 53%/94% 95% 52.411
Stratix V 62.435/101.457/7.804 301.02 62% 36%/67% 91% 21.135
Diffusion 3D
Arria 10 230.568/374.673/28.821 286.61 60% 94%/100% 89% 71.628
Stratix V 63.603/90.104/5.300 246.18 76% 68%/100% 100% 36.126
Hotspot 3D
Arria 10 228.149/323.211/19.012 296.20 62% 81%/100% 96% 73.398
Maximizing Performance of FPGAs in Stencil Computation (Cont.)
• Unlike GPUs, FPGAs can achieve higher computation throughput than their memory bandwidth
• Arria 10 achieves better performance than K40c, despite 8 times lower memory bandwidth
• Arria 10 achieves better power efficiency than 980 Ti, and close to P100 and V100
• Stratix 10 MX2100 will have better performance and power efficiency compared to next-generation GPUs
Neuromorphic Computing on FPGAs4
“Designing and Accelerating Spiking Neural Networks using OpenCL for
FPGAs”, A Podobas, S. Matsuoka, IEEE FPT 2017 (to appear)
l Motivation
- Neuromorphic computing is an emerging Post-Moore computing
paradigm
- Spiking Neural Networks (SNN) are one instance of Neuromorphic
computing
l Information is conveyed temporally through events called “spikes”
l Can be more power-efficient than traditional “rate-based” neural networks
l Used in e.g. IBM TrueNorth architecture
- Can we leverage FPGAs to accelerate another post-Moore computing
paradigm and what is the performance we can expect?
l Method
- Created a custom FPGA accelerator for SNNs
l Supports two vastly different but well-known neuron models
l Leverage Python for simplicity
l Exploits the timing characteristics (the “delays”) of the neural networks to
increase instruction- and dataflow level parallelism
l Described using OpenCL-- portable across FPGA devices
- Compared to well-known simulators:
l NEST for CPUs running on high-end Xeon processors and Xeon PHI
accelerators
l NeMo running on K20x and P100
l Our design running on a Stratix V FPGA
- Evaluated on a variety of networks with different spiking activity
Neuromorphic Computing on FPGAs (Cont.)
l Results
- Our accelerator can reach up-to 2.25
Billion Spikes/second of performance
l Despite our Stratix V built on 5-year-old
technology
- Our performance surpasses that of
NEST on recent multi-threaded CPUs
l Including both 24 hyper-threaded Xeon
as well as Xeon PHI
- Initial results points towards our
accelerator rivals the performance of
NeMo on GPUs
l Still needs further investigation and
comparison against other GPU
frameworks
Future Outlook and Ongoing Work: Manipulate Precision and
Numerical Format (e.g. IEEE vs. POSIT)
l Motivation
- Explosion in numerical precision formats
l IEEE 754 Floating-Point (what we are used to)
l Fixed-Point Arithmetic (what DSP builders are used to)
l Intel FlexPOINT (new AI format, shared exponent bits inside tensors)
l POSIT Floating-Point (IEEE 754 replacement candidate)
- Potentially great impact on performance/silicon
l Precision a tuneable knob in the Post-Moore workflow
- Performance, Power, Area and Precision trade-offs unclear
l Performance trade-offs in computer simulations
l Accuracy trade-offs in training of Neural Networks
l Area/Power trade-offs
l Leverage FPGAs to assist design-space exploration
- Automatically generate custom precision data-path (chain of
operators)
- Integrate data-path into existing infrastructure
l Intel FPGA SDK for OpenCL
l ASIP extensions to soft-cores e.g. RISC-V
- Evaluate on existing applications
Performance growth via data-centric computing:
“From FLOPS to BYTES”
• Identify the new parameter(s) for scaling over time
• Because data-related parameters (e.g. capacity and bandwidth) will still
likely continue to grow towards 2040s
• Can grow transistor# for compute, but CANNOT use them AT THE SAME
TIME(Dark Silicon) => multiple computing units specialized to type of data
• Continued capacity growth: 3D stacking (esp. direct silicon layering) and
low power NVM (e.g. ReRAM)
• Continued BW growth: Data movement energy will be capped constant by
dense 3D design and advanced optics from silicon photonics technologies
• Almost back to the old “vector” days(?), but no free lunch – latency still
problem, locality still important, need general algorithmic acceleration
thru data capacity and bandwidth, not FLOPS
Non-Volatile Memory and 3-D Stacking
• Many devices
• Various stacking technologies
Capacity by dense
1mm
(Original Slide Courtecy John Shalf@LBNL)
3.6mm
12mm
Compute Op ==
Compute Op ==
Compute Op ==
0.2x
1.6x
5.5x
Core Energy/Area est.
E/op: 22 pj
60
4.5 mm 1.2 mm 0.2 mm
0.23mm
0.5mm
2.7mm
1.1mm
RISC-V ISA designed at the University of California, Berkeley. Core
In a standard 40 nm process, the RISC-V scalar core scores 10% Core Logic L1I$
higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$
single-issue in-order scalar core, and is 49% more area-efficient. 1MB
SRAM
To demonstrate the extensibility of the RISC-V ISA, we integrate
6mm
Array Rocket Hwacha Rocket Hwacha
a custom vector accelerator alongside each single-issue in-order Scalar
Core
Vector
Accelerator
Scalar
Core
Vector
Accelerator
scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB
than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$
IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter
dual-core RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub
Vector Processor
of 1.3 GHz at 1.2 V and peak energy efficiency of 16.7 double- FPGA FSB/
1MB SRAM Array
HTIF
precision GFLOPS/W at 0.65 V with an area of 3 mm2 .
I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and
processor block diagram.
As we approach the end of conventional transistor scaling,
computer architects are forced to incorporate specialized and
heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core
greater energy efficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that
those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The
of application software to make use of separate ISAs operating scalar datapath is fully bypassed but carefully designed to min-
in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler-
memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target
open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level branch predictor, and return address
veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control
designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page-
efficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating
RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed
an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache
verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level
and is available at www.riscv.org. parallelism.
In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU,
processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point
process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, operations, including fused multiply-add (FMA), with hard-
outperforming ARM’s Cortex-A5 score of 1.57 DMIPS/MHz ware support for denormals and other exceptional values. The
by 10% in a smaller footprint. Our custom vector accelerator
is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q
processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB
PC
double-precision floating-point operations, demonstrating that Gen. I$ Inst. Int.EX D$ Commit to Hwacha
high efficiency can be obtained without sacrificing a unified Access Decode Access
demand-paged virtual memory environment. bypass paths omitted
for simplicity
FP.RF FP.EX1 FP.EX2 FP.EX3
Rocket Pipeline
II. C HIP A RCHITECTURE
VITLB
Figure 1 shows the block diagram of the dual-core pro- PC
Gen. VI$
VInst.
Decode
Seq-
uencer
Expand
Bank1 ... Bank8
R/W R/W
cessor. Each core incorporates a 64-bit single-issue in-order Access
Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline
their associated instruction and data caches, as described
below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.
Example Innovation: Tungsten TSV at 2um ultra fine
pitch with die thinning by Tezzaron Semiconductor
• Suppose 4TF SFP @ 7nm, 16TB/s
internal chip BW vs. 200GB/s external
chip mem BW => 80 times speedup!
• High-density, high-signaling TSV challenge
– Wide I/O 2 1024 bits 1 Ghz -> 2~3 Ghz
– We need 128,000 bits @ 1Ghz !
– 10 micron TSV estimation Many-layer stacking
• 400 x 400 TSVs on 20mx20m chip -> 50 via aggressive wafer
micron spacing
• With tungsten TSVs the chip area is
thinning and self-
negligible diagnostics
Source: Tezzaron website
http://www.tezzaron.com
DiRAM4 Stack Overview
(Tezzaron slides taken from
http://www.tezzaron.com/media/Tezzaron-
Presentation-EPS-100814-dist-.pptx)
• 64 Gb of Memory in 175 mm2
• 256 fully independent RAMs
• 16 Banks per RAM
• 64 bit Sep I/O Data per RAM
• 7ns Access Time (Closed page to data)
• 12ns tRC (Page Open to Page Open in a Bank)
• 16 Tb/s Data Bandwidth
• Competitive Manufacturing Cost
μBumps
Die to Wafer Cu Thermal Diffusion Bond
C4 Bumps
3 Layer 3D Memory 2 Layer Processor
FPGA (4Xnm) level#4
C level#3
level#2
Active Silicon Circuit Board C
level#0
Solder Bumps
Recon Recon
f. f.
SW SW
光インコネチッ
FPGA プ
FPGA FPGA
新メ 新メ 新メ 新メ
モリ モリ モリ モリ
ドーターチップ接続のイメージ
Network IF
TEG
MIPS CPU
Core TCI Tx
TCI Rx
Host CPU
Accelerator 1
TCI
Rx Accelerator 2
µ-Controller
Network IF
Tx Accelerator 3
8x8 PE Array
Tx
Host CPU + Accelerator x3 Chip Stack
Rx Fabricated in 65nm CMOS
Accelerator Chip
Microphotograph of stacked test chips.
Strawman BYTES-Oriented Post-Moore
Architecture
Low voltage & power CPU for 16TB/s DRAM &
direct stacking and large NVM/Flash
NVM Bandwidth NVM/Flash
silicon area NVM/Flash
=> 5~10Tbps NW idea NVM/Flash
NVM/Flash NVM/Flash
Domain-specific hetero- and DRAM DRAM
customizable processor DRAM DRAM
configurations, including PIM DRAM DRAM
Low Power CPU Optical SW & Launch Pad Low Power CPU
Extreme multi-layer DRAM &
NVRAM stacking via high TSV Interposer
density tungsten TSV
PCB
Direct WDM optics onto Direct Chip-Chip Interconnect with DWDM optics
Interposer
Low Power Processor allows Direct 3D Stacking
Configurable Low-power CPU
Interconnect Shortcomings
– Current technology:
– 10$ / Gbps and 50 pJ per bit, per link
– 1 exaflops -> 10 PB/s injection bw
– O (1B$) and O(5MW) (node link only)
– First stepping stone: mid-board optics - vcsels
– Advanced Development program at HPE
– Cheaper, more efficient, can be water-cooled
– Exascale technology target: silicon photonics - ring resonators
– 10 cents per Gbps, 5 pJ per bit
– Enabling enhanced topologies like the Hyper-X will require new “widgetry”
• 2.5pJ/bit power
• Bare metal protocol
– Ultra low latency
– Protocol agnostic
• 8 core Fiber
• 25Gb SERDES or 3.125Gb interface
• Self-calibrating self-tuning (Tezzaron slides taken from
• >1.6Tb/s payload http://www.tezzaron.com/media/Tezzaron-
Presentation-EPS-100814-dist-.pptx)
Problem:
heavy
optical loss
Fast Optical Crossbar Swtch (EECS, UCB)
Seok et. al. “Large-scale broadband digital silicon photonic
switches with vertical adiabatic couplers” Optica, 3-1, 2016
• Array of 64x64 MEMS
optical crossbar switch
• 3.7db on-chip insertion loss
• 0.91microsecond switching
time
Low latency
small packets
Bulk
Transfer
76
NICT Optical Packet Switch Node (Slides courtesy NICT)
n 4 x 4 OPS node with optical packet (OP) transponder
n 100Gb/s OPS port, 10GbE x 10 Client ports
n Stability: Tolerance for environmental disturbance
100Gbps (Polarization, Power fluctuation)
Optical Packet
Transponder
n Total throughput : 800 Gb/s
n Total power consumption: 141 W (w/o Transponder)
Burst-mode
n 10-node hopping, 450 km fiber transmission
Optical
Amplifiers
Header
Proc.
100 Gb/s Multi-wavelength
Header
Processor 1 1
Optical Packet Format
2 4x4 2
4 x 4 EA
Switch (1U)
3
Switch 3
4
l1
l2 ...
Preamble Header ...
...
... ...
4 Preamble
l3
... ...
Preamble
l4
... ...
Preamble
100G OP l5
... ...
Preamble
100G-OP l6
... ...
Preamble
Transponder
l7
... ...
Preamble
l8
... ...
Preamble
Switching speed: < 8 ns 10GbE x10 l9
... ...
Preamble
l10 Preamble
Power consumption: 3 W Client Network (10Gb Ethernet)
10 x 10 Gb/s payloads
Y. Muranaka, et.al, Photonics in Switching 2015. H. Furukawa, et.al, no.P.4.16, ECOC2015.
HIDEAKI FURUKAWA furukawa@nict.go.jp
September 6, 2016 © 2016 National Institute of Information and Communications Technology
Applications & Algorithms
0.01
89.5X
0.005
0.0025
Implicit
0.00125
Explicit 23.66 DP-PF
Sparse Solver:Memory-Bound
82
83
Improvement of performance on
sparse matrix computations due to
higher memory bandwidth
13 14 15 16
éD X X X ù ì F1 ü ì F1 ü
7 8 9 êX D X X X X úï F ï ï F ï
ê úï 2 ï ï 2 ï
ê X D X X X X ú ï F 3 ï ï F3 ï
9 10 11 12 ê úï ï ï ï
ê X D X X ú ï F 4 ï ï F4 ï
4 5 6 êX X D X X X ú ï F 5 ï ï F5 ï
ê úï ï ï ï
êX X X X D X X X X ú ï F 6 ï ï F6 ï
5 6 7 8 ê X X X X D X X X X úï F ï ï F ï
ê úï 7 ï ï 7 ï
1 2 3 ê {Y}=X [A]{X}
X X D X X ú ï F 8 ï ï F8 ï
ê X X D X X X úí F ý = í F ý
ê úï 9 ï ï 9 ï
1 2 3 4 ê do i= 1, XN X X X D X X X X ú ïF10 ï ï F10 ï
ê Y(i)= D(i)*X(i) úï ï ï ï
ê X X X X D X X X X ú ïF11 ï ï F11 ï
Sparse Matrices: ê do k= INDEX(i-1)+1,
X X INDEX(i)
X D X X ú ïF12 ï ï F12 ï
ê Y(i)= Y(i) + AMAT(k)*X(ITEM(k)) úï ï ï ï
• FEM ê X X D X ú ïF13 ï ï F13 ï
• Indirect Memory ê enddo X X X X D X ú ïF14 ï ï F14 ï
ê enddo úï ï ï ï
ê X X X X D X ú ïF15 ï ï F15 ï
Access ê ï ï ï ï
ë X X X D úû îF16 þ î F16 þ
• Memory-Bound
84
Parallel-in-Space/Time (PiST)
MG is scalable, but improvement of performance is
limited by parallelization only in space direction
Time-Dependent Problems: Concurrency in Time Dir.
Multigrid in (Space+Time) Direction
ü Traditional time-dependent method: Point-Wise Gauss Seidel
ü XBraid:Lawrence Livermore National Laboratory
pApplication to nonlinear problems (Transient Navier-Stokes Eqn’s)
MS with 3 sessions in SIAM PP16 (April 2016)
PiST approach is suitable for the Post-Moore Systems
with a complex and deeply hierarchical network
that causes
large latency.
APPLICATION TOPIC: FUSION ENERGY SCIENCE
[Slides Courtesy William Tang, Princeton University]
SITUATION ANALYSIS
Most critical problem for Fusion Energy: avoid/mitigate large-scale major disruptions
•Approach: Conventional “1st Principles (hypothesis-based)” HPC simulations are
unable to meet this challenge, big-data-driven statistical machine-learning (ML)
predictions for the occurrence of disruptions in fusion-grade plasmas such as the “Joint
European Torus (JET)” today and “ITER” in the near future are now deployed.
•Current Status: ~ 8 years of R&D results (led by JET) using Support Vector Machine
(SVM) ML on zero-D time trace data executed on CPU clusters yielding ~ reported
success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with
false alarm rate < 5% actually needed for ITER (Reference – P. DeVries, et al. (2015)
•Princeton Team Goals include:
(i)improve physics fidelity via development of new ML multi-D, time-dependent
software including better classifiers;
(ii)develop “portable” (cross-machine) predictive software beyond JET to other
devices and eventually ITER; and
(iii)enhance ML execution speed for very large datasets by using supercomputers (e.g.,
“Titan”/”Summit” in US; “Tsubame-3” in Japan; ”Piz Daint” at CSCS in Europe)
à development & deployment of advanced ML software via Deep Learning
Recurrent Neural Networks
CLASSIFICATION
● Binary Classification Problem:
○ Shots are Disruptive or Non-Disruptive
● Supervised ML techniques:
○ Physics domain scientists combine knowledge base of
observationally validated information with advanced statistical/ML
predictive methods.
● Machine Learning (ML) Methods Engaged:
Basic SVM approach initiated by JET team producing “APODIS”
software leading now to Princeton’s New Deep Learning Fusion
Recurrent Neural Net (FRNN) code
● Approach: (i) begin with properly normalized data; (ii) use training sets to generate
new models; (iii) use trained models to classify new samples & improve prediction of
tokamak disruptions
→ Multi-D data analysis requires new signal representations;
→ FRNN software includes Deep Learning Convolutional and
Recurrent Neural Net features to respond to new challenges
Machine Learning Workflow
> Threshold?
RNN Architecture:
• LSTM
• 3 layers RNN RNN RNN
• 300 cells per layer Internal
State
TP: 93.5%
FP: 7.5%
TP: 90.0%
FP: 5.0%
ROC Area:
0.96
Tensorflow+MPI
New FRNN scaling tests: TSUBAME 3.0
DNA consists o
called nucleotid
JST-CREST “Extreme Big Data” Project (2013-2018)
A), cytosine (C
molecule called
Problem Domain
Ultra Large Scale
Graphs and Social Massive Sensors and
Large Scale Infrastructures Data Assimilation in
Given a top-class Metagenomics
Co-Design Co-Design
Weather Prediction
Bring HPC rigor in
Co-Design
supercomputer, architectural,
日本地図 13/06/06 22:36
Cartesian Plane
EBD Bag
EBD System Software KV
generation big
NVM/Fla
sh NVM/Flas
h
sh
1.5TB/s DRAM & h
performance and
DRAM NVM BW DRAM
CPU CPU
Acceleration via
Acceleration, Scaling Supercomputers Acceleration, Scaling
adapted to AI/BD
(Big Data) BYTES capabilities, in bandwidth and
capacity, unilaterally important but often missing from
modern HPC machines in their pursuit of FLOPS…
• Need BOTH bandwidth and capacity Our measurement on
(BYTES) in a HPC-BD/AI machine: breakdown of one iteration
of CaffeNet training on
• Obvious for lefthand sparse ,bandwidth- TSUBAME-KFC/DL
(Mini-batch size of 256)
dominated apps
• But also for righthand DNN: Strong scaling,
Proper arch. to
large networks and datasets, in particular
for future 3D dataset analysis such as CT- support large
scans, seismic simu. vs. analysis…) Computation on GPUs memory cap.
occupies only 3.9%
and BW, network
latency and BW
important
(Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg)
Nodes
(Source: https://www.spineuniverse.com/image- Number of GPUs = 8 per node
4 Layers of Parallelism in DNN Training
• Hyper Parameter Search
• Searching optimal network configurations and
parameters
• Often use evolutionary algorithms
• Data Parallelism
• Split and parallelize the batch data
• Synchronous, asynchronous, hybrid, …
• Model Parallelism
• Split and parallelize the layer calculations in
forward/backward propagation
• ILP and other low level Parallelism
• Parallelize the convolution operations etc. (in
reality tensor op / matrix multiply)
0.8
Probability
0.10
Mini-batch size 8 nodes Predicted
0.4
Staleness=0 16 nodes Measured
-ηΣi ∇Ei
0.00
0.0
● ● ● ●
0.12
Twice asynchronous NSubbatch = 4 NSubbatch = 4
W(t+3)
0.8
updates within
Probability
Predicted
0.06
gradient computation
0.4
W(t+1)
0.00
0.0
● ● ● ●
●●
W(t+2) Staleness=2
NSubbatch = 8 NSubbatch = 8
-ηΣi ∇Ei Measured
0.8
0.10
Probability
DNN parameters space (NSubbatch: # of samples per one GPU iteration)
0.4
• Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of
0.00
0.0
● ● ● ●● ●
Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of
0 100 200 300 400 500 600 0 2 4 6 8 10
2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016
Interconnect Performance as important
as GPU Performance to accelerate DL
• ASGD DL system SPRINT (by DENSO IT Lab) and DL speedup prediction
with performance model
第162回ハイパフォーマンスコンピューティング研究発表会
2017/12/19
102 Background: cuDNN Convolution
N
Y[n, k, h, w] =
X Y Σc,u,v W[k, c, u, v] * X[n, c, h+u, w+v];
C K
C
1
H W Σ H 1
V
1
U
W W
2D Convolution (forward)
103 Background: cuDNN Convolution
´Concern: There are considerable performance gaps (w.r.t. time and workspace size)
among convolution algorithms
´e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown
if workspace If workspace
limit < 323 MiB limit ≧ 323 MiB
2
20
Execution time [ms]
15 0
0 IMPLICIT_GEMM
1 IMPLICIT_PRECOMP_GEMM
1
10
2 GEMM
4.51x 3 DIRECT
4 FFT
5
5 FFT_TILING
● 6 WINOGRAD
5 74 7 WINOGRAD_NONFUSED
0
80
300
●●●
Images/time [ms−1] ●●●
●●●
●●
●●
●●
60
●●●●●
●●●
200
●●●●●
●●
●●●●
40
●
●●●
●●●●●
●●
●●●
100
●●
●●●
20
●●●●● −1
●●● Images/time [ms ]
● Workspace size [MiB]
0
0
0 50 100 150 200 250
Batch size
Computation performance and workspace size of FFT_TILING
of AlexNet conv2 (forward)
105 Approach and Contribution
conv1 T2(v)
u ∈C1 Time
cμ cμ cμ cμ min. T
108 Evaluation: WD using Integer LP
BD BD
powerOfTwo (WR)
all (WR)
undivided (WD)
powerOfTwo (WD)
all (WD)
Breakdown of workspace size of AlexNet
Mini-batch size of 256, total workspace size of 120MiB, P100-SXM2
109 Evaluation: WD using Integer LP
120
●
IMPLICIT_GEMM
IMPLICIT_PRECOMP_GEMM
100
GEMM
BYTES-bound
●
FFT
80
●
●●
WINOGRAD_NONFUSED
algorithms
are faster
●
●
●
60
●
40 ●
●
20
●
● ●
0
0 2 4 6 8 10
0
undivided (WR, 8 MiB)