Talk 1 Satoshi Matsuoka

Cambrian Explosion of Computing in
the Post-Moore Era

Satoshi Matsuoka
Professor, GSIC, Tokyo Institute of Technology /
Director, AIST-Tokyo Tech. Big Data Open Innovation Lab /
Fellow, Artificial Intelligence Research Center, AIST, Japan /
Fellow, Advanced Institute for Computational Science, Riken
ETH Collegium Helveticum

2017/12/08
Zurich Switzerland
Current Trend – Many Core Processors e.g. GPU, KNL, …
Small # of Large CPU Cores < Large # of Small CPU Cores
x40
<<
CPU
コア
NVIDIA Pascal GPU TSUBAME3.0

K Computer CPU ~4000 “Streaming Processor” CUDA Cores
8 Sparc64 CPU Cores, 128 GigaFlops 5300 GigaFlops
Tokyo Tech. TSUBAME Supercomputing History
TSUBAME3.0
World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-

2000 leading use of massively parallel, many-core technology
Matsuoka
GSIC
Appointment
2008 TSUBAME1.2
2006 TSUBAME1.0 2010 TSUBAME2.0
2000 2002 “TSUBAME0” 170 TeraFlops
80 TeraFlops 2.4 Petarlops No1 World
128 Gigaflops 1.3 TeraFlops Word’s first GPU
No.1 Asia, No.7 World No.1 Production Green
Custom First “TeraScale” Supercomputer
JP Univ. Supercomputer 10,000 cores ACM Gordon Bell Prize
Supercomputer
32 cores 800 cores General Purpose CPU & Many Core Processor (GPU), Advaned Optical
Networks, Non-Volatile Memory, Efficient Power Control and Cooling
2013 TSUBAME2.5 2013 TSUBAME-KFC

41１8 GPUs Upgraded TSUBAME3 Prototype 2017 TSUBAME3.0, > 10 million cores
5.7 Petaflps, No.2 Japan Oil Immersive Cooling 12.1 Petaflops (AI Flops 47.2 Petaflops)
AI Flops 17.1 Petaflops Green World No.1 Green World No1
2015 AI Prototype Upgrade (KFC/DL) HPC and Big Data / AI Convergence
Overview of TSUBAME3.0
BYTES-centric Architecture, Scalaibility to all 2160 GPUs,
all nodes, the entire memory hiearchy
Full Operations
Aug. 2017
Full Bisection Bandwidgh
Intel Omni-Path Interconnect. 4 ports/node
Full Bisection / 432 Terabits/s bidirectional
~x2 BW of entire Internet backbone traffic
DDN Storage
(Lustre FS 15.9PB+Home 45TB)
540 Compute Nodes SGI ICE XA + New Blade

Intel Xeon CPU x 2+NVIDIA Pascal GPUx4 (NV-Link)
256GB memory 2TB Intel NVMe SSD
47.2 AI-Petaflops, 12.1 Petaflops
TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)
- No exterior cable mess (power, NW, water)
- Plan to become a future HPE product
Liquid Cooled
“Hot Pluggable” ICE- Xeon x 2
XA Blade PCIe Switch
> 20 TeraFlops
Smaller than 1U server, DFP
no cables or pipes 256GByte Memory
PCIe NVMe
Drive Bay x 4
100Gbps x 4 Liquid Cooled NVMe

15 Compute Racks
144 GPUs & 72
4 DDN Storage Racks
CPUs/rack
3 Peripheral & SW
racks
Integrated
100/200Gbps
Total 22 Racks
Fabric Backplane
TSUBAME3.0 became the first large
production petaflops-scale
supercomputer in the world to be #1
on the “Green500” power efficiency W
world ranking of supercomputers
14.1 Gigaflops/W is more than x10

more efficient than PCs and
Smartphones! Award Ceremony at
ISC2017 @ Frankfurt
Power Meters
Tokyo Tech / HPE

Benchmarking Team
ORNL Summit
• ~200 Petaflops FP64, ~3 Exaflop FP16 by 1H2018
9
Peak FP64 Exaflop TSUBAME in 2020 ‒
Just getting Flops/W is within reach
• 7nm+ post Volta GPU (Pascal P100 16nm)
• ~10,000 CUDA Cores (P100 3840)，12.5 Teraflops/Chip (P100 5.3TF)
w/matrix engine
• 80,000 chips => 80 million small cores
• 4 GPUs/node => 20,000 nodes (x40 TSUBAME3, 500~600 racks)
• Scalable High-Dimensional Torus or Hypercube topology (Tsubame3 : Full
Fattree)
• x3 power efficiency 50GF/W (x1.9 via process, x1.6 via arch) (TSUBAME3.0
14.1GF/W)
• 1 Exa DFP Peak, ~600 PF Linpack, 12MW Facility Power
• So the DARPA Exascale report projection turned out to be fairly accurate
• But is just getting FLOPS all that valuable?
2004 2008 2012 2016 2020
Exaflop
DARPA
ExaScale
Report
2018
LW CPU core 2020

108~109 pararelism
Applications Exascale
End of Denard Scaling
Workshop(2008-2009)
ENERGY EFFICIENCY
50GFlops/W in late 2020?
16,000
Max-Efficiency Tsubame 3.0
14,000 NVIDIA Pascal P100
DGX SaturnV 14.1
Linpack/Power [Gflops/kW]
12,000
ZettaScaler-1.6 c Gflops/W
10,000
Tsubame KFC
8,000 AMD FirePro NVIDIA K20x – K80
Mic
6,000
BlueGene/Q
4,000 Cell
2,000
TOP500 Average
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
DFP 64bit SFP 32bit HFP 16bit Tokyo Tech GSIC leads Japan in aggregated
AI-capable FLOPS TSUBAME3+2.5+KFC, in
Simulation all Supercomuters and CloudsNV
Computer Graphics Site Comparisons of AI-FP Perfs

T-KFC
Gaming 65.8 Petaflops
Big Data Tokyo Tech TSUBAME3.0 T2.5
Machine Learning / AI ~6700 GPUs + ~4000 CPUs
P100-fp16 P100 K40 U-Tokyo Oakforest-PACS (JCAHPC)

16000
Reedbush(U&H)
14000
12000
10000
NVIDIA Pascal
8000
P100 DGEMM Riken K
6000 Performane
GFLOPS
4000
2000 0 10 20 30 40 50 60 70
0 PFLOPS
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Dimension (m=n=k)

Tremendous Recent Rise in Interest by the Japanese
Government on Big Data, DL, AI, and IoT
• Three national centers on Big Data and AI launched
by three competing Ministries for FY 2016 (Apr 2015-)
– METI – AIRC (Artificial Intelligence Research Center): AIST (AIST
internal budget + > $200 million FY 2017), April 2015
• Broad AI/BD/IoT, industry focus
– MEXT – AIP (Artificial Intelligence Platform): Riken and other
institutions ($~50 mil), April 2016 Vice Minsiter
• A separate Post-K related AI funding as well. Tsuchiya@MEXT
• Narrowly focused on DNN Annoucing AIP
estabishment
– MOST – Universal Communication Lab: NICT ($50~55 mil)
• Brain –related AI
– $1 billion commitment on inter-ministry AI research over
10 years => Supplanting HPC activities? 1
2015- AI Research Center (AIRC), AIST
Now > 400+ FTEs
Effective Cycles among Research and Deployment of AI
Deployment of AI in real businesses and society
Institutions Security Manufacturing Big Sciences

Health Care Innovative
Companies Network Services Industrial robots Bio-Medical Sciences Start-Ups
Elderly Care Retailing
Communication Automobile Material Sciences
Director: Standard Tasks
Technology transferApplication Domains Technology transfer
Jun-ichi Tsujii Joint research Common AI Platform Standard Data Starting Enterprises
Common Modules
Planning/Business Team Planning/Business Team
Common Data/Models
NLP, NLU Behavior Prediction Planning Image Recognition
Text mining Mining & Modeling Recommend Control 3D Object recognition AI Research Framework
･･･
Matsuoka : Joint Brain Inspired AI Data-Knowledge integration AI

appointment as Model of Model of
Model of ･･･
Basal ganglia Ontology
“Designated” Fellow Hippocampus Cerebral cortex Knowledge Logic & Probabilistic
Modeling
Bayesian net ･･･
since July 2017

Core Center of AI for Industry-Academia Co-operation
National Institute for Joint Lab established Feb.
Advanced Industrial Science Tokyo Institute of
and Technology (AIST) 2017 to pursue BD/AI joint Technology / GSIC
research using large-scale
HPC BD/AI infrastructure
Ministry of Economics Resources and Acceleration of

AI / Big Data, systems research Tsubame 3.0/2.5
Trade and Industry (METI)
Big Data /AI
resources
AIST Artificial
Intelligence
Research Center Joint ITCS
(AIRC) Research on Departments
Application Area
AI / Big Data Director: Satoshi Matsuoka
and Basic Research
Natural Langauge applications Industrial in Big Data / AI
Processing Collaboration in data, algorithms and
Robotics Other Big Data / AI
applications methodologies
Security research organizations
Industry and proposals
JST BigData CREST
ABCI JST AI CREST
AI Bridging Cloud
Etc.
Infrastructure
METI AIST-AIRC ABCI
as the worlds first large-scale OPEN AI
Infrastructure
• ABCI: AI Bridging Cloud Infrastructure
• Top-Level SC compute & data capability for DNN (550 AI-
Petaflops)
• Contract won by Fujitsu
• Open Public & Dedicated infrastructure for Al & Big Data
Algorithms, Software and Applications ‒ OPEN SOURCING AI
DATACENTER
• >550 AI-Petaflops
• Platform to accelerate joint academic-industry R&D for AI in
• < 3MW Power
Japan • < 1.1 Avg. PUE
• Operational ~2018H1
Univ. Tokyo Kashiwa
Campus
ABCI Datacenter
State of the art & cheap & ultra high efficiency
18
The “Real” ABCI – 2018Q1
• Extreme computing power
– w/ >130 AI-PFlops for AI/ML especially DNN
– x1 million speedup over high-end PC: 1 Day training for 3000-Year DNN
training job
– TSUBAME-KFC (1.4 AI-Pflops) x 90 users (T2 avg)
• Big Data and HPC converged modern design
– For advanced data analytics (Big Data) and scientific simulation (HPC), etc.
– Leverage Tokyo Techʼs “TSUBAME3” design, but differences/enhancements
being AI/BD centric
• Ultra high BW & Low latency memory, network, and storage
– For accelerating various AI/BD workloads
– Data-centric architecture, optimizes data movement
• Big Data/AI and HPC SW Stack Convergence
– Incl. results from JST-CREST EBD
– Wide contributions from the PC Cluster community desirable.
• Ultra-Green (PUE<1.1), High Thermal (60KW) Rack
– Custom, warehouse-like IDC building and internal pods
– Final “commoditization” of HPC technologies into Clouds
19
ABCI Cloud Infrastructure
• Ultra-dense IDC design from ground-up
– Custom inexpensive lightweight “warehouse” building w/ substantial ABCI AI-IDC CG Image
earthquake tolerance
– x20 thermal density of standard IDC
• Extreme green
– Ambient warm liquid cooling, large Li-ion battery storage, and high-
efficiency power supplies, etc.
– Commoditizing supercomputer cooling technologies to
Clouds (60KW/rack)
• Cloud ecosystem
– Wide-ranging Big Data and HPC standard software stacks
Reference Image
• Advanced cloud-based operation
– Incl. dynamic deployment, container-based virtualized provisioning,
multitenant partitioning, and automatic failure recovery, etc.
– Joining HPC and Cloud Software stack for real
• Final piece in the commoditization of HPC (into IDC) 引⽤元: NEC導⼊事例
• Open Sourcing of Next-Gen IDC Architecture for AI

20
Comparing TSUBAME3/ABCI to Classical IDC
AI IDC CAPX/OPEX accelerartion by > x100
Perf > 400~600

Power Eff > 200~300
Traditional Xeon IDC TSUBAME3 (+Volta) & ABCI IDC

~10KW/rack PUE 1.5~2 ~60KW/rack PUE 1.0x
15~20 1U Xeon Servers ~36 T3 evolution servers
2 Tera AI-FLOPS(SFP) / server ~500 Tera AI-FLOPS(HFP) / server
30~40 Tera AI-FLOP / rack ~17 Peta AI-FLOPs / rack 21
ABCI Procurement Benchmarks
• Big Data Benchmarks • AI/ML Benchmarks

– (SPEC CPU Rate) – Low precision GEMM
– Graph 500 • CNN Kernel, defines “AI-Flops”
– MinuteSort – Single Node CNN
– Node Local Storage I/O • AlexNet and GoogLeNet
– Parallel FS I/O • ILSVRC2012 Dataset
– Multi-Node Scalable CNN
• Caffe+MPI
No traditional HPC – Large Memory CNN
Simulation Benchmarks • Convnet on Chainer

– RNN / LSTM
except SPEC CPU. • Neural Machine Translation on
Torch
Plan on “open-sourcing”
22
Cutting Edge Research AI Infrastructures in Japan
Accelerating BD/AI with HPC 1H 2019?
(and my effort to design & build them) “ExaAI”
In Construction
X4~6? ~2~3 AI-ExaFlop
In Production 1H 2018
Aug. 2017 x11.7 ABCI (AIST-AIRC) Undergoing
TSUBAME3.0 (Tokyo Tech./HPE) 550 AI-PF Engineering
In Production 47.2 AI-PF (65.8 AI-PF Study
Mar. 2017 x5.8 w/Tsubame2.5)
In Production AIST AI Cloud Also Post-K
x5.8 (AIST-AIRC/NEC) IDC under Multi AI-
8.2 AI-PF construction Exaflops
R&D Investments into world leading

Oct. 2015 Mar. 2017 AI/BD HW & SW & Algorithms and their
TSUBAME-KFC/DL AI Supercomputer co-design for cutting edge Infrastructure
(Tokyo Tech./NEC) Riken AIP/Fujitsu absolutely necessary (just as is with
1.4 AI-PF(Petaflops) 4.1 AI-PF Japan Post-K and US ECP in HPC)
What is worse: Moore’s Law will end in the 2020’s
•Much of underlying IT performance growth due to Moore’s law
• “LSI: x2 transistors in 1~1.5 years”
• Causing qualitative “leaps” in IT and societal innovations
• The main reason we have supercomputers and Google...
•But this is slowing down & ending, by mid 2020s…!!!
• End of Lithography shrinks The curse of constant
• End of Dennard scaling Gordon Moore
transistor power shall
• End of Fab Economics
soon be upon us
•How do we sustain “performance growth” beyond the “end of
Moore”?
• Not just one-time speed bumps
• Will affect all aspects of IT, including BD/AI/ML/IoT, not just HPC
• End of IT as we know it
20 year Eras towards of End of Moore’s Law
20-year • 1980s~2004
Moore-Dennard Dennard scaling,
Single Core
perf+ = single
ILP-Vector
3-5nm and
Killer-Micro Era thread+ = transistor
beyond 2025-
& freq+ = power+
Constant
Transistor Power 20 year • 2004~2015 feature
Post-Dennard scaling, perf+ =
Many-Core Era transistor+ = core#+,
constant power
• 2015~2025 all
above gets harder
20-year
• 2025~ post-Moore,
Next-Gen constant
Post-Moore era feature&power =
flat performance
Need to realize the next 20-year era of supercomputing
The “curse of constant transistor power”
- Ignorance of this is like ignoring global warming -
• Systems people have been telling the algorithm people that
“FLOPS will be free, bandwidth is important, so devise
algorithms under that assumption”
• This will certainly be true until exascale in 2020…
• But when Moore’s Law ends in 2025-2030, constant transistor
power (esp. for logic) = FLOPS will no longer be free!
• So algorithms that simply increase arithmetic intensity will no
longer scale beyond that point
• Like countering global warming – need disruptive change in
computing – in HW-SW-Alg-Apps etc. for the next 20 year era
Many core was a good step but we already used
it once, and cannot use it again for boosting
Need another leap!
100000 TSUBAME3
Power Efficieny (MFlops/Watt)

Estimate
Test Server 5550 2017~2018 Post Moore
10000 (K10 GPU) Flattening
TSUBAME1.2 15750 crossover
1501 TSUBAME-KFC
(S1070 GPU) 3800
1000 TSUBAME2.0 (K20X GPU)
TSUBAME1.0 842 (M2050 GPU) Immersive Cooling etc.
100 (Opteron CPU) #4 Top500 ~x2 improvement over
#7 Top500 TSUBAME 2.5 with same GPU
x1000 –
10 13 10 years x1,210 in
1 10 years!
2006H1
2006H2
2007H1
2007H2
2008H1
2008H2
2009H1
2009H2
2010H1
2010H2
2011H1
2011H2
2012H1
2012H2
2013H1
2013H2
2014H1
2014H2
2015H1
2015H2
2016H1
Year
Measured for the 2011 Gordon Bell Award Dendritic Solidification App
Flop/s/W = Total #Flops / J = energy to solution given same problem
HPCG HPCG/ HPCG/
# T HPCG Top 10 ranking June 2017
Site Computer Country Rmax
41ST LIST:K Computer

Manufacturer
RIKEN Advanced Institute for

THE TOP10 [Pflop/s] [Pflop/s] Peak HPL
1 8 Fujitsu SPARC64 VIIIfx 2.0GHz, Japan 0.6027 10.5 5.3% 5.7%

Computational Science
Tofu Interconnect
National University of Tianhe-2
2 2 NUDT NUDT TH-IVB-FEP, China 0.5801 33.9 1.1% 1.7%
Defense Technology
Xeon 12C 2.2GHz, IntelXeon Phi
Swiss National Piz Daint
3 3 Supercomputing Centre Cray Cray XC50, Switzerland 0.4700 19.6 1.9% 2.4%
(CSCS) Xeon E5 12C 2.6GHz, Aries, NVIDIA Tesla P100
JCAHPC Oakforest-PACS
4 7 Fujitsu PRIMERGY CX1640 M1, Japan 0.3855 13.6 1.5% 2.8%
Joint Center for Advanced HPC
Intel Xeons Phi 7250 68C 1.4 GHz, OmniPath
National Supercomputing Sunway TaihuLight

5 1 NRCPC NRCPC Sunway SW26010, China 0.3712 93.0 0.3% 0.4%
Center in Wuxi
260C 1.45GHz
Cori
Lawrence Berkeley
6 6 Cray Cray XC40, USA 0.3554 14.0 1.3% 2.5%
National Laboratory
Intel Xeons Phi 7250 68C 1.4 GHz, Aries
Lawrence Livermore Sequoia
7 5 IBM BlueGene/Q, USA 0.3304 17.2 1.6% 1.9%
National Laboratory
Power BQC 16C 1.6GHz, Custom
Oak Ridge Titan

8 4 Cray Cray XK7, USA 0.3223 17.6 1.2% 1.8%
National Laboratory
Opteron 16C 2.2GHz, Gemini, NVIDIA K20x
Los Alamos NL / Trinity

9 10 Cray Cray XC40, USA 0.1826 8.10 1.6% 2.3%
Sandia NL
Xeon E5 16C 2.3GHz, Aries
NASA/ Pleiades
10 15 HPE SGI ICE X, USA 0.1750 5.95 2.5% 2.9%
Ames Research Center/NAS
Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4
K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.
Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu
88,000 nodes, 660,000
73% total exec CPU Cores
1200 Communicaton time wait in 1.3 Petabyte mem
Computation
Elapsed Time (ms)

1000 communication 20GB/s Tofu NW
800
600
#1 38621.4 GTEPS
400 (#7 10.51PF Top500)
200
0 Effective x13
64 nodes 65536 nodes
performance c.f.
≫
(Scale 30) (Scale 40)
Linpack
BYTES Rich Machine LLNL-IBM Sequoia TaihuLight
List Rank GTEPS Implementation + Superior BYTES 1.6 million CPUs
10 million CPUs
1.6 Petabyte mem
November 2013 4 5524.12 Top-down only algoithm 1.3 Petabyte mem
June 2014 1 17977.05 Efficient hybrid

#3 23751 GTEPS #2 23755.7 GTEPS
November 2014 2 19585.2 Efficient hybrid (#4 17.17PF Top500) (#1 93.01PF Top500)
June, Nov 2015 Hybrid + Node
June Nov 2016
1 38621.4 Compression BYTES, not FLOPS!
K computer “Still the best” for Bandwidth
(Data-centric) workloads (It’s the Bandwidth!)
And TSUBAME3, too 2011 2012 2013 2014 2015 2016 2017
1.TOP500 List ２４４４ 7 8
2. HPCG 2
3. Gordon Bell Prize Finalist
3. HPC Challenge
Awards
(HPC、Random Access、STREAM、FFT)
4. Graph500 ４ 2
5. Green500(TSUBAME) 2
3 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Japan Flagship 2020 “Post K” Supercomputer
üCPU
• ARM v8 + 512 bit SVE extensions
• Multi-hundred petaflops peak total
• Power Knob feature for saving power
üMemory
Prime Minister Abe visiting K Computer 2013
ü3-D stacked DRAM, Terabyte/s BW
üInterconnect …
I/O Network
Maitenance
• TOFU3 CPU-integrated 6-D torus network Servers

…
…
…
• I/O acceleration …
…
Portal
Servers
• 30MW+ Power
…
… Login
Servers
• Being designed and will be manufactured by

…
…
… Hierarchical
Fujitsu … Storage System
:Interconnect
: Compute
• Development Leaders: Yutaka Ishikawa,

Node
Mitsuhisa Sato (Riken)

31
Post-K Instruction Set Architecture
l ARM V8 HPC Extension
l Fujitsu is a lead partner of ARM HPC extension development
l Detailed features of the SVN vector extension ISA announced at Hot Chips 28 - 2016
http://www.hotchips.org/program/
Mon 8/22 Day1 9:45AM GPUs & HPCs
ARMv8-A Next Generation Vector Architecture for HPC
l Fujitsu’s inheritances
l FMA
l Math acceleration primitives
l Inter core barrier
l Sector cache
l Hardware prefetch assist
l Explicit
govt. committee recommendation to
explore HW component collaboration
ISC'16, June 21, 2016 32
Post-K will be/have
l Continuum of K (^-^)
l Pretty good DFP FLOPS (^-^)
l VeryGood low precision FLOPS (＾◇＾)
l Awesome Memory Bandwidth ＼(◎o◎)／！
l (but) relatively low memory capacity
(-_-;)
l Awesome Network Injection Bandwidth＼(◎o◎)／
l Very Good Network Bisection Bandwidth (＾◇＾)
l (but) modest I/O speed (-_-;)
l ARM Ecosystem – 99.9% of codes work ＼(◎o◎)／
2017/03/09 RIKEN AICS 33
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
Intra-node GPU via NVLink Terabit class network/node Intra-node GPU via NVLink
20~40GB/s 800Gbps (400+400) 20~40GB/s
full bisection
HBM2
64GB Inter-node GPU via OmniPath
2.5TB/s 12.5GB/s fully switched
Any “Big” Data in the

DDR4 system can be moved
256GB
to anywhere via
150GB/s
RDMA speeds
minimum
Intel Optane
12.5GBytes/s
1.5TB 12GB/s
(planned) 16GB/s PCIe also with Stream 16GB/s PCIe
Fully Switched Processing Fully Switched
NVMe Flash Scalable to all 2160
2TB 3GB/s GPUs, not just 8
34
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)
è Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
Intra-node GPU via NVLink Intra-node GPU via NVLink
20~40GB/s 20~40GB/s
HBM2
64GB Inter-node GPU via OmniPath
2.5TB/s 12.5GB/s fully switched
Any “Big” Data in the

DDR4 system can be moved
256GB
to anywhere via
150GB/s
RDMA speeds
minimum
Intel Optane
12.5GBytes/s
1.5TB 12GB/s
(planned) 16GB/s PCIe also with Stream 16GB/s PCIe
Fully Switched Processing Fully Switched
NVMe Flash Scalable to all 2160
2TB 3GB/s GPUs, not just 8
35
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node)
è Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
36
Fujitsu SPARC64™ Xifx (2015)

CMG1
CMG0 34 cores
Performance Core
256-bit SIMD
1.1TFLOPS FMA x2
140.8GB/s(in)
L1 cache or
70.4 GB/s(out)
4.4TB/s
L1 cache
L2 cache 70.4 70.4
2.2TB/s GB/s GB/s
140GB/s
(Read)
L2 cache 140GB/s L2 cache
(Read)
Memory 120
120
120 120
GB/s Tofu2 GB/s
240GB/s x2(in/out) GB/s GB/s
HMCs controller HMCs
Memory BW : Injection BW = 2:1
(TSUBAME3: 40:1) Tofu2 12.5
12.5
GB/s
125GB/s x2(in/out) GB/s
x10 ports
Post Moore
Many Core Era Cambrian Era
Flops-Centric Monolithic Algorithms and Apps Cambrian Heterogeneous Algorithms and Apps
Flops-Centric Monolithic System Software Cambrian Heterogeneous System Software
Hardware/Software System APIs Hardware/Software System APIs

Flops-Centric Massively Parallel Architecture “Cambrian” Heterogeneous Architecture
Homogeneous General Purpose Nodes ~2025 Heterogeneous CPUs + Holistic Data

Compute + Localized Data Compute M-P Extinction
Nodes Nodes
Gen CPU Gen CPU Event Reconfigurable
Massive BW
Dataflow
Data Data 3-D Package
Optical
Computing
DNN&
Compute Neuromorphic
Compute Non-Volatile Quantum
Nodes Nodes Memory Low Precision Computing
汎用CPU Gen CPU Error-Prone
Data Data
Ultra Tightly Coupled w/Aggressive
Loosely Coupled with Electronic Interconnect 3-D+Photonic Switching Interconnected
Transistor Lithography Scaling Novel Devices + CMOS (Dark Silicon)

(CMOS Logic Circuits, DRAM/SRAM) (Nanophotonics, Non-Volatile Devices etc.)
Multi-Phyics Massive Medical
SImulation Manufacturing Fusion/Plasma EMF Analysis Post-Moore
Imaging Performamce
Auto Tuning Models
Post-Moore is NOT a Post-Moore Computational
Couplers
BW Reducing Alg. Low Rank
Parallel Space-and-Time
Science Libraries
More-Moore device Data Assimilation
Approximation
Out-of-core Alg
Algorithms
as a panacea Post-Moore Data Science

and AI Libraries High B/F Algorithms Machine Learning
based acceleration
Device & arch. advances Post-Moore Programming Model

Uncertainty
Quantification High-Level Accelerator-
improving data-related Data-oriented Latency Synthesis Specific
Scheduling Hiding Compilers Compilers
parameters over time Post-Moore High Bandwidth Hierarchical Memory
Model
Data-Movement Hierarchical Data
Fault Programmable
Runtime
“Rebooting Computing” Tolerance Abstractions Logic
Accelerator
Data & Custom Compute Centric Platform
in terms of devices, Silicon Photonics WDM Interconnect
“Binaries”
architectures, software.New memory Devices Photonic Switching

Photonic Interposes
Brain-inspired Computing
Quantum Computing
Post-Moore
Performance
PC-RAM
Algorithms, and ReRAM
Photonic Compute Devices Low Precision
Probablistic
& Neural Networks/
Neromorphic/
Parameters
Optical Packet Switching
applications necessary STT-MRAM Computing Izing - Annealing
Low-Reliablility Communication Low-Reliability computing
=> Co-Design even 3D architecture Building Block “gluable” architecture
Near threshold computing
Next gen VIAs & silicon
more important fabrication
Inductive ＴＣＩ Data Memoization
Customizable logic
Tnugsten VIAs and 3D silicon
c.f. Exascale Memory Communication Computation
Portfolio of Accelerators – What are they good for
– (1) general purpose
• Vector (SIMD) HPC accelerators => General purpose HPC esp. SIMD
• GPUs, Xeon Phi, Shenwei SW26010
, Post-K ARM-SVE
• Macro Dataflow Processors => asynch threads, functional
programming etc.
• ETL EM-4, Wave computing, Celebrus…
• High memory bandwidth accelerators (FLOPS increase = Moore’s law
ending) => memory capacity and BW increase, =>
• NEC SX-Aurora Vector processor BYTES/FLOPS ~= 1
• Future 3-D die stacked architectures, NVM including NV-DIMM
• FPGAs – programmable HLS languages e.g. OpenCL
• Intel Stratix 10, Xilinx Vertex 9
• Superconducting Accelerators
• Massive single thread performance to accelerate serial bottleneck
Portfolio of Accelerators – – What are they
good for – (2) ML / AI
• DNN accelerators – accelerating tensor operations => many ML apps
• Small Matrix-Tensor Engine (NVIDIA Volta TensorCore, Intel LakeCrest?)
• Systoric Array (Google TPU2)
• Small Vector processor arrays (Fujitsu DLU)
• Neuromorphic accelerators –Spiking Neural Networks => brain
simulation, more power efficient ML?
• IBM TrueNorth
• Manchester U Spinnaker
• Heidelberg U BrainScales
• Many others (DoE, U-Tokyo – NEC, SIngapore, …)
• Symbolic Computing accelerators => string searches. E.g. Genomics?
• Micron Automata Processor
Portfolio of Accelerators – – What are they
good for – (3) Quantum and Pseudoquantum
• Quantum Annealers (theory invented at Tokyo Tech.) => ML?
• D-Wave
• Others in the lab
• Pseudoquantum (CMOS) Annealers => ML?
• Hitachi Ising chip (ISSCC 2015)
• Fujitsu (Pseudo)quantum annealing chip
• Quantum Gate processors => Quantum simulation, Cryptography
• Many ongoing work
I/O and Data accelerators – What are they
good for (4) I/O
• Cray MTA – graph operations
• Burst Buffers – I/O intensive ops e.g. checkpoints
• Database accelerators – classic
Problem Specific Architectures
to exploit dark silicone
“What are they good for?” – c.f. Berkeley Dwarfs
• Deep Neural Network Accelerator (Many, incl. Google)
• Spiking Neuromorphic Architecture (Manchester SpiNNaker,
IBM TrueNorth, Heidelberg BrainScaleS)
• Ising Model optimization architecture (Hitachi)
• Automata Processor (Micron)
• Advanced FPGAs (Alltera, Xilinx)
• Network & I/O accelerator (Mellanox)
•…
• And of course Quantum Annealing and Computing (D-Wave)
Fujitsu Deep Learning Processor (DLUTM)
FY201
8～
T
D M
Unit) L
(Deep Learning
DLUTM features Supercomputer K technologies

U
nArchitecture designed for Deep Learning
nHigh performance HBM2 memory
nLow power design
➔ Goal: 10x Performance/Watt compared to others
nMassively parallel：Apply supercomputer interconnect technology “Exascale” AI

➔ Ability to handle large scale neural networks possible in
➔ TOFU Network derivative for massive scaling
Designed for Scalable Learning, technically superior to Google TPU2
1H2019
23 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Neuromorphic Architectures
(Not to be confused with DNN Accelerators)
IBM TrueNorth
Manchester SpiNNaker
(ARM Based)
• Hitachi@ISSCC2015
“An 1800-Times-Higher
Power-Efficient 20k-spin
Ising Chip for
Combinational
Optimization Problem
with CMOS Annealing”
• Competitive to
Quantum Annealing,
room temperature,
easy to scale
• Could be applicable
to deep learning?
Tokyo Tech. Work on FPGAs in Post-Moore
[Artur Podbas, Hamid Zohouri, Satoshi Matsuoka]
l Moore’s law is ending
- Silicon area will become scarce
- Cannot afford generality (general-purpose) architectures
l Expecting a “Cambrian Explosion” in computer architecture
- Many niche architecture for specific purposes
l e.g. Quantum computing, Neomorphic Computing, specialized ASICs, DSPs etc.
l Field-Programmable Gate-Arrays (FPGAs)
- Devices hosting a sea of logic and interconnect
l Logic includes Look-Up Tables (LUTs), On-Chip RAM (Block-RAM) and
Digital Signal Processing blocks (DSP)
l Programmer responsible to program and connect logic to specify device behavior
l Field-Programmable Gate-Arrays (FPGAs) are Post-Moore
friendly
- Silicon area of FPGAs is malleable – dynamically reconfigurable
- Better (more diverse) use of chip logic
Using FPGAs for High Performance Computing
“Evaluating High-Level Design Strategies on FPGAs for High-Performance Computing”,
A Podobas, H.R. Zohouri, N. Maruyama, S. Matsuoka, IEEE FPL 2017
l Motivation
- FPGAs are post-moore friendly
l FPGAs allows dynamic reconfiguration of silicon
l “Tune” architecture towards application needs
- FPGAs are notoriously hard to program (require
hardware expertise)
l High-level programming approaches are attractive but their
performance is unknown
l Method
- Evaluated three high-level programming approaches for
FPGAs
l 30-core many-core system (represents: programmability)
l LegUp High-Level Synthesis (represents: multiple custom
accelerators)
l Intel OpenCL for FPGA (represents: Deep-pipeline designs)
- Improvements:
l We improved the memory hierarchy for the many-core and
the multi-accelerator designs through cache multi-banking
- Evaluated on the Rodinia Benchmark Suite using the
Stratix V FPGA
Using FPGAs for High Performance Computing (Cont.)
l Results
- Intel FPGA SDK for OpenCL achieves highest average performance
- LegUp can remain competitive for applications with high-compute
and good spatial/temporal locality
- The many-core approach offer good programmability (OpenMP) but
relatively low performance
Optimizing Kernels for High-Performance Computing with FPGAs2
“Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs”, H.R.
Zohouri, N. Maruyama, A. Smith, M. Matsuda, S. Matsuoka, ACM Supercomputing 16
l Motivation
- OpenCL for FPGAs a promising High-
Level Synthesis tool to leverage FPGA
technology without hardware expertise
- What performance can we expect from
FPGAs compare to the more general-
purpose CPU and GPUs?
l Method
- Six benchmarks from Rodinia’s benchmark
suit
l Optimized using advanced FPGA
optimizations
- Minimizing Initiation Intervals l Results
- Inferring Shift-Registers and Sliding Windows - We found our FPGA to, on average, be faster than the
- Two FPGA devices CPU implementations
l Intel Stratix V (A7) FPGA l Especially the newer Arria 10 GX1150 FPGA
l Intel Arria 10 GX1150 - Our FPGA implementations were up-to 3.4x more power
- Power and Performance compared against: efficient than the K20c
Intel Xeon E5-2650v3
l But the performance of the GPU could not be reached
l
l NVIDIA 980Ti and K20c

Maximizing Performance of FPGAs in Stencil Computation
“Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs
Using OpenCL”, H.R. Zohouri, A Podobas, S. Matsuoka, IEEE FPGA 2018 (to appear)
l Motivation
- Stencil computation is one of the more important computation
patterns in HPC
l Differential Equations
l Weather, Seismic and Fluid simulation
l Convolution Neural Networks
- Hypothesis: by exploiting the unique architecture of FPGAs, they
can achieve comparable performance to GPUs in stencil computation
l This is despite their much lower memory bandwidth and compute
performance compared to GPUs
Stencil Accelerator
DDR Memory
Read PE0 PE1 PE2
Compute
Write PEn-1 PEn-2 PEn-3

Maximizing Performance of FPGAs in Stencil Computation (Cont.)
Out-of-bound
l Implementation Valid Compute
- Combines spatial and temporal Redundant

Compute (Halo)
Spatial Block
blocking Compute Block
l Previous work avoid spatial blocking Input Size
and put a hard limit on input size

- Shift-register based spatial-blocking Starting
N
Starting
Address
Minimizes local memory usage
Address
Read
l W C E
N
- Streaming temporal blocking S

Shift Register
Mapping W
C
Read
Read
Efficiently realized using autorun

Read
E
l
kernels and on-chip channels Write

S
Read
- Deep-pipelined design Valid Compute Redundant Compute (Halo)

l Avoids thread divergence and the need
for warp specialization
Time
l Methodology
- Hotspot 2D and 3D
l Two inputs and one output, 8/9 (2D) and 8/13 (3D) bytes per FLOP
- Diffusion 2D and 3D
l One input and one output, 12/15 (2D) and 12/17 (3D) bytes per FLOP
- Intel Stratix V A7 and Arria 10 GX 1150
- Four generations of highest-end NVIDIA GPUs
- Estimation for upcoming Intel Stratix 10 FPGAs
Performance fmax Memory Power
Stencil Device (GBps/GFLOPS/GCell/s) Logic DSP
(MHZ) (Bits/Blocks) (Watt)
Stratix V 99.582/112.030/12.448 302.48 69% 22%/52% 95% 29.845
Diffusion 2D
Arria 10 673.959/758.204/84.245 343.76 55% 38%/83% 95% 72.530
Stratix V 112.218/140.273/9.352 269.97 84% 27%/61% 64% 33.361
Hotspot 2D
Arria 10 480.335/600.419/40.028 326.58 47% 53%/94% 95% 52.411
Stratix V 62.435/101.457/7.804 301.02 62% 36%/67% 91% 21.135
Diffusion 3D
Arria 10 230.568/374.673/28.821 286.61 60% 94%/100% 89% 71.628
Stratix V 63.603/90.104/5.300 246.18 76% 68%/100% 100% 36.126
Hotspot 3D
Arria 10 228.149/323.211/19.012 296.20 62% 81%/100% 96% 73.398
• Unlike GPUs, FPGAs can achieve higher computation throughput than their memory bandwidth
• Arria 10 achieves better performance than K40c, despite 8 times lower memory bandwidth
• Arria 10 achieves better power efficiency than 980 Ti, and close to P100 and V100
• Stratix 10 MX2100 will have better performance and power efficiency compared to next-generation GPUs
Neuromorphic Computing on FPGAs4
“Designing and Accelerating Spiking Neural Networks using OpenCL for
FPGAs”, A Podobas, S. Matsuoka, IEEE FPT 2017 (to appear)
l Motivation
- Neuromorphic computing is an emerging Post-Moore computing
paradigm
- Spiking Neural Networks (SNN) are one instance of Neuromorphic
computing
l Information is conveyed temporally through events called “spikes”
l Can be more power-efficient than traditional “rate-based” neural networks
l Used in e.g. IBM TrueNorth architecture
- Can we leverage FPGAs to accelerate another post-Moore computing
paradigm and what is the performance we can expect?
l Method
- Created a custom FPGA accelerator for SNNs
l Supports two vastly different but well-known neuron models
l Leverage Python for simplicity
l Exploits the timing characteristics (the “delays”) of the neural networks to
increase instruction- and dataflow level parallelism
l Described using OpenCL-- portable across FPGA devices
- Compared to well-known simulators:
l NEST for CPUs running on high-end Xeon processors and Xeon PHI
accelerators
l NeMo running on K20x and P100
l Our design running on a Stratix V FPGA
- Evaluated on a variety of networks with different spiking activity
Neuromorphic Computing on FPGAs (Cont.)
l Results
- Our accelerator can reach up-to 2.25
Billion Spikes/second of performance
l Despite our Stratix V built on 5-year-old
technology
- Our performance surpasses that of
NEST on recent multi-threaded CPUs
l Including both 24 hyper-threaded Xeon
as well as Xeon PHI
- Initial results points towards our
accelerator rivals the performance of
NeMo on GPUs
l Still needs further investigation and
comparison against other GPU
frameworks
Future Outlook and Ongoing Work: Manipulate Precision and
Numerical Format (e.g. IEEE vs. POSIT)
l Motivation
- Explosion in numerical precision formats
l IEEE 754 Floating-Point (what we are used to)
l Fixed-Point Arithmetic (what DSP builders are used to)
l Intel FlexPOINT (new AI format, shared exponent bits inside tensors)
l POSIT Floating-Point (IEEE 754 replacement candidate)
- Potentially great impact on performance/silicon
l Precision a tuneable knob in the Post-Moore workflow
- Performance, Power, Area and Precision trade-offs unclear
l Performance trade-offs in computer simulations
l Accuracy trade-offs in training of Neural Networks
l Area/Power trade-offs
l Leverage FPGAs to assist design-space exploration
- Automatically generate custom precision data-path (chain of
operators)
- Integrate data-path into existing infrastructure
l Intel FPGA SDK for OpenCL
l ASIP extensions to soft-cores e.g. RISC-V
- Evaluate on existing applications
Performance growth via data-centric computing:
“From FLOPS to BYTES”
• Identify the new parameter(s) for scaling over time
• Because data-related parameters (e.g. capacity and bandwidth) will still
likely continue to grow towards 2040s
• Can grow transistor# for compute, but CANNOT use them AT THE SAME
TIME(Dark Silicon) => multiple computing units specialized to type of data
• Continued capacity growth: 3D stacking (esp. direct silicon layering) and
low power NVM (e.g. ReRAM)
• Continued BW growth: Data movement energy will be capped constant by
dense 3D design and advanced optics from silicon photonics technologies
• Almost back to the old “vector” days(?), but no free lunch – latency still
problem, locality still important, need general algorithmic acceleration
thru data capacity and bandwidth, not FLOPS
Non-Volatile Memory and 3-D Stacking
• Many devices
• Various stacking technologies
• Results: Massive capacity, extreme bandwidth, low power

• Exploits Z-direction locality
• New breed of “in memory computing”
• Could persist as a trajectory for the next 20 years
orders of magnitude
by 3D, as Z-direction
Could be reduced by
NVM w/DRAM cache

movement is under
Capacity by dense
1mm
(Original Slide Courtecy John Shalf@LBNL)
data movement Energy @

When does data movement dominate?
Data Movement Cost
Energy Ratio for 20mm

108mm
3.6mm
12mm
Compute Op ==
Compute Op ==
Compute Op ==
0.2x
1.6x
5.5x
Core Energy/Area est.
Power: 0.3W (<0.2W)
Area: 0.046 mm2

Area: 12.25 mm2
E/op: 150 (75) pj

Area: 0.6 mm2
Clock: 1.3 GHz
Clock: 1.0 GHz

Power: 0.025W
Clock: 2.4 GHz
E/op: 651 pj
Power: 2.5W
E/op: 22 pj
60
4.5 mm 1.2 mm 0.2 mm
0.23mm
0.5mm
2.7mm
A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W

RISC-V Processor with Vector Accelerators
Yunsup Lee⇤ , Andrew Waterman⇤ , Rimas Avizienis⇤ , Henry Cook⇤ , Chen Sun⇤† ,
Vladimir Stojanović⇤† , Krste Asanović⇤
Email: {yunsup, waterman, rimas, hcook, vlada, krste}@eecs.berkeley.edu, sunchen@mit.edu
⇤ Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
† Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract—A 64-bit dual-core RISC-V processor with vector 3mm 2.8mm
accelerators has been fabricated in a 45 nm SOI process. This
is the first dual-core processor to implement the open-source VRF L1D$
1.1mm
RISC-V ISA designed at the University of California, Berkeley. Core
In a standard 40 nm process, the RISC-V scalar core scores 10% Core Logic L1I$
higher in DMIPS/MHz than the Cortex-A5, ARM’s comparable L1VI$
single-issue in-order scalar core, and is 49% more area-efficient. 1MB
SRAM
To demonstrate the extensibility of the RISC-V ISA, we integrate
6mm
Array Rocket Hwacha Rocket Hwacha
a custom vector accelerator alongside each single-issue in-order Scalar
Core
Vector
Accelerator
Scalar
Core
Vector
Accelerator
scalar core. The vector accelerator is 1.8⇥ more energy-efficient 16K 32K 8KB 16K 32K 8KB
than the IBM Blue Gene/Q processor, and 2.6⇥ more than the L1I$ L1D$ L1VI$ L1I$ L1D$ L1VI$
IBM Cell processor, both fabricated in the same process. The Arbiter Arbiter
dual-core RISC-V processor achieves maximum clock frequency Dual-Core RISC-V Coherence Hub
Vector Processor
of 1.3 GHz at 1.2 V and peak energy efficiency of 16.7 double- FPGA FSB/
1MB SRAM Array
HTIF
precision GFLOPS/W at 0.65 V with an area of 3 mm2 .
I. I NTRODUCTION Fig. 1. Backside chip micrograph (taken with a removed silicon handle) and
processor block diagram.
As we approach the end of conventional transistor scaling,
computer architects are forced to incorporate specialized and
heterogeneous accelerators into general-purpose processors for A. Rocket Scalar Core
greater energy efficiency. Many proposed accelerators, such as Rocket is a 6-stage single-issue in-order pipeline that
those based on GPU architectures, require a drastic reworking executes the 64-bit scalar RISC-V ISA (see Figure 2). The
of application software to make use of separate ISAs operating scalar datapath is fully bypassed but carefully designed to min-
in memory spaces disjoint from the demand-paged virtual imize the impact of long clock-to-output delays of compiler-
memory of the host CPU. RISC-V [1] is a new completely generated SRAMs in the caches. A 64-entry branch target
open general-purpose instruction set architecture (ISA) de- buffer, 256-entry two-level branch predictor, and return address
veloped at the University of California, Berkeley, which is stack together mitigate the performance impact of control
designed to be flexible and extensible to better integrate new hazards. Rocket implements an MMU that supports page-
efficient accelerators close to the host cores. The open-source based virtual memory and is able to boot modern operating
RISC-V software toolchain includes a GCC cross-compiler, systems, including Linux. Both caches are virtually indexed
an LLVM cross-compiler, a software ISA simulator, an ISA physically tagged with parallel TLB lookups. The data cache
verification suite, a Linux port, and additional documentation, is non-blocking, allowing the core to exploit memory-level
and is available at www.riscv.org. parallelism.
In this paper, we present a 64-bit dual-core RISC-V Rocket has an optional IEEE 754-2008-compliant FPU,
processor with custom vector accelerators in a 45 nm SOI which can execute single- and double-precision floating-point
process. Our RISC-V scalar core achieves 1.72 DMIPS/MHz, operations, including fused multiply-add (FMA), with hard-
outperforming ARM’s Cortex-A5 score of 1.57 DMIPS/MHz ware support for denormals and other exceptional values. The
by 10% in a smaller footprint. Our custom vector accelerator
is 1.8⇥ more energy-efficient than the IBM Blue Gene/Q
processor and 2.6⇥ more than the IBM Cell processor for ITLB Int.RF DTLB
PC
double-precision floating-point operations, demonstrating that Gen. I$ Inst. Int.EX D$ Commit to Hwacha
high efficiency can be obtained without sacrificing a unified Access Decode Access
demand-paged virtual memory environment. bypass paths omitted
for simplicity
FP.RF FP.EX1 FP.EX2 FP.EX3
Rocket Pipeline
II. C HIP A RCHITECTURE
VITLB
Figure 1 shows the block diagram of the dual-core pro- PC
Gen. VI$
VInst.
Decode
Seq-
uencer
Expand
Bank1 ... Bank8
R/W R/W
cessor. Each core incorporates a 64-bit single-issue in-order Access
Rocket scalar core, a 64-bit Hwacha vector accelerator, and from Rocket Hwacha Pipeline
their associated instruction and data caches, as described
below. Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.
Example Innovation: Tungsten TSV at 2um ultra fine
pitch with die thinning by Tezzaron Semiconductor
• Suppose 4TF SFP @ 7nm, 16TB/s
internal chip BW vs. 200GB/s external
chip mem BW => 80 times speedup!
• High-density, high-signaling TSV challenge
– Wide I/O 2 1024 bits 1 Ghz -> 2~3 Ghz
– We need 128,000 bits @ 1Ghz !
– 10 micron TSV estimation Many-layer stacking
• 400 x 400 TSVs on 20mx20m chip -> 50 via aggressive wafer
micron spacing
• With tungsten TSVs the chip area is
thinning and self-
negligible diagnostics
Source: Tezzaron website
http://www.tezzaron.com
DiRAM4 Stack Overview
(Tezzaron slides taken from
http://www.tezzaron.com/media/Tezzaron-
Presentation-EPS-100814-dist-.pptx)
• 64 Gb of Memory in 175 mm2
• 256 fully independent RAMs
• 16 Banks per RAM
• 64 bit Sep I/O Data per RAM
• 7ns Access Time (Closed page to data)
• 12ns tRC (Page Open to Page Open in a Bank)
• 16 Tb/s Data Bandwidth
• Competitive Manufacturing Cost
Tezzaron Semiconductor 10/08/2014 62

2.5/3D Circuits
IME A*STAR /
Tezzaron
Collaboration
IME A*STAR / Tezzaron Collaboration
μBumps
Die to Wafer Cu Thermal Diffusion Bond
C4 Bumps
3 Layer 3D Memory 2 Layer Processor
FPGA (4Xnm) level#4
C level#3
level#2
Active Silicon Circuit Board C
(Tezzaron slides taken from

Organic Substrate
http://www.tezzaron.com/media/Tezzaron-level#1
level#0
Solder Bumps

Stanford N3XT 3-D Chip Project (slide by Subasish Mitra)
Super Building Block Archiecture (Amano, Keio U)
Holistic Control of Component Power System View on “Post-
Moore” Architecture
Special
Purpose Not just a new device, but
Processor focus on how they are
Hub Hub
Chip Chip interconnected, and
FPGA integrated as a system
controlling their power
System SW and
Hub
Programming
Chip A Hub architecture that
New Circuit Switched employs Inductive (3D) TCI
Optical NW Standard
Device Processor and programmable
Hub Hub FPGA+Switch
Chip Chip
NVM
新メ新メ新メ新メ
モリモリモリモリ
FPGA FPGA OS用CPU

光インコネ FPGA
チップ Recon
Recon
f. f.
SW SW アクセラ
レータ
Recon
FPGA 新しいコンピュー
f. FPGA
ティングチップ
SW
Recon Recon
f. f.
SW SW
光インコネチッ
FPGA プ
FPGA FPGA
新メ新メ新メ新メ
モリモリモリモリ
ドーターチップ接続のイメージ
Network IF
TEG
MIPS CPU
Core TCI Tx
TCI Rx
Host CPU Chip
Host CPU
Accelerator 1
TCI
Rx Accelerator 2
µ-Controller
Network IF
Tx Accelerator 3
8x8 PE Array
Tx
Host CPU + Accelerator x3 Chip Stack
Rx Fabricated in 65nm CMOS
Accelerator Chip
Microphotograph of stacked test chips.
Strawman BYTES-Oriented Post-Moore
Architecture
Low voltage & power CPU for 16TB/s DRAM &
direct stacking and large NVM/Flash
NVM Bandwidth NVM/Flash
silicon area NVM/Flash
=> 5~10Tbps NW idea NVM/Flash
NVM/Flash NVM/Flash
Domain-specific hetero- and DRAM DRAM
customizable processor DRAM DRAM
configurations, including PIM DRAM DRAM
Low Power CPU Optical SW & Launch Pad Low Power CPU
Extreme multi-layer DRAM &
NVRAM stacking via high TSV Interposer
density tungsten TSV
PCB
Direct WDM optics onto Direct Chip-Chip Interconnect with DWDM optics
Interposer
Low Power Processor allows Direct 3D Stacking
Configurable Low-power CPU
Interconnect Shortcomings
– Current technology:
– 10$ / Gbps and 50 pJ per bit, per link
– 1 exaflops -> 10 PB/s injection bw
– O (1B$) and O(5MW) (node link only)
– First stepping stone: mid-board optics - vcsels
– Advanced Development program at HPE
– Cheaper, more efficient, can be water-cooled
– Exascale technology target: silicon photonics - ring resonators
– 10 cents per Gbps, 5 pJ per bit
– Enabling enhanced topologies like the Hyper-X will require new “widgetry”
HPE Confidential – NDA required – Subject to Change 69

Optical Network Technology for Future Supercomputers & IDC
●Large-scale silicon photonics based cluster switches
●DWDM, multi-level modulation, highly integrated “elastic” optical interconnects SC14 AIST Booth #2531
●Ultra-low energy consumption network by making use of optical switches
DWDM, multi-level Datacenter server racks Silicon photonics
modulation optical cluster switches
interconnects
Memory
cube
Ø Ultra-compact switches based on silicon
photonics
CPU
Ø 3D integration by amorphous silicon
/GPU Ø A new server architecture
DSP
Comb Tx Rx
Current electrical switches：
source
～1pbps
↓
2.5D-CPU Card
～500Pbps
Wavelength Single-source No of ls Order of mod. Bit rate
Bank Wavelengths supply 1 1 20 Gbps
DEMUX MOD. MUX 4 8 640 Gbps
Fiber
32 8 5.12 Tbps
Wavelength bank
(Optical comb) Current state-of-the-art Tx
100Gbps
↓
Silicon Photonics Integration
～ 5.12Tbps
Luxtera 2.5D Photonic Data Pump
• 2.5pJ/bit power
• Bare metal protocol
– Ultra low latency
– Protocol agnostic
• 8 core Fiber
• 25Gb SERDES or 3.125Gb interface
• Self-calibrating self-tuning (Tezzaron slides taken from
• >1.6Tb/s payload http://www.tezzaron.com/media/Tezzaron-

32 x 32 Optical Circuit Switch (Courtesy NTTAIST)
Problem:
heavy
optical loss
Fast Optical Crossbar Swtch (EECS, UCB)
Seok et. al. “Large-scale broadband digital silicon photonic
switches with vertical adiabatic couplers” Optica, 3-1, 2016
• Array of 64x64 MEMS
optical crossbar switch
• 3.7db on-chip insertion loss
• 0.91microsecond switching
time
• At 100,000 ports – 9 hop

network
• 33db+ loss
• 8.2 microsecond
switching time => 1Tb
800Kbyte BW x Delay
Solution: Hybrid EO Network
• Idea1: use low (latency/diameter, bandwidth,
power) electrical network for low latency messages,
and use optical circuits for high bandwidth and
fixed topology messages
• Idea2: merge the electronic switch and optical
MEMS switch, and use the latter as the control
plane of the optical MEMs circuit
– Thus the electronic switches become the optical
speculative “buffer”
Hybrid Electro-Optical Network w/shortcuts
[Takizawa&Matsuoka LSPP07]
“Locality Aware MPI Communication on a Commodity Opto-Electronic Hybrid Network”
Low latency
small packets
Bulk
Transfer
76
NICT Optical Packet Switch Node (Slides courtesy NICT)
n 4 x 4 OPS node with optical packet (OP) transponder
n 100Gb/s OPS port, 10GbE x 10 Client ports
n Stability: Tolerance for environmental disturbance
100Gbps (Polarization, Power fluctuation)
Optical Packet
Transponder
n Total throughput : 800 Gb/s
n Total power consumption: 141 W (w/o Transponder)
Burst-mode
n 10-node hopping, 450 km fiber transmission
Optical
Amplifiers
Header
Proc.
100 Gb/s Multi-wavelength
Header
Processor 1 1
Optical Packet Format
2 4x4 2
4 x 4 EA
Switch (1U)
3
Switch 3
4
l1
l2 ...
Preamble Header ...
...
... ...
4 Preamble
l3
... ...
Preamble
l4
... ...
Preamble
100G OP l5
... ...
Preamble
100G-OP l6
... ...
Preamble
Transponder
l7
... ...
Preamble
l8
... ...
Preamble
Switching speed: < 8 ns 10GbE x10 l9
... ...
Preamble
l10 Preamble
Power consumption: 3 W Client Network (10Gb Ethernet)
10 x 10 Gb/s payloads
Y. Muranaka, et.al, Photonics in Switching 2015. H. Furukawa, et.al, no.P.4.16, ECOC2015.
HIDEAKI FURUKAWA furukawa@nict.go.jp
September 6, 2016 © 2016 National Institute of Information and Communications Technology
Applications & Algorithms
Slides by Kengo Nakajima

Information Technology Center
The University of Tokyo
New Frontiers of Computer & Computational Science

towards Post Moore Era
December 22, 2015, Takeda Auditorium, The University of Tokyo
78
Assumptions & Expectations

towards Post-Moore Era
• Higher Bandwidth, Larger & Heterogeneous Latency
– Memory: 3D Stacked Memory
– Network: Optical Communication
– Both of Memory & Network will be more hierarchical
• Larger Size of Memory & Cache
• Transaction/Transactional Memory
• Application-Customized Hardware, FPGA

• Large Number of Nodes/Number of Cores per Node
– under certain constraints (e.g. power, space …)
79
Applications & Algorithms in Post-

Moore Era (1/2)
• Compute Intensity ⇒ Data Movement Intensity
– Non-Blocking Method, Out-of-Core Algorithm
• Implicit scheme strikes back !
– I believe it was never defeated
– Improvement of performance on sparse matrix computations
– Big change and advancement are expected in all research
areas related to algorithms for sparse matrices including
preconditioning
– Everything might be easier… but don’t relax too much!
– Other Compute to Data Algorithms: H-Matrices

Highly-Scalable Atmospheric Simulation Framework
(ACM Gordon Bell Prize 2016)
Slide courtesy Haohuan Fu

Weak-scaling results
Resolution (km) DOFs=772B
2.480 1.389 0.920 0.620 0.488
0.16
0.08
7.95 DP-PF
0.04
34X “Exa-scale”
0.02 for exp
SYPD
0.01
89.5X
0.005
0.0025
Implicit
0.00125
Explicit 23.66 DP-PF
0.33 M 0.67 M 1.33 M 2.66 M 5.32 M 10.64 M

Slide courtesy Haohuan Fu
Total number of cores
The 488-m res run: 0.07 SYPD, 10.6M cores, dt=240s, 89.5X speedup over explicit
GeoFEM Benchmark: ICCG for FEM
Performance of a Node: Flat MPI
SR11K/J2 T2K FX10 K Earth
Power5+ AMD Sim 1
Core #/Node 16 16 16 8 8
Peak Performance
147.2 147.2 236.5 128.0 64.0
(GFLOPS)
STREAM Triad (GB/s) 101.0 20.0 64.7 43.3 256.0
B/F 0.686 0.136 0.274 0.338 4.00
GeoFEM (GFLOPS) 19.0 4.69 16.0 11.0 25.6
% to Peak 12.9 3.18 6.77 8.59 40.0
LLC/core (MB) 18.0 2.00 0.75 0.75 -
Sparse Solver：Memory-Bound
82
83
Improvement of performance on
sparse matrix computations due to
higher memory bandwidth
13 14 15 16
éD X X X ù ì F1 ü ì F1 ü
7 8 9 êX D X X X X úï F ï ï F ï
ê úï 2 ï ï 2 ï
ê X D X X X X ú ï F 3 ï ï F3 ï
9 10 11 12 ê úï ï ï ï
ê X D X X ú ï F 4 ï ï F4 ï
4 5 6 êX X D X X X ú ï F 5 ï ï F5 ï
ê úï ï ï ï
êX X X X D X X X X ú ï F 6 ï ï F6 ï
5 6 7 8 ê X X X X D X X X X úï F ï ï F ï
1 2 3 ê {Y}=X [A]{X}
X X D X X ú ï F 8 ï ï F8 ï
ê X X D X X X úí F ý = í F ý
1 2 3 4 ê do i= 1, XN X X X D X X X X ú ïF10 ï ï F10 ï
ê Y(i)= D(i)*X(i) úï ï ï ï
ê X X X X D X X X X ú ïF11 ï ï F11 ï
Sparse Matrices： ê do k= INDEX(i-1)+1,
X X INDEX(i)
X D X X ú ïF12 ï ï F12 ï
ê Y(i)= Y(i) + AMAT(k)*X(ITEM(k)) úï ï ï ï
• FEM ê X X D X ú ïF13 ï ï F13 ï
• Indirect Memory ê enddo X X X X D X ú ïF14 ï ï F14 ï
ê enddo úï ï ï ï
ê X X X X D X ú ïF15 ï ï F15 ï
Access ê ï ï ï ï
ë X X X D úû îF16 þ î F16 þ
• Memory-Bound
84
Assumptions & Expectations

towards Post-K/Post-Moore Era
• Post-K (-2020)
– Memory Wall
– Hierarchical Memory (e.g. KNL: MCDRAM-DDR)
• Post-Moore (-2025? -2029?)
– Larger Size of Memory & Cache
– Higher Bandwidth, Larger & Heterogeneous Latency
• 3D Stacked Memory, Optical Network
• Both of Memory & Network will be more hierarchical
– Application-Customized Hardware, FPGA
• Common Issues
– Hierarchy, Latency (Memory, Network etc.)
– Large Number of Nodes/Number of Cores per Node
• under certain constraints (e.g. power, space …)
85
Parallel-in-Space/Time (PiST)
MG is scalable, but improvement of performance is
limited by parallelization only in space direction
Time-Dependent Problems: Concurrency in Time Dir.
Multigrid in (Space+Time) Direction
ü Traditional time-dependent method: Point-Wise Gauss Seidel
ü XBraid：Lawrence Livermore National Laboratory
pApplication to nonlinear problems (Transient Navier-Stokes Eqn’s)
MS with 3 sessions in SIAM PP16 (April 2016)
PiST approach is suitable for the Post-Moore Systems
with a complex and deeply hierarchical network
that causes
large latency.
APPLICATION TOPIC: FUSION ENERGY SCIENCE
[Slides Courtesy William Tang, Princeton University]
SITUATION ANALYSIS
Most critical problem for Fusion Energy: avoid/mitigate large-scale major disruptions
•Approach: Conventional “1st Principles (hypothesis-based)” HPC simulations are
unable to meet this challenge, big-data-driven statistical machine-learning (ML)
predictions for the occurrence of disruptions in fusion-grade plasmas such as the “Joint
European Torus (JET)” today and “ITER” in the near future are now deployed.
•Current Status: ~ 8 years of R&D results (led by JET) using Support Vector Machine
(SVM) ML on zero-D time trace data executed on CPU clusters yielding ~ reported
success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with
false alarm rate < 5% actually needed for ITER (Reference – P. DeVries, et al. (2015)
•Princeton Team Goals include:
(i)improve physics fidelity via development of new ML multi-D, time-dependent
software including better classifiers;
(ii)develop “portable” (cross-machine) predictive software beyond JET to other
devices and eventually ITER; and
(iii)enhance ML execution speed for very large datasets by using supercomputers (e.g.,
“Titan”/”Summit” in US; “Tsubame-3” in Japan; ”Piz Daint” at CSCS in Europe)
à development & deployment of advanced ML software via Deep Learning
Recurrent Neural Networks
CLASSIFICATION
● Binary Classification Problem:
○ Shots are Disruptive or Non-Disruptive
● Supervised ML techniques:
○ Physics domain scientists combine knowledge base of
observationally validated information with advanced statistical/ML
predictive methods.
● Machine Learning (ML) Methods Engaged:
Basic SVM approach initiated by JET team producing “APODIS”
software leading now to Princeton’s New Deep Learning Fusion
Recurrent Neural Net (FRNN) code
● Approach: (i) begin with properly normalized data; (ii) use training sets to generate
new models; (iii) use trained models to classify new samples & improve prediction of
tokamak disruptions
→ Multi-D data analysis requires new signal representations;
→ FRNN software includes Deep Learning Convolutional and
Recurrent Neural Net features to respond to new challenges
Machine Learning Workflow
Identify Preprocessing Train model,

Signals and feature Use model for
Normalization Hyper parameter
prediction
• Classifiers extraction tuning
All data placed on appropriate

Princeton/PPPL DL numerical scale ~ O(1) Apply ML/DL software on
predictions now advancing e.g., Data-based with all new data
to multi-D time trace signals divided by their
signals (beyond zero-D) standard deviation
Measured sequential data • All available data analyzed;
arranged in patches of • Train LSTM (Long Short Term
equal length for training Memory Network) iteratively;
• Evaluate using ROC (Receiver
Operating Characteristics) and
cross-validation loss for every
epoch (equivalent of entire data
set for each iteration)
Deep Recurrent Neural Nets: Schematic
Alarm Alarm Alarm
> Threshold?
Output: Disruption coming? Output Output Output
RNN Architecture:
• LSTM
• 3 layers RNN RNN RNN
• 300 cells per layer Internal
State
Signals: Signals Signals Signals

•Plasma Current
•Locked Mode Amplitude
•Plasma Density 0D signals 1D 0D signals 1D 0D signals 1D
•Internal Inductance
•Input Power CNN
•Radiated Power CNN CNN
•Internal Energy
1D signals 1D signals 1D signals
•1D profiles (electron
temperature, density)
•…
T = 0 [ms] T=1 T=t
FRNN Code PERFORMANCE: ROC CURVES
JET ITER-like Wall Cases @30ms before Disruption
Performance Tradeoff: Tune True Positives (good: correctly caught disruption) vs. False
Positives (bad: safe shot incorrectly labeled disruptive).
TP: 93.5%
FP: 7.5%
TP: 90.0%
FP: 5.0%
ROC Area:
0.96
Data (~50 GB), 0D signals:

• Training: on 4100 shots from JET C-Wall campaigns
• Testing 1200 shots from Jet ILW campaigns
• All shots used, no signal filtering or removal of shots
FRNN Scaling Results on GPU’s
• Tests on OLCF Titan CRAY supercomputer
– OLCF DD AWARD: Enabled Scaling Studies on
Titan currently up to 6000 GPU’s
– Total ~ 18.7K Tesla K20X Kepler GPUs
Tensorflow+MPI
New FRNN scaling tests: TSUBAME 3.0
Very recent results: TSUBAME 3.0 supercomputer (TiTech, Tokyo, Japan)

Tsubame 3.0 initial “Grand Challenge Runs”
–Order of thousand Tesla P100 SXM2 GPUs, 4 GPUs per node, NVlink
–Tensorflow+MPI, CUDA8, CuDNN 6, OpenMPI 2.1.1, GPU Direct
their developme
Multi GPU Read Alignment Alg

In most living o
DNA consists o
called nucleotid
JST-CREST “Extreme Big Data” Project (2013-2018)
A), cytosine (C
molecule called
The four bases

Introduction
Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TITECH)

Future Non-Silo Extreme Big Data Scientific Apps
Problem Domain
Ultra Large Scale
Graphs and Social Massive Sensors and
Large Scale Infrastructures Data Assimilation in
Given a top-class Metagenomics
Co-Design Co-Design
Weather Prediction
Bring HPC rigor in
Co-Design
supercomputer, architectural,
日本地図 13/06/06 22:36
Cartesian Plane
EBD Bag
EBD System Software KV
how fast can we

S
incl. EBD Object System KV algorithmic, and

accelerate next
KV S
Graph Store
system software
S
NVM/Fla
NVM/Fla
2Tbps HBM NVM/Flas EBD KVS
sh 4~6HBM Channels NVM/Flash 1000km
generation big
NVM/Fla
sh NVM/Flas
h
sh
1.5TB/s DRAM & h
performance and
DRAM NVM BW DRAM
Exascale Big Data HPC

DRAM DRAM
DRAM
Low DRAM
Low
Power 30PB/s I/O BW Possible
High Powered
Main CPU Power
1 Yottabyte / Year
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
CPU CPU
data c.f. Clouds?

TSV Interposer
modeling into big

PCB
Convergent Architecture (Phases 1~4)

Large Capacity NVM, High-Bisection NW data
Cloud IDC Supercomputers
Very low BW & Efficiency Compute&Batch-Oriented
Highly available, resilient More fragile
EBD System Software (Matsuoka-G)
• Big Data Algorithms for Accelerators (GPU • Big-Data Performance Modeling and Analysis
and FPGAs, low level kernels for DNN&Graph) • Co-locating HPC and Big Data Analytics
• Fast and Memory-saving SpGEMM on GPUs • Visualizing Traffic of Large-scale Networks
• Accelerating SpMV on GPU by Reducing Memory • I/O vs MPI Traffic Interference on Fat-tree Networks
Access
• OpenCL-based High-Performance 3D Stencil • ibprof : Low-level Profiler of MPI Network Traffic
Computation on FPGAs • Evaluation of HPC-Big Data Applications in Clouds
• Evaluating Strategies to Accelerate Applications • Analysis on Configurations of Burst Buffers
using FPGAs
• High Performance Big-Data Programming Middleware
• Accelerating Spiking Neural Networks on FPGAs
• mrCUDA: Remote-to-local GPU Migration
• Directive-based Temporal-Blocking application Middleware
• Large Scale Graph Algorithms and Sorting • Transpiler between Python and Fortran
• No.1 on Graph500 Benchmark, 5 consecutive • Hamar (Highly Accelerated Map Reduce)
times (collab. w/Kyushu-U, Riken etc.) • Out-of-core GPU-MapReduce for Large-scale Graph
• Distributed Large-Scale Dynamic Graph Data Processing
Store & Large-scale Graph Colouring (vertex • DRAGON: Extending UVM to NVMe
coloring) • Hierarchical, UseR-level and ON-demand File system
• Dynamic Graph Data Structure Using Local- (HuronFS)
NVRAM
• Incremental Graph Community Detection • Optimizing Traffic Simulation App (Ex- Suzumura
Group)
• ScaleGraph: Large-scale Graph Processing • Incremental Graph Community Detection
Framework w/ User-Friendly Interface
• GPU-HykSort: Large Scale Sorting on Massive • DeepGraph
GPUs • Exact-Differential Traffic Simulation
• XtrSort: GPU out of core sorting
• Efficient Parallel Sorting Algorithm for Variable-
Length Keys
Characteristics of Big Data and AI Computing
As BD / AI As BD / AI
Graph Analytics e.g. Social Networks Dense LA: DNN
Sort, Hash, e.g. DB, log analysis Inference, Training, Generation
Symbolic Processing: Traditional AI
Opposite ends of HPC
computing spectrum,
but HPC simulation
As HPC Task apps can also be As HPC Task
Integer Ops & Sparse Matrices categorized likewise Dense Matrices, Reduced Precision
Data Movement, Large Memory Dense and well organized neworks
Sparse and Random Data, Low Locality and Data
Acceleration via
Acceleration, Scaling Supercomputers Acceleration, Scaling
adapted to AI/BD
(Big Data) BYTES capabilities, in bandwidth and
capacity, unilaterally important but often missing from
modern HPC machines in their pursuit of FLOPS…
• Need BOTH bandwidth and capacity Our measurement on
(BYTES) in a HPC-BD/AI machine: breakdown of one iteration
of CaffeNet training on
• Obvious for lefthand sparse ,bandwidth- TSUBAME-KFC/DL
(Mini-batch size of 256)
dominated apps
• But also for righthand DNN: Strong scaling,
Proper arch. to
large networks and datasets, in particular
for future 3D dataset analysis such as CT- support large
scans, seismic simu. vs. analysis…) Computation on GPUs memory cap.
occupies only 3.9%
and BW, network
latency and BW
important
(Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg)
Nodes
(Source: https://www.spineuniverse.com/image- Number of GPUs = 8 per node
4 Layers of Parallelism in DNN Training
• Hyper Parameter Search
• Searching optimal network configurations and
parameters
• Often use evolutionary algorithms
• Data Parallelism
• Split and parallelize the batch data
• Synchronous, asynchronous, hybrid, …
• Model Parallelism
• Split and parallelize the layer calculations in
forward/backward propagation
• ILP and other low level Parallelism
• Parallelize the convolution operations etc. (in
reality tensor op / matrix multiply)
What about the other layers?

How do we co-Design? 97
Deep Learning is “All about Scale”
• Andrew Ng:
– “Deep Learning is scalable”
– “Performance just gets better if you feed in more data”
• Data-parallel training with (Asynchronous)
Stochastic Gradient Descent
Fig. 2: Andrew Ng (Baidu) “What Data Scientists Should

Know about Deep Learning”
Fig. 3: Simplified DL workflow with ASGD per iteration:

1. Compute gradient
2. Exchange gradients via all-reduce; and
3. Update network parameters
December 12, 2017 Jens Domke 98

Example AI Research: Predicting Statistics of Asynchronous SGD Parameters
for a Large-Scale Distributed Deep Learning System on GPU Supercomputers
Background Proposal
• In large-scale Asynchronous Stochastic Gradient Descent • We propose a empirical performance model for an ASGD
(ASGD), mini-batch size and gradient staleness tend to be deep learning system SPRINT which considers probability
large and unpredictable, which increase the error of trained distribution of mini-batch size and staleness
DNN
Mini-batch size Staleness
Objective function E 4 nodes N Subbatch =1 NSubbatch = 1
0.8
Probability
0.10
Mini-batch size 8 nodes Predicted
0.4
Staleness=0 16 nodes Measured
-ηΣi ∇Ei
0.00
0.0
● ● ● ●
W(t) 100 200 300 400 500 600 0 2 4 6 8 10
0.12
Twice asynchronous NSubbatch = 4 NSubbatch = 4
W(t+3)
0.8
updates within
Probability
Predicted
0.06
gradient computation
0.4
W(t+1)
0.00
0.0
● ● ● ●
●●
W(t+1) 100 200 300 400 500 600 0 2 4 6 8 10
W(t+2) Staleness=2
NSubbatch = 8 NSubbatch = 8
-ηΣi ∇Ei Measured
0.8
0.10
Probability
DNN parameters space (NSubbatch: # of samples per one GPU iteration)
0.4
• Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of
0.00
0.0
● ● ● ●● ●
Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of
0 100 200 300 400 500 600 0 2 4 6 8 10
2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016
Interconnect Performance as important
as GPU Performance to accelerate DL
• ASGD DL system SPRINT (by DENSO IT Lab) and DL speedup prediction
with performance model
– Data measured on T2 and KFC

(both FDR) fitted to formulas
– Allreduce time (∈ TGPU) dep. on
#nodes and #DL_parameters Fig. 4: Oyama et al. “Predicting Statistics of Asynchronous SGD Parameters for a
Large-Scale Distributed Deep Learning System on GPU Supercomputers
• Other approaches == similar improvements:

– Cuda-Aware CNTK optimizes communication pipeline è 15%—23% speedup
(Banerjee et al. “Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters”)
– Reduced precision (FP[16|8|1]) to minimize msg. size w/ no or minor accuracy loss
December 12, 2017 Jens Domke 100
Less is More: Accelerating Deep
101
Neural Networks with Micro-Batching
Yosuke Oyama1, a, Tal Ben-Nun2, Torsten Hoefler2, Satoshi Matsuoka1
1) Tokyo Institute of Technology, 2) ETH Zurich
a) oyama.y.aa@m.titech.ac.jp, Presenter
第162回ハイパフォーマンスコンピューティング研究発表会
2017/12/19
102 Background: cuDNN Convolution
ŃVIDIA cuDNN: A deep learning kernel library for NVIDIA GPUs

Ádopted by most of deep learning frameworks
Ćontains multiple convolution algorithms for CNNs
´GEMM, direct, FFT, Winograd, …
´Most algorithms use workspace: A buffer in GPU memory to store intermediate data
N
Y[n, k, h, w] =
X Y Σc,u,v W[k, c, u, v] * X[n, c, h+u, w+v];
C K
C
1
H W Σ H 1
V
1
U
W W
2D Convolution (forward)
Ćoncern: There are considerable performance gaps (w.r.t. time and workspace size)
among convolution algorithms
é.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown
if workspace If workspace
limit < 323 MiB limit ≧ 323 MiB
2
20
Execution time [ms]
15 0
0 IMPLICIT_GEMM
1 IMPLICIT_PRECOMP_GEMM
1
10
2 GEMM
4.51x 3 DIRECT
4 FFT
5
5 FFT_TILING
● 6 WINOGRAD
5 74 7 WINOGRAD_NONFUSED
0
0 200 400 600 800 1000
Workspace size [MiB]

Execution time vs. workspace size of AlexNet conv2 (forward)
Mini-batch size of 256, NVIDIA P100-SXM2, cuDNN 7.0
Óbservation: Less batch size for More executable performance

´Faster algorithms can be enabled by dividing the mini-batch
93% of performance with
58% of workspace
80

●●●
300
●●●
Images/time [ms−1] ●●●
●●●
●●
●●
●●
60
●●●●●
●●●
200
●●●●●
●●
●●●●
40
●
●●●
●●●●●
●●
●●●
100
●●
●●●
20
●●●●● −1
●●● Images/time [ms ]
● Workspace size [MiB]
0
0
0 50 100 150 200 250
Batch size
Computation performance and workspace size of FFT_TILING
of AlexNet conv2 (forward)
105 Approach and Contribution
Ápproach: µ-cuDNN, a wrapper library for cuDNN

´µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for
cuDNN convolution
´µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and
Integer Linear Programming
Ćontribution:
´µ-cuDNN on NVIDIA Tesla P100-SXM2 GPU achieves
úp to 2.33x speedup for single convolution
úp to 1.63x speedup for convolution of a CNN
affected by other workspace sizes, not to let total workspace Where Cµ (b) is set of available micro-
size exceed the given size. Here we propose an 0-1 Integer micro-batch size of b, and D is a pruning
Linear Programming (ILP) based optimization algorithm (Fig. will mention later. Please note that Eq. 8
106 Proposal: WD using Integer LP
6). the WR algorithm as one of its elements;
Given set of kernels K and corresponding set of available cµ (b) 2 Cµ (b) for any b.
configurations Ck of kernel k, WR is solved with Eq. 4. Second, we define “desirable configurat
´Problem: D(C) ⇢ C as follows (Fig. 7):
X X
min. T = Tk (c)xk,c (4) D(C)
Total execution C|8c0 2 C [T (c) < T (c0 ) _ M
= {c 2 time
k2K c2Ck
X X Total workspace size should
subject to. Mk (c)xk,c  M (5) be less than M
¬[T(c)<T(c’) M(c)<M(c’)]
k2K c2Ck
X Exactly one configuration
T(c)should
xk,c = 1 (8k 2 K) (6)selected for each kernel
be
c2Ck c
xk,c 2 {0, 1} (8k 2 K, 8c 2 Ck ) (7) xk,c=1 ⇔ configuration c is c’
selected for kernel k
D(")
Where Mk (c) and Tk (c) is benchmarked workspace size
´M: total workspace size M
and execution time of kernel k with configuration c. xk,c
us toAuse
suggests ´K: setconfiguration
of convolution kernels
c on kernel k, if and only Fig. 7: The concept of desirable set. Here c c
if xk,c =Ć1.k: Eq.
A set of configurations
5 limits for kernel
the total workspace size tok the because there is c0 that does not satisfy the
user-specified size M . 0
_ 0
´Tk(c), Mk(c): Execution time and workspace size for config. c ).
T (c ) M (c) < M (c
1) Desirable Configuration Selection for WD ILP: The
challenging problem of the mentioned ILP-based algorithm is Where T (c) and M (c) is execution ti
107 Proposal: WD using Integer LP
Total workspace size

xk,c = 1
M
conv k
x2,v = 1
c ∈ Ck
x1,u = 1 conv2 M2(v)

v ∈ C2
conv1 T2(v)
u ∈C1 Time
cμ cμ cμ cμ min. T
108 Evaluation: WD using Integer LP
´the ILP-based algorithm nearly fully utilize the workspace

140
Total workspace limit (120 MiB)

120
BD BD
Total workspace size [MiB]

100 Performance-sensitive
kernels (conv2, conv3) BF BF conv1
80 aggressively occupy the conv2
workspace F F
conv3
conv4
60 conv5
BD BD
40
BF BF
20
BD F F
BD BD F BD
0 F F F
undivided (WR)
powerOfTwo (WR)
all (WR)
undivided (WD)
powerOfTwo (WD)
all (WD)
Breakdown of workspace size of AlexNet
Mini-batch size of 256, total workspace size of 120MiB, P100-SXM2
120
●
IMPLICIT_GEMM
IMPLICIT_PRECOMP_GEMM
100
GEMM
BYTES-bound
●
FFT

FFT_TILING
80
●
●●
WINOGRAD_NONFUSED
algorithms
are faster
●
●
●
60
●
40 ●
●
20
●
● ●
0
0 2 4 6 8 10
Execution time [ms]
A desirable configuration set of AlexNet conv2 (Forward)

Mini-batch size of 256, P100-SXM2
Each bar represents proportion of micro-batch sizes and algorithms
´WD overcomes WR with same total workspace limit

1.24x WD even beats
150 WR with 8x more
total memory size conv1
1.38x
Execution time [ms]
conv2
100 conv3
conv4
conv5
50 etc.
(WD)
0
undivided (WR, 8 MiB)
all (WR, 8 MiB)
undivided (WD, 120 MiB)
all (WD, 120 MiB)
all (WR, 64 MiB)
all (WD, 960 MiB)
all (WR, 512 MiB)
all (WD, 7680 MiB)

Benchmark result of AlexNet on NVIDIA Tesla P100-SXM2
workspace size of 8, 64, 512MiB per kernel, mini-batch size of 256
Co-Designing Cambrian HPC System Architecture
FLOPS-Oriented => BYTES-Oriented
Numerical Applications and Algorithms
System SW and &

Programming
SystemModels
SW and and Abstraction?
Programming &
Programming of
Comm MW for Perf Modeling of
Deep and High
Exabit Optical Super Building
Bandwidth
Inteconnect Block Architecture
Memory Hiearchy
Specialized/Integrated/Re Advanced 3-D stacked

Next Gen Exabit-class
configurable Super Non-Volatile Memory
Optically Switched >Tbytes, >10TByte/s
Building Block
Interconnect NVM/Flash NVM/Flash
Wavelength 2.5D-CPU Card Architecture NVM/Flash
NVM/Flash
NVM/Flash
NVM/Flash
Bank Single-source DRAM DRAM
DRAM DRAM
Wavelengths
DEMUX MOD. MUX
Fiber DRAM Optical SW & DRAM
supply
Wavelength bank Low Power CPU Launch Pad Low Power CPU
(Optical comb)
TSV Interposer
Silicon Photonics Integration
PCB
112
Post Moore Era Supercomputing Workshop @ SC17

• https://sites.google.com/site/2017pmes/
• Jeff Vetter (ORNL), Satoshi Matsuoka (Tokyo Tech) et. al.

Talk 1 Satoshi Matsuoka

Uploaded by

Copyright:

Available Formats

Talk 1 Satoshi Matsuoka

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Talk 1 Satoshi Matsuoka

Uploaded by

Copyright:

Available Formats

Cambrian Explosion of Computing in

the Post-Moore Era

ETH Collegium Helveticum

NVIDIA Pascal GPU TSUBAME3.0

World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-

2013 TSUBAME2.5 2013 TSUBAME-KFC

540 Compute Nodes SGI ICE XA + New Blade

100Gbps x 4 Liquid Cooled NVMe

14.1 Gigaflops/W is more than x10

Tokyo Tech / HPE

• ~200 Petaflops FP64, ~3 Exaflop FP16 by 1H2018

LW CPU core 2020

Computer Graphics Site Comparisons of AI-FP Perfs

Machine Learning / AI ~6700 GPUs + ~4000 CPUs

P100-fp16 P100 K40 U-Tokyo Oakforest-PACS (JCAHPC)

Matrix Dimension (m=n=k)

Institutions Security Manufacturing Big Sciences

Matsuoka : Joint Brain Inspired AI Data-Knowledge integration AI

since July 2017

Ministry of Economics Resources and Acceleration of

• Open Sourcing of Next-Gen IDC Architecture for AI

Perf > 400~600

Traditional Xeon IDC TSUBAME3 (+Volta) & ABCI IDC

• Big Data Benchmarks • AI/ML Benchmarks

Simulation Benchmarks • Convnet on Chainer

R&D Investments into world leading

Power Efficieny (MFlops/Watt)

41ST LIST:K Computer

RIKEN Advanced Institute for

1 8 Fujitsu SPARC64 VIIIfx 2.0GHz, Japan 0.6027 10.5 5.3% 5.7%

National Supercomputing Sunway TaihuLight

Oak Ridge Titan

Los Alamos NL / Trinity

Elapsed Time (ms)

June 2014 1 17977.05 Efficient hybrid

3. Gordon Bell Prize Finalist

• TOFU3 CPU-integrated 6-D torus network Servers

• Being designed and will be manufactured by

Fujitsu … Storage System

• Development Leaders: Yutaka Ishikawa,

Mitsuhisa Sato (Riken)

Any “Big” Data in the

Any “Big” Data in the

Fujitsu SPARC64™ Xifx (2015)

Flops-Centric Monolithic System Software Cambrian Heterogeneous System Software

Hardware/Software System APIs Hardware/Software System APIs

Homogeneous General Purpose Nodes ~2025 Heterogeneous CPUs + Holistic Data

Transistor Lithography Scaling Novel Devices + CMOS (Dark Silicon)

as a panacea Post-Moore Data Science

Device & arch. advances Post-Moore Programming Model

architectures, software.New memory Devices Photonic Switching

DLUTM features Supercomputer K technologies

nMassively parallel：Apply supercomputer interconnect technology “Exascale” AI

l NVIDIA 980Ti and K20c

Read PE0 PE1 PE2

Write PEn-1 PEn-2 PEn-3

l Implementation Valid Compute

- Combines spatial and temporal Redundant

blocking Compute Block