Mug21 DL ML v3
Mug21 DL ML v3
Mug21 DL ML v3
https://twitter.com/mvapich
1800 1900 …. 1940 1950 1960 1970 1980 1990 2000 2010 2020
1805 1901 1936 1954 1965 1967 1979 1985 1995 2000 2014 2017
Bayesian Deep
Linear K-Means
Turing Machine Network Forest
Regression
KNN SVM XGBoost
Evolutionary
PCA Decision Trees Random Forest CatBoost
Algorithms
A. Legendre – J. Gauss K. Pearson A. Turing S. McCulloch – W. Pitts F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert J. Pearl D. Rumelhart – G. Hinton – R. Wiliams V. Vapnik– C. Cortes A. Krizhevsky A. Ng Y. LeCun Y. Bengio
www.top500.org
Courtesy: https://openai.com/blog/ai-and-compute/
Network Based Computing Laboratory MUG ‘21 5
Outline
• Introduction
• Deep Neural Network Training and Essential Concepts
• Parallelization Strategies for Distributed DNN Training
• Machine Learning
• Data Science using Dask
• Conclusion
Courtesy: http://cs231n.github.io/neural-networks-1/
W1
X
W2
W3
W1
W4
X
W5
W2
W6
W3
W1 W7
W4
X
W5
W2 W8
W6
W3
W1 W7
W4
X Pred
W5
W2 W8
Error = Loss(Pred,Output)
W6
E7
E8
Error = Loss(Pred,Output)
Backward Pass
E3
E7
E4
E5
E8
Error = Loss(Pred,Output)
E6
Backward Pass
E3
E1 E7
E4
E5
E2 E8
Error = Loss(Pred,Output)
E6
Backward Pass
Courtesy: https://www.jeremyjordan.me/nn-learning-rate/
Model Parallelism
Drawback: If the dataset has 1 million images, then it will take forever to run the model on such a
big dataset
Solution: Can we use multiple machines to speedup the training of Deep learning models? (i.e.
Utilize Supercomputers to Parallelize)
Network Based Computing Laboratory MUG ‘21 24
Need for Communication in Data Parallelism
Y Y
N N
Y Y
Y Y
N N
Y Y
Y Y
Machine 1 Y Machine 4 Y
Y Y
N N
Y Y
Y Y
N N
Y Y
Y Y
Y
Machine 2 Y
Machine 5
Y
N Problem: Train a single model on whole dataset,
Y
Y not 5 models on different sets of dataset
N
Y
Y
Machine 3 Y
1 7 2 12 19 9
Machine 1 2 2 4
5 1 3
12 12 12
21 5 11
1 3 2 12 19 9
1 2 3
Machine 2 5 1 2
12 12 12
21 5 11
2 1 3 MPI 12 19 9
Machine 3 5 5 2
5 1 1 AllReduce
12 12 12
21 5 11
5 7 1 12 19 9
Machine 4 2 1 2
4 1 3
12 12 12
21 5 11
3 1 1 12 19 9
Machine 5 2 2 1
2 1 2
12 12 12
21 5 11
Gradients Reduced
Gradients
Network Based Computing Laboratory MUG ‘21 26
Allreduce Collective Communication Pattern
• Element-wise Sum data from all processes and sends to all processes
int MPI_Allreduce (const void *sendbuf, void * recvbuf, int count, MPI_Datatype datatype,
MPI_Op operation, MPI_Comm comm)
– Added NCCL communication substrate for various MPI collectives – Tested with Horovod and common DL Frameworks
• TensorFlow, PyTorch, and MXNet
• Support for hybrid communication protocols using NCCL-based, CUDA-based, and IB verbs-
based primitives – Tested with MPI4Dask 0.2
• MPI_Allreduce, MPI_Reduce, MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, • MPI4Dask is a custom Dask Distributed package with MPI support
MPI_Scatter, MPI_Scatterv, MPI_Gather, MPI_Gatherv, and MPI_Bcast
– Tested with MPI4cuML 0.1
– Full support for NVIDIA DGX, NVIDIA DGX-2 V-100, and NVIDIA DGX-2 A-100 systems • MPI4cuML is a custom cuML package with MPI support
• Enhanced architecture detection, process placement and HCA selection
• Enhanced intra-node and inter-node point-to-point tuning
• Enhanced collective tuning
~1.89X on
2 GPUs
70
60
4000
50
3000 40
2000 30
20
1000
10
0 0
1 2 4 8 16 1 2 4 8 16
Number of GPUs Number of GPUs
C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU
Systems, " ICS-2020.
Network Based Computing Laboratory MUG ‘21 35
Distributed TensorFlow on ORNL Summit (1,536 GPUs)
• ResNet-50 Training using
TensorFlow benchmark on 450
Thousands
SUMMIT -- 1536 Volta 400 ImageNet-1k has 1.2 million images
GPUs! 350
MVAPICH2-GDR reaching ~0.42 million
• 1,281,167 (1.2 mil.) images 250 images per second for ImageNet-1k!
200
150
• Time/epoch = 3 seconds
100
50
• Total Time (90 epochs) 0
= 3 x 90 = 270 seconds = 4.5 1 2 4 6 12 24 48 96 192 384 768 1536
Number of GPUs
minutes!
NCCL-2.6 MVAPICH2-GDR 2.3.4
*We observed issues for NCCL2 beyond 384 GPUs
Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1
Network Based Computing Laboratory MUG ‘21 36
Distributed TensorFlow on TACC Frontera (2048 CPU nodes)
• Scaled TensorFlow to 2048 nodes on
Frontera using MVAPICH2 and IntelMPI
262144
65536
16384
A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep
Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
64
– An easy to use Hybrid
Memory (GB)
CPU Broadwell (128 GB)
32
parallel training framework Volta GPU
16
• Hybrid = Data + Model CPU Skylake
8 (192 GB)
– Supports Keras models and 4 Pascal GPU
exploits TF 2.0 Eager 2
Execution 1
0 100 200 300 400 500 600 700 800
– Exploits MPI for Point-to-
Input Image Size (Width X Height)
point and Collectives ResNet-1k ResNet-5k
Benchmarking large-models lead to better insights and ability to develop new approaches!
A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid
Parallel Training of TensorFlow models”, ISC ’20, https://arxiv.org/pdf/1911.05146.pdf
recv(∇)
∇ send(∇) recv(∇)
∇
send(∇)
Replica-1
Image Conv3 Conv4 Output
Conv2
Conv1 FC
Partition-0 Partition-1 Partition-2
– 93.9% for 512 nodes A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid
Parallel Training of TensorFlow models”, ISC ‘20, https://arxiv.org/pdf/1911.05146.pdf
A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani, “GEMS: GPU Enabled Me mory Aware Model Parallelism System for
Distributed DNN”, SC ’20.
Proc 0
Proc 1
Domain Decomposition – Divide data objects amongst P processors Recompute centroids after MPI_Allreduce
• cuML:
– Execution is supported on multiple nodes using Dask (http://dask.org) and
NVIDIA’s NCCL
Network Based Computing Laboratory MUG ‘21 48
The cuML Library: Accelerating ML on GPUs
• The NVIDIA RAPIDS project aims to build end-to-end data science
analytic pipelines on GPUs
• An important component is the cuML library:
– GPU-accelerated ML library
– GPU-counterpart of Scikit-learn
– Supports the execution of ML workloads on Multi-Node Multi-GPUs (MNMG)
systems
• Most existing ML libraries, including Scikit-learn and Apache Spark’s
MLlib, only support CPU execution of ML algorithms
– Conventional wisdom has been that only DNNs are a good match for GPUs
because of high computational requirements
Network Based Computing Laboratory MUG ‘21 49
Main components of the cuML library
Python
• Main components Scikit-learn-like interfaces
– Python layer
Algorithms Linear Logistic
• Provides a Scikit-learn like interface regression regression
K-nearest Random
• Hides the complexities of the CUDA/C/C++ layer neighbors K-means
forest
– Primitives and cuML algorithms built on top of CUDA Primitives Element-wise Matrix
operations multiplication
• ML Algorithms Eigen standard distance/matrix
Decomposition deviation calculations
• Primitives
– Reusable building blocks for building machine learning Dask Python
algorithms
Cython
– Common for different machine learning algorithms
CUDA/C/C++
– Used to build different machine learning algorithms
UCX cuML Algorithms
– Communication Support in cuML:
cuML Primitives
• Point-to-point communication: Dask NCCL
CUDA Libraries
• Collective communication: NVIDIA Collective
Communications Library (NCCL) CUDA
• Trained Model:
– The trained parameters are brought to a single
worker using MPI_Reduce()
• Inference Stage:
– The trained parameters are broadcasted to all
workers with prediction partitions
Courtesy: https://arxiv.org/pdf/2002.04803.pdf
Network Based Computing Laboratory MUG ‘21 51
Accelerating cuML with MVAPICH2-GDR
• Utilize MVAPICH2-GDR (with mpi4py) as communication backend during the
training phase (the fit() function) in the Multi-node Multi-GPU (MNMG) setting
over cluster of GPUs
Dask Python
• Communication primitives:
– Allreduce UCX-Py Cython mpi4py
– Reduce CUDA/C/C++
– Broadcast
cuML Algorithms
• Exploit optimized collectives UCX
cuML Primitives
MVAPICH2-
NCCL
GDR
CUDA Libraries
CUDA
• MPI4cuML 0.1 was released in Feb ‘21 adding support for MPI to cuML:
– Can be downloaded from: http://hidl.cse.ohio-state.edu
• Features:
– Based on cuML 0.15
– MVAPICH2 support for C++ and Python APIs
• Included use of cuML C++ CUDA-Aware MPI example for KMeans clustering
• Enabled cuML handles to use MVAPICH2-GDR backend for Python cuML applications
– KMeans, PCA, tSVD, RF, LinearModels
• Added switch between available communication backends (MVAPICH2 and NCCL)
– Built on top of mpi4py over the MVAPICH2-GDR library
– Tested with
• Mellanox InfiniBand adapters (FDR and HDR)
• Various x86-based multi-core platforms (AMD and Intel)
• NVIDIA V100 and P100 GPUs
• Setup Instructions
– Installation Pre-requisites:
• Install the MVAPICH2-GDR Library
• Install the mpi4py Library
– Install MPI4cuML
$ wget http://hidl.cse.ohio-state.edu/download/hidl/cuml/mpi4cuml-0.1.tar.gz
$ tar -xzvf mpi4cuml-0.1.tar.gz
$ cd mpi4cuml-0.1/
$ conda env update -n mpi4cuml --file=conda/environments/cuml_dev_cuda10.2.yml
$ export LIBRARY_PATH=/path/to/miniconda3/envs/mpi4cuml/lib:$LIBRARY_PATH
$ export LD_LIBRARY_PATH=/path/to/miniconda3/envs/mpi4cuml/lib:$LIBRARY_PATH
$ ./build.sh cpp-mgtests
$ conda list
$ conda list | grep cuml
Build KMeans
$ cmake .. -DCUML_LIBRARY_DIR=/path/to/directory/with/libcuml.so
-DCUML_INCLUDE_DIR=/path/to/directory/with/kmeans/kmeans_c.h
$ make
$ LD_PRELOAD=$MPILIB/lib64/libmpi.so:$CUML_HOME/cpp/build/libcuml++.so
$ cd mpi4cuml/cpp/examples/mg_kmeans
$ vim hosts
.. write name of the compute nodes ..
Speedup
Speedup
1.2
1000 1.0 1000 0.8
0.8 0.6
0.6
500 500 0.4
0.4
0.2 0.2
0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs
MVAPICH2-GDR
Speedup
Speedup
1500 1.2 1000
Speedup
1.0
1.0
0.8
1000 0.8
0.6
0.6 500
0.4
500 0.4
0.2 0.2
0 0.0 0 0.0
1 2 4 8 16 32 1 2 4 8 16 32
Number of GPUs Number of GPUs
M. Ghazimirsaeed , Q. Anthony , A. Shafi , H. Subramoni , and D. K. Panda, Accelerating GPU-based Machine Learning in Python
using MPI Library: A Case Study with MVAPICH2-GDR, MLHPC Workshop, Nov 2020
Network Based Computing Laboratory MUG ‘21 60
Outline
• Introduction
• Deep Neural Network Training and Essential Concepts
• Parallelization Strategies for Distributed DNN Training
• Machine Learning
• Data Science using Dask
• Conclusion
Client
Cluster
Scheduler
Task Graph
Distributed
Scheduler Worker Client
UCX-Py mpi4py
(Cython wrappers)
Laptops/
High Performance Computing Hardware
Desktops
Network Based Computing Laboratory MUG ‘21 65
MPI4Dask: Bootstrapping and Dynamic Connectivity
• Several ways to start Dask programs:
Client
– Manual 1
Install MPI4Dask
$ wget http://hibd.cse.ohio-state.edu/download/hibd/dask/mpi4dask-0.2.tar.gz
$ tar -xzvf mpi4dask-0.2.tar.gz
$ cd mpi4dask-0.2/distributed
$ python setup.py build
$ pip install .
$ conda list
$ conda list | grep distributed
$ conda list |grep dask
Sample Output:
+ /opt/tutorials/dl-tutorial-21/dask/mvapich2/install/bin/mpirun_rsh -np 4 -hostfile
/tmp/shafi.16/hosts LD_PRELOAD=/opt/tutorials/dl-tutorial-
21/dask/mvapich2/install/lib/libmpi.so MV2_USE_CUDA=1 MV2_CPU_BINDING_LEVEL=SOCKET
MV2_CPU_BINDING_POLICY=SCATTER MV2_USE_GDRCOPY=1
MV2_GPUDIRECT_GDRCOPY_LIB=/opt/gdrcopy2.0/lib64/libgdrapi.so
MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING=1 /opt/tutorials/dl-tutorial-
21/dask/miniconda3/envs/mpi4dask_tutorial/bin/python /opt/tutorials/dl-tutorial-
21/labs/lab1/cupy_sum_tcp.py
<Client: 'tcp://10.3.1.1:46566' processes=2 threads=56, memory=269.79 GB>
Time for iteration 0 : 15.613967895507812
Time for iteration 1 : 12.97205138206482
Time for iteration 2 : 13.211132526397705
Time for iteration 3 : 12.941233396530151
Time for iteration 4 : 13.074704647064209
Median Wall-clock Time: 13.07 s
+ set +x
Sample Output:
+ /opt/tutorials/dl-tutorial-21/dask/mvapich2/install/bin/mpirun_rsh -np 4 -hostfile
/tmp/shafi.16/hosts LD_PRELOAD=/opt/tutorials/dl-tutorial-
21/dask/mvapich2/install/lib/libmpi.so MV2_USE_CUDA=1 MV2_CPU_BINDING_LEVEL=SOCKET
MV2_CPU_BINDING_POLICY=SCATTER MV2_USE_GDRCOPY=1
MV2_GPUDIRECT_GDRCOPY_LIB=/opt/gdrcopy2.0/lib64/libgdrapi.so
MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING=1 /opt/tutorials/dl-tutorial-
21/dask/miniconda3/envs/mpi4dask_tutorial/bin/python /opt/tutorials/dl-tutorial-
21/labs/lab1/cupy_sum_mpi.py
<Client: 'mpi://10.3.1.1:31549' processes=2 threads=56, memory=269.79 GB>
Time for iteration 0 : 9.499887466430664
Time for iteration 1 : 5.288544416427612 Configurable options in the script (cupy_sum_mpi.py)
Time for iteration 2 : 4.123088598251343
Time for iteration 3 : 4.088623523712158 DASK_PROTOCOL = 'mpi’ # Options are ['mpi', 'tcp']
Time for iteration 4 : 4.115798234939575 GPUS_PER_NODE = 1 # number of GPUs in the system
Median Wall-clock Time: 4.12 s RUNS = 5 # repetitions for the benchmark
+ set +x DASK_INTERFACE = 'ib0' # interface to use for communication
THREADS_PER_NODE = 28 # number of threads per node.
MVAPICH2-GDR: 3.17x faster
7 1.4
6 1.2
5 1.0
4 0.8
3 0.6
2 0.4
1 0.2
0 0.0
2 3 4 5 6 2 3 4 5 6
Number of Dask Workers Number of Dask Workers
A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based MPI4Dask 0.2 release
Communication for GPU-Accelerated Dask Applications, CCGrid ’21,
https://arxiv.org/abs/2101.08878 (http://hibd.cse.ohio-state.edu)
Network Based Computing Laboratory MUG ‘21 75
Benchmark #1: Sum of cuPy Array and its Transpose
(TACC Frontera GPU Subsystem) 1.71x better on average
A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based MPI4Dask 0.2 release
Communication for GPU-Accelerated Dask Applications, CCGrid ‘21
https://arxiv.org/abs/2101.08878 (http://hibd.cse.ohio-state.edu)
Network Based Computing Laboratory MUG ‘21 76
Benchmark #2: cuDF Merge (TACC Frontera GPU
Subsystem)
2.91x better on average 2.90x better on average
Equipment Support by
Follow us on
https://twitter.com/mvapich