Skip to content

Commit 240fb81

Browse files
authored
Merge pull request #436 from pytorch/1.6-additional-blog-posts
1.6 additional blog posts
2 parents 6726d5d + 725e0bd commit 240fb81

4 files changed

+217
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
layout: blog_detail
3+
title: 'Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs'
4+
author: Mengdi Huang, Chetan Tekur, Michael Carilli
5+
---
6+
7+
Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for [mixed-precision training](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/), which combined [single-precision](https://blogs.nvidia.com/blog/2019/11/15/whats-the-difference-between-single-double-multi-and-mixed-precision-computing/) (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance benefits on NVIDIA GPUs:
8+
9+
* Shorter training time;
10+
* Lower memory requirements, enabling larger batch sizes, larger models, or larger inputs.
11+
12+
In order to streamline the user experience of training in mixed precision for researchers and practitioners, NVIDIA developed [Apex](https://developer.nvidia.com/blog/apex-pytorch-easy-mixed-precision-training/) in 2018, which is a lightweight PyTorch extension with [Automatic Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) (AMP) feature. This feature enables automatic conversion of certain GPU operations from FP32 precision to mixed precision, thus improving performance while maintaining accuracy.
13+
14+
For the PyTorch 1.6 release, developers at NVIDIA and Facebook moved mixed precision functionality into PyTorch core as the AMP package, [torch.cuda.amp](https://pytorch.org/docs/stable/amp.html). `torch.cuda.amp` is more flexible and intuitive compared to `apex.amp`. Some of `apex.amp`'s known pain points that `torch.cuda.amp` has been able to fix:
15+
16+
* Guaranteed PyTorch version compatibility, because it's part of PyTorch
17+
* No need to build extensions
18+
* Windows support
19+
* Bitwise accurate [saving/restoring](https://pytorch.org/docs/master/amp.html#torch.cuda.amp.GradScaler.load_state_dict) of checkpoints
20+
* [DataParallel](https://pytorch.org/docs/master/notes/amp_examples.html#dataparallel-in-a-single-process) and intra-process model parallelism (although we still recommend [torch.nn.DistributedDataParallel](https://pytorch.org/docs/master/notes/amp_examples.html#distributeddataparallel-one-gpu-per-process) with one GPU per process as the most performant approach)
21+
* [Gradient penalty](https://pytorch.org/docs/master/notes/amp_examples.html#gradient-penalty) (double backward)
22+
* torch.cuda.amp.autocast() has no effect outside regions where it's enabled, so it should serve cases that formerly struggled with multiple calls to [apex.amp.initialize()](https://github.com/NVIDIA/apex/issues/439) (including [cross-validation)](https://github.com/NVIDIA/apex/issues/392#issuecomment-610038073) without difficulty. Multiple convergence runs in the same script should each use a fresh [GradScaler instance](https://github.com/NVIDIA/apex/issues/439#issuecomment-610028282), but GradScalers are lightweight and self-contained so that's not a problem.
23+
* Sparse gradient support
24+
25+
With AMP being added to PyTorch core, we have started the process of deprecating `apex.amp.` We have moved `apex.amp` to maintenance mode and will support customers using `apex.amp.` However, we highly encourage `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core.
26+
27+
# Example Walkthrough
28+
Please see official docs for usage:
29+
* [https://pytorch.org/docs/stable/amp.html](https://pytorch.org/docs/stable/amp.html )
30+
* [https://pytorch.org/docs/stable/notes/amp_examples.html](https://pytorch.org/docs/stable/notes/amp_examples.html)
31+
32+
Example:
33+
34+
```python
35+
import torch
36+
# Creates once at the beginning of training
37+
scaler = torch.cuda.amp.GradScaler()
38+
39+
for data, label in data_iter:
40+
optimizer.zero_grad()
41+
# Casts operations to mixed precision
42+
with torch.cuda.amp.autocast():
43+
loss = model(data)
44+
45+
# Scales the loss, and calls backward()
46+
# to create scaled gradients
47+
scaler.scale(loss).backward()
48+
49+
# Unscales gradients and calls
50+
# or skips optimizer.step()
51+
scaler.step(optimizer)
52+
53+
# Updates the scale for next iteration
54+
scaler.update()
55+
```
56+
57+
# Performance Benchmarks
58+
In this section, we discuss the accuracy and performance of mixed precision training with AMP on the latest NVIDIA GPU A100 and also previous generation V100 GPU. The mixed precision performance is compared to FP32 performance, when running Deep Learning workloads in the [NVIDIA pytorch:20.06-py3 container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) from NGC.
59+
60+
## Accuracy: AMP (FP16), FP32
61+
The advantage of using AMP for Deep Learning training is that the models converge to the similar final accuracy while providing improved training performance. To illustrate this point, for [Resnet 50 v1.5 training](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5#training-accuracy-nvidia-dgx-a100-8x-a100-40gb), we see the following accuracy results where higher is better. Please note that the below accuracy numbers are sample numbers that are subject to run to run variance of up to 0.4%. Accuracy numbers for other models including BERT, Transformer, ResNeXt-101, Mask-RCNN, DLRM can be found at [NVIDIA Deep Learning Examples Github](https://github.com/NVIDIA/DeepLearningExamples).
62+
63+
Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
64+
65+
<table width="460" border="0" cellspacing="5" cellpadding="5">
66+
<tbody>
67+
<tr>
68+
<td><strong>&nbsp;epochs</strong></td>
69+
<td><strong>&nbsp;Mixed Precision Top 1(%)</strong></td>
70+
<td>&nbsp;<strong>TF32 Top1(%)</strong></td>
71+
</tr>
72+
<tr>
73+
<td>&nbsp;90</td>
74+
<td>&nbsp;76.93</td>
75+
<td>&nbsp;76.85</td>
76+
</tr>
77+
</tbody>
78+
</table>
79+
80+
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
81+
82+
<table width="460" border="0" cellspacing="5" cellpadding="5">
83+
<tbody>
84+
<tr>
85+
<td><strong>&nbsp;epochs</strong></td>
86+
<td><strong>&nbsp;Mixed Precision Top 1(%)</strong></td>
87+
<td>&nbsp;<strong>FP32 Top1(%)</strong></td>
88+
</tr>
89+
<tr>
90+
<td>50</td>
91+
<td>76.25</td>
92+
<td>76.26</td>
93+
</tr>
94+
<tr>
95+
<td>90</td>
96+
<td>77.09</td>
97+
<td>77.01</td>
98+
</tr>
99+
<tr>
100+
<td>250</td>
101+
<td>78.42</td>
102+
<td>78.30</td>
103+
</tr>
104+
</tbody>
105+
</table>
106+
107+
## Speedup Performance:
108+
109+
### FP16 on NVIDIA V100 vs. FP32 on V100
110+
AMP with FP16 is the most performant option for DL training on the V100. In Table 1, we can observe that for various models, AMP on V100 provides a speedup of 1.5x to 5.5x over FP32 on V100 while converging to the same final accuracy.
111+
112+
<div class="text-center">
113+
<img src="{{ site.url }}/assets/images/nvidiafp32onv100.jpg" width="100%">
114+
</div>
115+
*Figure 2. Performance of mixed precision training on NVIDIA 8xV100 vs. FP32 training on 8xV100 GPU. Bars represent the speedup factor of V100 AMP over V100 FP32. The higher the better.*
116+
117+
## FP16 on NVIDIA A100 vs. FP16 on V100
118+
119+
AMP with FP16 remains the most performant option for DL training on the A100. In Figure 3, we can observe that for various models, AMP on A100 provides a speedup of 1.3x to 2.5x over AMP on V100 while converging to the same final accuracy.
120+
121+
<div class="text-center">
122+
<img src="{{ site.url }}/assets/images/nvidiafp16onv100.png" width="100%">
123+
</div>
124+
*Figure 3. Performance of mixed precision training on NVIDIA 8xA100 vs. 8xV100 GPU. Bars represent the speedup factor of A100 over V100. The higher the better.*
125+
126+
# Call to action
127+
AMP provides a healthy speedup for Deep Learning training workloads on Nvidia Tensor Core GPUs, especially on the latest Ampere generation A100 GPUs. You can start experimenting with AMP enabled models and model scripts for A100, V100, T4 and other GPUs available at NVIDIA deep learning [examples](https://github.com/NVIDIA/DeepLearningExamples). NVIDIA PyTorch with native AMP support is available from the [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) version 20.06. We highly encourage existing `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core available in the latest [PyTorch 1.6 release](https://pytorch.org/blog/pytorch-1.6-released/).
128+
129+
130+
131+
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
layout: blog_detail
3+
title: 'Microsoft becomes maintainer of the Windows version of PyTorch'
4+
author: Maxim Lukiyanov - Principal PM at Microsoft, Emad Barsoum - Group EM at Microsoft, Guoliang Hua - Principal EM at Microsoft, Nikita Shulga - Tech Lead at Facebook, Geeta Chauhan - PE Lead at Facebook, Chris Gottbrath - Technical PM at Facebook, Jiachen Pu - Engineer at Facebook
5+
6+
---
7+
8+
Along with the PyTorch 1.6 release, we are excited to announce that Microsoft has expanded its participation in the PyTorch community and is taking ownership of the development and maintenance of the PyTorch build for Windows.
9+
10+
According to the latest [Stack Overflow developer survey](https://insights.stackoverflow.com/survey/2020#technology-developers-primary-operating-systems), Windows remains the primary operating system for the developer community (46% Windows vs 28% MacOS). [Jiachen Pu](https://github.com/peterjc123) initially made a heroic effort to add support for PyTorch on Windows, but due to limited resources, Windows support for PyTorch has lagged behind other platforms. Lack of test coverage resulted in unexpected issues popping up every now and then. Some of the core tutorials, meant for new users to learn and adopt PyTorch, would fail to run. The installation experience was also not as smooth, with the lack of official PyPI support for PyTorch on Windows. Lastly, some of the PyTorch functionality was simply not available on the Windows platform, such as the TorchAudio domain library and distributed training support. To help alleviate this pain, Microsoft is happy to bring its Windows expertise to the table and bring PyTorch on Windows to its best possible self.
11+
12+
In the PyTorch 1.6 release, we have improved the core quality of the Windows build by bringing test coverage up to par with Linux for core PyTorch and its domain libraries and by automating tutorial testing. Thanks to the broader PyTorch community, which contributed TorchAudio support to Windows, we were able to add test coverage to all three domain libraries: TorchVision, TorchText and TorchAudio. In subsequent releases of PyTorch, we will continue improving the Windows experience based on community feedback and requests. So far, the feedback we received from the community points to distributed training support and a better installation experience using pip as the next areas of improvement.
13+
14+
In addition to the native Windows experience, Microsoft released a preview adding [GPU compute support to Windows Subsystem for Linux (WSL) 2](https://blogs.windows.com/windowsdeveloper/2020/06/17/gpu-accelerated-ml-training-inside-the-windows-subsystem-for-linux/) distros, with a focus on enabling AI and ML developer workflows. WSL is designed for developers that want to run any Linux based tools directly on Windows. This preview enables valuable scenarios for a variety of frameworks and Python packages that utilize [NVIDIA CUDA](https://developer.nvidia.com/cuda/wsl) for acceleration and only support Linux. This means WSL customers using the preview can run native Linux based PyTorch applications on Windows unmodified without the need for a traditional virtual machine or a dual boot setup.
15+
16+
## Getting started with PyTorch on Windows
17+
It's easy to get started with PyTorch on Windows. To install PyTorch using Anaconda with the latest GPU support, run the command below. To install different supported configurations of PyTorch, refer to the installation instructions on [pytorch.org](https://pytorch.org).
18+
19+
`conda install pytorch torchvision cudatoolkit=10.2 -c pytorch`
20+
21+
Once you install PyTorch, learn more by visiting the [PyTorch Tutorials](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) and [documentation](https://pytorch.org/docs/stable/index.html).
22+
23+
<div class="text-center">
24+
<img src="{{ site.url }}/assets/images/pytorch1.6.png" width="100%">
25+
</div>
26+
27+
## Getting started with PyTorch on Windows Subsystem for Linux
28+
The [preview of NVIDIA CUDA support in WSL](https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-cuda-in-wsl) is now available to Windows Insiders running Build 20150 or higher. In WSL, the command to install PyTorch using Anaconda is the same as the above command for native Windows. If you prefer pip, use the command below.
29+
30+
`pip install torch torchvision`
31+
32+
You can use the same tutorials and documentation inside your WSL environment as on native Windows. This functionality is still in preview so if you run into issues with WSL please share feedback via the [WSL GitHub repo](https://github.com/microsoft/WSL) or with NVIDIA CUDA support share via NVIDIA’s [Community Forum for CUDA on WSL](https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-on-windows-subsystem-for-linux/303).
33+
34+
## Feedback
35+
If you find gaps in the PyTorch experience on Windows, please let us know on the [PyTorch discussion forum](https://discuss.pytorch.org/c/windows/26) or file an issue on [GitHub](https://github.com/pytorch/pytorch) using the #module: windows label.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
layout: blog_detail
3+
title: 'PyTorch feature classification changes'
4+
author: Team PyTorch
5+
---
6+
7+
Traditionally features in PyTorch were classified as either stable or experimental with an implicit third option of testing bleeding edge features by building master or through installing nightly builds (available via prebuilt whls). This has, in a few cases, caused some confusion around the level of readiness, commitment to the feature and backward compatibility that can be expected from a user perspective. Moving forward, we’d like to better classify the 3 types of features as well as define explicitly here what each mean from a user perspective.
8+
9+
# New Feature Designations
10+
11+
We will continue to have three designations for features but, as mentioned, with a few changes: Stable, Beta (previously Experimental) and Prototype (previously Nightlies). Below is a brief description of each and a comment on the backward compatibility expected:
12+
13+
## Stable
14+
Nothing changes here. A stable feature means that the user value-add is or has been proven, the API isn’t expected to change, the feature is performant and all documentation exists to support end user adoption.
15+
16+
*Level of commitment*: We expect to maintain these features long term and generally there should be no major performance limitations, gaps in documentation and we also expect to maintain backwards compatibility (although breaking changes can happen and notice will be given one release ahead of time).
17+
18+
## Beta
19+
We previously called these features ‘Experimental’ and we found that this created confusion amongst some of the users. In the case of a Beta level features, the value add, similar to a Stable feature, has been proven (e.g. pruning is a commonly used technique for reducing the number of parameters in NN models, independent of the implementation details of our particular choices) and the feature generally works and is documented. This feature is tagged as Beta because the API may change based on user feedback, because the performance needs to improve or because coverage across operators is not yet complete.
20+
21+
*Level of commitment*: We are committing to seeing the feature through to the Stable classification. We are however not committing to Backwards Compatibility. Users can depend on us providing a solution for problems in this area going forward, but the APIs and performance characteristics of this feature may change.
22+
23+
<div class="text-center">
24+
<img src="{{ site.url }}/assets/images/install-matrix.png" width="100%">
25+
</div>
26+
27+
## Prototype
28+
Previously these were features that were known about by developers who paid close attention to RFCs and to features that land in master. In this case the feature is not available as part of binary distributions like PyPI or Conda (except maybe behind run-time flags), but we would like to get high bandwidth partner feedback ahead of a real release in order to gauge utility and any changes we need to make to the UX. To test these kinds of features we would, depending on the feature, recommend building from master or using the nightly whls that are made available on pytorch.org. For each prototype feature, a pointer to draft docs or other instructions will be provided.
29+
30+
*Level of commitment*: We are committing to gathering high bandwidth feedback only. Based on this feedback and potential further engagement between community members, we as a community will decide if we want to upgrade the level of commitment or to fail fast. Additionally, while some of these features might be more speculative (e.g. new Frontend APIs), others have obvious utility (e.g. model optimization) but may be in a state where gathering feedback outside of high bandwidth channels is not practical, e.g. the feature may be in an earlier state, may be moving fast (PRs are landing too quickly to catch a major release) and/or generally active development is underway.
31+
32+
# What changes for current features?
33+
34+
First and foremost, you can find these designations on [pytorch.org/docs](http://pytorch.org/docs). We will also be linking any early stage features here for clarity.
35+
36+
Additionally, the following features will be reclassified under this new rubric:
37+
38+
1. [High Level Autograd APIs](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api): Beta (was Experimental)
39+
2. [Eager Mode Quantization](https://pytorch.org/docs/stable/quantization.html): Beta (was Experimental)
40+
3. [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html): Prototype (was Experimental)
41+
4. [TorchScript/RPC](https://pytorch.org/docs/stable/rpc.html#rpc): Prototype (was Experimental)
42+
5. [Channels Last Memory Layout](https://pytorch.org/docs/stable/tensor_attributes.html#torch-memory-format): Beta (was Experimental)
43+
6. [Custom C++ Classes](https://pytorch.org/docs/stable/jit.html?highlight=experimental): Beta (was Experimental)
44+
7. [PyTorch Mobile](https://pytorch.org/mobile/home/): Beta (was Experimental)
45+
8. [Java Bindings](https://pytorch.org/docs/stable/packages.html#): Beta (was Experimental)
46+
9. [Torch.Sparse](https://pytorch.org/docs/stable/sparse.html?highlight=experimental#): Beta (was Experimental)
47+
48+
49+
Cheers,
50+
51+
Joe, Greg, Woo & Jessica

assets/images/install-matrix.png

34.8 KB
Loading

0 commit comments

Comments
 (0)