PyTorch 2.2 blog posts (pytorch#1566)

cjyabraham · kyliewd · web-flow · commit 4802c42021c6 · 2024-01-30T10:42:12.000-08:00
* Added PyTorch 2.2 blog post

Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;

* added another blog post

Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;

* highlight code

Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;

---------

Signed-off-by: Chris Abraham &lt;cjyabraham@gmail.com&gt;
Co-authored-by: Kylie Wagar-Dirks &lt;107439830+kyliewd@users.noreply.github.com&gt;
diff --git a/_posts/2024-01-30-pytorch2-2-lib-updates.md b/_posts/2024-01-30-pytorch2-2-lib-updates.md
@@ -0,0 +1,133 @@
+---
+layout: blog_detail
+title: "New Library Updates in PyTorch 2.2"
+---
+
+## Summary
+
+We are bringing a number of improvements to the current PyTorch libraries, alongside the PyTorch 2.2 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch.
+
+
+<table class="table table-bordered">
+  <tr>
+   <td colspan="3" style="font-weight: 600; text-align: center;">Latest Stable Library Versions (<a href="https://pytorch.org/docs/stable/index.html">Full List</a>)*
+   </td>
+  </tr>
+  <tr>
+   <td>TorchArrow 0.1.0
+   </td>
+   <td>TorchRec 0.6.0
+   </td>
+   <td>TorchVision 0.17
+   </td>
+  </tr>
+  <tr>
+   <td>TorchAudio 2.2.0
+   </td>
+   <td>TorchServe 0.9.0
+   </td>
+   <td>TorchX 0.7.0
+   </td>
+  </tr>
+  <tr>
+   <td>TorchData 0.7.1 
+   </td>
+   <td>TorchText 0.17.0
+   </td>
+   <td>PyTorch on XLA Devices 2.1
+   </td>
+  </tr>
+</table>
+
+
+*To see [prior versions](https://pytorch.org/docs/stable/index.html) or (unstable) nightlies, click on versions in the top left menu above ‘Search Docs’.
+
+
+## TorchRL  
+
+### Feature: TorchRL’s Offline RL Data Hub
+
+TorchRL now provides one of the largest dataset hubs for offline RL and imitation learning, and it all comes under a single data format (TED, for TorchRL Episode Data format). This makes it possible to easily swap from different sources in a single training loop. It is also now possible to easily combine datasets of different sources through the ReplayBufferEnsemble class. The data processing is fully customizable. Sources include simulated tasks (Minari, D4RL, VD4RL), robotic datasets (Roboset, OpenX Embodied dataset) and gaming (GenDGRL/ProcGen, Atari/DQN). Check these out in the [documentation](https://pytorch.org/rl/reference/data.html#datasets). 
+
+Aside from these changes, our replay buffers can now be dumped on disk using the `.dumps()` method which will serialize the buffers on disk using the TensorDict API which is faster, safer and more efficient than using torch.save. 
+
+Finally, replay buffers can now be  read and written from separate processes on the same machine without any extra code needed from the user!
+
+
+### TorchRL2Gym environment API
+
+To facilitate TorchRL’s integration in existing code-bases and enjoy all the features of TorchRL’s environment API (execution on device, batched operations, transforms…) we provide a TorchRL-to-gym API that allows users to register any environment they want in gym or gymnasium. This can be used in turn to make TorchRL a universal lib-to-gym converter that works across stateless (eg, dm_control) and stateless (Brax, Jumanji) environments. The feature is thoroughly detailed in the [doc](https://pytorch.org/rl/reference/generated/torchrl.envs.EnvBase.html#torchrl.envs.EnvBase.register_gym). The info_dict reading API has also been improved.
+
+
+### Environment speedups
+
+We added the option of executing environments on a different environment than the one used to deliver data in ParallelEnv. We also speeded up the GymLikeEnv class to a level that now makes it competitive with gym itself.
+
+
+### Scaling objectives
+
+The most popular objectives for RLHF and training at scale (PPO and A2C) are now compatible with FSDP and DDP models!
+
+
+## TensorDict
+
+
+### Feature: MemoryMappedTensor to replace MemmapTensor
+
+We provide a much more efficient mmap backend for TensorDict; MemoryMappedTensor, which directly subclasses torch.Tensor. It comes with a bunch of utils to be constructed, such as `from_tensor`, `empty` and many more. MemoryMappedTensor is now much safer and faster than its counterpart. The library remains fully compatible with the previous class to facilitate transition. 
+
+We also introduce a new set of multithreaded serialization methods that make tensordict serialization highly competitive with torch.save, with serialization and deserialization speeds for LLMs more than [3x faster than with torch.save](https://github.com/pytorch/tensordict/pull/592#issuecomment-1850761831).
+
+
+### Feature: Non-tensor data within TensorDict
+
+It is not possible to carry non-tensor data through the `NonTensorData` tensorclass. This makes it possible to build tensordicts with metadata. The `memmap`-API is fully compatible with these values, allowing users to seamlessly serialize and deserialize such objects. To store non-tensor data in a tensordict, simply assign it using the `__setitem__` method.
+
+
+### Efficiency improvements
+
+Several methods runtime have been improved, such as unbind, split, map or even TensorDict instantiation. Check our [benchmarks](https://pytorch.org/tensordict/dev/bench/)!
+
+
+## TorchRec/fbgemm_gpu
+
+
+### VBE
+
+TorchRec now natively supports VBE (variable batched embeddings) within the `EmbeddingBagCollection` module. This allows variable batch size per feature, unlocking sparse input data deduplication, which can greatly speed up embedding lookup and all-to-all time. To enable, simply initialize `KeyedJaggedTensor `with `stride_per_key_per_rank` and `inverse_indices` fields, which specify batch size per feature and inverse indices to reindex the embedding output respectively.
+
+In addition to the TorchRec library changes, [fbgemm_gpu](https://pytorch.org/FBGEMM/) has added the support for variable batch size per feature in TBE. [VBE](https://github.com/pytorch/FBGEMM/pull/1752) is enabled on split TBE training for both weighted and unweighted cases. To use VBE, please make sure to use the latest fbgemm_gpu version.  
+
+
+### Embedding offloading 
+
+This technique refers to using CUDA UVM to cache ‘hot’ embeddings (i.e. store embedding tables on host memory with cache on HBM memory), and prefetching the cache. Embedding offloading allows running a larger model with fewer GPUs, while maintaining competitive performance. Use the prefetching pipeline ([PrefetchTrainPipelineSparseDist](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/train_pipeline.py?#L1056)) and pass in [per-table cache load factor](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/types.py#L457) and the [prefetch_pipeline](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/types.py#L460) flag through constraints in the planner to use this feature.
+
+Fbgemm_gpu has introduced [UVM cache pipeline prefetching](https://github.com/pytorch/FBGEMM/pull/1893) in [v0.5.0](https://github.com/pytorch/FBGEMM/releases/tag/v0.5.0) for TBE performance speedup. This allows cache-insert to be executed in parallel with TBE forward/backward. To enable this feature, please be sure to use the latest fbgemm_gpu version.
+
+
+### Trec.shard/shard_modules
+
+These APIs replace embedding submodules with its sharded variant. The shard API applies to an individual embedding module while the shard_modules API replaces all embedding modules and won’t touch other non-embedding submodules.
+
+Embedding sharding follows similar behavior to the prior TorchRec DistributedModuleParallel behavior, except the ShardedModules have been made composable, meaning the modules are backed by [TableBatchedEmbeddingSlices](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/composable/table_batched_embedding_slice.py#L15) which are views into the underlying TBE (including .grad). This means that fused parameters are now returned with named_parameters(), including in DistributedModuleParallel.
+
+
+## TorchVision 
+
+
+### The V2 transforms are now stable!
+
+
+The `torchvision.transforms.v2` namespace was still in BETA stage until now. It is now stable! Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with [Getting started with transforms v2](https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py) in order to learn more about what can be done with the new v2 transforms.
+
+Browse our [main docs](https://pytorch.org/vision/stable/transforms.html#) for general information and performance tips. The available transforms and functionals are listed in the [API reference](https://pytorch.org/vision/stable/transforms.html#v2-api-ref). Additional information and tutorials can also be found in our [example gallery](https://pytorch.org/vision/stable/auto_examples/index.html#gallery), e.g. [Transforms v2: End-to-end object detection/segmentation example](https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_e2e.html#sphx-glr-auto-examples-transforms-plot-transforms-e2e-py) or [How to write your own v2 transforms](https://pytorch.org/vision/stable/auto_examples/transforms/plot_custom_transforms.html#sphx-glr-auto-examples-transforms-plot-custom-transforms-py).
+
+
+### Towards `torch.compile()` support
+
+We are progressively adding support for `torch.compile()` to torchvision interfaces, reducing graph breaks and allowing dynamic shape.
+
+The torchvision ops (`nms`, `[ps_]roi_align`, `[ps_]roi_pool` and `deform_conv_2d`) are now compatible with `torch.compile` and dynamic shapes.
+
+On the transforms side, the majority of [low-level kernels](https://github.com/pytorch/vision/blob/main/torchvision/transforms/v2/functional/__init__.py) (like `resize_image()` or `crop_image()`) should compile properly without graph breaks and with dynamic shapes. We are still addressing the remaining edge-cases, moving up towards full functional support and classes, and you should expect more progress on that front with the next release.
diff --git a/_posts/2024-01-30-pytorch2-2.md b/_posts/2024-01-30-pytorch2-2.md
@@ -0,0 +1,130 @@
+---
+layout: blog_detail
+title: "PyTorch 2.2: FlashAttention-v2 integration, AOTInductor"
+---
+
+We are excited to announce the release of PyTorch® 2.2 ([release note](https://github.com/pytorch/pytorch/releases/tag/v2.2.0))!  PyTorch 2.2 offers ~2x performance improvements to _[scaled_dot_product_attention](https://pytorch.org/docs/2.2/generated/torch.nn.functional.scaled_dot_product_attention.html)_ via [FlashAttention-v2](https://arxiv.org/abs/2307.08691) integration, as well as _AOTInductor_, a new ahead-of-time compilation and deployment tool built for  non-python server-side deployments.
+
+This release also includes improved _torch.compile_ support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.
+
+Please note that we are [deprecating macOS x86 support](https://github.com/pytorch/pytorch/issues/114602), and PyTorch 2.2.x will be the last version that supports macOS x64.
+
+Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog. 
+
+This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2.  More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page.
+
+Summary: 
+
+* _[scaled_dot_product_attention](https://pytorch.org/docs/2.2/generated/torch.nn.functional.scaled_dot_product_attention.html)_ (SDPA) now supports _[FlashAttention-2](https://arxiv.org/abs/2307.08691)_, yielding around 2x speedups compared to previous versions.
+* PyTorch 2.2 introduces a new ahead-of-time extension of [TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747) called _[AOTInductor](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html)_, designed to compile and deploy PyTorch programs for non-python server-side.
+* _torch.distributed_ supports a new abstraction for initializing and representing ProcessGroups called _[device_mesh](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html)_.
+* PyTorch 2.2 ships a standardized, configurable logging mechanism called [TORCH_LOGS](https://pytorch.org/tutorials/recipes/torch_logs.html).
+* A number of _torch.compile_ improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
+* Please note that we are [deprecating macOS x86 support](https://github.com/pytorch/pytorch/issues/114602), and PyTorch 2.2.x will be the last version that supports macOS x64.
+
+<table class="table table-bordered">
+  <tr>
+   <td style="width:25%">
+<strong>Stable</strong>
+   </td>
+   <td><strong>Beta</strong>
+   </td>
+   <td><strong>Performance Improvements</strong>
+   </td>
+  </tr>
+  <tr>
+   <td> 
+   </td>
+   <td><a href="#bookmark=id.ok7v7pq0igzw">FlashAttention-2 Integration</a>
+   </td>
+   <td><a href="#bookmark=id.rk3gf4pgy5m9">Inductor optimizations</a>
+   </td>
+  </tr>
+  <tr>
+   <td>
+   </td>
+   <td><a href="#bookmark=id.3qfc7y6r1dog">AOTInductor</a>
+   </td>
+   <td><a href="#bookmark=id.gfep1ccb8bvk">aarch64 optimizations</a>
+   </td>
+  </tr>
+  <tr>
+   <td>
+   </td>
+   <td><a href="#bookmark=id.n2lkw22a8l2m">TORCH_LOGS</a>
+   </td>
+   <td>
+   </td>
+  </tr>
+  <tr>
+   <td>
+   </td>
+   <td><em><a href="#bookmark=id.h50nybtt0fdm">device_mesh</a></em>
+   </td>
+   <td>
+   </td>
+  </tr>
+  <tr>
+   <td>
+   </td>
+   <td><a href="#bookmark=id.1lx0dkeu5zqt">Optimizer compilation</a>
+   </td>
+   <td>
+   </td>
+  </tr>
+</table>
+
+
+*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing).
+
+
+## Beta Features
+
+### [Beta] FlashAttention-2 support in _torch.nn.functional.scaled_dot_product_attention_
+
+_[torch.nn.functional.scaled_dot_product_attention](https://pytorch.org/docs/2.2/generated/torch.nn.functional.scaled_dot_product_attention.html)_ (SDPA) now supports FlashAttention-2, yielding around 2x speedups (compared to the previous version) and reaching ~50-73% of theoretical maximum FLOPs/s on A100 GPUs.
+
+More information is available on FlashAttention-2 in [this paper](https://arxiv.org/abs/2307.08691).
+
+For a tutorial on how to use SDPA please see [this tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html).  
+
+### [Beta] AOTInductor: ahead-of-time compilation and deployment for torch.export-ed programs
+
+AOTInductor is an extension of [TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747), designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts can be deployed in non-Python environments, which are frequently employed for inference on the server-side.  Note that AOTInductor supports the same backends as Inductor, including CUDA, ROCm, and CPU.
+
+For more information please see the [AOTInductor tutorial](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html).
+
+### [Beta] Fine-grained configurable logging via TORCH_LOGS
+
+PyTorch now ships a standardized, configurable logging mechanism that can be used to analyze the status of various subsystems such as compilation and distributed operations.
+
+Logs can be enabled via the TORCH_LOGS environment variable.  For example, to set the log level of TorchDynamo to logging.ERROR and the log level of TorchInductor to logging.DEBUG pass _TORCH_LOGS="-dynamo,+inductor"_ to PyTorch.
+
+For more information, please see the logging [documentation](https://pytorch.org/docs/2.2/logging.html) and [tutorial](https://pytorch.org/tutorials/recipes/torch_logs.html).
+
+### [Beta] torch.distributed.device_mesh
+
+PyTorch 2.2 introduces a new abstraction for representing the ProcessGroups involved in distributed parallelisms called _torch.distributed.device_mesh_. This abstraction allows users to represent inter-node and intra-node process groups via an N-dimensional array where, for example, one dimension can data parallelism in FSDP while another could represent tensor parallelism within FSDP.
+
+For more information, see the [device_mesh tutorial](https://pytorch.org/tutorials/recipes/distributed_device_mesh.html).
+
+### [Beta] Improvements to _torch.compile_-ing Optimizers
+
+A number of improvements have been made to torch.compile-ing Optimizers including less overhead and support for cuda graphs.
+
+More technical details of the improvements are available on [dev-discuss](https://dev-discuss.pytorch.org/t/compiling-the-optimizer-with-pt2/1669), and a recipe for _torch.compile_-ing optimizers is available [here](https://pytorch.org/tutorials/recipes/compiling_optimizer.html).
+
+
+## Performance Improvements
+
+### Inductor Performance Optimizations
+
+A number of performance optimizations have been added to TorchInductor including [horizontal fusion support for torch.concat](https://github.com/pytorch/pytorch/pull/111437), [improved convolution layout optimizations](https://github.com/pytorch/pytorch/pull/114600), and improved _scaled_dot_product_attention_ [pattern](https://github.com/pytorch/pytorch/pull/109156) [matching](https://github.com/pytorch/pytorch/pull/110001).
+
+For a complete list of inductor optimizations, please see the [Release Notes](https://github.com/pytorch/pytorch/tree/v2.2.0).
+
+### aarch64 Performance Optimizations
+
+PyTorch 2.2 includes a number of performance enhancements for aarch64 including support for [mkldnn weight pre-packing](https://github.com/pytorch/pytorch/pull/115037/files), improved [ideep](https://github.com/intel/ideep) [primitive caching](https://github.com/intel/ideep/pull/261), and improved inference speed via [fixed format kernel improvements](https://github.com/oneapi-src/oneDNN/pull/1590) to [OneDNN](https://github.com/oneapi-src/oneDNN/).
+
+For a complete list of aarch64 optimizations, please see the [Release Notes](https://github.com/pytorch/pytorch/tree/v2.2.0).