- {% for post in posts %}
diff --git a/_posts/2021-09-08-pytorch-hackathon-2021.md b/_posts/2021-09-08-pytorch-hackathon-2021 copy.md similarity index 100% rename from _posts/2021-09-08-pytorch-hackathon-2021.md rename to _posts/2021-09-08-pytorch-hackathon-2021 copy.md diff --git a/_posts/2021-10-21-accelerating-pytorch-with-cuda-graphs.md b/_posts/2021-10-21-accelerating-pytorch-with-cuda-graphs.md new file mode 100644 index 000000000000..da7a84ffc9a2 --- /dev/null +++ b/_posts/2021-10-21-accelerating-pytorch-with-cuda-graphs.md @@ -0,0 +1,247 @@ +--- +layout: blog_detail +title: 'Accelerating PyTorch with CUDA Graphs' +author: +featured-img: 'assets/images/accelerating-pytorch-with-cuda-graphs/overview.png' +--- + +Today, we are pleased to announce a new advanced CUDA feature, CUDA Graphs, has been brought to PyTorch. Modern DL frameworks have complicated software stacks that incur significant overheads associated with the submission of each operation to the GPU. When DL workloads are strong-scaled to many GPUs for performance, the time taken by each GPU operation diminishes to just a few microseconds, and in these cases, the high work submission latencies of frameworks often lead to low utilization of the GPU. As GPUs get faster and workloads are scaled to more devices, the likelihood of workloads suffering from these launch-induced stalls increases. To overcome these performance overheads, NVIDIA engineers worked with PyTorch developers to enable CUDA graph execution natively in PyTorch. This design was instrumental in scaling NVIDIA’s MLPerf workloads (implemented in PyTorch) to over 4000 GPUs in order to achieve [record-breaking performance](https://blogs.nvidia.com/blog/2021/06/30/mlperf-ai-training-partners/). + +CUDA graphs support in PyTorch is just one more example of a long collaboration between NVIDIA and Facebook engineers. [torch.cuda.amp](https://pytorch.org/docs/stable/amp.html), for example, trains with half precision while maintaining the network accuracy achieved with single precision and automatically utilizing tensor cores wherever possible. AMP delivers up to 3X higher performance than FP32 with just a few lines of code change. Similarly, NVIDIA’s [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) was trained using PyTorch on up to 3072 GPUs. In PyTorch, one of the most performant methods to scale-out GPU training is with [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) coupled with the NVIDIA Collective Communications Library ([NCCL](https://developer.nvidia.com/nccl)) backend. + + +# CUDA Graphs + + +[CUDA Graphs](https://developer.nvidia.com/blog/cuda-10-features-revealed/), which made its debut in CUDA 10, lets a series of CUDA kernels to be defined and encapsulated as a single unit, i.e., a graph of operations, rather than a sequence of individually-launched operations. It provides a mechanism to launch multiple GPU operations through a single CPU operation, and hence reduces the launching overheads. +The benefits of CUDA graphs can be demonstrated with the simple example in Figure 1. On the top, a sequence of short kernels is launched one-by-one by the CPU. The CPU launching overhead creates a significant gap in between the kernels. If we replace this sequence of kernels with a CUDA graph, initially we will need to spend a little extra time on building the graph and launching the whole graph in one go on the first occasion, but subsequent executions will be very fast, as there will be very little gap between the kernels. The difference is more pronounced when the same sequence of operations is repeated many times, for example, over many training steps. In that case, the initial costs of building and launching the graph will be amortized over the entire number of training iterations. For a more comprehensive introduction on the topic, see our blog [Getting Started with CUDA Graphs](https://developer.nvidia.com/blog/cuda-graphs) and GTC talk [Effortless CUDA Graphs](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32082/). + + +

+Cuda graphs reduce launching overhead by bundling multiple GPU operations into a single launchable unit, i.e., a graph. On the top, you can see five individual launches; whereas on the bottom, with CUDA graphs, they are all bundled into a single launch, reducing overhead. +
+ Figure 1. Benefits of using CUDA graphs +

+ + +## NCCL support for CUDA graphs + + +The previously mentioned benefits of reducing launch overheads also extend to NCCL kernel launches. NCCL enables GPU-based collective and P2P communications. With [NCCL support for CUDA graphs](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/cudagraph.html), we can eliminate the NCCL kernel launch overhead. + +Additionally, kernel launch timing can be unpredictable due to various CPU load and operating system factors. Such time skews can be harmful to the performance of NCCL collective operations. With CUDA graphs, kernels are clustered together so that performance is consistent across ranks in a distributed workload. This is especially useful in large clusters where even a single slow node can bring down overall cluster level performance. + +For distributed multi-GPU workloads, NCCL is used for collective communications. If we look at training a neural network that leverages data parallelism, without NCCL support for CUDA graphs, we’ll need a separate launch for each of forward/back propagation and NCCL AllReduce. By contrast, with NCCL support for CUDA graphs, we can reduce launch overhead by lumping together the forward/backward propagation and NCCL AllReduce all in a single graph launch. + + +

+With NCCL CUDA graph support, all the kernel launches for NCCL AllReduce for  the forward/backward propagation can be bundled into a graph to reduce overhead launch time. +
+ Figure 2. Looking at a typical neural network, all the kernel launches for NCCL AllReduce can be bundled into a graph to reduce overhead launch time. +

+ + +# PyTorch CUDA Graphs + + +From PyTorch v1.10, the CUDA graphs functionality is made available as a set of beta APIs. + +### API overview + +PyTorch supports the construction of CUDA graphs using [stream capture](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#creating-a-graph-using-stream-capture), which puts a CUDA stream in capture mode. CUDA work issued to a capturing stream doesn’t actually run on the GPU. Instead, the work is recorded in a graph. After capture, the graph can be launched to run the GPU work as many times as needed. Each replay runs the same kernels with the same arguments. For pointer arguments this means the same memory addresses are used. By filling input memory with new data (e.g., from a new batch) before each replay, you can rerun the same work on new data. + +Replaying a graph sacrifices the dynamic flexibility of typical eager execution in exchange for greatly reduced CPU overhead. A graph’s arguments and kernels are fixed, so a graph replay skips all layers of argument setup and kernel dispatch, including Python, C++, and CUDA driver overheads. Under the hood, a replay submits the entire graph’s work to the GPU with a single call to [cudaGraphLaunch](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1accfe1da0c605a577c22d9751a09597). Kernels in a replay also execute slightly faster on the GPU, but eliding CPU overhead is the main benefit. + +You should try CUDA graphs if all or part of your network is graph-safe (usually this means static shapes and static control flow, but see the other [constraints](https://pytorch.org/docs/master/notes/cuda.html#constraints)) and you suspect its runtime is at least somewhat CPU-limited. + +### API example + +PyTorch exposes graphs via a raw `torch.cuda.CUDAGraph` class and two convenience wrappers, `torch.cuda.graph` and `torch.cuda.make_graphed_callables`. + +torch.cuda.graph + +`torch.cuda.graph` is a simple, versatile context manager that captures CUDA work in its context. Before capture, warm up the workload to be captured by running a few eager iterations. Warmup must occur on a side stream. Because the graph reads from and writes to the same memory addresses in every replay, you must maintain long-lived references to tensors that hold input and output data during capture. To run the graph on new input data, copy new data to the capture’s input tensor(s), replay the graph, then read the new output from the capture’s output tensor(s). + +If the entire network is capture safe, one can capture and replay the whole network as in the following example. + +```python +N, D_in, H, D_out = 640, 4096, 2048, 1024 +model = torch.nn.Sequential(torch.nn.Linear(D_in, H), + torch.nn.Dropout(p=0.2), + torch.nn.Linear(H, D_out), + torch.nn.Dropout(p=0.1)).cuda() +loss_fn = torch.nn.MSELoss() +optimizer = torch.optim.SGD(model.parameters(), lr=0.1) + +# Placeholders used for capture +static_input = torch.randn(N, D_in, device='cuda') +static_target = torch.randn(N, D_out, device='cuda') + +# warmup +# Uses static_input and static_target here for convenience, +# but in a real setting, because the warmup includes optimizer.step() +# you must use a few batches of real data. +s = torch.cuda.Stream() +s.wait_stream(torch.cuda.current_stream()) +with torch.cuda.stream(s): + for i in range(3): + optimizer.zero_grad(set_to_none=True) + y_pred = model(static_input) + loss = loss_fn(y_pred, static_target) + loss.backward() + optimizer.step() +torch.cuda.current_stream().wait_stream(s) + +# capture +g = torch.cuda.CUDAGraph() +# Sets grads to None before capture, so backward() will create +# .grad attributes with allocations from the graph's private pool +optimizer.zero_grad(set_to_none=True) +with torch.cuda.graph(g): + static_y_pred = model(static_input)Alban Desmaison + # Fills the graph's input memory with new data to compute on + static_input.copy_(data) + static_target.copy_(target) + # replay() includes forward, backward, and step. + # You don't even need to call optimizer.zero_grad() between iterations + # because the captured backward refills static .grad tensors in place. + g.replay() + # Params have been updated. static_y_pred, static_loss, and .grad + # attributes hold values from computing on this iteration's data. +``` + +If some of your network is unsafe to capture (e.g., due to dynamic control flow, dynamic shapes, CPU syncs, or essential CPU-side logic), you can run the unsafe part(s) eagerly and use `torch.cuda.make_graphed_callables()` to graph only the capture-safe part(s). This is demonstrated next. + +torch.cuda.make_graphed_callables + +`make_graphed_callables` accepts callables (functions or `nn.Module`s) and returns graphed versions. By default, callables returned by `make_graphed_callables()` are autograd-aware, and can be used in the training loop as direct replacements for the functions or `nn.Module`s you passed. `make_graphed_callables()` internally creates `CUDAGraph` objects, runs warm up iterations, and maintains static inputs and outputs as needed. Therefore, (unlike with `torch.cuda.graph`) you don’t need to handle those manually. +In the following example, data-dependent dynamic control flow means the network isn’t capturable end-to-end, but `make_graphed_callables()` lets us capture and run graph-safe sections as graphs regardless: + + +```python +N, D_in, H, D_out = 640, 4096, 2048, 1024 + +module1 = torch.nn.Linear(D_in, H).cuda() +module2 = torch.nn.Linear(H, D_out).cuda() +module3 = torch.nn.Linear(H, D_out).cuda() + +loss_fn = torch.nn.MSELoss() +optimizer = torch.optim.SGD(chain(module1.parameters() + + module2.parameters() + + module3.parameters()), + lr=0.1) + +# Sample inputs used for capture +# requires_grad state of sample inputs must match +# requires_grad state of real inputs each callable will see. +x = torch.randn(N, D_in, device='cuda') +h = torch.randn(N, H, device='cuda', requires_grad=True) + +module1 = torch.cuda.make_graphed_callables(module1, (x,)) +module2 = torch.cuda.make_graphed_callables(module2, (h,)) +module3 = torch.cuda.make_graphed_callables(module3, (h,)) + +real_inputs = [torch.rand_like(x) for _ in range(10)] +real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)] + +for data, target in zip(real_inputs, real_targets): + optimizer.zero_grad(set_to_none=True) + + tmp = module1(data) # forward ops run as a graph + + if tmp.sum().item() > 0: + tmp = module2(tmp) # forward ops run as a graph + else: + tmp = module3(tmp) # forward ops run as a graph + + loss = loss_fn(tmp, y) + # module2's or module3's (whichever was chosen) backward ops, + # as well as module1's backward ops, run as graphs + loss.backward() + optimizer.step() +``` + +# Example use cases +## MLPerf v1.0 training workloads + +The PyTorch CUDA graphs functionality was instrumental in scaling NVIDIA’s MLPerf training v1.0 workloads (implemented in PyTorch) to over 4000 GPUs, setting new [records across the board](https://blogs.nvidia.com/blog/2021/06/30/mlperf-ai-training-partners/). We illustrate below two MLPerf workloads where the most significant gains were observed with the use of CUDA graphs, yielding up to ~1.7x speedup. + +| | Number of GPUs | Speedup from CUDA-graphs | +|-----------------------------|----------------:|-------------------------:| +| Mask R-CNN | 74.042 | 91.340 | +| BERT | 67.668 | 87.402 | + +Table 1. MLPerf training v1.0 performance improvement with PyTorch CUDA graph. + + +### Mask R-CNN + +Deep learning frameworks use GPUs to accelerate computations, but a significant amount of code still runs on CPU cores. CPU cores process meta-data like tensor shapes in order to prepare arguments needed to launch GPU kernels. Processing meta-data is a fixed cost while the cost of the computational work done by the GPUs is positively correlated with batch size. For large batch sizes, CPU overhead is a negligible percentage of total run time cost, but at small batch sizes CPU overhead can become larger than GPU run time. When that happens, GPUs go idle between kernel calls. This issue can be identified on an NSight timeline plot in Figure 3. The plot below shows the “backbone” portion of Mask R-CNN with per-gpu batch size of 1 before graphing. The green portion shows CPU load while the blue portion shows GPU load. In this profile we see that the CPU is maxed out at 100% load while GPU is idle most of the time, there is a lot of empty space between GPU kernels. + +

+NSight timeline plot of Mask R-CNN shows that the CPU is maxed out at 100% load while GPU is idle most of the time, and a lot of empty space between GPU kernels +
+ Figure 3: NSight timeline plot of Mask R-CNN +

+ +CUDA graphs can automatically eliminate CPU overhead when tensor shapes are static. A complete graph of all the kernel calls is captured during the first step, in subsequent steps the entire graph is launched with a single op, eliminating all the CPU overhead, as observed in Figure 4.. + +

+With CUDA graph, the entire graph is launched with a single op, eliminating all the CPU overhead. +
+ Figure 4: CUDA graphs optimization +

+ +With graphing, we see that the GPU kernels are tightly packed and GPU utilization remains high. The graphed portion now runs in 6 ms instead of 31ms, a speedup of 5x. We did not graph the entire model, mostly just the resnet backbone, which resulted in an overall speedup of ~1.7x. +In order to increase the scope of the graph, we made some changes in the software stack to eliminate some of the CPU-GPU synchronization points. In MLPerf v1.0, this work included changing the implementation of torch.randperm function to use CUB instead of Thrust because the latter is a synchronous C++ template library. These improvements are available in the latest NGC container. + + +### BERT + +Similarly, by graph capturing the model, we eliminate CPU overhead and accompanying synchronization overhead. CUDA graphs implementation results in a 1.12x performance boost for our max-scale BERT configuration. To maximize the benefits from CUDA graphs, it is important to keep the scope of the graph as large as possible. To achieve this, we modified the model script to remove CPU-GPU synchronizations during the execution such that the full model can be graph captured. Furthermore, we also made sure that the tensor sizes during the execution are static within the scope of the graph. For instance, in BERT, only a specific subset of total tokens contribute to loss function, determined by a pre-generated mask tensor. Extracting the indices of valid tokens from this mask, and using these indices to gather the tokens that contribute to the loss, results in a tensor with a dynamic shape, i.e. with shape that is not constant across iterations. In order to make sure tensor sizes are static, instead of using the dynamic-shape tensors in the loss computation, we used static shape tensors where a mask is used to indicate which elements are valid. As a result, all tensor shapes are static. Dynamic shapes also require CPU-GPU synchronization since it has to involve the framework’s memory management on the CPU side. With static-only shapes, no CPU-GPU synchronizations are necessary. This is shown in Figure 5. + + +

+ Synchronization free training eliminates CPU synchronization +
+ Figure 5. By using a fixed size tensor and a boolean mask as described in the text, we are able to eliminate CPU synchronizations needed for dynamic sized tensors +

+ + +## CUDA graphs in NVIDIA DL examples collection + +Single GPU use cases can also benefit from using CUDA Graphs. This is particularly true for workloads launching many short kernels with small batches. A good example is training and inference for recommender systems. Below we present preliminary benchmark results for NVIDIA's implementation of the Deep Learning Recommendation Model (DLRM) from our Deep Learning Examples collection. Using CUDA graphs for this workload provides significant speedups for both training and inference. The effect is particularly visible when using very small batch sizes, where CPU overheads are more pronounced. + +CUDA graphs are being actively integrated into other PyTorch NGC model scripts and the NVIDIA Github deep learning examples. Stay tuned for more examples on how to use it. + + +

+ CUDA graphs optimization for the DLRM model. The impact is larger for smaller batch sizes where CPU overheads are more pronounced. +

+

+ CUDA graphs optimization for the DLRM model. The impact is larger for smaller batch sizes where CPU overheads are more pronounced. +
+ Figure 6: CUDA graphs optimization for the DLRM model. +

+ + +# Call to action: CUDA Graphs in PyTorch v1.10 + +CUDA graphs can provide substantial benefits for workloads that comprise many small GPU kernels and hence bogged down by CPU launch overheads. This has been demonstrated in our MLPerf efforts, optimizing PyTorch models. Many of these optimizations, including CUDA graphs, have or will eventually be integrated into our PyTorch NGC model scripts [collection](https://ngc.nvidia.com/catalog/collections?orderBy=scoreDESC&pageNumber=0&query=pytorch&quickFilter=&filters=) and the NVIDIA [Github deep learning examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/). For now, check out our open-source MLPerf training v1.0 [implementation](https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA) which could serve as a good starting point to see CUDA graph in action. Alternatively, try the PyTorch CUDA graphs API on your own workloads. +Alban Desmaison +We thank many NVIDIAN’s and Facebook engineers for their discussions and suggestions: +[Karthik Mandakolathur US](mailto:karthik@nvidia.com), +[Tomasz Grel](mailto:tgrel@nvidia.com), +[PLJoey Conway](mailto:jconway@nvidia.com), +[Arslan Zulfiqar US](mailto:azulfiqar@nvidia.com) + +## Authors bios + +[Vinh Nguyen AU](mailto:vinhn@nvidia.com), +[Michael Carilli US](mailto:mcarilli@nvidia.com), +[Sukru Burc Eryilmaz US](mailto:seryilmaz@nvidia.com), +[Vartika Singh US](mailto:vartikas@nvidia.com), +[Michelle Lin US](mailto:miclin@nvidia.com), +[Natalia Gimelshein](mailto:ngimel@fb.com), +[Alban Desmaison](mailto:albandes@fb.com), +Edward Yang diff --git a/_posts/2021-10-21-pytorch-1.10-main-release.md b/_posts/2021-10-21-pytorch-1.10-main-release.md new file mode 100644 index 000000000000..d38efc0dc184 --- /dev/null +++ b/_posts/2021-10-21-pytorch-1.10-main-release.md @@ -0,0 +1,104 @@ +--- +layout: blog_detail +title: 'PyTorch 1.10 Release, including CUDA Graphs APIs, TorchScript improvements' +author: Team PyTorch +--- + +We are excited to announce the release of PyTorch 1.10. This release is composed of around 3,400 commits since 1.9, made by 426 contributors. We want to sincerely thank our community for continuously improving PyTorch. + +PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. The full release notes are available [here](https://github.com/pytorch/pytorch/releases/tag/v1.10.0). Highlights include: +1. CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads +2. New features to optimize usability and performance of TorchScript - profile-directed typing in TorchScript & LLVM-based JIT Compiler for CPUs +3. Android NNAPI support now in beta + +We are also releasing major updates to TorchAudio and TorchVision along with 1.10 as well as introducing TorchX - a new SDK for quickly building and deploying ML applications from research to production. See [this blog post](https://pytorch.org/blog/pytorch-1.10-new-library-releases/) for details. Features in PyTorch releases are classified as Stable, Beta, and Prototype. You can learn more about the definitions in [this blog post](https://pytorch.org/blog/pytorch-feature-classification-changes/). + +# Frontend APIs + +### (Stable) Python code transformations with FX + +FX provides a Pythonic platform for transforming and lowering PyTorch programs. It is a toolkit for pass writers to facilitate Python-to-Python transformation of functions and nn.Module instances. This toolkit aims to support a subset of Python language semantics—rather than the whole Python language—to facilitate ease of implementation of transforms. With 1.10, FX is moving to stable. + +You can learn more about FX in the [official documentation](https://pytorch.org/docs/master/fx.html) and [GitHub examples](https://github.com/pytorch/examples/tree/master/fx) of program transformations implemented using ```torch.fx````. + +### (Stable) *torch.special* + +A ```torch.special module```, analogous to [SciPy’s special module](https://docs.scipy.org/doc/scipy/reference/special.html), is now available in stable. The module has 30 operations, including gamma, Bessel, and error functions. Refer to this [documentation](https://pytorch.org/docs/master/special.html) for more details. + +### (Stable) nn.Module Parametrization + +```nn.Module``` parametrizaton, a feature that allows users to parametrize any parameter or buffer of an ```nn.Module``` without modifying the ```nn.Module``` itself, is available in stable. This release adds weight normalization (```weight_norm```), orthogonal parametrization (matrix constraints and part of pruning) and more flexibility when creating your own parametrization. + +Refer to this [tutorial](https://pytorch.org/tutorials/intermediate/parametrizations.html) and the general [documentation](https://pytorch.org/docs/master/generated/torch.nn.utils.parametrizations.spectral_norm.html?highlight=parametrize) for more details. + +### (Beta) CUDA Graphs APIs Integration + +PyTorch now integrates CUDA Graphs APIs to reduce CPU overheads for CUDA workloads. + +CUDA Graphs greatly reduce the CPU overhead for CPU-bound cuda workloads and thus improve performance by increasing GPU utilization. For distributed workloads, CUDA Graphs also reduce jitter, and since parallel workloads have to wait for the slowest worker, reducing jitter improves overall parallel efficiency. + +Integration allows seamless interop between the parts of the network captured by cuda graphs, and parts of the network that cannot be captured due to graph limitations. + +Read the [note](https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs) for more details and examples, and refer to the general [documentation](https://pytorch.org/docs/master/generated/torch.cuda.CUDAGraph.html#torch.cuda.CUDAGraph) for additional information. + +# Distributed Training + +### Distributed Training Releases Now in Stable + +In 1.10, there are a number of features that are moving from beta to stable in the distributed package: + +* **(Stable) Remote Module**: CThis feature allows users to operate a module on a remote worker like using a local module, where the RPCs are transparent to the user. Refer to this [documentation](https://pytorch.org/docs/master/rpc.html#remotemodule) for more details. + +* **(Stable) DDP Communication Hook**: This feature allows users to override how DDP synchronizes gradients across processes. Refer to this [documentation](https://pytorch.org/docs/master/rpc.html#remotemodule) for more details. + +* **(Stable) ZeroRedundancyOptimizer**: This feature can be used in conjunction with DistributedDataParallel to reduce the size of per-process optimizer states. With this stable release, it now can handle uneven inputs to different data-parallel workers. Check out this [tutorial](https://pytorch.org/tutorials/advanced/generic_join.html). We also improved the parameter partition algorithm to better balance memory and computation overhead across processes. Refer to this [documentation](https://pytorch.org/docs/master/distributed.optim.html) and this [tutorial](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) to learn more. + +# Performance Optimization and Tooling + +### (Beta) Profile-directed typing in TorchScript + +TorchScript has a hard requirement for source code to have type annotations in order for compilation to be successful. For a long time, it was only possible to add missing or incorrect type annotations through trial and error (i.e., by fixing the type-checking errors generated by torch.jit.script one by one), which was inefficient and time consuming. + +Now, we have enabled profile directed typing for torch.jit.script by leveraging existing tools like MonkeyType, which makes the process much easier, faster, and more efficient. For more details, refer to the [documentation](https://pytorch.org/docs/1.9.0/jit.html). + +### (Beta) CPU Fusion + +In PyTorch 1.10, we've added an LLVM-based JIT compiler for CPUs that can fuse together sequences of `torch` library calls to improve performance. While we've had this capability for some time on GPUs, this release is the first time we've brought compilation to the CPU. Check out a few sample results in this notebook, + +You can check out a few performance results for yourself in this [Colab notebook](https://colab.research.google.com/drive/1xaH-L0XjsxUcS15GG220mtyrvIgDoZl6?usp=sharing). + +### (Beta) PyTorch Profiler + +The objective of PyTorch Profiler is to target the execution steps that are the most costly in time and/or memory, and visualize the workload distribution between GPUs and CPUs. PyTorch 1.10 includes the following key features: + +* **Enhanced Memory View**: This helps you understand your memory usage better. This tool will help you avoid Out of Memory errors by showing active memory allocations at various points of your program run. + +* **Enhanced Automated Recommendations**: This helps provide automated performance recommendations to help optimize your model. The tools recommend changes to batch size, TensorCore, memory reduction technologies, etc. + +* **Distributed Training**: Gloo is now supported for distributed training jobs. + +* **Correlate Operators in the Forward & Backward Pass**: This helps map the operators found in the forward pass to the backward pass, and vice versa, in a trace view. + +* **TensorCore**: This tool shows the Tensor Core (TC) usage and provides recommendations for data scientists and framework developers. + +Refer to this [documentation](https://pytorch.org/docs/stable/profiler.html) for details. Check out this [tutorial](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) to learn how to get started with this feature. + +# PyTorch Mobile + +### (Beta) Android NNAPI Support in Beta + +Last year we [released prototype support](https://medium.com/pytorch/pytorch-mobile-now-supports-android-nnapi-e2a2aeb74534) for Android’s Neural Networks API (NNAPI). NNAPI allows Android apps to run computationally intensive neural networks on the most powerful and efficient parts of the chips that power mobile phones, including GPUs (Graphics Processing Units) and NPUs (specialized Neural Processing Units). + +Try out this feature using the [tutorial](https://pytorch.org/tutorials/prototype/nnapi_mobilenetv2.html). Please provide your feedback or ask questions on [the forum](https://discuss.pytorch.org/c/mobile/18). You can also check out [this presentation](https://www.youtube.com/watch?v=B-2spa3UCTU) to learn more. + +### (Beta) PyTorch Bundle Inputs + +PyTorch now provides a utility that allows TorchScript models to have inputs bundled directly to them. It allows users to streamline the process of passing runnable inputs with a model. These inputs can be used to actually run the model in benchmarking applications or trace the used operators in something like mobile’s upcoming tracing based selective build. Also, they could be used to just specify input shapes for certain pipelines. + +You can find a tutorial for this feature here [], and provide your feedback on the [PyTorch Discussion Forum - Mobile](https://discuss.pytorch.org/c/mobile/18). + +Thanks for reading. If you’re interested in these updates and want to join the PyTorch community, we encourage you to join the [discussion forums](https://discuss.pytorch.org/) and [open GitHub issues](https://github.com/pytorch/pytorch/issues). To get the latest news from PyTorch, follow us on [Facebook](https://www.facebook.com/pytorch/), [Twitter](https://twitter.com/PyTorch), [Medium](https://medium.com/pytorch), [YouTube](https://www.youtube.com/pytorch), or [LinkedIn](https://www.linkedin.com/company/pytorch). + +Cheers! + +Team PyTorch diff --git a/_posts/2021-10-27-fx-based-feature-extraction.md b/_posts/2021-10-27-fx-based-feature-extraction.md new file mode 100644 index 000000000000..0e12a0a9aa6d --- /dev/null +++ b/_posts/2021-10-27-fx-based-feature-extraction.md @@ -0,0 +1,400 @@ +--- +layout: blog_detail +title: 'FX based Feature Extraction in TorchVision' +author: Alexander Soare and Francisco Massa +featured-img: 'assets/images/fx-based-feature-extraction/overview.png' +--- + + + +# Introduction + +[FX](https://pytorch.org/docs/stable/fx.html) based feature extraction is a new [TorchVision utility](https://pytorch.org/vision/stable/feature_extraction.html) that lets us access intermediate transformations of an input during the forward pass of a PyTorch Module. It does so by symbolically tracing the forward method to produce a graph where each node represents a single operation. Nodes are named in a human-readable manner such that one may easily specify which nodes they want to access. + +Did that all sound a little complicated? Not to worry as there’s a little in this article for everyone. Whether you’re a beginner or an advanced deep-vision practitioner, chances are you will want to know about FX feature extraction. If you still want more background on feature extraction in general, read on. If you’re already comfortable with that and want to know how to do it in PyTorch, skim ahead to Existing Methods in PyTorch: Pros and Cons. And if you already know about the challenges of doing feature extraction in PyTorch, feel free to skim forward to FX to The Rescue. + + +## A Recap On Feature Extraction + +We’re all used to the idea of having a deep neural network (DNN) that takes inputs and produces outputs, and we don’t necessarily think of what happens in between. Let’s just consider a ResNet-50 classification model as an example: + +

+ CResNet-50 takes an image of a bird and transforms that into the abstract concept 'bird' +
+ Figure 1: ResNet-50 takes an image of a bird and transforms that into the abstract concept "bird". Source: Bird image from ImageNet. +

+ +We know though, that there are many sequential “layers” within the ResNet-50 architecture that transform the input step-by-step. In Figure 2 below, we peek under the hood to show the layers within ResNet-50, and we also show the intermediate transformations of the input as it passes through those layers. + +

+ ResNet-50 transforms the input image in multiple steps. Conceptually, we may access the intermediate transformation of the image after each one of these steps. +
+ Figure 2: ResNet-50 transforms the input image in multiple steps. Conceptually, we may access the intermediate transformation of the image after each one of these steps. Source: Bird image from ImageNet. +

+ + +## Existing Methods In PyTorch: Pros and Cons + +There were already a few ways of doing feature extraction in PyTorch prior to FX based feature extraction being introduced. + +To illustrate these, let’s consider a simple convolutional neural network that does the following + +* Applies several “blocks” each with several convolution layers within. +* After several blocks, it uses a global average pool and flatten operation. +* Finally it uses a single output classification layer. + +```python +import torch +from torch import nn + + +class ConvBlock(nn.Module): + """ + Applies `num_layers` 3x3 convolutions each followed by ReLU then downsamples + via 2x2 max pool. + """ + + def __init__(self, num_layers, in_channels, out_channels): + super().__init__() + self.convs = nn.ModuleList( + [nn.Sequential( + nn.Conv2d(in_channels if i==0 else out_channels, out_channels, 3, padding=1), + nn.ReLU() + ) + for i in range(num_layers)] + ) + self.downsample = nn.MaxPool2d(kernel_size=2, stride=2) + + def forward(self, x): + for conv in self.convs: + x = conv(x) + x = self.downsample(x) + return x + + +class CNN(nn.Module): + """ + Applies several ConvBlocks each doubling the number of channels, and + halving the feature map size, before taking a global average and classifying. + """ + + def __init__(self, in_channels, num_blocks, num_classes): + super().__init__() + first_channels = 64 + self.blocks = nn.ModuleList( + [ConvBlock( + 2 if i==0 else 3, + in_channels=(in_channels if i == 0 else first_channels*(2**(i-1))), + out_channels=first_channels*(2**i)) + for i in range(num_blocks)] + ) + self.global_pool = nn.AdaptiveAvgPool2d((1, 1)) + self.cls = nn.Linear(first_channels*(2**(num_blocks-1)), num_classes) + + def forward(self, x): + for block in self.blocks: + x = block(x) + x = self.global_pool(x) + x = x.flatten(1) + x = self.cls(x) + return x + + +model = CNN(3, 4, 10) +out = model(torch.zeros(1, 3, 32, 32)) # This will be the final logits over classes +``` + +Let’s say we want to get the final feature map before global average pooling. We could… + +### Modify the forward method + +```python +def forward(self, x): + for block in self.blocks: + x = block(x) + self.final_feature_map = x + x = self.global_pool(x) + x = x.flatten(1) + x = self.cls(x) + return x +``` + +Or return it directly: + +```python +def forward(self, x): + for block in self.blocks: + x = block(x) + self.final_feature_map = x + x = self.global_pool(x) + x = x.flatten(1) + x = self.cls(x) + return x +``` +That looks pretty easy. But there are some downsides here which all stem from the same underlying issue: that is, modifying the source code is not ideal: + +* It’s not always easy to access and change given the practical considerations of a project. +* If we want flexibility (switching feature extraction on or off, or having variations on it), we need to further adapt the source code to support that. +* It’s not always just a question of inserting a single line of code. Think about how you would go about getting the feature map from one of the intermediate blocks with the way I’ve written this module. +* Overall, we’d rather avoid the overhead of maintaining source code for a model, when we actually don’t need to change anything about how it works. + +One can see how this downside can start to get a lot more thorny when dealing with larger, more complicated models, and trying to get at features from within nested submodules. + +### Write a new module using the parameters from the original one + +Following on the example from above, say we want to get a feature map from each block. We could write a new module like so: + +```python +class CNNFeatures(nn.Module): + def __init__(self, backbone): + super().__init__() + self.blocks = backbone.blocks + + def forward(self, x): + feature_maps = [] + for block in self.blocks: + x = block(x) + feature_maps.append(x) + return feature_maps + + +backbone = CNN(3, 4, 10) +model = CNNFeatures(backbone) +out = model(torch.zeros(1, 3, 32, 32)) # This is now a list of Tensors, each representing a feature map +``` + +In fact, this is much like the method that TorchVision used internally to make many of its detection models. + +Although this approach solves some of the issues with modifying the source code directly, there are still some major downsides: + +* It’s only really straight-forward to access the outputs of top-level submodules. Dealing with nested submodules rapidly becomes complicated. +* We have to be careful not to miss any important operations in between the input and the output. We introduce potential for errors in transcribing the exact functionality of the original module to the new module. + +Overall, this method and the last both have the complication of tying in feature extraction with the model’s source code itself. Indeed, if we examine the source code for TorchVision models we might suspect that some of the design choices were influenced by the desire to use them in this way for downstream tasks. + +### Use hooks + +Hooks move us away from the paradigm of writing source code, towards one of specifying outputs. Considering our toy CNN example above, and the goal of getting feature maps for each layer, we could use hooks like this: + + +```python +model = CNN(3, 4, 10) + +feature_maps = [] # This will be a list of Tensors, each representing a feature map + +def hook_feat_map(mod, inp, out): + feature_maps.append(out) + +for block in model.blocks: + block.register_forward_hook(hook_feat_map) + +out = model(torch.zeros(1, 3, 32, 32)) # This will be the final logits over classes +``` + +Now we have full flexibility in terms of accessing nested submodules, and we free ourselves of the responsibilities of fiddling with the source code. But this approach comes with its own downsides: + +* We can only apply hooks to modules. If we have functional operations (reshape, view, functional non-linearities, etc) for which we want the outputs, hooks won’t work directly on them. +* We have not modified anything about the source code, so the whole forward pass is executed, regardless of the hooks. If we only need to access early features without any need for the final output, this could result in a lot of useless computation. +* Hooks are not TorchScript friendly. + +Here’s a summary of the different methods and their pros/cons: + + +| | Can use source code as is without any modifications or rewriting | Full flexibility in accessing features | Drops unnecessary computational steps | TorchScript friendly | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| Modify forward method | NO | Technically yes. Depends on how much code you’re willing to write. So in practice, NO. | YES | YES | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| New module that reuses submodules / parameters of original module | NO | Technically yes. Depends on how much code you’re willing to write. So in practice, NO. | YES | YES | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| Hooks | YES | Mostly YES. Only outputs of submodules | NO | NO | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| + +Table 1: The pros (or cons) of some of the existing methods for feature extraction with PyTorch + +In the next section of this article, let’s see how we can get greens across the board. + + +## FX to The Rescue + +The natural question for some new-starters in Python and coding at this point might be: *“Can’t we just point to a line of code and tell Python or PyTorch that we want the result of that line?”* For those who have spent more time coding, the reason this can’t be done is clear: multiple operations can happen in one line of code, whether they are explicitly written there, or they are implicit as sub-operations. Just take this simple module as an example: + +```python +class MyModule(torch.nn.Module): + def __init__(self): + super().__init__() + self.param = torch.nn.Parameter(torch.rand(3, 4)) + self.submodule = MySubModule() + + def forward(self, x): + return self.submodule(x + self.param).clamp(min=0.0, max=1.0) +``` + +The forward method has a single line of code which we can unravel as: + +1. Add `self.param` to `x` +2. Pass x through self.submodule. Here we would need to consider the steps happening in that submodule. I’m just going to use dummy operation names for illustration: + I. submodule.op_1 + II. submodule.op_2 +3. Apply the clamp operation + +So even if we point at this one line, the question then is: “For which step do we want to extract the output?”. + +[FX](https://pytorch.org/docs/stable/fx.html) is a core PyTorch toolkit that (oversimplifying) does the unravelling I just mentioned. It does something called “symbolic tracing”, which means the Python code is interpreted and stepped through, operation-by-operation, using some dummy proxy for a real input. Introducing some nomenclature, each step as described above is considered a **“node”**, and consecutive nodes are connected to one another to form a **“graph”** (not unlike the common mathematical notion of a graph). Here are the “steps” above translated to this concept of a graph. + +

+ Graphical representation of the result of symbolically tracing our example of a simple forward method. +
+ Figure 3: Graphical representation of the result of symbolically tracing our example of a simple forward method. +

+ +Note that we call this a graph, and not just a set of steps, because it’s possible for the graph to branch off and recombine. Think of the skip connection in a residual block. This would look something like: + +

+ Graphical representation of a residual skip connection. The middle node is like the main branch of a residual block, and the final node represents the sum of the input and output of the main branch. +
+ Figure 4: Graphical representation of a residual skip connection. The middle node is like the main branch of a residual block, and the final node represents the sum of the input and output of the main branch. +

+ +Now, TorchVision’s **get_graph_node_names** function applies FX as described above, and in the process of doing so, tags each node with a human readable name. Let’s try this with our toy CNN model from the previous section: + +```python +model = CNN(3, 4, 10) +from torchvision.models.feature_extraction import get_graph_node_names +nodes, _ = get_graph_node_names(model) +print(nodes) +``` +which will result in: +```python +['x', 'blocks.0.convs.0.0', 'blocks.0.convs.0.1', 'blocks.0.convs.1.0', 'blocks.0.convs.1.1', 'blocks.0.downsample', 'blocks.1.convs.0.0', 'blocks.1.convs.0.1', 'blocks.1.convs.1.0', 'blocks.1.convs.1.1', 'blocks.1.convs.2.0', 'blocks.1.convs.2.1', 'blocks.1.downsample', 'blocks.2.convs.0.0', 'blocks.2.convs.0.1', 'blocks.2.convs.1.0', 'blocks.2.convs.1.1', 'blocks.2.convs.2.0', 'blocks.2.convs.2.1', 'blocks.2.downsample', 'blocks.3.convs.0.0', 'blocks.3.convs.0.1', 'blocks.3.convs.1.0', 'blocks.3.convs.1.1', 'blocks.3.convs.2.0', 'blocks.3.convs.2.1', 'blocks.3.downsample', 'global_pool', 'flatten', 'cls'] +``` + +We can read these node names as hierarchically organised “addresses” for the operations of interest. For example 'blocks.1.downsample' refers to the MaxPool2d layer in the second `ConvBlock`. + +[`create_feature_extractor`](https://pytorch.org/vision/stable/feature_extraction.html#torchvision.models.feature_extraction.create_feature_extractor), which is where all the magic happens, goes a few steps further than **`get_graph_node_names`**. It takes desired node names as one of the input arguments, and then uses more FX core functionality to: + +1. Assign the desired nodes as outputs. +2. Prune unnecessary downstream nodes and their associated parameters. +3. Translate the resulting graph back into Python code. +4. Return another PyTorch Module to the user. This has the python code from step 3 as the forward method. + +As a demonstration, here’s how we would apply `create_feature_extractor` to get the 4 feature maps from our toy CNN model + +```python +from torchvision.models.feature_extraction import create_feature_extractor + +# Confused about the node specification here? +# We are allowed to provide truncated node names, and `create_feature_extractor` +# will choose the last node with that prefix. +feature_extractor = create_feature_extractor( + model, return_nodes=['blocks.0', 'blocks.1', 'blocks.2', 'blocks.3']) + +# `out` will be a dict of Tensors, each representing a feature map +out = feature_extractor(torch.zeros(1, 3, 32, 32)) +``` + +It’s as simple as that. When it comes down to it, FX feature extraction is just a way of making it possible to do what some of us would have naively hoped for when we first started programming: *“just give me the output of this code (*points finger at screen)”*. + +- [ ] … does not require us to fiddle with source code. +- [ ] … provides full flexibility in terms of accessing any intermediate transformation of our inputs, whether they are the results of a module or a functional operation +- [ ] … does drop unnecessary computations steps once features have been extracted +- [ ] … and I didn’t mention this before, but it’s also TorchScript friendly! + +Here’s that table again with another row added for FX feature extraction + + +| | Can use source code as is without any modifications or rewriting | Full flexibility in accessing features | Drops unnecessary computational steps | TorchScript friendly | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| Modify forward method | NO | Technically yes. Depends on how much code you’re willing to write. So in practice, NO. | YES | YES | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| New module that reuses submodules / parameters of original module | NO | Technically yes. Depends on how much code you’re willing to write. So in practice, NO. | YES | YES | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| Hooks | YES | Mostly YES. Only outputs of submodules | NO | NO | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| +| FX | YES | YES | YES | YES | +|-------------------------------------------------------------------|:-----------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:--------------------------------------:|:--------------------:| + +Table 2: A copy of Table 1 with an added row for FX feature extraction. FX feature extraction gets greens across the board! + + +## Current FX Limitations + +Although I would have loved to end the post there, FX does have some of its own limitations which boil down to: + +1. There may be some Python code that isn’t yet handled by FX when it comes to the step of interpretation and translation into a graph. +2. Dynamic control flow can’t be represented in terms of a static graph. + +The easiest thing to do when these problems crop up is to bundle the underlying code into a “leaf node”. Recall the example graph from Figure 3? Conceptually, we may agree that the `submodule` should be treated as a node in itself rather than a set of nodes representing the underlying operations. If we do so, we can redraw the graph as: + +

+ The individual operations within `submodule` may (left - within red box), may be consolidated into one node (right - node #2) if we consider the `submodule` as a 'leaf' node. +
+ Figure 5: The individual operations within `submodule` may (left - within red box), may be consolidated into one node (right - node #2) if we consider the `submodule` as a "leaf" node. +

+ + +We would want to do so if there is some problematic code within the submodule, but we don’t have any need for extracting any intermediate transformations from within it. In practice, this is easily achievable by providing a keyword argument to create_feature_extractor or get_graph_node_names. + + +```python +model = CNN(3, 4, 10) +nodes, _ = get_graph_node_names(model, tracer_kwargs={'leaf_modules': [ConvBlock]}) +print(nodes) +``` + +for which the output will be: + +```python +['x', 'blocks.0', 'blocks.1', 'blocks.2', 'blocks.3', 'global_pool', 'flatten', 'cls'] +``` + +Notice how, as compared to previously, all the nodes for any given `ConvBlock` are consolidated into a single node. + +We could do something similar with functions. For example, Python’s inbuilt `len` needs to be wrapped and the result should be treated as a leaf node. Here’s how you can do that with core FX functionality: + +```python +torch.fx.wrap('len') + +class MyModule(nn.Module): + def forward(self, x): + x += 1 + len(x) + +model = MyModule() +feature_extractor = create_feature_extractor(model, return_nodes=['add']) +``` + +For functions you define, you may instead use another keyword argument to `create_feature_extractor` (minor detail: here’s[ why you might want to do it this way instead](https://github.com/pytorch/pytorch/issues/62021#issue-950458396)): + + +```python +def myfunc(x): + return len(x) + +class MyModule(nn.Module): + def forward(self, x): + x += 1 + myfunc(x) + +model = MyModule() +feature_extractor = create_feature_extractor( + model, return_nodes=['add'], tracer_kwargs={'autowrap_functions': [myfunc]}) +``` + +Notice that none of the fixes above involved modifying source code. + +Of course, there may be times when the very intermediate transformation one is trying to get access to is within the same forward method or function that is causing problems. Here, we can’t just treat that module or function as a leaf node, because then we can’t access the intermediate transformations within. In these cases, some rewriting of the source code will be needed. Here are some examples (not exhaustive) + +- FX will raise an error when trying to trace through code with an `assert` statement. In this case you may need to remove that assertion or switch it with [`torch._assert`](https://pytorch.org/docs/stable/generated/torch._assert.html) (this is not a public function - so consider it a bandaid and use with caution). +- Symbolically tracing in-place changes to slices of tensors is not supported. You will need to make a new variable for the slice, apply the operation, then reconstruct the original tensor using concatenation or stacking. +- Representing dynamic control flow in a static graph is just not logically possible. See if you can distill the coded logic down to something that is not dynamic - see FX documentation for tips. + +In general, you may consult the FX documentation for more detail on the [limitations of symbolic tracing](https://pytorch.org/docs/stable/fx.html#limitations-of-symbolic-tracing) and the possible workarounds. + +## Conclusion + +We did a quick recap on feature extraction and why one might want to do it. Although there are existing methods for doing feature extraction in PyTorch they all have rather significant shortcomings. We learned how TorchVision’s FX feature extraction utility works and what makes it so versatile compared to the existing methods. While there are still some minor kinks to iron out for the latter, we understand the limitations, and can trade them off against the limitations of other methods depending on our use case. Hopefully by adding this new utility to your PyTorch toolkit, you’re now equipped to handle the vast majority of feature extraction requirements you may come across. + +Happy coding! + diff --git a/_posts/2021-6-15-pytorch-1.9-released.md b/_posts/2021-6-15-pytorch-1.9-released copy.md similarity index 100% rename from _posts/2021-6-15-pytorch-1.9-released.md rename to _posts/2021-6-15-pytorch-1.9-released copy.md diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image1.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image1.png new file mode 100644 index 000000000000..cc2dca9e4db6 Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image1.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image2.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image2.png new file mode 100644 index 000000000000..4b03cabd7938 Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image2.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image3.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image3.png new file mode 100644 index 000000000000..01bc2e1a4e7a Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image3.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image4.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image4.png new file mode 100644 index 000000000000..87c57d85e9a2 Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image4.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image6.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image6.png new file mode 100644 index 000000000000..908d1f5c96b8 Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image6.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image7.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image7.png new file mode 100644 index 000000000000..31e26545a257 Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image7.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/image8.png b/assets/images/accelerating-pytorch-with-cuda-graphs/image8.png new file mode 100644 index 000000000000..ce851019d5cd Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/image8.png differ diff --git a/assets/images/accelerating-pytorch-with-cuda-graphs/overview.png b/assets/images/accelerating-pytorch-with-cuda-graphs/overview.png new file mode 100644 index 000000000000..598929dec45d Binary files /dev/null and b/assets/images/accelerating-pytorch-with-cuda-graphs/overview.png differ diff --git a/assets/images/fx-based-feature-extraction/image1.png b/assets/images/fx-based-feature-extraction/image1.png new file mode 100644 index 000000000000..563cb7f02b33 Binary files /dev/null and b/assets/images/fx-based-feature-extraction/image1.png differ diff --git a/assets/images/fx-based-feature-extraction/image2.png b/assets/images/fx-based-feature-extraction/image2.png new file mode 100644 index 000000000000..437177444fbf Binary files /dev/null and b/assets/images/fx-based-feature-extraction/image2.png differ diff --git a/assets/images/fx-based-feature-extraction/image3.png b/assets/images/fx-based-feature-extraction/image3.png new file mode 100644 index 000000000000..c7942ebe996b Binary files /dev/null and b/assets/images/fx-based-feature-extraction/image3.png differ diff --git a/assets/images/fx-based-feature-extraction/image4.png b/assets/images/fx-based-feature-extraction/image4.png new file mode 100644 index 000000000000..75a62c9ea296 Binary files /dev/null and b/assets/images/fx-based-feature-extraction/image4.png differ diff --git a/assets/images/fx-based-feature-extraction/image5.png b/assets/images/fx-based-feature-extraction/image5.png new file mode 100644 index 000000000000..fa77c8748103 Binary files /dev/null and b/assets/images/fx-based-feature-extraction/image5.png differ