Add "Quantization in Practice" blogpost #912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

subramen merged 10 commits into pytorch:site from subramen:quantization_blog

Feb 8, 2022

Contributor

subramen commented Jan 19, 2022

No description provided.


          add quantizationin practice blogpost

a9ec8f8

subramen requested review from holly1238 and woo-kim

January 19, 2022 21:43

facebook-github-bot added the cla signed label

netlify bot commented Jan 19, 2022 •

edited

Loading

👷 Deploy Preview for pytorch-dot-org-preview processing.

🔨 Explore the source changes: c1b8fee

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-dot-org-preview/deploys/6202972b0ccea50008fd004a

woo-kim marked this pull request as ready for review

January 20, 2022 02:18


          text edits

6c0f10e

Contributor Author

subramen commented Jan 20, 2022 •

edited

Loading

@woo-kim @holly1238 The MathJax sections are rendering smaller than the surrounding text. Do you have suggestions on how to fix this?

subramen requested review from msaroufim and jerryzh168

January 20, 2022 14:59


          text edits

1e0dff6

msaroufim requested changes

View reviewed changes

_posts/2022-1-19-quantization-in-practice.md Outdated


		> If someone asks you what time it is, you don't respond "10:14:34:430705", but you might say "a quarter past 10".

		Quantization has roots in information compression; in deep networks it refers to reducing the numerical precision of its weights and/or activations. Overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression. When you quantize a model, two things generally happen - the model gets smaller and runs with better efficiency. Processing 8-bit numbers is faster than 32-bit numbers, and a smaller model has lower memory footprint and power consumption.

Member

msaroufim Jan 20, 2022

A source would help for the overparametrized point

For the processing 8 bit numbers is faster point that's because hardware providers made that explicitly available so worth mentioning

_posts/2022-1-19-quantization-in-practice.md Outdated

+              There are a few different ways to quantize your model with PyTorch. In this blog post, we'll take a look at how each technique looks like in practice. I will use a non-standard model that is not traceable, to paint an accurate picture of how much effort is really needed when quantizing your model.
+              <div class="text-center">
+                <img src="/assets/images/quantization_gif.gif" width="60%">

Member

msaroufim Jan 20, 2022

Very cool image

_posts/2022-1-19-quantization-in-practice.md Outdated


		To reconvert to floating point space, the inverse function is given by $\tilde r = (Q(r) - Z) \cdot S$. $\tilde r \neq r$, and their difference constitutes the quantization error.

		The scaling factor $S$ is simply the ratio of the input range to the output range: $S = \frac{\beta - \alpha}{\beta_q - \alpha_q}$

Member

msaroufim Jan 20, 2022

why is the scaling factor important to highlight here?

Contributor Author

subramen Jan 20, 2022

A) to provide an intuition of where this number is coming from, and B) to help make sense of how a/symmetric quantization schemes are better/worse for a given use case

_posts/2022-1-19-quantization-in-practice.md Outdated



		### Quantization Schemes
		$S, Z$ can be calculated and used for quantizing an entire tensor ("per-tensor"), or individually for each channel ("per-channel").

Member

msaroufim Jan 20, 2022

picture would help here

_posts/2022-1-19-quantization-in-practice.md Outdated Show resolved Hide resolved

_posts/2022-1-19-quantization-in-practice.md Outdated

+              ## FX GRAPH
+              from torch.quantization import quantize_fx
+              qconfig_dict = {"": torch.quantization.get_default_qconfig('fbgemm')}

Member

msaroufim Jan 20, 2022

what does fbgemm stand for? is this like saying nn.Linear in a qconfig?

_posts/2022-1-19-quantization-in-practice.md Outdated

+              model_quantized = quantize_fx.convert_fx(model_prepared)
+              ```
+              ### Quantization-aware Training (QAT)

Member

msaroufim Jan 20, 2022

NVIDIA had a nice diagram on QAT since this flow is a bit complex would help to explain in a picture https://developer.nvidia.com/blog/improving-int8-accuracy-using-quantization-aware-training-and-tao-toolkit/

Same comment for the other 2 techniques

_posts/2022-1-19-quantization-in-practice.md Outdated


		Download the [notebook](https://gist.github.com/suraj813/735357e56321237950a0348b50f2f3b4) or run it on [Colab](https://colab.research.google.com/gist/suraj813/735357e56321237950a0348b50f2f3b4/fx-and-eager-mode-quantization-example.ipynb) (note that Colab runtimes may differ significantly from local machines).

		Traceable models can be easily quantized with FX Graph Mode, but it's possible the model you're using is not traceable end-to-end. Maybe it has loops or `if` statements on inputs (dynamic control flow), or relies on third-party libraries. The model I use in this example has [dynamic control flow and uses third-party libraries](https://github.com/facebookresearch/demucs/blob/v2/demucs/model.py). As a result, it cannot be symbolically traced directly. In this code walkthrough, I show how you can bypass this limitation by quantizing the child modules individually for FX Graph Mode, and how to patch Quant/DeQuant stubs in Eager Mode.

Member

msaroufim Jan 20, 2022

The model we use since this is an official tutorial

_posts/2022-1-19-quantization-in-practice.md Outdated

		It's likely that you can still use QAT by "fine-tuning" it on a sample of the training dataset, but I did not try it on demucs (yet).


		## Quantizing "real-world" models

Member

msaroufim Jan 20, 2022

This is interesting in its own right, I'd rather you either write it out in this blog post or do a follow up post on quantizing non tracable models

Contributor Author

subramen Jan 24, 2022

Yeah it's the most interesting part of the article imo. But it's also very verbose - containing code and commentary - which is why I've only linked to the notebook from here. A follow-up post might be easier to parse, thanks for the suggestion.

_posts/2022-1-19-quantization-in-practice.md Outdated




		## What's next - Define-by-Run Quantization

Member

msaroufim Jan 20, 2022

This would be in the followup post too and you can basically gett people excited about reading this followup post by describing briefly how cool it is and how it works

Also make sure to synthesize everything people learnt at the very end, that your table and figures will help this doc be a reference for people new to quantization

jerryzh168 reviewed

View reviewed changes

_posts/2022-1-19-quantization-in-practice.md Outdated

+              The scaling factor $S$ is simply the ratio of the input range to the output range: $S = \frac{\beta - \alpha}{\beta_q - \alpha_q}$
+              where [$\alpha, \beta$] is the clipping range of the input, i.e. the boundaries of permissible inputs. [$\alpha_q, \beta_q$] is the range in quantized output space that it is mapped to. For 8-bit quantization, the output range $\beta_q - \alpha_q <= (2^8 - 1) $.
+              The process of choosing the appropriate input range is known as **calibration**; commonly used methods are MinMax and Entropy.

Contributor

jerryzh168 Jan 20, 2022

should we mention observers here? calibration is just a step that runs some sample data through the model and also through the inserted observers, so that the values can be recorded and used to calculate quantization parameters

jerryzh168 reviewed

View reviewed changes

_posts/2022-1-19-quantization-in-practice.md Outdated

		The `QConfig` ([code](https://github.com/PyTorch/PyTorch/blob/d6b15bfcbdaff8eb73fa750ee47cef4ccee1cd92/torch/ao/quantization/qconfig.py#L165)) NamedTuple specifies the observers and quantization schemes for the network's weights and activations. The default qconfig is at `torch.quantization.get_default_qconfig(backend)` where `backend='fbgemm'` for x86 CPU and `backend='qnnpack'` for ARM.


		## In PyTorch

Contributor

jerryzh168 Jan 20, 2022

there are two other dimensions, quantization mode: static/dynamic/weight only, and backend: server cpu/mobile cpu, gpu. I have more info in my slides

Contributor Author

subramen Jan 24, 2022

are the gpu backends usable yet? i don't see this as a backend in

Contributor

jerryzh168 Jan 24, 2022

GPU path with TensorRT is in prototype for internal users, GPU path with cudnn and cuda is WIP


          improve readability, add detail

7a5aca8

jerryzh168 reviewed

View reviewed changes

_posts/2022-1-19-quantization-in-practice.md Outdated

+                <img src="/assets/images/quantization_gif.gif" width="60%">
+              </div>
+              ## A quick introduction to quantization

Contributor

jerryzh168 Feb 3, 2022

would you like to integrate some part of this in the official quantization api?

Contributor Author

subramen Feb 3, 2022

do you mean the API itself, or in the docs? I'm happy to add it!

Contributor

jerryzh168 Feb 4, 2022

the documentation, we can add the explanations of core things like qconfig, quantized tensor, observer/fake_quant, qscheme to documentation: https://pytorch.org/docs/master/quantization.html#

subramen and others added 6 commits

February 3, 2022 16:15


          improve readability, add images, add new sections, remove real-world …

eac3d11

…section


          small edit

18be151


          add toc

a7762fc


          Update title

cfe9fb2


          Rename 2022-1-19-quantization-in-practice.md to 2022-2-8-quantization…

ed8c63a

…-in-practice.md


          Update 2022-2-8-quantization-in-practice.md

c1b8fee

subramen merged commit 89bab36 into pytorch:site

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels