update quantization doc: add x86 backend as default backend of server inference (#86794)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86794
Approved by: https://github.com/jgong5, https://github.com/kit1980
This commit is contained in:
XiaobingSuper
2022-11-30 21:07:19 -05:00
committed by PyTorch MergeBot
parent 69d7afc799
commit 8b2f9887bf
2 changed files with 30 additions and 23 deletions

View File

@ -13,7 +13,7 @@ Default values for native configurations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Below is the output of the configuration for quantization of ops
in fbgemm and qnnpack (PyTorch's default quantized backends).
in x86 and qnnpack (PyTorch's default quantized backends).
Results:

View File

@ -256,11 +256,14 @@ PTSQ API Example::
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or asymmetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or assymetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.quantization.get_default_qconfig('x86')
# Fuse the activations to preceding layers, where applicable.
# This needs to be done manually depending on the model architecture.
@ -352,11 +355,14 @@ QAT API Example::
model_fp32.eval()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or asymmetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
# for mobile inference. Other quantization configurations such as selecting
# symmetric or assymetric quantization and MinMax or L2Norm calibration techniques
# can be specified here.
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
# for server inference.
# model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('x86')
# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
@ -732,7 +738,7 @@ Backend/Hardware Support
| | |Quantization|Mode |Mode Support|
| | | |Quantization| |
+-----------------+---------------+------------+------------+------------+
|server CPU |fbgemm |Supported |All |
|server CPU |fbgemm/onednn |Supported |All |
| | | |Supported |
+-----------------+---------------+ | +
|mobile CPU |qnnpack/xnnpack| | |
@ -746,30 +752,31 @@ Backend/Hardware Support
Today, PyTorch supports the following backends for running quantized operators efficiently:
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `fbgemm <https://github.com/pytorch/FBGEMM>`_
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by `fbgemm <https://github.com/pytorch/FBGEMM>`_ and `onednn <https://github.com/oneapi-src/oneDNN>`_ (see the details at `RFC <https://github.com/pytorch/pytorch/issues/83888>`_)
* ARM CPUs (typically found in mobile/embedded devices), via `qnnpack <https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native/quantized/cpu/qnnpack>`_
* (early prototype) support for NVidia GPU via `TensorRT <https://developer.nvidia.com/tensorrt>`_ through `fx2trt` (to be open sourced)
Note for native CPU backends
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We expose both `fbgemm` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of `fbgemm` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `fbgemm` or `qnnpack`.
We expose both `x86` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of `x86` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `x86` or `qnnpack`.
When preparing a quantized model, it is necessary to ensure that qconfig
and the engine used for quantized computations match the backend on which
the model will be executed. The qconfig controls the type of observers used
during the quantization passes. The qengine controls whether `fbgemm` or
`qnnpack` specific packing function is used when packing weights for linear
and convolution functions and modules. For example:
during the quantization passes. The qengine controls whether `x86` or `qnnpack`
specific packing function is used when packing weights for
linear and convolution functions and modules. For example:
Default settings for fbgemm::
Default settings for x86::
# set the qconfig for PTQ
qconfig = torch.quantization.get_default_qconfig('fbgemm')
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
qconfig = torch.quantization.get_default_qconfig('x86')
# or, set the qconfig for QAT
qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
qconfig = torch.quantization.get_default_qat_qconfig('x86')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'fbgemm'
torch.backends.quantized.engine = 'x86'
Default settings for qnnpack::
@ -996,11 +1003,11 @@ Custom API Example::
Best Practices
--------------
1. If you are using the ``fbgemm`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
1. If you are using the ``x86`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
if ``dtype`` is ``torch.quint8``, make sure to set a custom ``quant_min`` to be ``0`` and ``quant_max`` to be ``127`` (``255`` / ``2``)
if ``dtype`` is ``torch.qint8``, make sure to set a custom ``quant_min`` to be ``-64`` (``-128`` / ``2``) and ``quant_max`` to be ``63`` (``127`` / ``2``), we already set this correctly if
you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default ``qconfig`` for
``fbgemm`` or ``qnnpack`` backend
``x86`` or ``qnnpack`` backend
Frequently Asked Questions
--------------------------