mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
update quantization doc: add x86 backend as default backend of server inference (#86794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86794 Approved by: https://github.com/jgong5, https://github.com/kit1980
This commit is contained in:
committed by
PyTorch MergeBot
parent
69d7afc799
commit
8b2f9887bf
@ -13,7 +13,7 @@ Default values for native configurations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Below is the output of the configuration for quantization of ops
|
||||
in fbgemm and qnnpack (PyTorch's default quantized backends).
|
||||
in x86 and qnnpack (PyTorch's default quantized backends).
|
||||
|
||||
Results:
|
||||
|
||||
|
@ -256,11 +256,14 @@ PTSQ API Example::
|
||||
model_fp32.eval()
|
||||
|
||||
# attach a global qconfig, which contains information about what kind
|
||||
# of observers to attach. Use 'fbgemm' for server inference and
|
||||
# 'qnnpack' for mobile inference. Other quantization configurations such
|
||||
# as selecting symmetric or asymmetric quantization and MinMax or L2Norm
|
||||
# calibration techniques can be specified here.
|
||||
model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
|
||||
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
|
||||
# for mobile inference. Other quantization configurations such as selecting
|
||||
# symmetric or assymetric quantization and MinMax or L2Norm calibration techniques
|
||||
# can be specified here.
|
||||
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
|
||||
# for server inference.
|
||||
# model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
|
||||
model_fp32.qconfig = torch.quantization.get_default_qconfig('x86')
|
||||
|
||||
# Fuse the activations to preceding layers, where applicable.
|
||||
# This needs to be done manually depending on the model architecture.
|
||||
@ -352,11 +355,14 @@ QAT API Example::
|
||||
model_fp32.eval()
|
||||
|
||||
# attach a global qconfig, which contains information about what kind
|
||||
# of observers to attach. Use 'fbgemm' for server inference and
|
||||
# 'qnnpack' for mobile inference. Other quantization configurations such
|
||||
# as selecting symmetric or asymmetric quantization and MinMax or L2Norm
|
||||
# calibration techniques can be specified here.
|
||||
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
|
||||
# of observers to attach. Use 'x86' for server inference and 'qnnpack'
|
||||
# for mobile inference. Other quantization configurations such as selecting
|
||||
# symmetric or assymetric quantization and MinMax or L2Norm calibration techniques
|
||||
# can be specified here.
|
||||
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default
|
||||
# for server inference.
|
||||
# model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
|
||||
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('x86')
|
||||
|
||||
# fuse the activations to preceding layers, where applicable
|
||||
# this needs to be done manually depending on the model architecture
|
||||
@ -732,7 +738,7 @@ Backend/Hardware Support
|
||||
| | |Quantization|Mode |Mode Support|
|
||||
| | | |Quantization| |
|
||||
+-----------------+---------------+------------+------------+------------+
|
||||
|server CPU |fbgemm |Supported |All |
|
||||
|server CPU |fbgemm/onednn |Supported |All |
|
||||
| | | |Supported |
|
||||
+-----------------+---------------+ | +
|
||||
|mobile CPU |qnnpack/xnnpack| | |
|
||||
@ -746,30 +752,31 @@ Backend/Hardware Support
|
||||
|
||||
Today, PyTorch supports the following backends for running quantized operators efficiently:
|
||||
|
||||
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `fbgemm <https://github.com/pytorch/FBGEMM>`_
|
||||
* x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via `x86` optimized by `fbgemm <https://github.com/pytorch/FBGEMM>`_ and `onednn <https://github.com/oneapi-src/oneDNN>`_ (see the details at `RFC <https://github.com/pytorch/pytorch/issues/83888>`_)
|
||||
* ARM CPUs (typically found in mobile/embedded devices), via `qnnpack <https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/native/quantized/cpu/qnnpack>`_
|
||||
* (early prototype) support for NVidia GPU via `TensorRT <https://developer.nvidia.com/tensorrt>`_ through `fx2trt` (to be open sourced)
|
||||
|
||||
|
||||
Note for native CPU backends
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
We expose both `fbgemm` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of `fbgemm` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `fbgemm` or `qnnpack`.
|
||||
We expose both `x86` and `qnnpack` with the same native pytorch quantized operators, so we need additional flag to distinguish between them. The corresponding implementation of `x86` and `qnnpack` is chosen automatically based on the PyTorch build mode, though users have the option to override this by setting `torch.backends.quantization.engine` to `x86` or `qnnpack`.
|
||||
|
||||
When preparing a quantized model, it is necessary to ensure that qconfig
|
||||
and the engine used for quantized computations match the backend on which
|
||||
the model will be executed. The qconfig controls the type of observers used
|
||||
during the quantization passes. The qengine controls whether `fbgemm` or
|
||||
`qnnpack` specific packing function is used when packing weights for linear
|
||||
and convolution functions and modules. For example:
|
||||
during the quantization passes. The qengine controls whether `x86` or `qnnpack`
|
||||
specific packing function is used when packing weights for
|
||||
linear and convolution functions and modules. For example:
|
||||
|
||||
Default settings for fbgemm::
|
||||
Default settings for x86::
|
||||
|
||||
# set the qconfig for PTQ
|
||||
qconfig = torch.quantization.get_default_qconfig('fbgemm')
|
||||
# Note: the old 'fbgemm' is still available but 'x86' is the recommended default on x86 CPUs
|
||||
qconfig = torch.quantization.get_default_qconfig('x86')
|
||||
# or, set the qconfig for QAT
|
||||
qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
|
||||
qconfig = torch.quantization.get_default_qat_qconfig('x86')
|
||||
# set the qengine to control weight packing
|
||||
torch.backends.quantized.engine = 'fbgemm'
|
||||
torch.backends.quantized.engine = 'x86'
|
||||
|
||||
Default settings for qnnpack::
|
||||
|
||||
@ -996,11 +1003,11 @@ Custom API Example::
|
||||
Best Practices
|
||||
--------------
|
||||
|
||||
1. If you are using the ``fbgemm`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
|
||||
1. If you are using the ``x86`` backend, we need to use 7 bits instead of 8 bits. Make sure you reduce the range for the ``quant\_min``, ``quant\_max``, e.g.
|
||||
if ``dtype`` is ``torch.quint8``, make sure to set a custom ``quant_min`` to be ``0`` and ``quant_max`` to be ``127`` (``255`` / ``2``)
|
||||
if ``dtype`` is ``torch.qint8``, make sure to set a custom ``quant_min`` to be ``-64`` (``-128`` / ``2``) and ``quant_max`` to be ``63`` (``127`` / ``2``), we already set this correctly if
|
||||
you call the `torch.ao.quantization.get_default_qconfig(backend)` or `torch.ao.quantization.get_default_qat_qconfig(backend)` function to get the default ``qconfig`` for
|
||||
``fbgemm`` or ``qnnpack`` backend
|
||||
``x86`` or ``qnnpack`` backend
|
||||
|
||||
Frequently Asked Questions
|
||||
--------------------------
|
||||
|
Reference in New Issue
Block a user