mirror of
https://github.com/uxlfoundation/oneDNN.git
synced 2025-10-20 18:43:49 +08:00
doc: document quantization mode and clean up quantization page
This commit is contained in:
@ -21,62 +21,101 @@ Related materials:
|
||||
|
||||
## Quantization Model
|
||||
|
||||
The primary quantization model that the library assumes is the following:
|
||||
\f[
|
||||
x_{f32}[:] = scale_{x} \cdot (x_{int8}[:] - zp_{x})
|
||||
\f]
|
||||
oneDNN support two main categories of quantization:
|
||||
- static quantization with scales only (symmetric) or scales and
|
||||
zero-points (asymmetric), where scales are applied after zero-point.
|
||||
- dynamic quantization compliant with the Open Compute Project (OCP)
|
||||
Microscaling (MX) [formats specification][1].
|
||||
|
||||
where \f$scale_{x}\f$ is a *scaling factor* in float format,
|
||||
\f$zp_{x}\f$ is the *zero point* in int32 format, and
|
||||
\f$[:]\f$ is used to denote elementwise application of the formula
|
||||
to the arrays. In order to provide best performance, oneDNN does not
|
||||
compute those scaling factors and zero-points as part of primitive
|
||||
computation. Those should be computed and provided by the user.
|
||||
To support quantization, primitives should be created and executed as
|
||||
follows:
|
||||
|
||||
These quantization parameters can either be computed ahead of time
|
||||
using calibration tools (*static* quantization) or at runtime based on
|
||||
the actual minimum and maximum values of a tensor (*dynamic*
|
||||
quantization). Either method can be used in conjunction with oneDNN, as
|
||||
the quantization parameters are passed to the oneDNN primitives at
|
||||
execution time.
|
||||
|
||||
To support int8 quantization, primitives should be created and
|
||||
executed as follow:
|
||||
|
||||
- during primitive descriptor creation, if one or multiple inputs are
|
||||
int8 (signed or not), then the primitive will behave as a quantized
|
||||
integer operation.
|
||||
- still during primitive descriptor creation, the dimensionality of
|
||||
the scaling factors and zero-point should be provided using masks
|
||||
(e.g. one scale per tensor, one scale per channel, ...).
|
||||
- finally, during primitive execution, the user must provide the
|
||||
actual quantization parameters as arguments to the execute function.
|
||||
Scales are `f32` values, and zero-points are `s32` values.
|
||||
- during primitive descriptor creation source, weights or destination
|
||||
memory descriptors use low precision datatype (e.g. `s8` or
|
||||
`fp8_e4m3`).
|
||||
- during primitive descriptor creation group size, data types, and
|
||||
broadcasting masks of the scaling factors and zero-point are
|
||||
provided using primitive attributes.
|
||||
- during primitive execution the actual quantization parameters are
|
||||
provided as arguments to the execute function.
|
||||
|
||||
For performance reasons, each primitive implementation typically
|
||||
supports only a subset of quantization parameter masks. For example,
|
||||
convolution typically supports per-tensor or per-channel scales (no
|
||||
zero-point) for weights, and per-tensor scaling factor and zero-points
|
||||
for activation.
|
||||
supports only a subset of quantization parameter masks, group sizes
|
||||
and data type combinations. Which combination is supported and
|
||||
optimized is listed in each primitive documentation page.
|
||||
|
||||
This guide does not cover how the appropriate scaling factor can be found.
|
||||
Refer to the materials in the [Introduction](@ref dgaq_intro).
|
||||
|
||||
### Numerical behavior
|
||||
### Static quantization
|
||||
|
||||
Primitive implementations are allowed to convert int8 inputs to wider
|
||||
datatypes (e.g. int16 or int32), as those conversions do not impact
|
||||
accuracy.
|
||||
The only formula for static quantization currently supported
|
||||
by oneDNN is with scales applied after zero-point as follows:
|
||||
|
||||
\f[
|
||||
x_{f32}[:] = scale_{x} \cdot (x_{quant}[:] - zp_{x})
|
||||
\f]
|
||||
|
||||
where \f$x_{f32}\f$ and \f$x_{quant}\f$ are the non-quantized and
|
||||
quantized representation of \f$x\f$ respectively, \f$scale_{x}\f$ is a
|
||||
*scaling factor* in a floating-point format, \f$zp_{x}\f$ is a *zero
|
||||
point* (typically in integral format), and \f$[:]\f$ is used to denote
|
||||
elementwise application of the formula to the arrays.
|
||||
|
||||
In this model, oneDNN assumes that quantization parameters are inputs
|
||||
provided by the user and the library does not compute those scaling
|
||||
factors and zero-points as part of primitive computation.
|
||||
|
||||
These quantization parameters can either be computed ahead of time
|
||||
using calibration tools or at runtime based on the actual minimum and
|
||||
maximum values of a tensor. Either method can be used in conjunction
|
||||
with oneDNN static quantization, as long as the quantization
|
||||
parameters are passed as input to the oneDNN primitives at execution
|
||||
time.
|
||||
|
||||
|
||||
### Dynamic quantization
|
||||
|
||||
The only formula for dynamic quantization currently supported by
|
||||
oneDNN is with scales computed following the [1],
|
||||
namely:
|
||||
|
||||
\f[
|
||||
x_{f32}[:] = scale_{x} \cdot x_{quant}[:]
|
||||
\f]
|
||||
|
||||
where \f$x_{f32}\f$ and \f$x_{quant}\f$ are the non-quantized and
|
||||
quantized representation of \f$x\f$ respectively, and \f$scale_{x}\f$ is a
|
||||
*scaling factor*:
|
||||
- in e8m0 format,
|
||||
- computed for each group of size 32 (see [set_scales](@ref dnnl::primitive_attr::set_scales)),
|
||||
- and computed as the largest power-of-two less than or equal to the
|
||||
maximum absolute value of the group divided by the largest
|
||||
power-of-two representable in the \f$x_{quant}\f$ data type
|
||||
(e.g. \f$E8M0(amax(x_quant[:])) / E8M0(MAX\_QUANT\_DT) \f$).
|
||||
|
||||
|
||||
## General numerical behavior notes
|
||||
|
||||
Primitive implementations are allowed to convert inputs to wider
|
||||
datatypes (e.g. int8 to int16 or int32), when those conversions do not
|
||||
impact accuracy.
|
||||
|
||||
During execution, primitives implementations avoid integer overflows
|
||||
and maintain integer accuracy by using wider datatypes (e.g. int32)
|
||||
for intermediate values and accumulators. Those are then converted as
|
||||
for intermediate values and accumulators.
|
||||
|
||||
Results are then converted as
|
||||
necessary before the result is written to the output memory objects.
|
||||
|
||||
When converting to integral datatypes, implementations typically
|
||||
saturate, whereas for floating-point datatypes, underflow/overflow can
|
||||
occur. To force saturation in floating-point datatypes use
|
||||
@ref dev_guide_attributes_post_ops_eltwise with clip algorithm.
|
||||
The scales are applied in single precision floating point data type
|
||||
(#dnnl::memory::data_type::f32) before downconversion to the
|
||||
destination datatype. When converting to integral datatypes,
|
||||
implementations typically saturate, whereas for floating-point
|
||||
datatypes, underflow/overflow can occur. To force saturation in
|
||||
floating-point datatypes use @ref
|
||||
dev_guide_attributes_post_ops_eltwise with clip algorithm. Rounding
|
||||
happens according to [rounding mode attribute](@ref dev_guide_attributes_rounding_mode).
|
||||
|
||||
@warning
|
||||
Depending on the architecture, the behavior of int8 computations might slightly
|
||||
@ -85,19 +124,72 @@ vary. For more details, refer to @ref dev_guide_int8_computations.
|
||||
When multiple operations are fused in a single primitive using the
|
||||
[post ops attribute](@ref dev_guide_attributes_post_ops), those are assumed to be
|
||||
computed in f32 precision. As a result the destination quantization
|
||||
parameters are applied after the post-ops as follow:
|
||||
parameters are applied after the post-ops as follows:
|
||||
|
||||
\f[
|
||||
\dst[:] = post\_ops(OP(src[:], weights[:], ...)) / scale_{\dst} + zp_{\dst}
|
||||
|
||||
\f]
|
||||
|
||||
Quantizing/dequantizing values between post-operations can still be
|
||||
achieved using one of [eltwise](@ref dev_guide_attributes_post_ops_eltwise),
|
||||
[binary](@ref dev_guide_attributes_post_ops_binary), or the scale parameter of
|
||||
the appropriate post-operation.
|
||||
Quantizing/dequantizing values between post-operations can be achieved
|
||||
using one of [eltwise](@ref dev_guide_attributes_post_ops_eltwise),
|
||||
[binary](@ref dev_guide_attributes_post_ops_binary), or the scale
|
||||
parameter of the appropriate post-operation.
|
||||
|
||||
### Example: Convolution Quantization Workflow
|
||||
|
||||
## API
|
||||
|
||||
oneDNN provides the following APIs to set scales:
|
||||
- C: @ref dnnl_primitive_attr_set_scales
|
||||
- C++: @ref dnnl::primitive_attr::set_scales
|
||||
|
||||
and the following APIs to set zero-points:
|
||||
- C: @ref dnnl_primitive_attr_set_zero_points
|
||||
- C++: @ref dnnl::primitive_attr::set_zero_points
|
||||
|
||||
Those take five parameters:
|
||||
- an argument index, to specify which argument is having its
|
||||
quantization parameter description set.
|
||||
- a mask, to specify along which axis the quantization parameters are
|
||||
applied. If the argument we are specifying is a \f$D_0 \times
|
||||
... \times D_{n-1}\f$ tensor and we want to have scales per \f$d_i\f$
|
||||
dimension (where \f$0 \le d_i < n\f$), then the mask should be set to
|
||||
\f$mask = \sum \limits_{d_i} 2^{d_i}\f$, and the number of scales
|
||||
should be \f$\prod\limits_{d_i}D_{d_i}\f$.
|
||||
- an array of group sizes, that specify the number of consecutive
|
||||
elements a single scale/zero-point applies to for each axis along
|
||||
which the quantization parameters apply,
|
||||
- a scale/zero-point data type. It is f32 by default for scales and
|
||||
int32 for zero-points
|
||||
- a quantization mode, which specifies how the scales are computed
|
||||
(e.g. static or dynamic).
|
||||
|
||||
|
||||
### Special Case: Host-side Scalar Scale and Zero-point
|
||||
|
||||
When using the GPU engine and a single scale/zero-point is used for an
|
||||
argument (mask=0), oneDNN supports passing those from the host to
|
||||
reduce overheads of copying data from host to device or allocating
|
||||
extra device memory. The host scale or zero-point attributes should be
|
||||
set at creation time using the following API:
|
||||
|
||||
~~~cpp
|
||||
dnnl::primitive_attr attr;
|
||||
attr.set_host_scale(DNNL_ARG_DST,
|
||||
memory::data_type::f32);
|
||||
|
||||
attr.set_host_zero_point(DNNL_ARG_DST,
|
||||
memory::data_type::s32);
|
||||
~~~
|
||||
|
||||
The corresponding memory objects for scale or zero-point host value
|
||||
should be created as a host-side scalar (see @ref
|
||||
dev_guide_host_side_scalars for details) and passed to the primitive
|
||||
execution function.
|
||||
|
||||
## Examples of quantization workflow
|
||||
|
||||
### Convolution Quantization Workflow
|
||||
|
||||
Consider a convolution with bias. The tensors are represented as:
|
||||
|
||||
@ -142,7 +234,7 @@ where
|
||||
necessary to apply `f32` scaling factors.
|
||||
|
||||
|
||||
### Per-Channel Scaling
|
||||
#### Per-Channel Scaling
|
||||
|
||||
Some of the primitives have limited support of multiple scales for a quantized
|
||||
tensor. The most popular use case is the @ref dev_guide_convolution primitive
|
||||
@ -184,78 +276,7 @@ oneDNN provides reorders that can perform per-channel scaling:
|
||||
).
|
||||
\f]
|
||||
|
||||
## API
|
||||
|
||||
The library API to support for INT8 was designed for the model described above.
|
||||
However, it does not require users to follow exactly this model. As long as
|
||||
users can fit their model into the given functionality everything should work
|
||||
fine. Having this in mind we tried to design a minimal and simple yet powerful
|
||||
enough quantization API.
|
||||
|
||||
The most common data type for data tensors during INT8 inference is
|
||||
#dnnl::memory::data_type::s8 and #dnnl::memory::data_type::u8. All the
|
||||
scaling factors related to tensors are not attached in any way to the
|
||||
oneDNN memory objects and should be maintained by users.
|
||||
|
||||
The library essentially extends the ability of the primitives to scale the
|
||||
output before storing the result to the memory with the destination data type.
|
||||
That's exactly the minimum that we need to support INT8 inference (check the
|
||||
equations above--only \f$output\_scale\f$ is non-standard).
|
||||
|
||||
The scaling happens in the single precision floating point data type
|
||||
(#dnnl::memory::data_type::f32). Before storing, the result is downconverted
|
||||
to the destination data type with saturation if required. The rounding happens
|
||||
according to the current HW setting (for instance, on CPU according to the
|
||||
MXCSR register).
|
||||
|
||||
|
||||
@anchor dev_guide_attributes_quantization_scales
|
||||
### Argument Scaling
|
||||
|
||||
The library uses @ref dev_guide_attributes API for setting the scaling factors
|
||||
for most of the primitives. The supporting attributes can be found in the
|
||||
documentation for each primitive. The unsupported cases are handled according
|
||||
to the
|
||||
[attributes error handling section](@ref dev_guide_attributes_error_handling).
|
||||
|
||||
API:
|
||||
- C: @ref dnnl_primitive_attr_set_scales_mask
|
||||
- C++: @ref dnnl::primitive_attr::set_scales_mask
|
||||
|
||||
Primitives support scales only when the data type of computation is an
|
||||
integer.
|
||||
|
||||
The parameters (C++ API for simplicity):
|
||||
~~~cpp
|
||||
void dnnl::primitive_attr::set_scales_mask(int arg, int mask);
|
||||
~~~
|
||||
|
||||
In the simplest case, when there is only one common scale the attribute changes
|
||||
the op behavior from
|
||||
\f[
|
||||
\dst[:] = Op(...)
|
||||
\f]
|
||||
|
||||
to
|
||||
|
||||
\f[
|
||||
\dst[:] = scale \cdot Op(...).
|
||||
\f]
|
||||
|
||||
To support scales per one or several dimensions, users must set the appropriate
|
||||
mask.
|
||||
|
||||
Say the destination is a \f$D_0 \times ... \times D_{n-1}\f$ tensor and
|
||||
we want to have output scales per \f$d_i\f$ dimension
|
||||
(where \f$0 \le d_i < n\f$).
|
||||
|
||||
Then the mask should be set to:
|
||||
- \f$mask = \sum \limits_{d_i} 2^{d_i}\f$,
|
||||
|
||||
and the number of scales should be:
|
||||
- `scales.size()` = \f$\prod\limits_{d_i}D_{d_i}\f$.
|
||||
|
||||
#### Example 1: weights quantization with per-output-channel scaling
|
||||
#### Preparing the weights with per-output-channel scaling
|
||||
|
||||
~~~cpp
|
||||
// weights dimensions
|
||||
@ -297,7 +318,7 @@ and the number of scales should be:
|
||||
// ...
|
||||
~~~
|
||||
|
||||
#### Example 2: convolution with per-output-channel quantization
|
||||
#### Create the convolution with per-output-channel quantization
|
||||
|
||||
This example is complementary to the previous example (which should ideally be
|
||||
the first one). Let's say we want to create an int8 convolution with per-output
|
||||
@ -359,7 +380,7 @@ channel scaling.
|
||||
// ...
|
||||
~~~
|
||||
|
||||
#### Example 3: matmul with advanced quantization
|
||||
### Matmul with weights-only quantization
|
||||
|
||||
This example describes a process of weights decompression, or
|
||||
weights-only-quantization (WoQ), in matmul primitive which may be found when
|
||||
@ -407,7 +428,7 @@ per-N quantization.
|
||||
// ...
|
||||
~~~
|
||||
|
||||
#### Example 4: matmul with precomputed reductions and advanced quantization
|
||||
### Matmul with precomputed reductions and advanced quantization
|
||||
|
||||
This example is a complementary addition to the one above. It describes a
|
||||
process of dynamic quantization with weights's tensor asymmetric quantization
|
||||
@ -493,21 +514,5 @@ impossible to apply them on-the-fly without potential accuracy loss.
|
||||
// ...
|
||||
~~~
|
||||
|
||||
### Special Case: Host-side Scalar Scale and Zero-point
|
||||
|
||||
When using the GPU engine, host-side scalar scales and zero-points are
|
||||
supported to reduce copying of data from host to device. A memory object
|
||||
for scale or zero-point host value should be created as a host-side scalar
|
||||
(see @ref dev_guide_host_side_scalars for details) and passed to the primitive
|
||||
execution function. The host scales or zero-points attributes should also
|
||||
be set using the following API:
|
||||
|
||||
~~~cpp
|
||||
dnnl::primitive_attr attr;
|
||||
attr.set_host_scale(DNNL_ARG_DST,
|
||||
memory::data_type::f32);
|
||||
|
||||
attr.set_host_zero_point(DNNL_ARG_DST,
|
||||
memory::data_type::s32);
|
||||
~~~
|
||||
|
||||
[1]: [Open Compute Project Microscaling specification version 1](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)
|
||||
|
Reference in New Issue
Block a user