mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
Summary: https://github.com/pytorch/pytorch/issues/67578 disabled reduced precision reductions for FP16 GEMMs. After benchmarking, we've found that this has substantial performance impacts for common GEMM shapes (e.g., those found in popular instantiations of multiheaded-attention) on architectures such as Volta. As these performance regressions may come as a surprise to current users, this PR adds a toggle to disable reduced precision reductions `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = ` rather than making it the default behavior. CC ngimel ptrblck stas00 Note that the behavior after the previous PR can be replicated with `torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/67946 Reviewed By: zou3519 Differential Revision: D32289896 Pulled By: ngimel fbshipit-source-id: a1ea2918b77e27a7d9b391e030417802a0174abe
75 lines
4.5 KiB
ReStructuredText
75 lines
4.5 KiB
ReStructuredText
.. _numerical_accuracy:
|
|
|
|
Numerical accuracy
|
|
==================
|
|
|
|
In modern computers, floating point numbers are represented using IEEE 754 standard.
|
|
For more details on floating point arithmetics and IEEE 754 standard, please see
|
|
`Floating point arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic>`_
|
|
In particular, note that floating point provides limited accuracy (about 7 decimal digits
|
|
for single precision floating point numbers, about 16 decimal digits for double precision
|
|
floating point numbers) and that floating point addition and multiplication are not
|
|
associative, so the order of the operations affects the results.
|
|
Because of this, pytorch is not guaranteed
|
|
to produce bitwise identical results for floating point computations that are
|
|
mathematically identical. Similarly, bitwise identical results are not guaranteed across
|
|
PyTorch releases, individual commits, or different platforms. In particular, CPU and GPU
|
|
results can be different even for bitwise-identical inputs and even after controlling for
|
|
the sources of randomness.
|
|
|
|
Batched computations or slice computations
|
|
------------------------------------------
|
|
|
|
Many operations in pytorch support batched computation, where the same operation is performed
|
|
for the elements of the batches of inputs. An example of this is :meth:`torch.mm` and
|
|
:meth:`torch.bmm`. It is possible to implement batched computation as a loop over batch elements,
|
|
and apply the necessary math operations to the individual batch elements, for efficiency reasons
|
|
we are not doing that, and typically perform computation for the whole batch. The mathematical
|
|
libraries that we are calling, and pytorch internal implementations of operations can produces
|
|
slightly different results in this case, compared to non-batched computations. In particular,
|
|
let ``A`` and ``B`` be 3D tensors with the dimensions suitable for batched matrix multiplication.
|
|
Then ``(A@B)[0]`` (the first element of the batched result) is not guaranteed to be bitwise
|
|
identical to ``A[0]@B[0]`` (the matrix product of the first elements of the input batches)
|
|
even though mathematically it's an identical computation.
|
|
|
|
Similarly, an operation applied to a tensor slice is not guaranteed to produce results that are
|
|
identical to the slice of the result of the same operation applied to the full tensor. E.g. let
|
|
``A`` be a 2-dimentional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to
|
|
``A[:,0].sum()``.
|
|
|
|
Extremal values
|
|
---------------
|
|
|
|
When inputs contain large values such that intermediate results may overflow the range of the
|
|
used datatype, the end result may overflow too, even though it is representable in the original
|
|
datatype. E.g.:
|
|
|
|
.. code:: python
|
|
|
|
import torch
|
|
a=torch.tensor([1e20, 1e20]) # fp32 type by default
|
|
a.norm() # produces tensor(inf)
|
|
a.double().norm() # produces tensor(1.4142e+20, dtype=torch.float64), representable in fp32
|
|
|
|
TensorFloat-32(TF32) on Nvidia Ampere devices
|
|
---------------------------------------------
|
|
|
|
On Ampere Nvidia GPUs, pytorch by default uses TensorFloat32 (TF32) to speed up mathematically
|
|
intensive operations, in particular matrix multiplications and convolutions. When operation is performed
|
|
using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
|
|
results, and surprising results such as multiplying a matrix by identity matrix produces
|
|
results that are different from the input.
|
|
Most neural network workloads have the same convergence behavior when using tf32 as they have
|
|
with fp32, however, if better accuracy is desired, TF32 can be turned off with
|
|
``torch.backends.cuda.matmul.allow_tf32 = False``
|
|
|
|
For more information see :ref:`TensorFloat32<tf32_on_ampere>`
|
|
|
|
Reduced Precision Reduction for FP16 GEMMs
|
|
------------------------------------------
|
|
Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision).
|
|
If reduced-precision reductions are problematic, they can be turned off with
|
|
``torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False``
|
|
|
|
For more information see :ref:`allow_fp16_reduced_precision_reduction<fp16reducedprecision>`
|