mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision')  ### We provide 3 fp32 compute precision can be set: - **"ieee"**: Not allowed to use any other internal computation data types . - **"tf32"**: Allowed to use tf32 as internal computation data types. - **"bf16"**: Allowed to use bf16 as internal computation data types. - **"none"**: Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
103 lines
3.6 KiB
ReStructuredText
103 lines
3.6 KiB
ReStructuredText
.. meta::
|
|
:description: A guide to torch.backends.mkldnn, a PyTorch backend to run MKLDNN operations
|
|
:keywords: optimize PyTorch, MKLDNN
|
|
|
|
.. _mkldnn_backend:
|
|
|
|
MKLDNN backend
|
|
---------------------------------------------------
|
|
|
|
MKLDNN is an open-source cross-platform performance library of basic building blocks
|
|
for deep learning applications.
|
|
|
|
.. code:: python
|
|
|
|
# The flag below controls whether enable MKLDNN backend in Pytorch.
|
|
torch.backends.mkldnn.enabled = True
|
|
|
|
Users can disable MKLDNN backend by:
|
|
|
|
.. code:: python
|
|
|
|
torch.backends.mkldnn.enabled = False
|
|
|
|
.. _bf16_on_mkldnn:
|
|
|
|
Bfloat16 (BF16) on MKLDNN backend
|
|
---------------------------------------------------
|
|
|
|
Starting in PyTorch 2.4, there is a set of APIs to control the internal computation precision
|
|
for `float32` operators.
|
|
|
|
.. code:: python
|
|
|
|
# The flag below controls the internal computation precision for mkldnn matmul. Default ieee is float32.
|
|
torch.backends.mkldnn.matmul.fp32_precision = "ieee"
|
|
|
|
# The flag below controls the internal computation precision for mkldnn conv. Default ieee is float32.
|
|
torch.backends.mkldnn.conv.fp32_precision = "ieee"
|
|
|
|
# The flag below controls the internal computation precision for mkldnn rnn. Default ieee is float32.
|
|
torch.backends.mkldnn.rnn.fp32_precision = "ieee"
|
|
|
|
Note that besides matmuls and convolutions themselves, functions and nn modules that internally uses
|
|
matmuls or convolutions are also affected. These include :class:`torch.nn.Linear`, :class:`torch.nn._ConvNd`, :func:`torch.cdist`,
|
|
:func:`torch.tensordot`, :func:`torch.nn.functional.affine_grid` and :func:`torch.nn.functional.grid_sample`,
|
|
:class:`torch.nn.AdaptiveLogSoftmaxWithLoss`, :class:`torch.nn.GRU` and :class:`torch.nn.LSTM`.
|
|
|
|
To get an idea of the precision and speed, see the example code and benchmark data (on SPR) below:
|
|
|
|
.. code:: python
|
|
|
|
torch.manual_seed(0)
|
|
a_full = torch.randn(10240, 10240, dtype=torch.double)
|
|
b_full = torch.randn(10240, 10240, dtype=torch.double)
|
|
ab_full = a_full @ b_full
|
|
mean = ab_full.abs().mean() # 80.7451
|
|
|
|
a = a_full.float()
|
|
b = b_full.float()
|
|
|
|
# Do matmul at BF16 mode.
|
|
torch.backends.mkldnn.matmul.fp32_precision = 'bf16'
|
|
ab_bf16 = a @ b # expected speedup with BF16 dot-product acceleration
|
|
error = (ab_bf16 - ab_full).abs().max() # 1.3704
|
|
relative_error = error / mean # 0.0170
|
|
print(error, relative_error)
|
|
|
|
# Do matmul FP32 mode.
|
|
torch.backends.mkldnn.matmul.fp32_precision = 'ieee'
|
|
ab_fp32 = a @ b
|
|
error = (ab_fp32 - ab_full).abs().max() # 0.0003
|
|
relative_error = error / mean # 0.00000317
|
|
print(error, relative_error)
|
|
|
|
From the above example, we can see that with BF16, the speed is ~7x faster on SPR, and that
|
|
relative error compared to double precision is approximately 2 orders of magnitude larger.
|
|
If full FP32 precision is needed, users can disable BF16 by:
|
|
|
|
.. code:: python
|
|
|
|
torch.backends.mkldnn.matmul.fp32_precision = 'ieee'
|
|
torch.backends.mkldnn.conv.fp32_precision = 'ieee'
|
|
torch.backends.mkldnn.rnn.fp32_precision = 'ieee'
|
|
|
|
To toggle the BF16 flags off in C++, you can do
|
|
|
|
.. code:: C++
|
|
|
|
at::globalContext().setFloat32Precision("ieee", "mkldnn", "matmul");
|
|
at::globalContext().setFloat32Precision("ieee", "mkldnn", "conv");
|
|
at::globalContext().setFloat32Precision("ieee", "mkldnn", "rnn");
|
|
|
|
We can override a generic setting for a specific operator or backend if the fp32_precision is set to `ieee`.
|
|
|
|
.. code:: python
|
|
|
|
torch.backends.fp32_precision = "bf16"
|
|
torch.backends.mkldnn.fp32_precision = "ieee"
|
|
torch.backends.mkldnn.matmul.fp32_precision = "ieee"
|
|
|
|
For such case, both `torch.backends.mkldnn.fp32_precision` and `torch.backends.mkldnn.matmul.fp32_precision`
|
|
is overridden to bf16.
|