### Description
This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.
```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```
Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it.
### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
- The names are more informative. 'tf32' is more informative than a simple "high".
- Easier to extend new algorithm like `tf32x3`
#### Cons
- "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.
### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')

### We provide 3 fp32 compute precision can be set:
- **"ieee"**: Not allowed to use any other internal computation data types .
- **"tf32"**: Allowed to use tf32 as internal computation data types.
- **"bf16"**: Allowed to use bf16 as internal computation data types.
- **"none"**: Precision's are not set. Can be override by its father node.
### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
backend = cuda, op = all, precision setting = none
backend = cuda, op = conv, precision setting = tf32
backend = cuda, op = rnn, precision setting = tf32
backend = cuda, op = matmul, precision setting = none
backend = matmul, op = all, precision setting = none
backend = matmul, op = conv, precision setting = none
backend = matmul, op = rnn, precision setting = none
backend = matmul, op = matmul, precision setting = none
```
- If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
- If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".
### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
- If the user only uses previous APIs, it will work as previous expectations.
- If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.
### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD
Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
Changes:
1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
giving the following case:
```
import torch
a= torch.empty_strided([64, 1, 33], [33, 3, 1], dtype=torch.bfloat16).fill_(1)
b = torch.randn(64, 33, 256).to(dtype = torch.bfloat16)
y = torch.ops.aten.bmm(a, b)
```
```a``` is a contiguous tensor, but the strides are not defaulted contiguous strides ([33, 33, 1]), onednn matmul always running a non-optimized path:
```
onednn_verbose,exec,cpu,matmul,gemm:jit,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,7.28711
```
This PR will convert the inputs' stride to deafult contiguous stride before calling onednn to running an optimization path:
```
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_bf16,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,3.06396
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99511
Approved by: https://github.com/mingfeima, https://github.com/jgong5
## Description
Currently, both inference and training will use `forward_training` in rnn primitive, which will bring performance downgrade for inference (The performance drop is from rnn primitive and unnecessary creation of `pd` and `workspace`). This PR is to split them into `forward_inference` and `forward_training` seperately.
## Performance
With this fix PR, in RNN-T inference, the throughput reduction is 167 ms, which increases `3.7%` of E2E time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96736
Approved by: https://github.com/jgong5
Summary:
Enable Gelu bf16/fp32 in CPU path using Mkldnn implementation. User doesn't need to_mkldnn() explicitly. New Gelu fp32 performs better than original one.
Add Gelu backward for https://github.com/pytorch/pytorch/pull/53615.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58525
Reviewed By: ejguan
Differential Revision: D29940369
Pulled By: ezyang
fbshipit-source-id: df9598262ec50e5d7f6e96490562aa1b116948bf
Summary:
Fixes https://github.com/pytorch/pytorch/issues/43039, when tracing a MKLDNN model with setting **check_trace=True**, there has an error: **RuntimeError: unsupported memory format option Preserve**, this PR is to solve this problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61241
Reviewed By: anjali411
Differential Revision: D29737365
Pulled By: suo
fbshipit-source-id: e8f7f124bc6256f10b9d29969e0c65d332514625
Summary:
Use `functools.lru_cache` to avoid calling this function multiple time
Check that we are running on Linux platform before trying to open
"/proc/cpuinfo"
Do not spawn new process, but simply open("/proc/cpuinfo").read() and
search the output for the keywords
Fixes https://github.com/pytorch/pytorch/issues/57360
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57408
Reviewed By: driazati
Differential Revision: D28136769
Pulled By: malfet
fbshipit-source-id: ab476774c3be2913cb576d98d47a2f7ec03c19aa
Summary:
## 🚀 Feature
Add Mkl-Layout kernel for tanh.
## Motivation
We want to add a Mkl-Layout kernel for tanh to improve tanh's performance when the input Tensor is Mkl-Layout.
Because, PyTorch does not have the Mkl-Layout kernel for tanh, so it cannot execute the tanh input by the Mkl-Layout Tensor.
Off course you can temporarily avoid this problem by executing to_dense/to_mkldnn, but the performance is significantly reduced due to the copy overhead(1.6-4.3 times slower than CPU kernel).
## Perfomance results
### Environment
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- OS: 18.04.1 LTS
- compiler: gcc 7.5.0
- branch: master
- commit ID: fe2c126
- build Environment variable: USE_CUDA=0
- Python: 3.6.9
- Intel MKL(Math Kernel Library): 2020.2-254
- Intel oneDNN: 1.8.1
### Benchmark script
``` python
import torch
import torch.nn as nn
torch.manual_seed(1)
x = torch.randn(2048, 2048)
x_mkl = x.to_mkldnn()
print("### CPU tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### CPU tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### to_dense/to_mkldnn + tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x_mkl.to_dense().tanh().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### to_dense/to_mkldnn + tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x_mkl.to_dense().tanh_().to_mkldnn()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### Mkl-Layout tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
output = x_mkl.tanh()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
print("\n### Mkl-Layout tanh_")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(100):
x_mkl.tanh_()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
```
### Results
#### OMP_NUM_THREADS=1 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 579.662 | 1658.000 | 617.565 |
| tanh_ | 554.477 | 881.997 | 589.426 |
#### OMP_NUM_THREADS=6 Results(Self CPU time total ms)
| Operation | CPU kernel | to_dense/to_mkldnn+CPU kernel | Mkl-Layout kernel(This PR) |
| ---------- | ---------- | ----------------------------- | -------------------------- |
|tanh | 182.387 | 421.336 | 136.226 |
| tanh_ | 94.331 | 404.931 | 99.254 |
## Modification policy for the code
oneDNN is already supported tanh operation.
[oneDNN: Elementwise](https://spec.oneapi.com/versions/latest/elements/oneDNN/source/primitives/eltwise.html)
There is already exist sigmoid implementation that uses the same Elementwise API as tanh, so we created this PR code with reference to the sigmoid implementation.
527c1e0e37/aten/src/ATen/native/mkldnn/UnaryOps.cpp (L28-L42)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54656
Test Plan:
A test for sigmoid has already been created as shown below.
So, I added a new test of tanh referring to the test of sigmoid.
527c1e0e37/test/test_mkldnn.py (L944-L954)
### mkldnn tanh test result
```
$ python3 test/test_mkldnn.py TestMkldnn.test_tanh
Couldn't download test skip set, leaving all tests enabled...
.
----------------------------------------------------------------------
Ran 1 test in 0.004s
OK
```
Reviewed By: gchanan
Differential Revision: D27395827
Pulled By: ezyang
fbshipit-source-id: d4481332de187e2dea095f9b6aabc73a497960fe
Summary:
There are the following two patterns to call add in-pace.
```python
torch.add(a, b, out=a) # (1) a in-placed
torch.add(a, b, out=b) # (2) b in-placed
```
If a and b are mkldnn Tensor, the value is different from expected in case (2).
**Sample code to reproduce the behavior:**
```python
import torch
torch.manual_seed(4)
a = torch.randn(4, 4)
b = torch.randn(4, 4)
b.fill_(1.0)
a_mkl = a.to_mkldnn()
b_mkl = b.to_mkldnn()
torch.add(b, a, alpha=1.0, out=a)
torch.add(b_mkl, a_mkl, alpha=1.0, out=a_mkl)
print(a)
print(a_mkl)
```
**Results:**
Actual:
```python
tensor([[ 0.0586, 2.2632, 0.8162, 1.1505],
[ 1.1075, 0.7220, -1.6021, 1.6245],
[ 0.1316, 0.7949, 1.3976, 1.6699],
[ 0.9463, 1.0467, -0.7671, -1.1205]])
tensor([[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.]], layout=torch._mkldnn)
```
Expected:
```python
tensor([[ 0.0586, 2.2632, 0.8162, 1.1505],
[ 1.1075, 0.7220, -1.6021, 1.6245],
[ 0.1316, 0.7949, 1.3976, 1.6699],
[ 0.9463, 1.0467, -0.7671, -1.1205]])
tensor([[ 0.0586, 2.2632, 0.8162, 1.1505],
[ 1.1075, 0.7220, -1.6021, 1.6245],
[ 0.1316, 0.7949, 1.3976, 1.6699],
[ 0.9463, 1.0467, -0.7671, -1.1205]], layout=torch._mkldnn)
```
This is because `dnnl::sum` called in `mkldnn_add` has the following specifications:
[oneDNN doc : Sum](https://oneapi-src.github.io/oneDNN/dev_guide_sum.html)
> The sum primitive supports in-place operation, meaning that the src0 tensor can be used as both input and output.
> In-place operation overwrites the original data. Using in-place operation requires the memory footprint of the
> output tensor to be either bigger than or equal to the size of the dst memory descriptor used for primitive creation.
but, case 2) are added to the first argument.
So, we modified it so that a and b are swapped and passed to "sum" in case (2).
**Environment**
・CPU : Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
・build USE_MKLDNN=1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51687
Reviewed By: jbschlosser
Differential Revision: D27062172
Pulled By: VitalyFedyunin
fbshipit-source-id: bf76d36f9fdb1b4337d71d87bcdbaf4edb11f12f