Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
This allows us to provide OpOverloadPacket.overloads method that
lists all of the overloads.
This isn't tested; will be exercised in the next PR.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76814
Approved by: https://github.com/mruberry
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76485
Adds an environment variable `PYTORCH_JIT_ENABLE_NVFUSER` for
controlling whether or not nvfuser is enabled. This required changing
the PassManager behavior to support the case where nvfuser gets enabled
by default when PYTORCH_JIT_ENABLE_NVFUSER=1.
Previously the solution for turning nvfuser on or off was to use the
PassManager to register or un-register the pass. That works fine if the
pass starts of _disabled_, but causes issues once we try to enable the
pass by default.
The main issue with enabling by default is with the validation check to
see whether NVFuser can be turned on. The check relies on
at::globalContext().hasCUDA(), which requires CUDAHooks to be registered
before hasCUDA() wil work correctly. At static initialization time it's
difficult to ensure that CUDAHooks will be registered _before_ we
attempt to register the nvfuser pass. In OSS it worked fine, but in
internal builds it would fail on ROCm builds.
To fix this, we switch the control of NVFuser enablement to a check in
the pass. i.e. previously, we enabled/disabled nvfuser by registering or
de-registering the pass in pass manager; now, the pass is always
registered in pass manager, and enablement is done by a check within the
nvfuser pass.
Remaining TODO: Connect this with NNC so that in cases where NNC is
available but not NVFuser (i.e. on AMD gpus), NNC can be turned on
automatically.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D35982618
Pulled By: davidberard98
fbshipit-source-id: fd5b76bc0b8c8716c96fdc04bebfb15026a7ef60
(cherry picked from commit ff14603ff5ac8d9b6c749c4f111f4a8be8023b7f)
- Allow registering custom decompositions
- Add easier API for invoking decompositions
- Shorten API names (no users yet)
I am doing these as one pr because they are fairly short/simple and because github first does not support ghstack yet.
cc @Chillee @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76252
Approved by: https://github.com/davidberard98
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75119
Add support for parsing Tensor constants like Double(4, 4) ... by initializing random tensors. This makes saving IR and then parsing it lossy, so I have it toggled as default not on, but is useful in cases like repro-ing Fusions with tensor constants post-freezing.
cc Krovatkin
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D35373999
Pulled By: eellison
fbshipit-source-id: a5c8d9f93f23a7442258fc745ed6b6def330dca8
(cherry picked from commit 32dd6567522973563bd452bf486ed27b02e4e35c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74361
This adds an optional validation after executing an NVFuser node, which checks that the output is the same as the unfused implementation. Then the outputs and the graph are reported via a callback.
```python
import torch
def callback(x, y, graph):
for i in range(len(x)-amt, len(x)):
print(x[i])
print(y[i])
print(graph)
with torch.jit.fuser("fuser2"):
torch._C._jit_nvfuser_set_comparison_callback(True, callback)
torch.jit.script
def g(x, y):
z = torch.add(x, y)
return torch.sin(z)
def f(x, y, a):
z = torch.add(x, y)
return g(torch.relu(z), a)
f_s = torch.jit.script(f)
x = torch.rand((10, 10), dtype=torch.half).cuda()
y = torch.rand((10, 10), dtype=torch.half).cuda()
a = torch.rand((10, 10), dtype=torch.half).cuda()
f_s(x, y, a)
f_s(x, y, a)
f_s(x, y, a)
```
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D34975310
Pulled By: davidberard98
fbshipit-source-id: 2379c9a6f371cd58da6a187c1f16882f3923ab24
(cherry picked from commit 96c87992c65f5e6bb1bdd51791682dd837af99b4)
This is a technical revert of 6d36bbde7eb2eb0aed448f694338cb49c2ae47f3 to reconcile it with e50478c02592597f12b8490ec5496f76c7d8b8cc (which is the same + lint changes applied)
Should be skipped during import
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73938
This is a first step in porting and making usable all of the decompositions defined in [functorch](https://github.com/pytorch/functorch/blob/main/functorch/_src/decompositions.py#L349) in core and in JIT as well as C++.
The decompositions are defined in python, scripted and inlined, and then serialized as C++ code which TorchScript can parse. The workflow is edit python decomposition file then run [tools/codegen/decompositions/gen_jit_decompositions.py](https://github.com/pytorch/pytorch/pull/73938/files#diff-6adef2116be233c3524e3b583e373ab0ffc9169beb6c1f6d96b5d0385e75afa1).
Decompositions are mapped to their corresponding aten schemas via the schema in their python def. This allows multiple decompositions for an overloaded op like `aten.var` (shown here in the example).
This is just a first PR, i'm sure there will be many follows ups such as:
- making these runnable in C++ with simple executor
- porting over more decompositions from AOT Autograd
- Using opinfos / more robust testing
- Categorizing decompositions
- Hooking in decompositions at various points of JIT execution
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D34938126
Pulled By: eellison
fbshipit-source-id: 9559a7cb731982e3a726f2f95af498b84fb09c13
(cherry picked from commit a4e0e748791e378e7e12a9dd0b63fb3c62dc1890)
Summary:
added python API to disable nvfuser on certain opkind.
```
"_jit_set_nvfuser_skip_node_kind",
[](const std::string& op_name, bool flip = true) {
return fuser::cuda::skipNode(op_name, flip);
})
```
Args:
`op_name`: Symbol of op;
`flip`: flag indicating whether to flip the given op in the skip list.
Returns:
a bool flag indicating if `op_name` was already in the skip list.
The python example that disables the fusion of `aten::add` afterwards.
`torch._C._jit_set_nvfuser_skip_node_kind("aten::add", True) # returns False, as no op is in skip list by default`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74520
Reviewed By: saketh-are
Differential Revision: D35046110
Pulled By: davidberard98
fbshipit-source-id: 689f5286513dbab206768823a852467b9f6b49b6
(cherry picked from commit 9a31129f7591ba2d393ab057b1cd137a6a25e7e8)
Summary:
## Description
Preview4 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of https://github.com/pytorch/pytorch/pull/50256, the below improvements are included:
- The [preview4 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.4.1) of the oneDNN Graph API is used
- The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
- SkyLake 8180 (1 socket of 28 cores):

- SkyLake 8180 (single thread):

\* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
\** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code are placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is:
```
caffe2/CMakeLists.txt
```
## Limitations
- In this PR, we have only supported the optimization on Linux platform. The support on Windows and MacOS will be enabled as the next step.
- We have only optimized the inference use case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68111
Reviewed By: eellison
Differential Revision: D34584878
Pulled By: malfet
fbshipit-source-id: ce817aa8cc9052ee9ed930c9cf66be83449e61a4
(cherry picked from commit cd17683aa7d9c0947df45a1ab53627feff795587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71230
DBR quantization uses `torch.Tensor.as_subclass` frequently. When
the quantized model is traced with `torch.jit.trace`, these calls appear
in the resulting graph as `aten::alias`. This PR adds a pass to remove
these calls from the graph, for two reasons:
1. ease of debugging (these calls do nothing)
2. less work for downstream passes (for example, converting to ONNX currently breaks if these alias calls are present)
For now, we have to inline the graph in order for `aliasDb` to determine
safety properly. In the future, we may choose to relax this if there is
a need for it.
Test Plan:
Test plan is pretty basic for now, it can be improved in future PRs.
```
python test/test_quantization.py TestQuantizeDBR.test_jit_tracing_removes_aliases
```
Reviewed By: eellison
Differential Revision: D33552387
Pulled By: vkuzo
fbshipit-source-id: 681a33ddfff394a91e971263ac593afd93c5ea78
(cherry picked from commit 0f8412725d0c6fd9ef1072a50d4203465aa5d1f9)
Summary:
Based on past PRs, here is an non-exhaustive list of files to consider for extension. The PR is not meant to be final. Based on feedback and discussion, files could be dropped from the list, or PR could be updated to move code around such that extension is no longer needed.
List of files below and description:
* These files are for converting from IR to ONNX proto. These should be used only for ONNX.
```
"torch/csrc/jit/serialization/export.*",
"torch/csrc/jit/serialization/onnx.*",
```
* This file is touched whenever pass signature is updated.
```
"torch/_C/__init__.pyi.in",
```
* These files are touched whenever pass signature is updated. Somehow it's been convention that onnx passes are also added here, but it could be possible to move them. Let me know what you think.
~~"torch/csrc/jit/python/init.cpp",~~
~~"torch/csrc/jit/python/script_init.cpp",~~
Update: Bowen will move onnx passes to files under onnx folder.
* ~~Touched when need new attr::xxx, or onnx::xxx.~~
~~"aten/src/ATen/core/interned_strings.h"~~
Update: Nikita will help separate this file.
malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72297
Reviewed By: H-Huang
Differential Revision: D34254666
Pulled By: malfet
fbshipit-source-id: 032cfa590cbedf4648b7335fe8f09a2380ab14cb
(cherry picked from commit 88653eadbf5b6dfe1f84acec8f1c3256a49f2f68)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69547
ScriptModule export introduces duplicated ONNX initializers for shared weights, unnecessarily increases ONNX model size. This PR de-duplicates ONNX initializers for model exported in eval mode, by checking if the underlying tensors share the same `data_ptr`, `strides` and `sizes`.
Test Plan: Imported from OSS
Reviewed By: msaroufim
Differential Revision: D32994271
Pulled By: malfet
fbshipit-source-id: 10ac66638b6255890875272472aa9ed07a5b1d9a
Co-authored-by: BowenBao <bowbao@microsoft.com>
(cherry picked from commit d7cbde940c5c259a3feff5af870b01dd21fbf3e0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71650
*
Refactors PE so there is a current fusion strategy set, which will take in a vector of e.g. [(STATIC, 2), (DYNAMIC, 10)] which means fuse two static invocations then fuse 10 dynamic ones, then stop specializing.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33801501
Pulled By: eellison
fbshipit-source-id: ebc7ac3c57e35a3b9bb15ab751f0aa1d25cc9bd5
(cherry picked from commit 8dd89088d3ceae800ea110d0b6949b759d4fe582)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71443
cogwheel test inline_cvr_infer_canary_pyper_model_publish is timing out.
The convert_fx call takes > 20 mins for local and local_ro sub modules, which used to take ~ 2 mins.
Test Plan:
Fblearn flow run
* the following cmd took 1113 seconds before the diff and 5002 seconds after.
flow-cli clone-locally 320014219 --run-as-secure-group pytorch_at_scale --operators pyper_model_publish_workflow.pyper_model_publish_workflow.process_torch_package_model_files.process_non_sparse_parameters[0]
Cogwheel test
* Cogwheel test with packages in B3588 (the last good run) took 4694.48s
* Cogwheel test with packages in B3590 (the first timeout) took 13975.83s
* Cogwheel test with the following packages took 4535.04s
* all packages in B3588 except the model publish
* the model publish built with D33469839 (043e84b3d2) reversed (created D33633570)
Reviewed By: albanD, jerryzh168
Differential Revision: D33633570
fbshipit-source-id: dc5e777c48a90c551641a3f79126461f6a60449e
(cherry picked from commit 03ab65023a9f4175584ddac1cca7eab51397c84a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70464
Add handling of strided input tensors to dynamic fusion. This is done with the same set of input striding specializations as https://github.com/pytorch/pytorch/pull/60684/:
```
S_ONE, // STRIDE_ONE: packed
S_CONT, // STRIDE_CONTIGUOUS: stride[i + 1] * sizes[i + 1]
S_TRAN_CONT, // STRIDE_TRANSPOSED_CONTIGUOUS: stride[i-1] * sizes[i-1]
S_AS_ARG, // STRIDE_AS_ARG: stride passed in as runtime value
```
and then two additional specializations for a) contiguous tensor and b) channels-last tensor. channels-last is a common case and we should optimize for it. additionally, tensors natively store whether they are contiguous/channels-last contiguous, which makes it faster to check if tensors follow this pattern.
Output striding will be done in a follow up.
The striding is stored on both the TensorGroup node and on the guard node. The striding descriptors are stored as a vector of strings on the node for debugability and to make use of storing ivalues as attributes on nodes.
As an example:
```
%8 : Double(10, 11, 12, 13, strides=[1716, 1, 143, 11], requires_grad=0, device=cpu) = prim::TensorExprGroup_0[symbolic_shape_inputs=[-37, -36, -35, -34], striding_inputs_desc=[["TENSOR_CONT_CHANNELS_LAST"]](%x, %24, %23, %22, %21)```
```
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D33458649
Pulled By: eellison
fbshipit-source-id: c42616d3c683d70f6258180d23d3841a31a6030d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67254
Fixes https://github.com/pytorch/pytorch/issues/65997
BC breaking:
`output = torch.ops._test.leaky_relu(self=torch.tensor(-1.0))` now fails with the error `TypeError: __call__() got multiple values for argument 'self'` since we call into `OpOverloadBundle`'s `__call__` method that has `self` bound to it as its first argument.
Follow up work:
1. disallow `default` as an overload name for aten operators.
2. Add a method to obtain a list of all overloads (exclude the ones registered by JIT)
3. Add methods/properties to `OpOverload` to access more schema information (types of input and output args etc)
cc ezyang gchanan
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D33469839
Pulled By: anjali411
fbshipit-source-id: c3fc43460f1c7c9651c64b4d46337be21c400621
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65645
This is a retry of PR: https://github.com/pytorch/pytorch/pull/59492
Latest Changes: Added more tests, added the getOrCreateDB pattern, updated logic to remove unnecessary checks
addressed all comments.
Adding code to find common expressions from the two subblocks of an if
operation and hoist them before the if block.
This also allows Dead Code Elimination to
then eliminate some if blocks.
Test Plan: python test_jit.py TestIfHoisting
Reviewed By: eellison
Differential Revision: D33302065
Pulled By: Gamrix
fbshipit-source-id: a5a184a480cf07354359aaca344c6e27b687a3c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67254
Fixes https://github.com/pytorch/pytorch/issues/65997
TODO: disallow `default` as an overload name for aten operators.
BC breaking:
`output = torch.ops._test.leaky_relu(self=torch.tensor(-1.0))` now fails with the error `TypeError: __call__() got multiple values for argument 'self'` since we call into `OpOverloadBundle`'s `__call__` method that has `self` bound to it as its first argument.
cc ezyang gchanan
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33262228
Pulled By: anjali411
fbshipit-source-id: 600dbf511514ea9b41aea3e6b1bc1102dab08909