Re-landing #68111/#74596
## Description
v0.5 PR of this [RFC](https://github.com/pytorch/pytorch/issues/49444).
On the basis of #50256, the below improvements are included:
* The [v0.5 release branch](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.5) of the oneDNN Graph API is used
* The fuser now works with the profiling graph executor. We have inserted type check nodes to guard the profiled tensor properties.
### User API:
The optimization pass is disabled by default. Users could enable it by:
```
torch.jit.enable_onednn_fusion(True)
```
`torch.jit.freeze` should be used after tracing (recommended) or scripting a model.
### Performance:
[pytorch/benchmark](https://github.com/pytorch/benchmark) tool is used to compare the performance:
* SkyLake 8180 (1 socket of 28 cores):

* SkyLake 8180 (single thread):

* By mapping hardswish to oneDNN Graph, it’s 8% faster than PyTorch JIT (NNC + OFI)
** We expect performance gain after mapping transpose, contiguous & view to oneDNN graph ops
### Directory structure of the integration code
Fuser-related code is placed under:
```
torch/csrc/jit/codegen/onednn/
```
Optimization pass registration is done in:
```
torch/csrc/jit/passes/onednn_graph_fuser.h
```
CMake for the integration code is in:
```
caffe2/CMakeLists.txt
cmake/public/mkldnn.cmake
cmake/Modules/FindMKLDNN.cmake
```
## Limitations
* In this PR, we only support Pytorch-oneDNN-Graph integration on Linux platform. Support on Windows and MacOS will be enabled as a next step.
* We have only optimized the inference use-case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76622
Approved by: https://github.com/eellison
This functionality does not seem to be used
and there are some requests to update dependency.
Add `third_party` to torch_cpu include directories if compiling with
Caffe2 support, as `caffe2/quantization/server/conv_dnnlowp_op.cc` depends on `third_party/fbgemm/src/RefImplementations.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75394
Approved by: https://github.com/janeyx99, https://github.com/seemethere
This is enabled on some of our internal builds, is a common source
of fbcode only errors and apparently we are relatively clean on it.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74996
Approved by: https://github.com/malfet
Summary:
This is the very first step for the UCC-NCCL integration. This PR lets `ProcessGroupNCCL` load the `torch_ucc.so` if the user specifies an environmental variable `TORCH_UCC_LIBRARY_PATH`. If this environment variable is not specified by the user, then there will be no visible change.
In the future, we may want to make PyTorch smart enough to automatically detect the `torch_ucc.so` in the user's system, but before doing that, I believe we should first make sure that `ProcessGroupUCC` is very well tested.
Note that in this PR, `ProcessGroupNCCL` just loads the library but will not use it. I am trying to make PRs small, so the usage of `torch_ucc.so` will be submitted in later PRs.
This PR requires the change in https://github.com/facebookresearch/torch_ucc/pull/56, otherwise `torch_ucc.so` can not be successfully loaded. But his PR can be landed separately without waiting for https://github.com/facebookresearch/torch_ucc/pull/56 because, in PyTorch's unit tests, UCC is never used or tested.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69552
Reviewed By: mruberry
Differential Revision: D34675212
Pulled By: jiayisuse
fbshipit-source-id: a3d1fb98340dbe3a931af555423863efd381f1ae
(cherry picked from commit 3778b6fabe70c26b5a65e6ddec641d2ef9113cd1)
Summary:
Also enables bazel build to run lazy codegen. Bazel (oss) build feeds off the same filelists as cmake/buck (build_variables.bzl), so enabling it is easier than keeping it disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74111
Test Plan: Run CI and verify test_lazy_ops is running via OSS cmake builds
Reviewed By: bdhirsh
Differential Revision: D34772403
fbshipit-source-id: 8a63f58b9536e6ac1be530667932176ef2549496
(cherry picked from commit e807ffb1918853d10b924fdc24f85ee5b1a39021)
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols. Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74009
Approved by: https://github.com/malfet
Summary:
RFC: https://github.com/pytorch/rfcs/pull/40
This PR (re)introduces python codegen for unboxing wrappers. Given an entry of `native_functions.yaml` the codegen should be able to generate the corresponding C++ code to convert ivalues from the stack to their proper types. To trigger the codegen, run
```
tools/jit/gen_unboxing.py -d cg/torch/share/ATen
```
Merged changes on CI test. In https://github.com/pytorch/pytorch/issues/71782 I added an e2e test for static dispatch + codegen unboxing. The test exports a mobile model of mobilenetv2, load and run it on a new binary for lite interpreter: `test/mobile/custom_build/lite_predictor.cpp`.
## Lite predictor build specifics
1. Codegen: `gen.py` generates `RegisterCPU.cpp` and `RegisterSchema.cpp`. Now with this PR, once `static_dispatch` mode is enabled, `gen.py` will not generate `TORCH_LIBRARY` API calls in those cpp files, hence avoids interaction with the dispatcher. Once `USE_LIGHTWEIGHT_DISPATCH` is turned on, `cmake/Codegen.cmake` calls `gen_unboxing.py` which generates `UnboxingFunctions.h`, `UnboxingFunctions_[0-4].cpp` and `RegisterCodegenUnboxedKernels_[0-4].cpp`.
2. Build: `USE_LIGHTWEIGHT_DISPATCH` adds generated sources into `all_cpu_cpp` in `aten/src/ATen/CMakeLists.txt`. All other files remain unchanged. In reality all the `Operators_[0-4].cpp` are not necessary but we can rely on linker to strip them off.
## Current CI job test coverage update
Created a new CI job `linux-xenial-py3-clang5-mobile-lightweight-dispatch-build` that enables the following build options:
* `USE_LIGHTWEIGHT_DISPATCH=1`
* `BUILD_LITE_INTERPRETER=1`
* `STATIC_DISPATCH_BACKEND=CPU`
This job triggers `test/mobile/lightweight_dispatch/build.sh` and builds `libtorch`. Then the script runs C++ tests written in `test_lightweight_dispatch.cpp` and `test_codegen_unboxing.cpp`. Recent commits added tests to cover as many C++ argument type as possible: in `build.sh` we installed PyTorch Python API so that we can export test models in `tests_setup.py`. Then we run C++ test binary to run these models on lightweight dispatch enabled runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69881
Reviewed By: iseeyuan
Differential Revision: D33692299
Pulled By: larryliu0820
fbshipit-source-id: 211e59f2364100703359b4a3d2ab48ca5155a023
(cherry picked from commit 58e1c9a25e3d1b5b656282cf3ac2f548d98d530b)
Summary:
When enabled, it will generate `torch_cuda_linalg` library, which would depend on cusolve and magma and registers dynamic bindings to it from LinearAlgebraStubs
Avoid symbol clashes that can result in infinite recursion by moving all symbols in the library to its own namespace.
Add checks that should prevent calling self in recursion to `LinearAlgebraStubs.cpp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73447
Reviewed By: albanD
Differential Revision: D34538827
Pulled By: malfet
fbshipit-source-id: f2535b471d3524768a84b2e169b6aa24c26c03bf
(cherry picked from commit 4ec24b079c861c1122f0fa86e280b977c3c2f7ac)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/73377
We've migrated to CUDA-11.3 as default toolkit in 1.9, it's time to stop builds (especially considering forward-compatibility guarantee across CUDA-11.x drivers)
Hence we are removing CUDA 11.1 support. We should also cleanup old cuda related code from our builder and pytorch repo making scripts a little more clean.
We have code that references cuda 9.2 , 10.1 , 11.0, 11.1, 11.2 and none of these are currently use
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73514
Reviewed By: janeyx99
Differential Revision: D34551989
Pulled By: atalman
fbshipit-source-id: 9ceaaa9b25ad49689986f4b29a26d20370d9d011
(cherry picked from commit fe109c62daf429e9053c03f6e374568ba23cd041)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
- Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
- This is required for the backwards compatibility test.
The PyTorch op schema is XNNPACK dependent. See,
aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
example. The nightly version is assumed to have USE_XNNPACK=ON,
so with this change we ensure that the test build can also
have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642
Reviewed By: kimishpatel
Differential Revision: D34456794
Pulled By: digantdesai
fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72306
When enable, it will generate `torch_cuda_linalg` library, which would depend on cusolve and magma and registers dynamic bindings to it from LinearAlgebraStubs
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33992795
Pulled By: malfet
fbshipit-source-id: d1fa351a320659b29754997c20d754e69bfe36c0
(cherry picked from commit d5d6c69a988b9454538ecd28674206da2541de17)
Summary:
We noticed that on M1 Macs Tranformer network profiles are dominated by scalar `exp` and `erff` functions (for softmax and GELU).
The NEON `Vectorized<float>` implementation does not use SLEEF functions in order to compile on mobile platforms. However, SLEEF is already compiled on macOS ARM64 and is safe to use there. This change adds another implementation of `Vectorized<float>` that uses SLEEF functions. This implementation is only used on macOS ARM64.
This change speeds up e.g. prediction of spaCy transformer models by 20% on M1 Macs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70354
Reviewed By: albanD
Differential Revision: D33659540
Pulled By: kimishpatel
fbshipit-source-id: b8f02a61321873fc60778190a005c466c7d0cc0c
(cherry picked from commit 71286a207cefaae5a0be4eb3d618b55366ee4861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68247
This splits `Functions.h`, `Operators.h`, `NativeFunctions.h` and
`NativeMetaFunctions.h` into seperate headers per operator base name.
With `at::sum` as an example, we can include:
```cpp
<ATen/core/sum.h> // Like Functions.h
<ATen/core/sum_ops.h> // Like Operators.h
<ATen/core/sum_native.h> // Like NativeFunctions.h
<ATen/core/sum_meta.h> // Like NativeMetaFunctions.h
```
The umbrella headers are still being generated, but all they do is
include from the `ATen/ops' folder.
Further, `TensorBody.h` now only includes the operators that have
method variants. Which means files that only include `Tensor.h` don't
need to be rebuilt when you modify function-only operators. Currently
there are about 680 operators that don't have method variants, so this
is potentially a significant win for incremental builds.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D32596272
Pulled By: albanD
fbshipit-source-id: 447671b2b6adc1364f66ed9717c896dae25fa272
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69710
Namely no range-loop-analysis (that detect when loop variable can not be const reference
Test Plan: Imported from OSS
Reviewed By: r-barnes
Differential Revision: D32997003
Pulled By: malfet
fbshipit-source-id: dba0e7875e5b667e2cc394c70dd75e2403265918
Summary:
This fixes the `USE_PRECOMPILED_HEADERS` cmake version check which was accidentally inverted, so it was always disabled.
I've also made the precompiled header so it only includes headers used in 95% or more of code, weighted by compile time. This limits it to the standard library, `c10` and a limited subset of `ATen/core`. Crucially, the new pch doesn't depend on `native_functions.yaml` so won't cause as much unnecessary rebuilding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67851
Reviewed By: zou3519
Differential Revision: D32290902
Pulled By: dagitses
fbshipit-source-id: dfc33330028c99b02ff40963926c1f1260d00d00
Summary:
There is a bug in CMake's Ninja generator where files considered inputs to the cmake command couldn't be generated by another build step. The fix was included in CMake 3.13, but 3.10.3 is still sufficient for other cmake generators e.g. makefiles.
For reference, the bug is here https://gitlab.kitware.com/cmake/cmake/-/issues/18584
This is necessary for https://github.com/pytorch/pytorch/issues/68246 but I'm isolating the change here to make testing easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68731
Reviewed By: jbschlosser
Differential Revision: D32604545
Pulled By: malfet
fbshipit-source-id: 9bc0bd8641ba415dd63ce21a05c177e2f1dd9866
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180
Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377
Test Plan: CIs
Reviewed By: seemethere
Differential Revision: D32358467
fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
Summary:
CAFFE2 has been deprecated for a while, but still included in every PyTorch build.
We should stop building it by default, although CI should still validate that caffe2 code is buildable.
Build even fewer dependencies when compiling mobile builds without Caffe2
Introduce `TEST_CAFFE2` in torch.common.utils
Skip `TestQuantizedEmbeddingOps` and `TestJit.test_old_models_bc` is code is compiled without Caffe2
Should be landed after https://github.com/pytorch/builder/pull/864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66658
Reviewed By: driazati, seemethere, janeyx99
Differential Revision: D31669156
Pulled By: malfet
fbshipit-source-id: 1cc45e2d402daf913a4685eb9f841cc3863e458d
Summary:
As title, introduce the file `TracerRunner` shared by internal/external tracer and the main function is
```
TracerResult trace_run(const std::string& input_module_path);
```
which basically takes the path to model file and generate the trace result. The main difference between external tracer and internal tracer is
1. the dependency on `<yaml-cpp/yaml.h>`.
2. the output yaml file from internal tracer includes `model_version` and `model_asset`. These are only needed for internal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64152
ghstack-source-id: 140692467
Test Plan:
```
./build/bin/model_tracer --model_input_path "/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_with_bundled_input.ptl" --build_yaml_path "/Users/chenlai/Documents/pytorch/tracing/tmp.yaml"
```
```
./fbcode/caffe2/fb/model_tracer/run_model_with_bundled_inputs.sh ~/local/notebooks/prod_models/deeplabv3_scripted_with_bundled_input.ptl
```
have the same operator output
selected_operators.yaml (P460296279)
selected_mobile_ops.h (P460296258)
Reviewed By: dhruvbird
Differential Revision: D30632224
fbshipit-source-id: eb0321dbc0f1fcf6d2e05384695eebb59ac04f8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65401
Per https://github.com/pytorch/pytorch/issues/57744 statically linked CUPTI
causes exception handling to break on certain compiler configurations, likely
because CUPTI comes with incompatible libstdc++ symbols. Rather than pray that
something reasonable happens, use the safer configuration (dynamic linking) by
default and give a warning if the user inverts the setting.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: gdankel
Differential Revision: D31082208
Pulled By: ezyang
fbshipit-source-id: 14f66af920847e158436b5801c43f3124b109b34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445
PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31503350
fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
Summary:
Previously https://github.com/pytorch/pytorch/pull/64087 broke the test `binary_macos_wheel_3_7_cpu_build`, because wheel build is not happy with `model_tracer`. Considering it's prototype and there is no need to ship model_tracer via wheel at the moment, using the option `TRACING_BASED` for building tracer. When tracing-based is mature enough, we can ship the tracer binary via wheel eventually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66267
Original commit changeset: 8ac3d75a52d0
ghstack-source-id: 140122106
Test Plan:
binary_macos_wheel_3_7_cpu_build passes
{F668643831}
Reviewed By: dhruvbird
Differential Revision: D31478593
fbshipit-source-id: 726cab1b31c4596f6268b7824eecb20e2e59d161
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Do not delete `caffe2::OperatorBase::Output` calls as they have side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041
Reviewed By: ngimel
Differential Revision: D31360142
Pulled By: malfet
fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8