Summary:
Request to update ROCm CI dockers to release 3.1
Changes required to the PyTorch source base attached:
* switch to the fast path for the Caffe2 ReLU operator
* switch to the new hipMemcpyWithStream(stream) API to replace hipMemcpyAsync(stream) && hipStreamSynchronize(stream) paradigm in an optimized fashion
* disable two regressed unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33930
Differential Revision: D20589048
Pulled By: ezyang
fbshipit-source-id: 568f40c1b90f311eb2ba57f02a9901114d8364af
Summary:
This makes PyTorch compileable(but not linkable) with `CUDA_SEPARABLE_COMPILATION` option enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34899
Test Plan: CI
Differential Revision: D20501050
Pulled By: malfet
fbshipit-source-id: 02903890a827fcc430a26f397d4d05999cf3a441
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34228
This PR adds LLVM codegen to tensor expressions. LLVM is added as an
optional build dependency specified with `USE_LLVM=<path_to_llvm>`
variable. If this variable is not set or LLVM is not found in the
specified path, the LLVM codegen is completely disabled.
Differential Revision: D20251832
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: 77e203ab4421eb03afc64f8da17e0daab277ecc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34547
This enables threading by passing a threadpool to xnnpack ops.
Test Plan:
python test/test_xnnpack_integration.py
Imported from OSS
Differential Revision: D20370553
fbshipit-source-id: 4db08e73f8c69b9e722b0e11a00621c4e229a31a
Summary:
Currently if we run
```bash
DEBUG=1 ONNX_ML=0 MAX_JOBS=8 CMAKE_CXX_COMPILER_LAUNCHER=ccache CMAKE_C_COMPILER_LAUNCHER=ccache CMAKE_CUDA_COMPILER_LAUNCHER=ccache USE_OPENMP=0 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_NCCL=0 USE_CUDA=1 USE_CUDNN=0 USE_STATIC_CUDNN=0 USE_NNPACK=0 USE_QNNPACK=0 USE_FBGEMM=0 BUILD_TEST=0 TORCH_CUDA_ARCH_LIST="6.1" python setup.py develop --cmake-only
```
then `touch build/CMakeCache.txt` (which adjusting build options will
do), then `python setup.py develop`, the following error message will
show up:
```
CMake Error at build/clog-source/CMakeLists.txt:249 (ADD_SUBDIRECTORY):
ADD_SUBDIRECTORY not given a binary directory but the given source
directory "/home/hong/wsrc/pytorch/build/clog-source" is not a subdirectory
of "/home/hong/wsrc/pytorch/build/clog-source". When specifying an
out-of-tree source a binary directory must be explicitly specified.
```
This is due to a conflict between our cpuinfo submodule and XNNPACK's
external clog dependency. Moving our cpuinfo upward and setting
CLOG_SOURCE_DIR resolves the issue.
---
Also reverted https://github.com/pytorch/pytorch/issues/33947 , where `CLOG_SOURCE_DIR` as an option is not quite appropriate (given that cpuinfo uses its included clog subdir) and the setting of this variable should be a bit later when the dir of cpuinfo is known.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33922
Differential Revision: D20193572
Pulled By: ezyang
fbshipit-source-id: 7cdbdc947a6c7e0ef10df33feccb5b20e1b3ba43
Summary:
Mainly renaming pthread_create of C2, the only one referred internally in NNPACK, that
is conflicting, to pthread_create_c2.
Removed 2 other conflicting symbols that are not used internally at all.
Pointing XNNPACK to original repo instead of the fork.
Copy pasted the new interface and implementation to
caff2/utils/threadpool, so that for internal builds we compile against
this.
When threadpool is unified this will be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33869
Differential Revision: D20140580
Pulled By: kimishpatel
fbshipit-source-id: de70df0af9c7d6bc065e85ede0e1c4dd6a9e6be3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33947
XNNPACK was downloading clog because we weren't setting CLOG_SOURCE_DIR.
Actually, it was downloading cpuinfo and pointing to the copy of clog
within that. So let's just point to the copy of clog within the cpuinfo
submodule we already have.
(Note: this ignores all push blocking failures!)
Test Plan:
Ran cmake and didn't see any downloading.
Verified that our clog is the same as the one that was being downloaded
with `diff -Naur`.
Differential Revision: D20169656
Pulled By: suo
fbshipit-source-id: ba0f7d1535f702e504fbc4f0102e567f860db94b
Summary:
This might lead to silent undefined behaviour (e.g. with out-of-bound indices). This affects `test_multinomial_invalid_probs_cuda` which is now removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32719
Test Plan:
* Build with VERBOSE=1 and manually inspect `less ndebug.build.log | grep 'c++' | grep -v -- -DNDEBUG` (only with nina on Linux)
* CI
Fixes https://github.com/pytorch/pytorch/issues/22745
Differential Revision: D20104340
Pulled By: yf225
fbshipit-source-id: 2ebfd7ddae632258a36316999eeb5c968fb7642c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33722
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.
XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.
Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.
Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.
The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.
This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.
Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509
Test Plan:
Build: CI
Functionality: Not exposed
Reviewed By: dreiss
Differential Revision: D20069796
Pulled By: AshkanAliabadi
fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.
XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.
Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.
Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.
The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.
This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original https://github.com/pytorch/pytorch/issues/30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.
Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32509
Reviewed By: dreiss
Differential Revision: D19521853
Pulled By: AshkanAliabadi
fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
Summary:
The detection of the env variable ONNX_ML has been properly handled in tools/setup_helpers/cmake.py,
line 242.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33424
Differential Revision: D20043991
Pulled By: ezyang
fbshipit-source-id: 91d1d49a5a12f719e67d9507cc203c8a40992f03
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33250
As Title says. FBGEMM has recently added the support for Windows.
ghstack-source-id: 97932881
Test Plan: CI
Reviewed By: jspark1105
Differential Revision: D19738268
fbshipit-source-id: e7f3c91f033018f6355edeaf6003bd2803119df4
Summary:
Stacked PRs
* #32958 - Make zip serialization the default
* **#32244 - Fix some bugs with zipfile serialization**
It includes the following changes:
* Split up tests so that we can test both serialization methods
* Loading something within a buffer doesn't work anymore, so those tests are only on the old serialization method (it's possible but introduces a big slowdown since it requires a linear scan of the entire zipfile to find the magic number at the end)
* Call `readinto` on a buffer if possible instead of `read` + a copy
* Disable CRC-32 checks on read (there was some issue where miniz said the CRC was wrong but `zipinfo` and `unzip` said the zip file was fine)
](https://our.intern.facebook.com/intern/diff/19418935/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32244
Pulled By: driazati
Reviewed By: eellison
Differential Revision: D19418935
fbshipit-source-id: df140854f52ecd04236225417d625374fd99f573
Summary:
For system pybind11 installs this is a system header location that should not get installed since it might include other unrelated headers. Since the header is already installed for a system install there's no need to install the headers, so only do the install when we use the bundled pybind11 version.
Closes https://github.com/pytorch/pytorch/issues/29823. Closes https://github.com/pytorch/pytorch/issues/30627.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30758
Differential Revision: D18820189
Pulled By: bddppq
fbshipit-source-id: fcc9fa657897e18c07da090752c912e3be513b17
Summary:
Also move the logic that installs the pybind11 headers from setup.py to cmake (to align with other headers).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29659
Differential Revision: D18458208
Pulled By: bddppq
fbshipit-source-id: cfd1e74b892d4a65591626ab321780c8c87b810d
Summary:
```
c10/util/Half.h:467:37: warning: implicit conversion from 'long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 [-Wimplicit-int-float-conversion]
return f < limit::lowest() || f > limit::max();
~ ^~~~~~~~~~~~
c10/util/Half.h:497:41: note: in instantiation of function template specialization 'c10::overflows<long, double>' requested here
if (!std::is_same<To, bool>::value && overflows<To, From>(f)) {
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29604
Differential Revision: D18440713
Pulled By: bddppq
fbshipit-source-id: f059b4e37e90fa84308be52ff5e1070ffd04031e
Summary:
This PR makes Caffe2 compatible with TensorRT 6. To make sure it works well, new unit test is added. This test checks PyTorch->ONNX->TRT6 inference flow for all classification models from TorhchVision Zoo.
Note on CMake changes: it has to be done in order to import onnx-tensorrt project. See https://github.com/pytorch/pytorch/issues/18524 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26426
Reviewed By: hl475
Differential Revision: D17495965
Pulled By: houseroad
fbshipit-source-id: 3e8dbe8943f5a28a51368fd5686c8d6e86e7f693
Summary:
Fixes https://github.com/pytorch/pytorch/issues/15476, supersedes https://github.com/pytorch/pytorch/issues/23496, supersedes and closes https://github.com/pytorch/pytorch/issues/27607
As explained by rgommers in https://github.com/pytorch/pytorch/issues/23496, linking against the expanded library path for `libculibos` in `cmake/Dependencies.cmake` hard codes the path into the distributed cmake files.
Instead, I only link against the targets (e.g. `caffe2::cudnn`) and move the dependency on `libculibos` into the cuda import targets declared in `cmake/public/cuda.cmake`. That file is distributed with the other cmake files and so the variable is expanded on the user's machine. I am now also using `CMAKE_STATIC_LIBRARY_SUFFIX` instead of `.a` to fix the windows issue from https://github.com/pytorch/pytorch/issues/15828. I don't have a windows setup to confirm though.
Finally, to get pytorch to compile with the extra libraries enabled, I also had to link `__caffe2_nccl` to `torch_python`; otherwise I was getting include errors as the hard coded include directory was wrong. `nccl` is built into `build` not `third_party/build`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27887
Differential Revision: D17929440
Pulled By: ezyang
fbshipit-source-id: 3db6bd94d758fca2e1d6a64f4f5eea03cc07cf64
Summary:
ROCm 2.9 brings support for the rocTX API through rocTracer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27416
Differential Revision: D17777480
Pulled By: bddppq
fbshipit-source-id: 6bce9b54c94e5b4c5787570d2b85736882bd23a7
Summary:
FindCUDNN.cmake and cuda.cmake have done the detection. This commit deletes `tools/setup_helpers/cudnn.py` as it is no longer needed.
Previously in https://github.com/pytorch/pytorch/issues/25482, one test failed because TensorRT detects cuDNN differently, and there may be situations we can find cuDNN but TensorRT cannot. This is fixed by passing our detection result down to TensorRT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25876
Differential Revision: D17346270
Pulled By: ezyang
fbshipit-source-id: c1e7ad4a1cb20f964fe07a72906f2f002425d894
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25977
Call add_subdirectory() explicitly before NNPACK/QNNPACK with
EXCLUDE_FROM_ALL property so that pthreadpool target won't be installed
by default for libtorch mobile build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25977
Test Plan: Imported from OSS
Differential Revision: D17312083
Pulled By: ljk53
fbshipit-source-id: 79851d0aa9402c5b9287ef4bbd8d7fd3a341497d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25958
Should have cleaned up the remaining protobuf dependencies before landing PR #25896.
Test Plan: - CI build;
Reviewed By: dreiss
Differential Revision: D17296949
Pulled By: ljk53
fbshipit-source-id: 20c444e63900c7fa054db3cc757d3f18614af630
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650
This PR removes protobuf dependencies from mobile build altogether:
- caffe2/proto: protobuf files, including caffe2.proto and torch.proto;
- caffe2 components that depend on caffe2.proto, including most part of
caffe2/core, caffe2/utils;
- libprotobuf / libprotobuf-lite dependencies;
- protobuf compiler;
- some utils class, e.g.: netdef_converter.cpp;
- introduce a macro to disable third_party/onnx which depends on protobuf;
Test Plan:
- builds;
- link with demo app to make sure it can load and run a model in pickle format;
Differential Revision: D17183548
Pulled By: ljk53
fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531
Summary:
As of ROCm 2.6, we support hiprtc - the HIP runtime compilation API. Enable the jit fusion feature depending on the existence of such an API. This entails
* new hipification rules for API_RTC
* add hiprtc APIs to the shim loader
* update cmake infrastructure to find the hiprtc library (it is part of the HIP package)
* enabling of unit tests in the jit_fuser test set
* special casing in resource strings for HIP - the typedefs CUDA requires would be redundant
* for now disable the occupancy calculation we do not support yet and hard-code
Thanks to t-vi for working with me on getting this integration done!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22872
Differential Revision: D17207425
Pulled By: bddppq
fbshipit-source-id: 93409f3051ad0ea06afacc2239fd6c402152debe
Summary:
In facebookincubator/gloo#212, a libuv based Gloo transport was introduced,
which allows us to use Gloo on macOS (and later perhaps also Windows). This
commit updates CMake code to enable building with USE_DISTRIBUTED=1 on macOS.
A few notes:
* The Caffe2 ops are not compiled, for they depend on `gloo::transport::tcp`.
* The process group implementation uses `gloo::transport::tcp` on Linux (because of `epoll(2)` on Linux and `gloo::transport::uv` on macOS).
* The TCP store works but sometimes crashes on process termination.
* The distributed tests are not yet run.
* The nightly builds don't use `USE_DISTRIBUTED=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25260
Reviewed By: mrshenli
Differential Revision: D17202381
Pulled By: pietern
fbshipit-source-id: ca80a82e78a05b4154271d2fb0ed31c8d9f26a7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25602
Enable rocThrust with hipCUB and rocPRIM for ROCm. They are the ROCm implementations of the thrust and cub APIs and replace the older hip-thrust and cub-hip packages going forward. ROCm 2.5 is the first release to contain the new packages as an option, as of 2.6 they will be the only available option.
Add hipification rules to correctly hipify thrust::cuda to thrust::hip and cub:: to hipcub:: going forward. Add hipification rules to hipify specific cub headers to the general hipcub header.
Infrastructure work to correctly find, include and link against the new packages. Add the macro definition to choose the HIP backend to Thrust.
Since include chains are now a little different from CUDA's Thrust, add includes for functionality used where applicable.
Skip four tests that fail with the new rocThrust for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21864
Reviewed By: xw285cornell
Differential Revision: D16940768
Pulled By: bddppq
fbshipit-source-id: 3dba8a8f1763dd23d89eb0dd26d1db109973dbe5
Summary:
We have environment variable USE_CUDNN with self-explanatory name. However cpp code is compiled based on cpp macro definition AT_CUDNN_ENABLED, which is defined as:
```
IF (NOT AT_CUDA_ENABLED OR NOT CUDNN_FOUND)
MESSAGE(STATUS "CuDNN not found. Compiling without CuDNN support")
set(AT_CUDNN_ENABLED 0)
ELSE()
include_directories(SYSTEM ${CUDNN_INCLUDE_DIRS})
set(AT_CUDNN_ENABLED 1)
ENDIF()
```
So, even if USE_CUDNN is set to 0, cpp is compiled with cuDNN if cmake finds cuDNN in the system. I actually tested it and was very surprised when I was debugging cuDNN code which I built with USE_CUDNN=0. I believe that cmake code above should look like this:
`IF (NOT AT_CUDA_ENABLED OR NOT CUDNN_FOUND OR NOT USE_CUDNN) ...`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25037
Differential Revision: D17048683
Pulled By: pbelevich
fbshipit-source-id: 48afa19eaae0bba2ffd49c1f68db0b4efd5cf85e
Summary:
Currently they sit together with other code in cuda.cmake. This commit is the first step toward cleaning up cuDNN detection in our build system.
Another attempt to https://github.com/pytorch/pytorch/issues/24293, which breaks manywheels build because it does not handle `USE_STATIC_CUDNN` properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24938
Differential Revision: D17070920
Pulled By: ezyang
fbshipit-source-id: a4d017a3505c102d9c435a73ae62332e4336c52e
Summary:
Currently they sit together with other code in cuda.cmake. This commit
is the first step toward cleaning up cuDNN detection in our build system.
Another attempt to https://github.com/pytorch/pytorch/issues/24293, which breaks manywheels build because it does not handle `USE_STATIC_CUDNN`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24784
Differential Revision: D16914345
Pulled By: ezyang
fbshipit-source-id: fd261478c01d879dc770c1f1a56b17cc1a587be2