Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336
This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.
This is the first step in making all NNC mutations in-place.
Test Plan: Imported from OSS
Reviewed By: iramazanli
Differential Revision: D30049829
Pulled By: navahgar
fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334
Here's a possibly controversial PR. These counters got in the way of
generalizing the fuser tests to handle arbitrary devices, and I guess I'm just
generally skeptical that they provide much value. While true that they let us
observe whether fusion groups were created, we already have assertions based on
the shape of the graph, and I'm not sure that I trust those any less than these
counters.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D29471484
Pulled By: bertmaher
fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449
Make dtypeToCppString as a virtual method so that a child
class can easily override the dtype string generation rule. This is
needed as a preparation to make loop and tensor index as int64_t.
Test Plan:
```
build/bin/test_tensorexpr
```
Reviewed By: H-Huang
Differential Revision: D29173969
Pulled By: desertfire
fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.
- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
- NAN / POS_INFINITY / NEG_INFINITY
- Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400
Reviewed By: ejguan
Differential Revision: D28421065
Pulled By: malfet
fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57551
The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.
Relanding #55113 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28195049
Pulled By: ZolotukhinM
fbshipit-source-id: 035b77ae996dbbcd542b4b0e4c011b41e8d7828b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55113
The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.
Differential Revision: D27487549
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: d8f3d92262cde1c155beefb629454370d9af2f89
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header". Reland PR https://github.com/pytorch/pytorch/issues/54350.
This reverts commit 204ac21bf1457022caab197001788239720b96d6.
The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965
Reviewed By: jbschlosser
Differential Revision: D27755907
Pulled By: malfet
fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release. Conditionally add support for this feature. Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.
The use of this feature is conditionalized on a new ROCM_VERSION macro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350
Reviewed By: H-Huang
Differential Revision: D27449031
Pulled By: malfet
fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51594
ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.
To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.
The reason the PR was previously reverted was that the LLVM generated
calls to bridge functions were breaking unwind tables. This is now fixed
by requiring bridge functions to never throw and setting the
corresponding attribute in the LLVM generated code.
Differential Revision: D26213882
Test Plan: Imported from OSS
Reviewed By: pbelevich, ngimel
Pulled By: ZolotukhinM
fbshipit-source-id: db954d8338e2d750c2bf0a41e88e38bd494f2945
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51475
ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.
To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.
Test Plan: Imported from OSS
Reviewed By: pbelevich, Chillee
Differential Revision: D26179083
Pulled By: ZolotukhinM
fbshipit-source-id: 9e44de098ae94d25772cf5e2659d539fa6f3f659
Summary:
CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX.
1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime;
https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx
2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319
Reviewed By: malfet
Differential Revision: D26114475
Pulled By: ngimel
fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318
We can skip the dispatcher and go to the device-specific
`at::native::empty_strided` implementation.
Also, unpacking the TensorOptions struct at kernel launch time actually takes a
bit of work, since the optionals are encoded in a bitfield. Do this upfront
and use the optionals directly at runtime.
ghstack-source-id: 119735738
Test Plan:
Before:
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
FusedOverhead 2143 ns 2142 ns 332946
UnfusedOverhead 2277 ns 2276 ns 315130
```
After:
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
FusedOverhead 2175 ns 2173 ns 321877
UnfusedOverhead 2394 ns 2394 ns 307360
```
(The noise in the baseline makes this really hard to read, it seemed to be
about 3-5% faster in my local testing)
Reviewed By: eellison
Differential Revision: D25859132
fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357
This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)
Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`
Imported from OSS
Reviewed By: eellison
Differential Revision: D25545791
fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
Summary:
Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068
Reviewed By: ZolotukhinM
Differential Revision: D25009680
Pulled By: nickgg
fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0
Summary:
Take 2 of this fix, I removed the repro from the issue which is a bit flaky due to parallelism. It broke on Windows but isn't specific to Windows or this fix, I think. I'll make sure all the tests pass this time (cc zou3519).
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like float(half(float(X))) and added a pass to collapse those down inside the CudaHalfScalarRewriter.
Fixes https://github.com/pytorch/pytorch/issues/47138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47448
Reviewed By: glaringlee
Differential Revision: D24765070
Pulled By: nickgg
fbshipit-source-id: 5297e647534d53657bef81f4798e8aa6a93d1fbd
Summary:
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like `float(half(float(X)))` and added a pass to collapse those down inside the CudaHalfScalarRewriter.
Fixes https://github.com/pytorch/pytorch/issues/47138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47229
Reviewed By: agolynski
Differential Revision: D24706475
Pulled By: nickgg
fbshipit-source-id: 9df72bbbf203353009e98b9cce7ab735efff8b21
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47254
CUDA codegen used a static global counter for picking names for
functions, but the functions only need to be unique in the scope of the
given codegen. This PR fixes that.
Differential Revision: D24698271
Test Plan: Imported from OSS
Reviewed By: bertmaher
Pulled By: ZolotukhinM
fbshipit-source-id: 516c0087b86b35bbb6ea7c71bb0ed9c3daaca2b8
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:
1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.
2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.
Fixes https://github.com/pytorch/pytorch/issues/45953.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129
Reviewed By: glaringlee
Differential Revision: D24253639
Pulled By: nickgg
fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506
Reviewed By: zhangguanheng66
Differential Revision: D23991410
Pulled By: Krovatkin
fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984
Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops
Reviewed By: ezyang
Differential Revision: D23885259
Pulled By: asuhan
fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.
The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.
To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.
Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909
Reviewed By: agolynski
Differential Revision: D23800565
Pulled By: nickgg
fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.
For example it will transform the following:
```
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
for k in 0..5 // threadIdx.x
do other thing(i, k);
```
Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
do other thing(blockIdx.x, threadIdx.x);
}
```
And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
for j in 0..10 // threadIdx.x
do thing(i, j);
do other thing(i);
```
will become:
```
if (blockIdx.x < 1) {
if (threadIdx.x < 1) {
do outer thing;
}
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
do other thing(blockIdx.x);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733
Reviewed By: mruberry
Differential Revision: D23736878
Pulled By: nickgg
fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.
Coming soon more info and tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325
Reviewed By: colesbury
Differential Revision: D23628859
Pulled By: nickgg
fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.
Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209
Reviewed By: izdeby
Differential Revision: D23575577
Pulled By: nickgg
fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.
First up the good stuff, benchmark before:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 5.7917 9.7037 6.9386 6.0448
(100, 100) 5.9338 14.972 7.1139 6.3254
(100, 10000) 21.453 741.54 145.74 12.555
(1000, 1000) 8.0678 122.75 22.833 9.0778
Row sum Caffe2 NNC Simple Better
(10, 100) 5.4502 7.9661 6.1469 5.5587
(100, 100) 5.7613 13.897 21.49 5.5808
(100, 10000) 21.702 82.398 75.462 22.793
(1000, 1000) 22.527 129 176.51 22.517
```
After:
```
Column sum Caffe2 NNC Simple Better
(10, 100) 6.0458 9.4966 7.1094 6.056
(100, 100) 5.9299 9.1482 7.1693 6.593
(100, 10000) 21.739 121.97 162.63 14.376
(1000, 1000) 9.2374 29.01 26.883 10.127
Row sum Caffe2 NNC Simple Better
(10, 100) 5.9773 8.1792 7.2307 5.8941
(100, 100) 6.1456 9.3155 24.563 5.8163
(100, 10000) 25.384 30.212 88.531 27.185
(1000, 1000) 26.517 32.702 209.31 26.537
```
Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).
The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.
It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878
Reviewed By: glaringlee
Differential Revision: D23382499
Pulled By: nickgg
fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097
Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.
Test Plan: Imported from OSS
Reviewed By: protonu, zheng-xq
Differential Revision: D23167926
Pulled By: bertmaher
fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906