Commit Graph

76 Commits

Author SHA1 Message Date
59dd12042e [nnc] Removed const from all fields in IR. (#62336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336

This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.

This is the first step in making all NNC mutations in-place.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30049829

Pulled By: navahgar

fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
2021-08-03 11:44:36 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
349f2f767c Modernize to default constructor and nullptr in torch (#61735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29716659

fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8
2021-07-16 10:51:13 -07:00
635d864b26 Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142)
Summary:
Test-plan: Compile + clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142

Reviewed By: VitalyFedyunin

Differential Revision: D29529372

Pulled By: malfet

fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6
2021-07-06 09:46:46 -07:00
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
93772792e3 [nnc] Get rid of fuser trigger counters (#57334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334

Here's a possibly controversial PR.  These counters got in the way of
generalizing the fuser tests to handle arbitrary devices, and I guess I'm just
generally skeptical that they provide much value.  While true that they let us
observe whether fusion groups were created, we already have assertions based on
the shape of the graph, and I'm not sure that I trust those any less than these
counters.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29471484

Pulled By: bertmaher

fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57
2021-06-29 22:22:15 -07:00
96b3537e71 [NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449

Make dtypeToCppString as a virtual method so that a child
class can easily override the dtype string generation rule. This is
needed as a preparation to make loop and tensor index as int64_t.

Test Plan:
```
build/bin/test_tensorexpr
```

Reviewed By: H-Huang

Differential Revision: D29173969

Pulled By: desertfire

fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776
2021-06-17 09:34:58 -07:00
b162d95e46 Fix a number of lint perf and safety issues in torch (#59897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29037012

fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb
2021-06-15 13:14:51 -07:00
fbe65b16ae Use irange in torch/csrc/jit (#55716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27690245

fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5
2021-06-07 16:48:08 -07:00
ba694520e5 [ROCm] fix JIT codegen (#57400)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
2021-05-27 11:45:07 -07:00
4c24d820ff [TensorExpr] Implement 'call_raw' in CUDA codegen. (#57901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57901

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28312107

Pulled By: ZolotukhinM

fbshipit-source-id: 53b4fd418d0c7bf70647278ee03efa5ef60b3af8
2021-05-12 14:08:20 -07:00
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
0bf69278f7 Reland: [TensorExpr] Add CodeGen::call_raw method. (#57551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57551

The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.

Relanding #55113 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195049

Pulled By: ZolotukhinM

fbshipit-source-id: 035b77ae996dbbcd542b4b0e4c011b41e8d7828b
2021-05-05 09:10:25 -07:00
05b255c543 Revert D27487549: [TensorExpr] Add CodeGen::call_raw method.
Test Plan: revert-hammer

Differential Revision:
D27487549 (c9ab384af7)

Original commit changeset: d8f3d92262cd

fbshipit-source-id: ea8e71dbe2d632bc0fb557362c8bd899eb6aa83a
2021-05-01 19:48:07 -07:00
c9ab384af7 [TensorExpr] Add CodeGen::call_raw method. (#55113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55113

The new method allows to pass input and output arguments by `void*`
pointers instead of CallArgs. That helps to reduce the invocation
overhead. Currently this is only supported in LLVM codegen.

Differential Revision: D27487549

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: d8f3d92262cde1c155beefb629454370d9af2f89
2021-04-30 15:24:37 -07:00
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
f3743f097f [TensorExpr] Nuke tensorexpr::ScalarType and instead use c10::ScalarType directly. (#56825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56825

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27977461

Pulled By: ZolotukhinM

fbshipit-source-id: f8a72938ba395e426e2d9449627113abb1c9c34f
2021-04-26 01:51:21 -07:00
e1752ffa04 [reland][ROCm] use hiprtc precompiled header (#55965)
Summary:
Revert "Revert D27449031 (2a7df657fe): [pytorch][PR] [ROCm] use hiprtc precompiled header".  Reland PR https://github.com/pytorch/pytorch/issues/54350.

This reverts commit 204ac21bf1457022caab197001788239720b96d6.

The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965

Reviewed By: jbschlosser

Differential Revision: D27755907

Pulled By: malfet

fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e
2021-04-15 15:47:56 -07:00
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
204ac21bf1 Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header
Test Plan: revert-hammer

Differential Revision:
D27449031 (2a7df657fe)

Original commit changeset: 81a8d7847a47

fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c
2021-04-01 06:42:04 -07:00
2a7df657fe [ROCm] use hiprtc precompiled header (#54350)
Summary:
HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release.  Conditionally add support for this feature.  Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features.

The use of this feature is conditionalized on a new ROCM_VERSION macro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350

Reviewed By: H-Huang

Differential Revision: D27449031

Pulled By: malfet

fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3
2021-03-31 13:36:50 -07:00
c639513378 [TensorExpr] Resubmit: Introduce ExternalCall nodes to TE IR. (#51594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51594

ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.

To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.

The reason the PR was previously reverted was that the LLVM generated
calls to bridge functions were breaking unwind tables. This is now fixed
by requiring bridge functions to never throw and setting the
corresponding attribute in the LLVM generated code.

Differential Revision: D26213882

Test Plan: Imported from OSS

Reviewed By: pbelevich, ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: db954d8338e2d750c2bf0a41e88e38bd494f2945
2021-02-03 10:22:54 -08:00
4f37150f40 Revert D26179083: [TensorExpr] Introduce ExternalCall nodes to TE IR.
Test Plan: revert-hammer

Differential Revision:
D26179083 (f4fc3e3920)

Original commit changeset: 9e44de098ae9

fbshipit-source-id: d15684e04c65c395b4102d4f98a4488482822d1b
2021-02-02 05:29:41 -08:00
f4fc3e3920 [TensorExpr] Introduce ExternalCall nodes to TE IR. (#51475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51475

ExternalCall nodes represent opaque calls to external functions to fill a
tensor (buffer) with values. It could be used to include nodes that are
otherwise not-representable as TE, or whose TE representation is currently too
slow.

To make an external function available in NNC as ExternalCall, one needs to
implement a "bridge" function that would take raw (void*) pointers to the data
along with the arrays containing dimension info. This function would then
internally call the desired external function and make sure the results of the
call are correctly placed in the provided raw data buffers.

Test Plan: Imported from OSS

Reviewed By: pbelevich, Chillee

Differential Revision: D26179083

Pulled By: ZolotukhinM

fbshipit-source-id: 9e44de098ae94d25772cf5e2659d539fa6f3f659
2021-02-02 00:50:46 -08:00
392abde8e6 patch nvrtc API for cuda TK >= 11.1 (#50319)
Summary:
CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX.
1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime;
https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx
2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319

Reviewed By: malfet

Differential Revision: D26114475

Pulled By: ngimel

fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5
2021-01-27 23:58:20 -08:00
2569dc71e1 Reapply D25859132: [te] Optimize allocation of kernel outputs (#50546)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50546

And fix the ROCm build
ghstack-source-id: 119837166

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D25912464

fbshipit-source-id: 023e1f6c9fc131815c5a7a31f4860dfe271f7ae1
2021-01-15 17:02:49 -08:00
269193f5f5 Revert D25859132: [te] Optimize allocation of kernel outputs
Test Plan: revert-hammer

Differential Revision:
D25859132 (62f676f543)

Original commit changeset: 8753289339e3

fbshipit-source-id: 580069c7fa7565643d3204f3740e64ac94c4db39
2021-01-14 04:28:29 -08:00
62f676f543 [te] Optimize allocation of kernel outputs (#50318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318

We can skip the dispatcher and go to the device-specific
`at::native::empty_strided` implementation.

Also, unpacking the TensorOptions struct at kernel launch time actually takes a
bit of work, since the optionals are encoded in a bitfield.  Do this upfront
and use the optionals directly at runtime.
ghstack-source-id: 119735738

Test Plan:
Before:
```
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead         2143 ns       2142 ns     332946
UnfusedOverhead       2277 ns       2276 ns     315130
```

After:
```
-------------------------------------------------------
Benchmark                Time           CPU Iterations
-------------------------------------------------------
FusedOverhead        2175 ns       2173 ns  321877
UnfusedOverhead      2394 ns       2394 ns  307360
```

(The noise in the baseline makes this really hard to read, it seemed to be
about 3-5% faster in my local testing)

Reviewed By: eellison

Differential Revision: D25859132

fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509
2021-01-13 12:12:43 -08:00
6568572712 Support integral types for kAbs in SimpleIREvaluator (#49357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357

This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)

Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`

Imported from OSS

Reviewed By: eellison

Differential Revision: D25545791

fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
2020-12-18 07:57:58 -08:00
50386b9988 [NNC] Add Support For is_nan (#48973)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48973

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D25413166

Pulled By: eellison

fbshipit-source-id: 0c79258345df18c60a862373fa16931228fb92ef
2020-12-16 18:31:01 -08:00
9ead558899 Add max supported SM for nvrtc-11.0 (#48151)
Summary:
Should fix the regression when nvrtc from CUDA-11.0 is used on the system with RTX3080

Addresses issue described in https://github.com/pytorch/pytorch/issues/47669#issuecomment-725073808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48151

Reviewed By: ngimel

Differential Revision: D25043899

Pulled By: malfet

fbshipit-source-id: 998ded59387e3971c2c1a5df4af595630515a72e
2020-11-18 08:17:28 -08:00
aabc87cd04 [NNC] Fix HalfChecker when half present but unused (#48068)
Summary:
Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068

Reviewed By: ZolotukhinM

Differential Revision: D25009680

Pulled By: nickgg

fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0
2020-11-17 12:07:57 -08:00
dcca712d3c [NNC] refactor cuda half support to more general file (#47373)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47373

Test Plan: Imported from OSS

Reviewed By: ansley

Differential Revision: D24805246

Pulled By: eellison

fbshipit-source-id: 33b5c84c9212d51bac3968e02aae2434dde40cd8
2020-11-12 11:14:00 -08:00
e985503d80 [NNC] Fix an issue with half-scalar vars coerced to float (Take 2) (#47448)
Summary:
Take 2 of this fix, I removed the repro from the issue which is a bit flaky due to parallelism. It broke on Windows but isn't specific to Windows or this fix, I think. I'll make sure all the tests pass this time (cc zou3519).

Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like float(half(float(X))) and added a pass to collapse those down inside the CudaHalfScalarRewriter.

Fixes https://github.com/pytorch/pytorch/issues/47138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47448

Reviewed By: glaringlee

Differential Revision: D24765070

Pulled By: nickgg

fbshipit-source-id: 5297e647534d53657bef81f4798e8aa6a93d1fbd
2020-11-05 19:31:52 -08:00
745899f926 Revert D24706475: [pytorch][PR] [NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float
Test Plan: revert-hammer

Differential Revision:
D24706475 (33cf7fddd2)

Original commit changeset: 9df72bbbf203

fbshipit-source-id: f16ff04818de4294713d5b97eab5b298c1a75a6b
2020-11-05 08:25:48 -08:00
33cf7fddd2 [NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float (#47229)
Summary:
Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like `float(half(float(X)))` and added a pass to collapse those down inside the CudaHalfScalarRewriter.

Fixes https://github.com/pytorch/pytorch/issues/47138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47229

Reviewed By: agolynski

Differential Revision: D24706475

Pulled By: nickgg

fbshipit-source-id: 9df72bbbf203353009e98b9cce7ab735efff8b21
2020-11-04 15:48:12 -08:00
9b168a1fed [TensorExpr] Pick meaningful names for functions in TE codegen. (#47255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47255

As a result of this change, the generated CUDA code for the following fusion group:
```
graph(%0 : Float(32, 32, 1, 1, strides=[32, 1, 1, 1], requires_grad=0, device=cuda:0),
      %1 : Float(32, 32, strides=[32, 1], requires_grad=0, device=cuda:0),
      %2 : Float(32, 32, 1, strides=[32, 1, 1], requires_grad=0, device=cuda:0)):
  %3 : int = prim::Constant[value=1]()
  %v1.1 : Float(32, 32, 32, strides=[1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%1, %2, %3) # test/test_tensorexpr.py:155:0
  %5 : int = prim::Constant[value=1]()
  %6 : Float(32, 32, 32, 32, strides=[32768, 1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%v1.1, %0, %5) # test/test_tensorexpr.py:156:0
  return (%6)
```

Would look like the following:
```
extern "C" __global__
void fused_add_add(float* t0, float* t1, float* t2, float* aten_add) {
{
  float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32);
  float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32));
  float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768));
  aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2;
}
}
```

Previously we generated:
```
extern "C" __global__
void func(float* t0, float* t1, float* t2, float* aten_add) {
{
  float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32);
  float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32));
  float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768));
  aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2;
}
}
```

Differential Revision: D24698273

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 6da95c6ac3d5155ebfaaab4f84f55a24deb6d10d
2020-11-03 16:41:22 -08:00
a65e757057 [TensorExpr] CudaCodegen: restart counter for function names unique ID inside each codegen instantiation. (#47254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47254

CUDA codegen used a static global counter for picking names for
functions, but the functions only need to be unique in the scope of the
given codegen. This PR fixes that.

Differential Revision: D24698271

Test Plan: Imported from OSS

Reviewed By: bertmaher

Pulled By: ZolotukhinM

fbshipit-source-id: 516c0087b86b35bbb6ea7c71bb0ed9c3daaca2b8
2020-11-03 16:41:20 -08:00
f3db68776c [NNC] Fix two more bugs in Cuda Half support (#46129)
Summary:
Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats:

1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules.

2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in.

Fixes https://github.com/pytorch/pytorch/issues/45953.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129

Reviewed By: glaringlee

Differential Revision: D24253639

Pulled By: nickgg

fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac
2020-10-12 13:31:07 -07:00
22a34bcf4e ROCm {emoji:2764} TensorExpr (#45506)
Summary:
This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 .
The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506

Reviewed By: zhangguanheng66

Differential Revision: D23991410

Pulled By: Krovatkin

fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e
2020-09-29 16:52:16 -07:00
3dd0e362db [TensorExpr] Fix min and max for integral inputs in CUDA backend (#44984)
Summary:
For integral types, isnan is meaningless. Provide specializations for
maximum and minimum which don't call it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984

Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops

Reviewed By: ezyang

Differential Revision: D23885259

Pulled By: asuhan

fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437
2020-09-23 23:19:12 -07:00
4bbb6adff5 [NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909)
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.

The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.

To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.

Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909

Reviewed By: agolynski

Differential Revision: D23800565

Pulled By: nickgg

fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
2020-09-21 09:27:22 -07:00
82ab167cce [NNC] Fix masking for all block and thread dimensions in CudaCodeGen (#44733)
Summary:
Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.

For example it will transform the following:
```
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  for k in 0..5 // threadIdx.x
    do other thing(i, k);
```

Into:
```
do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
  do other thing(blockIdx.x, threadIdx.x);
}
```

And handle the case where statements are not bound by any axis, eg.
```
do outer thing;
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  do other thing(i);
```

will become:

```
if (blockIdx.x < 1) {
  if (threadIdx.x < 1) {
    do outer thing;
  }
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
  do other thing(blockIdx.x);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733

Reviewed By: mruberry

Differential Revision: D23736878

Pulled By: nickgg

fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768
2020-09-16 14:23:47 -07:00
64b4307d47 [NNC] Cuda Codegen - mask loops bound to block/thread dimensions (#44325)
Summary:
Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar.

Coming soon more info and tests...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325

Reviewed By: colesbury

Differential Revision: D23628859

Pulled By: nickgg

fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad
2020-09-11 16:48:16 -07:00
960c088a58 [te] Fix casting of unsigned char, and abs(int) (#44157)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44157

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D23528507

Pulled By: bertmaher

fbshipit-source-id: c5ef0422a91a4665b616601bed8b7cd137be39f9
2020-09-09 11:08:36 -07:00
be94dba429 [NNC] fix support for FP16 in CudaCodgen (#44209)
Summary:
Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load.

Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209

Reviewed By: izdeby

Differential Revision: D23575577

Pulled By: nickgg

fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46
2020-09-08 18:00:39 -07:00
c14a3613a8 Fix NaN propagation in TE fuser's min/max implementation (#43609)
Summary:
Per eager mode source-of-truth, NaNs shall be propagated by min/max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43609

Reviewed By: ZolotukhinM

Differential Revision: D23349184

Pulled By: bertmaher

fbshipit-source-id: 094eb8b89a02b27d5ecf3988d0f473c0f91e4afb
2020-09-01 02:10:13 -07:00
1390cad2d8 [NNC] Hook up registerizer to Cuda codegen [2/x] (#42878)
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.

First up the good stuff, benchmark before:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.7917          9.7037          6.9386          6.0448
          (100, 100)          5.9338          14.972          7.1139          6.3254
        (100, 10000)          21.453          741.54          145.74          12.555
        (1000, 1000)          8.0678          122.75          22.833          9.0778

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.4502          7.9661          6.1469          5.5587
          (100, 100)          5.7613          13.897           21.49          5.5808
        (100, 10000)          21.702          82.398          75.462          22.793
        (1000, 1000)          22.527             129          176.51          22.517

```

After:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          6.0458          9.4966          7.1094           6.056
          (100, 100)          5.9299          9.1482          7.1693           6.593
        (100, 10000)          21.739          121.97          162.63          14.376
        (1000, 1000)          9.2374           29.01          26.883          10.127

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.9773          8.1792          7.2307          5.8941
          (100, 100)          6.1456          9.3155          24.563          5.8163
        (100, 10000)          25.384          30.212          88.531          27.185
        (1000, 1000)          26.517          32.702          209.31          26.537
```

Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).

The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.

It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878

Reviewed By: glaringlee

Differential Revision: D23382499

Pulled By: nickgg

fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
2020-08-31 10:39:46 -07:00
6c99d5611d [tensorexpr] Fix promotion of booleans (#43097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097

Boolean arguments weren't promoted, so if you tried to write a comparison with
types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE
engine.

Test Plan: Imported from OSS

Reviewed By: protonu, zheng-xq

Differential Revision: D23167926

Pulled By: bertmaher

fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906
2020-08-18 15:19:38 -07:00