pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-07 10:01:39 +08:00

Author	SHA1	Message	Date
Raghavan Raman	59dd12042e	[nnc] Removed const from all fields in IR. (#62336 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336 This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change. This is the first step in making all NNC mutations in-place. Test Plan: Imported from OSS Reviewed By: iramazanli Differential Revision: D30049829 Pulled By: navahgar fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63	2021-08-03 11:44:36 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Richard Barnes	349f2f767c	Modernize to default constructor and nullptr in torch (#61735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61735 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29716659 fbshipit-source-id: ec2a0a0b7e55d2e50b1d35f0b651bd40675ae7e8	2021-07-16 10:51:13 -07:00
Nikita Shulga	635d864b26	Fix modernize-use-equals-default nolint failures in torch/csrcs (#61142 ) Summary: Test-plan: Compile + clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/61142 Reviewed By: VitalyFedyunin Differential Revision: D29529372 Pulled By: malfet fbshipit-source-id: 2ccde7712a51c28243b16bbb4d1d68086e0414a6	2021-07-06 09:46:46 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Bert Maher	93772792e3	[nnc] Get rid of fuser trigger counters (#57334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334 Here's a possibly controversial PR. These counters got in the way of generalizing the fuser tests to handle arbitrary devices, and I guess I'm just generally skeptical that they provide much value. While true that they let us observe whether fusion groups were created, we already have assertions based on the shape of the graph, and I'm not sure that I trust those any less than these counters. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D29471484 Pulled By: bertmaher fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57	2021-06-29 22:22:15 -07:00
Bin Bao	96b3537e71	[NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59449 Make dtypeToCppString as a virtual method so that a child class can easily override the dtype string generation rule. This is needed as a preparation to make loop and tensor index as int64_t. Test Plan: ``` build/bin/test_tensorexpr ``` Reviewed By: H-Huang Differential Revision: D29173969 Pulled By: desertfire fbshipit-source-id: a447badba76788354da1c79f80c834c99f105776	2021-06-17 09:34:58 -07:00
Richard Barnes	b162d95e46	Fix a number of lint perf and safety issues in torch (#59897 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D29037012 fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb	2021-06-15 13:14:51 -07:00
Richard Barnes	fbe65b16ae	Use irange in torch/csrc/jit (#55716 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27690245 fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5	2021-06-07 16:48:08 -07:00
Jeff Daily	ba694520e5	[ROCm] fix JIT codegen (#57400 ) Summary: Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT. - ROCM_VERSION macro must be available to both device and host compilation passes. - Unifies some of CUDA and HIP differences in the code generated. - NAN / POS_INFINITY / NEG_INFINITY - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated] - Differentiates bf16 codegen for HIP. - Optionally provides missing macros when using hiprtc precompiled header feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400 Reviewed By: ejguan Differential Revision: D28421065 Pulled By: malfet fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074	2021-05-27 11:45:07 -07:00
Mikhail Zolotukhin	4c24d820ff	[TensorExpr] Implement 'call_raw' in CUDA codegen. (#57901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57901 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28312107 Pulled By: ZolotukhinM fbshipit-source-id: 53b4fd418d0c7bf70647278ee03efa5ef60b3af8	2021-05-12 14:08:20 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Mikhail Zolotukhin	0bf69278f7	Reland: [TensorExpr] Add `CodeGen::call_raw` method. (#57551 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57551 The new method allows to pass input and output arguments by `void*` pointers instead of CallArgs. That helps to reduce the invocation overhead. Currently this is only supported in LLVM codegen. Relanding #55113 (the entire stack) which was reverted because I forgot to guard a new test with `ifdef LLVM`. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D28195049 Pulled By: ZolotukhinM fbshipit-source-id: 035b77ae996dbbcd542b4b0e4c011b41e8d7828b	2021-05-05 09:10:25 -07:00
Mike Ruberry	05b255c543	Revert D27487549: [TensorExpr] Add `CodeGen::call_raw` method. Test Plan: revert-hammer Differential Revision: D27487549 (`c9ab384af7`) Original commit changeset: d8f3d92262cd fbshipit-source-id: ea8e71dbe2d632bc0fb557362c8bd899eb6aa83a	2021-05-01 19:48:07 -07:00
Mikhail Zolotukhin	c9ab384af7	[TensorExpr] Add `CodeGen::call_raw` method. (#55113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55113 The new method allows to pass input and output arguments by `void*` pointers instead of CallArgs. That helps to reduce the invocation overhead. Currently this is only supported in LLVM codegen. Differential Revision: D27487549 Test Plan: Imported from OSS Reviewed By: bertmaher Pulled By: ZolotukhinM fbshipit-source-id: d8f3d92262cde1c155beefb629454370d9af2f89	2021-04-30 15:24:37 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Mikhail Zolotukhin	f3743f097f	[TensorExpr] Nuke tensorexpr::ScalarType and instead use c10::ScalarType directly. (#56825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56825 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D27977461 Pulled By: ZolotukhinM fbshipit-source-id: f8a72938ba395e426e2d9449627113abb1c9c34f	2021-04-26 01:51:21 -07:00
Jeff Daily	e1752ffa04	[reland][ROCm] use hiprtc precompiled header (#55965 ) Summary: Revert "Revert D27449031 (`2a7df657fe`): [pytorch][PR] [ROCm] use hiprtc precompiled header". Reland PR https://github.com/pytorch/pytorch/issues/54350. This reverts commit 204ac21bf1457022caab197001788239720b96d6. The original PR was reverted under suspicion that it was causing CI instability, but it was instead due to a hardware failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/55965 Reviewed By: jbschlosser Differential Revision: D27755907 Pulled By: malfet fbshipit-source-id: 75bf0b9d888df3dee62f00a366b1123757e0474e	2021-04-15 15:47:56 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Alexander Golynski	204ac21bf1	Revert D27449031: [pytorch][PR] [ROCm] use hiprtc precompiled header Test Plan: revert-hammer Differential Revision: D27449031 (`2a7df657fe`) Original commit changeset: 81a8d7847a47 fbshipit-source-id: b7b970c8ea4110357fba3ad4d52a86fa5641d90c	2021-04-01 06:42:04 -07:00
Jeff Daily	2a7df657fe	[ROCm] use hiprtc precompiled header (#54350 ) Summary: HIP's runtime compiler (hiprtc) is adding support for precompiled HIP headers in the ROCm 4.2 release. Conditionally add support for this feature. Using this feature will improve the ROCm torch wheel user experience; users will no longer need to install HIP headers separately to use torch JIT features. The use of this feature is conditionalized on a new ROCM_VERSION macro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54350 Reviewed By: H-Huang Differential Revision: D27449031 Pulled By: malfet fbshipit-source-id: 81a8d7847a47ce2bb253d1ea58740ef66ed154a3	2021-03-31 13:36:50 -07:00
Mikhail Zolotukhin	c639513378	[TensorExpr] Resubmit: Introduce ExternalCall nodes to TE IR. (#51594 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51594 ExternalCall nodes represent opaque calls to external functions to fill a tensor (buffer) with values. It could be used to include nodes that are otherwise not-representable as TE, or whose TE representation is currently too slow. To make an external function available in NNC as ExternalCall, one needs to implement a "bridge" function that would take raw (void*) pointers to the data along with the arrays containing dimension info. This function would then internally call the desired external function and make sure the results of the call are correctly placed in the provided raw data buffers. The reason the PR was previously reverted was that the LLVM generated calls to bridge functions were breaking unwind tables. This is now fixed by requiring bridge functions to never throw and setting the corresponding attribute in the LLVM generated code. Differential Revision: D26213882 Test Plan: Imported from OSS Reviewed By: pbelevich, ngimel Pulled By: ZolotukhinM fbshipit-source-id: db954d8338e2d750c2bf0a41e88e38bd494f2945	2021-02-03 10:22:54 -08:00
Luca Wehrstedt	4f37150f40	Revert D26179083: [TensorExpr] Introduce ExternalCall nodes to TE IR. Test Plan: revert-hammer Differential Revision: D26179083 (`f4fc3e3920`) Original commit changeset: 9e44de098ae9 fbshipit-source-id: d15684e04c65c395b4102d4f98a4488482822d1b	2021-02-02 05:29:41 -08:00
Mikhail Zolotukhin	f4fc3e3920	[TensorExpr] Introduce ExternalCall nodes to TE IR. (#51475 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51475 ExternalCall nodes represent opaque calls to external functions to fill a tensor (buffer) with values. It could be used to include nodes that are otherwise not-representable as TE, or whose TE representation is currently too slow. To make an external function available in NNC as ExternalCall, one needs to implement a "bridge" function that would take raw (void*) pointers to the data along with the arrays containing dimension info. This function would then internally call the desired external function and make sure the results of the call are correctly placed in the provided raw data buffers. Test Plan: Imported from OSS Reviewed By: pbelevich, Chillee Differential Revision: D26179083 Pulled By: ZolotukhinM fbshipit-source-id: 9e44de098ae94d25772cf5e2659d539fa6f3f659	2021-02-02 00:50:46 -08:00
jjsjann123	392abde8e6	patch nvrtc API for cuda TK >= 11.1 (#50319 ) Summary: CUDA TK >= 11.1 provides ptxjitcompiler that emits SASS instead of PTX. 1. This gives better backward-compatibility that allows future TK to work with older driver, which might not necessarily be able to load generated PTX through JIT compile and would error out at runtime; https://docs.nvidia.com/deploy/cuda-compatibility/#using-ptx 2. Meanwhile, SASS doesn't provide good future compatibility, so for unsupported arch, we fallback to PTX to support future device. https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cubin-compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/50319 Reviewed By: malfet Differential Revision: D26114475 Pulled By: ngimel fbshipit-source-id: 046e9e7b3312d910f499572608a0bc1fe53feef5	2021-01-27 23:58:20 -08:00
Bert Maher	2569dc71e1	Reapply D25859132: [te] Optimize allocation of kernel outputs (#50546 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50546 And fix the ROCm build ghstack-source-id: 119837166 Test Plan: CI Reviewed By: ZolotukhinM Differential Revision: D25912464 fbshipit-source-id: 023e1f6c9fc131815c5a7a31f4860dfe271f7ae1	2021-01-15 17:02:49 -08:00
Mike Ruberry	269193f5f5	Revert D25859132: [te] Optimize allocation of kernel outputs Test Plan: revert-hammer Differential Revision: D25859132 (`62f676f543`) Original commit changeset: 8753289339e3 fbshipit-source-id: 580069c7fa7565643d3204f3740e64ac94c4db39	2021-01-14 04:28:29 -08:00
Bert Maher	62f676f543	[te] Optimize allocation of kernel outputs (#50318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50318 We can skip the dispatcher and go to the device-specific `at::native::empty_strided` implementation. Also, unpacking the TensorOptions struct at kernel launch time actually takes a bit of work, since the optionals are encoded in a bitfield. Do this upfront and use the optionals directly at runtime. ghstack-source-id: 119735738 Test Plan: Before: ``` ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2143 ns 2142 ns 332946 UnfusedOverhead 2277 ns 2276 ns 315130 ``` After: ``` ------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------- FusedOverhead 2175 ns 2173 ns 321877 UnfusedOverhead 2394 ns 2394 ns 307360 ``` (The noise in the baseline makes this really hard to read, it seemed to be about 3-5% faster in my local testing) Reviewed By: eellison Differential Revision: D25859132 fbshipit-source-id: 8753289339e365f78c790bee076026cd649b8509	2021-01-13 12:12:43 -08:00
Peng Wu	6568572712	Support integral types for kAbs in SimpleIREvaluator (#49357 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357 This is a follow-up fix for PR #48679, where the previous PR adds support for integer inputs to aten::abs by promoting integers to float and then demote the result back to integers. This PR supports integer inputs to aten::abs more efficiently in the SimpleIREvaluator by allowing implementing integer inputs for kAbs (renamed from kFabs). - Rename kFabs to kAbs - Add support for integer input to kAbs in SimpleIREvalator (note that: llvm_codegen and cuda_codegen already supports integer inputs to kAbs) Test Plan: - `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` - `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` Imported from OSS Reviewed By: eellison Differential Revision: D25545791 fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230	2020-12-18 07:57:58 -08:00
Elias Ellison	50386b9988	[NNC] Add Support For is_nan (#48973 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48973 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25413166 Pulled By: eellison fbshipit-source-id: 0c79258345df18c60a862373fa16931228fb92ef	2020-12-16 18:31:01 -08:00
Nikita Shulga	9ead558899	Add max supported SM for nvrtc-11.0 (#48151 ) Summary: Should fix the regression when nvrtc from CUDA-11.0 is used on the system with RTX3080 Addresses issue described in https://github.com/pytorch/pytorch/issues/47669#issuecomment-725073808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48151 Reviewed By: ngimel Differential Revision: D25043899 Pulled By: malfet fbshipit-source-id: 998ded59387e3971c2c1a5df4af595630515a72e	2020-11-18 08:17:28 -08:00
Nick Gibson	aabc87cd04	[NNC] Fix HalfChecker when half present but unused (#48068 ) Summary: Fixes an internally reported issue in the tensorexpr fuser when using FP16 on Cuda. The HalfChecker analysis to determine if we need to define the Half type searches the IR for expressions that use Half. If one of the parameters is of type Half but it (or any other Half expr) are not used in the IR we'll return a false negative. Fix this by adding the parameter list to the HalfChecker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48068 Reviewed By: ZolotukhinM Differential Revision: D25009680 Pulled By: nickgg fbshipit-source-id: 24fddef06821f130db3d3f45d6d041c7f34a6ab0	2020-11-17 12:07:57 -08:00
Elias Ellison	dcca712d3c	[NNC] refactor cuda half support to more general file (#47373 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47373 Test Plan: Imported from OSS Reviewed By: ansley Differential Revision: D24805246 Pulled By: eellison fbshipit-source-id: 33b5c84c9212d51bac3968e02aae2434dde40cd8	2020-11-12 11:14:00 -08:00
Nick Gibson	e985503d80	[NNC] Fix an issue with half-scalar vars coerced to float (Take 2) (#47448 ) Summary: Take 2 of this fix, I removed the repro from the issue which is a bit flaky due to parallelism. It broke on Windows but isn't specific to Windows or this fix, I think. I'll make sure all the tests pass this time (cc zou3519). Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like float(half(float(X))) and added a pass to collapse those down inside the CudaHalfScalarRewriter. Fixes https://github.com/pytorch/pytorch/issues/47138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47448 Reviewed By: glaringlee Differential Revision: D24765070 Pulled By: nickgg fbshipit-source-id: 5297e647534d53657bef81f4798e8aa6a93d1fbd	2020-11-05 19:31:52 -08:00
Richard Zou	745899f926	Revert D24706475: [pytorch][PR] [NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float Test Plan: revert-hammer Differential Revision: D24706475 (`33cf7fddd2`) Original commit changeset: 9df72bbbf203 fbshipit-source-id: f16ff04818de4294713d5b97eab5b298c1a75a6b	2020-11-05 08:25:48 -08:00
Nick Gibson	33cf7fddd2	[NNC] Fix an issue in Cuda fusion with fp16 scalar vars coerced to float (#47229 ) Summary: Fixes an issue where fp16 scalars created by the registerizer could be referenced as floats - causing invalid conversions which would crash in the NVRTX compile. I also noticed that we were inserting patterns like `float(half(float(X)))` and added a pass to collapse those down inside the CudaHalfScalarRewriter. Fixes https://github.com/pytorch/pytorch/issues/47138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47229 Reviewed By: agolynski Differential Revision: D24706475 Pulled By: nickgg fbshipit-source-id: 9df72bbbf203353009e98b9cce7ab735efff8b21	2020-11-04 15:48:12 -08:00
Mikhail Zolotukhin	9b168a1fed	[TensorExpr] Pick meaningful names for functions in TE codegen. (#47255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47255 As a result of this change, the generated CUDA code for the following fusion group: ``` graph(%0 : Float(32, 32, 1, 1, strides=[32, 1, 1, 1], requires_grad=0, device=cuda:0), %1 : Float(32, 32, strides=[32, 1], requires_grad=0, device=cuda:0), %2 : Float(32, 32, 1, strides=[32, 1, 1], requires_grad=0, device=cuda:0)): %3 : int = prim::Constant[value=1]() %v1.1 : Float(32, 32, 32, strides=[1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%1, %2, %3) # test/test_tensorexpr.py:155:0 %5 : int = prim::Constant[value=1]() %6 : Float(32, 32, 32, 32, strides=[32768, 1024, 32, 1], requires_grad=0, device=cuda:0) = aten::add(%v1.1, %0, %5) # test/test_tensorexpr.py:156:0 return (%6) ``` Would look like the following: ``` extern "C" __global__ void fused_add_add(float* t0, float* t1, float* t2, float* aten_add) { { float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32); float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)); float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768)); aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2; } } ``` Previously we generated: ``` extern "C" __global__ void func(float* t0, float* t1, float* t2, float* aten_add) { { float v = __ldg(t1 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32) + (512 * blockIdx.x + threadIdx.x) % 32); float v_1 = __ldg(t2 + ((512 * blockIdx.x + threadIdx.x) / 32) % 32 + 32 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)); float v_2 = __ldg(t0 + ((512 * blockIdx.x + threadIdx.x) / 1024) % 32 + 32 * ((512 * blockIdx.x + threadIdx.x) / 32768)); aten_add[((((512 * blockIdx.x + threadIdx.x) / 32768) * 32768 + 32 * (((512 * blockIdx.x + threadIdx.x) / 32) % 32)) + 1024 * (((512 * blockIdx.x + threadIdx.x) / 1024) % 32)) + (512 * blockIdx.x + threadIdx.x) % 32] = (v + v_1) + v_2; } } ``` Differential Revision: D24698273 Test Plan: Imported from OSS Reviewed By: bertmaher Pulled By: ZolotukhinM fbshipit-source-id: 6da95c6ac3d5155ebfaaab4f84f55a24deb6d10d	2020-11-03 16:41:22 -08:00
Mikhail Zolotukhin	a65e757057	[TensorExpr] CudaCodegen: restart counter for function names unique ID inside each codegen instantiation. (#47254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47254 CUDA codegen used a static global counter for picking names for functions, but the functions only need to be unique in the scope of the given codegen. This PR fixes that. Differential Revision: D24698271 Test Plan: Imported from OSS Reviewed By: bertmaher Pulled By: ZolotukhinM fbshipit-source-id: 516c0087b86b35bbb6ea7c71bb0ed9c3daaca2b8	2020-11-03 16:41:20 -08:00
Nick Gibson	f3db68776c	[NNC] Fix two more bugs in Cuda Half support (#46129 ) Summary: Fixes two bugs reported by https://github.com/pytorch/pytorch/issues/45953 in the NNC Cuda codegen which could break when using Half floats: 1. The Registerizer will generate new scalars with the type of the load being replaced, and doesn't have Cuda specific logic to avoid using the half type. I've added a quick mutator to coerce these to float, similar to the existing load casting rules. 2. We're not handling explicit casts to Half inserted by the user (in the report the user being the JIT). Addressing this by replacing these with casts to Float since thats the type we do Half math in. Fixes https://github.com/pytorch/pytorch/issues/45953. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46129 Reviewed By: glaringlee Differential Revision: D24253639 Pulled By: nickgg fbshipit-source-id: 3fef826eab00355c81edcfabb1030332cae595ac	2020-10-12 13:31:07 -07:00
Thomas Viehmann	22a34bcf4e	ROCm {emoji:2764} TensorExpr (#45506 ) Summary: This might be an alternative to reverting https://github.com/pytorch/pytorch/issues/45396 . The obvious rough edge is that I'm not really seeing the work group limits that TensorExpr produces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/45506 Reviewed By: zhangguanheng66 Differential Revision: D23991410 Pulled By: Krovatkin fbshipit-source-id: 11d3fc4600e4bffb1d1192c6b8dd2fe22c1e064e	2020-09-29 16:52:16 -07:00
Alex Suhan	3dd0e362db	[TensorExpr] Fix min and max for integral inputs in CUDA backend (#44984 ) Summary: For integral types, isnan is meaningless. Provide specializations for maximum and minimum which don't call it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44984 Test Plan: python test/test_jit_fuser_te.py -k TestTEFuser.test_minmax_int_ops Reviewed By: ezyang Differential Revision: D23885259 Pulled By: asuhan fbshipit-source-id: 2e6da2c43c0ed18f0b648a2383d510894c574437	2020-09-23 23:19:12 -07:00
Nick Gibson	4bbb6adff5	[NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909 ) Summary: A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled. The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads. To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place. Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer). Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909 Reviewed By: agolynski Differential Revision: D23800565 Pulled By: nickgg fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f	2020-09-21 09:27:22 -07:00
Nick Gibson	82ab167cce	[NNC] Fix masking for all block and thread dimensions in CudaCodeGen (#44733 ) Summary: Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix https://github.com/pytorch/pytorch/issues/44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases. For example it will transform the following: ``` for i in 0..10 // blockIdx.x for j in 0..10 // threadIdx.x do thing(i, j); for k in 0..5 // threadIdx.x do other thing(i, k); ``` Into: ``` do thing(blockIdx.x, threadIdx.x); if (threadIdx.x < 5) { do other thing(blockIdx.x, threadIdx.x); } ``` And handle the case where statements are not bound by any axis, eg. ``` do outer thing; for i in 0..10 // blockIdx.x for j in 0..10 // threadIdx.x do thing(i, j); do other thing(i); ``` will become: ``` if (blockIdx.x < 1) { if (threadIdx.x < 1) { do outer thing; } } syncthreads(); do thing(blockIdx.x, threadIdx.x); syncthreads(); if (threadIdx.x < 1) { do other thing(blockIdx.x); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/44733 Reviewed By: mruberry Differential Revision: D23736878 Pulled By: nickgg fbshipit-source-id: 52d08626ae8043d53eb937843466874d479a6768	2020-09-16 14:23:47 -07:00
Nick Gibson	64b4307d47	[NNC] Cuda Codegen - mask loops bound to block/thread dimensions (#44325 ) Summary: Fix an issue where loops of different sizes are bound to the same Cuda dimension / metavar. Coming soon more info and tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/44325 Reviewed By: colesbury Differential Revision: D23628859 Pulled By: nickgg fbshipit-source-id: 3621850a4cc38a790b62ad168d32e7a0e2462fad	2020-09-11 16:48:16 -07:00
Bert Maher	960c088a58	[te] Fix casting of unsigned char, and abs(int) (#44157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44157 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D23528507 Pulled By: bertmaher fbshipit-source-id: c5ef0422a91a4665b616601bed8b7cd137be39f9	2020-09-09 11:08:36 -07:00
Nick Gibson	be94dba429	[NNC] fix support for FP16 in CudaCodgen (#44209 ) Summary: Fixes a bug where FP16 values could be incorrectly cast to a half type that doesn't have a cast operator by inserting the cuda specific cast to float during handling of the Cast node, not as a wrapper around printing Loads and Stores. Two main changes: the HalfChecker now inserts the casts to float explicitly in the IR, and the PrioritizeLoad mutator now consumes both Loads and a Cast which immediately preceded a load. Tested with test_jit_fuser_te.py and test_tensorexpr.py, plus C++ tests obv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44209 Reviewed By: izdeby Differential Revision: D23575577 Pulled By: nickgg fbshipit-source-id: 808605aeb2af812758f96f9fdc11b07e08053b46	2020-09-08 18:00:39 -07:00
Bert Maher	c14a3613a8	Fix NaN propagation in TE fuser's min/max implementation (#43609 ) Summary: Per eager mode source-of-truth, NaNs shall be propagated by min/max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/43609 Reviewed By: ZolotukhinM Differential Revision: D23349184 Pulled By: bertmaher fbshipit-source-id: 094eb8b89a02b27d5ecf3988d0f473c0f91e4afb	2020-09-01 02:10:13 -07:00
Nick Gibson	1390cad2d8	[NNC] Hook up registerizer to Cuda codegen [2/x] (#42878 ) Summary: Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance. First up the good stuff, benchmark before: ``` Column sum Caffe2 NNC Simple Better (10, 100) 5.7917 9.7037 6.9386 6.0448 (100, 100) 5.9338 14.972 7.1139 6.3254 (100, 10000) 21.453 741.54 145.74 12.555 (1000, 1000) 8.0678 122.75 22.833 9.0778 Row sum Caffe2 NNC Simple Better (10, 100) 5.4502 7.9661 6.1469 5.5587 (100, 100) 5.7613 13.897 21.49 5.5808 (100, 10000) 21.702 82.398 75.462 22.793 (1000, 1000) 22.527 129 176.51 22.517 ``` After: ``` Column sum Caffe2 NNC Simple Better (10, 100) 6.0458 9.4966 7.1094 6.056 (100, 100) 5.9299 9.1482 7.1693 6.593 (100, 10000) 21.739 121.97 162.63 14.376 (1000, 1000) 9.2374 29.01 26.883 10.127 Row sum Caffe2 NNC Simple Better (10, 100) 5.9773 8.1792 7.2307 5.8941 (100, 100) 6.1456 9.3155 24.563 5.8163 (100, 10000) 25.384 30.212 88.531 27.185 (1000, 1000) 26.517 32.702 209.31 26.537 ``` Speedup about 3-8x depending on the size of the data (increasing with bigger inputs). The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel. It required a lot of refactoring and bug fixes on the way: * Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs). * Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores. * Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case). * Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs. * Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf. * Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change. * Added simplification of simple Division patterns to the IRSimplifier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878 Reviewed By: glaringlee Differential Revision: D23382499 Pulled By: nickgg fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949	2020-08-31 10:39:46 -07:00
Bert Maher	6c99d5611d	[tensorexpr] Fix promotion of booleans (#43097 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43097 Boolean arguments weren't promoted, so if you tried to write a comparison with types such as `Tensor(Bool) == Int` you'd fail typechecking inside the TE engine. Test Plan: Imported from OSS Reviewed By: protonu, zheng-xq Differential Revision: D23167926 Pulled By: bertmaher fbshipit-source-id: 47091a815d5ae521637142a5c390e8a51a776906	2020-08-18 15:19:38 -07:00

1 2

76 Commits