181 Commits

Author SHA1 Message Date
901bbcba12 Gate division bitwise numerics under a flag (#165566)
https://github.com/pytorch/pytorch/pull/164144 ensures that division for compile is bitwise equivalent with eager. However, in https://github.com/pytorch/pytorch/issues/164301, the kernel performance is regressed.

On B200:
With standard triton `/`:
6511 GB/s

With triton `div_rn`:
4692 GB/s

Further investigation is required for the generated PTX to see why there is such a large slowdown. For now, enable bitwise equivalent results under `TORCHINDUCTOR_EMULATE_DIVISION_ROUNDING` similar to emulate_precision_cast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165566
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-10-15 23:41:01 +00:00
c8c5187e85 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/bobrenjc93
2025-10-10 22:18:11 +00:00
abb2f7179e Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 68913d8f2a953bdbada4033101b04f6e8d49dabe.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to It breaks CI again, why was it landed for 3 times in a row without any changes? ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3390973016))
2025-10-10 16:10:25 +00:00
68913d8f2a Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
2025-10-10 14:00:46 +00:00
d272ed4b3e Fix identity expansion (#165066)
In some cases, we wrap indexing with `Identity` to prevent expansion from int32 -> int64 range. There are some checks in codegen which intend to check for constants, which did not handle Identity. Update these checks and update Identity so that it recursively prints inputs.

Fix for https://github.com/pytorch/pytorch/issues/164700

Replaces https://github.com/pytorch/pytorch/pull/160190 cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @njriasan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165066
Approved by: https://github.com/njriasan, https://github.com/shunting314, https://github.com/jansel
2025-10-10 13:07:15 +00:00
ed2d514ad8 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit 724463d5a2fba369cd14e89215b84d1b01435df7.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/malfet due to Not sure if it's related, but looks it triggered fuzzer compiler test failure, see a2f29bcd63/1 ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3387288464))
2025-10-09 19:53:38 +00:00
ee6a1ecb0a [ROCm] Enable MI355 CI on PRs, and run full set of UTs on PRs (#160215)
Useful to have PR testing for PRs such as https://github.com/pytorch/pytorch/pull/151360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160215
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-09 18:03:12 +00:00
724463d5a2 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 14:31:33 +00:00
e09fb44ef1 Revert "Fix truediv numerics between eager and compile (#164144)"
This reverts commit d386325ca9a142419f45b987391f4bb175dd7d0b.

Reverted https://github.com/pytorch/pytorch/pull/164144 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164144#issuecomment-3384769092))
2025-10-09 08:40:52 +00:00
d386325ca9 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 04:22:03 +00:00
54ae61c573 Change test_emulate_precision_casts_mean_ratio_chain from gelu to relu (#164997)
gelu can be instable on local builds due to libdevice differences, as we lower to libdevice.erf. That combined with the semantics in the test can lead to catastrophic cancellation. We switch this test from gelu to relu to fix this instability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164997
Approved by: https://github.com/eellison, https://github.com/jansel
2025-10-09 03:14:05 +00:00
86474ce996 Update mask dtype (#164472)
Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472
Approved by: https://github.com/bdhirsh
2025-10-03 00:19:36 +00:00
6fa972796e [inductor] Fix bugs in emulate_precision_casts (#163520)
Fixes #163449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163520
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481
2025-09-24 02:52:36 +00:00
9c4d9f940b [inductor] Support out_dtype arg to matmul (#163393)
Fixes #163275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393
Approved by: https://github.com/eellison, https://github.com/coconutruben
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434
2025-09-23 15:37:38 +00:00
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
3ef1bef36c [sdpa] make sure to recompile if alignment is different than before (#163083)
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
    buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.

### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)

## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
2025-09-23 01:33:33 +00:00
36c2a1325c [inductor] Fix bug where viewed outputs get padded (#163398)
Fixes #163328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163398
Approved by: https://github.com/eellison
ghstack dependencies: #163386
2025-09-22 21:52:45 +00:00
77d8e98e1b [Inductor] update exp codegen for better precision (#161829)
Prior to this PR, we have:
```
[Default Behavior] uses `tl.math.exp({x})`:
eager diff: tensor(2.6935e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(9.2757e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013996509159580942, compile_latency:0.0013981951951980592

TORCHINDUCTOR_USE_FAST_MATH=1 uses `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)`:
eager diff: tensor(2.2315e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(3.5329e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0013982331859319662, compile_latency:0.0013824134564199367

Update inductor to use `tl.extra.libdevice.exp(tmp0)`:
eager diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
compile diff: tensor(2.3421e-06, device='cuda:0', dtype=torch.float64)
eager_latency:0.0014109122834153282, compile_latency:0.0014062877025520593
```

Since `tl.extra.libdevice.exp` leads to both better precision and on-par latency, we use it by default now.

Note that `tl.extra.libdevice.exp` used to have a perf issue in [January 2025](https://github.com/triton-lang/triton/issues/5735) since it used due to `ex2.approx.f32` instead of `ex2.approx.ftz.f32`. So `tl.extra.libdevice.exp2(tmp0 * 1.4426950408889634)` was used as a workaround. I double checked that the issue is resolved and `tl.extra.libdevice.exp` also uses [ex2.approx.ftz.f32](https://github.com/triton-lang/triton/issues/5735#issuecomment-3238421293) today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161829
Approved by: https://github.com/jansel
2025-08-30 04:56:51 +00:00
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
09381f5dac Revert "[Graph Partition] Pass all OSS unit tests (#154667)"
This reverts commit ca7315c17162ea21b1ca5ba23f4bf6168766c7b9.

Reverted https://github.com/pytorch/pytorch/pull/154667 on behalf of https://github.com/clee2000 due to broke inductor/test_memory.py::TestOperatorReorderForPeakMemory::test_reorder_peak_memory_lpmf [GH job link](https://github.com/pytorch/pytorch/actions/runs/16885961204/job/47836769279) [HUD commit link](ca7315c171) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154667#issuecomment-3176805477))
2025-08-11 20:34:27 +00:00
ca7315c171 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-11 16:25:12 +00:00
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
4b11428cb5 [BE][testing] Skip test_repeated_masked_load internally (#158355)
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?

Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
2025-07-16 16:15:44 +00:00
a04a13c449 [BE][testing] Skip test_triton_interpret internally (#158260)
Summary: Subprocesses in fbcode are tricky because of .par files. I'm thinking it's not an important enough test to get it running and skipping is fine.

Test Plan: `buck test`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158260
Approved by: https://github.com/eellison
2025-07-16 16:14:44 +00:00
0797b2b6a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 16:07:54 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
b40981c630 Fix incorrect stride handling in adaptive_avg_pool3d (#157326)
Fixes #157248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326
Approved by: https://github.com/eqy
ghstack dependencies: #157242
2025-07-01 03:03:48 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
517d2995e0 Add__int__ and __float__ methods to _sympy.functions.Identity (#155873)
Fixes #155688

Root Cause:
in [`torch/_inductor/index_propagation.py`](f151b20123/torch/_inductor/index_propagation.py (L57-L68))
When creating a `TypedExpr` from an `Identity` (a `torch.utils._sympy.functions.Identity`, not a `sympy.matrices.expressions.Identity `) and the inner value of the identity, `Identity.args[0]`, is any torch int type, the `TypedExpr.__post_init__` method tries to cast the Identity object to a python `int`.  This is where to `TypeError` from the issue was raised, because Identity does not know how to cast to an `int`.

Fix:
Define `__int__` method for `torch.utils._sympy.functions.Identity`.
wlog for `float`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155873
Approved by: https://github.com/williamwen42
2025-06-15 04:24:40 +00:00
f6b83d4cc6 sort iteration over index vars (#154846)
Fix for https://github.com/pytorch/pytorch/issues/154741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154846
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
2025-06-02 22:06:00 +00:00
ef1d45b12d Cleanup parent fallback logic (#154006)
The `parent` in fallback_node_due_to_unsupported_type is a duplication of `unsupported_output_tensor` logic. remove it. tested that the tests in test_add_complex give same codegen. this fixes an issue in mx that @drisspg was running into.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154006
Approved by: https://github.com/drisspg
2025-05-29 13:40:36 +00:00
d6e29bf875 Reflect back mutation if we clone misaligned tensors (#154442)
Fix for https://github.com/pytorch/pytorch/issues/152425

inductor specializes whether or not a tensor is 16-bit aligned on the first invocation. then, on subsequent invocations, if we inferred alignment but are passed a non-aligned tensor we clone the tensor.

If we infer alignment, then run with unaligned, and mutate the input, we need to reflect back the mutation to the input. This pr adds back that mutation.

We could have also been less aggressive about inferring alignment for mutated tensors, but that has a pretty perf hit.See the following benchmark:
```
import torch

t = torch.rand(4096 * 4096, device="cuda", dtype=torch.float16)

@torch.compile(dynamic=False)
def foo(x):
    return x.add_(1)

import triton

print(triton.testing.do_bench(lambda: foo(t[:-1])))
torch._dynamo.reset()
print(triton.testing.do_bench(lambda: foo(t[1:])))
```
gives
```
0.04063070610165596
0.07613472988113162
```
So almost twice as slow for non-aligned tensors. Tensors changing alignment is a relatively rare case.

In the future, we could considering a multi-kernel approach, or codegening a triton kernel that does most of the loads with aligned instructions, and a prologue/epilogue of un-alignment. But, it's yet to be seen this is a huge issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154442
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-05-29 13:36:48 +00:00
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
02cecd1018 [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-21 20:14:34 +00:00
e434a9152e Revert "[inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)"
This reverts commit 6246c7d62ca2f091838d5c707e3d932994c5e35a.

Reverted https://github.com/pytorch/pytorch/pull/151506 on behalf of https://github.com/henrylhtsang due to seems to be breaking some rocm mi300 run ([comment](https://github.com/pytorch/pytorch/pull/151506#issuecomment-2815999009))
2025-04-18 18:40:17 +00:00
9ccdeae7db Fix uint view copy (#151598)
Fix for https://github.com/pytorch/pytorch/issues/151156. We have some logic to undo our upcast prior to dtype bitcast. This pr cleans up that logic using dtypes in codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151598
Approved by: https://github.com/zou3519
ghstack dependencies: #151562
2025-04-18 18:13:39 +00:00
6246c7d62c [inductor][test] Skip triton tests for MPS as well, also change reason for skipping SM89 to not IS_BIG_GPU (#151506)
Differential Revision:
[D73162091](https://our.internmc.facebook.com/intern/diff/D73162091/)

Combining / improving https://github.com/pytorch/pytorch/pull/150485 and https://github.com/pytorch/pytorch/pull/150343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151506
Approved by: https://github.com/ColinPeppler
2025-04-18 17:26:16 +00:00
6d46b530fc Remove libdevice ops in inductor (#151562)
Now that we track dtypes during codegen, we can delete all these extra ops that worked around the problem by doing dispatch at lowering time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151562
Approved by: https://github.com/isuruf, https://github.com/jansel
2025-04-17 22:18:00 +00:00
fe961679d5 [Inductor] add support for disabling atomic adds (#151033)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151033
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-04-11 18:41:56 +00:00
27ded359a5 Fix inplacing with multiple, fused uses (#150845)
We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845
Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel
2025-04-09 00:05:07 +00:00
49b7d0d84d [ROCm] Enable more inductor UTs (#149513)
Primarily enable inductor fp8 tests, also enable other inductor tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513
Approved by: https://github.com/jeffdaily
2025-04-01 00:30:36 +00:00
585fd972b8 Iterate over dense dim first in split reduction reindexing (#147229)
Fix for https://github.com/pytorch/pytorch/issues/144431.

Improves perf from 0.29963893827160504 -> 0.0396331632970453.

In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.

This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
2025-03-18 17:35:21 +00:00
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00