Commit Graph

1031 Commits

Author SHA1 Message Date
e3dadb1d36 [opaque obj] torch.compile support (#163936)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163936
Approved by: https://github.com/zou3519
ghstack dependencies: #163284, #163714
2025-11-13 00:35:20 +00:00
35571fe94b [effects] Add register_effectful_op (#163284)
Refactored register_effectful_op to return a handler to match how fake kernels are registered. This makes it easier to deregister effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163284
Approved by: https://github.com/zou3519
2025-11-13 00:35:20 +00:00
1df723e6f5 [inductor] Fix constant creation (#167398)
We ran into this issue when debugging inductor-lite. Calling `torch.tensor` within a fake mode (which is the case inside of inductor) will create a FakeTensor, which causes this FakeTensor to be used as a constant within inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167398
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng
2025-11-11 16:30:46 +00:00
3ea829a337 Fix torch.cond HOP device in inductor (#167354)
Fixes #166918

The output device may not be on the same device as the predicate device.

```
python test/inductor/test_control_flow.py -k test_output_on_different_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167354
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-11-10 18:19:38 +00:00
e1a1aeaf5b [1/N] Use key in dict for existence checks (#167035)
This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167035
Approved by: https://github.com/janeyx99
2025-11-06 02:25:10 +00:00
a6c6acea9d [11/N] Apply ruff UP035 rule (#166225)
This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166225
Approved by: https://github.com/aorenste
2025-11-04 04:53:40 +00:00
6c98657239 Add some Triton related suppressions that don't show on CI (#166868)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166868
Approved by: https://github.com/maggiemoss, https://github.com/zou3519
2025-11-03 22:54:50 +00:00
3a38ec78e1 [inductor] Expand use of generic benchmark function (#164938)
Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938
Approved by: https://github.com/nmacchioni, https://github.com/eellison
2025-11-03 20:15:25 +00:00
82d86bacf3 [inductor] track reduction before splitting (#166053)
Keep tracking of the reduction before splitting.

In the mix-order reduction context, if one of the reduction is split, it makes it much harder to fuse with the other reduction. Tracking the metadata of the reduction before splitting to make the fusion possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166053
Approved by: https://github.com/jansel
2025-11-01 19:41:21 +00:00
dfebdcab86 [GraphPartition] cache get_free_symbol_uses (#166338)
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)

I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.

Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)

This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
2025-10-31 21:24:05 +00:00
26534e9809 Revert "[GraphPartition] cache get_free_symbol_uses (#166338)"
This reverts commit a6b1ef17173f56ba93ac97ff4384fa4060b5e41e.

Reverted https://github.com/pytorch/pytorch/pull/166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](a6b1ef1717) ([comment](https://github.com/pytorch/pytorch/pull/166338#issuecomment-3472980329))
2025-10-31 12:57:56 +00:00
a6b1ef1717 [GraphPartition] cache get_free_symbol_uses (#166338)
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)

I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.

Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)

This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
2025-10-31 02:50:10 +00:00
694db5f549 Use 'is' in callable comparisons (#166624)
Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166624
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-10-30 19:00:09 +00:00
2de4cf2102 [1/N] Remove unused loop variables (#166258)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-30 12:22:25 +00:00
1dd6b76914 Revert "[1/N] Remove unused loop variables (#166258)"
This reverts commit 76b2c37045e52540ec51e967aa7b6436a6b9b174.

Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](76b2c37045) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))
2025-10-29 11:10:37 +00:00
76b2c37045 [1/N] Remove unused loop variables (#166258)
This PR removes unused loop variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-10-29 01:34:15 +00:00
236ce736a1 [reland] Add provenance to inductor IR nodes created after graph.run (#164255) (#164746)
Summary:

as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Test Plan:
```
buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r  test_deferred_triton_kernels
```

Differential Revision: D83979975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746
Approved by: https://github.com/mlazos
2025-10-28 02:20:20 +00:00
8d4e48831e Remove JITFunction constexpr and some arg_names (#166280)
https://github.com/triton-lang/triton/pull/8536 breaks torch.compile integration. This PR attempts to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166280
Approved by: https://github.com/jansel
2025-10-27 09:29:03 +00:00
9940e894ea Fix pyrefly ignore syntax in _inductor (#166247)
Ensures pyrefly ignores only ignore the intended error code.

pyrefly check
lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166247
Approved by: https://github.com/oulgen
2025-10-27 02:48:42 +00:00
380d440d1c Revert "inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040)"
This reverts commit 9038a30cee56e0d577a666fffa32e990732572d4.

Reverted https://github.com/pytorch/pytorch/pull/164040 on behalf of https://github.com/karthickai due to Kindly add the test case mentioned in the issue ([comment](https://github.com/pytorch/pytorch/pull/164040#issuecomment-3444137989))
2025-10-24 17:14:45 +00:00
9038a30cee inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040)
…semantics on views; add regression test for transposed mutation (fixes #163929)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164040
Approved by: https://github.com/ngimel, https://github.com/jansel
2025-10-24 16:37:43 +00:00
017d2985f3 set unbacked bindings in reinplace pass for newly created nodes during generalize_scatter decomp (#164948)
Two fixes:
1. in rein_place pass, set unbacked bindings for newly created nodes.
2. In inductor, ComputeBuffer used to miss detecting some used symbols, fixed that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164948
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #164341
2025-10-18 03:20:30 +00:00
e925dfcc6b Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-10-17 07:27:11 +00:00
84d141e910 Revert "[inductor] Expand use of generic benchmark function (#164938)"
This reverts commit 5c583e2573f29243742e00b9fa36b266c5c78bb3.

Reverted https://github.com/pytorch/pytorch/pull/164938 on behalf of https://github.com/clee2000 due to I think this broke test/inductor/test_cuda_repro.py::CudaReproTests::test_epilogue_fusion_with_view? [GH job link](https://github.com/pytorch/pytorch/actions/runs/18529735968/job/52813191763) [HUD commit link](f58f301313) on both rocm and the slow grad check for linux. It did run successfully on cuda workflow on trunk, I wonder if this a gpu capability thing? no clue though ([comment](https://github.com/pytorch/pytorch/pull/164938#issuecomment-3407600224))
2025-10-15 17:48:38 +00:00
5c583e2573 [inductor] Expand use of generic benchmark function (#164938)
Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938
Approved by: https://github.com/nmacchioni, https://github.com/eellison
2025-10-15 09:18:24 +00:00
18b3658df9 [inductor][ez] properly print Pointwise (#165369)
Previously when we print a ComputedBuffer for reduction, we get something like:
```
ComputedBuffer(name='buf0', layout=FixedLayout('cuda:0', torch.float32, size=[1, 768], stride=[768, 1]), data=Reduction(
  'cuda',
  torch.float32,
  def inner_fn(index, rindex):
      _, i1 = index
      r0_0 = rindex
      tmp0 = ops.load(tangents_1, i1 + 768 * r0_0)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(primals_1, i1 + 768 * r0_0)
      tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16)
      tmp4 = ops.load(rsqrt, r0_0)
      tmp5 = tmp3 * tmp4
      tmp6 = tmp1 * tmp5
      return tmp6
  ,
```
But if we print a ComputedBuffer for a pointwise, we get something like
```
ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f12922c5bc0>, ranges=[32768, 768]))

```

Note that the inner function str is not printed.

With the change, we get the inner_fn string printed in this case:
```

ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(       14:42:46 [25/1988]
  'cuda',
  torch.bfloat16,
  def inner_fn(index):
      i0, i1 = index
      tmp0 = ops.load(tangents_1, i1 + 768 * i0)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(primals_2, i1)
      tmp3 = tmp1 * tmp2
      tmp4 = ops.load(rsqrt, i0)
      tmp5 = tmp3 * tmp4
      tmp6 = ops.load(buf1, i0)
      tmp7 = ops.constant(-0.5, torch.float32)
      tmp8 = tmp6 * tmp7
      tmp9 = ops.load(rsqrt, i0)
      tmp10 = tmp9 * tmp9
      tmp11 = tmp10 * tmp9
      tmp12 = tmp8 * tmp11
      tmp13 = ops.constant(0.0013020833333333333, torch.float32)
      tmp14 = tmp12 * tmp13
      tmp15 = ops.load(primals_1, i1 + 768 * i0)
      tmp16 = ops.to_dtype(tmp15, torch.float32, src_dtype=torch.bfloat16)
      tmp17 = tmp14 * tmp16
      tmp18 = tmp5 + tmp17
      tmp19 = ops.load(buf1, i0)
      tmp20 = ops.constant(-0.5, torch.float32)
      tmp21 = tmp19 * tmp20
      tmp22 = ops.load(rsqrt, i0)
      tmp23 = tmp22 * tmp22
      tmp24 = tmp23 * tmp22
      tmp25 = tmp21 * tmp24
      tmp26 = ops.constant(0.0013020833333333333, torch.float32)
      tmp27 = tmp25 * tmp26
      tmp28 = ops.load(primals_1, i1 + 768 * i0)
      tmp29 = ops.to_dtype(tmp28, torch.float32, src_dtype=torch.bfloat16)
      tmp30 = tmp27 * tmp29
      tmp31 = tmp18 + tmp30
      tmp32 = ops.to_dtype(tmp31, torch.bfloat16, src_dtype=torch.float32)
      return tmp32
  ,
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165369
Approved by: https://github.com/eellison
2025-10-14 06:08:12 +00:00
ac529df244 Native matmul (#157743)
### Implementation of #151705

This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.

To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:

1. **Basic support** (this PR)
2. **Lazy broadcasting** for optimal performance (future PR)

### Summary of This PR

This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead.

### Notable Changes

1. Adds a new config flag: `config.triton.enable_native_matmul`
2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled
3. Enforces tililng suitable for matmul when the native matmul flag is enabled
4. Implements code generation for `ops.dot`
5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.

@eellison @jansel @PaulZhang12 @shunting314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743
Approved by: https://github.com/jansel
2025-10-14 04:22:30 +00:00
f363114852 [Bugfix][Inductor][Dynamo] Fix stride incorrectness issues for stride 0 tensor (#164897)
Fixes #164814 - we update to include cases where we know symbolic expression is statically one.  There are two errors here; first in graph capture, where a tensor with size 0 yet symbolic stride would attempt to keep the symbolic stride, resulting in a mismatch.  The second is in inductor code gen, where we only checked in squeeze if size == 1, missing the case where a symbolic stride equals 1.

Also fixes #164924 (@bobrenjc93  for fuzzer finding an issue affecting users : )

### Test plan:
```
python test/dynamo/test_aot_autograd.py AotAutogradFallbackTests
```

Results in:
```
..
----------------------------------------------------------------------
Ran 49 tests in 45.622s

OK (expected failures=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164897
Approved by: https://github.com/laithsakka
2025-10-10 21:26:57 +00:00
600db525bd [easy][while_loop] use copy_input instead of clone in _clone_aliased_inputs (#164955)
Compared with clone, ExternKernel.copy_input additionally realize the buffer, which downstream assumes the input buffer are realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164955
Approved by: https://github.com/BoyuanFeng
2025-10-09 23:39:00 +00:00
9944cac6e6 Add suppressions to torch/_inductor (#165062)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Split this directory into two PRs to keep them from being too large.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062
Approved by: https://github.com/oulgen, https://github.com/mlazos
2025-10-09 20:34:20 +00:00
2982406721 [inductor] ban benchmarking by default in deterministic mode (#164532)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164532
Approved by: https://github.com/eellison
ghstack dependencies: #164801
2025-10-08 20:55:15 +00:00
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
9eb89a4ad5 Add missing TypeIs to torch/_inductor/ir.py (#164489)
This should be a TypeIs here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489
Approved by: https://github.com/mlazos
2025-10-03 19:34:20 +00:00
a34797e031 Revert "Add provenance to inductor IR nodes created after graph.run (#164255)"
This reverts commit b9e73e639e36f3aa628752161711e68878231b30.

Reverted https://github.com/pytorch/pytorch/pull/164255 on behalf of https://github.com/jeffdaily due to broke rocm; inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_deferred_triton_kernels [GH job link](https://github.com/pytorch/pytorch/actions/runs/18200790301/job/51821738132) [HUD commit link](b9e73e639e) ([comment](https://github.com/pytorch/pytorch/pull/164255#issuecomment-3363360088))
2025-10-02 22:01:41 +00:00
b9e73e639e Add provenance to inductor IR nodes created after graph.run (#164255)
Summary:
as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Differential Revision: D82871244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255
Approved by: https://github.com/mlazos
2025-10-02 17:32:46 +00:00
0d7994ca97 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-30 04:53:58 +00:00
474d07554a [dynamic shapes] unbacked-safe slicing (#161414)
Summary:
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Test Plan:
contbuild & OSS CI, see 56218d85e2

Rollback Plan:

Differential Revision: D80948073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414
Approved by: https://github.com/laithsakka
2025-09-30 01:15:19 +00:00
0f619c1f89 Revert "[inductor] do comm compute overlap at aten fx level (#163215)"
This reverts commit c9b5af9a384e7ef5f95613abe1622f5f55133c3a.

Reverted https://github.com/pytorch/pytorch/pull/163215 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
c9b5af9a38 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-29 18:18:03 +00:00
eb4361a801 [Fix] Adding missing f prefixes to formatted strings [1/N] (#164065)
As stated in the title.

* #164068
* #164067
* #164066
* __->__ #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065
Approved by: https://github.com/Skylion007
2025-09-29 04:53:00 +00:00
3413490f53 [scan] materialize combine_fn in forward add more autograd tests (#161732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025
2025-09-27 18:13:15 +00:00
520fca82c8 Refactor Provenance Tracking (#163378)
Summary:
- Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code

- Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`.

- If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a  `codegen_comment` call.

- Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling

- Remove `debug_handle` from `call_kernel`.

Test Plan:
CI

```
buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D82839271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378
Approved by: https://github.com/angelayi
2025-09-25 22:55:59 +00:00
9341ede617 Revert to old behaviour of not padding strides if shape or stride is dynamic (#163639)
Differential Revision: D83053287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163639
Approved by: https://github.com/blaine-rister
2025-09-24 18:31:01 +00:00
f68de58c9d [Inductor-FX] Support symbol and dynamic scalar graph inputs and outputs (#163596)
# Problems
This PR fixes a few edge cases that the FX converter missed related to dynamic shapes.

1. Inductor graphs can sometimes take `sympy.Symbol` inputs. We have logic to convert these to FX placeholder nodes. However, this logic did not update the `self.expr_to_proxy` table mapping symbols to proxy nodes. (There was existing logic to do this for `ir.TensorBox` inputs, but not `sympy.Symbol`.) This caused sympy tracing to fail when these symbol inputs were used in other expressions.

2. We lacked codegen for `ShapeAsConstantBuffer`. This IR node is seen when the graph input or output is a scalar computed from dynamic shapes.

# Fixes
a. Update `self.expr_to_proxy` when generating placeholders for `sympy.Symbol` inputs. Change `SymbolBuffer.get_example` to convert the symbol to a `torch.SymInt`, so we can populate `meta["val"]` correctly and use the value in other computations.
b. Support `ShapeAsConstantBuffer` by tracing the sympy expression.
c. Move output generation inside the metadata hook, allowing us to populate `meta["val"]` for the nodes computing `ShapeAsConstantBuffer`.

# Test plan
Added several new CI tests:
 1. `torch.cond` with dynamic shapes. This exposes both issues, as the predicate is a `ShapeAsConstantBuffer` and one of the subgraphs uses a symbol input, due to the closure. Also tests when the parent and subgraphs have different input shapes.
 2. Output dynamic shape scalar. This tests `ShapeAsConstantBuffer` as an output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163596
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-09-24 06:08:14 +00:00
3ef1bef36c [sdpa] make sure to recompile if alignment is different than before (#163083)
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
    buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.

### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)

## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
2025-09-23 01:33:33 +00:00
60c2bdedcd Replace Literal[None] with None in typing (#163489)
This PR replaces Literal[None] with None in typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-22 22:10:08 +00:00
4967ad8baa [Graph Partition] improve custom op output alias (#163227)
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
72fedf0575/torch/_inductor/ir.py (L7976-L7982)

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227
Approved by: https://github.com/zou3519
2025-09-19 17:01:36 +00:00
6e680ae8de add more restriction to fusion with large accumulate reads (#163163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163163
Approved by: https://github.com/yf225
2025-09-19 01:20:30 +00:00
c230ac7300 [inductor][ez] add ChoiceCaller annotations (#162672)
# why

- enable ChoiceCaller generation to provide extra information that
  feedback_saver_fns (functions registered to run at the bench of
  benchmarking) can use afterwards
- users that extend ChoiceCaller creation e.g. by creating their own
  InductorChoices can use this to shuttle through information

# what

- add an annotations dictionary to ChoiceCaller class

# testing

n/a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162672
Approved by: https://github.com/nmacchioni
2025-09-16 20:49:55 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00