pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-18 01:15:12 +08:00

Author	SHA1	Message	Date
angelayi	e3dadb1d36	[opaque obj] torch.compile support (#163936 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163936 Approved by: https://github.com/zou3519 ghstack dependencies: #163284, #163714	2025-11-13 00:35:20 +00:00
angelayi	35571fe94b	[effects] Add register_effectful_op (#163284 ) Refactored register_effectful_op to return a handler to match how fake kernels are registered. This makes it easier to deregister effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/163284 Approved by: https://github.com/zou3519	2025-11-13 00:35:20 +00:00
angelayi	1df723e6f5	[inductor] Fix constant creation (#167398 ) We ran into this issue when debugging inductor-lite. Calling `torch.tensor` within a fake mode (which is the case inside of inductor) will create a FakeTensor, which causes this FakeTensor to be used as a constant within inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167398 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng	2025-11-11 16:30:46 +00:00
Shangdi Yu	3ea829a337	Fix torch.cond HOP device in inductor (#167354 ) Fixes #166918 The output device may not be on the same device as the predicate device. ``` python test/inductor/test_control_flow.py -k test_output_on_different_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/167354 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2025-11-10 18:19:38 +00:00
Yuanyuan Chen	e1a1aeaf5b	[1/N] Use `key in dict` for existence checks (#167035 ) This PR uses `key in dict` expressions for existence checks of dict elements in Python code. This operation is more efficient than `key in dict.keys()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167035 Approved by: https://github.com/janeyx99	2025-11-06 02:25:10 +00:00
Yuanyuan Chen	a6c6acea9d	[11/N] Apply ruff UP035 rule (#166225 ) This PR continues to apply ruff UP035 rule to inductor code. ruff UP035 rule aims to use Python 3.10 syntax and libraries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166225 Approved by: https://github.com/aorenste	2025-11-04 04:53:40 +00:00
Edward Z. Yang	6c98657239	Add some Triton related suppressions that don't show on CI (#166868 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166868 Approved by: https://github.com/maggiemoss, https://github.com/zou3519	2025-11-03 22:54:50 +00:00
kundaMwiza	3a38ec78e1	[inductor] Expand use of generic benchmark function (#164938 ) Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison	2025-11-03 20:15:25 +00:00
Shunting Zhang	82d86bacf3	[inductor] track reduction before splitting (#166053 ) Keep tracking of the reduction before splitting. In the mix-order reduction context, if one of the reduction is split, it makes it much harder to fuse with the other reduction. Tracking the metadata of the reduction before splitting to make the fusion possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166053 Approved by: https://github.com/jansel	2025-11-01 19:41:21 +00:00
Boyuan Feng	dfebdcab86	[GraphPartition] cache get_free_symbol_uses (#166338 ) Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. `ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)` I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. `ee7434be82/torch/_inductor/ir.py (L4541-L4543)` This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338 Approved by: https://github.com/eellison	2025-10-31 21:24:05 +00:00
PyTorch MergeBot	26534e9809	Revert "[GraphPartition] cache get_free_symbol_uses (#166338 )" This reverts commit a6b1ef17173f56ba93ac97ff4384fa4060b5e41e. Reverted https://github.com/pytorch/pytorch/pull/166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](`a6b1ef1717`) ([comment](https://github.com/pytorch/pytorch/pull/166338#issuecomment-3472980329))	2025-10-31 12:57:56 +00:00
Boyuan Feng	a6b1ef1717	[GraphPartition] cache get_free_symbol_uses (#166338 ) Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. `ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)` I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. `ee7434be82/torch/_inductor/ir.py (L4541-L4543)` This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338 Approved by: https://github.com/eellison	2025-10-31 02:50:10 +00:00
Yuanyuan Chen	694db5f549	Use 'is' in callable comparisons (#166624 ) Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166624 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-10-30 19:00:09 +00:00
Yuanyuan Chen	2de4cf2102	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-30 12:22:25 +00:00
PyTorch MergeBot	1dd6b76914	Revert "[1/N] Remove unused loop variables (#166258 )" This reverts commit 76b2c37045e52540ec51e967aa7b6436a6b9b174. Reverted https://github.com/pytorch/pytorch/pull/166258 on behalf of https://github.com/atalman due to breaks test/distributed/test_serialization.py::TestSerialization::test_weights_only [GH job link](https://github.com/pytorch/pytorch/actions/runs/18894311802/job/53929321703) [HUD commit link](`76b2c37045`) ([comment](https://github.com/pytorch/pytorch/pull/166258#issuecomment-3460964612))	2025-10-29 11:10:37 +00:00
Yuanyuan Chen	76b2c37045	[1/N] Remove unused loop variables (#166258 ) This PR removes unused loop variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166258 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos	2025-10-29 01:34:15 +00:00
Shangdi Yu	236ce736a1	[reland] Add provenance to inductor IR nodes created after graph.run (#164255 ) (#164746 ) Summary: as title - Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes. - Extract `assign_origin_node` function Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_deferred_triton_kernels ``` Differential Revision: D83979975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746 Approved by: https://github.com/mlazos	2025-10-28 02:20:20 +00:00
Oguz Ulgen	8d4e48831e	Remove JITFunction constexpr and some arg_names (#166280 ) https://github.com/triton-lang/triton/pull/8536 breaks torch.compile integration. This PR attempts to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166280 Approved by: https://github.com/jansel	2025-10-27 09:29:03 +00:00
Maggie Moss	9940e894ea	Fix pyrefly ignore syntax in _inductor (#166247 ) Ensures pyrefly ignores only ignore the intended error code. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166247 Approved by: https://github.com/oulgen	2025-10-27 02:48:42 +00:00
PyTorch MergeBot	380d440d1c	Revert "inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040 )" This reverts commit 9038a30cee56e0d577a666fffa32e990732572d4. Reverted https://github.com/pytorch/pytorch/pull/164040 on behalf of https://github.com/karthickai due to Kindly add the test case mentioned in the issue ([comment](https://github.com/pytorch/pytorch/pull/164040#issuecomment-3444137989))	2025-10-24 17:14:45 +00:00
Jupiter-Guy	9038a30cee	inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040 ) …semantics on views; add regression test for transposed mutation (fixes #163929) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164040 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-10-24 16:37:43 +00:00
Laith Sakka	017d2985f3	set unbacked bindings in reinplace pass for newly created nodes during generalize_scatter decomp (#164948 ) Two fixes: 1. in rein_place pass, set unbacked bindings for newly created nodes. 2. In inductor, ComputeBuffer used to miss detecting some used symbols, fixed that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164948 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #164341	2025-10-18 03:20:30 +00:00
Yuanyuan Chen	e925dfcc6b	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-10-17 07:27:11 +00:00
PyTorch MergeBot	84d141e910	Revert "[inductor] Expand use of generic benchmark function (#164938 )" This reverts commit 5c583e2573f29243742e00b9fa36b266c5c78bb3. Reverted https://github.com/pytorch/pytorch/pull/164938 on behalf of https://github.com/clee2000 due to I think this broke test/inductor/test_cuda_repro.py::CudaReproTests::test_epilogue_fusion_with_view? [GH job link](https://github.com/pytorch/pytorch/actions/runs/18529735968/job/52813191763) [HUD commit link](`f58f301313`) on both rocm and the slow grad check for linux. It did run successfully on cuda workflow on trunk, I wonder if this a gpu capability thing? no clue though ([comment](https://github.com/pytorch/pytorch/pull/164938#issuecomment-3407600224))	2025-10-15 17:48:38 +00:00
Mwiza Kunda	5c583e2573	[inductor] Expand use of generic benchmark function (#164938 ) Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison	2025-10-15 09:18:24 +00:00
Shunting Zhang	18b3658df9	[inductor][ez] properly print Pointwise (#165369 ) Previously when we print a ComputedBuffer for reduction, we get something like: ``` ComputedBuffer(name='buf0', layout=FixedLayout('cuda:0', torch.float32, size=[1, 768], stride=[768, 1]), data=Reduction( 'cuda', torch.float32, def inner_fn(index, rindex): _, i1 = index r0_0 = rindex tmp0 = ops.load(tangents_1, i1 + 768 * r0_0) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(primals_1, i1 + 768 * r0_0) tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16) tmp4 = ops.load(rsqrt, r0_0) tmp5 = tmp3 * tmp4 tmp6 = tmp1 * tmp5 return tmp6 , ``` But if we print a ComputedBuffer for a pointwise, we get something like ``` ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f12922c5bc0>, ranges=[32768, 768])) ``` Note that the inner function str is not printed. With the change, we get the inner_fn string printed in this case: ``` ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise( 14:42:46 [25/1988] 'cuda', torch.bfloat16, def inner_fn(index): i0, i1 = index tmp0 = ops.load(tangents_1, i1 + 768 * i0) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(primals_2, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(rsqrt, i0) tmp5 = tmp3 * tmp4 tmp6 = ops.load(buf1, i0) tmp7 = ops.constant(-0.5, torch.float32) tmp8 = tmp6 * tmp7 tmp9 = ops.load(rsqrt, i0) tmp10 = tmp9 * tmp9 tmp11 = tmp10 * tmp9 tmp12 = tmp8 * tmp11 tmp13 = ops.constant(0.0013020833333333333, torch.float32) tmp14 = tmp12 * tmp13 tmp15 = ops.load(primals_1, i1 + 768 * i0) tmp16 = ops.to_dtype(tmp15, torch.float32, src_dtype=torch.bfloat16) tmp17 = tmp14 * tmp16 tmp18 = tmp5 + tmp17 tmp19 = ops.load(buf1, i0) tmp20 = ops.constant(-0.5, torch.float32) tmp21 = tmp19 * tmp20 tmp22 = ops.load(rsqrt, i0) tmp23 = tmp22 * tmp22 tmp24 = tmp23 * tmp22 tmp25 = tmp21 * tmp24 tmp26 = ops.constant(0.0013020833333333333, torch.float32) tmp27 = tmp25 * tmp26 tmp28 = ops.load(primals_1, i1 + 768 * i0) tmp29 = ops.to_dtype(tmp28, torch.float32, src_dtype=torch.bfloat16) tmp30 = tmp27 * tmp29 tmp31 = tmp18 + tmp30 tmp32 = ops.to_dtype(tmp31, torch.bfloat16, src_dtype=torch.float32) return tmp32 , ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165369 Approved by: https://github.com/eellison	2025-10-14 06:08:12 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel	2025-10-14 04:22:30 +00:00
Lucas Kabela	f363114852	[Bugfix][Inductor][Dynamo] Fix stride incorrectness issues for stride 0 tensor (#164897 ) Fixes #164814 - we update to include cases where we know symbolic expression is statically one. There are two errors here; first in graph capture, where a tensor with size 0 yet symbolic stride would attempt to keep the symbolic stride, resulting in a mismatch. The second is in inductor code gen, where we only checked in squeeze if size == 1, missing the case where a symbolic stride equals 1. Also fixes #164924 (@bobrenjc93 for fuzzer finding an issue affecting users : ) ### Test plan: ``` python test/dynamo/test_aot_autograd.py AotAutogradFallbackTests ``` Results in: ``` .. ---------------------------------------------------------------------- Ran 49 tests in 45.622s OK (expected failures=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164897 Approved by: https://github.com/laithsakka	2025-10-10 21:26:57 +00:00
Yidi Wu	600db525bd	[easy][while_loop] use copy_input instead of clone in _clone_aliased_inputs (#164955 ) Compared with clone, ExternKernel.copy_input additionally realize the buffer, which downstream assumes the input buffer are realized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164955 Approved by: https://github.com/BoyuanFeng	2025-10-09 23:39:00 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Shunting Zhang	2982406721	[inductor] ban benchmarking by default in deterministic mode (#164532 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164532 Approved by: https://github.com/eellison ghstack dependencies: #164801	2025-10-08 20:55:15 +00:00
Yuanyuan Chen	35c4130fd1	[2/N] Fix ruff warnings (#164460 ) Apply ruff `SIM` rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460 Approved by: https://github.com/ezyang	2025-10-04 03:40:32 +00:00
Aaron Gokaslan	9eb89a4ad5	Add missing TypeIs to torch/_inductor/ir.py (#164489 ) This should be a TypeIs here Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489 Approved by: https://github.com/mlazos	2025-10-03 19:34:20 +00:00
PyTorch MergeBot	a34797e031	Revert "Add provenance to inductor IR nodes created after graph.run (#164255 )" This reverts commit b9e73e639e36f3aa628752161711e68878231b30. Reverted https://github.com/pytorch/pytorch/pull/164255 on behalf of https://github.com/jeffdaily due to broke rocm; inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_deferred_triton_kernels [GH job link](https://github.com/pytorch/pytorch/actions/runs/18200790301/job/51821738132) [HUD commit link](`b9e73e639e`) ([comment](https://github.com/pytorch/pytorch/pull/164255#issuecomment-3363360088))	2025-10-02 22:01:41 +00:00
Shangdi Yu	b9e73e639e	Add provenance to inductor IR nodes created after graph.run (#164255 ) Summary: as title - Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes. - Extract `assign_origin_node` function Differential Revision: D82871244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255 Approved by: https://github.com/mlazos	2025-10-02 17:32:46 +00:00
eellison	0d7994ca97	[inductor] do comm compute overlap at aten fx level (#163215 ) This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing. Subsequent prs will handle: - use of exposure analysis to do bucketing - make sure inductor respects comm/compute overlapping done at fx level - non-profiling mm estimation/rank broadcasting of profile results Other mis: - Validate accuracy of nccl estimations ( use ruisi's profiling instead ?) For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives. fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3 bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215 Approved by: https://github.com/IvanKobzarev	2025-09-30 04:53:58 +00:00
Pian Pawakapan	474d07554a	[dynamic shapes] unbacked-safe slicing (#161414 ) Summary: Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Test Plan: contbuild & OSS CI, see `56218d85e2` Rollback Plan: Differential Revision: D80948073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414 Approved by: https://github.com/laithsakka	2025-09-30 01:15:19 +00:00
PyTorch MergeBot	0f619c1f89	Revert "[inductor] do comm compute overlap at aten fx level (#163215 )" This reverts commit c9b5af9a384e7ef5f95613abe1622f5f55133c3a. Reverted https://github.com/pytorch/pytorch/pull/163215 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see `c9b5af9a38 (51526707590-box)` ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))	2025-09-29 21:53:42 +00:00
eellison	c9b5af9a38	[inductor] do comm compute overlap at aten fx level (#163215 ) This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing. Subsequent prs will handle: - use of exposure analysis to do bucketing - make sure inductor respects comm/compute overlapping done at fx level - non-profiling mm estimation/rank broadcasting of profile results Other mis: - Validate accuracy of nccl estimations ( use ruisi's profiling instead ?) For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives. fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3 bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215 Approved by: https://github.com/IvanKobzarev	2025-09-29 18:18:03 +00:00
can-gaa-hou	eb4361a801	[Fix] Adding missing `f` prefixes to formatted strings [1/N] (#164065 ) As stated in the title. * #164068 * #164067 * #164066 * __->__ #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065 Approved by: https://github.com/Skylion007	2025-09-29 04:53:00 +00:00
Yidi Wu	3413490f53	[scan] materialize combine_fn in forward add more autograd tests (#161732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664, #161808, #162025	2025-09-27 18:13:15 +00:00
Shangdi Yu	520fca82c8	Refactor Provenance Tracking (#163378 ) Summary: - Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code - Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`. - If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a `codegen_comment` call. - Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling - Remove `debug_handle` from `call_kernel`. Test Plan: CI ``` buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing ``` Differential Revision: D82839271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378 Approved by: https://github.com/angelayi	2025-09-25 22:55:59 +00:00
Nan Zhang	9341ede617	Revert to old behaviour of not padding strides if shape or stride is dynamic (#163639 ) Differential Revision: D83053287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163639 Approved by: https://github.com/blaine-rister	2025-09-24 18:31:01 +00:00
Blaine Burton Rister	f68de58c9d	[Inductor-FX] Support symbol and dynamic scalar graph inputs and outputs (#163596 ) # Problems This PR fixes a few edge cases that the FX converter missed related to dynamic shapes. 1. Inductor graphs can sometimes take `sympy.Symbol` inputs. We have logic to convert these to FX placeholder nodes. However, this logic did not update the `self.expr_to_proxy` table mapping symbols to proxy nodes. (There was existing logic to do this for `ir.TensorBox` inputs, but not `sympy.Symbol`.) This caused sympy tracing to fail when these symbol inputs were used in other expressions. 2. We lacked codegen for `ShapeAsConstantBuffer`. This IR node is seen when the graph input or output is a scalar computed from dynamic shapes. # Fixes a. Update `self.expr_to_proxy` when generating placeholders for `sympy.Symbol` inputs. Change `SymbolBuffer.get_example` to convert the symbol to a `torch.SymInt`, so we can populate `meta["val"]` correctly and use the value in other computations. b. Support `ShapeAsConstantBuffer` by tracing the sympy expression. c. Move output generation inside the metadata hook, allowing us to populate `meta["val"]` for the nodes computing `ShapeAsConstantBuffer`. # Test plan Added several new CI tests: 1. `torch.cond` with dynamic shapes. This exposes both issues, as the predicate is a `ShapeAsConstantBuffer` and one of the subgraphs uses a symbol input, due to the closure. Also tests when the parent and subgraphs have different input shapes. 2. Output dynamic shape scalar. This tests `ShapeAsConstantBuffer` as an output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163596 Approved by: https://github.com/angelayi, https://github.com/jansel	2025-09-24 06:08:14 +00:00
Colin Peppler	3ef1bef36c	[sdpa] make sure to recompile if alignment is different than before (#163083 ) ## Context An example from Qwen2-7B - This come from running torch.compile with a sequence length that is divisible by 8 (no padding needed). Call this `Run1`. - If we then run the compiled model with a difference length that isn't divisible by 8 (requires padding). Call this `Run2`. - Then we'll see this error. ``` File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48(512 // (512 // s19))(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845) File "torch/_ops.py", line 841, in __call__ return self._op(args, *kwargs) RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4. ``` - We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`. ### A bit more on why. Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`. `0897affcd5/torch/_inductor/lowering.py (L2687-L2694)` ## Fix Size hint doesn't guard, so the fix is to use `guard_or` to guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083 Approved by: https://github.com/eellison	2025-09-23 01:33:33 +00:00
Yuanyuan Chen	60c2bdedcd	Replace Literal[None] with None in typing (#163489 ) This PR replaces Literal[None] with None in typing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2025-09-22 22:10:08 +00:00
Boyuan Feng	4967ad8baa	[Graph Partition] improve custom op output alias (#163227 ) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. `72fedf0575/torch/_inductor/ir.py (L7976-L7982)` According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163227 Approved by: https://github.com/zou3519	2025-09-19 17:01:36 +00:00
Xuan Zhang	6e680ae8de	add more restriction to fusion with large accumulate reads (#163163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163163 Approved by: https://github.com/yf225	2025-09-19 01:20:30 +00:00
Ruben Rodriguez Buchillon	c230ac7300	[inductor][ez] add ChoiceCaller annotations (#162672 ) # why - enable ChoiceCaller generation to provide extra information that feedback_saver_fns (functions registered to run at the bench of benchmarking) can use afterwards - users that extend ChoiceCaller creation e.g. by creating their own InductorChoices can use this to shuttle through information # what - add an annotations dictionary to ChoiceCaller class # testing n/a Pull Request resolved: https://github.com/pytorch/pytorch/pull/162672 Approved by: https://github.com/nmacchioni	2025-09-16 20:49:55 +00:00
joshuamarkovic	559e8d1c20	[doc]: Small typos (#162982 ) Small typo fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-16 17:42:19 +00:00

1 2 3 4 5 ...

1031 Commits