pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Maggie Moss	d795fb225a	[RFC] Add pyrefly to lintrunner (#165179 ) This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker. References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283: test plan: `lintrunner init` `lintrunner` confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 Approved by: https://github.com/ezyang	2025-10-16 20:07:09 +00:00
Lucas Kabela	e6d9d68598	[Bugfix][Dynamo] Fix Sparse tensors by graph break in Dynamo (#164873 ) Fixes #164823 by making lack of support for sparse tensors very explicit (in fake tensor, inductor, and lowering code) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164873 Approved by: https://github.com/williamwen42, https://github.com/eellison, https://github.com/mlazos	2025-10-16 15:06:20 +00:00
Maggie Moss	9944cac6e6	Add suppressions to torch/_inductor (#165062 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Split this directory into two PRs to keep them from being too large. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062 Approved by: https://github.com/oulgen, https://github.com/mlazos	2025-10-09 20:34:20 +00:00
Yuanyuan Chen	a029675f6f	More ruff SIM fixes (#164695 ) This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695 Approved by: https://github.com/ezyang	2025-10-09 03:24:50 +00:00
Aaron Gokaslan	d1a62c8036	[BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856 ) Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856 Approved by: https://github.com/williamwen42	2025-10-07 22:51:17 +00:00
Parshant Sharma	bc33b10202	fix copy_ for scalar in inductor (#164167 ) Fixes #158437 ### Summary - TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))` - Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device ### Impact - Enables compilation of models using ` tensor.copy_(scalar) `patterns - module: inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167 Approved by: https://github.com/shunting314	2025-10-07 18:31:37 +00:00
eellison	35f66b83f8	respect aten planned overlap in inductor (#164569 ) Now that we have a hop to add implicit deps - use those deps for comm/compute overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164569 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164568	2025-10-06 15:47:55 +00:00
eellison	4a39820e5e	Add hop for additional control dependencies (#164568 ) Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies. This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap. There's definitely some similarity with the `with_effects` hop. Talked with @angelayi - when @zou3519 is back we will figure out how we want to consolidate. The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects` breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](`aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)`). By maintaining the node with its original calling structure in subgraph this all works. Example transformation: Before: ``` %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {}) %mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {}) ``` After: ``` add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1) mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1) mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add) ``` The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev	2025-10-06 15:47:55 +00:00
Pian Pawakapan	474d07554a	[dynamic shapes] unbacked-safe slicing (#161414 ) Summary: Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Test Plan: contbuild & OSS CI, see `56218d85e2` Rollback Plan: Differential Revision: D80948073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414 Approved by: https://github.com/laithsakka	2025-09-30 01:15:19 +00:00
Yidi Wu	3413490f53	[scan] materialize combine_fn in forward add more autograd tests (#161732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732 Approved by: https://github.com/zou3519 ghstack dependencies: #161557, #161664, #161808, #162025	2025-09-27 18:13:15 +00:00
Zhijing Li	28c7d11428	[AOTI] Pass in shape_env for get_stride_order (#163925 ) Summary: As titled. Without the diff, we got P1963055009 With the diff passing in the enviroment, we can do correct sym_int deduction: https://fburl.com/mlhub/p5zy7o28 Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)" ``` Without the fix: P1964887260 With the fix: P1964888579 Differential Revision: D83211018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925 Approved by: https://github.com/ColinPeppler	2025-09-26 21:10:03 +00:00
Jason Ansel	3e1b1a30f2	Revert "[inductor] Fix issue with scalar arg handling" (#163737 ) This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904. See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774 This PR might also cause issues with cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737 Approved by: https://github.com/ezyang ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520, #163482	2025-09-24 07:33:12 +00:00
Jason Ansel	6fa972796e	[inductor] Fix bugs in emulate_precision_casts (#163520 ) Fixes #163449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163520 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481	2025-09-24 02:52:36 +00:00
Jason Ansel	ca512af3e7	[inductor] Fix issue with scalar arg handling (#163481 ) Fixes #163420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422	2025-09-24 02:52:36 +00:00
Simon Fan	bda9ab291d	[inductor] fix as_strided lowering with .view(dtype) inputs (#163319 ) FIXES https://github.com/pytorch/pytorch/issues/163286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319 Approved by: https://github.com/eellison	2025-09-23 12:50:57 +00:00
Colin Peppler	3ef1bef36c	[sdpa] make sure to recompile if alignment is different than before (#163083 ) ## Context An example from Qwen2-7B - This come from running torch.compile with a sequence length that is divisible by 8 (no padding needed). Call this `Run1`. - If we then run the compiled model with a difference length that isn't divisible by 8 (requires padding). Call this `Run2`. - Then we'll see this error. ``` File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48(512 // (512 // s19))(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845) File "torch/_ops.py", line 841, in __call__ return self._op(args, *kwargs) RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4. ``` - We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`. ### A bit more on why. Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`. `0897affcd5/torch/_inductor/lowering.py (L2687-L2694)` ## Fix Size hint doesn't guard, so the fix is to use `guard_or` to guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083 Approved by: https://github.com/eellison	2025-09-23 01:33:33 +00:00
Jason Ansel	4fc271e559	[inductor] Don't require_dense for grid_sampler_2d_backward (#163415 ) Fixes #163372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163415 Approved by: https://github.com/Skylion007 ghstack dependencies: #163386, #163398, #163387, #163414	2025-09-22 21:53:01 +00:00
Colin Peppler	3c8b90542c	support unbacked softmax / logsoftmax (#162216 ) ### DDE ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3u0, 0) (unhinted: Eq(3u0, 0)). (Size-like symbols: u0) Caused by: (_decomp/decompositions.py:1185 in _softmax) ``` ``` torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)). (Size-like symbols: u0) Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0) # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax) ``` ``` GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)). (Size-like symbols: u0) Caused by: (_refs/__init__.py:2218 in _reduction) ``` ### Cannot convert symbols to int ``` File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "orch/_inductor/sizevars.py", line 591, in size_hint return int(out) ^^^^^^^^ File "sympy/core/expr.py", line 342, in __int__ raise TypeError("Cannot convert symbols to int") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162216 Approved by: https://github.com/laithsakka, https://github.com/eellison	2025-09-18 15:43:20 +00:00
Laith Sakka	0f462740a0	replace more // with FloorDiv in inductor code (#162969 ) see this https://github.com/pytorch/pytorch/pull/162869 for more context, sympy div representation can make reasoning fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162969 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/jansel	2025-09-18 03:28:31 +00:00
Blaine Burton Rister	9aca0ba027	[Inductor-FX] Support IndexPutFallback (#162863 ) # Feature This PR supports lowering `IndexPutFallback` through Inductor's FX converter. The approach is very similar to the one taken in https://github.com/pytorch/pytorch/pull/162686. Compared to `ScatterFallback`, this required one additional change: the value of `self.op_overload` for `IndexPutFallback` was inaccurate. Previously, it used `aten.index_put`, which would result in unsound FX IR. The existing Python/C++ codegen use `aten.index_put_`, since the fallback mutates its input. This PR changes `self.op_overload` to match that. # Test plan Added a CI test lowering deterministic index put via the FX converter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162863 Approved by: https://github.com/angelayi	2025-09-16 08:52:47 +00:00
karthickai	6eb14ac60f	[Inductor] Fix cross-device scalar lowering - cpu scalar with cuda tensor fails in torch.compile (#161447 ) This PR fixes bug in TorchInductor where cross-device scalar indexing fails during compilation, causing discrepancies from eager mode behavior. Fixes: #140457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161447 Approved by: https://github.com/mlazos	2025-09-09 01:07:02 +00:00
Colin Peppler	5fd6b6a2db	[refactor] add helper sizevars function, is_size_one, for size==1 checks (#162189 ) ## Summary - document guard behavior in `SizeVarAllocator.is_size_one` - use `is_size_one` for broadcast/expand checks. - This diff is a no-op since we'd use `shape_env.evaluate_expr(... fallback_value=False)` `a4f9132a17/torch/_inductor/sizevars.py (L450-L453)` ------ https://chatgpt.com/codex/tasks/task_e_68b8d0d1f2c48328b2d38c00e738bc8c Pull Request resolved: https://github.com/pytorch/pytorch/pull/162189 Approved by: https://github.com/laithsakka	2025-09-08 22:48:16 +00:00
Yidi Wu	48e3be3ab6	[while_loop][autograd] add hop while_loop_stack_output (#160467 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160467 Approved by: https://github.com/zou3519 ghstack dependencies: #160548	2025-09-06 21:26:33 +00:00
gaoyufeng	1c1b28d5b6	Fix slice scatter dtype consistency (#160851 ) Fixes #147842 Fix torch.slice_scatter type inconsistency issue. I noticed previous PRs on this have stalled, so I'm opening this new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160851 Approved by: https://github.com/soulitzer	2025-09-02 01:08:26 +00:00
Karthick Panner Selvam	130e50afff	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos, https://github.com/atalman	2025-08-28 18:57:34 +00:00
PyTorch MergeBot	c55bdb26e1	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit 378edb047f83dfb84c2d9c032bddebc5e0147b8f. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/atalman due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3230152168))	2025-08-27 23:45:12 +00:00
Karthick Panner Selvam	378edb047f	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-27 14:49:20 +00:00
PyTorch MergeBot	de58505890	Revert "[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 )" This reverts commit cddcaa19035d6414a351be7c7b16c47d5a0c3466. Reverted https://github.com/pytorch/pytorch/pull/160677 on behalf of https://github.com/karthickai due to This is breaking tests on Rocm ([comment](https://github.com/pytorch/pytorch/pull/160677#issuecomment-3226541063))	2025-08-27 02:36:42 +00:00
Karthick Panner Selvam	cddcaa1903	[Inductor] Add DeviceAssert op to enable device-side assertion in torch.compile (#160677 ) This PR introduces a device_assert op to trigger device-side assertions within torch.compile. This implementation is based on the suggestion in [this comment](https://github.com/pytorch/pytorch/issues/147282#issuecomment-2756056084). Changes Included - Implemented device_assert op and overrides has_side_effect to return True to avoid removal by dead code elimination. - Commented out the assert_async_msg_decomp and functional_assert_async_msg_decomp decompositions to disable the default assert decomposition inside Inductor. - Added lowering for torch.ops.aten._assert_async.msg to convert assert calls into the ops_handler. - Implemented the codegen method for the device_assert op. This supports generating C++ and Triton code. - Added test cases to verify both "should throw" and "should not throw" scenarios. Fixes #147282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160677 Approved by: https://github.com/mlazos	2025-08-26 22:33:23 +00:00
PyTorch MergeBot	3f1a97a99c	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 44549c7146bd6c4166f97e856037babe1b7f4f49. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135))	2025-08-22 20:48:46 +00:00
Pian Pawakapan	44549c7146	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-20 22:52:56 +00:00
PyTorch MergeBot	6ea4be1e2e	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 2f0cba934de7094a66c6ce68f5e937254f23142a. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193))	2025-08-20 15:16:45 +00:00
Pian Pawakapan	2f0cba934d	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-19 17:32:47 +00:00
PyTorch MergeBot	5e98d9f9ba	Revert "[dynamic shapes] unbacked-safe slicing (#157944 )" This reverts commit 56218d85e2da09d9ede3809718ec989c2151632c. Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk `56218d85e2` ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677))	2025-08-19 01:16:17 +00:00
Pian Pawakapan	56218d85e2	[dynamic shapes] unbacked-safe slicing (#157944 ) Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944 Approved by: https://github.com/laithsakka	2025-08-18 22:38:16 +00:00
vishalgoyal316	19b4283884	Typo correction in variable name uninitalized_val in resize() function (#160636 ) Fixes #160633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160636 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2025-08-14 20:11:43 +00:00
Boyuan Feng	2898d3f965	[Lowering] Add assertion msg to sym_size and sym_stride (#160591 ) Summary: Add assertion msg to sym_size and sym_stride lowering function. Test Plan: Will test in mast job. Rollback Plan: Differential Revision: D80187693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160591 Approved by: https://github.com/angelayi	2025-08-14 04:55:32 +00:00
Markus Hoehnerbach	182efe31db	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-08-13 22:54:18 +00:00
David Berard	6b414f56a4	Revert "[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 )" (#159798 ) This reverts commit 305a03727672de42870f956ddf4ad9fa424443e1. Reason: causes device-side assertion failures when running with this repro (a minimized version of a failure seen in a real model) ``` import torch def ri(inp, repeats, output_size): return torch.repeat_interleave(inp, repeats, output_size=output_size) inp = torch.arange(0, 4, device="cuda").reshape(-1, 1) x = torch.tensor([1, 2, 3, 4], device="cuda") ri_c = torch.compile(ri) print(ri(inp, x, 10)) print(ri_c(inp, x, 10)) ``` which leads to errors like ``` /tmp/torchinductor_dberard/3h/c3hlb22fpptebupstsuhl6kexa6z3upgbnyxln7c24gfcr5747iu.py:30: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp5 < 4` failed. ``` Differential Revision: [D79591561](https://our.internmc.facebook.com/intern/diff/D79591561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159798 Approved by: https://github.com/danzimm	2025-08-04 23:39:20 +00:00
Simon Fan	669009bcd1	[inductor] respect layout tags for ops with registered lowerings (#159134 ) scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue. ~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work. FIXES https://github.com/pytorch/pytorch/issues/159097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134 Approved by: https://github.com/eellison	2025-07-31 21:29:40 +00:00
Markus Hoehnerbach	f89c28cc6b	[inductor] add lowering for repeat_interleave.Tensor with output size specified (#147160 ) (#158462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158462 Approved by: https://github.com/eellison	2025-07-31 17:00:32 +00:00
IvanKobzarev	8aebf01287	[bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663 ) Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function. `all_gather_merge_fn_to_trace` `reduce_scatter_merge_fn_to_trace` (Instead of creating nodes and doing FakeTensor prop manually) This allows to experiment with merge function. Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_` Adding topological sort after bucketing passes (comment in post_grad.py): ``` # Fx collectives bucketing passes require topological sort for the cases: # when bucketed collectives have users before the last collective in the bucket # AND when inputs of bucketed collective have ancestors after the first collective in the bucket. # # In this case we can not manually pick the place for bucketed collective insertion. # But we are guaranteed by the bucketing (independent collectives in the bucket), # that it is possible to reorder nodes to satisfy all ordering requirements. # # --- before bucketing --- # in0 = ... # wait_ag0 = ag(in0) # user0(wait_ag0) # ... # pre_in1 = ... # in1 = transform(pre_in1) # wait_ag1 = ag(in1) # user1(wait_ag1) # # --- after bucketing --- # # in0 = ... # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective. # # pre_in1 = ... # in1 = transform(pre_in1) # ag_bucket(in0+in1) # wait_bucket # wait_ag0 = wait_bucket[0] # wait_ag1 = wait_bucket[1] # user1(wait_ag1) ```` Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel: <img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" /> <img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663 Approved by: https://github.com/wconstab	2025-07-25 22:49:51 +00:00
Sam Larsen	fa0355c18d	Fix full_like decomposition to preserve strides (#158898 ) Summary: See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again. Rollback Plan: Differential hack Revision: D78783627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898 Approved by: https://github.com/eellison	2025-07-25 20:21:36 +00:00
thenumberouscode	92e93bb580	[inductor][cpu] Stop lowering div to reciprocal multiplication to preserve precision when the divisor is a scalar and device is on cpu (#158231 ) ## Fixes https://github.com/pytorch/pytorch/issues/157959 ## mini repro from issue ```c++ import torch from torch import nn class Foo(nn.Module): def __init__( self, use_parameter: bool ) -> None: super().__init__() self.b = 101 if use_parameter: self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False) def forward(self, x: torch.Tensor) -> torch.Tensor: # return x + self.b # return x - self.b return x / self.b # return x * self.b torch.manual_seed(42) x = torch.rand((5, 5)) expected = Foo(False)(x) models = [ Foo(False), Foo(True), torch.compile(Foo(False), fullgraph=True), torch.compile(Foo(True), fullgraph=True), ] for m in models: print((m(x) - expected).sum()) ``` all outputs equal zero except the result of torch.compile(Foo(False), fullgraph=True) ## summary: when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal. this could lead precision lost in c++ kernel. but not in triton kernel ## why: Generated C++ kernel; thanks to @xmfan for supplying the code. ```c++ #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L))) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 * tmp2; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); } } } } } ``` The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits. ```c++ #include <iostream> #include <iomanip> int main() { auto tmp1 = static_cast<float>(0.009900990099009901); auto tmp2 = static_cast<double>(0.009900990099009901); std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl; std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl; return 0; } ``` the ouput is ```bash tmp1 = 0.0099009899422526359558 tmp2 = 0.0099009900990099011103 ``` `auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value. I also found that the bug occurred at that position `86d8af6a6c/torch/_inductor/lowering.py (L6238)` The commit states that the precision lost is expected in cuda implementation. original commit `03439d4c1c` cuda implementation `0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)` What is interesting is that the Triton kernel works correctly due to the precision of float type in python. ```python def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 25 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), xmask) tmp1 = 0.009900990099009901 tmp2 = tmp0 * tmp1 tl.store(out_ptr0 + (x0), tmp2, xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231 Approved by: https://github.com/eellison	2025-07-25 08:57:17 +00:00
Laith Sakka	0b2ef76e85	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-24 20:08:05 +00:00
Colin Peppler	a6b7bea244	[inductor] support linear & layer_norm unbacked (#155267 ) ### What - Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path. - For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`). ### Example DDE ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes) ``` ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/ir.py:2797 in create) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267 Approved by: https://github.com/eellison	2025-07-23 05:42:01 +00:00
PyTorch MergeBot	23550ab735	Revert "DDE-Free select with unbacked index. (#157605 )" This reverts commit 79d7c754ab8ae0e5c3a614521632d2cfbfa0fdba. Reverted https://github.com/pytorch/pytorch/pull/157605 on behalf of https://github.com/laithsakka due to fail pr time benchmarks ([comment](https://github.com/pytorch/pytorch/pull/157605#issuecomment-3084663020))	2025-07-17 16:20:02 +00:00
Laith Sakka	79d7c754ab	DDE-Free select with unbacked index. (#157605 ) When select has data dependent input, we cant tell if the actual index shall be index+size or index. to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the output view and we compute its value dynamically at runtime when inductor is lowered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605 Approved by: https://github.com/ColinPeppler	2025-07-17 05:08:11 +00:00
Colin Peppler	53ab73090e	[inductor] support unbacked symint in sdpfa (#157739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157739 Approved by: https://github.com/laithsakka	2025-07-09 22:01:29 +00:00
Jason Ansel	64f2ec77f8	[inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912 ) Fixes #156682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912 Approved by: https://github.com/angelayi	2025-07-04 06:09:28 +00:00

1 2 3 4 5 ...

808 Commits