pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Aleksei Nikiforov	c733072874	Fix IValue from SymBool on big-endian system (#163647 ) Skip test_compiled_autograd_attribution on s390x It fails both on s390x and x86_64 at least under some circumstances. Disable it for now until on s390x until it works reliably. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163647 Approved by: https://github.com/malfet	2025-10-14 15:07:48 +00:00
Lucas Kabela	1fa11f42b1	[Bugfix][vLLM] Explicitly do not support instead of crashing for named tuples in infer schema (#165191 ) Fixes https://github.com/vllm-project/vllm/issues/25270 by being explicit in erroring; previously we had a cryptic `__origin__ undefined` error, but now should give proper error message that we don't support NamedTuples in schema Test with ``` python test/test_custom_ops.py TestCustomOp.test_unsupported_param_types ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165191 Approved by: https://github.com/zou3519	2025-10-14 14:18:42 +00:00
Colin Peppler	306c55ba27	[atomically_apply_size_hint] Make unbacked replacements reconciles to a single expr (#164324 ) ## Problem Okay there's limitations with today's `atomically_apply_size_hint` though it works for most observed failures we've seen so far. However, it's easy to come up with an edge case. Suppose you encounter this setup. ``` a: [s0 + u0] b: [s1 + u1] c: [u2 + u3] d: [u100] ``` Today, we use a few heuristics to specify the LHS and RHS for replacements. `10d2734d9b/torch/_inductor/sizevars.py (L730-L759)` It's possible to end up with these replacement rules. Notice how there's no replacement for `s1 + u1` and `u2 + u3` :( That's because today picking the LHS and RHS matters a lot, and `s1 + u1` & `u2 + u3` happened to end up on the RHS. ``` s0 + u0 => s1 + u1 s0 + u0 => u2 + u3 # overrides previous replacement; each expr only gets one replacement s0 + u0 => u100 # overrides previous replacement; ditto ``` I believe what we really want is this: everybody gets a replacement! And they all should (eventually) settle at the same canonical expr (i.e. `u100`) when running the replacement several times. ``` s1 + u1 ==> s0 + u0 u2 + u3 ==> s0 + u0 s0 + u0 ==> u100 ``` We can just short-cut this by using the canonical expr as the replacement. ``` s1 + u1 ==> u100 u2 + u3 ==> u100 s0 + u0 ==> u100 ``` ## Implementation I offer one way to deal with this: 1. assure every expression has one canonical replacement (i.e. `u100`) 2. if two expressions are equal (inferred from `deferred_runtime_asserts`), then they must have the same canonical replacement We can implement the above with union find. * Whenever you see `Eq(lhs, rhs)` then do `union(lhs, rhs)`. * Whenever you want to find the canonical replacement for a given expr then do `find(expr)`. * When picking the canonical replacement we can use a few heuristics like (1) prefer a fully backed expr, (2) replacing with sub-expressions, and whatever we'd like. Differential Revision: [D84549260](https://our.internmc.facebook.com/intern/diff/D84549260) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164324 Approved by: https://github.com/laithsakka	2025-10-14 13:57:33 +00:00
Dzmitry Huba	5fbf93b774	Introduce automatic wrapper to run DTensor tests under local tensor mode (#165383 ) The wrapper enable to share test body implementation while eliminating need test class by hand. As an example, this change converts the whole DTensorTest to use local tensor mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165383 Approved by: https://github.com/ezyang	2025-10-14 06:08:03 +00:00
Angel Li	a856a17799	bf16 support for per_channel bwd (#165325 ) Follow up to #165098 - adding bf16 support for the backward pass. To avoid BC breaking changes/losing precision, we upcast the parameters to fp32 after the op gets called, and downcast the gradients to bf16 before returning. For testing, we upcast to fp32 before calling the reference function. We increase the tolerance to 1e-2 for bf16 inputs because of a difference in casting calculations between python's `x.to(torch.bfloat16)` and cpp's `x.to(at::kBFloat16)` (after comparing intermediate tensors, we found that the numerics diverge after the final casting). We don't explicitly cast in the CPP op but rather let autograd/optimizer handle it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165325 Approved by: https://github.com/andrewor14	2025-10-14 05:47:32 +00:00
PyTorch MergeBot	33bfec27ff	Revert "use sym_numel, to allow fake tensors to work (#163831 )" This reverts commit e71c75680f2d6ce5f61ad4b2125f4934087762eb. Reverted https://github.com/pytorch/pytorch/pull/163831 on behalf of https://github.com/isuruf due to test failure on mps introduced ([comment](https://github.com/pytorch/pytorch/pull/163831#issuecomment-3400131730))	2025-10-14 05:10:56 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel	2025-10-14 04:22:30 +00:00
PyTorch MergeBot	fa3916f466	Revert "[export] Turn on install_free_tensors flag (#164691 )" This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/seemethere due to Breaks some internal things, both me and author agreed that revert was the best course of action ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3400013759))	2025-10-14 03:58:12 +00:00
PyTorch MergeBot	267348fe7f	Revert "Fix double dispatch to Python for detach (#163671 )" This reverts commit a3e3efe474bef63940ded803e78bb2a382681f1e. Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/seemethere due to We should've reverted this when we decided to revert https://github.com/pytorch/pytorch/pull/164691 since they were actually stacked ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3400009953))	2025-10-14 03:55:36 +00:00
PyTorch MergeBot	1803d40c99	Reapply "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 9166f6120f63e2d5d76e6ccdbfccb8d6e41cbb43. Reverted https://github.com/pytorch/pytorch/pull/165353 on behalf of https://github.com/seemethere due to This is causing merge conflicts since a dependent PR wasn't reverted ([comment](https://github.com/pytorch/pytorch/pull/165353#issuecomment-3400006587))	2025-10-14 03:52:50 +00:00
VINAY PRITHYANI	e71c75680f	use sym_numel, to allow fake tensors to work (#163831 ) Fixes #[163759](https://github.com/pytorch/pytorch/issues/163759) Replace `numel` with `sym_numel`. Tested with example in issue and it works now . Pull Request resolved: https://github.com/pytorch/pytorch/pull/163831 Approved by: https://github.com/bobrenjc93	2025-10-14 03:33:28 +00:00
Nikita Shulga	770e6b910c	[DTensor] Extend conv ops to 3D (#165241 ) Current implementation hardcodes 4D input and output tensor shapes Change that by computing `output_conv_shape` for any number of input dims Replace `[.., .., .., slice]` with `[..., slice]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165241 Approved by: https://github.com/ezyang	2025-10-14 02:30:46 +00:00
Animesh Jain	9166f6120f	Revert "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165353 Approved by: https://github.com/seemethere	2025-10-13 23:40:11 +00:00
Animesh Jain	f3683453ae	[compile] Regional inductor compilation with fx.annotate (#164776 ) This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`. ### UX 1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic. Example ``` def fn(x, y): sin = torch.sin(x) with fx_traceback.annotate({"compile_with_inductor": 0}): mul = sin * y add = mul + 1 return torch.sin(add) ``` 2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is ``` # Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor` def aot_eager_regional_inductor(): return aot_autograd( fw_compiler=compile_fx_annotated_nodes_with_inductor, bw_compiler=compile_fx_annotated_nodes_with_inductor, ) ``` 3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy. ### Implementation 1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph. 2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner` Forward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) sin: "f32[10]" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(sin, primals_2) # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1 getitem: "f32[10]" = inner[0]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem) return (sin_1, primals_1, primals_2, sin, getitem) ``` Backward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1); primals_1 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) cos: "f32[10]" = torch.ops.aten.cos.default(add); add = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2); mul_1 = sin = primals_2 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y getitem: "f32[10]" = inner[0] getitem_1: "f32[10]" = inner[1]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1); getitem_1 = cos_1 = None return (mul_4, getitem) ``` ### Some issue raised in the HOP meeting 1) CSE will not differentiate different meta custom nodes and do wrong thing. 2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than? 3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph? 4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements? 5) What are we going to use the annotations for? a) compile flex b) streams c) nn.Module info to organize MoE components for pipelining d) PP stages e) Rename graph nodes for more debugging f) No nested regional compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776 Approved by: https://github.com/SherlockNoMad ghstack dependencies: #165188	2025-10-13 22:22:20 +00:00
Animesh Jain	1191e51c44	[dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165188 Approved by: https://github.com/yushangdi	2025-10-13 22:22:20 +00:00
zpcore	3edd94485f	[5/N][DTensor device order] Implement graph based redistribution algorithm (#164902 ) (Extract out the algorithm from https://github.com/pytorch/pytorch/pull/160266.) Build a graph to search for the path from source placement to destination placement (with device order). Currently solution introduces too many all-gathers and missing the opportunity for all-to-all when redistribute, especially when we consider the device order. ### How to build the graph: When operator of Shard, think of collective op as operation on a stack of device axis: - I, J are tensor dimensions; - X, Y, Z, Y are ordered mesh dimensions. <img width="357" height="253" alt="image" src="https://github.com/user-attachments/assets/23bb3cc3-0506-4071-9053-3c525cf0e526" /> Detailed collective op transition is implemented in `DTensorRedistributePlanner.get_next_state`. ### How to find the min cost path: Assign weight to different type of collective ops and use Dijkstra to find the min cost path from the graph we build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164902 Approved by: https://github.com/ezyang	2025-10-13 22:03:57 +00:00
PyTorch MergeBot	a71ca4dcb9	Revert "[opaque_obj_v2] PyObject custom op schema type (#165004 )" This reverts commit 3faee200674c0c2bca3f395a063264cfd8a9a5b7. Reverted https://github.com/pytorch/pytorch/pull/165004 on behalf of https://github.com/seemethere due to This fails internal tests, see D84399300 ([comment](https://github.com/pytorch/pytorch/pull/165004#issuecomment-3398906856))	2025-10-13 20:08:38 +00:00
Aidyn-A	c44d638b15	[Easy][Test][Dynamo] Avoid direct string comparison in MiscTestsDevice::get_device_module (#165314 ) Fixes a small issue on string comparison, as the test fails with: ``` AssertionError: String comparison failed: 'cuda' != 'cuda:0' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165314 Approved by: https://github.com/soulitzer	2025-10-13 19:58:59 +00:00
Ti-Tai Wang	cb328c0b20	[ONNX] TorchTensor supports tofile() (#165195 ) Fixes #165120 ref: `43ebf47bb5/src/onnx_ir/tensor_adapters.py (L171-L200)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165195 Approved by: https://github.com/justinchuby	2025-10-13 19:12:06 +00:00
PyTorch MergeBot	955cd7060b	Revert "Update round size with 1 division behavior (#162203 )" This reverts commit 12d2ef557f6e127100267c31a31572d8ab5cc788. Reverted https://github.com/pytorch/pytorch/pull/162203 on behalf of https://github.com/izaitsevfb due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162203#issuecomment-3398622898))	2025-10-13 18:32:37 +00:00
can-gaa-hou	0ce945790e	[NJT] Fix schema validation error in jagged functions (#165307 ) Fixes #161812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165307 Approved by: https://github.com/soulitzer	2025-10-13 17:59:18 +00:00
Chien-Chin Huang	e93343cfab	[CP] Introduce flex_cp_forward custom op for FlexAttention CP (#163185 ) The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter. While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd. For the next step, we should explore how to interpolate the required communication based on the information from BlockMask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163185 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500	2025-10-13 17:16:32 +00:00
Catherine Lee	c86a7c5f5e	Disable failing test_int8_woq_mm_concat_cuda on slow grad check (#165331 ) Same as https://github.com/pytorch/pytorch/pull/165147, I missed some Pull Request resolved: https://github.com/pytorch/pytorch/pull/165331 Approved by: https://github.com/bbeckca	2025-10-13 17:08:00 +00:00
Kurt Mohler	83cbba8759	[MPS] Support large tensors in `torch.cat` (#164416 ) Fixes #164415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164416 Approved by: https://github.com/malfet	2025-10-13 16:56:56 +00:00
Scott Wolchok	a3e3efe474	Fix double dispatch to Python for detach (#163671 ) This fixes #71725. Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-10-13 16:10:17 +00:00
Chien-Chin Huang	6bda3bb286	[PP] Fix split_args_kwargs_into_chunks issues (#165306 ) 1. https://github.com/pytorch/pytorch/pull/164111/ adds the support of splitting BlockMask. But BlockMask actually has B=1 case that the BlockMask will be broadcast. This PR adds the support of B=1 case. 2. The original split_args_kwargs_into_chunks doesn't initialize the default specs correctly. Since we now use tree_flatten and tree_unflatten to do split, we should also use tree_map to initialize the default spec. This will actually support the case when the values are not torch.Tensor, which were only supported if users explicitly provide the shard spec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165306 Approved by: https://github.com/H-Huang	2025-10-13 15:52:39 +00:00
PyTorch MergeBot	8580112682	Revert "[dynamo][DebugMode] mask python keys in dispatch_key_set guard checks (#164992 )" This reverts commit 306b344a1847749f0baf085dcd92560f4e99cd1b. Reverted https://github.com/pytorch/pytorch/pull/164992 on behalf of https://github.com/jeffdaily due to broke ROCm CI test/inductor/test_inductor_scheduler.py::TestSchedulerCUDA::test_flop_counter_op_options0_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18417066364/job/52485636942) [HUD commit link](`306b344a18`) ([comment](https://github.com/pytorch/pytorch/pull/164992#issuecomment-3397927142))	2025-10-13 15:14:34 +00:00
PyTorch UpdateBot	c509a78645	Update slow tests (#165301 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165301 Approved by: https://github.com/pytorchbot	2025-10-13 11:47:32 +00:00
Chien-Chin Huang	8461b63f2c	[CP] Replace context_parallel context manager with functional APIs (#164500 ) `context_parallel()` being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to: 1. Wrap the attention op into an `nn.Module` 2. Lift any buffers that are not sequence agnostic to input We can replace `context_parallel()` with two functional APIs: `_context_parallel_shard` and `_enable_context_parallel_dispatcher`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164500 Approved by: https://github.com/XilunWu ghstack dependencies: #162542	2025-10-13 06:30:18 +00:00
Yuanyuan Chen	8de85896e0	Enable ruff rule E721 (#165162 ) `E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162 Approved by: https://github.com/Skylion007	2025-10-13 01:48:55 +00:00
Dzmitry Huba	5e58420dff	LocalTensor (#164537 ) A LocalTensor is a tensor subclass which simulates a tensor that is distributed across SPMD ranks. A LocalTensor might be size N, but in fact there are world_size shards/replicas of it stored internally. When you do a plain PyTorch operation on it, we apply the operation to each shard; when you do a collective, we do the mathematically equivalent operation on the local shards. A LocalTensor is associated with a list of ranks which specify which ranks it holds local tensors for. NB, this is NOT a DataParallel like abstraction where you can run operations on multiple different GPUs. It is intended purely for debugging purposes, the overhead is almost certainly too high to keep eight GPUs (even the C++ autograd needs multithreading to keep up!) (It might potentially be possible to trace through this with torch.compile and then compile it with CUDA graphs but this is currently a non-goal.) In order to handle MPMD, we provide a helper decorator that allows you to run a function with no side effects for each LocalTensor shard and combine results back into LocalTensor or LocalIntNode. Note: This PR convert all DTensor ops and some DTensor tests to illustrate intended usage and ensure conrrectness. In subsequent PR more tests will be converted. DUring test conversion we aim to share as much as possible of test logic between multi-process / multi-threaded and local tensor tests. We would like to developers to be able to run both flavors of the tests. Note: This work is based on the original proposal by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537 Approved by: https://github.com/ezyang	2025-10-12 20:06:41 +00:00
Howard Huang	2beead7523	[PP] move FSDP reduce scatters to end of step (#165106 ) Move FSDP reduce scatters to the end of the PP step. The reduce scatter compute stream sync blocks the other stages from executing their backwards leading to bubbles. There should be a way to execute these RS earlier, but doing this for now as a quick fix. <img width="1056" height="463" alt="image" src="https://github.com/user-attachments/assets/b945dd55-8ab1-4acc-b862-c6e2e476b834" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165106 Approved by: https://github.com/weifengpy ghstack dependencies: #164976	2025-10-12 13:28:02 +00:00
Yu, Guangye	3a110c9bb2	Add a new API torch.xpu.is_tf32_supported for Intel GPU (#163141 ) # Motivation Aligned with other backends, this PR introduces a new API `torch.xpu.is_tf32_supported`, which should be used before `torch.backends.mkldnn.allow_tf32=True` or provide hardware capability information to the Triton # Additional Context On Intel Xe architecture and newer, TF32 operations can be accelerated through DPAS (Dot Product Accumulate Systolic) instructions. Therefore, TF32 support can be determined by checking whether the device supports subgroup matrix multiply-accumulate operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163141 Approved by: https://github.com/EikanWang	2025-10-12 12:11:57 +00:00
Shunting Zhang	5171f14064	[inductor] verify determinism with inductor benchmark script (#164904 ) Verify the deterministic mode with torch.compile benchmark scripts. Here is what my testing script does (pasted in the end): - run a model in default mode, save it's result - run the model again in default mode, but distort the benchmarking results. Compare it with the saved result. - Do the above again in deterministic mode. I tried to test a few modes - BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode - DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change. ``` model=GoogleFnet export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0 export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 # disable autotune cache export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0 export TORCHINDUCTOR_FX_GRAPH_CACHE=0 export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/ export TORCHINDUCTOR_BENCHMARK_KERNEL=1 export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 # Non deterministic mode # --float32 rather than --amp to make it easier to repro non-deterministic echo "Save results for non-deterministic mode" python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl echo "Compare results with distorted benchmarking in non-deterministic mode" TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl echo "Save results for deterministic mode" TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl echo "Compare results with distorted benchmarking in deterministic mode" TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904 Approved by: https://github.com/jansel, https://github.com/v0i0	2025-10-12 00:03:42 +00:00
Raman Kumar	df26c51478	error message for instantiating CUDA Stream if CUDA not available (#159868 ) Fixes #159744 Summary: ``` import torch # Generate input data input_tensor = torch.randn(3, 3) stream = torch.cuda.Stream() # Call the API input_tensor.record_stream(stream) ``` ⚠️ will now show an error message `torch.cuda.Stream requires CUDA support` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159868 Approved by: https://github.com/malfet, https://github.com/isuruf	2025-10-11 23:21:35 +00:00
PyTorch MergeBot	8d49cd5b26	Revert "[compile] Regional inductor compilation with fx.annotate (#164776 )" This reverts commit 1e4c7dffa31b3284a4cd4daa4424602827bd9d0f. Reverted https://github.com/pytorch/pytorch/pull/164776 on behalf of https://github.com/malfet due to Looks like this one broke everything, not the top of the stack ([comment](https://github.com/pytorch/pytorch/pull/164776#issuecomment-3393725466))	2025-10-11 23:14:23 +00:00
PyTorch MergeBot	a19123b37e	Revert "[dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188 )" This reverts commit f0325d07876b5a52d29a44ee02dcf7a7c21b258a. Reverted https://github.com/pytorch/pytorch/pull/165188 on behalf of https://github.com/malfet due to Looks like it broke bunch of tests, see `2d4654d208/1` ([comment](https://github.com/pytorch/pytorch/pull/165188#issuecomment-3393674273))	2025-10-11 21:38:45 +00:00
Laith Sakka	2d4654d208	do not overguard when comparing lists (#165091 ) if we are comparing two lists l1, l2 of different lengths for equality. we should early exist if len(l1) != len(l2) and avoid guarding/comparing inner elements. This avoids recompilations as in the unit test. address https://github.com/pytorch/pytorch/issues/137515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165091 Approved by: https://github.com/aorenste, https://github.com/mlazos ghstack dependencies: #164884, #164885, #164886, #164887, #164888, #164889	2025-10-11 20:37:51 +00:00
Animesh Jain	f0325d0787	[dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165188 Approved by: https://github.com/yushangdi ghstack dependencies: #164776	2025-10-11 15:49:42 +00:00
Animesh Jain	1e4c7dffa3	[compile] Regional inductor compilation with fx.annotate (#164776 ) This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`. ### UX 1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic. Example ``` def fn(x, y): sin = torch.sin(x) with fx_traceback.annotate({"compile_with_inductor": 0}): mul = sin * y add = mul + 1 return torch.sin(add) ``` 2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is ``` # Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor` def aot_eager_regional_inductor(): return aot_autograd( fw_compiler=compile_fx_annotated_nodes_with_inductor, bw_compiler=compile_fx_annotated_nodes_with_inductor, ) ``` 3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy. ### Implementation 1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph. 2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner` Forward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) sin: "f32[10]" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(sin, primals_2) # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1 getitem: "f32[10]" = inner[0]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem) return (sin_1, primals_1, primals_2, sin, getitem) ``` Backward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1); primals_1 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) cos: "f32[10]" = torch.ops.aten.cos.default(add); add = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2); mul_1 = sin = primals_2 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y getitem: "f32[10]" = inner[0] getitem_1: "f32[10]" = inner[1]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1); getitem_1 = cos_1 = None return (mul_4, getitem) ``` ### Some issue raised in the HOP meeting 1) CSE will not differentiate different meta custom nodes and do wrong thing. 2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than? 3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph? 4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements? 5) What are we going to use the annotations for? a) compile flex b) streams c) nn.Module info to organize MoE components for pipelining d) PP stages e) Rename graph nodes for more debugging f) No nested regional compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776 Approved by: https://github.com/SherlockNoMad	2025-10-11 15:49:42 +00:00
PyTorch MergeBot	816fb7f48d	Revert "Enable ruff rule E721 (#165162 )" This reverts commit 9e7c19f72b6d0690915c307409c0c0a76b5a3bf0. Reverted https://github.com/pytorch/pytorch/pull/165162 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165162#issuecomment-3393328271))	2025-10-11 13:25:40 +00:00
zpcore	512dd79ff0	[4/N] [DTensor device order] Support debugmode to show dtensor distribution transform path (#164821 ) Enable the DebugMode to print out how `placements` and `shard_order` will update when we execute `transform_infos` to transform from source placement to target placement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164821 Approved by: https://github.com/SherlockNoMad, https://github.com/pianpwk ghstack dependencies: #164806, #164820	2025-10-11 09:44:54 +00:00
zpcore	9dac4e2540	[2/N] [DTensor device order] Add shard_order attribute in DTensorSpec (#164806 ) Add `shard_order` field in DTensorSpec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164806 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2025-10-11 09:39:08 +00:00
Yuanyuan Chen	9e7c19f72b	Enable ruff rule E721 (#165162 ) `E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162 Approved by: https://github.com/Skylion007	2025-10-11 06:43:53 +00:00
Animesh Jain	220a34118f	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri	2025-10-11 04:26:09 +00:00
Edward Z. Yang	de8d81275a	Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939 ) This fixes AOTAutograd rms_norm not being bitwise equivalent to eager, because it avoids a decomposition. You can force the decomposition by having the decomposition in the dispatch table, but if eager mode wouldn't have decomposed (because it went to the fused one), we now default to preserving the fused call by default. This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel. Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939 Approved by: https://github.com/bdhirsh	2025-10-11 01:03:55 +00:00
Animesh Jain	d73416642f	[test] Skip testing of source_fn_stack in light of export changes (#165176 ) This is in regards to https://github.com/pytorch/pytorch/pull/164691 where we are inlining into nn modules, and therefore it is causing this test to fail. The test here looks for node.name which is quite different with inlining. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165176 Approved by: https://github.com/andrewor14 ghstack dependencies: #165172	2025-10-11 00:16:59 +00:00
PaulZhang12	c8c5187e85	Fix truediv numerics between eager and compile (#164144 ) Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144 Approved by: https://github.com/bobrenjc93	2025-10-10 22:18:11 +00:00
Chien-Chin Huang	ee0a8a5a50	[CP]Introduce ContextParallal plan for parallelize_module() (#162542 ) Motivation Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques. Candidate Approaches 1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook. - Pros: This is similar to how we do TP. - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context manager is still required for splitting the inputs. 2. Provide a function wrapper. - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper. - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API. Summary ~~This PR implements approach 2 and refactor the code in such a way that most code can be used by option approach 1, which will be introduced in another PR.~~ We changed this PR to implement option 1 as people like option 1 due to the consistency with the existing parallelisms. But this PR can also serve the foundation to implement option 2, which was the early version of this PR. This PR also changes `create_cp_block_mask` logic since we now only focus on ModuleWrapper approach which doesn't require to hack the seq_len field in a BlockMask. This PR also removes TorchFunctionMode dispatcher mode as it doesn't work well with SAC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162542 Approved by: https://github.com/XilunWu	2025-10-10 22:03:43 +00:00
fduwjj	50c338c2da	[DeviceMesh] Move global state into class method (#164510 ) This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue. We will also need to do DTensor CPU overhead benchmark for this change. 3-5% CPU overhead in DTensor has been observed: before: <img width="1147" height="535" alt="image" src="https://github.com/user-attachments/assets/9e4ac018-ec0a-46a4-8f2c-64b4dbec465c" /> After: <img width="1114" height="576" alt="image" src="https://github.com/user-attachments/assets/eaf83660-652b-4c6b-8591-f6049ccdd14c" /> running the benchmark mentioned here: https://github.com/pytorch/pytorch/issues/159169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164510 Approved by: https://github.com/lw, https://github.com/fegin	2025-10-10 21:37:17 +00:00

... 2 3 4 5 6 ...

37028 Commits