pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Sean McGovern	8c60f4ae08	[Distributed] update table in docs (#165009 ) Fixes #162248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165009 Approved by: https://github.com/ezyang trunk/8c60f4ae085ad7c497ee0a0f7731d514f2c0ada8	2025-10-14 18:17:22 +00:00
Rohit Singh Rathaur	c4565c3b94	[distributed] Replace 164 assert statements in fsdp directory (#165235 ) Replace assert statements with explicit if/raise patterns across 20 files: - _optim_utils.py (38 asserts) - _flat_param.py (25 asserts) - _fully_shard/_fsdp_param.py (23 asserts) - sharded_grad_scaler.py (12 asserts) - fully_sharded_data_parallel.py (11 asserts) - wrap.py (10 asserts) - _state_dict_utils.py (9 asserts) - _fully_shard/_fsdp_param_group.py (8 asserts) - _runtime_utils.py (6 asserts) - _init_utils.py (6 asserts) - 10 additional files (16 asserts) This prevents assertions from being disabled with Python -O flag. Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165235 Approved by: https://github.com/albanD trunk/c4565c3b946e2a72702e8d346ae6e405d6ee992f	2025-10-14 18:04:57 +00:00
Wei Feng	6918f17114	[FSDP2] provide public API to share cuda streams across roots (#165024 ) for pipeline parallel, we can have multiple FSDP roots (chunks) ``` model = nn.Sequential([chunk0, chunk1]) fully_shard(model.chunk0) fully_shard(model.chunk1) ``` we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation ``` from torch.distributed.fsdp import share_comm_ctx share_comm_ctx([model.chunk0, model.chunk1]) ``` unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165024 Approved by: https://github.com/mori360 trunk/6918f17114d9781b03844ee96784cf83b6f162c3	2025-10-14 17:50:46 +00:00
Rohit Singh Rathaur	9b6be53326	[distributed] Replace 94 assert statements in tensor ops files (#165229 ) Replace assert statements with explicit if/raise patterns in: - _math_ops.py (43 asserts) - _matrix_ops.py (27 asserts) - _view_ops.py (24 asserts) This prevents assertions from being disabled with Python -O flag. Fixes partially #164878. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165229 Approved by: https://github.com/albanD trunk/9b6be5332694f1945e4491bd40bcdefe77736681	2025-10-14 17:28:06 +00:00
Kathryn-cat	7fee6bbf34	[Fix] Completely remove stride normalization on DLPack Tensor (#164161 ) A followup on PR #163282 Fixes #163274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164161 Approved by: https://github.com/ngimel, https://github.com/eqy trunk/7fee6bbf34c15c971b91c51a21da159965eabacf	2025-10-14 17:17:11 +00:00
ruisizhang123	6adaa328f4	[autobucketing] aten autobucketing fix to enable aot_eager pass (#165063 ) When the autobucketing pass is registered as aot_eager backend `fw_compiler` and `bw_compiler`, this pr ensures the tensors are all-gathers on "cpu/cuda" device instead of "meta" device. When we do `dist.all_gather_object`, it will create new bytestorage outside no_dispatch [here](`a2e2e1d8c0/torch/distributed/distributed_c10d.py (L3303)`), which is on meta device. Thus, I updated the code to use `unset_fake_temporarily`, which would gather RealTensor from other ranks. It is needed to unblock the aot_eager+autobucketing pass in this [PR](https://github.com/pytorch/torchtitan/pull/1813). Otherwise, I hit the error as follows: ```bash traceback : Traceback (most recent call last): File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 358, in wrapper return f(args, kwargs) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 607, in train self.train_step(data_iterator) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 507, in train_step loss = self.forward_backward_step(input_dict, labels) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 483, in forward_backward_step pred = model_parts[0](inputs, extra_inputs, extra_args) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 418, in __call__ return super().__call__(args, *kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1795, in _call_impl return forward_call(args, kwargs) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 901, in compile_wrapper raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2359, in _call_user_compiler raise BackendCompilerFailed( self.compiler_fn, e, inspect.currentframe() ).with_traceback(e.__traceback__) from None File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2334, in _call_user_compiler compiled_fn = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/__init__.py", line 2441, in __call__ return self.compiler_fn(model_, inputs_, self.kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, *self.kwargs) File "/home/ruisizhang123/pytorch/torch/_functorch/aot_autograd.py", line 1100, in aot_module_simplified compiled_fn, _ = aot_stage2_compile( ~~~~~~~~~~~~~~~~~~^ aot_state, ^^^^^^^^^^ ...<4 lines>... inference_compiler, ^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 257, in aot_stage2_compile return aot_stage2_autograd(aot_state, aot_graph_capture) File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 1696, in aot_stage2_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/home/ruisizhang123/torchtitan/torchtitan/experiments/simple_fsdp/backend.py", line 35, in aten_autobucketing_reordering_pass schedule_overlap_bucketing(gm) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 755, in schedule_overlap_bucketing ).run() ~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 358, in run self._align_compute_nodes_runtime_estimations_across_all_distributed_ranks() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 337, in _align_compute_nodes_runtime_estimations_across_all_distributed_ranks dist.all_gather_object( ~~~~~~~~~~~~~~~~~~~~~~^ gathered_runtime_estimations, runtime_estimations, pg ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/distributed/c10d_logger.py", line 82, in wrapper return func(args, **kwargs) File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3170, in all_gather_object input_tensor, local_size = _object_to_tensor(obj, current_device, group) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3079, in _object_to_tensor byte_tensor = torch.ByteTensor(byte_storage).to(device) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised: RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "meta". This is no longer allowed; the devices must match. Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165063 Approved by: https://github.com/eellison trunk/6adaa328f4de37130296d1ed59cebc505060c6a4	2025-10-14 17:09:54 +00:00
Paul Zhang	4a7eed527f	Make truediv numerics change external only for now (#165328 ) Summary: For D84399286, failing ads ne deterministic tests now. These tests are especially brittle with subtle bitwise numerics changes. Will reenable for fbcode once e2e validation tests are performed Test Plan: N/A Differential Revision: D84514361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165328 Approved by: https://github.com/izaitsevfb trunk/4a7eed527fbdecf05eacf7c9e56759cee871a6c5	2025-10-14 17:08:17 +00:00
PyTorch MergeBot	d2494cbb2b	Revert "[distributed] Replace assert statements with AssertionError exceptions (#165216 )" This reverts commit 74db92b21868b7e9e77cc966e5d57a8246723cbd. Reverted https://github.com/pytorch/pytorch/pull/165216 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_pg_wrapper.py::ProcessGroupNCCLWrapperTest::test_debug_level_detail_no_gloo [GH job link](https://github.com/pytorch/pytorch/actions/runs/18492765290/job/52693842750) [HUD commit link](`74db92b218`), note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165216#issuecomment-3402838765)) trunk/d2494cbb2b98a3105f0fc3eea79abe7d58df61c6	2025-10-14 17:05:16 +00:00
Shangdi Yu	5eddbb5e47	[annotate] Annotation should be mapped across submod (#165202 ) The match for backward nodes might be in a different submod, so we should check all submod for potential matches. In flex attention, this could happen if `mask_mod` has operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph. ``` python test/functorch/test_aot_joint_with_descriptors.py -k preserve_annotate ``` Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now. ``` NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" LOG_RANK=0 TRAIN_FILE="torchtitan.train" TORCHFT_LIGHTHOUSE="http://localhost:29510" PYTORCH_ALLOC_CONF="expandable_segments:True" torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint="localhost:0" --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165202 Approved by: https://github.com/SherlockNoMad trunk/5eddbb5e47499b94fd18764cdf022845471219c6	2025-10-14 16:19:38 +00:00
Animesh Jain	c9b2a09530	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri trunk/c9b2a0953017053aa3abdcfcfd0e0faa4845d0db	2025-10-14 15:33:50 +00:00
KarhouTam	bf5aeb3148	[torch/utils][Code Clean] Clean asserts in `hipify/`, `jit/`, `model_dump` and `tensorboard` of `torch/utils` (#165311 ) Including: - `torch/utils/hipify/` - `torch/utils/jit/` - `torch/utils/model_dump/` - `torch/utils/tensorboard/` Fixes part of #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165311 Approved by: https://github.com/albanD trunk/bf5aeb31480df7335f1b7e0b55d15198bf7d10d1	2025-10-14 15:26:23 +00:00
Rohit Singh Rathaur	45b8c0f75c	[distributed] Replace 54 assert statements in tensor/_ops/_tensor_ops.py (#165226 ) Replace assert statements with explicit if/raise patterns to prevent assertions from being disabled with Python -O flag. Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165226 Approved by: https://github.com/albanD trunk/45b8c0f75cb79139adda8f931cc19fb2f3e823fb	2025-10-14 15:10:03 +00:00
Aleksei Nikiforov	c733072874	Fix IValue from SymBool on big-endian system (#163647 ) Skip test_compiled_autograd_attribution on s390x It fails both on s390x and x86_64 at least under some circumstances. Disable it for now until on s390x until it works reliably. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163647 Approved by: https://github.com/malfet trunk/c73307287494a075a1ee69f3a77f877792ee9166	2025-10-14 15:07:48 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos trunk/fbe0d20a173063c3d15f310a8c4f9cfa852f5234	2025-10-14 14:22:54 +00:00
Lucas Kabela	1fa11f42b1	[Bugfix][vLLM] Explicitly do not support instead of crashing for named tuples in infer schema (#165191 ) Fixes https://github.com/vllm-project/vllm/issues/25270 by being explicit in erroring; previously we had a cryptic `__origin__ undefined` error, but now should give proper error message that we don't support NamedTuples in schema Test with ``` python test/test_custom_ops.py TestCustomOp.test_unsupported_param_types ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165191 Approved by: https://github.com/zou3519 trunk/1fa11f42b152ffe55cddb7439e4659136c860c7d	2025-10-14 14:18:42 +00:00
FFFrog	6f713e25bb	[CodeClean] Replace std::runtime_error with TORCH_CHECK (#164130 ) As the title stated. Changes: - torch/csrc/inductor(Part 1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164130 Approved by: https://github.com/albanD, https://github.com/Skylion007 trunk/6f713e25bb37ef2e30a785d441d671d0ceaf8f3d	2025-10-14 14:09:53 +00:00
Shangdi Yu	09a4187b8e	Update windows cuda build to use 12.8 (#165345 ) As title Motivation: The rest of the pytorch and inductor build is using 12.8 and we're deprecating cuda 12.6 builds soon per https://github.com/pytorch/pytorch/issues/165111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165345 Approved by: https://github.com/atalman, https://github.com/malfet trunk/09a4187b8ed34355d4d25b31c41586290ef56e67	2025-10-14 13:58:20 +00:00
Colin Peppler	306c55ba27	[atomically_apply_size_hint] Make unbacked replacements reconciles to a single expr (#164324 ) ## Problem Okay there's limitations with today's `atomically_apply_size_hint` though it works for most observed failures we've seen so far. However, it's easy to come up with an edge case. Suppose you encounter this setup. ``` a: [s0 + u0] b: [s1 + u1] c: [u2 + u3] d: [u100] ``` Today, we use a few heuristics to specify the LHS and RHS for replacements. `10d2734d9b/torch/_inductor/sizevars.py (L730-L759)` It's possible to end up with these replacement rules. Notice how there's no replacement for `s1 + u1` and `u2 + u3` :( That's because today picking the LHS and RHS matters a lot, and `s1 + u1` & `u2 + u3` happened to end up on the RHS. ``` s0 + u0 => s1 + u1 s0 + u0 => u2 + u3 # overrides previous replacement; each expr only gets one replacement s0 + u0 => u100 # overrides previous replacement; ditto ``` I believe what we really want is this: everybody gets a replacement! And they all should (eventually) settle at the same canonical expr (i.e. `u100`) when running the replacement several times. ``` s1 + u1 ==> s0 + u0 u2 + u3 ==> s0 + u0 s0 + u0 ==> u100 ``` We can just short-cut this by using the canonical expr as the replacement. ``` s1 + u1 ==> u100 u2 + u3 ==> u100 s0 + u0 ==> u100 ``` ## Implementation I offer one way to deal with this: 1. assure every expression has one canonical replacement (i.e. `u100`) 2. if two expressions are equal (inferred from `deferred_runtime_asserts`), then they must have the same canonical replacement We can implement the above with union find. * Whenever you see `Eq(lhs, rhs)` then do `union(lhs, rhs)`. * Whenever you want to find the canonical replacement for a given expr then do `find(expr)`. * When picking the canonical replacement we can use a few heuristics like (1) prefer a fully backed expr, (2) replacing with sub-expressions, and whatever we'd like. Differential Revision: [D84549260](https://our.internmc.facebook.com/intern/diff/D84549260) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164324 Approved by: https://github.com/laithsakka trunk/306c55ba27bc2bd45468e0586ccb38726c676b7f	2025-10-14 13:57:33 +00:00
Isalia20	56d6229ff9	[MPS] fix comment for normcdf (#165233 ) Just a small comment fix for normcdf Pull Request resolved: https://github.com/pytorch/pytorch/pull/165233 Approved by: https://github.com/malfet trunk/56d6229ff944a508e1d6bc14b4dbbf92637bc029	2025-10-14 13:56:31 +00:00
Rohit Singh Rathaur	74db92b218	[distributed] Replace assert statements with AssertionError exceptions (#165216 ) Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag. Fixes #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165216 Approved by: https://github.com/albanD trunk/74db92b21868b7e9e77cc966e5d57a8246723cbd	2025-10-14 09:58:59 +00:00
Chien-Chin Huang	c48843e4c6	[CP][BE] Docstrings, comments polish and remove unused variables (#165039 ) No logic change, just polish the docstrings, comments and remove unused variables Pull Request resolved: https://github.com/pytorch/pytorch/pull/165039 Approved by: https://github.com/XilunWu ghstack dependencies: #162542, #164500, #163185 trunk/c48843e4c6e6e800530719a15f3685f2c752820b	2025-10-14 09:35:32 +00:00
Cui, Yifeng	9e89b1c4c7	Update torch-xpu-ops commit pin (#165321 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@ce9db1](`ce9db15136`), includes: - Fix test_barrier hang by using static global rank in ProcessGroupXCCL - Update install_xpu_headers only when content should change to speedup recompilation - Add global rank information to communication logging - Remove duplicate normalization from FFT methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/165321 Approved by: https://github.com/EikanWang trunk/9e89b1c4c77575aa4785296be25e48082aa94224	2025-10-14 09:07:24 +00:00
PyTorch MergeBot	c5972ebdfb	Revert "Update windows cuda build to use 12.8 (#165345 )" This reverts commit ca96c675001fa87b9d9c648972415ab8b1591f11. Reverted https://github.com/pytorch/pytorch/pull/165345 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165345#issuecomment-3400344079)) trunk/c5972ebdfb509a0d415fec447d4b7c0df1932fff	2025-10-14 06:46:33 +00:00
Shunting Zhang	18b3658df9	[inductor][ez] properly print Pointwise (#165369 ) Previously when we print a ComputedBuffer for reduction, we get something like: ``` ComputedBuffer(name='buf0', layout=FixedLayout('cuda:0', torch.float32, size=[1, 768], stride=[768, 1]), data=Reduction( 'cuda', torch.float32, def inner_fn(index, rindex): _, i1 = index r0_0 = rindex tmp0 = ops.load(tangents_1, i1 + 768 * r0_0) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(primals_1, i1 + 768 * r0_0) tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16) tmp4 = ops.load(rsqrt, r0_0) tmp5 = tmp3 * tmp4 tmp6 = tmp1 * tmp5 return tmp6 , ``` But if we print a ComputedBuffer for a pointwise, we get something like ``` ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f12922c5bc0>, ranges=[32768, 768])) ``` Note that the inner function str is not printed. With the change, we get the inner_fn string printed in this case: ``` ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise( 14:42:46 [25/1988] 'cuda', torch.bfloat16, def inner_fn(index): i0, i1 = index tmp0 = ops.load(tangents_1, i1 + 768 * i0) tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16) tmp2 = ops.load(primals_2, i1) tmp3 = tmp1 * tmp2 tmp4 = ops.load(rsqrt, i0) tmp5 = tmp3 * tmp4 tmp6 = ops.load(buf1, i0) tmp7 = ops.constant(-0.5, torch.float32) tmp8 = tmp6 * tmp7 tmp9 = ops.load(rsqrt, i0) tmp10 = tmp9 * tmp9 tmp11 = tmp10 * tmp9 tmp12 = tmp8 * tmp11 tmp13 = ops.constant(0.0013020833333333333, torch.float32) tmp14 = tmp12 * tmp13 tmp15 = ops.load(primals_1, i1 + 768 * i0) tmp16 = ops.to_dtype(tmp15, torch.float32, src_dtype=torch.bfloat16) tmp17 = tmp14 * tmp16 tmp18 = tmp5 + tmp17 tmp19 = ops.load(buf1, i0) tmp20 = ops.constant(-0.5, torch.float32) tmp21 = tmp19 * tmp20 tmp22 = ops.load(rsqrt, i0) tmp23 = tmp22 * tmp22 tmp24 = tmp23 * tmp22 tmp25 = tmp21 * tmp24 tmp26 = ops.constant(0.0013020833333333333, torch.float32) tmp27 = tmp25 * tmp26 tmp28 = ops.load(primals_1, i1 + 768 * i0) tmp29 = ops.to_dtype(tmp28, torch.float32, src_dtype=torch.bfloat16) tmp30 = tmp27 * tmp29 tmp31 = tmp18 + tmp30 tmp32 = ops.to_dtype(tmp31, torch.bfloat16, src_dtype=torch.float32) return tmp32 , ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165369 Approved by: https://github.com/eellison trunk/18b3658df9ab5e78468221668416878f67bdd42c	2025-10-14 06:08:12 +00:00
Dzmitry Huba	5fbf93b774	Introduce automatic wrapper to run DTensor tests under local tensor mode (#165383 ) The wrapper enable to share test body implementation while eliminating need test class by hand. As an example, this change converts the whole DTensorTest to use local tensor mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165383 Approved by: https://github.com/ezyang trunk/5fbf93b7747447ec1b140b7f426d96d62a1507c3	2025-10-14 06:08:03 +00:00
Angel Li	a856a17799	bf16 support for per_channel bwd (#165325 ) Follow up to #165098 - adding bf16 support for the backward pass. To avoid BC breaking changes/losing precision, we upcast the parameters to fp32 after the op gets called, and downcast the gradients to bf16 before returning. For testing, we upcast to fp32 before calling the reference function. We increase the tolerance to 1e-2 for bf16 inputs because of a difference in casting calculations between python's `x.to(torch.bfloat16)` and cpp's `x.to(at::kBFloat16)` (after comparing intermediate tensors, we found that the numerics diverge after the final casting). We don't explicitly cast in the CPP op but rather let autograd/optimizer handle it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165325 Approved by: https://github.com/andrewor14 trunk/a856a17799f81924da1a654f97f87207aef89610	2025-10-14 05:47:32 +00:00
Michael Lazos	bc6e08954d	[user-cuda-streams] Add fork/join custom ops (#162900 ) Creates the fork/join stream ops. These ops are passthrough ops which mutate all of their args (without actually performing any computation on them) so that during functionalization, implicit dependencies are added on all of their args. This allows us to prevent reordering during our pre/post grad graph passes. Make custom ops inplace Pull Request resolved: https://github.com/pytorch/pytorch/pull/162900 Approved by: https://github.com/anijain2305 ghstack dependencies: #163027, #162899, #163028 trunk/bc6e08954daec4da712690c13c7c821195ed3e01	2025-10-14 05:43:19 +00:00
Michael Lazos	45a96b2081	[user-streams] Handle aliasing properly (#163028 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163028 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 ghstack dependencies: #163027, #162899	2025-10-14 05:43:19 +00:00
Michael Lazos	04e36611bb	[user-cuda-streams] Pass streams/events to the graph via lookup table (#162899 ) Stores streams in a global object look table that maps a dynamo selected index to objects. This index is generated during tracing, and at runtime, a helper function is called from the bytecode to populate this map. This differs from the previous implementation that simply mapped IDs to the associated objects. This required specialization on the IDs of the specific objects, while this new approach does not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162899 Approved by: https://github.com/anijain2305 ghstack dependencies: #163027	2025-10-14 05:43:19 +00:00
Michael Lazos	f15c25d5c3	[user-streams] Move stream code to streams module (#163027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163027 Approved by: https://github.com/StrongerXi, https://github.com/anijain2305	2025-10-14 05:43:19 +00:00
Kostas Tsiampouris	e93981c243	[PyTorch][aarch64] Cast to signed char to fix aarch64 build (#165021 ) Summary: Initial fix: D39198776 Reverted by clang-tidy bot: D83948172 Test Plan: Can now build on aarch64 {P1983767795} Reviewed By: bigning Differential Revision: D84203406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165021 Approved by: https://github.com/cyyever, https://github.com/Skylion007 trunk/e93981c243b61233755a2697ba3d5bce31c7dc05	2025-10-14 05:37:34 +00:00
Lakshay Garg	496adf9f9c	Replace insert with std::rotate_copy for RingBuffer (#165348 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165348 Approved by: https://github.com/eqy, https://github.com/Skylion007 trunk/496adf9f9c9263bf337539f1a933713e3826f11c	2025-10-14 05:11:28 +00:00
PyTorch MergeBot	33bfec27ff	Revert "use sym_numel, to allow fake tensors to work (#163831 )" This reverts commit e71c75680f2d6ce5f61ad4b2125f4934087762eb. Reverted https://github.com/pytorch/pytorch/pull/163831 on behalf of https://github.com/isuruf due to test failure on mps introduced ([comment](https://github.com/pytorch/pytorch/pull/163831#issuecomment-3400131730)) trunk/33bfec27ff867cf5e719fa997f00c1ec3dbb9859	2025-10-14 05:10:56 +00:00
KarhouTam	f44935cc14	[torch/utils][Code Clean] Clean asserts in `torch/utils/_sympy` (#165279 ) Including: `torch/utils/_sympy/` Fixes part of #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165279 Approved by: https://github.com/albanD trunk/f44935cc1429b15ef312b1aa4c9e3a8a08d45b84	2025-10-14 04:52:23 +00:00
KarhouTam	39116409a1	[torch/utils][Code Clean] Clean asserts in `benchmark/` and `data/` in `torch/utils/` (#165299 ) Including: - `torch/utils/benchmarks/` - `torch/utils/data/` Fixes part of #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165299 Approved by: https://github.com/albanD trunk/39116409a11db0797e6941610d67943bf4b786d7	2025-10-14 04:50:39 +00:00
James Wu	515d1326c1	Add CLAUDE_CONTEXT directory to gitignore (#165358 ) Claude often adds a bunch of MD files or other stuff that is specific to a local session, add a folder for claude to put this stuff that doesn't get checked into the repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/165358 Approved by: https://github.com/oulgen trunk/515d1326c1c398454c261ed3556105ee05c14181	2025-10-14 04:47:21 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel trunk/ac529df244c8e6e02040e1e54a894dd0d6b5d874	2025-10-14 04:22:30 +00:00
PyTorch MergeBot	fa3916f466	Revert "[export] Turn on install_free_tensors flag (#164691 )" This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/seemethere due to Breaks some internal things, both me and author agreed that revert was the best course of action ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3400013759)) trunk/fa3916f4668bf095b1cb8d28bae93554a7ad8bdf	2025-10-14 03:58:12 +00:00
PyTorch MergeBot	267348fe7f	Revert "Fix double dispatch to Python for detach (#163671 )" This reverts commit a3e3efe474bef63940ded803e78bb2a382681f1e. Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/seemethere due to We should've reverted this when we decided to revert https://github.com/pytorch/pytorch/pull/164691 since they were actually stacked ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3400009953)) trunk/267348fe7fda1ac8aa6b57cbcbe8db0ce6362baa	2025-10-14 03:55:36 +00:00
PyTorch MergeBot	1803d40c99	Reapply "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 9166f6120f63e2d5d76e6ccdbfccb8d6e41cbb43. Reverted https://github.com/pytorch/pytorch/pull/165353 on behalf of https://github.com/seemethere due to This is causing merge conflicts since a dependent PR wasn't reverted ([comment](https://github.com/pytorch/pytorch/pull/165353#issuecomment-3400006587)) trunk/1803d40c995e72a5993ee0940ec38bca760978b5	2025-10-14 03:52:50 +00:00
Tristan Trouwen	29c5368e0f	MTIA _cdist_forward registration (#165333 ) Summary: Added registration for _cdist_forward on MTIA Differential Revision: D84357997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165333 Approved by: https://github.com/albanD trunk/29c5368e0f4ca094dbe328fbb0b7ebb508baead8	2025-10-14 03:51:31 +00:00
VINAY PRITHYANI	e71c75680f	use sym_numel, to allow fake tensors to work (#163831 ) Fixes #[163759](https://github.com/pytorch/pytorch/issues/163759) Replace `numel` with `sym_numel`. Tested with example in issue and it works now . Pull Request resolved: https://github.com/pytorch/pytorch/pull/163831 Approved by: https://github.com/bobrenjc93 trunk/e71c75680f2d6ce5f61ad4b2125f4934087762eb	2025-10-14 03:33:28 +00:00
Shangdi Yu	ca96c67500	Update windows cuda build to use 12.8 (#165345 ) As title Motivation: The rest of the pytorch and inductor build is using 12.8 and we're deprecating cuda 12.6 builds soon per https://github.com/pytorch/pytorch/issues/165111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165345 Approved by: https://github.com/atalman trunk/ca96c675001fa87b9d9c648972415ab8b1591f11	2025-10-14 02:33:44 +00:00
Nikita Shulga	770e6b910c	[DTensor] Extend conv ops to 3D (#165241 ) Current implementation hardcodes 4D input and output tensor shapes Change that by computing `output_conv_shape` for any number of input dims Replace `[.., .., .., slice]` with `[..., slice]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165241 Approved by: https://github.com/ezyang trunk/770e6b910c556699d96ed629c49409fbef20007f	2025-10-14 02:30:46 +00:00
Colin Peppler	37d57ac9cb	Use sym_eq in _check_rms_norm_inputs_symint (#165112 ) Summary: ### Problem ArrayRef's `equals()`does elementwise quality using `==` operator. This can cause a DDE for unbacked symints since `==` operator calls `guard_bool`. ``` // SymInt.h bool operator==(const SymInt& o) const { return sym_eq(o).guard_bool(__FILE__, __LINE__); } ``` ### Solution Adds `sym_equals()` to do elementwise equality for `SymIntArrayRef`. Use this instead of `equals()` for `SymIntArrayRef`. Reviewed By: guangy10, pianpwk, muchulee8 Differential Revision: D84168401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165112 Approved by: https://github.com/Skylion007 trunk/37d57ac9cb7f538b812cf1d9851b55b46213fe15	2025-10-14 00:06:24 +00:00
Animesh Jain	9166f6120f	Revert "[export] Turn on install_free_tensors flag (#164691 )" (#165353 ) This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165353 Approved by: https://github.com/seemethere trunk/9166f6120f63e2d5d76e6ccdbfccb8d6e41cbb43	2025-10-13 23:40:11 +00:00
Nicolas Macchioni	fb0291d14b	[pt2][caching] fix runtime error in context on cpu-only machine when compile for gpu (#165220 ) re https://github.com/pytorch/pytorch/pull/165186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165220 Approved by: https://github.com/clee2000 trunk/fb0291d14b1b31190f32fa763a5951da0c60f08f	2025-10-13 22:47:41 +00:00
Animesh Jain	f3683453ae	[compile] Regional inductor compilation with fx.annotate (#164776 ) This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`. ### UX 1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic. Example ``` def fn(x, y): sin = torch.sin(x) with fx_traceback.annotate({"compile_with_inductor": 0}): mul = sin * y add = mul + 1 return torch.sin(add) ``` 2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is ``` # Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor` def aot_eager_regional_inductor(): return aot_autograd( fw_compiler=compile_fx_annotated_nodes_with_inductor, bw_compiler=compile_fx_annotated_nodes_with_inductor, ) ``` 3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy. ### Implementation 1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph. 2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner` Forward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) sin: "f32[10]" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(sin, primals_2) # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1 getitem: "f32[10]" = inner[0]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem) return (sin_1, primals_1, primals_2, sin, getitem) ``` Backward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1); primals_1 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) cos: "f32[10]" = torch.ops.aten.cos.default(add); add = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2); mul_1 = sin = primals_2 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y getitem: "f32[10]" = inner[0] getitem_1: "f32[10]" = inner[1]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1); getitem_1 = cos_1 = None return (mul_4, getitem) ``` ### Some issue raised in the HOP meeting 1) CSE will not differentiate different meta custom nodes and do wrong thing. 2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than? 3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph? 4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements? 5) What are we going to use the annotations for? a) compile flex b) streams c) nn.Module info to organize MoE components for pipelining d) PP stages e) Rename graph nodes for more debugging f) No nested regional compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776 Approved by: https://github.com/SherlockNoMad ghstack dependencies: #165188 trunk/f3683453aefb7ca4d7874452a74b74258b59527f	2025-10-13 22:22:20 +00:00
Animesh Jain	1191e51c44	[dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165188 Approved by: https://github.com/yushangdi	2025-10-13 22:22:20 +00:00
zpcore	3edd94485f	[5/N][DTensor device order] Implement graph based redistribution algorithm (#164902 ) (Extract out the algorithm from https://github.com/pytorch/pytorch/pull/160266.) Build a graph to search for the path from source placement to destination placement (with device order). Currently solution introduces too many all-gathers and missing the opportunity for all-to-all when redistribute, especially when we consider the device order. ### How to build the graph: When operator of Shard, think of collective op as operation on a stack of device axis: - I, J are tensor dimensions; - X, Y, Z, Y are ordered mesh dimensions. <img width="357" height="253" alt="image" src="https://github.com/user-attachments/assets/23bb3cc3-0506-4071-9053-3c525cf0e526" /> Detailed collective op transition is implemented in `DTensorRedistributePlanner.get_next_state`. ### How to find the min cost path: Assign weight to different type of collective ops and use Dijkstra to find the min cost path from the graph we build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164902 Approved by: https://github.com/ezyang trunk/3edd94485f55e9b9ca4edc633ef8fbaa5868c885	2025-10-13 22:03:57 +00:00

... 5 6 7 8 9 ...

94722 Commits