pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-11 22:34:53 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	3044e1a460	Revert "varlen api (#164502 )" This reverts commit 3681312ce03e425e280a110df2153db107616a15. Reverted https://github.com/pytorch/pytorch/pull/164502 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doctests failure is legit ([comment](https://github.com/pytorch/pytorch/pull/164502#issuecomment-3404419420))	2025-10-15 03:56:42 +00:00
Yuanyuan Chen	b11593c31b	[8/N] Apply ruff UP035 rule (#165214 ) This is follow-up of #164653 to continue applying `UP035` fixes. The purpose is to finally enable this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165214 Approved by: https://github.com/ezyang	2025-10-15 03:18:57 +00:00
Yuanyuan Chen	36871622f1	[2/N] Mark unused parameters in C++ code (#165121 ) This is follow-up of #164912 to mark unused C++ parameters to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121 Approved by: https://github.com/Skylion007	2025-10-15 03:04:39 +00:00
Michael Gathara	b4fd47179e	feat(dynamo): IS#160752 make F.one_hot work with jacfwd + torch.compile(dynamic=True) (#160837 ) Fixes #160752 # Background: `torch.func.jacfwd` is implemented as vmap over forward-mode JVP. With torch.compile(dynamic=True), FakeTensor + SymInt shape reasoning is used while tracing through the transform. The old vmap rule for one_hot decomposed into “zeros_symint + scatter,” which interacted poorly with the transform stack and dynamic shapes, leading to failures mid-trace. Using a functional equality construction makes one_hot composable with vmap/JVP and friendly to dynamic shape tracing. # Changes: - functorch vmap batching rule for `aten::one_hot` now uses a purely functional formulation: - Replace “zeros + scatter” with eq(self.unsqueeze(-1), arange(num_classes)).to(kLong) under FuncTorchBatched. - one_hot native path remains unchanged for regular eager; vmap transform no longer relies on scatter, which was fragile under dynamic shape tracing. The minimal repro from the issue is now fixed: ```python import torch import torch.nn.functional as F MAX, BATCH = 3, 37 def func(x, idxs): return x.square() * F.one_hot(idxs, MAX) def jacfunc(x, idxs): return torch.func.jacfwd(func, argnums=0)(x, idxs) idxs = torch.randint(MAX, (BATCH,), dtype=torch.int64) x = torch.rand((BATCH, MAX), dtype=torch.float64) # eager out_eager = jacfunc(x, idxs) # compiled dynamic jacfunc_c = torch.compile(jacfunc, dynamic=True) out_comp = jacfunc_c(x, idxs) torch.testing.assert_close(out_eager, out_comp) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160837 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519	2025-10-15 02:48:44 +00:00
Alex Sibiryakov	4f400ab520	Fix: nDims is mutated inside the loop in Shape.cu (#165446 ) Summary: The `nDims` variable is mutated inside the loop but never restored to its original value. This affects subsequent iterations of the outer loop. Each batch iteration may get incorrect `nDims` after the first batch. Test Plan: CI Reviewed By: ngimel Differential Revision: D84612194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165446 Approved by: https://github.com/ngimel	2025-10-15 02:32:15 +00:00
Zhengxu Chen	839f6facdb	[precompile] Fix frame construction for wrapped model. (#165454 ) Summary: If a function is wrapped with functools, we should not look at the wrapped function signature but rather the wrapper, since we need to construct the frame for the top level function here. Test Plan: test_decorated_function_with_functools_wrap_aot Differential Revision: D84626752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165454 Approved by: https://github.com/yiming0416	2025-10-15 02:01:46 +00:00
Howard Huang	ca65023b90	[PP] Fix edge case with FSDP when stages_per_rank > 3 (#165467 ) There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, `3f83e8915e/torch/distributed/pipelining/schedules.py (L1029-L1031)` This change is need to be able to unshard and reshard a stage multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165467 Approved by: https://github.com/wwwjn	2025-10-15 01:53:04 +00:00
Huy Do	132ae8e6dd	Don't link with libnvToolsExt when building for 12.9 (#165465 ) This is to bring back this logic from https://github.com/pytorch/pytorch/pull/161916/files#diff-bf46b4a09ca67e50622bf84fefc0d11b584ffcc24ee6cc5019cf0fc7565d81a8L170. Building libtorch on 12.9 is failing otherwise https://github.com/pytorch/pytorch/actions/runs/18458531395/job/52610761895: ``` cp: cannot stat '/usr/local/cuda/lib64/libnvToolsExt.so.1': No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165465 Approved by: https://github.com/atalman, https://github.com/malfet	2025-10-15 01:45:37 +00:00
Bernhard Manfred Gruber	a20afb6100	Allow at::native::offset_t to be offset using `operator+=` (#164570 ) This will be required by CCCL 3.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164570 Approved by: https://github.com/Skylion007, https://github.com/eqy	2025-10-15 01:40:54 +00:00
Yiming Zhou	47524dcc48	[benchmark] Add more timm models (#165381 ) Added following models to timm_models - [convnextv2_nano.fcmae_ft_in22k_in1k](https://huggingface.co/timm/convnextv2_nano.fcmae_ft_in22k_in1k) - [vit_base_patch14_dinov2.lvd142m](https://huggingface.co/timm/vit_base_patch14_dinov2.lvd142m) - [ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) - [deit_tiny_patch16_224.fb_in1k](https://huggingface.co/timm/deit_tiny_patch16_224.fb_in1k) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165381 Approved by: https://github.com/BoyuanFeng	2025-10-15 01:19:10 +00:00
Amandeep Chhabra	9ffba8a2f9	fixing stress test failure (#164353 ) Summary: This diff fixes a stress test failure by adding a new binary echo4.py and modifying the existing echo1.py binary. The changes are made in both fbcode and xplat directories. The api_test.py file is updated to use the new echo4.py binary, and the BUCK file is updated to include the new binary. Test Plan: ``` buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary_redirect_and_tee (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results ``` ``` buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results ``` https://www.internalfb.com/intern/testinfra/testrun/17732923648474906 https://www.internalfb.com/intern/testinfra/testrun/15481123834815653 Differential Revision: D83623694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164353 Approved by: https://github.com/d4l3k	2025-10-15 01:18:50 +00:00
Angel Li	3681312ce0	varlen api (#164502 ) Summary Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. Benchmarking To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs \| \| Variable Length API \| SDPA \| \|--------\|--------------------\|----------\| \| Runtime \| 0.21750560760498047 ms \| 0.43171775817871094 ms \| \| TFLOPs \| 231.812 \| 320.840 \| The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. Testing Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. Next steps Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502 Approved by: https://github.com/v0i0, https://github.com/drisspg	2025-10-15 00:45:06 +00:00
PyTorch MergeBot	7778a58e7c	Revert "[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334 )" This reverts commit bbb902c8dd911e1587253f496c1e2fb178d4b6a1. Reverted https://github.com/pytorch/pytorch/pull/165334 on behalf of https://github.com/jeffdaily due to trunk CI passed here but failures on HUD after merge? test/functorch/test_aot_joint_with_descriptors.py::TestAOTJointWithDescriptors::test_module_with_kwargs [GH job link](https://github.com/pytorch/pytorch/actions/runs/18511729262/job/52755708742) [HUD commit link](`bbb902c8dd`) ([comment](https://github.com/pytorch/pytorch/pull/165334#issuecomment-3404071893))	2025-10-15 00:21:49 +00:00
Xu Han	e7091a47da	[AOTI] skip Windows XPU crashed UTs. (#165393 ) Skip some UTs, which crashed on Windows XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165393 Approved by: https://github.com/jansel	2025-10-14 23:45:14 +00:00
Brian Hirsh	bcfea48ab7	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang ghstack dependencies: #165327	2025-10-14 23:34:34 +00:00
Brian Hirsh	d2e1dbc8f2	make aotdispatcher opinfo tests keep input mutations in graph (#165327 ) This stack is going to turn off functionalization and turn on the default partitioner, so I'm going to separate out a few changes before turning off functionalization in our OpInfo tests: (1) run our tests with input mutations allowed inside the graph (2) run our tests with the default partitioner (3) run with functionalization off (4) (later) make the tests properly test for bitwise equivalence Pull Request resolved: https://github.com/pytorch/pytorch/pull/165327 Approved by: https://github.com/ezyang	2025-10-14 23:34:33 +00:00
fduwjj	89298ada83	[device_mesh] Implement `_unflatten` on top of CuTe layout bookkeeping (#161224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161224 Approved by: https://github.com/lw, https://github.com/fegin ghstack dependencies: #164510	2025-10-14 23:17:11 +00:00
sekyonda	c467e59cb0	dynamo configs to torch.compiler (#163517 ) Moving some dynamo configs to torch.compiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/163517 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2025-10-14 22:44:53 +00:00
angelayi	bbb902c8dd	[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334 ) fx.Interpreter doesn't handle kwargs... not sure how this code worked previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/165334 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang	2025-10-14 22:22:58 +00:00
Guilherme Leobas	e6f766c7d7	[Dynamo] Fixes for exceptions (#153966 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153966 Approved by: https://github.com/Lucaskabela viable/strict/1760496298	2025-10-14 22:03:58 +00:00
Wei Feng	13b621d87c	[DTensor] add __repr__ for CommDebugMode(get_total_count()=) (#165006 ) I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)` ``` comm_mode = CommDebugMode() with comm_mode: out = torch.mm(inps, weight) print(comm_mode) # CommDebugMode(get_total_counts()=0) ``` Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165006 Approved by: https://github.com/anshul-si ghstack dependencies: #165024 viable/strict/1760493304	2025-10-14 21:31:23 +00:00
Dzmitry Huba	01738a3fea	Continue local tensor mode enablement for DTensor tests (#165451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165451 Approved by: https://github.com/ezyang, https://github.com/albanD	2025-10-14 21:20:54 +00:00
PyTorch MergeBot	a2f34bdd7c	Revert "Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) (#164923 )" This reverts commit 3401665110dbfbfa4625646e4a18ebf8c99fa92f. Reverted https://github.com/pytorch/pytorch/pull/164923 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164923#issuecomment-3403654378))	2025-10-14 21:20:49 +00:00
karthickai	a63ab0b8cd	[Inductor] Fix out-of-bounds indices in repeat_interleave decomposition (#165368 ) When `repeat_interleave` is decomposed into: ```bash cumsum = repeat.cumsum(0) pos = torch.arange(output_size, device=repeat.device) indices = torch.searchsorted(cumsum, pos, right=True) ``` `searchsorted` op with `right=True` returns the insertion point after matching elements. When query values `pos` are `>= cumsum[-1]`, searchsorted returns `len(cumsum)`, which is out of bounds for indexing (valid range: `[0, len(cumsum)-1]`). These invalid indices trigger CUDA device-side assert errors in downstream indexing operations. This fix adds clamping to ensure all indices stay within the valid range [0, repeat.size(0)-1]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165368 Approved by: https://github.com/mlazos	2025-10-14 21:16:36 +00:00
Yiming Zhou	102b7885ff	Add option to run AOT Precompile in benchmark (#164906 ) Use the existing benchmark infra to get some signals for AOT precompile pass rate on OSS models. Here we also measure and log the loading time. ``` python ./benchmarks/dynamo/huggingface.py --accuracy --inference --aot-precompile python ./benchmarks/dynamo/timm_models.py --accuracy --inference --aot-precompile python ./benchmarks/dynamo/torchbench.py --accuracy --inference --aot-precompile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164906 Approved by: https://github.com/zhxchen17	2025-10-14 20:59:55 +00:00
Janani Sriram	382d04a51e	[Inductor][ATen][FP8] Add note for supported blockwise scaling strategy pairs (#165450 ) Summary: Add note mentioning which scaling type pairs are supported in Inductor ATen, since this was a source of confusion and also informs which scaling strategies we choose to support for other backends, like Triton. Test Plan: n/a Reviewed By: lw Differential Revision: D84522373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165450 Approved by: https://github.com/NikhilAPatel	2025-10-14 20:43:58 +00:00
Jean Schmidt	1ec0755a7e	[ISSUES] Update ci:sev template to include a note about ci: disable-autorevert label (#165459 ) We noticed that disabling autorevert in any and all ci:sevs is too impactful, as ci: sevs are sometimes created just to communicate an action or a impactful change. But sometimes durring a SEV we might not want to disable autorevert anyways, a example is a ci: sev impacting jobs we don't use as basis for autorevert. So, a note is added reminding the ci:sev author to optionally add this tag to disable auto-revert Note: using this opportunity to fix the ci: disable-autorevert issues. As it is best for the title to be simple and the displayed message in the GitHub interface to be decorated with emoji :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165459 Approved by: https://github.com/malfet	2025-10-14 20:32:46 +00:00
Malay Bag	058782c6ab	[torch.export] Rmoving unused constants - add support for corner case (#165205 ) Summary: In some cases unused constant had only one level of child node, no second level of child node. Those constants should be removed too. The added test case has the scenario where this scenario will happen. Test Plan: ``` buck test mode/opt caffe2/test:test_export -- 'test_unused_constant' ``` https://www.internalfb.com/intern/testinfra/testrun/15481123837456594 Differential Revision: D84398413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165205 Approved by: https://github.com/angelayi	2025-10-14 20:26:28 +00:00
angelayi	2b4ef6b4d6	[opaque_obj_v2] PyObject custom op schema type (#165004 ) This is a cleaner implementation of opaque objects (https://github.com/pytorch/pytorch/pull/162660). Instead now we just need to do: Call `register_opaque_type` to register the type as being "opaque" and allowed by custom ops. You also need to pass a unique name that maps to the type. ```python class OpaqueQueue: def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None: super().__init__() self.queue = queue self.init_tensor_ = init_tensor_ def push(self, tensor: torch.Tensor) -> None: self.queue.append(tensor) def pop(self) -> torch.Tensor: if len(self.queue) > 0: return self.queue.pop(0) return self.init_tensor_ def size(self) -> int: return len(self.queue) register_opaque_type(OpaqueQueue, "_TestOpaqueObject_OpaqueQueue") ``` When creating the custom op, the schema will then use the unique name: ```python self.lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT") torch.library.define( "_TestOpaqueObject::queue_push", "(_TestOpaqueObject_OpaqueQueue a, Tensor b) -> ()", tags=torch.Tag.pt2_compliant_tag, lib=self.lib, ) @torch.library.impl( "_TestOpaqueObject::queue_push", "CompositeExplicitAutograd", lib=self.lib ) def push_impl(queue: OpaqueQueue, b: torch.Tensor) -> None: assert isinstance(queue, OpaqueQueue) queue.push(b) ``` Using the custom op: ```python queue = OpaqueQueue([], torch.zeros(3)) torch.ops._TestOpaqueObject.queue_push(queue, torch.ones(3)) self.assertTrue(queue.size(), 1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165004 Approved by: https://github.com/albanD	2025-10-14 20:21:04 +00:00
q1l1	3f83e8915e	[inductor] fix issue for example value with unbacked strides (#163660 ) ## Issue During autotune, we're not applying size hints atomically for the example inputs used for benchmarking. If there is unbacked symint showing up in inputs' strides, this might lead to CUDA IMA, and this could be reproduced by the added unittest, with stride being `[128 * u0, 128, 1]` and unbacked fallback being 8192, after calling `benchmark_example_value`, we get back a tensor with stride as `[8192, 128, 1]` as opposed to `[128 * 8192, 128, 1]` ## Fix Using the atomic API when trying to apply size hints to input tensor' strides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163660 Approved by: https://github.com/ColinPeppler	2025-10-14 20:07:51 +00:00
Jeff Daily	d7e3f493d9	[ROCm][CI] add mi355 to inductor perf test nightly (#165326 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165326 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-14 20:03:21 +00:00
Edward Yang	08f09d9543	Ensure rms_norm decomp generates add.Scalar for pattern match BC (#165437 ) Summary: Apparently if I just do `tensor + eps` this turns into add.Tensor, which is bad because the constant Tensor ends up getting hoisted into an input, which is a bozo thing to do. Just make sure it's exactly compatible. Test Plan: ``` buck run 'fbcode//mode/opt' fbcode//bolt/nn/executorch/backends/tests:qnn_test_ar1g1 bolt.nn.executorch.backends.tests.qnn_test_ar1g1.QnnTestAR1G1.test_RMSNorm ``` Reviewed By: tugsbayasgalan Differential Revision: D84613184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165437 Approved by: https://github.com/tugsbayasgalan	2025-10-14 19:56:37 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	74acf92648	Forward fix inductor failure (#165363 ) (#165443 ) Summary: Title Test Plan: CI Differential Revision: D84615478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165443 Approved by: https://github.com/angelayi	2025-10-14 19:31:58 +00:00
Nikita Shulga	cbf212e9c7	[CI] Fix doctest job if build without distributed (#165449 ) Guard test with `TORCH_DOCTEST_DISTRIBUTED` and set it to true in run_test.py to be able to pass doctest for PyTorch build without distribtued support. This is a regression introduced by https://github.com/pytorch/pytorch/pull/164806 Fixes https://github.com/pytorch/pytorch/issues/165343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165449 Approved by: https://github.com/seemethere	2025-10-14 19:19:03 +00:00
Guilherme Leobas	d18e068fd6	[dict] Implement `__eq__` for dict_items (#155154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155154 Approved by: https://github.com/anijain2305	2025-10-14 18:56:51 +00:00
jmaczan	3401665110	Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) (#164923 ) The initial fix for inspect.signature uses not a right approach (https://github.com/pytorch/pytorch/pull/164349#pullrequestreview-3306614010). As @williamwen42 suggests (https://github.com/pytorch/pytorch/pull/164349#issuecomment-3379222885) we can just for now get rid of `inspect.signature` call in flex_attention to resolve this high priority issue (https://github.com/pytorch/pytorch/issues/164247#issuecomment-3378673179). In this PR I did exactly this - limited the scope of fix to just computing `num_positional_args` in `flex_attention._get_mod_type` based on properties returned by `NestedUserFunctionVariable.const_getattr` (some were missing so I added them) Fixes #164247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164923 Approved by: https://github.com/williamwen42	2025-10-14 18:29:15 +00:00
Sean McGovern	8c60f4ae08	[Distributed] update table in docs (#165009 ) Fixes #162248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165009 Approved by: https://github.com/ezyang	2025-10-14 18:17:22 +00:00
Rohit Singh Rathaur	c4565c3b94	[distributed] Replace 164 assert statements in fsdp directory (#165235 ) Replace assert statements with explicit if/raise patterns across 20 files: - _optim_utils.py (38 asserts) - _flat_param.py (25 asserts) - _fully_shard/_fsdp_param.py (23 asserts) - sharded_grad_scaler.py (12 asserts) - fully_sharded_data_parallel.py (11 asserts) - wrap.py (10 asserts) - _state_dict_utils.py (9 asserts) - _fully_shard/_fsdp_param_group.py (8 asserts) - _runtime_utils.py (6 asserts) - _init_utils.py (6 asserts) - 10 additional files (16 asserts) This prevents assertions from being disabled with Python -O flag. Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165235 Approved by: https://github.com/albanD	2025-10-14 18:04:57 +00:00
Wei Feng	6918f17114	[FSDP2] provide public API to share cuda streams across roots (#165024 ) for pipeline parallel, we can have multiple FSDP roots (chunks) ``` model = nn.Sequential([chunk0, chunk1]) fully_shard(model.chunk0) fully_shard(model.chunk1) ``` we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation ``` from torch.distributed.fsdp import share_comm_ctx share_comm_ctx([model.chunk0, model.chunk1]) ``` unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165024 Approved by: https://github.com/mori360	2025-10-14 17:50:46 +00:00
Rohit Singh Rathaur	9b6be53326	[distributed] Replace 94 assert statements in tensor ops files (#165229 ) Replace assert statements with explicit if/raise patterns in: - _math_ops.py (43 asserts) - _matrix_ops.py (27 asserts) - _view_ops.py (24 asserts) This prevents assertions from being disabled with Python -O flag. Fixes partially #164878. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165229 Approved by: https://github.com/albanD	2025-10-14 17:28:06 +00:00
Kathryn-cat	7fee6bbf34	[Fix] Completely remove stride normalization on DLPack Tensor (#164161 ) A followup on PR #163282 Fixes #163274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164161 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-14 17:17:11 +00:00
ruisizhang123	6adaa328f4	[autobucketing] aten autobucketing fix to enable aot_eager pass (#165063 ) When the autobucketing pass is registered as aot_eager backend `fw_compiler` and `bw_compiler`, this pr ensures the tensors are all-gathers on "cpu/cuda" device instead of "meta" device. When we do `dist.all_gather_object`, it will create new bytestorage outside no_dispatch [here](`a2e2e1d8c0/torch/distributed/distributed_c10d.py (L3303)`), which is on meta device. Thus, I updated the code to use `unset_fake_temporarily`, which would gather RealTensor from other ranks. It is needed to unblock the aot_eager+autobucketing pass in this [PR](https://github.com/pytorch/torchtitan/pull/1813). Otherwise, I hit the error as follows: ```bash traceback : Traceback (most recent call last): File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 358, in wrapper return f(args, kwargs) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 607, in train self.train_step(data_iterator) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 507, in train_step loss = self.forward_backward_step(input_dict, labels) File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 483, in forward_backward_step pred = model_parts[0](inputs, extra_inputs, extra_args) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 418, in __call__ return super().__call__(args, *kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _wrapped_call_impl return self._call_impl(args, *kwargs) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1795, in _call_impl return forward_call(args, kwargs) File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 901, in compile_wrapper raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2359, in _call_user_compiler raise BackendCompilerFailed( self.compiler_fn, e, inspect.currentframe() ).with_traceback(e.__traceback__) from None File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2334, in _call_user_compiler compiled_fn = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__ compiled_gm = compiler_fn(gm, example_inputs) File "/home/ruisizhang123/pytorch/torch/__init__.py", line 2441, in __call__ return self.compiler_fn(model_, inputs_, self.kwargs) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__ cg = aot_module_simplified(gm, example_inputs, *self.kwargs) File "/home/ruisizhang123/pytorch/torch/_functorch/aot_autograd.py", line 1100, in aot_module_simplified compiled_fn, _ = aot_stage2_compile( ~~~~~~~~~~~~~~~~~~^ aot_state, ^^^^^^^^^^ ...<4 lines>... inference_compiler, ^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 257, in aot_stage2_compile return aot_stage2_autograd(aot_state, aot_graph_capture) File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 1696, in aot_stage2_autograd compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args) File "/home/ruisizhang123/torchtitan/torchtitan/experiments/simple_fsdp/backend.py", line 35, in aten_autobucketing_reordering_pass schedule_overlap_bucketing(gm) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 755, in schedule_overlap_bucketing ).run() ~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 358, in run self._align_compute_nodes_runtime_estimations_across_all_distributed_ranks() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 337, in _align_compute_nodes_runtime_estimations_across_all_distributed_ranks dist.all_gather_object( ~~~~~~~~~~~~~~~~~~~~~~^ gathered_runtime_estimations, runtime_estimations, pg ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/home/ruisizhang123/pytorch/torch/distributed/c10d_logger.py", line 82, in wrapper return func(args, **kwargs) File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3170, in all_gather_object input_tensor, local_size = _object_to_tensor(obj, current_device, group) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3079, in _object_to_tensor byte_tensor = torch.ByteTensor(byte_storage).to(device) ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised: RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "meta". This is no longer allowed; the devices must match. Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165063 Approved by: https://github.com/eellison	2025-10-14 17:09:54 +00:00
Paul Zhang	4a7eed527f	Make truediv numerics change external only for now (#165328 ) Summary: For D84399286, failing ads ne deterministic tests now. These tests are especially brittle with subtle bitwise numerics changes. Will reenable for fbcode once e2e validation tests are performed Test Plan: N/A Differential Revision: D84514361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165328 Approved by: https://github.com/izaitsevfb	2025-10-14 17:08:17 +00:00
PyTorch MergeBot	d2494cbb2b	Revert "[distributed] Replace assert statements with AssertionError exceptions (#165216 )" This reverts commit 74db92b21868b7e9e77cc966e5d57a8246723cbd. Reverted https://github.com/pytorch/pytorch/pull/165216 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_pg_wrapper.py::ProcessGroupNCCLWrapperTest::test_debug_level_detail_no_gloo [GH job link](https://github.com/pytorch/pytorch/actions/runs/18492765290/job/52693842750) [HUD commit link](`74db92b218`), note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165216#issuecomment-3402838765))	2025-10-14 17:05:16 +00:00
Shangdi Yu	5eddbb5e47	[annotate] Annotation should be mapped across submod (#165202 ) The match for backward nodes might be in a different submod, so we should check all submod for potential matches. In flex attention, this could happen if `mask_mod` has operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph. ``` python test/functorch/test_aot_joint_with_descriptors.py -k preserve_annotate ``` Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now. ``` NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" LOG_RANK=0 TRAIN_FILE="torchtitan.train" TORCHFT_LIGHTHOUSE="http://localhost:29510" PYTORCH_ALLOC_CONF="expandable_segments:True" torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint="localhost:0" --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165202 Approved by: https://github.com/SherlockNoMad	2025-10-14 16:19:38 +00:00
Animesh Jain	c9b2a09530	[export] Turn on install_free_tensors flag (#164691 ) The final step in removing the discrepancy between torch.compile(fullgraph=True) and torch.export(strict=True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691 Approved by: https://github.com/avikchaudhuri	2025-10-14 15:33:50 +00:00
KarhouTam	bf5aeb3148	[torch/utils][Code Clean] Clean asserts in `hipify/`, `jit/`, `model_dump` and `tensorboard` of `torch/utils` (#165311 ) Including: - `torch/utils/hipify/` - `torch/utils/jit/` - `torch/utils/model_dump/` - `torch/utils/tensorboard/` Fixes part of #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165311 Approved by: https://github.com/albanD	2025-10-14 15:26:23 +00:00
Rohit Singh Rathaur	45b8c0f75c	[distributed] Replace 54 assert statements in tensor/_ops/_tensor_ops.py (#165226 ) Replace assert statements with explicit if/raise patterns to prevent assertions from being disabled with Python -O flag. Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165226 Approved by: https://github.com/albanD	2025-10-14 15:10:03 +00:00
Aleksei Nikiforov	c733072874	Fix IValue from SymBool on big-endian system (#163647 ) Skip test_compiled_autograd_attribution on s390x It fails both on s390x and x86_64 at least under some circumstances. Disable it for now until on s390x until it works reliably. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163647 Approved by: https://github.com/malfet	2025-10-14 15:07:48 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-14 14:22:54 +00:00

1 2 3 4 5 ...

94458 Commits