pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Catherine Lee	ffc7552e01	See if we can handle uploading all test data (#165484 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165484 Approved by: https://github.com/izaitsevfb trunk/ffc7552e01899fbf17fec23bcd92665b36bf91fb	2025-10-15 19:57:41 +00:00
Angel Li	78f5a1ec60	varlen api (#164502 ) Summary Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. Benchmarking To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs \| \| Variable Length API \| SDPA \| \|--------\|--------------------\|----------\| \| Runtime \| 0.21750560760498047 ms \| 0.43171775817871094 ms \| \| TFLOPs \| 231.812 \| 320.840 \| The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. Testing Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. Next steps Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502 Approved by: https://github.com/v0i0, https://github.com/drisspg trunk/78f5a1ec60cb5e2516df962c068617532dee4012	2025-10-15 19:45:55 +00:00
eellison	2b71b62045	Add Memory Estimation Tracker (#165059 ) Add Memory Tracker utility, which will track live memory given alternate ordering of nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165059 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945 viable/strict/1760572180 trunk/2b71b62045fdcd89bcae050cd7bacef39988f8c1	2025-10-15 19:44:29 +00:00
PyTorch MergeBot	8c4b528403	Revert "[Inductor][CuTeDSL] Move load_template up two directories (#165347 )" This reverts commit 815d6415996d5b32b569fd2a8206f1e57c75bfe3. Reverted https://github.com/pytorch/pytorch/pull/165347 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165347#issuecomment-3407958496)) trunk/8c4b528403d68fe3483f0cd3103de44a28409df8	2025-10-15 19:30:46 +00:00
Simon Layton	066f818eea	Refactor and unify v1/v2 _scaled_mm codes (#165436 ) Summary: * Refactor out some core routines (scaled_gemm, auto-tuned scaled_gemm) * Unify v1/v2 dispatch calls where possible * Simplify call pattern w.r.t. CUDA/ROCM for easier readability. Test Plan: ``` pytest -svv test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165436 Approved by: https://github.com/drisspg trunk/066f818eea00e9cfde1c8efbef70190c42453f9b	2025-10-15 19:07:05 +00:00
Luca Wehrstedt	14af1dc3da	[DeviceMesh] Fix layout calculation when flattening non-contiguous dims (#165542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165542 Approved by: https://github.com/ezyang, https://github.com/fduwjj trunk/14af1dc3da517e1a57beff6d13a48e5651cb0c47 viable/strict/1760570630	2025-10-15 18:55:45 +00:00
Tugsbayasgalan Manlaibaatar	2395d7d7da	Relax equality check (#165460 ) When an object is inherited from multiple types, the previous check would fail. So we should relax it to respect eager semantic Differential Revision: [D84635322](https://our.internmc.facebook.com/intern/diff/D84635322) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165460 Approved by: https://github.com/avikchaudhuri trunk/2395d7d7dad80bb887872e85b3ae8cbd38c70f1c	2025-10-15 18:32:01 +00:00
Catherine Lee	0aa7ebaf03	Fix periodic debug tests failing due to FakeProcessGroup things (#165479 ) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with https://github.com/pytorch/pytorch/pull/163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const, char const, unsigned int, char const, char const) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance, void const) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const, void* ()(void const), void* ()(void const), void const) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > ()(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: https://github.com/pytorch/pytorch/pull/165479 Approved by: https://github.com/ezyang trunk/0aa7ebaf036dc6e9bdff477d824fa284b225eedd ciflow/periodic/0aa7ebaf036dc6e9bdff477d824fa284b225eedd viable/strict/1760568585	2025-10-15 18:16:08 +00:00
Jeff Daily	7a97832585	[ROCm] Add more timm models, forward fix #165381 (#165569 ) PR #165381 added timm models to cuda and cpu expected accuracy files. ROCm expected accuracy files were not updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165569 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> trunk/7a97832585835e34fe4de7289e376e598234167a	2025-10-15 18:11:21 +00:00
PyTorch MergeBot	84d141e910	Revert "[inductor] Expand use of generic benchmark function (#164938 )" This reverts commit 5c583e2573f29243742e00b9fa36b266c5c78bb3. Reverted https://github.com/pytorch/pytorch/pull/164938 on behalf of https://github.com/clee2000 due to I think this broke test/inductor/test_cuda_repro.py::CudaReproTests::test_epilogue_fusion_with_view? [GH job link](https://github.com/pytorch/pytorch/actions/runs/18529735968/job/52813191763) [HUD commit link](`f58f301313`) on both rocm and the slow grad check for linux. It did run successfully on cuda workflow on trunk, I wonder if this a gpu capability thing? no clue though ([comment](https://github.com/pytorch/pytorch/pull/164938#issuecomment-3407600224)) viable/strict/1760567049 trunk/84d141e910c0e7e86584e2a4625353e333bec2e5	2025-10-15 17:48:38 +00:00
Simon Layton	7c6c5d04fe	Add scaled_grouped_mm_v2 and python API (#165154 ) Summary: * Add `torch._scaled_grouped_mm_v2` with more functionality and extensibility for future formats * Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint * Test both original and v2 functionality Test Plan: ``` pytest -svv -k grouped test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165154 Approved by: https://github.com/drisspg, https://github.com/danielvegamyhre trunk/7c6c5d04fe3c82ec010ae7f636f35e359d13d226	2025-10-15 17:47:23 +00:00
PyTorch MergeBot	b509fb9b5d	Revert "add and fix OpInfo tests for the default partitioner (#165372 )" This reverts commit bcfea48ab7fd489218289693b98c1a6a6582d079. Reverted https://github.com/pytorch/pytorch/pull/165372 on behalf of https://github.com/malfet due to Looks like it broke slow jobs, see `331b7cc054/1` ([comment](https://github.com/pytorch/pytorch/pull/165372#issuecomment-3407567748)) viable/strict/1760564979 trunk/b509fb9b5d82575f1126baf3c146dee4db51b581	2025-10-15 17:38:52 +00:00
Scott Wolchok	331b7cc054	Fix double dispatch to Python for detach (#163671 ) This fixes #71725. Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671 Approved by: https://github.com/ezyang, https://github.com/albanD trunk/331b7cc054415210ec73f4e7e4571f8a0c21ed62	2025-10-15 17:24:50 +00:00
Nikhil Patel	815d641599	[Inductor][CuTeDSL] Move load_template up two directories (#165347 ) Summary: Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future. Test Plan: `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:flex_flash -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8` Differential Revision: D84527470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165347 Approved by: https://github.com/drisspg trunk/815d6415996d5b32b569fd2a8206f1e57c75bfe3	2025-10-15 16:34:58 +00:00
Timm Ruland	ffe3cb226a	In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication. (#165539 ) When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here: `5c583e2573/torch/distributed/pipelining/stage.py (L935-L936)` This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this. Fixes #165143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165539 Approved by: https://github.com/H-Huang trunk/ffe3cb226a5724ec9b0ba7a2d8b8ebd0e18760de viable/strict/1760556275	2025-10-15 15:05:55 +00:00
fduwjj	7ae123d72c	[DeviceMesh] Make _flatten_mapping an object attribute instead of a class attribute (#165521 ) The `_flatten_mapping` field was defined as a class attribute with a mutable default value {}: ``` _flatten_mapping: dict[str, "DeviceMesh"] = {} ``` This caused all DeviceMesh instances to share the same dictionary object. When multiple test instances tried to create flattened meshes with the same name (like "dp"), they would conflict because they were all using the same shared dictionary, resulting in the error: "Flatten mesh with mesh_dim_name dp has been created before, Please specify another valid mesh_dim_name." Pull Request resolved: https://github.com/pytorch/pytorch/pull/165521 Approved by: https://github.com/fegin, https://github.com/lw trunk/7ae123d72c5882fdbe19b86614159ba1c4049436 viable/strict/1760554355	2025-10-15 14:47:09 +00:00
Aidyn-A	7719cb75bf	[ATen][CMake] Fix duplicated CUTLASS path (#165424 ) Fixes #165110 The `PUBLIC` scope causes CUTLASS of the FBGEMM being included in for all PyTorch targets, including special matmuls (RowwiseScaledMM, ScaledGroupMM and GroupMM). Due to version mismatch between FBGEMM/CUTLASS and PyTorch/CUTLASS it is unacceptable to use FBGEMM/CUTLASS in PyTorch targets. This PR limits the scope of FBGEMM/CUTLASS to `fbgemm_genai` target only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165424 Approved by: https://github.com/cthi, https://github.com/eqy, https://github.com/danielvegamyhre trunk/7719cb75bf905079a495e922541eff70b1acb1ec	2025-10-15 14:14:17 +00:00
PaulZhang12	712f54d453	[ATen] Remove explicit casting of complex nansum during accumulation (#165494 ) https://github.com/pytorch/pytorch/pull/164790 modifies aten to perform a different reduction order intra warp. However, this change exposed a large difference in a sum for complex32. Namely the case: ``` import torch a = torch.tensor([[ 4.82031250+7.34765625j, -3.37109375-1.9501953125j], [ 3.7832031250-2.43359375j, -6.07812500+5.32812500j]], dtype=torch.complex32, device='cuda:0') sum_out = torch.sum(a) nansum_out = torch.nansum(a) torch.testing.assert_close( sum_out, nansum_out, rtol=0, atol=0, ) ``` Here, the result of `sum` and `nansum` differed significantly by 1e-2. Further investigation showed that the explicit casting of b back to `arg_t` from `scalar_t` was the root cause. `arg_t` is the dtype of the accumulator, ComplexFloat, and `scalar_t` of the input dtype, ComplexHalf. When we cast in the reduction to the accumulator order, that means the input is still of ComplexHalf, which loses precision as it can store intermediate values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165494 Approved by: https://github.com/ngimel trunk/712f54d453c5cdf3d136ebb0fbdb4de9945afbb9 viable/strict/1760552799	2025-10-15 13:49:25 +00:00
Samuel Park	f58f301313	Fixes bug with tolist calls to GradTrackingTensors (#165184 ) Fixes #161943 ## The Fix I implemented a recursive unwrapping helper function in the `tensor_to_list.cpp` file that looks for wrapped tensors and unwraps them. The recursive implementation was needed for multi-level gradTrackingTensors. Let me know if there is any more suggestions on fixing this issue! @guilhermeleobas @KimbingNg Pull Request resolved: https://github.com/pytorch/pytorch/pull/165184 Approved by: https://github.com/zou3519 trunk/f58f301313d4fc89499fb35cdfb2ffb91d14d896 viable/strict/1760549062	2025-10-15 12:54:28 +00:00
Mwiza Kunda	5c583e2573	[inductor] Expand use of generic benchmark function (#164938 ) Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison viable/strict/1760534864 trunk/5c583e2573f29243742e00b9fa36b266c5c78bb3	2025-10-15 09:18:24 +00:00
Bob Ren	0c14f55de6	[ez] fix typo (#165282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165282 Approved by: https://github.com/ezyang, https://github.com/mlazos trunk/0c14f55de674790fd3b2b5808de9f1a523c4feec	2025-10-15 06:19:24 +00:00
Isalia20	8e510e1095	[MPS] fix empty dot op crash (#165237 ) reproducer ``` import torch # does not crash a = torch.rand((0), device="cpu") b = torch.rand((0), device="cpu") a.dot(b) # crashes due to internal assert a = torch.rand((0), device="mps") b = torch.rand((0), device="mps") a.dot(b) ``` Discovered when implementing an op for SparseMPS backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/165237 Approved by: https://github.com/malfet viable/strict/1760518396 trunk/8e510e109539aa7e24b00abce22c1c81545ab144	2025-10-15 04:49:29 +00:00
PyTorch UpdateBot	59d30d1b75	[vision hash update] update the pinned vision hash (#165496 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165496 Approved by: https://github.com/pytorchbot trunk/59d30d1b75849f21fe86f0b3244b2306abef4cb9	2025-10-15 04:35:50 +00:00
PyTorch UpdateBot	3915898c22	[audio hash update] update the pinned audio hash (#165495 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165495 Approved by: https://github.com/pytorchbot trunk/3915898c22472cbde83ba437bd6580b504a92db2	2025-10-15 04:32:49 +00:00
PyTorch MergeBot	3044e1a460	Revert "varlen api (#164502 )" This reverts commit 3681312ce03e425e280a110df2153db107616a15. Reverted https://github.com/pytorch/pytorch/pull/164502 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doctests failure is legit ([comment](https://github.com/pytorch/pytorch/pull/164502#issuecomment-3404419420)) trunk/3044e1a460a2ae71a95e77d9ac0c33d3e8294e85	2025-10-15 03:56:42 +00:00
Yuanyuan Chen	b11593c31b	[8/N] Apply ruff UP035 rule (#165214 ) This is follow-up of #164653 to continue applying `UP035` fixes. The purpose is to finally enable this rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165214 Approved by: https://github.com/ezyang trunk/b11593c31bd84845e1573de0c15692387c572a2f	2025-10-15 03:18:57 +00:00
Yuanyuan Chen	36871622f1	[2/N] Mark unused parameters in C++ code (#165121 ) This is follow-up of #164912 to mark unused C++ parameters to improve code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121 Approved by: https://github.com/Skylion007 trunk/36871622f1061ff5b4e1458274659b9138835b19	2025-10-15 03:04:39 +00:00
Michael Gathara	b4fd47179e	feat(dynamo): IS#160752 make F.one_hot work with jacfwd + torch.compile(dynamic=True) (#160837 ) Fixes #160752 # Background: `torch.func.jacfwd` is implemented as vmap over forward-mode JVP. With torch.compile(dynamic=True), FakeTensor + SymInt shape reasoning is used while tracing through the transform. The old vmap rule for one_hot decomposed into “zeros_symint + scatter,” which interacted poorly with the transform stack and dynamic shapes, leading to failures mid-trace. Using a functional equality construction makes one_hot composable with vmap/JVP and friendly to dynamic shape tracing. # Changes: - functorch vmap batching rule for `aten::one_hot` now uses a purely functional formulation: - Replace “zeros + scatter” with eq(self.unsqueeze(-1), arange(num_classes)).to(kLong) under FuncTorchBatched. - one_hot native path remains unchanged for regular eager; vmap transform no longer relies on scatter, which was fragile under dynamic shape tracing. The minimal repro from the issue is now fixed: ```python import torch import torch.nn.functional as F MAX, BATCH = 3, 37 def func(x, idxs): return x.square() * F.one_hot(idxs, MAX) def jacfunc(x, idxs): return torch.func.jacfwd(func, argnums=0)(x, idxs) idxs = torch.randint(MAX, (BATCH,), dtype=torch.int64) x = torch.rand((BATCH, MAX), dtype=torch.float64) # eager out_eager = jacfunc(x, idxs) # compiled dynamic jacfunc_c = torch.compile(jacfunc, dynamic=True) out_comp = jacfunc_c(x, idxs) torch.testing.assert_close(out_eager, out_comp) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160837 Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519 trunk/b4fd47179e01ae3b09b22c261e74d3d7fb185f8b	2025-10-15 02:48:44 +00:00
Alex Sibiryakov	4f400ab520	Fix: nDims is mutated inside the loop in Shape.cu (#165446 ) Summary: The `nDims` variable is mutated inside the loop but never restored to its original value. This affects subsequent iterations of the outer loop. Each batch iteration may get incorrect `nDims` after the first batch. Test Plan: CI Reviewed By: ngimel Differential Revision: D84612194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165446 Approved by: https://github.com/ngimel trunk/4f400ab520f0151c8f01d7c305637276e4a222ca	2025-10-15 02:32:15 +00:00
Zhengxu Chen	839f6facdb	[precompile] Fix frame construction for wrapped model. (#165454 ) Summary: If a function is wrapped with functools, we should not look at the wrapped function signature but rather the wrapper, since we need to construct the frame for the top level function here. Test Plan: test_decorated_function_with_functools_wrap_aot Differential Revision: D84626752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165454 Approved by: https://github.com/yiming0416 trunk/839f6facdba92f8fe90cbd50721ff9a025474969	2025-10-15 02:01:46 +00:00
Howard Huang	ca65023b90	[PP] Fix edge case with FSDP when stages_per_rank > 3 (#165467 ) There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, `3f83e8915e/torch/distributed/pipelining/schedules.py (L1029-L1031)` This change is need to be able to unshard and reshard a stage multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165467 Approved by: https://github.com/wwwjn trunk/ca65023b908bebeceacc177f7bb22f7c8cda531c	2025-10-15 01:53:04 +00:00
Huy Do	132ae8e6dd	Don't link with libnvToolsExt when building for 12.9 (#165465 ) This is to bring back this logic from https://github.com/pytorch/pytorch/pull/161916/files#diff-bf46b4a09ca67e50622bf84fefc0d11b584ffcc24ee6cc5019cf0fc7565d81a8L170. Building libtorch on 12.9 is failing otherwise https://github.com/pytorch/pytorch/actions/runs/18458531395/job/52610761895: ``` cp: cannot stat '/usr/local/cuda/lib64/libnvToolsExt.so.1': No such file or directory ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165465 Approved by: https://github.com/atalman, https://github.com/malfet trunk/132ae8e6dd5e1a206dfb330eb7c94555f6eaaf9e	2025-10-15 01:45:37 +00:00
Bernhard Manfred Gruber	a20afb6100	Allow at::native::offset_t to be offset using `operator+=` (#164570 ) This will be required by CCCL 3.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164570 Approved by: https://github.com/Skylion007, https://github.com/eqy trunk/a20afb61007a94f5c28294e9ae20043657152ef6	2025-10-15 01:40:54 +00:00
Yiming Zhou	47524dcc48	[benchmark] Add more timm models (#165381 ) Added following models to timm_models - [convnextv2_nano.fcmae_ft_in22k_in1k](https://huggingface.co/timm/convnextv2_nano.fcmae_ft_in22k_in1k) - [vit_base_patch14_dinov2.lvd142m](https://huggingface.co/timm/vit_base_patch14_dinov2.lvd142m) - [ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256) - [deit_tiny_patch16_224.fb_in1k](https://huggingface.co/timm/deit_tiny_patch16_224.fb_in1k) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165381 Approved by: https://github.com/BoyuanFeng trunk/47524dcc4839548431e06dbe036faf752509001a	2025-10-15 01:19:10 +00:00
Amandeep Chhabra	9ffba8a2f9	fixing stress test failure (#164353 ) Summary: This diff fixes a stress test failure by adding a new binary echo4.py and modifying the existing echo1.py binary. The changes are made in both fbcode and xplat directories. The api_test.py file is updated to use the new echo4.py binary, and the BUCK file is updated to include the new binary. Test Plan: ``` buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary_redirect_and_tee (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results ``` ``` buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results ``` https://www.internalfb.com/intern/testinfra/testrun/17732923648474906 https://www.internalfb.com/intern/testinfra/testrun/15481123834815653 Differential Revision: D83623694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164353 Approved by: https://github.com/d4l3k trunk/9ffba8a2f98b10d2f33a414ec2c68bc8abb01106	2025-10-15 01:18:50 +00:00
Angel Li	3681312ce0	varlen api (#164502 ) Summary Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. Benchmarking To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs \| \| Variable Length API \| SDPA \| \|--------\|--------------------\|----------\| \| Runtime \| 0.21750560760498047 ms \| 0.43171775817871094 ms \| \| TFLOPs \| 231.812 \| 320.840 \| The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. Testing Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. Next steps Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502 Approved by: https://github.com/v0i0, https://github.com/drisspg trunk/3681312ce03e425e280a110df2153db107616a15	2025-10-15 00:45:06 +00:00
PyTorch MergeBot	7778a58e7c	Revert "[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334 )" This reverts commit bbb902c8dd911e1587253f496c1e2fb178d4b6a1. Reverted https://github.com/pytorch/pytorch/pull/165334 on behalf of https://github.com/jeffdaily due to trunk CI passed here but failures on HUD after merge? test/functorch/test_aot_joint_with_descriptors.py::TestAOTJointWithDescriptors::test_module_with_kwargs [GH job link](https://github.com/pytorch/pytorch/actions/runs/18511729262/job/52755708742) [HUD commit link](`bbb902c8dd`) ([comment](https://github.com/pytorch/pytorch/pull/165334#issuecomment-3404071893)) trunk/7778a58e7c3a9dfca8c4fa00d936581e7549d918	2025-10-15 00:21:49 +00:00
Xu Han	e7091a47da	[AOTI] skip Windows XPU crashed UTs. (#165393 ) Skip some UTs, which crashed on Windows XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165393 Approved by: https://github.com/jansel trunk/e7091a47daa1993954a1bfa690fad6a9a5605e61	2025-10-14 23:45:14 +00:00
Brian Hirsh	bcfea48ab7	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang ghstack dependencies: #165327 trunk/bcfea48ab7fd489218289693b98c1a6a6582d079	2025-10-14 23:34:34 +00:00
Brian Hirsh	d2e1dbc8f2	make aotdispatcher opinfo tests keep input mutations in graph (#165327 ) This stack is going to turn off functionalization and turn on the default partitioner, so I'm going to separate out a few changes before turning off functionalization in our OpInfo tests: (1) run our tests with input mutations allowed inside the graph (2) run our tests with the default partitioner (3) run with functionalization off (4) (later) make the tests properly test for bitwise equivalence Pull Request resolved: https://github.com/pytorch/pytorch/pull/165327 Approved by: https://github.com/ezyang ciflow/slow/d2e1dbc8f2566b87452b01f318b524664f385e94	2025-10-14 23:34:33 +00:00
fduwjj	89298ada83	[device_mesh] Implement `_unflatten` on top of CuTe layout bookkeeping (#161224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161224 Approved by: https://github.com/lw, https://github.com/fegin ghstack dependencies: #164510 trunk/89298ada836949ef092836e821f8262d52b11bf2	2025-10-14 23:17:11 +00:00
sekyonda	c467e59cb0	dynamo configs to torch.compiler (#163517 ) Moving some dynamo configs to torch.compiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/163517 Approved by: https://github.com/williamwen42, https://github.com/anijain2305 Co-authored-by: Svetlana Karslioglu <svekars@meta.com> trunk/c467e59cb0afa6883897735be1db93c547f12c46	2025-10-14 22:44:53 +00:00
angelayi	bbb902c8dd	[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334 ) fx.Interpreter doesn't handle kwargs... not sure how this code worked previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/165334 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang trunk/bbb902c8dd911e1587253f496c1e2fb178d4b6a1	2025-10-14 22:22:58 +00:00
Guilherme Leobas	e6f766c7d7	[Dynamo] Fixes for exceptions (#153966 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153966 Approved by: https://github.com/Lucaskabela trunk/e6f766c7d750d40603eee3f66c5915bac606b3ea viable/strict/1760496298	2025-10-14 22:03:58 +00:00
Wei Feng	13b621d87c	[DTensor] add __repr__ for CommDebugMode(get_total_count()=) (#165006 ) I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)` ``` comm_mode = CommDebugMode() with comm_mode: out = torch.mm(inps, weight) print(comm_mode) # CommDebugMode(get_total_counts()=0) ``` Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165006 Approved by: https://github.com/anshul-si ghstack dependencies: #165024 trunk/13b621d87c3a8adb78133947b2c87e6c56a7f67d viable/strict/1760493304	2025-10-14 21:31:23 +00:00
Dzmitry Huba	01738a3fea	Continue local tensor mode enablement for DTensor tests (#165451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165451 Approved by: https://github.com/ezyang, https://github.com/albanD trunk/01738a3feacbcf00df3f0b8b7f7859e07a6645a3	2025-10-14 21:20:54 +00:00
PyTorch MergeBot	a2f34bdd7c	Revert "Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) (#164923 )" This reverts commit 3401665110dbfbfa4625646e4a18ebf8c99fa92f. Reverted https://github.com/pytorch/pytorch/pull/164923 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164923#issuecomment-3403654378)) trunk/a2f34bdd7ce3a2cf85373854bac75b7cf8069d28	2025-10-14 21:20:49 +00:00
karthickai	a63ab0b8cd	[Inductor] Fix out-of-bounds indices in repeat_interleave decomposition (#165368 ) When `repeat_interleave` is decomposed into: ```bash cumsum = repeat.cumsum(0) pos = torch.arange(output_size, device=repeat.device) indices = torch.searchsorted(cumsum, pos, right=True) ``` `searchsorted` op with `right=True` returns the insertion point after matching elements. When query values `pos` are `>= cumsum[-1]`, searchsorted returns `len(cumsum)`, which is out of bounds for indexing (valid range: `[0, len(cumsum)-1]`). These invalid indices trigger CUDA device-side assert errors in downstream indexing operations. This fix adds clamping to ensure all indices stay within the valid range [0, repeat.size(0)-1]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165368 Approved by: https://github.com/mlazos trunk/a63ab0b8cdc1458e300b6da9c7447af306ae01a6	2025-10-14 21:16:36 +00:00
Yiming Zhou	102b7885ff	Add option to run AOT Precompile in benchmark (#164906 ) Use the existing benchmark infra to get some signals for AOT precompile pass rate on OSS models. Here we also measure and log the loading time. ``` python ./benchmarks/dynamo/huggingface.py --accuracy --inference --aot-precompile python ./benchmarks/dynamo/timm_models.py --accuracy --inference --aot-precompile python ./benchmarks/dynamo/torchbench.py --accuracy --inference --aot-precompile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164906 Approved by: https://github.com/zhxchen17 trunk/102b7885ff403360ff275a0fd8f1e5dff62d9469	2025-10-14 20:59:55 +00:00
Janani Sriram	382d04a51e	[Inductor][ATen][FP8] Add note for supported blockwise scaling strategy pairs (#165450 ) Summary: Add note mentioning which scaling type pairs are supported in Inductor ATen, since this was a source of confusion and also informs which scaling strategies we choose to support for other backends, like Triton. Test Plan: n/a Reviewed By: lw Differential Revision: D84522373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165450 Approved by: https://github.com/NikhilAPatel trunk/382d04a51ee90ff0f8b1d2d072028201c61a601a	2025-10-14 20:43:58 +00:00

1 2 3 4 5 ...

94482 Commits